This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Summarization with Precise Length Control

Lesly Miculicich  Yujia Xie  Song Wang  Pengcheng He
Microsoft
{leslym,yujiaxie,song.wang,pengcheng.h}@microsoft.com
Abstract

Many applications of text generation such as summarization benefit from accurately controlling the text length. Existing approaches on length-controlled summarization either result in degraded performance or can only control the length approximately. In this work, we present a framework to generate summaries with precisely the specified number of tokens or sentences, while maintaining or even improving the text quality. In addition, we jointly train the models to predict the lengths, so our model can generate summaries with optimal length. We evaluate the proposed framework on the CNNDM dataset and show improved performance compared to existing methods.

1 Introduction

Controlling the length of the output is an important aspect of text summarization, as the desired length of the summary can vary depending on factors such as the size of the input document and the level of detail required in the summary. For example, a summary can range from a single sentence, providing a brief overview of the main topic or idea in the document, to several paragraphs providing a more detailed summary of the content. This can be particularly relevant in applications such as customizable summarization and constrained or fixed length summarization, where the text must fit specific device specifications such as screen width.

The success of length-controlled summarization is measured by both the language quality and the length accuracy of the generated summary. Initial approaches introduced the desired length as a parameter or vector embedding in the model Kikuchi et al. (2016); Liu et al. (2018) but these methods produced summaries with lower Rouge scores than their baselines. More successful approaches divide the training data into buckets or bins, each with specific length ranges (e.g. 0-30, 30-60) and use this information to build a summary prefix Fan et al. (2018); He et al. (2020) or as a constrain Takase and Okazaki (2019). While the summary quality improves in comparison with previous approaches, this methods lose the ability for accurate length control with a specific number of tokens. Another way to control the summary length is by manipulating the probabilities of the end-of-sentence (EOS) token Chan et al. (2021); Liu et al. (2022), but it may lead to fluency issues and lack of coverage when forcing an earlier sequence termination.

In this work, we consider two practical cases of length-controlled summarization: token-level control and sentence-level control. We propose two easy-to-implement methods for the two cases, respectively. The first, REverse Position Induced Length-cOntrolled Text generation REPILOT, controls the precise number of tokens to produce. The second, explicit sentence enumeration SentEnum, controls number of sentences. Both methods are highly accurate and show improvements on Rouges scores on two data-sets. Moreover, we jointly train the models to predict the optimal length of the summary given the input document. So our models can handle cases where the input length is not provided and it replaces the use length penalty during inference. Our evaluation results show that these methods produce summary quality that is comparable with state-of-the-art models and are significantly more accurate than previous approaches.

In summary, our contributions are: 1) A method for highly accurate length control of tokens and sentences with comparable or improved quality of the summary. 2) A baseline for sentence-based length control for summarization. 3) Length prediction in the summarization models to manage cases where the length is not given.

2 Related Work and Baselines

Various methods for controlling the length of summaries have been proposed in the literature. Kikuchi et al. (2016) proposed a method that uses a learnable length embedding input at the beginning of the decoding process, called LenInit. They also experimented with inputting the remaining length at each time step of the decoding, called LenEmb. Makino et al. (2019) improved upon this approach by optimizing the loss function with an overweight penalty, called GOLC. Liu et al. (2018) proposed a method that uses a length parameter as part of their CNN-based model, called LC. Liu et al. (2022) used two attention mechanisms, one for controlling the information selection and another for the end-of-sentence token. They first pre-trained a model with balanced length data LAAM, and then fine-tuned it with original data PtLAAM. Saito et al. (2020) proposed to control the summary length by inputting an extracted summary prototype LPAS, and Takase and Okazaki (2019) modified the sinusoidal positional embedding to allow length control during decoding.

Some additional studies have proposed generic approaches for controlled summarization with different features, including length. Fan et al. (2018) proposed a method that divides the training data into buckets and prepends the corresponding bucket id to the summary. He et al. (2020) pre-trained a model by prepending extracted key-phrases from the summary. For length control they divided the training on predefined buckets, and used the mean key-phrases per bucket to control the summary length. Chan et al. (2021) proposed a method to control the decoding based on a Constrained Markov Decision Process. For length control, they use a cost function that computes the normalized distance between the bucket ids of the generated summary and the reference.

3 Length Controlled Summarization

Denote x={x1,,xn}x=\{x_{1},\cdots,x_{n}\} as the input document, and \ell as the desired length, i.e., number of tokes or sentences. Our goal is to train a probability model p(y|x,l)p(y|x,l), where y={y1,,yl}y=\{y_{1},\cdots,y_{l}\} is the summary. If the length is not given, our models can also estimate the length of the summary as p(l|x)p(l|x).

Refer to caption
Figure 1: Left: Regular model. Right: Reverse Position Induced Length-Controlled Text generation (REPILOT)

3.1 REPILOT

The REverse Position Induced Length-cOntrolled Text generation (REPILOT) method is a light-weighted solution for accurately controlling the generated text lengths. Specifically, we simply reverse the indices of position encoding as “7, 6, …, 2, 1, 0”, as illustrated in the right of Figure 1. In this way, the model is aware of the information of how many more tokens should be decoded in each decoding step. By starting the position embedding with the target length, we can control the length.

In practice, we observe that training the model with the exact number of tokens may lead to abrupt ending of the generation, i.e., the generation will end once the position index reach 0, no matter whether the sentences have semantically ended. Therefore, we add a scalar noise to the position indices. Specifically, we sample a scalar δ𝒩(0,1)\delta\sim\mathcal{N}(0,1), truncate it to integer, and add it to the sequence of position indices.

3.2 SentEnum

We propose a simple yet effective solution for generating a desired number of sentences in the output text. To the best of our knowledge, it is the first solution for controlling the number of sentences to generate; as previous solutions control the number of tokens. Controlling sentences is a challenging problem because sentences boundaries like punctuation marks and spacing are often ambiguous. We guide the generation by using explicit enumeration of sentences. The required length is indicated as a number prefix to the summary. The inserted numbers are preceded by an special token [SN] to differentiate them from the text111We preprocess the training data with the previously described annotation, using the sentence tokenizer from NLTK https://www.nltk.org/api/nltk.tokenize.html, and we post-processed the summary to remove the annotation at inference time., the following is an example:

[SN]3 [SEP] [SN]1 Nearly 40 endangered forest elephants were killed in 2 parks. [SN]2 Sudanese poachers on horseback are believed to be responsible. [SN]3 Forest and savanna elephant populations have declined drastically

3.3 Length Prediction

The models are trained to predict both the length and the summary as multi-task learning. In the case of REPILOT, we use a head classifier for predicting the number of tokens. The loss is computed with a weighted average from both length classifier and summarization, where we adopt the mean squared loss for the length prediction and the cross entropy loss for the text generation. In the case of SentEnum, we simply train the model to predict the length prefix together with the summary as a text string. This method is easier to implement and preliminary experiments show similar performance as having a separate head to predict the number of sentences.

4 Experimental Setup

We initialize our models with Zcode++ He et al. (2022), a large pre-trained language model that reported strong results when fine-tuned in summarization. Unless specified we use the same hyper parameters and configuration.

Data-sets: We perform experiments on CNNDM data Nallapati et al. (2016) and Arxiv Cohan et al. (2018). Information and statistics are detailed in Appendix A.

Metrics: We use the standard Rouge score (R1, R2) for evaluating the summary outputs. To evaluate the success of the length control, we measure the accuracy (Acc.) and the mean absolute difference (Diff.).

5 Results and Analysis

5.1 Summary quality

Table 1 shows the Rouge scores on CNNDM. We compare with the reported results of previous works described in Section 2. All previous approaches use the length of the annotated reference summary to report rouge scores (marked as G in the Table 1); and some of them use an external model to extract information from the document such as key-phases or -sentences (marked as E in the Table 1) which are not directly comparable to ours. Our approach is the first to predict the expected length jointly with the summary. We show the results with our two models REPILOT and SentEnum with both golden and predicted lengths. In addition, we report the scores of Zcode++ finetuned on CNNDM. We use beam search of size 3 and no length penalty. The length penalty adjust the model to the typical length of the test set. We argue that the length prediction can replace the use of this hyper-parameter and preliminary experiments showed that our models archive better scores without it. We however report results of Zcode++ with and without length penalty.

Both REPILOT and SentEnum using reference target lengths archive higher Rouge score than Zcode++. In addition, our models with predicted length performed better than Zcode++ without the length penalty adjustment, and on par with Zcode++ with length penalty. The results are also comparable with the reported results of similar approaches even though we do not use external model to extract information from the document to guide the summary generation.

5.2 Length Control Accuracy on REPILOT

Table 2 shows the performance of the length control measured in number of tokens for REPILOT. We compare it with 3 baselines: a fine-tuned Zcode++ model and two models that group the summaries into different number of buckets and use the bucket ids as the length control Fan et al. (2018); He et al. (2020). The models are evaluated using golden lengths of reference summaries and using the same decoding method: greedy decoding without n-gram blocking. Our results show that REPILOT achieved better ROUGE scores and lower mean absolute difference between requested and predicted lengths.

5.3 Length Control Accuracy on SentEnum

Table 3 shows the evaluation for SentEnum. We compare it with three baselines: a fine-tuned Zcode++, a model that groups the summaries into Buckets Fan et al. (2018); He et al. (2020), and a model that uses only the required number of sentences as prefix without sentence enumeration SentPrefix. As the diversity of number of sentences in the summaries is small in CNNDM (Appendix A), we include experiments in Arxiv. The results are obtained with greedy decoding without n-gram blocking which archived the best accuracy for all methods. We utilize the golden lengths as input, and evaluate the generated vs. the golden length. SentEnum is significantly more accurate to generate the required number of sentences and shows higher R2 scores. Appendix B shows additional details about the percentage of over and under generation respect to the input length. SentEnum shows the least percentage of errors.

Additionally, we evaluated the accuracy on an out-of-domain test set of 50 samples using length from 1 to 8 for all examples. Figure 2 shows the results comparing Buckets, SentPrefix and SentEnum. The vertical axis show the accuracy of the generated length and the horizontal axis the input lengths. SentEnum model shows higher accuracy for all input lengths, spatially in the middle range.

5.4 Length Prediction Accuracy

Finally, we evaluate the length predictor. Table 4 shows the Diff. of the predicted vs. the reference length. We compare with a Encoder-based classifier trained using DeBerta. He et al. (2021) with mean square loss. The jointly trained predictors are slightly more accurate than the individually trained ones showing that the multitask approach is effective.

6 Conclusions

We study two simple methods for precisely controlling the length of tokens and sentences in text summarization. These techniques generated text with a specified length more accurately than previous methods. Additionally, we introduced a length predictor, making the models more versatile and easier to use without requiring an input length. These techniques can also be applied to other tasks such as text simplification and translation.

R1 R2 w/o pre-trained LM LenInit Kikuchi et al. (2016) G 25.87 8.27 LenEmb Kikuchi et al. (2016) G 26.73 8.39 LC Liu et al. (2018) G 35.45 14.50 GOLC Makino et al. (2019) G 38.27 16.30 LenCtrl Fan et al. (2018) G 39.16 15.54 LenAttn Yu et al. (2021) G 39.82 17.31 GPT2 CMDP Chan et al. (2021) G 41.72 17.99 LPAS Saito et al. (2020) GE 42.55 20.09 w/ pre-trained LM BART Lewis et al. (2020) N 44.16 21.28 BLPAS Liu et al. (2022) GE 42.95 20.29 LAAM Liu et al. (2022) GE 43.55 20.44 PtLAAM Liu et al. (2022) GE 44.17 20.63 CtrlSum He et al. (2020) E 45.65 22.35 CtrlSum He et al. (2020) GE 46.26 22.60 Zcode++ He et al. (2022) N 45.53 22.55 Zcode++ w/o length penalty N 45.19 22.41 + REPILOT G 46.20 22.03 + REPILOT + length pred. N 45.61 22.13 + SentEnum. G 46.02 22.60 + SentEnum. + length pred. N 45.54 22.56

Table 1: Evaluation Results on CNNDM data. G: Use length from the reference summary. E: Use extracted information from the document. N: None of the above.

R1 \uparrow R2 \uparrow Diff. \downarrow Zcode++ 44.76 21.33 16.68 Buckets-10 45.82 21.76 5.84 Buckets-100 45.86 21.54 1.43 REPILOT 46.36 22.08 1.30

Table 2: Results of the REPILOT modelon CNNDM using greedy decoding and without n-gram blocking.

R1 \uparrow R2 \uparrow Acc. \uparrow Diff. \downarrow CNNDM Zcode++ 44.8 21.3 60.1 0.5 Buckets 45.8 21.8 87.1 0.1 SentPrefix 45.7 21.7 94.0 0.1 SentEnum 45.7 22.1 98.6 0.02 Arxiv Zcode++ 46.3 19.5 20.3 1.5 Buckets 50.1 21.0 77.5 0.2 SentPrefix 50.0 20.9 79.9 0.2 SentEnum 49.8 21.3 93.3 0.1

Table 3: Results of SenEnum model using greedy decoding and without n-gram blocking.

Tokens Sentences Encoder-based classifier 15.42 1.04 Jointly trained classifier 15.13 0.90

Table 4: Diff. of the predicted length vs. gold length.
Refer to caption
Figure 2: Accuracy of the produced summary length respect to the input number of sentences. The evaluation is done on a out-of-distribution set.

Limitations

Our methods are not the perfect solution to control summarization granularity. We will keep exploring semantic aware length control to control the granularity of generated text in a more meaningful way. Additionally, the methods were not tested on unseen lengths in the training data. SentEnum is less accurate when the input length is higher due to the fewer number of long examples in the training data. This suggest it doesn’t generalize for all lengths. Another limitation of SentEnum is the quality of the sentence splitter. In our experiments, the errors introduced by the splitter did not have a significant impact on the results but it may not be the case for noisier data-sets or different languages.

Ethics Statement

Our work is committed to comply with all applicable ACL Ethics Policy 222https://www.aclweb.org/portal/content/acl-code-ethics. We presented methods which are simple to replicate and do not required high computation resources as they can use pre-trained language or summarization models. We acknowledge the risk of any text generation model to produce outputs that could lead to misinformation, bias or misuse. We however committed to use publicly available datasets whose content is relatively safe.

References

Appendix A Data Statistics

The following table shows the number of examples for training, development and testing for the reported data-sets:

Train Dev. Test CNNDM 287,113 13,368 11,490 Arxiv 202,914 6,436 6,440

The following statistics are calculated from the summaries of the training sets.

Max Min Mean Med. P75 P95 STD CNNDM Words 1,246 4 49 46 57 85 20 Sent. 36 1 3.8 4 4 6 1.3 Arxiv Words 26K 2 278 164 237 482 587 Sent. 20 1 6.1 6 8 10 2.2

Appendix B Over and Under Length Generation

We count the percentage of examples with generated length shorter and longer than the gold length. We called % Over) and %Under generation. This results were obtained with the SentEnum model using with greedy decoding and without n-gram blocking in CNNDM and Arxiv data-sets. Given the data distribution, the number of shorter training examples is higher than the longer ones. Thus, all methods have tendency to under produce rather than over produce. However, SentEnum show significantly better results.

%Over\downarrow %Under \downarrow CNNDM Zcode++ 14.5 25.4 Buckets 1.8 11.1 SentPrefix 1.1 4.9 SentEnum 0.8 0.6 Arxiv Zcode++ 19.9 59.8 Buckets 6.8 16.6 SentPrefix 5.9 14.7 SentEnum 2.4 4.3