A Focused Study on Sequence Length for Dialogue Summarization

Abstract

Output length is critical to dialogue summarization systems. The dialogue summary length is determined by multiple factors, including dialogue complexity, summary objective, and personal preferences. In this work, we approach dialogue summary length from three perspectives. First, we analyze the length differences between existing models’ outputs and the corresponding human references and find that summarization models tend to produce more verbose summaries due to their pretraining objectives. Second, we identify salient features for summary length prediction by comparing different model settings. Third, we experiment with a length-aware summarizer and show notable improvement on existing models if summary length can be well incorporated. Analysis and experiments are conducted on popular DialogSum and SAMSum datasets to validate our findings.¹¹1Code available: https://github.com/BinWang28/LA-BART

Index Terms— Dialogue summarization, Seq2seq model, Controllable generation

1 Introduction

Previous work in summarization focus on the news domain where both extractive and abstractive methods are effective [1]. Dialogue summarization aims to generate summaries for intercourses between individuals or groups. Because of the verbal or written format of dialogues, its summarization is abstractive in nature and poses unique challenges [2, 3]. A good dialogue summary should be not only coherent and fluent but also concise as well as cover essential details [4].

Two main factors determine the desired summary length: 1) dialogue complexity and 2) user preferences. Traditional text summarization systems require both source text and summary length as the input to satisfy specific needs like display constrain [5]. In comparison, recent text summarizers, especially pre-trained models, generate a summary without explicit length modeling procedures [4, 6, 7]. However, the output sequence length for dialogue summarization is a crucial factor since a concise and informative summary is favorable in both applications and evaluations. The study of performance evaluation, predictability and controllability on the length of dialogue summary is missing.

Dialogue: #Person1#: May, do you mind helping me prepare for the picnic? #Person2#: Sure. Have you checked the weather report? #Person1#: Yes. It says it will be sunny all day. No sign of rain at all. This is your father’s favorite sausage. Sandwiches for you and Daniel. #Person2#: No, thanks Mom. I’d like some . . . . . . Summaries: Human1: May is helping her mother to do some preparation for the picnic. Human2: Mom asks May to help to prepare for the picnic and May agrees. Human3: May’s mother asks May for help in preparing for a picnic. May gives her a hand. BART_Large: #Person1# asks May to help her prepare for the picnic. May takes some fruit salad, crackers, sausage, toast, chicken wings, napkins, cups, and picnic blanket to the living room.

Table 1: One sample from DialogSum dataset.

Model ROUGE-1 ROUGE-2 ROUGE-L Len. $\Delta$ Prec. Rec. F-1 Prec. Rec. F-1 Prec. Rec. F-1 BART_Large 44.11 54.25 47.22 19.86 23.78 20.92 41.97 49.77 44.57 7.9 BART_Base 45.87 49.10 46.12 19.78 20.71 19.63 43.43 45.97 43.81 5.4 T5_Base 40.91 48.61 43.25 17.06 19.58 17.72 38.85 44.89 40.86 6.6 T5_Small 38.30 42.64 39.12 14.44 15.44 14.40 36.77 40.14 37.55 6.2 Inter-human 54.16 54.68 53.34 27.17 27.35 26.70 51.38 51.78 50.84 4.2

Table 2: ROUGE (Precision, Recall, F-1) scores on DialogSum dataset and the absolute length difference between summaries. Inter-human scores are the average between any two annotators.

For length-controllable summarization, [5] and [8] propose to use length embedding as input to the decoder of LSTM-based or CNN-based models. [9] suggests controlling the summary length by first extracting salient tokens from the original text. In comparison, our goal is not to propose a better controllable model but a study on the importance of summary lengths. No prior work focuses on summary length analysis for dialogue summarization systems.

With the availability of a multi-reference dialogue summarization dataset [10], we initiate the study of dialogue summary length while referring to cross-human statistics. First, the SOTA summarization models are compared to humans, and we found that most models are experiencing a verbose problem due to their pretraining objectives. Human still achieves a much higher length agreement compared to machines. Second, we investigate whether summary length is predictable and what information is critical for length prediction. Last, we adapt existing models to length controllable ones by simply adding output length as an additional input. Through experiments, we witness that significant improvements can be achieved with reference summary length. It means that better consideration of summary lengths has significant potential improvement for existing models.

Model Correlation $r$ $\rho$ $\tau$ BART_Large 74.6 72.9 54.6 BART_Base 72.7 71.8 54.4 T5_Base 71.4 72.1 54.3 T5_Small 69.6 69.6 51.0 Inter-human 76.9 75.6 58.6

Table 3: Correlation of summary lengths.

2 Performance of Existing Models

Experimental setup: DialogSum is a recently released dialogue summarization dataset [10]. It consists of 13,460 everyday dialogues, which are divided into three subsets: train (12,460), validation (500) and test (500). Each dialogue is associated with one reference summary except for the test set, where three reference summaries are given. An example is shown in Table 1 with references and one model output. In this section, we focus on DialogSum because its multi-reference summaries allow us to study inter-human statistics. We compute inter-human performance by comparing summaries from any two annotators, and the average is reported. Four pre-trained generation models are selected for experiments including BART_{Large / Base} [6] and T5_{Base / Small} [11]. We did not include the dedicated dialogue summarization models [4, 7, 12] because close performance is witnessed compared with BART_Large.

Our Findings: Table 2 shows model performances on ROUGE scores [13]²²2‘py-rouge’ implementation used in this paper for fair comparison [14]. and the absolute length difference between summaries concerning each dialogue. Summary length is counted by the number of words split by white space. As expected, an obvious gap still exists between automatic summarizers and humans. There are several major differences. First, humans achieve an almost perfect balance between precision and recall in all ROUGE scores. In contrast, all neural summarizers achieve higher recall than precision, which indicates that the summarization models have better coverage of content presented in the reference summary but tend to generate more unnecessary phrases. More specifically, the recall score of ROUGE-1 on BART_Large is extremely close to inter-human performance (54.25 to 54.68). Yet, a lower precision leads to a lower F-1 score. Second, we witness that the absolute length differences for summarizers are larger than for humans. The agreement between humans is a 4.2-word length difference. In comparison, existing summarizers are less competent in determining the summary length and tend to generate longer summaries than desired. Third, Table 3 shows the length correlation across humans and models in all dialogues. We found that summarizers have comparable performance with humans. It indicates that the summarizer can determine the relative summary length well. In other words, both humans and summarizers can decide which dialogue requires a lengthier summary.

The above analysis concludes that the existing summarizers suffer from verbose problems. The main reason is from the denoising objective in pretraining. The denoising-based pretraining objective generates the output sequence similar to the input in length [6], while the summarization task expects a condensed output. Therefore, future pretrained summarization models should pay more attention to verbose problems during unsupervised pretraining or fine-tuning.

Model DialogSum SAMSum Surface 5.12 6.84 Single 5.96 6.34 Single+ 4.69 6.33 Multi 4.83 7.19 Multi+ 4.53 7.35 Inter-human 4.21 –

Table 4: Length prediction results of five models and human performance in absolute length difference.

Model Sum. Len. DialogSum SAMSum R-1 R-2 R-L Len. $\Delta$ R-1 R-2 R-L Len. $\Delta$ Previous Results Ext-Oracle ✗ 38.68 17.28 40.06 - 42.12 17.08 40.15 - T5_Base ✗ 43.25 17.72 40.86 6.6 52.11 26.88 49.32 10.2 ^△UniLMv2_Base ✗ 47.04 21.13 45.04 - 50.53 26.62 48.81 - ^△BART_Large ✗ 47.28 21.18 44.83 - 53.12 27.95 49.15 - ^⋄MV-BART_Large ✗ - - - - 53.42 27.98 49.97 7.2 ^⋆ CODS ✗ - - - - 52.65 27.84 50.79 - Our Results BART_Large ✗ 47.22 20.92 44.57 7.9 52.93 28.21 49.88 9.3 +w/ len. out. ✗ 47.29 20.77 45.01 5.9 53.44 28.08 50.14 8.2 LA-BART_Large pseudo 47.28 21.09 45.11 5.5 53.43 28.28 49.94 8.7 +w/ len. out. pseudo 47.11 20.60 44.91 4.9 53.56 28.59 50.29 8.0 LA-BART_Large ✓ 49.81 22.81 47.40 2.6 57.81 31.73 53.46 4.1 +w/ len. out. ✓ 49.29 22.19 46.97 1.7 57.89 31.85 53.95 2.7

Table 5: Experimental results on DialogSum and SAMSum datasets. ^△, ^⋄, ^⋆ indicate the results are taken from [10], [14] and [15], respectively.

Model DialogSum SAMSum BART_Large 51.6 56.8 +w/ len. out. 52.3 56.7 LA-BART_Large(ps) 52.5 56.3 +w/ len. out. 52.6 56.7 LA-BART_Large 54.5 59.5 +w/ len. out. 54.3 59.8

Table 6: Performance evaluation on BERTScore.

3 Summary Length Predictor

Previous dialogue summarization models have no length prediction or length control module. Here, we investigate the difficulty of predicting optimal summary length and what kind of information is beneficial in this task. We fine-tune a T5_Small model and design five variants with different input features and training objectives. Experiments are conducted on DialogSum and SAMSum [16] datasets. The details of our training objectives are as follows:

Surface:: The surface information of a dialogue is used as input.

“Length of dialogue: #{x}. Number of utterance: #{y}.”
Single:: Dialogue is used as input.

“Dialogue: {D}.”
Single+:: Both the surface information and the dialogue are used as input.

“Length of dialogue: #{x}. Number of utterance: #{y}. Dialogue: {D}.”
Multi:: The input is the same as Single. The output contains summary length prediction and summary generation in a multi-task learning manner.
Multi+:: The difference from Multi is its input is the same as Single+.

Reference 1 Reference 2 Reference 3 BART LA-BART (pseudo) LA-BART(true) Ranking Score 3.6 3.6 3.3 2.2 2.3 2.6

Table 7: Human evaluation results. The higher the better.

Table 4 shows the length prediction results. On DialogSum dataset, surface information serves as a strong baseline and is even better than using the whole dialogue. A further boost is witnessed by combining surface information and dialogue as input. Meantime, multi-task learning brings extra benefits. As a result, Multi+ achieves the best performance and is close to human performance with only a 0.3-word difference in length (4.53 v.s. 4.21). Unlike DialogSum, SAMSum dataset has fewer clear instructions in its annotation process. Therefore, the summary length is less consistent and more challenging to forecast. We found that dialogue plays the most important role in length prediction while multi-task learning does not help and is even worse than surface information. In the following section, we choose the best model for each dataset to acquire pseudo summary lengths as the input to the proposed length controllable model.

4 Length-Aware Model

In this section, we first present a simple method to adapt existing models with length controllability. Then, we experiment with several methods to enhance model’s length awareness.

Length-aware models: Unlike previous methods proposed to control the output length through learning additional length embeddings as the input to LSTM-based or CNN-based seq2seq models [5, 8], we directly use the textual description of the desired output length as additional input along with the dialogue as a sign for desired summary length. More specifically, the input is in the form of

\emph{Summary length: \#\{z\}. Dialogue: \{D\}}.

Results in Table 5 show that the simple length controllable adaption is effective with pre-trained neural summarizers. We then experiment with the following three methods to probe models’ length-awareness:

LA-BART_v1:: Ground-truth summary length is used during training. Pseudo summary length is used for inference. It is acquired by the length predictor in Section 3.
LA-BART_v2:: Instead of the pseudo length as in v1, reference summary length is used during inference. In this case, we assume that the ground-truth summary length is revealed. It tests a model’s length controllability and provides an upper bound for length-awareness models.
Multi-task learning:: The summarizer is trained to predict the summary length along with summary generation. We expect a model to be more aware of summary length in this multi-task learning objective.

We use BART_Large as the baseline for most experiments and provide more baseline results in the Appendix. Recent dialogue summarization methods are listed for comparison [4, 10, 15]. Besides the standard ROUGE score for summarization, we also report the performance on BERTScore [17] for a more comprehensive evaluation.

Results and analysis: The results on ROUGE and BERTScore are shown in Table 5 and 6, respectively.

First, we show that LA-BART with golden summary length significantly improved ROUGE score and BERTScore on both datasets. The absolute length difference is only 2.6- and 4.1-word, which indicates that using reference summary length can control the generated summary length well. The evaluation metrics favor the generated summary to have a close length with the reference summary for balanced precision and recall. It indicates that the performance of summarizers can be largely improved if summary length can be accurately predicted or well incorporated.

Second, we witness less absolute length difference with pseudo summary length as input. On DialogSum, the length difference is improved from 7.9- to 5.5-word. A noticeable improvement is also shown on BERTScore. In contrast, the improvement in length difference is minor for SAMSum, and no improvement is shown on BERTScore. We believe it is because the summary length of SAMSum is more challenging to predict than the DialogSum dataset (as shown in Table 4) because the latter gives more explicit instructions in the summary labeling process.

Third, we always observe obvious improvement in predicted summary length with multi-task learning by adding summary length prediction as an additional objective. It indicates that the model can pay more attention to output length. There is also some improvement shown on ROUGE and BERTScore. Especially on DialogSum dataset, the BERTScore improves from 51.6 to 52.3 by adding the multi-task learning objective. Here, we show that a simple length-awareness trick can improve the summarizer’s performance. Therefore, future dialogue summarization models should pay more attention to the generated summary length to be closer to human-level performance.

Human analysis: To accommodate the drawbacks of automatic summarization metrics, we also conduct the human evaluation on dialogue summary pairs from DialogSum dataset. The detail of our comparative evaluation is as follows. First, annotators are presented with the original dialogue and six candidate summaries. Three summaries are from the dataset references which are human-written. The rest three are machine generated-summaries including the model outputs of BART_Large, LA-BART_Large with pseudo length labels as the input and BART_Large with the ground-truth length labels as input. Then, the annotator is asked to perform comparative ranking of the summaries based on the overall quality. The highest-ranked summary gets a score of 5, while the lowest is scored 0. Here, ten dialogues are randomly sampled and each dialogue is evaluated by 4 annotators.

Human evaluation results are shown in Table 7. As expected, the human-written reference summaries receive the highest score. LA-BART_Large with ground-truth length achieves the best score among generated summarizers. LA-BART_Large with the pseudo length label shows comparative performance with the baseline.

5 Conclusion

In this work, we study the summary length of dialogues. We spot that recent dialogue summarizers suffer from verbose problems due to their pretraining objective. We show by experiments that the model’s potential can be stimulated by considering the summary length. We hope this work can arise the attention on summary length and facilitate the development of summarization models.

References

[1] Wafaa S El-Kassas, Cherif R Salama, Ahmed A Rafea, and Hoda K Mohamed, “Automatic text summarization: A comprehensive survey,” Expert Systems with Applications, vol. 165, pp. 113679, 2021.
[2] Harvey Sacks, Emanuel A. Schegloff, and Gail Jefferson, “A simplest systematics for the organization of turn-taking for conversation,” Language, vol. 50, no. 4, pp. 696–735, 1974.
[3] Bin Wang, Chen Zhang, Yan Zhang, Yiming Chen, and Haizhou Li, “Analyzing and evaluating faithfulness in dialogue summarization,” arXiv preprint arXiv:2210.11777, 2022.
[4] Jiaao Chen and Diyi Yang, “Multi-view sequence-to-sequence models with conversational structure for abstractive dialogue summarization,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, Nov. 2020, pp. 4106–4118, Association for Computational Linguistics.
[5] Yuta Kikuchi, Graham Neubig, Ryohei Sasano, Hiroya Takamura, and Manabu Okumura, “Controlling output length in neural encoder-decoders,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, Nov. 2016, pp. 1328–1338, Association for Computational Linguistics.
[6] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer, “BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, July 2020, pp. 7871–7880, Association for Computational Linguistics.
[7] Zhengyuan Liu, Ke Shi, and Nancy Chen, “Coreference-aware dialogue summarization,” in Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue, Singapore and Online, July 2021, pp. 509–519, Association for Computational Linguistics.
[8] Angela Fan, David Grangier, and Michael Auli, “Controllable abstractive summarization,” in Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, Melbourne, Australia, July 2018, pp. 45–54, Association for Computational Linguistics.
[9] Itsumi Saito, Kyosuke Nishida, Kosuke Nishida, Atsushi Otsuka, Hisako Asano, Junji Tomita, Hiroyuki Shindo, and Yuji Matsumoto, “Length-controllable abstractive summarization by guiding with summary prototype,” arXiv preprint arXiv:2001.07331, 2020.
[10] Yulong Chen, Yang Liu, Liang Chen, and Yue Zhang, “DialogSum: A real-life scenario dialogue summarization dataset,” in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, Aug. 2021, pp. 5062–5074, Association for Computational Linguistics.
[11] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of Machine Learning Research, vol. 21, no. 140, pp. 1–67, 2020.
[12] Junpeng Liu, Yanyan Zou, Hainan Zhang, Hongshen Chen, Zhuoye Ding, Caixia Yuan, and Xiaojie Wang, “Topic-aware contrastive learning for abstractive dialogue summarization,” in Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, Nov. 2021, pp. 1229–1243, Association for Computational Linguistics.
[13] Chin-Yew Lin, “Rouge: A package for automatic evaluation of summaries,” in Text summarization branches out, 2004, pp. 74–81.
[14] Xiachong Feng, Xiaocheng Feng, and Bing Qin, “A survey on dialogue summarization: Recent advances and new frontiers,” arXiv preprint arXiv:2107.03175, 2021.
[15] Chien-Sheng Wu, Linqing Liu, Wenhao Liu, Pontus Stenetorp, and Caiming Xiong, “Controllable abstractive dialogue summarization with sketch supervision,” in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, Aug. 2021, pp. 5108–5122, Association for Computational Linguistics.
[16] Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer, “SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization,” arXiv preprint arXiv:1911.12237, 2019.
[17] Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi, “Bertscore: Evaluating text generation with bert,” in International Conference on Learning Representations, 2020.
[18] Jean Carletta, Simone Ashby, Sebastien Bourban, Mike Flynn, Mael Guillemot, Thomas Hain, Jaroslav Kadlec, Vasilis Karaiskos, Wessel Kraaij, Melissa Kronenthal, et al., “The ami meeting corpus: A pre-announcement,” in International workshop on machine learning for multimodal interaction. Springer, 2005, pp. 28–39.
[19] Guy Feigenblat, Chulaka Gunasekara, Benjamin Sznajder, Sachindra Joshi, David Konopnicki, and Ranit Aharonov, “TWEETSUMM - a dialog summarization dataset for customer service,” in Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, Nov. 2021, pp. 245–260, Association for Computational Linguistics.
[20] Ming Zhong, Da Yin, Tao Yu, Ahmad Zaidi, Mutethia Mutuma, Rahul Jha, Ahmed Hassan Awadallah, Asli Celikyilmaz, Yang Liu, Xipeng Qiu, and Dragomir Radev, “QMSum: A new benchmark for query-based multi-domain meeting summarization,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, June 2021, pp. 5905–5921, Association for Computational Linguistics.
[21] Don Tuggener, Margot Mieskes, Jan Deriu, and Mark Cieliebak, “Are we summarizing the right way? a survey of dialogue summarization data sets,” in Proceedings of the Third Workshop on New Frontiers in Summarization, Online and in Dominican Republic, Nov. 2021, pp. 107–118, Association for Computational Linguistics.
[22] Abigail See, Peter J. Liu, and Christopher D. Manning, “Get to the point: Summarization with pointer-generator networks,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, July 2017, pp. 1073–1083, Association for Computational Linguistics.
[23] Zhengyuan Liu, Angela Ng, Sheldon Lee, Ai Ti Aw, and Nancy F Chen, “Topic-aware pointer-generator networks for summarizing spoken conversations,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019, pp. 814–821.

6 Related Work

Dialogue summarization attracts more and more research attention with the availability of large-scale labeled datasets. Several different domains of data can be formatted as dialogues for summarization including meetings, emails, customer services, and chit-chats [18, 16, 19, 20]. Each released dataset has its own summarization objective as introduced in [21]. In this work, we focus on DialogSum [10] and SAMSum [16] datasets because they are the first large-scale general-purpose chit-chat dialogue summarization datasets.

Dialogue summarization models are mainly abstractive neural models because of the interactive nature of dialogues. Information is scattered across utterances which pose challenges to summarizing dialogues. [22] proposed a pointer-generator network to copy words from the source content through attention mechanisms. [4] and [23] leverage the topic segmentation and conversational structures to better model information exchange for the encoding process of summarization. [7] proposes to improve co-references by adjusting the attention distribution within summarization models. In terms of length control, [5] and [8] propose to use length embedding as input to the decoder of LSTM-based or CNN-based models. [9] suggests controlling the summary length by first extracting salient tokens from the original text. In comparison, our goal is not to propose a better controllable model but a study on the importance of summary lengths. No prior work focuses on summary length analysis for dialogue summarization systems. We initiate the first cross-human study with the availability of multi-reference dataset.

Table 8 shows more details of DialogSum and SAMSum datasets. The datasets are composed of dialogues of daily activities and human-written summaries. Each dialogue is associated with one human-written summary except for the test set of DialogSum, where three references are given. More dataset details can be found in [10] and [16].

# Train # Val # Test # Comp. Rate DialogSum 12,460 1,500 1,500 17.04% SAMSum 14,731 818 819 21.65%

Table 8: Statistics of dialogue summarization datasets.

7 More Results and Case Study

We show the result with BART_Base as the baseline in Table 9. In general, it shows the same trend with the results of BART_Large as in Table 5 and similar conclusions can be drawn.

Table 10 is a case study of LA-BART on length controllability. By increasing the input length signal, we observe that longer summaries can be generated accordingly, and more dialogue details are conveyed in the generated summary.

8 Verbose Problem of Pre-trained Models

In Section 2, we spot the verbose problem of existing pre-trained generation models when applied to summarization tasks. BART and T5 are two popular pre-trained encoder-decoder models. However, during pre-training, they share a similar input-output length. It leads the model to generate similar length output with the input. In contrast, summarization is an information compression process and the compression rate is usually less than 30% (Table 8). It means that the output length is less than 30% of the corresponding input length. This phenomenon applies to both chit-chat dialogue summarization and other summarization types. Therefore, we think researchers should be aware of the verbose problem and it is also an opportunity to improve current pre-training dialogue summarization models.

Model Sum. Len. DialogSum SAMSum R-1 R-2 R-L Len. $\Delta$ R-1 R-2 R-L Len. $\Delta$ BART_Base ✗ 45.12 18.80 42.75 5.9 51.39 26.66 48.73 9.1 +w/ len. out. ✗ 45.27 18.76 42.99 5.7 51.19 25.87 47.88 7.7 LA-BART_Base pseudo 45.48 18.93 43.14 4.6 51.60 26.58 48.26 7.8 +w/ len. out. pseudo 45.41 19.08 43.32 4.9 51.36 26.30 47.93 7.9 LA-BART_Base ✓ 47.50 20.34 44.96 1.3 55.87 29.45 51.56 2.2 +w/ len. out. ✓ 47.61 20.73 45.39 1.7 55.82 29.06 51.49 2.3

Table 9: Experimental results on DialogSum and SAMSum datasets for BART_Base.

Input Dialogue #Person1#: Happy Birthday, this is for you, Brian. #Person2#: I’m so happy you remember, please come in and enjoy the party. Everyone’s here, I’m sure you have a good time. #Person1#: Brian, may I have a pleasure to have a dance with you? #Person2#: Ok. #Person1#: This is really wonderful party. #Person2#: Yes, you are always popular with everyone. and you look very pretty today. #Person1#: Thanks, that’s very kind of you to say. I hope my necklace goes with my dress, and they both make me look good I feel. #Person2#: You look great, you are absolutely glowing. #Person1#: Thanks, this is a fine party. We should have a drink together to celebrate your birthday. Reference #Person1# and Brian are at the birthday party of Brian. Brian thinks #Person1# looks great and is popular. LA-BART $length=5$ #Person1# celebrates Brian’s birthday with him. $length=10$ #Person1# dances with Brian at his birthday party. $length=15$ #Person1# celebrates Brian’s birthday and dances with him at the party. Brian thinks it’s great. $length=20$ #Person1# wishes Brian a happy birthday and invites him to dance. Brian agrees and admires #Person2#’s dress and outfit. $length=25$ #Person1# wishes Brian a happy birthday and invites him to dance. Brian agrees and admires #Person2#’s dress and necklace. They will have a drink together. $length=30$ #Person1# wishes Brian a happy birthday and invites him to dance. Brian agrees and compliments #Person2#’s dress and necklace. They think the party is fine and decide to have a drink together. $length=35+$ #Person1# wishes Brian a happy birthday and invites him to dance. Brian agrees and compliments #Person2#’s dress and necklace. They think the party is fine and decide to have a drink together to celebrate his birthday.

Table 10: An example of model outputs for length controllability.