Controllable Topic-Focused Abstractive Summarization

Seyed Ali Bahrainian Computer Science Department, University of TuebingenTuebingen, Germany, email: [email protected] , Martin Jaggi Computer Science Department, EPFLLausanne, Switzerland, email: [email protected] and Carsten Eickhoff Computer Science Department, University of TuebingenTuebingen, Germany, email: [email protected]

Abstract.

Controlled abstractive summarization focuses on producing condensed versions of a source article to cover specific aspects by shifting the distribution of generated text towards a desired style, e.g., a set of topics. Subsequently, the resulting summaries may be tailored to user-defined requirements. This paper presents a new Transformer-based architecture capable of producing topic-focused summaries. The architecture modifies the cross-attention mechanism of the Transformer to bring topic-focus control to the generation process while not adding any further parameters to the model. We show that our model sets a new state of the art on the NEWTS dataset in terms of topic-focused abstractive summarization as well as a topic-prevalence score. Moreover, we show via extensive experiments that our proposed topical cross-attention mechanism can be plugged into various Transformer models, such as BART and T5, improving their performance on the CNN/Dailymail and XSum benchmark datasets for abstractive summarization. This is achieved via fine-tuning, without requiring training from scratch. Finally, we show through human evaluation that our model generates more faithful summaries outperforming the state-of-the-art Frost model.

sequence-to-sequence neural models, abstractive summarization, topical customization ¹¹1This article has some textual overlap with the PhD thesis of the first author (Bahrainian, 2019).

^†^†booktitle: ^†^†ccs: Computing methodologies Neural networks^†^†ccs: Computing methodologies Latent Dirichlet allocation^†^†ccs: Computing methodologies Natural language generation^†^†copyright: acmcopyright^†^†journal: TOIS^†^†journalyear: 2021^†^†journalvolume: 1^†^†journalnumber: 1^†^†article: 1^†^†publicationmonth: 1^†^†price: 15.00^†^†doi: 10.1145/3464299

Automatic document summarization produces a condensed version of a source document, covering its main aspects. Summarization systems are mainly classified into two categories: extractive summarization and abstractive summarization.

Extractive summarization (Nallapati et al., 2017) selects sentences of a source document based on a scoring scheme and combines those exact sentences to produce a summary. Conversely, abstractive summarization aims at producing shortened versions of a source document by generating sentences that do not necessarily appear in the original document. The latter is receiving increased attention recently, due to how capable neural summarizers and pre-trained conditional language models (Lewis et al., 2020; Raffel et al., 2019) have become. The advent of sequence-to-sequence (seq2seq) architectures based on Long-Short-Term-Memory (LSTM) Networks (Hochreiter and Schmidhuber, 1997) with attention and pointer copy mechanism (Nallapati et al., 2017; Bahdanau et al., 2014; See et al., 2017), followed by Transformer architectures with multi-headed self-attention (Vaswani et al., 2017), significantly contributed to this trend.

Refer to caption — Figure 1. A news article (at the top) summarized with respect to two main topics underneath, Fire and Firefighters on the left, and Dangerous and Deadly on the right.

Transformer-based models Vaswani et al. (2017) generate high-quality summaries reaching unparalleled levels of fidelity, even outperforming human-written LEAD-3 baselines on summarization tasks (Lewis et al., 2020; Bahrainian et al., 2022). These models consist of an encoder that generates a rich representation of a source article, a decoder that generates summary tokens guided by the encoder representation, and learnable attention distributions that connect encoder and decoder to train the model end-to-end. However, the generation capabilities of these models may be controlled and directed towards producing text with a specific word distribution, e.g., focusing on topics of interest or even generating more faithful summaries (Dou et al., 2021).

In this paper, we present a new abstractive summarization framework, featuring topic-focused text generation, adjusting the topics of a generated summary to conform to a target probability distribution over words, i.e., topics. A main component of this model is the topical cross-attention mechanism used to connect encoder and decoder guided by a topic model. We introduce the new architecture by adopting a Topical Attention mechanism that can be easily integrated into any Transformer model. We name our model Conformer, for CONtrolled topic-based transFORMER. It uses a topic model to compute distributions over words which, in turn, are learned by a feed-forward neural network to guide the cross-attention mechanisms of a summarization model to give more weight to specific user-defined topics when generating a summary. By using conditional generation (e.g., prompting) we show that the model further focuses its generated summaries on a target topic distribution. According to this definition, a topic is a probability distribution over the simplex (i.e., the vocabulary) such that the higher the associated probability of a word is, the higher is its presence in that topic. Therefore, the co-occurrence of words with a high probability forms a high-level semantic concept which may be the focus of a summary. Figure 1 illustrates an example document from the NEWTS dataset (Bahrainian et al., 2022) summarized twice, each time focusing on one of its main topics. We show that our model, Conformer, is not only highly effective in standard abstractive summarization, but also sets a new state-of-the-art performance on controlled topic-based summarization.

Latent Dirichlet Allocation (LDA) topic modeling by Blei et al. (2003) follows the same co-occurrence notion, and is used to compute topics of corpora of textual documents. Conformer utilizes LDA for distinguishing topics. While Conformer is initialized using a pre-trained BART (Lewis et al., 2020) model, we also demonstrate the ease of integrating our topic control mechanism into other Transformer models by also integrating it into T5 (Text-to-Text Transfer Transformer) (Raffel et al., 2019) model. Our experimental results show that Conformer sets a new state of the art on the XSUM benchmark dataset for abstractive summarization, as well as, on NEWTS for topic-based abstractive summarization..

The main contributions of this paper are:

•

We present Conformer, a fundamentally new architecture for abstractive summarization based on a pre-trained conditional language model while requiring no additional complexity beyond the parameters of the original pre-trained model.
•

We show that Conformer is highly effective at controlling the topical focus of a generated summary and demonstrate that the proposed architecture improves the faithfulness of the generated summaries.

The remainder of this paper is organized as follows: Section 1 presents related work on abstractive summarization as well as the use of topic models (Bahrainian et al., 2021a) in summarization. In Section 2, we present our proposed topic-sensitive Transformer-based summarization model. In Section 3, we evaluate our proposed approach comprehensively. Finally, Section 4 concludes the paper and presents an outlook on future directions.

1. Related Work

In this section, we discuss the body of related work on (1) attention-based abstarctive summarization models, (2) controlled text generation, and (3) the use of topic models in summarization:

1.1. Attention in Abstractive Summarization

One of the early deep learning architectures that was shown to be effective in abstractive summarization was the Attention-based Encoder-Decoder (Nallapati et al., 2016) proposed by Bahdanau et al. (2014) also known as vanilla seq2seq. Attention mechanisms have been shown to enhance the basic encoder-decoder model (Bahdanau et al., 2014). The main bottleneck of the basic encoder-decoder architecture was its fixed-sized representation (“thought vector”), which could not capture all the relevant information of the input sequence as the model or input scales up. However, the attention mechanism that relies on the notion that, at each generation step, only parts of the input are relevant has shown to be an effective solution to this problem. Following the vanilla seq2seq scheme, models such as the Pointer Generator Network (PGN) (Vinyals et al., 2015) with copy mechanism (See et al., 2017) emerged in an effort to solve the challenge of out-of-vocabulary words.

Later, Attention was shown to be effective in Transformer architectures (Vaswani et al., 2017) as well. Several large pre-trained conditional language models for text generation were introduced. One such model is T5 Raffel et al. (2019) that can be fine-tuned for different seq2seq tasks. Another model in this category is BART (Lewis et al., 2020), a denoising autoencoder for pre-training seq2seq models. BART is trained by “corrupting text with an arbitrary noising function and learning a model to reconstruct the original text” (Lewis et al., 2020). Our proposed model, Conformer, is initialized by a pre-trained BART model. In order to demonstrate the ease of integration of our proposed topical cross attention mechanism into arbitrary seq2seq Transformer models without adding new model parameters, we also plug this component into a T5 model and show the efficacy of this approach.

1.2. Controlled Text Summarization

Previous work has studied various control mechanisms to enforce a specific style (Blinova et al., 2023), form, or use of content in a generated summary. (Gehrmann et al., 2018) proposed to use a content selector to select phrases from a source document that should be part of a generated summary. Likewise, (Li et al., 2018) introduce an information selection layer to explicitly model the information selection process in abstractive summarization. Other work has studied guiding the generation process based on topics (Bahrainian et al., 2021b; Dathathri et al., 2020), aspects (Gu et al., 2022), entities (Narayan et al., 2021; Maddela et al., 2022), word relations such as triplets (Dou et al., 2021) to shift the content of a generated summary toward a desired target form such as formal or informal text (Briakou et al., 2021), biased versus neutral (Pryzant et al., 2020), simplified (Cao et al., 2020), or even sentimental stance (Shen et al., 2017).

Here we study controlled text summarization using an information selection signal in the form of latent topics.

1.3. Topic Models in Summarization

Previous work has utilized topic information in seq2seq problems such as neural response generation (Bahrainian et al., 2021b; Xing et al., 2017). The work of Xing et al. (2017) uses a topic model named Twitter LDA to respond to messages, a different objective than our work. (Sim et al., 2022) studied topic preservation in multi-document summarization. They conclude, that many summaries appear less informative due to a lack of topic coherence between them and their source articles. (Cui and Hu, 2021) incorporates a topic-based approach in a multi-document summary setting to jointly discover latent topics that can act as a semantic bridge across documents and provide global information to guide the summary generation. (Liu and Yang, 2021) use a classifier to distill the training of a summarization model with respect to topical consistency between an input document and a generated summary. They observe that this brings improvements to standard abstractive summarization.

Another recent piece of related work integrates topic information into Transformer models for abstractive summarization (Wang et al., 2020). The authors utilize the Poisson factor analysis (Zhou et al., 2012) topic model. They add three new modules to a Transformer architecture at the cost of additional parameters. These three modules are semantic-informed attention, topic embedding with masked attention, and document-related modulation, which together integrate topic information into a Transformer architecture. Our work differs from theirs in three crucial ways: (1) our design does not introduce any additional model parameters to the original Transformer model, making it more efficient. (2) No training from scratch is required for the models in our approach, and standard fine-tuning for several epochs brings a significant improvement in summarization performance. (3) Our approach is more generic in that it can be integrated into various families of seq2seq models with attention and is not limited to Transformers.

In summary, topic information has been used in previous neural models as an input, and Wang et al. (2018) argue that it results in the diversification of words appearing in summaries. However, the novelty of this paper lies in using this source of information to systematically influence the output summaries to conform to a specific topic distribution. This results in enhanced ROUGE scores for standard abstractive summarization while introducing a mechanism to control the topical distribution of generated summaries. Moreover, our approach is highly generic such that it can be easily integrated into existing seq2seq models at no additional computational cost (i.e., no new parameters are added to the Transformer model). At the same time, Conformer’s novel topic-guided generation produces highly faithful summaries.

2. The Conformer Model

In this section, we introduce our new architecture featuring a topic-distribution adjusting mechanism capable of controlling the topics covered in the generated summaries. The model is equipped with a topical attention mechanism that can be easily plugged into any seq2seq Transformer-based model.

In the following, we first present the computation of topic-word probability distributions. Subsequently, we detail the integration of the topic information into the Transformer architecture.

2.1. Computing Topic-word Distributions using LDA

Although we use LDA as our topic model of choice, any other topic model that can factorize a corpus of text into two matrices, namely, the topic-word matrix and the document-topic matrix, may be used as a replacement. This requirement matches the properties of most existing topic models.

In order to compute the topical attention weights, after training an LDA model on the training data, we map the target summary corresponding to each document to its LDA space. In other words, we compute the prevalence of all topics for each target summary, which results in a per-target summary probability distribution over all LDA topics. Furthermore, for each topic, LDA computes a probability distribution over words of the entire vocabulary $\mathcal{V}$ , i.e., the simplex. Therefore, for a given input document $d$ we can calculate a topical word vector $\tau^{d}$ of dimension $\left|\mathcal{V}\right|$ considering all the words in that document, such that:

(1)

\tau^{d}=\sum_{i}P(\text{topic}_{i}|d)\cdot\tilde{\mathbf{w}}_{i}

where $P(\text{topic}_{i}|d)$ is the probability of each LDA topic being present in the target summary, and $\tilde{\mathbf{w}}_{i}$ is the $\left|\mathcal{V}\right|$ -dimensional vector consisting of the probabilities $\tilde{w}_{i,j}=P(\text{word}_{j}|\text{topic}_{i})$ of all words in vocabulary $\mathcal{V}$ under $\text{topic}_{i}$ .

After having obtained the topic distribution of a target summary, we can now compute the probability scores of all tokens/words of the corresponding source document. In this way during training, all tokens/words from the source article will be scored according to the target topics that an abstractive summarization decoder needs to focus on when generating output summaries. That also means that the topic-words distribution is sorted and trimmed to match the source documents dimensions. In other words, each word from the source article is associated with a corresponding topic-words probability score computed using the LDA model. For stopwords or rare words that do not appear in the topic model vocabulary, a very small positive probability of ‘ $10^{-9}$ ’ is assigned.

The advantage of using topics of target summaries to guide the topical attention is that, if a word in the source document does not appear in any of the topics present in the target summary, this word will receive almost a ‘ $0$ ’ probability in the topic-words probability vector described in Equation (1).

By relying on topic information derived from the target summaries, we achieve two goals: (1) This focuses the summary generation attention precisely on the target topics as indicated by the target summaries, making our model a suitable solution for both standard and controlled abstractive summarization. This will direct the model to assign very low probabilities to tokens/words of the source document outside the scope of the corresponding target summary’s topics. (2) According to previous research (Wang et al., 2018) the incorporation of topical information enables a language generation model to further diversify the vocabulary used in the generation process, thus generating a richer summary.

Due to the absence of target summary information at test time we train a feedforward neural network to map an input vector representing the frequency of words in an input article to its corresponding topic-words distribution computed by LDA defined in Equation 1. The feedforward network has one layer and with equal input and output size in the size of $\left|\mathcal{V}\right|$ optimizing the cross-entropy loss. The mapping of source article word frequencies to topic-words values is learned in the training phase when target summaries are available. Subsequently, the topic-words distribution computed by LDA is replaced by that of learned using the feedforward network both during training and testing to circumvent the absence of target summaries at test time. To elaborate, the feedforward network is trained based on LDA as described above, prior to fine-tuning the Summarizer model. From this point on $\tau^{d}$ refers to the topic-words distribution computed by the feedforward network.

2.2. Integrating Topic Information into Transformers

This section discusses the integration of topical attention into the seq2seq Transformer architecture. Although we use a pre-trained BART model to initialize the Conformer, any summarization Transformer model may be used as well. In the experimental section, we demonstrate this on the example of a T5 model.

Once we have calculated the topic probabilities of an input source document according to the topics of its corresponding target summary, we end up with a vector of topic probability values from the feedforward network in the size of $\left|\mathcal{V}\right|$ . Subsequently, we trim and reorder this vector to match the size and the order of the input document. Figure 2 shows the integration of topical attention into the Transformer architecture which is shared by both BART and T5 models.

To integrate this vector into a Transformer architecture, we insert the topic-words probability scores derived from the output of the feedforward network into the cross-attention modules that connect the encoder with the decoder. That is, for each word of the source article, we compute a topic probability score, indicating the word’s likelihood of appearing in the summary. Going back to the original Transformer paper (Vaswani et al., 2017) each encoder layer has keys, queries, and values within the same layer. However, in the cross-attention module, the keys and values are computed by the encoder, while the decoder computes the queries.

This is because cross-attention essentially bridges the encoder blocks and the decoder blocks. Figure 2 shows how the key and value matrices flow from the encoder side to the decoder.

While, for reasons of computational efficiency, the computations in a Transformer are performed at the matrix level, in the following, we present the steps of topical attention of the Conformer at the document level for illustrative clarity:

We remember from the original Transformer paper (Vaswani et al., 2017) that cross-attention between encoder and decoder is computed as:

(2)

\textit{Attention}(\textbf{Q},\textbf{K},\textbf{V})=\text{softmax}\Big{(}\frac{\textbf{Q}\cdot\textbf{K}^{\top}}{\sqrt{d_{\textbf{K}}}}\Big{)}\textbf{V}

The same equation holds in the case of cross-attention with the keys $K$ and values $V$ being outputs of the encoder while the queries $Q$ are computed on the decoder side. We observe from the equation that a dot product is computed between the keys and the queries. This dot product for a single source document and summary pair results in a single value indicating the importance of each key at its position given the query. Then this value is normalized, dividing it by $\sqrt{d_{K}}$ and then passing it to a softmax function. At this step, we compute an average between the result of the latter softmax function and the softmax of topic-words probabilities derived from Equation (1). We, therefore, define topical attention as:

(3)

\textit{Topical-Attention}(\textbf{Q},\textbf{K},\textbf{V})=\tfrac{1}{2}\bigg{(}\text{softmax}\Big{(}\frac{\textbf{Q}\cdot\textbf{K}^{\top}}{\sqrt{d_{K}}}\Big{)}+\text{softmax}({\tau}^{d})\bigg{)}\textbf{V}

Topical attention effectively reduces the importance of a word from the source document that is not covered in the topics discussed in the respective target summary by adding a small value or even ‘ $0$ ’ to it and then dividing it by $2$ . Thus, the resulting value will become too small to be the focus of the generation process. On the other hand, the probability of a word related to a topic discussed in the target summary can be potentially increased or remain unchanged. We conclude that as explained above Conformer obtains topic information from the feed-forward network component, and is trained in an end-to-end fashion.

3. Evaluation

3.1. Datasets

We evaluate the effectiveness of all methods on three publicly available benchmark datasets.

NEWTS (Bahrainian et al., 2022) is a topical abstractive summarization dataset that provides two alternative human-written summaries of each source article, with different topical focus for a subset of the CNN/dailymail collection. This dataset contains 4,800 training examples and 1,200 test samples. We use this dataset for evaluating controlled topical summarization.

The CNN/DailyMail dataset (Hermann et al., 2015; Nallapati et al., 2016), contains news articles from the CNN and Daily Mail websites. The experiments reported in this paper are based on the non-anonymized version of the dataset, containing 287,226 pairs of training articles and summaries, 13,368 validation pairs, and 11,490 test pairs.

The XSum (Narayan et al., 2018) dataset is included to show the efficacy of our approach on shorter news articles. This dataset is a collection of online articles from the British Broadcasting Corporation (BBC) and their corresponding summaries. It contains 204,045 training samples, 11,332 validation samples, and 11,334 test samples.

3.2. Experimental Results

Topic-focused Summarization: The primary feature of Conformer is controlling the topic distribution of the output summaries. Therefore, we first evaluate Conformer on the NEWTS dataset (Bahrainian et al., 2022) to measure the topical focus of the generated summaries with respect to target topics that should appear in a summary. This dataset presents each topic with four prompt types, namely, topic words, topic phrases, topic IDs and topic sentences. In this paper, we only use the sentence-based type to prompt our Conformer model to generate topic-focused summaries. Additionally, our model in this experiment is initialized based on the BART base model. At the same time, during the fine-tuning phase, the topical attention distribution enforces the presence of the target topics while limiting the rest. That is the weight of topics that are not the target of a summary are reduced to null. Table 1 presents the results of this experiment. We report the baseline results from the original paper while using the same naming scheme. In the table, ‘b’ following a model name indicates a ‘base’ model size while ‘L’ indicates a ‘large’ model size. Additionally, ‘T-W’ indicates the prompt ‘topic-words,’ ‘T-ph’ indicates a ‘topic-phrase’ prompt, ‘T-Sent’ indicates a ‘topic-sentence’ prompt, ‘no prompt’ means no prompting was used while fine-tuning a model, and ‘CNN-DM’ indicates that the model was fine-tuned on the same source articles of our dataset paired with their original corresponding CNN/Dailymail summaries. In order to compute the topic focus score, we used the evaluation script from the Github repository of the NEWTS dataset (Bahrainian et al., 2022). Additionally, we include results from a very large languge model, namely, ChatGPT (version: Aug 2023) to compare it against our model. In order to do so we use the prompt: Summarize this article with respect to topic: ‘T-Sent’. We observe that our model defines a new state of the art both in terms of ROUGE performance as well as the topic-focus score computed by an LDA model as explained in the original paper. For completeness, the topic-focus score computes the prevalence of each target topic appearing in a corresponding summary averaged over all generated summaries. Furthermore, we observe that a significantly smaller Conformer outperforms ChatGPT in the controlled topic-focused summarization task one the NEWTS dataset. Based on the results of this experiment we conclude that Conformer is highly capable of shifting and controlling the topics of the output summaries.

	R1	R2	RL	Topic Focus
BART-b + T-W	31.14	10.46	19.94	0.1375
BART-b + T-Ph	31.01	10.36	19.91	0.1454
BART-b + T-Sent	30.38	09.70	19.48	0.1513
BART-b T-ID	30.97	10.23	20.08	0.1399
BART-b no prompt	16.48	0.75	11.71	0.0080
BART-b CNN-DM	26.23	7.24	17.12	0.1338
T5-b + T-W	31.78	10.83	20.54	0.1386
T5-b + T-Ph	31.55	10.75	20.27	0.1426
T5-b + T-Sent	31.40	10.37	20.35	0.1528
T5-b + T-ID	31.44	10.64	20.06	0.1342
T5-b no prompt	30.98	10.19	20.23	0.1379
T5-b CNN-DM	27.87	8.55	18.41	0.1305
T5-L + T-W	30.92	10.01	20.19	0.1598
T5-L + T-Ph	31.40	10.50	20.27	0.1457
T5-L + T-Sent	30.64	09.84	19.91	0.1462
T5-L + T-ID	30.35	9.93	19.77	0.1335
T5-L no prompt	30.06	9.55	19.25	0.1366
T5-L CNN-DM	28.44	8.49	18.61	0.1286
ProphetNet + T-W	31.91	10.80	20.66	0.1362
ProphetNet + T-Ph	31.56	10.35	20.17	0.1474
ProphetNet + T-Sent	31.40	10.03	20.02	0.1633
ProphetNet no prompt	30.22	9.67	19.27	0.1316
ProphetNet CNN-DM	28.71	8.53	18.69	0.1295
PPLM	29.63	9.08	18.76	0.1482
CATS	30.12	9.35	19.11	0.1519
ChatGPT	32.47	11.26	20.58	0.1573
Conformer-b(Ours)	34.16	11.67	21.93	0.1759

Table 1. Benchmark comparing our model to the baselines reported in the original NEWTS paper, using a 3-fold cross validation in terms of

F_{1}

ROUGE 1,

F_{1}

ROUGE 2, and

F_{1}

ROUGE L and the LDA topic-focus score.

Evaluating Topical Attention: This section evaluates our proposed model both in terms of best overall summarization performance, as well as, shifting the topical focus of the generated summaries. Model parameters are presented in the appendix. Following standard practice, we evaluate our proposed model against the baseline methods in terms of $F_{1}$ ROUGE 1, $F_{1}$ ROUGE 2, and $F_{1}$ ROUGE L scores using the official Perl-based implementation of ROUGE (Lin, 2004).

Table 2. Comparison between BART and T5 models with and without topical attention in terms of

F_{1}

ROUGE metrics on the CNN/Dailymail dataset in percentage (

\%

Models	R 1	R 2	R L
T5-Topical-base (ours)	42.91	20.64	39.88
T5-base (Raffel et al., 2019)	42.05	20.34	39.40
Conformer-base (ours)	43.79	21.17	40.77
BART-base (Lewis et al., 2020)	42.85	20.79	39.88
Conformer-large (ours)	45.96	22.27	42.51
BART-large (Lewis et al., 2020)	44.16	21.28	40.90

For each dataset an LDA model is trained followed by a feedforward network reconstructing the topic-words vector. LDA returns $N$ probability distributions over the vocabulary representing the latent topics discussed in the respective training set. Since the actual number of underlying topics ( $N$ ) is an unknown parameter in the LDA model, it needs to be estimated first. For this purpose, similar to the method proposed in (Griffiths and Steyvers, 2004; Bahrainian and Crestani, 2018), we perform a model selection process. It involves keeping the LDA parameters (commonly referred to as $\alpha$ and $\eta$ ) fixed while assigning several values of $N$ and running the LDA model for each value. We pick the model that minimizes the negative $\log P(V|N)$ , where $V$ contains all the words in the vocabulary of all the documents in the training data. This process is repeated until we find the topic model with the optimal number of topics. The training of each LDA model on CNN/Dailymail takes nearly a day, so we could only repeat it for a limited number of $N$ values. concretely, we trained the LDA models with values $N$ ranging from $50$ to $300$ in increments of $50$ , and the optimal value on the CNN/Dailymail and XSum datasets were found to be $250$ and $150$ , respectively. Subsequently, the feed-forward network learns LDA output for each input document.

Tables 2 and 3 present the results of this experiment. For all models and metrics, we can note superior performance of the Conformer on both the CNN/Dailymail and the XSum dataset compared against original models.

Additionally, we observe that the relative performance improvement in all models against their corresponding original models is relatively higher in the case of the ROUGE 1 metric. One may interpret this observation such that the semantic similarity information obtained from the topic model forces a higher usage of individual topic-related words in the generated summaries. On the other hand, when it comes to ROUGE 2 and ROUGE L, since there are particular grammar (or word sequence) rules and patterns that a language model needs to enforce, the improvements are slightly smaller than those in ROUGE 1. (Dou et al., 2021) also states that generation guided by certain signals, e.g., in this case topicality, does improve the summarization performance.

It is worth noting that the demonstrated improvements in summarization performance come at no additional cost in terms of parameters added to the original models. This property of our scheme stands in stark contrast to other existing approaches such as (Wang et al., 2020) that require the addition of a significant number of model parameters. As an example, for the different Transformer models they study, they introduce an additional $5.69\%$ to $9.58\%$ of new parameters, which in absolute numbers amounts to 10.25 million and 38.91 million parameters according to their paper. Therefore, our approach is more generically integrated with Transformer-based models, and requires identical run time to the unmodified variants. Furthermore, our controlled generation approach is not language-specific and can be trained for any language.

Table 3. Comparison between BART and T5 models with and without topical attention in terms of

F_{1}

ROUGE metrics on the XSum dataset in percentage (

\%

Models	R 1	R 2	R L
T5-Topical-base	42.70	19.56	36.01
T5-base (Raffel et al., 2019)	41.79	18.90	35.04
Conformer-base	43.30	20.03	36.28
BART-base (Lewis et al., 2020)	42.32	19.14	35.52
Conformer-large	47.67	24.95	39.81
BART-large (Lewis et al., 2020)	45.14	22.27	37.25

Table 4. Comparison between our top-performing model against top recent baselines in terms of

F_{1}

ROUGE metrics on the CNN/Dailymail and XSum datasets. Results cited from original papers.

Datasets	CNN/Dailymail			XSum
Models	R 1	R 2	R L	R 1	R 2	R L
UniLM (Dong et al., 2019)	43.33	20.21	40.51	42.63	19.10	33.13
T5-largest (Raffel et al., 2019)	43.52	21.55	40.69	-	-	-
BART-large (Lewis et al., 2020)	44.16	21.28	40.90	45.14	22.27	37.25
ProphetNet (Yan et al., 2020)	44.20	21.17	41.30	-	-	-
BART+TA (Wang et al., 2020)	44.47	21.39	41.32	45.76	22.68	38.03
GSum (Dou et al., 2021)	45.94	22.32	42.48	45.40	21.89	36.67
BRIO (Liu et al., 2022)	47.78	23.55	44.57	49.07	25.59	40.40
Conformer-large (ours)	45.96	22.27	42.51	47.67	24.95	39.81

Comparison to State of the art: Now that we have demonstrated the positive effect of topical attention and topic controlability, we compare our top performing model to approaches from the literature. Baselines: These top-performing baseline models are: UniLM (Dong et al., 2019), BART (Lewis et al., 2020), T5 (Raffel et al., 2019), BART+TA (Wang et al., 2020), GSum (Dou et al., 2021), and BRIO (Liu et al., 2022).

Table 4 presents the results on the CNN/DailyMail and the XSum dataset. This table directly cites the results reported in the original papers, where available. We observe that our model performs superior to the latest Transformer models such as ProphetNet, the standard BART model and (Wang et al., 2020) on both datasets. At the same time, the top-performing approach of (Wang et al., 2020) which is also based on BART-large adds 38.91 million new parameters to the architecture while our approach adds no further parameters to the BART-large model. Comparing our results with theirs, we conclude that Conformer is highly effective to the point that the massive addition of hidden states in (Wang et al., 2020) does not lead to improvements over our approach. Moreover, in addition to the better computational efficiency of our model, our approach is more generic in that it can be integrated with various architectures as demonstrated in the case of T5 models. Finally, we observe that Conformer achieves superior performance over GSum (Dou et al., 2021) on XSUM dataset while matching GSum’s performance on the CNN/Dailymail dataset on abstractive summarization. This is while, the new BRIO model with its contrastive loss training outperforms our model. Studying the initialization of Conformer based on the BRIO model remains as future work.

Output Summaries: We also include sample output summaries of our proposed Conformer model in Table 6 below. The shift in the topical distribution of the words generated by the model when focusing on the topic ‘Marriage’ and when focusing on topic ‘Charges and Arrest’ is evident in the two summaries on the right side of the table.

Table 5. From left to right, the table presents a human-written summary with no topical focus from the CNN/Dailymail dataset, Conformer generated summary with no topical focus, Conformer generated summary focused on the topic of ‘Marriage’, and Conformer generated summary focused on the topic of Charges and Arrest

Human-Written Summary	Conformer(General)	Conformer(Marriage)	Conformer(Charges and Arrest)
Liana Barrientos married ten men in eleven years - even marrying six of them in one year alone . All of her marriages took place in New York state. Her first marriage took place in 1999, followed by two in 2001, six in 2002, and her tenth marriage in 2010. Barrientos allegedly described her 2010 nuptials as ’her first and only marriage’ She is reportedly divorced from four of her ten husbands. The Department of Homeland Security was ’involved’ in Barrientos’ case, the Bronx district attorney’s office has said.	39-year-old Lianna Barrientos married ten men in eleven years and married six of them in one year alone. Authorities reportedly slapped her with offering a false instrument for filing felony charges in November 2014. According to the New York Times, she has divorced at least four times. The Department of Homeland Security was ’involved’ in Barrientos’ case, the Bronx district attorney’s office told the New York Times. According to her Facebook profile she is from the Dominican Republic and lives in Manhattan.	39-year-old Lianna Barrientos married ten men in eleven years. She married the ten men from 1999 to 2010 and married six of the men in 2002 alone. The Times reported that for her first three 2002 marriages, Barrientos wed Koridze on Valentine’s Day, before she went on to marry Goktepe approximately two weeks later, then soon tied the knot with Paharelau. She claimed her last marriage to Salle Keita was her first and only marriage.	Authorities reportedly slapped Lianna Barrientos with offering a false instrument for filing felony charges in November 2014 She was arrested and charged in November 2014 for lying regarding her marriage to Salle Keita. Barrientos has been arrested multiple times, including for loitering, drug possession, and jumping a turnstile and trespassing, according to the Daily News. Department of Homeland Security was ’involved’ in Barrientos’ case, the Bronx district attorney’s office revealed to the Times.

Summary Faithfulness: It has been previously demonstrated that when generation is guided by auxiliary signals such as relations between words, e.g., triplets (Dou et al., 2021) or entities (Narayan et al., 2021), models generate more faithful summaries with less hallucination. In this experiment, we investigate whether guiding the generation process by topics, which are in essence groups of related words, can generate more faithful summaries. Therefore, we randomly select $100$ summaries from the test set of the CNN/Dailymail dataset and study the effect. We conduct a human study using three human judges while providing them with outputs of our model, that of a standard BART-large, as well as, the state-of-the-art Frost (Narayan et al., 2021) in faithfulness. Similar to (Narayan et al., 2021), each summary is rated from $1$ to $5$ , with $1$ corresponding to very unfaithful content, $3$ corresponding to 50-50, and $5$ indicating a very faithful summary. The results of this experiment show that our model, Conformer, receives an average score of 4.62 while Frost and BART-large score 4.39 and 4.18 respectively, in how faithful their summaries are to the original document. This demonstrates another advantage of our model, indicating that a more granular guided approach (i.e. topics versus entities) is superior in terms of faithfulness and being factually correct. In the future we will further investigate the proclivity for hallucinations when the model is prompted with various topic-prompts including those not appearing in the source article.

4. Conclusions and Future Work

In this article, we introduce Conformer, a new summarization model conforming its summaries to a target topic distribution. It is based on a novel topical attention mechanism that is sufficiently generic to be end-to-end integrated into most seq2seq models via cross attention. Our approach does not impose any additional model parameters. Therefore, it is computationally as efficient as the original models. For the same reason, fine-tuning Conformer is also as easy as the original BART model. We show that the incorporation of topical attention offers significant improvement in terms of ROUGE scores. We show that our novel Conformer model achieves state-of-the-art performance in topic-based abstractive summarization on the NEWTS dataset. This solution is not limited to a specific language by nature and can be trained for any languages. Finally, we showed that our approach yields summaries outperforming state-of-the-art in faithfulness. In the future, we plan to study controlled text generation to include/exclude various informative signals other than topics. Finally, we will investigate incorporating various user preference signals in the generation process.

Appendix A Model Parameters

For completeness we state that the model parameters we used for fine-tuning the Transformer-based variants are, $beam-size=4$ , $length\_penalty=2.0$ , $max\_length=180$ , $min\_length=56$ , $no\_repeat\_ngram\_size=3$ , and $early\_stopping=True$ . The feed-forward network recostructing the word-topic probabilities is trained for 15 epochs with a single layer in the size 300. The models were fine-tuned for 5 epochs on a Tesla V100 GPU.

References

(1)
Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
Bahrainian (2019) Seyed Ali Bahrainian. 2019. Just-In-Time Information Retrieval and Summarization for Personal Assistance. Ph.D. Dissertation. Università della Svizzera italiana.
Bahrainian and Crestani (2018) Seyed Ali Bahrainian and Fabio Crestani. 2018. Augmentation of Human Memory: Anticipating Topics That Continue in the Next Meeting. In Proceedings of the 2018 Conference on Human Information Interaction & Retrieval (CHIIR ’18). 150–159.
Bahrainian et al. (2022) Seyed Ali Bahrainian, Sheridan Feucht, and Carsten Eickhoff. 2022. NEWTS: A Corpus for News Topic-Focused Summarization. In Findings of the Association for Computational Linguistics: ACL 2022. Dublin, Ireland, 493–503.
Bahrainian et al. (2021a) Seyed Ali Bahrainian, Martin Jaggi, and Carsten Eickhoff. 2021a. Self-Supervised Neural Topic Modeling. In Findings of the Association for Computational Linguistics: EMNLP 2021. Association for Computational Linguistics, Punta Cana, Dominican Republic, 3341–3350.
Bahrainian et al. (2021b) Seyed Ali Bahrainian, George Zerveas, Fabio Crestani, and Carsten Eickhoff. 2021b. CATS: Customizable Abstractive Topic-Based Summarization. ACM Trans. Inf. Syst. 40, 1, Article 5 (oct 2021), 24 pages.
Blei et al. (2003) David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. J. Mach. Learn. Res. 3 (2003), 993–1022.
Blinova et al. (2023) Sofia Blinova, Xinyu Zhou, Martin Jaggi, Carsten Eickhoff, and Seyed Ali Bahrainian. 2023. SIMSUM: Document-level Text Simplification via Simultaneous Summarization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, 9927–9944. https://aclanthology.org/2023.acl-long.552
Briakou et al. (2021) Eleftheria Briakou, Di Lu, Ke Zhang, and Joel R. Tetreault. 2021. Olá, Bonjour, Salve! XFORMAL: A Benchmark for Multilingual Formality Style Transfer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021. Association for Computational Linguistics, 3199–3216.
Cao et al. (2020) Yixin Cao, Ruihao Shui, Liangming Pan, Min-Yen Kan, Zhiyuan Liu, and Tat-Seng Chua. 2020. Expertise Style Transfer: A New Task Towards Better Communication between Experts and Laymen. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 1061–1071.
Cui and Hu (2021) Peng Cui and Le Hu. 2021. Topic-Guided Abstractive Multi-Document Summarization. In Findings of the Association for Computational Linguistics: EMNLP 2021. Association for Computational Linguistics, Punta Cana, Dominican Republic, 1463–1472. https://doi.org/10.18653/v1/2021.findings-emnlp.126
Dathathri et al. (2020) Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. 2020. Plug and Play Language Models: A Simple Approach to Controlled Text Generation. In International Conference on Learning Representations.
Dong et al. (2019) Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems. 13042–13054.
Dou et al. (2021) Zi-Yi Dou, Pengfei Liu, Hiroaki Hayashi, Zhengbao Jiang, and Graham Neubig. 2021. GSum: A General Framework for Guided Neural Abstractive Summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 4830–4842.
Gehrmann et al. (2018) Sebastian Gehrmann, Yuntian Deng, and Alexander M. Rush. 2018. Bottom-Up Abstractive Summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018. 4098–4109.
Griffiths and Steyvers (2004) Thomas L Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National academy of Sciences (2004).
Gu et al. (2022) Yuxuan Gu, Xiaocheng Feng, Sicheng Ma, Lingyuan Zhang, Heng Gong, and Bing Qin. 2022. A Distributional Lens for Multi-Aspect Controllable Text Generation. https://doi.org/10.48550/ARXIV.2210.02889
Hermann et al. (2015) Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems. 1693–1701.
Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020. Association for Computational Linguistics, 7871–7880.
Li et al. (2018) Wei Li, Xinyan Xiao, Yajuan Lyu, and Yuanzhuo Wang. 2018. Improving Neural Abstractive Document Summarization with Explicit Information Selection Modeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018. 1787–1796.
Lin (2004) Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out (2004).
Liu and Yang (2021) Jingzhou Liu and Yiming Yang. 2021. Enhancing Summarization with Text Classification via Topic Consistency. In Machine Learning and Knowledge Discovery in Databases. Research Track: European Conference, ECML PKDD 2021, Bilbao, Spain, September 13–17, 2021, Proceedings, Part III (Bilbao, Spain). 661–676.
Liu et al. (2022) Yixin Liu, Pengfei Liu, Dragomir Radev, and Graham Neubig. 2022. BRIO: Bringing Order to Abstractive Summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 2890–2903.
Maddela et al. (2022) Mounica Maddela, Mayank Kulkarni, and Daniel Preotiuc-Pietro. 2022. EntSUM: A Data Set for Entity-Centric Extractive Summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland, 3355–3366.
Nallapati et al. (2017) Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017. Summarunner: A recurrent neural network based sequence model for extractive summarization of documents. In Thirty-First AAAI Conference on Artificial Intelligence.
Nallapati et al. (2016) Ramesh Nallapati, Bowen Zhou, Cícero Nogueira dos Santos, Çaglar Gülçehre, and Bing Xiang. 2016. Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, CoNLL 2016, Berlin, Germany, August 11-12, 2016. 280–290.
Narayan et al. (2018) Shashi Narayan, Shay B. Cohen, and Mirella Lapata. 2018. Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1797–1807.
Narayan et al. (2021) Shashi Narayan, Yao Zhao, Joshua Maynez, Gonçalo Simões, Vitaly Nikolaev, and Ryan McDonald. 2021. Planning with Learned Entity Prompts for Abstractive Summarization. Transactions of the Association for Computational Linguistics 9 (2021), 1475–1492.
Pryzant et al. (2020) Reid Pryzant, Richard Diehl Martinez, Nathan Dass, Sadao Kurohashi, Dan Jurafsky, and Diyi Yang. 2020. Automatically Neutralizing Subjective Bias in Text. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020. AAAI Press, 480–489.
Raffel et al. (2019) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 (2019).
See et al. (2017) Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get To The Point: Summarization with Pointer-Generator Networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers. 1073–1083.
Shen et al. (2017) Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi S. Jaakkola. 2017. Style Transfer from Non-Parallel Text by Cross-Alignment. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. 6830–6841.
Sim et al. (2022) Mong Yuan Sim, Wei Emma Zhang, and Congbo Ma. 2022. An Empirical Study on Topic Preservation in Multi-Document Summarization. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: Student Research Workshop. Association for Computational Linguistics, Online, 61–67.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998–6008.
Vinyals et al. (2015) Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer networks. In Advances in Neural Information Processing Systems. 2692–2700.
Wang et al. (2018) Li Wang, Junlin Yao, Yunzhe Tao, Li Zhong, Wei Liu, and Qiang Du. 2018. A Reinforced Topic-aware Convolutional Sequence-to-sequence Model for Abstractive Text Summarization. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI’18). 4453–4460.
Wang et al. (2020) Zhengjue Wang, Zhibin Duan, Hao Zhang, Chaojie Wang, Long Tian, Bo Chen, and Mingyuan Zhou. 2020. Friendly Topic Assistant for Transformer Based Abstractive Summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 485–497.
Xing et al. (2017) Chen Xing, Wei Wu, Yu Wu, Jie Liu, Yalou Huang, Ming Zhou, and Wei-Ying Ma. 2017. Topic aware neural response generation. In Thirty-First AAAI Conference on Artificial Intelligence.
Yan et al. (2020) Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, and Ming Zhou. 2020. ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training. arXiv preprint arXiv:2001.04063 (2020).
Zhou et al. (2012) Mingyuan Zhou, Lauren Hannah, David Dunson, and Lawrence Carin. 2012. Beta-negative binomial process and Poisson factor analysis. In Artificial Intelligence and Statistics. PMLR, 1462–1471.