Attention-based Multi-hypothesis Fusion for Speech Summarization

Abstract

Speech summarization, which generates a text summary from speech, can be achieved by combining automatic speech recognition (ASR) and text summarization (TS). With this cascade approach, we can exploit state-of-the-art models and large training datasets for both subtasks, i.e., Transformer for ASR and Bidirectional Encoder Representations from Transformers (BERT) for TS. However, ASR errors directly affect the quality of the output summary in the cascade approach. We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary. We investigate several schemes to combine ASR hypotheses. First, we propose using the sum of sub-word embedding vectors weighted by their posterior values provided by an ASR system as an input to a BERT-based TS system. Then, we introduce a more general scheme that uses an attention-based fusion module added to a pre-trained BERT module to align and combine several ASR hypotheses. Finally, we perform speech summarization experiments on the How2 dataset and a newly assembled TED-based dataset that we will release with this paper¹¹1https://github.com/nttcslab-sp-admin/TEDSummary. These experiments show that retraining the BERT-based TS system with these schemes can improve summarization performance and that the attention-based fusion module is particularly effective.

Index Terms— Speech Summarization, Automatic Speech Recognition, BERT, Attention-based Fusion.

1 Introduction

Speech summarization generates a text summary from given speech data. It is challenging because it needs to process lengthy speech data (a sequence of utterances) and extract important information to create a compact representation of the content. Moreover, in contrast to a text input, speech contains fillers, disfluencies, redundancies (e.g. repetition of the same phrases), and colloquial language. There are two main types of summarization approaches, extractive and abstractive. Extractive summarization aims at identifying the most relevant segments of the input text/speech document and then concatenating them to assemble a summary. Abstractive summarization aims at directly generating a summary by paraphrasing the intent of the input document. Although the latter type is challenging, it has achieved great progress with the introduction of powerful deep learning models for text summarization (TS) such as Bidirectional Encoder Representations from Transformers (BERT) [1]. Moreover, abstractive summarization can potentially normalize spoken text to remove disfluencies, redundancies, and colloquial language, making the summary more understandable than extractive ones. Consequently, in this paper, we focus on abstractive summarization.

Speech summarization is achieved by combining two main sub-modules: an automatic speech recognition (ASR) module, which transcribes speech into a corresponding text document, and a TS system, which generates a compact representation of that document. Such a cascade connection permits using state-of-the-art modules optimized for each task individually, without requiring a large amount of paired data composed of speech data and associated summaries. Moreover, each module can operate on very different time resolution, i.e., ASR is performed for each utterance, while TS requires the entire text document. However, while generating the summary, a cascade connection discards speech-specific information [2, 3], which has the potential to enrich the summary, such as intonation when generating the summary. Moreover, the TS system receives input text containing ASR errors that affect the performance of the speech summarization system [4]. This study focuses on the latter problem.

Many works have attempted to mitigate the influence of ASR errors on a natural language processing (NLP) back-end. For example, studies on speech translation [5, 6, 7, 8, 9, 10] have reported that ASR errors could be mitigated during translation by considering multiple recognition hypotheses at the input of the translation back-end. Some studies [8, 9] have used posterior probabilities to weight ASR hypotheses based on their confidence. Sperber et al. [10] proposed directory inputting recognition lattices to the back-end translation system. In general, the recognition lattices hold the approximated entire ASR search information representing word-level multi-hypotheses with a compact lattice form. Therefore, this approach helps mitigating ASR errors by considering alternative word candidates within the recognition lattices. However, it is difficult to use this approach directly with pre-trained state-of-the-art TS models like BERT, since the BERT model expects a sub-word sequence as an input instead of a lattice. For speech summarization, some studies built systems that are robust to ASR errors [11, 12, 13]. Weng et al. [4] achieved robust speech summarization by adding confidence scores associated with each recognized word to the input of a BERT-based TS system. The confidence scores help the TS system to ignore unreliable words in the ASR output. However, it provides a limited ability to recover the unreliable information from alternative word candidates as it derives from a single (the 1-best) ASR hypothesis rather than an N-best list or a lattice. Ogawa et al. [11] proposed inputting confusion networks (CNs) to a compressive (i.e., non-neural) TS system. Then, from the CNs, the TS system selects recognized words to form a summary that maximizes the ILP-based objective function. These studies confirmed that inputting multiple ASR hypotheses (e.g. lattices and CNs) and auxiliary information (e.g. confidence scores) to the NLP module is useful [10, 8, 7].

In this paper, we propose a speech summarization model that can exploit multiple ASR hypotheses to mitigate the influence of ASR errors and be used with pre-trained TS systems like BERT. We explore two approaches to combining the ASR hypotheses. First, we propose replacing the input sub-word embedding of BERT with a sum of sub-word embedding vectors weighted by their ASR posterior values, as done in previous studies on speech translation [8]. The second approach proposes to use an attention fusion mechanism to combine different ASR hypotheses within the BERT module. In the point of fusing multiple inputs with an attention mechanism, this attention fusion is similar to hierarchical attention [14, 15] proposed for multi-stream combinations [16] and audio-visual processing [17]. However, our proposal does not have a hierarchical architecture; it is an attention function that fuses multiple-hypothesis following the self-attention manner. The query is a 1-best hypothesis, and value and key are multiple-hypothesis. The attention fusion can be implemented at the input or within the BERT model. The latter can exploit BERT’s strong modeling capability to perform the hypotheses fusion.

Posterior fusion is similar to the confidence-based approach [4] in the sense that both approaches simply exploit the confidence of individual words or tokens within the 1-best hypothesis at token level and do not explicitly model multiple hypotheses at sequence level. In contrast, the attention fusion explicitly models multiple token sequence hypotheses by using two attention steps. First, we align the hypotheses with the 1-best hypothesis using an attention mechanism across the tokens of the 1-best and each hypothesis. This process allows using hypotheses with different lengths or redundant information. In the second step, we employ an attention mechanism over the aligned tokens and across the hypotheses to combine them. This attention fusion is similar to system combination approaches like ROVER [18], posterior probability decoding [19] and minimum Bayes risk decoding [20], but the combination is performed within the BERT encoding process.

We performed experiments to confirm the effectiveness of the proposed methods on two speech summarization datasets, i.e., YouTube How2 video and TED Talk summarization. The TED Talk summarization is a newly assembled corpus that associates the TEDLIUM ASR corpus with publicly available TED Talk summaries. We release this new corpus with this paper. Experimental results show that retraining a BERT-based TS system with the proposed multi-hypothesis combination schemes can improve summarization performance on both datasets.

2 Speech summarization

Let us consider a spoken document $D$ , which contains $K$ utterances. Let $X_{k}$ and $S_{k}$ be the speech signal and the associated transcription of the $k$ -th utterance of this spoken document. The summarization task consists of generating a compact document $Y$ from the input spoken document $D$ . This problem is addressed in two stages. First, we use an ASR system to transcribe each speech utterance $X_{k}$ into text. Then, we use a TS to generate the summary $Y$ from the input speech document $D$ .

2.1 ASR system

We use a state-of-the-art Transformer model for ASR [21, 22]. The ASR system predicts the word sequence, $\hat{S}_{k}$ associated with the speech signal $X_{k}$ using beam search. This is achieved by combining the scores from the transformer and language model,

\displaystyle\hat{S}_{k}

\displaystyle=\mathrm{argmax}_{S}\left(\log p_{\text{transf}}(S|X_{k})+\lambda\log p_{\text{lm}}(S)\right),

(1)

where $p_{\text{transf}}(S|X_{k})$ is the posterior probability of $S$ given $X_{k}$ obtained with the transformer model, $p_{\text{lm}}(S)$ represents the language model, and $\lambda$ is the shallow fusion weight of the language model.

In this paper, we assume that we can obtain multiple ASR hypotheses. These hypotheses can be the $N$ -best recognition hypotheses obtained using beam search decoding or $N$ hypotheses obtained with different ASR systems as often done with system combination [23]. In the following, $\hat{S}^{n}_{k}$ represents the $n$ -th hypothesis obtained for a list of $N$ hypotheses.

In practice, ASR operates on sub-word units such as byte pair encoding (BPE), and thus a recognition hypothesis can be expressed as $\hat{S}^{n}_{k}=[\hat{s}^{n}_{k,1},\ldots,\hat{s}^{n}_{k,M_{k}^{n}}]^{\mathrm{T}}\in\mathbb{R}^{M_{k}^{n}\times V}$ , where $\hat{s}^{n}_{k,m}$ is a one-hot vector representing the $m$ -th token of the $n$ -th hypothesis and the $k$ -th utterance, $M_{k}^{n}$ is the length of the hypothesis, $V$ is the vocabulary size (the number of sub-word units), and $\mathrm{T}$ is the transpose operation. To obtain recognition hypotheses for the entire spoken document, we can simply concatenate the hypotheses of all utterances as $\hat{S}^{n}=[\hat{s}^{n}_{1,1},\ldots,\hat{s}^{n}_{K,M_{K}^{n}}]^{\mathrm{T}}$ . By abuse of notation, we remove the utterance index $k$ and redefine $\hat{S}^{n}=[\hat{s}^{n}_{1},\ldots,\hat{s}^{n}_{M^{n}}]^{\mathrm{T}}\in\mathbb{R}^{M^{n}\times V}$ , where $M^{n}$ is the total length of the concatenated $n$ -th hypotheses of the the spoken document, i.e, $M^{n}=\sum_{k}M^{n}_{k}$ . With beam search, we can also obtain the posterior probabilities associated with each token in the hypothesis, which we denote as $\hat{p}^{n}_{m}$ .

2.2 TS system

Early works on text summarization used combinatorial optimization approaches to find important content in a text document [24, 25, 26]. It was difficult to generate consistent abstractive summaries with such approaches, so most research focused on extractive summaries. The success of the deep learning-based language models greatly improved the quality of abstractive summarization [27, 28, 27, 29]. Recently, Transformer-based models have become state-of-the-art models for abstractive text summarization [30]. For example, BERTSum [29] leverages the strong language modeling capability of a pre-trained BERT [1] model to achieve high-quality abstractive summarization through transfer learning. Since, the original BERT model assumes single sentences as input instead of a sequence of sentences as in summarization tasks, BERTSum introduces BERT’s classification (CLS) tokens at the start of each sentence of the input document. BERTSum uses the BERT model as an encoder and adds a Transformer-based decoder to generate the summary. The model is then fine-tuned on the text summarization task.

The TS system accepts the entire transcription of the spoken document as an input. In general, we can simply use the 1-best hypothesis $\hat{S}^{1}$ . We employ a BERTSum model to generate a summary from $\hat{S}^{1}$ as

$\displaystyle E$	$\displaystyle=\text{Emb}(\hat{S}^{1}),$	(2)
$\displaystyle Z$	$\displaystyle=\text{Enc}^{\text{bert}}(E),$	(3)
$\displaystyle Y$	$\displaystyle=\text{Dec}^{\text{transf}}(Z),$	(4)

where $\text{Emb}(\cdot)$ is an embedding layer that converts the one-hot token sequence into a sequence of embedding vectors $E=[e_{1},\ldots,e_{M^{1}}]^{\mathrm{T}}\in\mathbb{R}^{M^{1}\times B}$ , $B$ is the size of the embedding, $\text{Enc}^{\text{bert}}(\cdot)$ represents a BERT encoder, $Z$ is an intermediate representation of the document, and $\text{Dec}^{\text{transf}}(\cdot)$ represents a transformer-based decoder that generates a summary based on $Z$ . BERTSum takes advantage of the strong language modeling capability of the pre-trained BERT model to generate high-quality summaries. By inserting CLS symbols between consecutive sequences, the BERT encoder can process a sequence of multiple utterances that forms a long document.

There are two issues with the cascade connection of ASR and BERTSum. First, the BERT model assumes discrete features representing the IDs of the sub-word units as inputs. Thus it cannot accept an uncertain input such as posteriors of the ASR system directly, and recognition errors will directly affect the summary. We discuss some proposals to address this problem in Section 3.

Second, although both systems use sub-word units, the definition of sub-word units usually differs. Typically, ASR performs better with much smaller BPE units than the conventional BERT system.. However, since we would like to exploit a pre-trained BERT model, we consider two options. With the first option, we re-tokenize the word sequence after ASR to match the sub-word definition of the BERT model. With this approach, recognition performance may be optimal, but it offers limited possibilities for the interconnection of the ASR and TS systems. The other option is to use an ASR system trained with the same sub-word unit definitions as the BERT model. This training will degrade ASR performance but allow more flexibility in combining the two systems. We will analyze the impact of these options in the experiments of Section 5.2.

2.3 Conventional interconnection of ASR and TS systems

Many ASR+NLP systems such as speech summarization, translation, and spoken dialog systems use a cascade of ASR and NLP sub-systems built independently. The performance of the downstream NLP task is directly affected by the recognition errors [31]. Previous studies improved the robustness of the NLP back-end to ASR errors using various auxiliary information sources from the ASR system, e.g., probabilities, recognition hypotheses, and hidden states [32, 33, 34, 35, 8, 11, 36, 6, 10, 4, 37].

For speech summarization, Weng et al. proposed including a confidence embedding in the input of BERTSum [29] to achieve robust speech summarization [4]. They modified the input embedding vectors of BERTSum to be the sum of the sub-word embedding vector and a confidence embedding. Here, we implemented a similar method. We use $\hat{p}^{1}_{m}$ , as introduced in Section 2.1, as a confidence value, which we map to a hidden vector using a linear mapping as

c_{m}=\text{Emb}^{\text{conf}}(\hat{p}^{1}_{m}),

(5)

where $c_{m}\in\mathbb{R}^{B}$ is a confidence embedding and $\text{Emb}^{\text{conf}}(\cdot)$ is a linear embedding layer, where $B$ is the dimension of the projected embedding vectors. Then the modified embedding vector $e_{m}^{\text{conf}}$ is obtained as the summation of confidence and word embeddings:

e_{m}^{\text{conf}}=c_{m}+e_{m}.

(6)

3 Multi-hypothesis Summarization

In this paper, we propose a BERT-based summarization model that takes into account multiple speech recognition hypotheses. We propose combining several ASR hypotheses to mitigate the influence of ASR errors. Conceptually, the combined hypothesis can be obtained as a weighted-sum as,

\displaystyle e_{m}^{*}=\sum_{n=1}^{N}\alpha^{n}_{m}e_{m}^{n},

(7)

where $\alpha^{n}_{m}$ denotes a weight of each hypothesis, $e_{m}^{*}$ denotes a modified embedding vector.

First, we apply a method of incorporating posterior probability that was originally proposed for speech translation to the BERT summarization model. Next, we explain the hypothesis fusion method using an attention mechanism.

3.1 Speech summarization with posterior fusion

The posterior fusion consists of summing the embedding vectors of all sub-words weighted by their posterior probabilities, $\hat{p}_{m}^{n}$ , as

\displaystyle e_{m}^{\text{post}}=

\displaystyle\sum_{n=1}^{N}\hat{p}_{m}^{n}e_{m}^{n},

(8)

where $e_{m}^{\text{post}}$ is a modified embedding vector and $e_{m}^{n}$ is the $m$ -th embedding vector of the $n$ -th hypothesis obtained with Eq. (2). The modified embedding $e_{m}^{\text{post}}$ can include the uncertainty in the ASR system. Computing $e_{m}^{\text{post}}$ requires that all hypotheses are aligned and have the same length. We thus create the $N$ hypotheses as follows. First, we save the sequence of sum of output log-softmax values from the ASR and the language models for the best beam-search path. After decoding, for each step in the 1-best path, we select the $N{=}10$ tokens with the top 10 values of saved output vectors in the path. This way of creating $N$ hypotheses may generate more diverse hypotheses than simply obtaining the $N$ -best list from beam search decoding. Note that since the posterior-based fusion modifies the input of the TS model, we need to retrain the BERTSum model using ASR hypotheses.

3.2 Attention-based multi-hypothesis fusion

Posterior probability fusion trusts the ASR weighting and may mitigate the influence of unreliable tokens when performing summarization. However, in the case of the ASR outputs the correct word at $10$ -th hypotheses, it may be difficult to recover information from the modified embedding of Eq. (8) because the correct word’s weight $\hat{p}^{10}_{m^{1}0}$ is too small. On the other hand, our proposal re-calculates for all hypothesis weight based on BERT representation. Thus, even if the correct word is $10$ -th hypotheses, our proposal can provide high weight to the correct word.

Attention-based fusion consists of two steps. In the first step, we pick up a representative ASR hypothesis and align the other hypothesis to it. We use the most confident ASR hypothesis, i.e., the 1-best hypothesis, $\hat{S}^{1}$ . If we use multiple ASR systems, we use a hypothesis of a possible best performing ASR system as $\hat{S}^{1}$ .

We obtain the embedding vectors for the $n$ -th hypothesis, $\tilde{E}^{n}=[\tilde{e}^{n}_{1},\ldots,\tilde{e}^{n}_{M^{1}}]\in\mathbb{R}^{M^{1}\times B}$ , which is time-aligned with the $M^{1}$ -length hypothesis $\hat{S}^{1}$ , based on the attention mechanism as:

\tilde{E}^{n}=\text{softmax}\left((E^{1}W^{Q})(E^{n}W^{K})^{\mathrm{T}}\right)E^{n}W^{V},

(9)

$E^{n}=[e_{1}^{n},\ldots,e_{M^{n}}^{n}]^{\mathrm{T}}\in\mathbb{R}^{M^{n}\times B}$ is the sequence of embedding vectors associated with the $n$ -th hypothesis. $E^{1}$ is the sequence of the 1-best embedding vectors and is used as a query. $\text{softmax}(\cdot)$ is the softmax operation and $W^{Q}\in\mathbb{R}^{B\times B^{\prime}}$ , $W^{K}\in\mathbb{R}^{B\times B^{\prime}}$ , $W^{V}\in\mathbb{R}^{B\times B^{\prime}}$ are the query, key, and value projection matrices, respectively.

In the second step, we perform attention over the different hypotheses for every aligned sub-word position $m$ in a similar way as the hierarchical attention [14, 34, 38]. Let $C_{m}=[\tilde{e}^{1}_{m},\dots,\tilde{e}^{N}_{m}]\in\mathbb{R}^{B^{\prime}\times N}$ be a matrix containing the $N$ aligned embedding sequences for the $m$ -th sub-word position in a sequence. We can perform attention over the hypotheses to obtain a modified embedding vector $e^{\text{att}}_{m}$ as

	$\displaystyle\alpha_{m}=$	$\displaystyle\text{softmax}\left((e^{1}_{m})^{\mathrm{T}}W^{Q}C_{m}\right),$		(10)
	$\displaystyle e^{\text{att}}_{m}=$	$\displaystyle\alpha_{m}C_{m}^{\mathrm{T}},$		(11)

where $\alpha_{m}\in\mathbb{R}^{1\times N}$ are attention weights over the recognition hypotheses. Eq. (11) performs a similar summation over embedding vectors as in Eq. (8), but using the attention mechanism to compute the weights and time-aligned embedding vectors.

The proposed attention fusion can also be extended to multi-head attention. In that case, we can obtain an aligned hypothesis for each attention head, $\tilde{E}_{h}^{n}$ , using a similar equation as Eq. (9), with different projection matrices for each head, i.e., $W^{Q}_{h}$ , $W^{K}_{h}$ and $W^{V}_{h}$ , where $h$ is the head index. We can then also define $C_{m,h}$ for each head and compute attention weights $\alpha_{m,h}$ and fused hypotheses ${e}^{\text{att}}_{m,h}$ for each head as in Eqs. (10) and (11). We then obtain the fused hypothesis as,

\displaystyle e^{\text{att}}_{m}=

\displaystyle\text{concatenate}(e^{\text{att}}_{m,1},\ldots,e^{\text{att}}_{m,H})W^{o},

(12)

where $H$ is the number of attention heads and $W^{o}\in\mathbb{R}^{(HB^{\prime})\times B}$ is an output projection matrix as used in previous work [39]. We used the multi-head implementation in our experiments. Note that we also used the cosine similarity instead of the dot product to compute the attention weights in Eqs. (9) and (10).

Unlike the posterior fusion, we can use hypotheses of different lengths with the proposed attention fusion thanks to the time-alignment step of Eq.(9). Moreover, although we derived the attention fusion assuming it was performed at the input of the BERT encoder, we can also perform attention fusion at any layer within the BERT encoder. When performing the fusion within the BERT encoder, the multiple hypotheses are embedded in the intermediate representation of the BERT model. Therefore, we can exploit BERT’s strong language modeling capabilities to combine the hypotheses. Figure 1 illustrates how we apply the proposed attention fusion within the BERT encoder.

Refer to caption — Fig. 1: Attention fusion process.

Note that when we introduce a randomly initialized attention fusion layer, it may corrupt the BERT encoder, making the training slow or unstable. In this paper $B$ equals $B^{\prime}$ , therefore, we initialize the projection matrices $W^{Q}_{h}$ , $W^{K}_{h}$ , and $W^{V}_{h}$ to identity matrices, so at the beginning of the training, Eq. 11 provide highest value to first hypotheses, and the $e^{\text{att}}_{m}$ is close to the 1-best hypothesis $e^{1}_{m}$ .

Table 1: Comparison of the text and speech summarization corpora.

Dataset	Text	Speech	Compression rate		Source lengths		Target lengths		WER
	documents	documents	sentence	word	sentence	word	sentence	word
CNNDM	162,018	n/a	14%	9%	35	853	3	60	n/a
How2	72,983	12,798	16%	16%	14	303	1	34	13.0%
TED	4,001	1,495	6%	5%	102	2210	4	79	8.5%

4 TED Speech summarization corpus

We used three corpora to train and test our speech summarization systems. In particular, we assembled a new corpus derived from TED Talks that we will release upon acceptance of the paper. Table 1 compares the characteristics of the TED corpus with the two other corpora used in this paper.

4.1 Descriptions of the corpora

CNN-DailyMail (CNNDM) is a large-volume corpus used for text summarization of news documents. We use this corpus to pre-train the TS system. How2 is a publicly available corpus for speech summarization [40]. It consists of summarizations of How2 videos taken from YouTube. The target for summarization consists of the brief video description provided on YouTube. Although the corpus also includes videos, allowing multi-modal summarization, here we only use the audio content of the corpus.

The TED corpus consists of a summarization of TED Talks. We created this corpus by associating TED Talks included in the TEDLIUM corpus with their summaries obtained from the TED website²²2https://www.ted.com/talks. For the TED summarization task, we add the speaker name to the speech document to allow the TS systems to output the speaker names, which commonly appear in the reference summaries. The target summary consists of the title and abstract of the talk. Note that others have also used TED Talks from speech summarization [41], but those corpora were small. Furthermore, they did not publicly release their corpus.

4.2 Analysis of the complexity of the TED summarization task

Table 1 shows the number of text and speech documents, the compression rate, the source and target lengths, and the word error rate (WER) of the three corpora. The compression rate is expressed as the ratio between the output and input document lengths. Thus, a lower value means higher information compression. This table shows that our proposed TED summarization task is challenging because it has a relatively small amount of training data and consists of long input speech documents that require higher information compression than needed for existing datasets. Note that the WER is about 8.5%, which makes it slightly easier than the How2 corpus in terms of the ASR performance.

Table 2: ROUGE scores of ideal extractive summaries and word overlap.

Dataset	ROUGE-1	ROUGE-2	ROUGE-L	Word overlap
CNNDM	45.0	29.9	42.8	83%
How2	27.9	10.4	23.3	58%
TED	34.4	19.8	33.5	72%

Another way in analyzing the complexity of a summarization task is whether it is to apply extractive summarization. We can measure ideal extractive summarization scores using the reference summary to select the set of sentences from the input document that achieves the highest summarization score. Table 2 shows the ROUGE scores [42] and word overlap of the ideal extractive summary for the three datasets. The word overlap measures the percentage of words from the target summaries that are in the source documents. CNNDM, which consists of a summary of news articles, is well suited for extractive summarization; therefore, oracle scores and word overlap are relatively high. In contrast, the How2 corpus consists of relatively casual speech, which leads to much lower oracle scores and word overlap. The TED summarization corpus consists of relatively formal speech and its oracle extractive summarization scores are between those of CNNDM and How2.

Table 3: ROUGE scores of abstractive summarization with BERTSum using the ground-truth transcriptions (BERTSum (oracle)).

Dataset	ROUGE-1	ROUGE-2	ROUGE-L
CNNDM	41.7	19.4	38.8
How2	56.5	37.8	59.3
TED	32.1	6.2	19.0

Finally, we compare the performance of abstractive text summarization on the ground-truth transcriptions. The results indicate an upper-bound value for the speech summarization systems we investigated. Table 3 shows the ROUGE scores obtained with BERTSum models for the three tasks. We found that the difficulty of abstractive summarization is highly dependent on the length of the input and the compression rate, which makes the proposed TED summary task also challenging for abstractive summarization.

Comparing the results of Table 2 and 3 shows that ideal extractive summarization’s scores on TED talk, in particular ROUGE-2 and -L scores, are higher than abstractive summarization ones. We will investigate extractive summarization on the TED corpus and comparing it with abstractive summarization in future works.

Table 4: ASR WER for each BPE size. 30.5k^∗ corresponds to the BPE size of the BERT model.

	BPE size
	500	5k	10k	20k	30k	30.5k^∗
How2	n/a	13.0	13.6	14.1	14.3	14.6
TED	8.5	n/a	8.7	9.5	10.0	10.4

5 Experiments

We performed experiments using two speech summarization datasets, TED and How2 described in Section 4.

5.1 System configuration

Our baseline consists of a cascade of ASR and TS systems trained separately. We built Transformer-based ASR models using the Espnet toolkit³³3https://github.com/espnet/espnet, by following the published recipes for the TEDLIUM2 [43] and How2 [40] tasks, except that we varied the BPE size. For the TS model, we built a BERTSum model using a pre-trained BERT model as an encoder and a Transformer decoder in the same way as [38]. We used the pre-trained BERT model provided by huggingface⁴⁴4https://huggingface.co/transformers/. We pre-trained the BERTSum model using CNNDM data, and fine-tuned it on each summarization task.

We consider three baseline systems, “(1) BERTSum (oracle),” which uses ground-truth ASR transcriptions as input to the BERTsum model, “(2) BERTSum (ASR-BPE),” which uses transcriptions obtained with an ASR system trained with the optimal BPE definition for ASR, and “(3) BERTSum (BERT-BPE),” which uses an ASR system trained with the BPE definition of the pre-trained BERT model. The first baseline illustrates the upper-bound performance on the tasks. The second baseline is optimal for ASR, but it requires re-tokenizing the recognized text to match the BPE of the BERTSum model, and it cannot be used to pass ASR information such as confidence or posteriors to the summarization back-end. The third baseline uses the same BPE definition as our proposed methods.

Table 5: ROUGE scores for the different speech summarization systems. Systems (6) and (7) are the proposed methods.

	Method	TED			How2
		ROUGE-1	ROUGE-2	ROUGE-L	ROUGE-1	ROUGE-2	ROUGE-L
(1)	BERTSum (oracle)	32.1	6.2	19.0	56.5	37.8	59.3
(2)	BERTSum (ASR-BPE)	29.9	6.9	18.3	47.4	27.1	46.1
(3)	BERTSum (BERT-BPE)	28.9	6.2	17.8	45.3	26.8	45.0
(4)	BERTSum retrain	31.5	5.6	20.4	47.2	27.0	45.6
(5)	BERTSum confidence	30.1	6.8	20.4	48.4	29.0	47.3
(6)	BERTSum Pos. fusion	31.6	6.1	20.3	48.2	27.8	46.5
(7)	BERTSum Att. fusion	31.9	6.0	19.3	49.3	28.8	48.2

In addition to the above systems, we created a baseline by retraining (3) BERTSum (BERT-BPE) on the transcriptions generated with ASR (“(4) BERTSum retrain”). We also implemented a system similar to [4] that uses BPE confidence scores as an auxiliary feature at the input of BERTSum (“(5) BERTSum confidence”) as described in Section 2.3.

Table 6: How2 dataset summarization examples. The red collar highlights the different main parts.

Method	Example
Reference	learn how to form a b sound for ventriloquists with expert voice throwing tips from a professional comedian in this free online ventriloquism lesson video clip
(5) BERTSum confidence	practice your ventriloquists with expert voice throwing tips from a professional comedian in this free online ventriloquism lesson video clip
(7) BERTSum Att. fusion	learn how to make b sound for ventriloquists with expert voice throwing tips from a professional comedian in this free online ventriloquism lesson video clip
Reference	when choosing the right hair style for your face , pull your hair back and take into account the shape of your face choose the right hair style with tips from a beauty professional in this free video on hair care
(5) BERTSum confidence	when picking a hairstyle for face , it ’ s important to pick a hairstyle that is n ’ t fit choose a hairstyle with tips from a beauty professional in this free video on hair care
(7) BERTSum Att. fusion	when choosing a flattering hairstyle , take an measurements of the face shape and pull all of the hair back into account choose a flat hairstyle with tips from a beauty professional in this free video on hair care

The proposed posterior-based hypotheses fusion used the same ASR and TS system as the baseline systems, i.e. (3) BERTSum (BERT-BPE), except that the BERTSum system was fine-tuned on input embeddings obtained with Eq. (8) using $N{=}10$ as described in Section 3.1. We call this system “(6) BERTSum Pos. fusion”.

For the proposed attention-based fusion, we inserted the attention fusion layer at the fifth layer of the BERT and used four attention heads ( $H=4$ ). We used the same approach to generate the recognition hypotheses as for the posterior fusion described in Section 3.1 except that we used attention fusion described in Section 3.2 instead of Eq. (8). We used five hypotheses ( $N{=}5$ ) for attention fusion due to GPU memory constraints of our experimental environment. We trained all BERTSum models following the original recipe except that we used a learning rate of $0.0002$ and warm-up steps of $20$ k when retraining on the ASR outputs (systems (4) to (7)). We compared the summarization performance of each method with ROUGE [42].

5.2 Effect of BPE size

ASR and BERT models use both BPE to represent sub-words; however, the BPE unit definitions used for the two systems differ. Typically, ASR systems achieve optimal performance at a smaller BPE size than that of the BERT model. However, since our proposed method requires that the BPE unit definition of the ASR system must match that of the BERT model, we expect to have some ASR degradation due to too large BPE sizes. Thus, we first investigate the effect of BPE size on ASR performance.

Table 4 shows WER as a function of different BPE sizes for the How2 and TED corpora. We observe a relative WER increase of 12% for How2 and by more than 20% for the TED corpus when adopting the BPE definition used by BERT. Although this is a significant WER increase, we discuss its impact on summarization in the following subsection.

5.3 Speech summarization results

Table 5 shows the ROUGE scores for the baseline systems (systems (1) to (5)) and the proposed method with (6) posterior and (7) attention-based fusion⁵⁵5Note that we confirmed the validity of our implementation of BERTSum as it achieved a similar level of performance on the How2 corpus [29], which reported ROUGE-1 and ROUGE-L scores of 48.3 and 44.0, respectively, although the systems cannot be directly compared because of differences in the training data and ASR front-end..

We observe a large performance gap in summarization performance when using ASR transcriptions (system (2)) instead of ground-truth transcription (system (1)). As we discussed in section 5.2 using BERT’s BPE definition for ASR (system (3)) induces more recognition errors, which clearly degrades summarization performance on both tasks compared to using the BPE optimal for ASR (system(2)). However, this performance degradation can be mostly recovered by fine-tuning on the ASR hypotheses (system (4)). Using confidence scores (system (5)) [29] improves ROUGE scores on the How2 corpus, but degrades ROUGE-1 on the TED corpus compared to simply retraining on the ASR hypotheses.

The proposed posterior (system (6)) and attention-based fusion (system (7)) approaches both achieve equivalent or superior ROUGE-1 scores for both corpora than the baseline systems with retraining on ASR hypotheses (system (4)). In particular, the proposed attention-based fusion improves ROUGE-1 by 2 points on the How2 corpus. This result indicates that the proposed system can better mitigate ASR errors by using multiple hypotheses with the BERTSum model.

Figure 2 shows the ROUGE-1 score difference between systems (4) and (5)-(7) as a function of the WER of the spoken documents. These results show that our proposed attention fusion system achieves higher robustness for ASR error than systems (5) and (6). We hypothesize that this is due to the fact that the proposed attention fusion approach models explicitly multiple ASR hypotheses, and can thus choose alternative word candidates within ASR hypotheses to generate the summary. This may make the attention fusion-based system robust to ASR errors if the correct words are included in the hypotheses. In addition to the ROUGE score, Table 6 provides a couple of summaries generated by our proposed system for the How2 corpus ⁶⁶6We provide more examples, as well as examples on the TED corpus on our webpage https://github.com/takatomokano/ted_summary. We include summaries obtained with BERTSum fine-tuned on the ASR hypotheses as a comparison. We confirm that both systems achieve highly readable summaries that are close to the reference.

The experiments with the proposed attention fusion used aligned hypotheses of the same duration. However, the proposed method can also handle hypotheses of different lengths thanks to the alignment mechanism of Eq. (9). We also tested the attention fusion using hypotheses generated by five different ASR systems each with a different BPE size (the systems used in the experiments of Section 5.2). In this case, the hypotheses were unaligned and of different lengths. The proposed attention fusion achieved a ROUGE-1 score of 48.0 on the HOW2 task, which shows some improvement over the baseline systems (2)-(4). Although this result is behind our best performing system, it shows that the proposed method could handle hypotheses with different lengths. We plan to further investigate such a combination by using more diverse ASR systems to generate the hypotheses in future work.

6 Conclusion

In this paper, we proposed a speech summarization system that can exploit multi-hypotheses generated by an ASR system to mitigate the impact of recognition errors. We proposed two schemes, i.e., posterior and attention-based fusion, which could be integrated into a BERT-based TS model. We showed that both approaches could reduce the impact of ASR errors on summarization and achieved competitive results on two tasks.

Future works will include investigating tighter interconnection of the ASR front-end and TS back-end to mitigate ASR errors and exploit speech-specific information such as intonation and create richer and more informative speech summaries.

7 Acknowledgement

We would like to thank Jiatong Shi at Johns Hopkins University for providing a script of ROVER-based system combinations.

References

[1] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in proc. NAACL-HLT, 2019, pp. 4171–4186.
[2] T. Kano, S. Takamichi, S. Sakti, G. Neubig, T. Toda, and S. Nakamura, “Generalizing continuous-space translation of paralinguistic information,” in proc. INTERSPEECH, 2013, pp. 2614–2618.
[3] Q. T. Do, S. Sakti, and S. Nakamura, “Sequence-to-sequence models for emphasis speech translation,” IEEE ACM Trans. Audio Speech Lang. Process., vol. 26, no. 10, pp. 1873–1883, 2018.
[4] S. Weng, T. Lo, and B. Chen, “An effective contextual language modeling framework for speech summarization with augmented features,” in proc. EUSIPCO, 2020, pp. 316–320.
[5] N. Bertoldi, R. Zens, and M. Federico, “Speech translation by confusion network decoding,” in proc. ICASSP, 2007, pp. 1297–1300.
[6] M. Hopkins and J. May, “Tuning as ranking,” in proc. ACL, 2011, pp. 1352–1362.
[7] M. Ohgushi, G. Neubig, S. Sakti, T. Toda, and S. Nakamura, “An empirical comparison of joint optimization techniques for speech translation,” in proc. INTERSPEECH, 2013, pp. 2619–2623.
[8] K. Osamura, T. Kano, S. Sakriani, K. Sudoh, and S. Nakamura, “Using spoken word posterior features in neural machine translation,” in proc. IWSLT, 2018.
[9] P. Bahar, T. Bieschke, R. Schlüter, and H. Ney, “Tight integrated end-to-end training for cascaded speech translation,” in proc. SLT, 2021, pp. 950–957.
[10] M. Sperber, G. Neubig, J. Niehues, and A. Waibel, “Neural lattice-to-sequence models for uncertain inputs,” in proc. IWSLT, 2017, pp. 1380–1389.
[11] A. Ogawa, T. Hirao, T. Nakatani, and M. Nagata, “Ilp-based compressive speech summarization with content word coverage maximization and its oracle performance analysis,” in proc. ICASSP, 2019, pp. 7190–7194.
[12] S. Xie and Y. Liu, “Using n-best lists and confusion networks for meeting summarization,” IEEE Trans. Speech Audio Process., vol. 19, no. 5, pp. 1160–1169, 2011.
[13] S. Lin and B. Chen, “Improved speech summarization with multiple-hypothesis representations and kullback-leibler divergence measures,” in proc. INTERSPEECH, 2009, pp. 1847–1850.
[14] Z. Yang, D. Yang, C. Dyer, X. He, A. J. Smola, and E. H. Hovy, “Hierarchical attention networks for document classification,” in proc. NAACL-HLT, 2016, pp. 1480–1489.
[15] J. Libovický and J. Helcl, “Attention strategies for multi-source sequence-to-sequence learning,” in proc. ACL, 2017, pp. 196–202.
[16] X. Wang, R. Li, S. H. Mallidi, T. Hori, S. Watanabe, and H. Hermansky, “Stream attention-based multi-array end-to-end speech recognition,” in proc. ICASSP, 2019, pp. 7105–7109.
[17] C. Hori, T. Hori, T. Lee, Z. Zhang, B. Harsham, J. R. Hershey, T. K. Marks, and K. Sumi, “Attention-based multimodal fusion for video description,” in proc. ICCV, 2017, pp. 4203–4212.
[18] J. Fiscus, “A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (rover),” in proc. ASRU, 1997, pp. 347–354.
[19] G. Evermann and W. Philip, “Posterior probability decoding, confidence estimation and system combination,” in proc. Speech Transcription Workshop, 2000.
[20] V. Goel and W. Byrne, “Minimum bayes-risk automatic speech recognition,” in proc. Computer Speech & Language, 2000, pp. 115–135.
[21] S. Karita, X. Wang, S. Watanabe, T. Yoshimura, W. Zhang, N. Chen, T. Hayashi, T. Hori, H. Inaguma, Z. Jiang, M. Someki, N. E. Y. Soplin, and R. Yamamoto, “A comparative study on transformer vs RNN in speech applications,” in proc. ASRU, 2019, pp. 449–456.
[22] H. Inaguma, S. Kiyono, K. Duh, S. Karita, N. Yalta, T. Hayashi, and S. Watanabe, “Espnet-st: All-in-one speech translation toolkit,” in proc. ACL, A. Celikyilmaz and T. Wen, Eds., 2020, pp. 302–311.
[23] S. Jalalvand, M. Negri, D. Falavigna, M. Matassoni, and M. Turchi, “Automatic quality estimation for ASR system combination,” Comput. Speech Lang., vol. 47, pp. 214–239, 2018.
[24] R. T. McDonald, “A study of global inference algorithms in multi-document summarization,” in proc. ECIR, 2007, vol. 4425 of Lecture Notes in Computer Science, pp. 557–564.
[25] W. Yih, J. Goodman, L. Vanderwende, and H. Suzuki, “Multi-document summarization by maximizing informative content-words,” in proc. IJCAI, 2007, pp. 1776–1782.
[26] H. Takamura and M. Okumura, “Text summarization model based on the budgeted median problem,” in proc. ACM, 2009, pp. 1589–1592.
[27] J. Zhang, Y. Zhao, M. Saleh, and P. J. Liu, “PEGASUS: pre-training with extracted gap-sentences for abstractive summarization,” in proc. ICML, 2020, pp. 11328–11339.
[28] A. See, P. J. Liu, and C. D. Manning, “Get to the point: Summarization with pointer-generator networks,” in proc. ACL, 2017, pp. 1073–1083.
[29] Y. Liu and M. Lapata, “Text summarization with pretrained encoders,” in proc. EMNLP-IJCNLP, 2019, pp. 3728–3738.
[30] A. A. Syed, F. L. Gaol, and T. Matsuo, “A survey of the state-of-the-art models in neural abstractive text summarization,” IEEE Access, vol. 9, pp. 13248–13265, 2021.
[31] M. Paul, M. Federico, and S. Stüker, “Overview of the iwslt 2010 evaluation campaign,” in proc. IWSLT, 2010.
[32] X. He, L. Deng, and A. Acero, “Why word error rate is not a good metric for speech recognizer training for the speech translation task?,” in proc. ICASSP, 2011, pp. 5632–5635.
[33] T. Kano, S. Sakti, S. Takamichi, G. Neubig, T. Toda, and S. Nakamura, “A method for translation of paralinguistic information,” in proc. IWSLT, 2012, pp. 158–163.
[34] P. Manakul, M. J. F. Gales, and L. Wang, “Abstractive spoken document summarization using hierarchical model with multi-stage attention diversity optimization,” in proc. INTERSPEECH, 2020, pp. 4248–4252.
[35] N. Ruiz, Q. Gao, W. Lewis, and M. Federico, “Adapting machine translation models toward misrecognized speech with text-to-speech pronunciation rules and acoustic confusability,” in proc. INTERSPEECH, 2015, pp. 2247–2251.
[36] T. Kano, S. Sakti, and S. Nakamura, “Transformer-based direct speech-to-speech translation with transcoder,” in proc. SLT, 2021, pp. 958–965.
[37] S. Dalmia, B. Yan, V. Raunak, F. Metze, and S. Watanabe, “Searchable hidden intermediates for end-to-end models of decomposable sequence tasks,” in proc. NAACL-HLT, 2021, pp. 1882–1896.
[38] T. Liu, S. Liu, and B. Chen, “A hierarchical neural summarization framework for spoken documents,” in proc. ICASSP, 2019, pp. 7185–7189.
[39] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in proc. NIPS, 2017, pp. 5998–6008.
[40] R. Sanabria, O. Caglayan, S. Palaskar, D. Elliott, L. Barrault, L. Specia, and F. Metze, “How2: A large-scale dataset for multimodal language understanding,” CoRR, vol. abs/1811.00347, 2018.
[41] F. Koto, S. Sakti, G. Neubig, T. Toda, M. Adriani, and S. Nakamura, “The use of semantic and acoustic features for open-domain TED talk summarization,” in proc. APSIPA, 2014, pp. 1–4.
[42] C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in proc. ACL, 2004, pp. 74–81.
[43] A. Rousseau, P. Deléglise, and Y. Estève, “TED-LIUM: an automatic speech recognition dedicated corpus,” in proc. LREC, 2012, pp. 125–129.