Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition ^†^†thanks: ${*}$ Corresponding Author: Jianzong Wang, [email protected]

Chendong Zhao^†⋆‡, Jianzong Wang^†∗, Wenqi Wei^†, Xiaoyang Qu^†, Haoqian Wang^⋆, Jing Xiao^†

\ddagger

Work done during an internship at Ping An Technology. ^†Ping An Technology (Shenzhen) Co., Ltd., Shenzhen, China
^⋆The Shenzhen International Graduate School, Tsinghua University, China

Abstract

The Transformer architecture model, based on self-attention and multi-head attention, has achieved remarkable success in offline end-to-end Automatic Speech Recognition (ASR). However, self-attention and multi-head attention cannot be easily applied for streaming or online ASR. For self-attention in Transformer ASR, the softmax normalization function-based attention mechanism makes it impossible to highlight important speech information. For multi-head attention in Transformer ASR, it is not easy to model monotonic alignments in different heads. To overcome these two limits, we integrate sparse attention and monotonic attention into Transformer-based ASR. The sparse mechanism introduces a learned sparsity scheme to enable each self-attention structure to fit the corresponding head better. The monotonic attention deploys regularization to prune redundant heads for the multi-head attention structure. The experiments show that our method can effectively improve the attention mechanism on widely used benchmarks of speech recognition.

Index Terms:

Automatic Speech Recognition, Sparse Attention, Monotonic Attention, Self-Attention.

I Introduction

Automatic speech recognition (ASR) has been well explored since it was first perceived. End-to-end ASR has seen large improvements in recent years. The great success of end-to-end paradigms for ASR should largely be attributed to several advanced models, such as Connectionist Temporal Classification (CTC) [1], Recurrent Neural Network Transducer (RNN-T) [2], and attention-based ASR [3, 4, 5]. The Transformer architecture has shown promising performance for end-to-end ASR [6, 7, 8, 9], which combines an acoustic model, a language model, and an acoustic-to-text alignment mechanism. Unlike recurrent neural networks, Transformer can perform parallel training very well as it abandons the recurrent structure that has an inherently sequential nature [10, 11, 12, 13]. Moreover, multi-head attention in the Transformer can significantly improve the network’s performance because it can achieve dynamic embedding and capture the strong intrinsic structures [14, 15, 16, 17]. However, although the Transformer shows expressive power for Natural Language Processing (NLP) tasks, there still remains room for exploiting it in speech tasks.

Transformer ASR methods [6, 7] exploit the self-attention approach, where each self-attention layer in both encoder and decoder utilizes the multi-head structure to extract different input representations. While it has been studied that some of the attention heads are useless in other tasks, we previously observed this phenomenon in Transformer ASR as well, where the attention function is similar to an identity mapping. This issue motivates us to find out and address the redundant calculations in such self-attention. So in this work, we propose to remove the redundant informations without degrading the overall performance. For Transformer-based ASR model, one fundamental problem is that the softmax normalization function-based attention mechanism makes it impossible to highlight important speech information. In the attention mechanism, softmax is often used to calculate the attention weight, and its output is always dense, and non-negative [18, 19, 20]. Dense attention means that all inputs always contribute a little bit to the decision, which leads to a waste of attention. Besides, the attention score distribution becomes flat as the input length increases. Therefore, it is difficult to focus on the speech’s critical information, especially for a long speech. Considering that dense attention cannot highlight important information in a long speech, we try to use sparse transformation to achieve sparse attention distribution. Inspired by [21], we first apply the adaptive $\alpha$ -entmax to the ASR task. Specifically, we replace the softmax in each self-attention with $\alpha$ -entmax, and maintain the learnability of $\alpha$ .

For attention-based ASR model, another fundamental challenge is the accurate alignment between the input speeches and output texts [22]. Monotonicity can keep the marginalization of the alignment between text output and speech input tractable. But, the monotonic multi-head alignment can not make all heads contribute to the final predictions. We proposed an adaptive monotonic multi-head attention with a regularization constraint. In our experiments, we found some techniques which highlight attention scores of important parts of the speech, offering better alignment accuracy. In practice, searching for the best function is time-consuming. To tackle this problem, we introduce an approach based on automatically adaptive attention. Contributive to such alignment, the trained model is more focusing and accurate of multiple attentions. In this way, it can adaptively eliminate the redundant heads and let every head learn alignments properly.

Overall, our contributions are as follows. Firstly, we unify the sparse attention and monotonic attention into ASR tasks. Secondly, we propose an adaptive method to obtain a unique $\alpha$ parameter for each self-attention in Transformer. Third, we deployed a regularized monotonic multi-head alignment scheme for acoustic-to-text alignment.

II Related Works and Preliminary

II-A Sparse Attention

Recent works [23, 24, 25] based on the attention function propose to use sparse normalizing such as sparsemax [19]. The sparsemax can omit items with small probability and thus can distribute the probability mass on outputs with higher probabilities. It improves the performance and interpretability to a certain extent [20]. However, each head of the multi-head attention mechanism used in Transformer is designed for capturing characteristics from a different point of view. For example, while the heads at the first layer of encoder focus on feature extraction, the heads in the last layer of decoder focus on classification of results. Intuitively the attention function of the two scenarios should be different. So the method of replacing all attention mechanisms with sparsemax may not suit for Transformer [26]. In the field of NLP, there are many breakthroughs in Transformer regarding sparsity. From the perspective of Tsallis entropy [27], [28] unify all transformation functions into the form of $\alpha$ -entmax. When $\alpha$ = 1, $\alpha$ -entmax is a softmax function. For any $\alpha$ $>$ 1, $\alpha$ -entmax is the sparsemax function. Moreover, in order to make the $\alpha$ -entmax fit different heads for different tasks, [21] introduces an adaptive method to enable the parameter $\alpha$ to be updated.

II-B Monotonic Attention

Monotonic attention technique [29] aims to restrict the transition of attention in a past-to-now pipeline. Generally speaking, monotonic attention is adopted to reduce the complexity of decoding. In strong context-related tasks as natural language understanding, a fully attention function is commonly applied to both input and output sequences, because the word-order or the part-of-speech is less important. But it may not work well for ASR tasks because the speech waveforms and target text sequence are required to be temporally monotonic. For enhancing the alignment ability between the waveform and text sequence, [30] incorporate the monotonic attention together with the connectionist temporal classification objective in a multi-task learning diagram by using the same encoder together. This recipe becomes the standard optimizing pipeline for most attention-based models [31, 32, 33]. In [34], a monotonic chunkwise attention (MoChA) approach was introduced for a streaming attention. It first splitted the encoder output latents into rather short-length chunks and then applied the soft attention function on these small chunks. Furthermore, an adaptive-length chunk approach was proposed in [35]. Specifically, monotonic infinite lookback (MILK) attention [36] was applied to consider the overall acoustic features preceding to learn an adaptive latency-quality trade-off schedule jointly.

II-C Attention-based ASR

Although attention-based architectures[37, 4, 5] have achieved remarkable success in offline end-to-end ASR systems, they cannot easily be applied for the online or streaming ASR systems. Several optimization paradigms are proposed to overcome this limit. Monotonic multi-head attention[38] extends monotonicity to the multi-head to extract useful representations and complex alignment. For Monotonic multi-head attention [38], HeadDrop[39] is proposed to eliminate the redundant head. MoChA [34] bridges the gap between monotonic attention and soft attention. As the online speech recognizing becoming more and more important, it also brings a drawback of additional designings in the network and training criterias, and precision degradations. As the Transformer’s size is large, a compressed structure [40] is proposed. Conformer [41] is a convolution-augmented Transformer for speech recognition. In triggered attention [42], a CTC task is trained to learn the alignment that triggers the attention decoding. The work in [43] use chunk-hopping for a latency-control end-to-end ASR.

II-D Transformer ASR Model Architecture

As shown in Figure 1, our network architecture is based on a Transformer-based ASR model, based on an encoder and a decoder. Let us denote the input sequence as $x=\{x_{1},...,x_{T}\}$ with T being the length of the frame sequence. The corresponding encoder states are denoted as $h=\{h_{1},...,h_{S}\}$ . Let the output sequence as $y=\{y_{1},...,y_{U}\}$ with $U$ being the length of the character sequence. The encoder first processes the input with a down-sampling convolution layer and stacks with several encoder blocks. The encoder processes sequential speech features, which are 80-dimensional log-mel spectral energies. The down-sampling convolution layer uses a stride of 2, kernels of $3\times 3$ , followed by the ReLU activation function. The down-sampling convolution is able to extract useful encodings, shorten acoustic representations’ length, and promote faster alignments in the decoding. Main modules of Transformer contain position encoding, multi-head self-attention calculation, linear, and softmax, point-wise feed-forward network, residual connections, layer normalization. Here, we emphasize the concept of self-attention and multi-head attention. Self-attention is designed to learn the internal dependencies by computing the representation of a single sequence. The Transformer employs the scaled dot-product self-attention formulated as:

	$\displaystyle head_{i}=softmax(\frac{Q_{i}K_{i}^{T}}{\sqrt{d/h}})V_{i}\>\>(i=1,2,...,h),$		(1)
	$\displaystyle where\>\>Q_{i}=XW_{i}^{q},K=X_{1}W_{i}^{k},V_{i}=XW_{i}^{v}.$		(2)

Here, $h$ means the number of heads of multi-head attention. As the dimension of $X$ in Transformer has the same dimension as $d_{model}$ , the project matrices are $W_{i}^{q}\in\mathbb{R}^{{d_{model}}\times d_{q}}$ , $W_{i}^{k}\in\mathbb{R}^{d_{model}\times d_{k}}$ , $W_{i}^{v}\in\mathbb{R}^{d_{model}\times d_{v}}$ . Note that $d_{q}=d_{k}=d_{v}=d_{model}/h$ . The output of self-attention $O$ is concatenated by multi-head mechanism, which is formulated as follows

O=concat(head_{1},head_{2},...,head_{h})W^{O}.

(3)

$W^{O}\in\mathbb{R}^{hd_{v}\times d_{model}}$ and $concat(\cdot)$ means concatenation. Besides, it deploys the sinusoidal positional encoding scheme, which achieves the attention calculation to adapt well on different speech length.

III Approach

In order to systematically prove the performance of adaptive sparse transformer, we designed multiple transformer structures based on different sparse functions, including sparse transformer, 1.5-entmax transformer, and adaptive $\alpha$ -entmax transformer. Our innovations are as follows: firstly, we use different sparse functions to replace the softmax function in self-attention, and use adaptive sparse methods to each self-attention gets its own degree of sparsity. Secondly, we propose a regularized monotonic multi-head attention. Our innovations lie in two components: an adaptive sparse multi-head self-attention illustrated in Sec. III-A and an adaptive monotonic multi-head attention illustrated in Sec. III-B. Figure 1 shows how to integrate these two core components into the baseline.

Refer to caption — Figure 1: The diagram of our model architecture. Note that the residual connections and layer normalization are omitted.

III-A Adaptive Sparse Multi-head Self-Attention

As shown in Figure 2, we introduce several sparse functions in self-attention to prevent the disadvantage of softmax. The first idea is to substitute softmax with sparsemax for every independent head of the Transformer, which can totally omit the attention of low probability and distribute probability mass only on high-interest elements. It is achieved by computing the Euclidean projection of the input vector z onto the probability simplex p:

Sparsemax(\textbf{z})=arg\min\limits_{\textbf{p}\in\Delta^{J}}\frac{1}{2}||\textbf{z}-\textbf{p}||^{2}_{2},

(4)

where $\Delta^{J}={\textbf{p}\in R^{J}|\textbf{p}\leq 0,\sum\limits_{j}^{J}p_{j}=1}$ is J-dimension simplex. To overcome the potential issue of softmax, we propose to treat $\alpha$ values of each attention head differently, where some heads may train to be informative sparser, and others may become similar its original form. We also propose to treat the $\alpha$ values as learnable variables, optimized via back-propagation calculations along with other network parameters. In [19], the sparsity of sparsemax is proved and a closed-form solution is given:

\textbf{p}^{*}=sparsemax(\textbf{z})=[\textbf{z}-\tau]_{+},

(5)

where $[\cdot]_{+}$ denotes function $max(\cdot,0)$ , and $\tau$ is the Lagrange multiplier in the context of the constraint $\sum_{j}(p_{j})=1$ . Here $\tau$ can also be interpreted as a threshold under which the corresponding probability will be set to zero. Next, we extend the concept of sparse attention in the multi-head architecture. The core change of this architecture is we use $\alpha\text{-entmax}$ [28] to introduce sparsity into attention mechanism of sparse self-attention. For the three mentioned transformations (softmax, sparsemax, $\alpha$ -entmax), the relationship among them can be analyzed from the perspective of entropy:

\alpha\text{-entmax}(\textbf{z})=arg\max\limits_{\textbf{p}\in\Delta^{J}}\textbf{p}^{T}\textbf{z}+H_{\alpha}^{T}(\textbf{p}).

(6)

$H_{\alpha}^{T}(\textbf{p})$ is the Tsallis $\alpha$ -entropy family definition with the following definition:

H_{\alpha}^{T}(\textbf{p})=\begin{cases}\frac{1}{\alpha(\alpha-1)}\sum\limits_{i}(p_{j}-p_{j}^{\alpha})&\alpha\neq 1\\ -\sum_{j}p_{j}logp_{j}&\alpha=1\par\end{cases}

(7)

$\alpha$ here is a hyperparameter. For the cases $\alpha=1$ and $\alpha=2$ , the corresponding $\alpha$ -entropies are Shannon and Gini entropy while the corresponding $\alpha$ -entmaxes are equal to softmax and sparsemax respectively. Thus it establishes the connection between softmax and sparsemax while also throwing lights on a medium model between these two transformations. [28] proves the uniqueness of $\tau$ and proposed an accurate method for calculating $\alpha$ -entmax when $\alpha$ = 1.5 (also called 1.5-entmax). [44] proposed an extended iterative bisection approach that is applicable for $\alpha$ -entmax. It is rational to set the $\alpha$ to different values in Transformer for better adaptation to different heads in different layers.

Inspired by [21], we apply an adaptive $\alpha$ -entmax attention method in the multi-head attention mechanism for ASR tasks. $\alpha$ -entmax is a convex optimization problem. When $\alpha$ is set as a variable but not a hyperparameter, we have $p^{*}=\alpha-entmax(\textbf{z},\alpha)$ . It is not trivial to derivate $\frac{\partial\alpha-entmax(\textbf{z},\alpha)}{\partial\alpha}$ from the Lagrangian and optimality conditions by taking the derivative with respect to z in [19]. Due to current deep neural network frameworks can not take the derivatives for this optimization problem automatically. Thus, we tackle this problem from the perspective of the solution. It is trivial that for $j\notin S$ , $p^{*}_{j}=0$ which is a constant thus $\frac{\partial p^{*}_{j}}{\partial\alpha}=0$ . To simplify the gradient for $i\in S$ , we use $\overline{p}^{*}$ and $\overline{\textbf{z}}$ to denote the corresponding vectors whose indices are in the support $S$ . Finally we can solve the component of the Jacobian.

III-B Adaptive Monotonic Multi-head Attention

For end-to-end ASR, another fundamental challenge is the alignment between the input sequence and output sequence. CTC [1] and RNN-Transducer [2] are monotonic, so that the alignment track is able to be marginal. But the attention approach results in non-monotonic alignment, which may diminish the quality of speech recognition. The hard monotonic attention mechanism was introduced used for streaming time-sync decoding with attention architecture ASR. For the hard monotonic attention mechanism, the process is shown as follows. For the decoding time-step i, the attention method begins inspecting entries starting at the preceding output time-step, referred to as $t_{i-1}$ . It then computes an energy scalar $e_{i,j}$ based on a monotonic energy function $MonotonicEnergy(\cdot)$ . This function refers the $(i-1)_{th}$ output state and encoding features $h_{j}$ as inputs, where $j=t_{i-1},t_{i-1}+1,...$ . The energy scalar is computed as follows:

TABLE I: Results of different Max functions (CER (%) in AISHELL, WER (%) in WSJ), T represents softmax temperature. Numbers in the last three rows represent sparsing only encoder / only decoder / both.

Method	Sparse Attention			Sparse Attention and Monotonic Attention
Method	AISHELL-Dev	AISHELL-Test	WSJ-eval93	AISHELL-Dev	AISHELL-Test	WSJ-eval93
softmax	8.60	9.33	9.58	8.23	9.00	9.36
softmax(T=0.75)	8.54	9.03	9.34	8.43	8.91	9.07
softmax(T=0.5)	7.80	8.64	8.80	7.73	8.59	8.64
softmax(T=0.25)	8.07	8.91	9.05	7.87	8.69	8.91
sparsemax	8.91 / 8.99 / 8.98	9.56 / 9.53 / 9.55	9.69 / 9.64 / 9.62	8.64 / 8.58 / 8.64	9.13 / 9.29 / 9.06	9.45 / 9.38 / 9.34
1.5-entmax	7.69 / 7.71 / 7.69	8.57 / 8.62 / 8.49	8.47 / 8.58 / 8.56	7.68 / 7.67 / 7.68	8.51 / 8.59 / 8.47	8.49 / 8.55 / 8.39
$\alpha$ -entmax (adaptive)	7.67 / 7.71 / 7.65	8.54 / 8.60 / 8.53	8.35 / 8.47 / 8.38	7.71 / 7.69 / 7.64	8.51 / 8.52 / 8.45	8.38 / 8.49 / 8.37

e_{i,j}=MonotonicEnergy(s_{i-1},h_{j}).

(8)

The detail of monotonic energy function can refer to [38]. Then the energy values $e_{i,j}$ are passed into a sigmoid function $\sigma(\cdot)$ to produce selection probability $p_{i,j}$ formulated as

p_{i,j}=\sigma(e_{i,j}).

(9)

Then, the selection probability $p_{i,j}$ is used to produce a discrete decision variable $z_{i,j}$ , which is formulated as

\displaystyle z_{i,j}\sim Bernoulli(p_{i,j}).

(10)

Whenever $p_{i,}$ is satisfied, the $z_{i,j}$ is stimulated (i.e., value approaching 1). Once $z_{i,j}=1$ , the model stops and set $t_{i}=j$ and $c_{i}=h_{t_{i}}$ .

In contrast to a single attention head in the recurrent sequence-to-sequence model, multi-head mechanism is capable of learning contexts with diverse paces between input and output sequence. Besides, the multi-head can learn complex alignment. In specific, the results of each head can be regarded as corresponding to some specific frames in the speech, and different heads are related to each other. While some heads can detect boundaries, other heads can capture strong intrinsic speech structures.

Dense attention mechanisms can fail to capture or utilize the strong intrinsic structures present in speech. By taking advantage of the dependencies between specific frames that different heads focus on, we can emphasize the key information in the speech and highlight the characteristic representation of these speech frames. However, there is no gurantee that multiple heads are all contributive to the overall context learning. Besides, the monotonic multi-head alignment can prevent delayed state decoding caused by multi-heads for boundary detection. Every monotonic attention head must learn alignments properly. It is necessary to prune some redundant monotonic attention heads. Thus, to enhance accordance among multi-heads on token boundary discovery, we use L1 regularization to eliminate information-redundant heads in shallow layers.

IV Experiment

In this section, we compare our method with the standard Transformer architecture of the softmax transform. Also, we compare with other two method variants for a fair comparasion, which explore different normalizing transformations:

1.5-entmax: Set fixed value $\alpha$ = 1.5 for all multi-heads of a sparse ent-max Transformer. This approach is considered to be novel, since 1.5-entmax had solely been introduced for recurrent ASR methods, but not explored in Transformer before. The major difference is, assigning sparse ent-max is only one module of seq2seq method, but being an integral component of all the Transformer modules.

$\alpha$ -entmax: To be benefit from an adaptive value of the sparse ent-max transformation, a learnable $\alpha_{i,j}^{t}$ is assigned for every attention head.

IV-A Dataset and Setup

Datasets: We systematically evaluated model performance on AISHELL-1 [45] and WSJ. AISHELL-1 is a Chinese corpus, which consists of 400 speakers and almost 170 hours of recorded utterances. The corpus utterances are splitted to training, develop and test datasets, where the training set consists of 120,098 speeches of 340 speakers; the develop set consists of 14,326 speeches from 40 speakers and test set consists of 7,176 utterances from 20 speakers. In the experiments of AISHELL-1, we do not use language models. For the WSJ dataset, we utilize eval-92 for evaluation and eval-93 for test. We use transcribed text to train a N-gram language model.

Metrics: Word error rate (WER) and character error rate (CER) were applied as the evaluation metrics in the experiments. WER is calculated by the portion of words whose recognized text-formulations were not same with the ground-truth labels. CER is calculated based on the Levenshtein distance between the recognized and the ground-truth characters, then divided by the absolute sequence length. Both metrics follows that the lower of values, the better the performance.

Training Details: As mentioned before, the input speech waveform is firstly processed to deep features of 80-dimensional filterbanks, where the hop size is 10ms and the window size is 25ms. Also, we conduct on temporal and speaker differences subtraction and bias normalization. The architecture setting includes the encoder block number $N_{e}=12$ , the decoder block number $N_{d}=6$ and the feed-forward network of $d_{ff}=1024$ . We also keep feature dimension $d_{model}=512$ and attention heads $h=8$ same as the baseline. We used the Adam optimizer with $\beta_{1}$ = 0.9, $\beta_{2}$ =0.98 and the warmup steps is 25000. The dropout is 0.1. For decoding, we use beam search with a beam size of 5. We only changed the softmax function in the self-attention in the encoder and add adaptive monotonic multi-head attention in the decoder. For comparison methods, we adopt a RNN-Transducer. The model is a 4-layer bidirectional long short-term memory model of 320 hidden units, each direction account for encoding, and a 2-layer uni-directional long short-term memory model of 512 hidden units as recognization module. This architecture considers both the acoustic and linguistic features, and projects the prediction of non-linear activation function to softmax.

IV-B Overall Results

TABLE II: WER (%) comparisons on WSJ eval93

Method	WER
Transformer	9.39
wav2letter [46]	9.5
Jasper [47]	9.9
Explicit Sparse Transformer [48]	12.98
Factorized Attention Transformer [49]	8.97
Combiner [50]	9.32
Proposed method	8.22

TABLE III: CER (%) comparisons on on AISHELL-1

Method	Dev	Test
Transformer	8.47	9.32
RNN-T	10.13	11.82
LAS [5]	-	10.56
SA-T [2]	8.30	9.30
Sync-Transformer [51]	7.91	8.91
Proposed method	7.58	8.40

The results are shown in Table 1. When only sparse transforms are added to the Transformer encoder and decoder, both have better performance. Among them, adding to both the encoder and decoder have achieved better results. We speculate that this is because the sparse transform enhances the discrimination of the acoustic features, and the discrimination of the text information is less important than the comparison. In addition, the model that completely replaces the sparse transform achieves the best results. The method of sparsity is significantly improved compared to the method of changing the softmax temperature. Low softmax temperatures can make the distribution of softmax results steeper, thereby highlighting the expression of some information, but the non-negative output of softmax is still not changed, which still leads to waste of attention. It is worth noting that a too low temperature will cause the model to focus more on a certain part of the information and ignore other important information, leading to a decline in the results.

TABLE IV: Sparsity comparisons (WER %) on LibriSpeech Corpus.

Method	dev-clean	dev	test-clean	test
softmax	3.64	11.89	3.86	11.95
sparsemax	3.88	12.01	4.00	12.19
1.5-entmax	3.48	10.43	3.79	11.76
$\alpha$ -entmax	3.36	10.29	3.70	11.64

In self-attention, replacing the softmax function with adaptive $\alpha$ -entmax all reduces WER, which shows that adding sparseness is beneficial to the performance of the model. However, sparsemax caused a decrease in performance, which shows that excessive sparseness can cause loss of information. This demonstrates that excessive sparseness cannot highlight the key information in the speech, and only a properly designed adaptive sparse can effectively highlight the key information. We combine the adaptive $\alpha$ -entmax in self-attention and report the best result in Table II, III.

IV-C The Analysis of Adaptively Sparse Attention

In this section, we retain the monotonic attention and study the effect of different sparse transformations in the sparse self-attention. In Table 1, we keep the self-attention module in its most original state (softmax) and introduce it into the cross-head attention block for experiments. The third to sixth rows in Table 1 represent different max functions in the cross-head attention block. Due to the small number of attention heads, too sparse max function (sparsemax) will lead to the loss of information. The max function with insufficient sparsity does not play a sparsity role. It is worth mentioning that the adaptive $\alpha$ -entmax function tends to $\alpha$ =1. To further investigate the proposed method on more challenging data sets, we also conduct ablation studies on English LibriSpeech dataset in Table IV. In Figure 3, we pose an example with different sparse transformations in sparse self-attention. Compared with softmax, the sparsemax function can produce competitive sparse output, but it will miss some output information. 1.5-entmax can produce more complete output results, thus achieving better performance. $\alpha$ -entmax enables each self-attention module to use a specific $\alpha$ according to different emphases, thus obtaining the best performance. We can see that $\alpha$ -entmax does not only avoid the waste of attention scores, but also effectively highlights some important frames in speech. The deepening of the color means that it has a higher attention score.

TABLE V: Ablation studies about the L1 regularization on monotonic attention.

Method	AISHELL-Test		WSJ-eval93
Method	reg.	w/o reg.	reg.	w/o reg.
softmax	9.00	9.29	9.36	9.43
sparsemax	9.13	9.32	9.45	9.47
$\alpha$ -entmax	8.49	8.50	8.37	8.35

In Figure 4(a), we tracked the change of the $\alpha$ parameter. It can be seen that at the beginning of the training, the $\alpha$ parameter has decreased to a certain degree, which shows that it is desirable to use a dense output similar to softmax in the early stage of training. Especially for the encoder part, the closer to the feature input, the closer the value of $\alpha$ is to 1, which shows that the softmax function can effectively enhance the distinguishability of features. After a certain amount of training, except for the first layer of the encoder, the other parts move closer to the sparse direction, which shows that the method of initializing $\alpha$ cannot determine the degree of sparseness in advance, and to some extent, it shows the method of fixing $\alpha$ as a learnable parameter is suitable for multi-head attention. In Figure 3(b), we visualized the final output attention score of the cross-head attention, and we can see that softmax generates a lot of low score attention due to lack of sparsity. The structure we proposed can produce more high scores to avoid the waste of attention scores.

IV-D The Analysis of Adaptively monotonic Attention

In Table 1, comparing with the sparse attention alone, the preformance of sparsemax methods is improved by averaging -0.17% with the help of monotonic attention. Furthermore, the monotonic attention is also complementary to the sparsemax and entmax methods. Test results are enhanced by -0.13% CER on AISHELL and -0.07% WER on WSJ, respectively. Results show that the monotonic attention can cooperate with the sparse attention to improve the performance further.

We further experiment the L1 regularization on the monotonic attention. As shown in Table 5, with the help of the regularization, all max function methods gain improvements both on AISHELL-1 and WSJ. The gap between w/o reg. and the proposed can be reduced from 9.29% to 9.00% on AISHELL-1 and 9.43% to 9.36% on WSJ for original softmax version. We applied adaptive sparse where $\alpha$ leading to 1.0, the monotonic multi-head alignment can prevent delayed token generation caused by such heads for boundary detection.

V Conclusion

This paper integrates an adaptive sparse attention mechanism and an adaptive monotonic attention mechanism into a Transformer-based end-to-end ASR model. The adaptive sparse attention mechanism is implemented in multi-head self-attention to highlight important informations in the input speech by replacing the softmax function with $\alpha$ -entmax. Besides, the adaptive monotonic multi-head attention is implemented by adding a regularized term. The experimental results prove that our method can focus more attention on the more important information and achieve the best results. The adaptive mechanism bases only on gradient-based optimization, searching for per-head sparsity and computing $\alpha$ -entmax.

VI Acknowledgment

This paper is supported by the Key Research and Development Program of Guangdong Province under grant No.2021B0101400003. Corresponding author is Jianzong Wang from Ping An Technology (Shenzhen) Co., Ltd ([email protected]).

References

[1] A. Graves, S. Fernández, and F. Gomez, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” International Conference on Machine Learning, pp. 369–376, 2006.
[2] Z. Tian, J. Yi, J. Tao, Y. Bai, and Z. Wen, “Self-attention transducers for end-to-end speech recognition.” Interspeech, 2019.
[3] S. Karita, N. E. Y. Soplin, S. Watanabe, M. Delcroix, A. Ogawa, and T. Nakatani, “Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration,” in Proc. Interspeech 2019, pp. 1408–1412.
[4] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” Advances in neural information processing systems, vol. 28, pp. 577–585, 2015.
[5] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and spell,” 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015.
[6] X. Song, Z. Wu, Y. Huang, C. Weng, D. Su, and H. Meng, “Non-autoregressive transformer asr with ctc-enhanced decoder input,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 5894–5898.
[7] M. Li, C. Zorilă, and R. Doddipatla, “Head-synchronous decoding for transformer-based streaming asr,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 5909–5913.
[8] C. Zhao, J. Wang, X. Qu, H. Wang, and J. Xiao, “r-g2p: Evaluating and enhancing robustness of grapheme to phoneme conversion by controlled noise introducing and contextual information incorporation,” ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6197–6201, 2022.
[9] E. Tsunoo, Y. Kashiwagi, and S. Watanabe, “Streaming transformer asr with blockwise synchronous beam search,” in 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 22–29.
[10] Z. Hong, J. Wang, X. Qu, J. Liu, C. Zhao, and J. Xiao, “Federated learning with dynamic transformer for text to speech,” in Interspeech, 2021.
[11] C. Zhao, J. Wang, L. Li, X. Qu, and J. Xiao, “Adaptive few-shot learning algorithm for rare sound event detection,” 2022 International Joint Conference on Neural Networks (IJCNN), 2022.
[12] N. Zhang, J. Wang, Z. Hong, C. Zhao, X. Qu, and J. Xiao, “Dt-sv: A transformer-based time-domain approach for speaker verification,” 2022 International Joint Conference on Neural Networks (IJCNN), 2022.
[13] Z. Hong, J. Wang, W. Wei, J. Liu, X. Qu, B. Chen, Z. Wei, and J. Xiao, “When hearing the voice, who will come to your mind,” 2021 International Joint Conference on Neural Networks (IJCNN), 2021.
[14] Z. Deng, Y. Cai, L. Chen, Z. Gong, Q. Bao, X. Yao, D. Fang, W. Yang, S. Zhang, and L. Ma, “Rformer: Transformer-based generative adversarial network for real fundus image restoration on a new clinical benchmark,” IEEE Journal of Biomedical and Health Informatics, 2022.
[15] Y. Cai, J. Lin, X. Hu, H. Wang, X. Yuan, Y. Zhang, R. Timofte, and L. V. Gool, “Mask-guided spectral-wise transformer for efficient hyperspectral image reconstruction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
[16] Y. Cai, X. Hu, H. Wang, Y. Zhang, H. Pfister, and D. Wei, “Learning to generate realistic noisy images via pixel-level noise-aware adversarial training,” in Thirty-Fifth Conference on Neural Information Processing Systems, 2021.
[17] Y. Cai, J. Lin, X. Hu, H. Wang, X. Yuan, Y. Zhang, R. Timofte, and L. V. Gool, “Coarse-to-fine sparse transformer for hyperspectral image reconstruction,” in European Conference on Computer Vision (ECCV), 2022.
[18] R. Fan, W. Chu, P. Chang, and J. Xiao, “Cass-nat: Ctc alignment-based single step non-autoregressive transformer for speech recognition,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 5889–5893.
[19] A. Martins and R. Astudillo, “From softmax to sparsemax: A sparse model of attention and multi-label classification,” in International Conference on Machine Learning, 2016, pp. 1614–1623.
[20] V. Niculae and M. Blondel, “A regularized framework for sparse and structured neural attention,” in Advances in Neural Information Processing Systems, 2017, pp. 3338–3348.
[21] G. Correia, V. Niculae, and A. F. Martins, “adaptively sparse transformer,” in Conference on Empirical Methods in Natural Language Processing, 2019.
[22] X. Song, Z. Wu, Y. Huang, C. Weng, D. Su, and H. Meng, “Non-autoregressive transformer asr with ctc-enhanced decoder input,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 5894–5898.
[23] J. Xue, T. Zheng, and J. Han, “Structured sparse attention for end-to-end automatic speech,” International Conference on Acoustics Speech and Signal Processing (ICASSP), 2020.
[24] V. Niculae, A. F. Martins, M. Blondel, and C. Cardie, “Sparsemap: Differentiable sparse structured inference,” International Conference on Machine Learning, 2018.
[25] C. Malaviya, P. Ferreira, and A. F. Martins, “Sparse and constrained attention for neural machine translation,” Annual Meeting of the Association for Computational Linguistics (ACL)., 2018.
[26] D. Mareček and R. Rosa, “Extracting syntactic trees from transformer encoder self-attentions,” in Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2018, pp. 347–349.
[27] A. Plastino and A. Plastino, “Stellar polytropes and tsallis’ entropy,” vol. 174, no. 5-6. Elsevier, 1993, pp. 384–386.
[28] B. Peters, V. Niculae, and A. F. Martins, “Sparse sequence-to-sequence models,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 1504–1519.
[29] C. Raffel, M.-T. Luong, P. J. Liu, R. J. Weiss, and D. Eck, “Online and linear-time attention by enforcing monotonic alignments,” in International Conference on Machine Learning. PMLR, 2017, pp. 2837–2846.
[30] S. Kim, T. Hori, and S. Watanabe, “Joint ctc-attention based end-to-end speech recognition using multi-task learning,” in 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, pp. 4835–4839.
[31] A. H. Liu, H.-y. Lee, and L.-s. Lee, “Adversarial training of end-to-end speech recognition using a criticizing language model,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6176–6180.
[32] K. Kim, K. Lee, D. Gowda, J. Park, S. Kim, S. Jin, Y.-Y. Lee, J. Yeo, D. Kim, S. Jung et al., “Attention based on-device streaming speech recognition with large speech corpus,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019, pp. 956–963.
[33] T. Nakatani, “Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration,” in Proc. Interspeech 2019, 2019.
[34] C.-C. Chiu and C. Raffel, “Monotonic chunkwise attention,” International Conference on Learning Representations (ICLR), 2017.
[35] R. Fan, P. Zhou, W. Chen, J. Jia, and G. Liu, “An online attention-based model for speech recognition,” Proc. Interspeech 2019, pp. 4390–4394, 2019.
[36] N. Arivazhagan, C. Cherry, W. Macherey, C.-C. Chiu, S. Yavuz, R. Pang, W. Li, and C. Raffel, “Monotonic infinite lookback attention for simultaneous machine translation,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 1313–1323.
[37] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[38] X. Ma, J. Pino, J. Cross, L. Puzon, and J. Gu, “Monotonic multihead attention,” International Conference on Learning Representations (ICLR), 2019.
[39] H. Inaguma, M. Mimura, and T. Kawahara, “Enhancing monotonic multihead attention for streaming asr,” INTERSPEECH, 2020.
[40] S. Li, D. Raj, X. Lu, P. Shen, T. Kawahara, and H. Kawai, “Improving transformer-based speech recognition systems with compressed structure and speech attributes augmentation.” in INTERSPEECH, 2019, pp. 4400–4404.
[41] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu et al., “Conformer: Convolution-augmented transformer for speech recognition,” INTERSPEECH, 2020.
[42] N. Moritz, T. Hori, and J. Le Roux, “Triggered attention for end-to-end speech recognition,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 5666–5670.
[43] L. Dong, F. Wang, and B. Xu, “Self-attention aligner: A latency-control end-to-end model for asr using self-attention network and chunk-hopping,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 5656–5660.
[44] M. Blondel, A. Martins, and V. Niculae, “Learning classifiers with fenchel-young losses: Generalized entropies, margins, and algorithms,” in The 22nd International Conference on Artificial Intelligence and Statistics, 2019, pp. 606–615.
[45] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline,” in 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA). IEEE, 2017, pp. 1–5.
[46] N. et al, “Fully convolutional speech recognition,” CoRR, vol. abs/1812.06864,, 2018.
[47] J. Li, V. Lavrukhin, B. Ginsburg, R. Leary, O. Kuchaiev, J. M. Cohen, H. Nguyen, and R. T. Gadde, “Jasper: An end-to-end convolutional neural acoustic model,” in Interspeech 2019.
[48] J. L. Guangxiang Zhao and X. Renz, “Explicit sparse transformer: Concentrated attention through explicit selection,” arXiv preprint arXiv:1912.11637v1, 2019.
[49] R. Child, S. Gray, A. Radford, and I. Sutskever, “Generating long sequences with sparse transformers.” Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2019.
[50] J. Wu, C. Xiong, and T. Schnabel, “Combiner: Inductively learning tree structured attention in transformer,” in International Conference on Learning Representations, 2020.
[51] Z. Tian, J. Yi, Y. Bai, J. Tao, S. Zhang, and Z. Wen, “Synchronous transformer for end-to-end speech recognition,” International Conference on Acoustics Speech and Signal Processing, 2020.

Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition ††thanks: ∗{*} Corresponding Author: Jianzong Wang, [email protected]