Understanding Self-Attention of Self-Supervised Audio Transformers

Abstract

Self-supervised Audio Transformers (SAT) enable great success in many downstream speech applications like ASR, but how they work has not been widely explored yet. In this work, we present multiple strategies for the analysis of attention mechanisms in SAT. We categorize attentions into explainable categories, where we discover each category possesses its own unique functionality. We provide a visualization tool for understanding multi-head self-attention, importance ranking strategies for identifying critical attention, and attention refinement techniques to improve model performance.

Index Terms: Self-supervised Learning, Self-attention, Transformer Encoders, Interpretability

1 Introduction

Adapting the idea of self-supervised learning [1, 2, 3, 4] to continuous speech has received much attention in recent work [5, 6, 7, 8, 9, 10], where Transformer Encoders with multi-head self-attention [11] are pre-trained on a large amount of audio data in a self-supervised scheme. Once pre-trained, they are used to improve various downstream supervised tasks, including phone classification, speaker recognition, SLU, and ASR. Despite the great success of these Self-supervised Audio Transformers (SAT)¹¹1These pre-trained transformer encoders have several different names in their original papers. In this paper we refer to them as SAT for simplicity., their internal attention are often neglected and not explored, as we have little understanding of how they work, or the knowledge they acquire from a large amount of unlabeled data. Understanding how SAT models draw conclusions is crucial for both their improvement and application. In the area of natural language processing (NLP), explaining and interpreting pre-trained black-box models like BERT have been a well-explored topic [12, 13, 14, 15, 16, 17]. However, the analysis of models that are pre-trained on speech has not seen such widespread exploration, and remains an important and challenging endeavor for the speech community.

In this work, we propose to analyze the multi-head self-attention mechanism of SAT through the following methods: visualization, categorization, functionality study, and importance ranking. We found that the self-attentions of SAT models tend to converge into three categories: global attentions, vertical attentions, and diagonal attentions. Diagonal attentions either highly attend to $\pm t$ neighbor or are highly correlated with phoneme boundaries; vertical attentions often concentrate on specific phonemes. As for noisy global attentions, we provide a visualization tool to draw insights about their implicit operations. Through our quantized ranking analysis, we conclude that diagonal attentions outrank the most in terms of importance, followed by vertical attentions. Last but not least, we introduce attention refinement methods which allow us to improve learned representations by moderately removing global attentions or constraining attention span, resulting in a faster inference time and higher performance.

2 Self-Supervised Audio Transformers

The main ideology of NLP BERT pre-training [1, 2, 3, 4] is to corrupt the input word tokens by randomly masking or permuting them with a probability policy, layers of Transformer Encoder [11] are trained together with a classifier that estimates the masked words at the output. Primarily inspired by this idea, previous works [5, 7, 6, 8, 9, 10] proposed self-supervised learning for audio with Transformer Encoders. In this work, we refer to these types of models as Self-Supervised Audio Transformers, SAT. Unlike BERT where the inputs are discrete text tokens, the inputs of SATs are acoustic features (e.g., MFCC, FBANK, Mel-Spectrogram), which are continuously long and locally smooth. Also, different from BERT where the model is trained by estimating discrete tokens, SATs change the classifier to a feed-forward network and minimize reconstruction error between real frames and predicted frames.

Among all the variants of SATs, we address our focus on SATs that take continuous acoustic frames as input [5, 7, 6], rather than ones that operate on quantized discrete vectors of speech [8, 9, 10]. In our analysis, we particularly consider the framework described in the Mockingjay [5] literature, mainly due to its designs to address the continuously long and locally smooth problem of speech. In Mockinjgay, two techniques of downsampling and consecutive masking are introduced to resolve these issues. Downsampling is applied on input features to adapt SATs to long sequences. To reduce the length of frames by a factor of $R_{factor}$ , consecutive frames of $R_{factor}$ amount are reshaped and stacked into one frame [5, 18, 19]. On the other hand, consecutive masking is applied during pre-training to avoid the model from exploiting the local smoothness of acoustic frames. Instead of masking a single frame, consecutive frames of $C_{num}$ are masked to zero. To study the attentions of SAT models, we use the prevailing framework of the LARGE model described in Mockingjay [5], which consists of 12 layers of Transformer Encoders. We train three models on the LibriSpeech [20] train-clean-360 subset with identical settings as in [5], except for $C_{num}\in\{3,6,9\}$ , and we name them as M3, M6, M9, respectively.

3 Notations

We start by defining notation for self-attention mechanism and SAT representations. Given a length $T$ sequence of vectors $\bm{x}=x_{1},...,x_{T}\in\mathbb{R}^{d}$ , we denote $A^{h}_{u}\in\mathbb{R}^{T\times T}$ as attention weights for all query-key pairs of a head $h$ when propagating an utterance $u$ . Hence, $A^{h}_{u}[q,k]\in\mathbb{R}$ is the attention weight of $x_{q}$ attending to $x_{k}$ . We use $q$ for timestamp of query; $k$ for timestamp of key, where $1\leq q,k\leq T$ . As a result, $A^{h}_{u}[q]\in\mathbb{R}^{T}$ is the attention distribution formed by $x_{q}$ , which is a row if we view $A^{h}_{u}$ as a map. When analyzing the representations of a $L$ -layer SAT, we denote $\bm{x^{l}}=x^{l}_{1},...,x^{l}_{T}\in\mathbb{R}^{d}$ as the representations of a given layer, where $0\leq l\leq L$ and $\bm{x^{0}}$ represents input features.

4 Visualization and Categorization

Refer to caption — Figure 1: Attention maps of heads favored by G, V, D, visualized with the same utterance. (a)(c)(e) are average cases; (b)(d)(f) are extreme cases found by maximizing the metrics.

We plot out $A^{h}_{u}\in\mathbb{R}^{T\times T}$ as an attention map, where $A^{h}_{u}[0,0]$ starts from the upper-left corner, like Fig 1²²2Due to the page limit, we provide attention maps and supplementary materials in our demo page: https://github.com/leo19941227/Self-Attention-on-SATs . SAT attentions tend to converge into three categories: (1) global: flat attention distributions; (2) vertical: attention maps with vertical lines, and (3) diagonal: attention maps with clear diagonal. Because attention maps of a head are similar across utterances in the sense of three categories, we study self-attention on the basis of head instead of a single attention map. To classify heads into three categories, we define three metrics to quantify a head $h$ : globalness $G$ , verticality $V$ and diagnality $D$ in equations 1, 2, 3, respectively.

G(h)=\underset{u\sim U}{\mathbb{E}}\bigg{[}\ \frac{1}{T}\sum_{q=1}^{T}\mathbb{H}(\ A^{h}_{u}[q]\ )\ \bigg{]}\vspace{-2pt}

(1)

V(h)=\underset{u\sim U}{\mathbb{E}}\bigg{[}-\mathbb{H}(\ \frac{1}{T}\sum_{q=1}^{T}A^{h}_{u}[q]\ )\ \bigg{]}\vspace{-10pt}

(2)

D(h)=\underset{u\sim U}{\mathbb{E}}\bigg{[}-\frac{1}{T^{2}}\sum_{q=1}^{T}\sum_{k=1}^{T}|q-k|\cdot A^{h}_{u}[q,k]\ \bigg{]}\vspace{-2pt}

(3)

where $\mathbb{H}$ is the standard definition of entropy, and $U$ is a speech corpus. Based on G, V, D, we would have three ranking lists for all heads. If among the three ranking lists, a head has the highest rank based on the list of G, it would be categorized as global, and so on. We use ranking instead of values because the metrics may not have the same numerical scale. Fig 1 shows two attention maps for each category.

Diagonal attentions attend to local neighbors for every query. Some exhibit highly focused like Fig 1(f) and some are block diagonal like Fig 1(e). Interestingly, no SAT contains highly focused diagonal attention at main diagonal. They shift either to the left or right, and larger masking span $C_{num}$ is accompanied by a larger shift, possibly due to SAT models trying to get useful information from further frames. The functionality of block diagonal attentions is discussed in section 5. Vertical attentions like Fig 1(c)(d) always attend to similar locations for all queries given an utterance; global attentions like Fig 1(a)(b) behave randomly. These two categories are discussed in section 6. Finally, we visualize the head distribution^†^†footnotemark: according to metrics and find the model trained with a larger masking span $C_{num}$ has more global heads. On the contrary, M3 contains the most diagonal heads, suggesting that smaller $C_{num}$ makes SAT focus on local structure more. As for the distribution across layers, all categories of heads are distributed quite uniformly.

5 Phoneme Segmentation

There are attention maps of clear block diagonal like Fig 2(a). The borders of blocks might be phoneme boundaries, as illustrated in Fig 2(b). It seems diagonal attention knows phoneme intervals. We conduct phoneme segmentation to examine the correlation. We mainly follow the algorithm proposed in [21], which first calculates a similarity matrix from a sequence of features, containing all pairwise distances between features, and then extract boundary points from the similarity matrix. For segmentation with an attention map, its rows are considered as the feature sequence for computing the similarity matrix^†^†footnotemark: . Examples of similarity matrices are shown in Fig 2(c)(d) when segmentation features are the attentions map and MFCCs, respectively. We slightly modify the boundary-point-extraction algorithm^†^†footnotemark: in [21], the modification makes algorithm a little more stable, but only little performance difference is found.

TIMIT [22] is used for evaluating the phoneme segmentation since it provides ground-truth phoneme boundaries. We follow the setup in [23] that uses a small subset of training set as validation to adjust a few algorithm parameters and evaluate on test set. We use a 20ms tolerance window and evaluate with R-value [24] and precision-recall curve. We hand-pick a visually block diagonal head for each of M3, M6, and M9. We choose MFCC as baseline feature since it is the most prevailing feature [21, 25, 26, 27] for segmentation. Little performance difference is found between MFCC and $\bm{x^{0}}$ (Mel-scale spectrogram).

Fig 2(e) verifies the correlation between block diagonal attentions and phoneme boundaries, that attentions clearly surpass MFCC under the same setting. As for R-value, under the strict hit counting scenario [24], MFCC achieves 76.68; M3, M6, M9 achieve 79.99, 78.43, 78.19, respectively. Interestingly, larger masking span $C_{num}$ leads to poorer performance. The reason is that when $C_{num}=3$ , masked portions are typically within a phoneme interval, the model learned to utilize features in the same interval to reconstruct. On the other hand, $C_{num}=9$ can sometimes mask an entire phoneme interval, the model then tries to retrieve information beyond the interval.

Worth mentioning, similarity matrices on MFCCs and learned block diagonal attentions have a fundamental difference that the former show high activation on similar but distant frames in Fig 2(d), while the latter are more aware of phoneme neighbor structure. Figures similar to Fig 2(d) are shown^†^†footnotemark: when we compute similarity matrix on Mel-scale spectrogram or SAT representations, suggesting that despite there are similar frames located far apart, block diagonal heads learned to ignore distant information and focus on neighbor structure.

6 Phoneme Relation Map

To study the functionality of global and vertical heads, we propose to align attentions to phoneme relations to see whether some heads focus on looking for specific phoneme relations in the utterances. For a sequence of input features $\bm{x^{0}}$ , there exists frame-wise phoneme labels $\bm{y}\in\bm{Y}^{T}$ , where $\bm{Y}$ is a predefined phone set. We consider $x^{l}_{q}$ attending to $x^{l}_{k}$ as when observing phoneme $y_{q}$ the head would look for phoneme $y_{k}$ . We quantify a phoneme relation $Y_{m}\rightarrow Y_{n}$ inside a head $h$ by summing up all attention weights $A^{h}_{u}[q,k]$ whose phoneme relation $y_{q}\rightarrow y_{k}$ equals $Y_{m}\rightarrow Y_{n}$ , over the entire speech corpus. More specifically, we plot a phoneme relation map (PRM) $P_{h}\in\mathbb{R}^{|\bm{Y}|\times|\bm{Y}|}$ by the following equations:

P_{h}^{{}^{\prime}}[m,n]=\underset{u\sim U}{\mathbb{E}}\bigg{[}\frac{1}{T}\sum_{q=1}^{T}\sum_{k=1}^{T}\mathbb{I}_{y_{q}=Y_{m}}\cdot\mathbb{I}_{y_{k}=Y_{n}}\cdot A^{h}_{u}[q,k]\bigg{]}\vspace{-5pt}

(4)

P_{h}[m,n]=\frac{P_{h}^{{}^{\prime}}[m,n]-P_{U}[m,n]}{P_{U}[m,n]}

(5)

where $1\leq m,n\leq|\bm{Y}|$ , $\mathbb{I}$ is indicator function, $P_{h}^{{}^{\prime}},P_{U}\in\mathbb{R}^{|\bm{Y}|\times|\bm{Y}|}$ and $P_{U}$ is the distribution of all possible phoneme relations^†^†footnotemark: in speech corpus $U$ , normalizing the effect of dominating relations like $sil\rightarrow sil$ which appears in all utterances. As a result, positive values in $P_{h}$ represent preference for specific phoneme relations; negative values represent the opposite.

PRMs are plotted using TIMIT [22] with 39 phonemes, and results of several heads are shown^†^†footnotemark: in Fig 3. Since diagonal heads are interpretable themselves, we focus on vertical and global heads. There are several operations: attending to silence, identity, specific phonemes, and not attending to these (not operations). We observe tendency of vertical heads either focus or neglect specific phonemes for all queries, and we bridge their connections. For later discussion, we use focus and neglect to refer to these types of behaviours.

While a PRM characterizes all phoneme relations of a head, we further define concentration $C_{h}\in\mathbb{R}^{|Y|}$ of a head $h$ , where each $C_{h}[n]\in\mathbb{R}$ quantifies the amount of focus (when positive) or neglect (when negative) of a head on specific phoneme $Y_{n}$ , over all queries:

C_{h}[n]=\frac{1}{|Y|}\sum_{m=1}^{|Y|}P_{h}[m,n]

(6)

Fig 4 verifies the connection between verticality and concentration. We report the maximum focus or neglect for each head. Fig 4(a) points out that heads with high verticality do focus on specific phonemes; Fig 4(b) points out even a slight increase of the verticality V of a head has correlation to concentration, for both focus and neglect. Some low-verticality heads with extreme neglect at the bottom-left of Fig 4(b) are diagonal heads, which always attend to their neighbors dynamically and show extreme neglect for all phonemes.

7 Importance Ranking

To evaluate the importance of different attention patterns, we conducted two pruning-based probing experiments. We ablate partial functionality of self-attention directly at inference time in two aspects: (1) ablates an entire head; (2) ablates the visible span for all heads. If an attention pattern is essential, ablating it should exhibit immediate loss in terms of the quality of final representations. We examine representation quality by three probing tasks: spectrogram reconstruction, phoneme classification, and speaker recognition. For the first task, we examine the richness in terms of spectrogram details of refined representations. We reuse the reconstruction head during pre-training and measure L1 loss compared to the original. For the latter two tasks, we examine the usefulness of refined representations on downstream tasks. For phoneme and speaker classifications, we train the downstream models using LibriSpeech [20] train-clean-100 subset and fixed 50k steps. In frame-level setting, we use single-hidden MLP; in utterance-level setting, we use mean-pooling followed by a linear transform. Phoneme classification is conducted under frame-level setting; speaker recognition is under frame-level and utterance-level. Following [5], phoneme labels are obtained by the Montreal Force Aligner [28], and all evaluations are done on the LibriSpeech test-clean subset.

7.1 Head-based Pruning

For each head $h$ , we first compute values of $G(h)$ , $V(h)$ , $D(h)$ , and then cumulatively prune heads from high to low for each metric by setting $A^{h}_{u}=0$ , resulting in three curves as shown in Fig 5(a)(b)(c). We can thus rank the importance of the three categories by observing which pruning results in a larger performance drop. We find ranking results are consistent for different $C_{num}$ , so we only show the result of M3. There are several findings: (1) Diagonal heads are the most important. Performances on all three tasks drop significantly with only 24 heads pruned. (2) Vertical heads rank second. While pruning them does not hurt reconstruction or phoneme classification much, it drops faster than global heads in speaker recognition. This suggests that vertical attentions have more relation to speaker identity. (3) Global heads have the least importance that pruning them has the least effect on all tasks. (4) Both global and vertical heads are harmful to the phonetic structure. Fig 5(b) shows that pruning them even improve the phoneme classification accuracy. For vertical heads, we speculate that the vertical heads might focus on distant phonemes when forming a new representation independently (disrespectfully) of query phoneme, which might corrupt the local phonetic structure. (5) In Fig 5(b), when we prune according to diagonality, phoneme accuracy drops dramatically for the first 24 heads pruned, while it surprisingly increases as we prune more heads. This is because when pruning more than 24 diagonal heads, we start to prune the heads that are more vertical or global than diagonal, supporting the previous finding that vertical and global attentions are harmful for phonemic information.

We further show the result of ranking the importance of heads based on their maximum attention weights, denoted as weight in Fig 5(a)(b)(c), which has been shown to be a strong baseline in the previous work [29]. Fig 5(c) shows pruning based on globalness has less influence than weight. Fig 6(a) visualizes the difference between two ranking strategies. Although they agree on which heads are essential, they slightly diverge on which are not. Their decision boundaries in terms of the first 24 heads to be pruned are shown by red and blue lines in Fig 6(a), which is the direct cause of their performance differences. Globalness prunes blue dots while leaving red dots unpruned; and vise versa for weight (while they all prune green dots). Since globalness performs better than weight after pruning, this suggests that heads of red dots are more important than blue dots. We select the head with the highest ranking difference from both red and blue dots, and plot their PRMs in Fig 6(b) and (c), respectively. We find that while Fig 6(b) shows strong neglect, (c) does not possess observable operation. In fact, heads of red dots are mostly with clear neglect^†^†footnotemark: . We argue that this is the main reason why globalness performs better after pruning, that heads with neglect are essential to speaker identity, and globalness defined by entropy is able to recognize neglect and score them higher. On the other hand, weight is confused by attentions with large weights but without meaningful operation, suggesting that weight do not always reflect the importance of heads. When we plot^†^†footnotemark: the PRMs for dots located at upper left (the red box in Fig 6(a)), they typically show neglect. These observations are consistent for all M3, M6, M9. We speculate that these heads might learn to neglect less useful frames, like sil in Fig 6(b), and focus more on other frames with more speaker information [30]. Based on the above observations, we choose globalness as our refinement metric. Fig 5(d)(e)(f) show pruning results for M3, M6, M9. The importance of global heads become less for larger $C_{num}$ , and we keep observing performance boost for phoneme classification. Despite all three models drop for speaker recognition, the drop is mitigated dramatically in utterance-level setting (a more common scenario), suggesting that global heads are not necessary when speaker classification is performed on utterance level. In conclusion, we can prune SAT heads for more than 50% without sacrificing performance.

7.2 Span-based Pruning

Since most of the heads have attention span over a long range (no matter what category it belongs to), we further conduct attention-span pruning to examine if global information is genuinely not helpful for extracting phonetic information. We limit the visible span of all heads by length $r$ , either to the left or right. That is, we set $A^{h}_{u}[q,k]=0$ for any $|q-k|>r$ . Results are presented in Fig 5(g)(h)(i).

8 Conclusion

In this paper, we present multiple strategies for analyzing the self-attention mechanism in SATs, including phoneme segmentation, phoneme relation map, and two aspects of pruning. We find several attention functionality and operations. We identify critical attentions and show our visualization tool useful for understanding pruning behavior. Finally, we conclude that we can refine representations and speed up inference time for a given SAT in two aspects: removing global heads or constraining attention span. More materials are provided in our demo page^†^†footnotemark: .

References

[1] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[2] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language understanding,” in Advances in neural information processing systems, 2019, pp. 5753–5763.
[3] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
[4] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for self-supervised learning of language representations,” arXiv preprint arXiv:1909.11942, 2019.
[5] A. T. Liu, S.-w. Yang, P.-H. Chi, P.-c. Hsu, and H.-y. Lee, “Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6419–6423.
[6] D. Jiang, X. Lei, W. Li, N. Luo, Y. Hu, W. Zou, and X. Li, “Improving transformer-based speech recognition using unsupervised pre-training,” arXiv preprint arXiv:1910.09932, 2019.
[7] X. Song, G. Wang, Z. Wu, Y. Huang, D. Su, D. Yu, and H. Meng, “Speech-xlnet: Unsupervised acoustic model pretraining for self-attention networks,” arXiv preprint arXiv:1910.10387, 2019.
[8] P. Wang, L. Wei, Y. Cao, J. Xie, and Z. Nie, “Large-scale unsupervised pre-training for end-to-end spoken language understanding,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7999–8003.
[9] A. Baevski, S. Schneider, and M. Auli, “vq-wav2vec: Self-supervised learning of discrete speech representations,” arXiv preprint arXiv:1910.05453, 2019.
[10] A. Baevski, M. Auli, and A. Mohamed, “Effectiveness of self-supervised pre-training for speech recognition,” arXiv preprint arXiv:1911.03912, 2019.
[11] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
[12] B. v. Aken, B. Winter, A. Löser, and F. A. Gers, “Visbert: Hidden-state visualizations for transformers,” in Companion Proceedings of the Web Conference 2020, 2020, pp. 207–211.
[13] Y. Hao, L. Dong, F. Wei, and K. Xu, “Visualizing and understanding the effectiveness of bert,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 4134–4143.
[14] O. Kovaleva, A. Romanov, A. Rogers, and A. Rumshisky, “Revealing the dark secrets of bert,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 4365–4374.
[15] K. Clark, U. Khandelwal, O. Levy, and C. D. Manning, “What does bert look at? an analysis of bert’s attention,” in Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2019, pp. 276–286.
[16] I. Tenney, D. Das, and E. Pavlick, “Bert rediscovers the classical nlp pipeline,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 4593–4601.
[17] I. Tenney, P. Xia, B. Chen, A. Wang, A. Poliak, R. T. McCoy, N. Kim, B. Van Durme, S. Bowman, D. Das et al., “What do you learn from context? probing for sentence structure in contextualized word representations,” in 7th International Conference on Learning Representations, 2019.
[18] M. Sperber, J. Niehues, G. Neubig, S. Stüker, and A. Waibel, “Self-attentional acoustic models,” Proc. Interspeech 2018, pp. 3723–3727, 2018.
[19] N.-Q. Pham, T.-S. Nguyen, J. Niehues, M. Müller, and A. Waibel, “Very deep self-attention networks for end-to-end speech recognition,” Proc. Interspeech 2019, pp. 66–70, 2019.
[20] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210, 2015.
[21] S. Bhati, S. Nayak, and S. R. M. Kodukula, “Unsupervised segmentation of speech signals using kernel-gram matrices,” Communications in Computer and Information Science, vol. 841, pp. 139–149, 2018.
[22] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, “Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1,” STIN, vol. 93, p. 27403, 1993.
[23] A. Stan, C. Valentini-Botinhao, B. Orza, and M. Giurgiu, “Blind speech segmentation using spectrogram image-based features and mel cepstral coefficients,” in 2016 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2016, pp. 597–602.
[24] O. J. Räsänen, U. K. Laine, and T. Altosaar, “An improved speech segmentation quality measure: the r-value,” in Tenth Annual Conference of the International Speech Communication Association, 2009.
[25] S. Bhati, S. Nayak, K. S. R. Murty, and N. Dehak, “Unsupervised acoustic segmentation and clustering using siamese network embeddings,” Proc. Interspeech 2019, pp. 2668–2672, 2019.
[26] O. Scharenborg, V. Wan, and M. Ernestus, “Unsupervised speech segmentation: An analysis of the hypothesized phone boundaries,” The Journal of the Acoustical Society of America, vol. 127, no. 2, pp. 1084–1095, 2010.
[27] I. Mporas, T. Ganchev, and N. Fakotakis, “Phonetic segmentation using multiple speech features,” International Journal of Speech Technology, vol. 11, no. 2, pp. 73–85, 2008.
[28] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, “Montreal forced aligner: Trainable text-speech alignment using kaldi,” Proc. Interspeech 2017, pp. 498–502, 2017.
[29] Y. Hao, L. Dong, F. Wei, and K. Xu, “Self-attention attribution: Interpreting information interactions inside transformer,” arXiv preprint arXiv:2004.11207, 2020.
[30] Q. Wang, K. Okabe, K. A. Lee, H. Yamamoto, and T. Koshinaka, “Attention mechanism in speaker recognition: What does it learn in deep speaker embedding?” in 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2018, pp. 1052–1059.