Towards trustworthy phoneme boundary detection with autoregressive model and improved evaluation metric

Abstract

Phoneme boundary detection has been studied due to its central role in various speech applications. In this work, we point out that this task needs to be addressed not only by algorithmic way, but also by evaluation metric. To this end, we first propose a state-of-the-art phoneme boundary detector that operates in an autoregressive manner, dubbed SuperSeg. Experiments on the TIMIT and Buckeye corpora demonstrates that SuperSeg identifies phoneme boundaries with significant margin compared to existing models. Furthermore, we note that there is a limitation on the popular evaluation metric, R-value, and propose new evaluation metrics that prevent each boundary from contributing to evaluation multiple times. The proposed metrics reveal the weaknesses of non-autoregressive baselines and establishes a reliable criterion that suits for evaluating phoneme boundary detection.

Index Terms— Phoneme boundary detection, acoustic analysis, speech segmentation

1 Introduction

Phoneme-level segmentation of speech provides useful information for researching various speech applications such as speech recognition, speech synthesis, language identification, and singing voice synthesis. For instance, Liu et al. [1] improved spoken language identification performance by aggregating latent variables between phoneme boundaries, and Lee et al. [2] showed promising results in voice style transfer by using variable-length segments that are basically similar to phoneme-level intervals. In light of the usability of phoneme boundaries in a broad range of speech processing, it is necessary to develop an effective automatic system that can detect phoneme boundaries with high accuracy from a given speech source.

Phoneme boundary detection is typically performed under two different settings depending on the presence of phoneme transcription. In a text-dependent scenario known as forced alignment, a pair of phonemes and an utterance are presented, and the start and end timestamps of each phoneme are estimated. This can be performed via an automatic speech recognition (ASR) system based on a hidden Markov model [3], attention alignments from a speech synthesis model [4], or an explicit phoneme-to-audio aligner that is trained on supervisory signals obtained directly from phoneme boundaries [5, 6]. On the other hand, in a text-independent setting, the goal is to learn phoneme-level transitions in a speech source without knowledge of the corresponding phoneme sequence. Recent studies have proposed various methods for solving this problem in both supervised and unsupervised manners [7, 8, 9].

We focus on developing a phoneme boundary detector for the text-independent scenario under the supervised learning setting. First, we note that a naive classification approach is prone to produce multiple boundaries around a true boundary (i.e., over-segmentation). To tackle this, we propose SuperSeg that employs an autoregressive architecture to leverage previous boundary estimates. SuperSeg prevents unwanted boundary repetitions by feeding additional information on whether the previous frames are classified as a boundary. Another issue is the volume of available data. In most cases, we have a smaller amount of data with phoneme boundary labels (e.g., TIMIT [10]: 5.4 hours) compared to other speech datasets (e.g., MLS [11]: 60.5k hours) since phoneme boundary annotation is conducted at the frame-level with domain expertise. This lack of data often weakens the generalization performance for test data. To this end, we propose to adopt data augmentation techniques such as pitch/formant perturbations [12] and masking blocks of frequency channels [13]. Furthermore, we propose new evaluation metrics for assessing phoneme-level segmentation. Previous metrics do not efficiently penalize duplicated estimates around true boundaries, thus overrating non-autoregressive baselines. Our proposed metrics provide a more trustworthy evaluation criterion by restricting the multiple contributions of each boundary to evaluation scores.

2 Related work

There have been lots of studies for phoneme boundary detection. Frank et et. [14] optimizes bidirectional LSTMs with a cross entropy function assigning more weight to phoneme boundary. Kreuk et al. [7] employs learnable segmental features and identifies boundary candidates that minimizes the structured loss with dynamic programming. They also show that additional supervisory signals from phoneme labels yields improvements on boundary detection. Kamper et al. [8] utilizes the discrete code of pretrained vector quantized networks without using any phoneme boundary labels. Zhu et al. [9] adopts a contrastive learning scheme and a phoneme recognition task. Their model can perform forced alignment using a dynamic time warping algorithm as well as text-independent segmentation. Similarly, Kreuk et al. [15] uses the noise contrastive estimation and leverages a large amount of unlabeled audio data. Lin et al. [16] employs a regularized attention mechanism on a pretrained acoustic encoder and performs text-dependent phoneme segmentation.

For evaluation of phoneme boundary detection, Räsänen et al. [17] introduces R-value that is insensitive to random boundary insertion to penalize over-segmentation. To the best of our knowledge, R-value is by far the most reliable metric for assessing boundary segmentation.

3 Proposed Method

Refer to caption — Fig. 1: Detailed architecture of SuperSeg.

Algorithm 1 Proposed evaluation algorithm

\gamma

(tolerance),

A=\{a_{i}\}_{i=1,...,N_{A}}

(sorted boundary list),

B=\{b_{i}\}_{i=1,...,N_{B}}

(another sorted boundary list)

N_{hit}\leftarrow 0

for

a_{i}=a_{1},a_{2},...,a_{N_{A}}

for

b_{j}\in B

|a_{i}-b_{j}|\leq\gamma

then

N_{hit}\leftarrow N_{hit}+1

Remove

b_{j}

from

B

break

end if

end for

return

\frac{N_{hit}}{N_{A}}

3.1 Architecture

SuperSeg consists of three main parts; acoustic encoder, boundary embedder, and boundary decoder. Acoustic encoder extracts latent features $\textbf{h}_{1},...,\textbf{h}_{T}\in\mathbb{R}^{d_{h}}$ from the mel-spectrogram of a given speech where $T$ denotes the total frame length. More specifically, acoustic encoder first transforms a log-scale mel-spectrogram of dimension $d_{mel}$ to $d_{l}$ -dimensional variables $\textbf{m}_{1},...,\textbf{m}_{T}\in\mathbb{R}^{d_{l}}$ ( $d_{l}\geq d_{h}$ ) and processes them using convolutional neural networks (CNNs) consisting of ReLU activations, layer normalization [18], and dropout layers [19]. Boundary embedder maps binary values of boundary (1 if the current frame contains a phoneme boundary, 0 otherwise) to $d_{e}$ -dimensional vectors $\textbf{e}_{1},...,\textbf{e}_{T}\in\mathbb{R}^{d_{e}}$ . Boundary decoder employs a unidirectional LSTM [20] that receives as inputs the $t$ -th latent feature $\textbf{h}_{t}$ and the previous boundary embedding vector $\textbf{e}_{t-1}$ , and outputs a Bernoulli parameter $p_{t}$ that estimates the probability that the current frame contains a phoneme boundary. Fig. 1 shows the detailed architecture of SuperSeg.

3.2 Training

SuperSeg is optimized to solve binary classification tasks using the binary cross entropy (BCE) loss. To satisfy the autoregressive nature, the boundary embedding vectors obtained from the ground-truth labels are shifted to the right by 1 time step, and passed to boundary decoder (i.e., teacher forcing). For better generalization, we adopt data augmentation algorithms such as pitch/formant perturbations [12] and masking blocks of frequency channels [13].

3.3 Inference

At test time, SuperSeg sequentially identifies phoneme-level boundaries on a frame-by-frame basis. To decide the label of the $t$ -th frame, SuperSeg compares the model output $p_{t}$ with the threshold $\nu$ (1 if $p_{t}>\nu$ and 0 otherwise). The threshold $\nu$ is determined based on the evaluation metric (e.g., R-value [17]) with the grid search using the validation set before evaluating on the test set.

3.4 Evaluation metric

Table 1: Comparison of phoneme boundary detection models on conventional evaluation metrics (%).

		TIMIT				Buckeye
Model	Use text	Precision	Recall	F1-score	R-value	Precision	Recall	F1-score	R-value
Frank et al. [14]	✗	91.10	88.10	89.60	90.80	87.80	83.30	85.50	87.17
Kreuk et al. [7]	✗	94.03	90.46	92.22	92.79	85.40	89.12	87.23	88.76
Lin et al. [16]	✓	93.42	95.96	94.67	95.18	88.49	90.33	89.40	90.90
SuperSeg (non-AR)	✗	94.88	95.88	95.38	96.05	89.81	92.46	91.12	92.24
SuperSeg (AR)	✗	95.63	94.77	95.20	95.82	89.92	89.94	89.93	91.40

In the conventional calculation of precision (P) and recall (R) [7, 9, 15], boundaries are usually evaluated not as a sequence but as individual elements, which overrates the repeated estimates around the true boundaries. We note that this often leads to unreliable scores of F1 and R-value since they are computed by using precision and recall. To tackle this issue, we propose to measure precision and recall by evaluating each boundary sequentially as shown in Alg. 1. The proposed algorithm avoids multiple but redundant contributions, thus giving a low score for over-segmentation. We use the true and detected boundaries for $A$ and $B$ respectively to get the recall, and swap the order to calculate the precision. The F1-score and R-value are obtained from these precision and recall values in the same way as before. Fig. 2 demonstrates the conventional and proposed hit counting methods.

4 Experiments

4.1 Experimental setup

We used as input a log-scale 80-channel mel-spectrogram ( $d_{mel}=80$ ) that is computed on 40-millisecond windows with a stride of 10 milliseconds. The output dimension of the initial linear layer was set to 256 ( $d_{l}=256$ ), and the dimension of the latent feature $\textbf{h}_{t}$ was set to 192 ( $d_{h}=192$ ). We used 6 blocks ( $N_{l}=6$ ) consisting of a convolution network, layer normalization, ReLU activation, and a dropout layer with the zeroing probability of 0.4. The kernel size of convolution networks was set to 3 ( $k=3$ ) and we used the dilation cycle of [1, 2, 4, 1, 2, 4]. The dimension of boundary embedding vector $\textbf{e}_{t}$ was set to 64 ( $d_{e}=64$ ). Lastly, the unidirectional LSTM used hidden states with the size of 256 and transformed them into one-dimensional Bernoulli parameters.

We conducted experiments using two popular benchmarks for phoneme boundary detection; TIMIT [10] and Buckeye [21] corpora. For the TIMIT dataset, we split the original training set into training and validation sets at a ratio of 9:1. The test split of the TIMIT dataset was used as is. For the Buckeye dataset, we constructed training, validation, and test sets by splitting the whole data at a ratio of 8:1:1 based on speaker identities following Kreuk et al. [15].

We trained SuperSeg on the TIMIT and Buckeye datasets using the AdamW optimizer [22] with a learning rate of $0.0005$ up to 1600 and 1000 epochs, respectively. The training was performed on a single RTX 3090 GPU with a batch size of 256. We employed data augmentation methods to reduce overfitting. For frequency masking, we uniformly sampled a mask size from $[0,35)$ for every training data. For pitch/formant shift, we sampled a pitch multiplier from $(\frac{1}{1.2},1.2)$ and a formant multiplier from $(\frac{1}{1.1},1.1)$ .

4.2 Results on conventional metrics

Table 2: Comparison of phoneme boundary detection models on proposed evaluation metrics (%). Model marked with

{\dagger}

denotes that evaluation is conducted on our reimplementation.

	TIMIT				Buckeye
Model	Precision	Recall	F1-score	R-value	Precision	Recall	F1-score	R-value
Kreuk et al. [7]^†	94.30	80.78	87.01	86.28	91.18	80.06	85.26	85.57
SuperSeg (non-AR)	83.69	82.54	83.11	85.56	74.94	74.35	74.65	78.38
SuperSeg (AR)	93.79	93.39	93.59	94.50	87.96	87.70	87.83	89.61

Table 3: Ablation study on data augmentation. A1 denotes frequency masking and A2 denotes pitch/formant perturbation.

Corpus	A1	A2	P	R	F1	R-val
TIMIT	✗	✗	92.20	92.18	92.19	93.33
	✓	✗	93.66	93.34	93.50	94.44
	✗	✓	94.16	93.33	93.75	94.59
	✓	✓	93.79	93.39	93.59	94.50
Buckeye	✗	✗	86.08	85.67	85.87	87.93
	✓	✗	86.18	88.25	87.21	89.01
	✗	✓	88.65	86.90	87.77	89.44
	✓	✓	87.96	87.70	87.83	89.61

First, we compare the proposed method with other baselines using the conventional evaluation metrics. The methods proposed by Frank el al. [14] and Kreuk et al. [7] are working in the text-independent scenario, and Lin et al. [16] performs forced alignment given a phoneme sequence. These baselines are all optimized under the supervised learning setting. We also trained and evaluated a non-autoregressive (non-AR) version of SuperSeg with boundary embedder excluded. In Table 1, we report the results of precision (P), recall (R), F1-score (F1), and R-value (R-val) with a tolerance level of 20 milliseconds. The proposed model, SuperSeg (AR), achieves higher scores of F1-score and R-value than all the previous models in both TIMIT and Buckeye benchmarks. It’s noteworthy that our text-independent model trained solely on the TIMIT or Buckeye corpus outperforms the previous state-of-the-art text-dependent model [16] that leverages the pretrained wav2vec 2.0 model. The unexpected result is that, however, the non-AR model of SuperSeg shows better performance than the proposed AR model. In section 3.4, we point out that the conventional evaluation criteria are vulnerable to the multiple but redundant contributions of adjacent boundaries for scores. To verify this, we investigated the boundaries predicted by the AR and non-AR SuperSeg models and the results are shown in Fig. 3. It can be observed that the non-AR model produces some repeated boundaries around the target boundaries to exploit the vulnerability of the evaluation metrics. On the other hand, the AR model detects phoneme boundaries appropriately without duplicated predictions.

4.3 Results on proposed metrics

Table 2 shows the results of several models using the proposed algorithm for computing precision and recall. First of all, the F1-score and R-value are all dropped compared to the scores in Table 1. This is a preordained outcome since the proposed algorithm does not allow multiple commitments of each boundary to the evaluation scores. Secondly, the scores of the non-AR model are significantly reduced. This demonstrates that the non-AR model exploits the weaknesses of the conventional metrics and actually does not operate as desired. The other baseline [7] also shows degraded performance on the proposed metrics. Thirdly, the proposed AR model shows decent performance comparable to the previous results with little decrease in the scores. Through this, we can verify that the proposed AR model produces trustworthy phoneme-level segments that fit our expectation.

4.4 Ablation study

To check the effect of frequency masking and pitch/formant perturbation, we additionally trained three SuperSeg models with different data augmentation settings. The experimental results are presented in Table 3. When trained without data augmentation, the model performance becomes slightly degraded in both TIMIT and Buckeye corpora. Interestingly, when one of the data augmentations is used, the F1-score and R-value are comparable to or even better than the scores of the model trained with both techniques. This suggests that the data augmentation is useful for the training of SuperSeg to a certain extent.

5 Conclusion

We introduced SuperSeg that builds on an autoregressive architecture to use previous model estimates as an additional input. To prevent the overfitting issue arising from the limited volume of the existing annotated datasets, we proposed to utilize data augmentation techniques such as frequency masking and pitch/formant perturbation. Furthermore, we proposed a new evaluation algorithm suitable for assessing phoneme boundary detection. Through the experiments, we showed SuperSeg achieves state-ot-the-art performance of phoneme boundary detection on both TIMIT and Buckeye corpora. We also demonstrated that the conventional metrics are vulnerable to the multiple contributions of a single boundary to a score and the proposed evaluation provides a reliable criterion by restricting these redundant commitments.

For future work, the text-dependent version of SuperSeg can be developed to find an alignment between a phoneme sequence and audio. Also, we expect that various speech applications will benefit from phoneme boundaries detected by SuperSeg. For instance, variable length voice conversion can be implemented by merging phoneme-level feature segments and expanding them back to different lengths based on the prosody of a target speaker.

References

[1] Hexin Liu, Leibny Paola Garcia Perera, Andy WH Khong, Suzy J Styles, and Sanjeev Khudanpur, “Pho-lid: A unified model incorporating acoustic-phonetic and phonotactic information for language identification,” arXiv preprint arXiv:2203.12366, 2022.
[2] Sang-Hoon Lee, Ji-Hoon Kim, Hyunseung Chung, and Seong-Whan Lee, “Voicemixer: Adversarial voice style mixup,” Advances in Neural Information Processing Systems, vol. 34, pp. 294–308, 2021.
[3] Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger, “Montreal forced aligner: Trainable text-speech alignment using kaldi.,” in Interspeech, 2017, vol. 2017, pp. 498–502.
[4] Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu, “Fastspeech: Fast, robust and controllable text to speech,” Advances in Neural Information Processing Systems, vol. 32, 2019.
[5] Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer, and Dan Chazan, “Phoneme alignment based on discriminative learning,” 2005.
[6] Jiahong Yuan, Neville Ryant, Mark Liberman, Andreas Stolcke, Vikramjit Mitra, and Wen Wang, “Automatic phonetic segmentation using boundary models.,” in Interspeech, 2013, pp. 2306–2310.
[7] Felix Kreuk, Yaniv Sheena, Joseph Keshet, and Yossi Adi, “Phoneme boundary detection using learnable segmental features,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 8089–8093.
[8] Herman Kamper and Benjamin van Niekerk, “Towards unsupervised phone and word segmentation using self-supervised vector-quantized neural networks,” arXiv preprint arXiv:2012.07551, 2020.
[9] Jian Zhu, Cong Zhang, and David Jurgens, “Phone-to-audio alignment without text: A semi-supervised approach,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 8167–8171.
[10] John S Garofolo, “Timit acoustic phonetic continuous speech corpus,” Linguistic Data Consortium, 1993, 1993.
[11] Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and Ronan Collobert, “Mls: A large-scale multilingual dataset for speech research,” ArXiv, vol. abs/2012.03411, 2020.
[12] Hyeong-Seok Choi, Juheon Lee, Wansoo Kim, Jie Lee, Hoon Heo, and Kyogu Lee, “Neural analysis and synthesis: Reconstructing speech from self-supervised representations,” Advances in Neural Information Processing Systems, vol. 34, pp. 16251–16265, 2021.
[13] Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” arXiv preprint arXiv:1904.08779, 2019.
[14] Joerg Franke, Markus Mueller, Fatima Hamlaoui, Sebastian Stueker, and Alex Waibel, “Phoneme boundary detection using deep bidirectional lstms,” in Speech Communication; 12. ITG Symposium. VDE, 2016, pp. 1–5.
[15] Felix Kreuk, Joseph Keshet, and Yossi Adi, “Self-supervised contrastive learning for unsupervised phoneme segmentation,” arXiv preprint arXiv:2007.13465, 2020.
[16] Binghuai Lin and Liyuan Wang, “Learning acoustic frame labeling for phoneme segmentation with regularized attention mechanism,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7882–7886.
[17] Okko Johannes Räsänen, Unto Kalervo Laine, and Toomas Altosaar, “An improved speech segmentation quality measure: the r-value,” in Tenth Annual Conference of the International Speech Communication Association. Citeseer, 2009.
[18] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
[19] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.
[20] Sepp Hochreiter and Jürgen Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[21] Mark A Pitt, Keith Johnson, Elizabeth Hume, Scott Kiesling, and William Raymond, “The buckeye corpus of conversational speech: Labeling conventions and a test of transcriber reliability,” Speech Communication, vol. 45, no. 1, pp. 89–95, 2005.
[22] Ilya Loshchilov and Frank Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.