Text-Conditioned Transformer for Automatic Pronunciation Error Detection

Zhan Zhang [email protected] Yuehai Wang [email protected] Jianyi Yang [email protected] Department of Information and Electronic Engineering, Zhejiang University, China

Abstract

Automatic pronunciation error detection (APED) plays an important role in the domain of language learning. As for the previous ASR-based APED methods, the decoded results need to be aligned with the target text so that the errors can be found out. However, since the decoding process and the alignment process are independent, the prior knowledge about the target text is not fully utilized. In this paper, we propose to use the target text as an extra condition for the Transformer backbone to handle the APED task. The proposed method can output the error states with consideration of the relationship between the input speech and the target text in a fully end-to-end fashion. Meanwhile, as the prior target text is used as a condition for the decoder input, the Transformer works in a feed-forward manner instead of autoregressive in the inference stage, which can significantly boost the speed in the actual deployment. We set the ASR-based Transformer as the baseline APED model and conduct several experiments on the L2-Arctic dataset. The results demonstrate that our approach can obtain 8.4% relative improvement on the $F_{1}$ score metric.

keywords:

automatic pronunciation error detection (APED), computer-assisted pronunciation training (CAPT), Transformer

^†^†journal: Journal of LaTeX Templates

1 Introduction

With the quick development of globalization and education, the number of language learners is rapidly increasing. However, most learners are facing the problem of teacher shortage or finding a proper time to follow systematic learning. Thus, recently, the computer-assisted language learning (CALL)[1] systems have been studied to offer a flexible education service, which can be used to reach the language learning requirement in fragmented time. In particular, oral practice is an important part of daily communication, and computer-assisted pronunciation training (CAPT)[2] systems are designed for this task. Such systems generally play the role of automatic pronunciation error detection (APED). The APED system first gives a predefined utterance text (and a reference speech of a professional teacher if needed), and the learner tries to pronounce this target text correctly. For example, a learner wants to study the pronunciation of “apple” (its phonemes are “AE P AH L”), but the learner may mispronounce it to “AE P AO L”. We call “AE P AO L” as the canonical pronunciation. By accurately detecting the pronunciation errors and providing precise feedback that “AH” is mispronounced, the APED system guides the learner to correct the pronunciation towards the target utterance and improve the speaking ability.

APED has been widely studied for decades. Depending on how to evaluate the matching degree between the student pronounced speech and the standard pronunciation, several comparison-based or goodness of pronunciation (GOP) methods have been proposed to solve the APED task[3, 4, 5, 6, 7, 8]. Recently, with the rising trend for neural networks and the development of automatic speech recognition (ASR) technologies, some end-to-end APED models [9, 10] have been studied to simplify the workflow. They use ASR backbones to recognize the canonical pronunciation and obtain where the errors are, based on the alignment between the predicted phonemes and the standard phonemes. The ASR-based methods can significantly decrease the deploying efforts compared with conventional GOP methods or comparison-based methods. In particular, recently, the Transformer structure[11] shows a good performance for sequence-to-sequence (seq2seq) modelling, and gets promising performance in ASR tasks [12, 13, 14, 15]. Thus, we choose the Transformer as the backbone for APED tasks in this paper.

However, the main deficiency of the conventional ASR-based Transformer for APED tasks is that the autoregressive decoding will slow the inference speed[16]. Unfortunately, the APED task generally requires the system to give a quick response about the errors so that the learners can adapt their pronunciations and evaluate again. Another consideration is that, for the ASR-based APED, the decoded text sequence needs to be aligned with the target text to detect the errors. Since the target text is already known in advance, it is a waste to ignore this prior knowledge during the autoregressive inference. On the one hand, the length of the target text is fixed, but the autoregressive decoding is length-agnostic. On the other hand, the recognized sequence is generally close to the prior target text in this evaluation task. These two factors inspire us to use the target text as extra input for the network.

In this paper, we propose an ASR and alignment unified Transformer-based APED workflow, which can incorporate both the audio feature and the text information, and output the error states directly. Compared with ASR-based methods which optimize the recognition result to improve the APED performance, the proposed method works in a fully end-to-end manner. Thus, the proposed method can optimize the APED metric directly. We observe a 8.4% relative improvement on the $F_{1}$ score for the L2-Arctic dataset[17] with the proposed method. Meanwhile, by using the prior target text as an input condition, the inference process works in a feed-forward manner rather than autoregressive, which can significantly boost the inference speed as suggested in [18, 19].

The rest of this paper is organized as follows. In Section 2, we analyze the related works about the APED task and how we are inspired to propose the text-conditioned feed-forward Transformer; In Section 3, we compare the baseline ASR-based autoregressive APED Transformer and describe the proposed ASR and alignment unified feed-forward Transformer in detail; Next, we analyze the results obtained by the conventional methods and the proposed method in Section 4; Finally, we show the conclusion of this paper in Section 5.

2 Related Works

From the perspective of language learning, an error detected in the APED system can be described as that the produced pronunciation is a nonstandard one. In other words, the pronounced speech deviates too far from the standard target speech. Based on this simple idea, comparison-based APED methods [3, 4, 5, 6] have been explored. These methods generally adopt dynamic time warping (DTW) [20] algorithms to align the extracted features of the input speech with the standard target speech. Depending on the distance between each text unit, the pronunciation quality score can be calculated. To this end, the comparison-based methods need to prepare a standard speech for reference, which are inconvenient to evaluate a new utterance.

Apart from directly comparing to a specific standard speech, the input speech can also be evaluated by whether a standard acoustic model can recognize each phoneme. In particular, the likelihood of each phoneme has proven to be an effective feature for indicating whether the error happens, and such a likelihood-based scoring method is often referred to as GOP [7, 8]. In practice, this approach utilizes the hidden Markov model (HMM) to model the sequential phone states. The likelihood score is calculated from the force-aligned states and the open phone states. Since the first proposal of GOP by [7], many variants [21, 22, 23, 24, 25] have been studied to adapt its original equation for better measurement of the goodness.

With the rise of deep learning, the performance of the ASR tasks has been greatly improved. Thus, by utilizing the advanced acoustic model of an ASR system and recognizing the input speech, ASR-based APED can be another efficient approach to detect the errors. Such a method can also avoid the deploying efforts of conventional HMM-based GOP methods or comparison-based DTW methods, and several ASR-based APED systems have been proposed [9, 10]. Currently, the ASR systems are generally built upon CTC loss [26] or attention mechanism [27, 28] to handle the sequential features. The main deficiency of CTC loss is the conditional independent assumption. Such an assumption may not be valid for the continuous speech. The ASR performance is reported to be better by combining the CTC loss with the attention mechanism [29] or using the Transformer structure[14, 15]. In particular, the Transformer structure, which is originally designed to handle the natural language processing (NLP) problems [30, 31], has been successfully utilized in several other domains, such as computer vision (CV)[32, 33], and speech-related tasks including text to speech (TTS) [34, 35, 18, 19], voice conversion (VC)[36], and ASR [12, 13].

Despite the convenience of ASR-based APED systems, alignment is still an inevitable process to obtain the final evaluation results. The recognized phonemes should be aligned with the target phonemes to find out the mispronunciations. As the alignment process is not integrated into the backward optimization of the ASR model, such a method is not fully end-to-end. In other words, the decoding process and the evaluation process are independent. However, intuitively, human raters will first keep the target text in mind, then try to compare the input speech to find out where the errors take place. Focussing on the prior target text limits the search space for the decoding process. Extended Recognition Network (ERN) [37] utilizes this idea to incorporate prior knowledge about common mispronunciations into the HMM states. However, the predefined error HMM paths will lead to bad performance when faced with unseen mispronunciations. Despite its weakness, ERN still shows that the prior knowledge is of vital importance to facilitate the performance of APED tasks. This inspires us to directly take the prior target text as an extra condition, together with the speech features for input. Meanwhile, the attention mechanism can be a logical approach to fuse both the speech feature and the text feature. Thus, the attention-based seq2seq models including Listen, attend and spell (LAS)[28] and Transformer[11] are ideal backbones to start with. Transformer uses the positional encoding to model the time information, instead of a recurrent architecture in LAS. The ASR performance of Transformer is reported to be better in [15]. Thus, we use Transformer as the backbone in this paper.

However, the conventional attention-based Transformer generally adopt autoregressive decoding to predict the next entity. This will lead to a slow inference, which can be a deficiency for the APED system. As analyzed in [16], for each decoding step, the current prediction depends on the earlier decoded output to get the conditional probability. However, since the output target is already known in the training stage, the Transformer can assume this target as a decoded result (this is called as “teacher-forcing”). Thus, the Transformer do not need to wait for the decoded output and the Transformer can run in parallel. In contrast, this prior does not exist in the inference stage, and the Transformer must run sequentially to predict the next entity for several decoding steps until meeting the end-of-sentence-tag ( $\langle$ EOS $\rangle$ ). On the contrary, Transformers which work in a feed-forward manner can greatly boost the speed [16, 18, 19]. Thus, for the APED task, if we can utilize the prior text to be evaluated, and unify the ASR and alignment process, the conventional Transformers can decode in a feed-forward manner, and the aforementioned limitation will no longer exist.

Based on the analysis above, we propose the text-conditioned ASR and alignment unified feed-forward Transformer for the APED task. We give a detailed description of the proposed method in the next section.

3 Proposed Method

In this section, we first show the conventional ASR-based APED workflow for comparison. Next, we demonstrate the proposed fully end-to-end workflow and describe the network structure and its training method in detail.

3.1 ASR-Based APED

Refer to caption — Figure 1: Workflow for the ASR-based APED method. The alignment process is independent from the phoneme prediction process.

A typical workflow for the ASR-based APED is depicted in Fig.1. The training dataset is generally constructed by three parts, the target text to be read, the collected speech, and the canonical pronounced text marked by professional annotators. For example, L2-Arctic dataset[17] manually labels the correct phonemes and mispronunciation error tags about the collected speech. Three annotators who are experienced in transcribing speech samples of native or non-native English speakers participate in the annotating process to ensure the high quality. Based on such a dataset, an ASR model is trained to recognize the canonical phoneme-level text $\mathbf{p}=(p_{1},p_{2},...,p_{n},p_{n+1})$ from the extracted audio features $\mathbf{x}=(x_{1},x_{2},...,x_{m})$ . We should note that the described ASR-based APED is general, and can be applied with any ASR systems that can translate the audio features into phonemes. However, as we focus on Transformer, we limit our description on the attention-based training and inference in the following paragraph. For the attention-based models, the cross-entropy loss is used between the predict phonemes $\hat{\mathbf{p}}$ and the canonical phonemes $\mathbf{p}$ :

l_{asr}=CrossEntropy(\hat{\mathbf{p}},\mathbf{p}),

(1)

where $p_{n+1}=\langle$ EOS $\rangle$ .

For the inference stage, the Transformer works quite differently from the training stage. The Transformer uses autoregressive outputting method to recognize the canonical phonemes sequentially. The recognized phonemes string will end with $\langle$ EOS $\rangle$ . Next, Needleman-Wunscha algorithm[38] is applied to align the recognized sequence $\hat{\mathbf{p}}$ with the target phonemes $\mathbf{t}=(t_{1},t_{2},...,t_{k})$ . After the alignment process, the error states $\mathbf{e}=(e_{1},e_{2},...,e_{k})$ with consideration of the target phonemes can be returned to the user. An alignment example is shown in Table 1. We can observe that this sample includes 1 deletion and 2 substitution errors. The mispronounced phonemes whose error states are marked as 1 can be returned to the users.

Table 1: Alignment sample

	IF		YOU		ONLY				COULD			KNOW		HOW		I	THANK				YOU
Target	IH	F	Y	UW	OW	N	L	IY	K	UH	D	N	OW	HH	AW	AY	TH	AE	NG	K	Y	UW
Pronounced	IH	F	Y	UW	AO	N	L	IY	K	UH	-	N	AO	HH	AW	AY	TH	AE	NG	K	Y	UW
Error States	0	0	0	0	1	0	0	0	0	0	1	0	1	0	0	0	0	0	0	0	0	0

For better clarification, we summarize the training and the inference stage of the ASR-based model in Table 2. We use a 39-dim Mel frequency cepstral coefficients (MFCC) feature as the encoder input. The start-of-sentence tag ( $\langle$ SOS $\rangle$ ) and the right-shifted 1-dim label of the canonical phonemes are concatenated as the decoder input in the training stage. This input is replaced by $\langle$ SOS $\rangle$ and a regressively decoded phonemes string in the inference stage. The decoder tries to predict the probability of the next phoneme and $\langle$ EOS $\rangle$ for output. There are in total 42 tags for classification, including 39 phonemes and $\langle$ SOS $\rangle$ $\langle$ EOS $\rangle$ $\langle$ PAD $\rangle$ .

Table 2: Training and inference summary of the ASR-Based Transformer

	Training Stage
	EncoderInput	DecoderInput	DecoderOutput
data	SpeechFeatures	$\langle$ SOS $\rangle$ +Canonical Phonemes(Shifted)	Canonical Phonemes+ $\langle$ EOS $\rangle$
loss	-	-	$l_{asr}$
len	m	1+n	n+1
dim	39	1	42
	Inference Stage
	EncoderInput	DecoderInput	DecoderOutput
data	SpeechFeatures	$\langle$ SOS $\rangle$ +Recognized Phonemes	Next Recognized Phonemes
len	m	End with $\langle$ EOS $\rangle$	End with $\langle$ EOS $\rangle$
dim	39	1	42

We should note that there are several lengths defined for the described sequences. First, the attention mechanism is adopted to match the speech features ( $length=m$ ) and the recognized phonemes ( $length=n+1$ ). Next, the alignment operation is applied to find out the error states, whose length is equal to that of the target phonemes ( $length=k$ ). However, such an alignment operation is performed in the inference stage, thus not jointly optimized with the ASR model. Such a dilemma inspires us to integrate the alignment operation or the target text into the training stage.

3.2 Fully End-to-end APED

As shown in Fig.2, for the proposed method, we move the alignment operation into the data preparing stage. We align the canonical phonemes and the target phonemes to obtain where the errors occur in advance.

Next, we directly evaluate the relationship between speech features and the target phonemes. Thus, the network can be viewed as a fusion model. Moreover, the mother language (L1) of the speakers shows to affect the acoustic characteristics when studying a new language (L2) [39, 40]. Meanwhile, the extracted L1 features have also been proved to be helpful in the APED task [41]. Thus, we introduce the accent related auxiliary task to extract the L1 information. As shown in Fig.3, we append an extra classifier after the encoder and use the cross-entropy loss between the predicted accent $\hat{a}$ and the ground truth accent $a$ presented in the dataset:

l_{a}=CrossEntropy(\hat{a},a).

(2)

Since the speech evaluation dataset is scarce, we first obtain a basic acoustic model by training the model on ASR datasets. The training process is similar to conventional ASR-based APED methods discussed in Section 3.1, and the new ASR loss function is,

l_{asr}^{{}^{\prime}}=l_{asr}+\alpha l_{a},

(3)

where $\alpha$ is the weight of the auxiliary accent task.

We further adapt this basic acoustic model to the APED task. A training and inference summary of the proposed model is shown in Table 3. We will discuss the details and the differences between the proposed model and the ASR-based model in the remaining paragraphs.

Table 3: Training and inference summary of the proposed Transformer

	Training Stage
	EncoderInput	EncoderOutput	DecoderInput	DecoderOutput1	DecoderOutput2
data	SpeechFeatures	Accent	$\langle$ SOS $\rangle$ +Target Phonemes	Aligned Canonical Phonemes+ $\langle$ EOS $\rangle$	$\langle$ SOS $\rangle$ +Error States
loss	-	$l_{a}$	-	$l_{asr}$	$l_{eval}$
len	m	1	1+k	k+1	1+k
dim	39	6	1	42	1
	Inference Stage
	EncoderInput	EncoderOutput	DecoderInput	DecoderOutput1	DecoderOutput2
data	SpeechFeatures	Accent	$\langle$ SOS $\rangle$ +Target Phonemes	Canonical Phonemes+ $\langle$ EOS $\rangle$	$\langle$ SOS $\rangle$ +Error States
len	m	1	1+k	k+1	1+k
dim	39	6	1	42	1

Firstly, for the auxiliary accent classification task, while the input audio features are sequential, the accent is a 1-dim global attribute. We try to process the sequential data with gated recurrent units (GRU)[42] or a simple GlobalMean. Experiments in Section 4 show that GlobalMean performs a little better. Note that there are 6 kinds of accent including Arabic, Chinese, Hindi, Korean, Spanish, and Vietnamese, for the used dataset in our experiments.

Secondly, the prior target phonemes are used as an extra condition for the decoder input instead of the canonical pronounced phonemes, in both the training and the inference stage. For the audio features $\mathbf{x}$ and a certain target phoneme $t_{i}$ (target phoneme at step $i$ ), the decoder output is changed to $\hat{e_{i}}$ , which indicates the matching degree of the audio features and $t_{i}$ . As we use a binary state to judge its goodness, we use the sigmoid activation at the last layer for binary classification in Fig.3.

As the whole process is differentiable, we can directly optimize the loss between the predicted error states $\hat{\mathbf{e}}$ and the ground truth error states $\mathbf{e}$ . For now, several classification losses can be used for this model. We first apply a basic binary cross-entropy (BCE) loss between the predicted error states $\hat{\mathbf{e}}=(\hat{e_{0}},\hat{e_{1}},\hat{e_{2}},...,\hat{e_{k}})$ and the ground truth error states $\mathbf{e}=(e_{0},e_{1},e_{2},...,e_{k})$ as the evaluation loss,

l_{eval}^{BCE}=BCE(\hat{\mathbf{e}},\mathbf{e}),

(4)

where $e_{0}=\langle$ SOS $\rangle$ . A further discussion about the choice of loss functions is presented in Section 4.5.

However, compared with ASR-based methods, a binary state only concerns about whether the target phoneme is correct or mispronounced. Thus, the model may lose information about the exact phoneme. To fix this, we still require the proposed model to conduct the ASR task with an auxiliary weight of $\beta$ , and the whole loss function is,

l=l_{eval}+\beta l_{asr}+\alpha l_{a}.

(5)

The canonical phonemes to be recognized are aligned with the target phonemes for the proposed model using the aforementioned Needleman-Wunscha algorithm in Section 3.1 to make these two phoneme strings have equal length $k+1$ .

Lastly, we should note that the proposed model has a consistent behavior in the training and inference stage, as shown in Table 3. This characteristic makes the inference in our method faster compared with ASR-based autoregressive Transformers, shown in Section 4.2.

4 Experiment

We use the SpeechTransformer backbone proposed in [43] for experiments. The SpeechTransformer is constructed by 6 encoder and 6 decoder layers in our experiments. Meanwhile, the attention modelling dimension $d_{model}=512$ , 4 attention heads, and the feed-forward dimension $d_{ff}=1024$ are adopted. We extract the MFCC features of the audio files by Kaldi toolkit[44]. These MFCC features are subsampled with a factor of $n=4$ , and stacked with $m=5$ number of frames, which is the same as the settings in [43]. We demonstrate the ASR performance for phoneme recognition in the first subsection 4.1. Then we use this pretrained model to adapt for the APED task and show the latency, metric, and results in the next three subsections. Finally, we analyze the loss functions and the behavior of the proposed model in the last two subsections, 4.5 and 4.6, correspondingly.

4.1 Phoneme Recognition

We use Librispeech [45] as the dataset for ASR training. This dataset contains approximately 1000 hours of 16kHz read English speech. It is divided into “clean” (460 hours) and “other” (500 hours) parts based on its recognition difficulty. The “clean” part is further divided into the training set of 100 hours and 360 hours, development set (dev-clean), and test set (test-clean). As the APED task focuses on the phoneme-level error, we first convert the dataset into phoneme-level transcriptions using the Montreal Forced Aligner tool[46]. Next, we train the Transformer on different parts of the trainset for 300 epochs, including train-clean of 100 hours (train-100h), the whole train-clean part (train-460h), and the whole train part (train-960h). We use dev-clean as the validation dataset to choose the best model and test-clean for inference performance comparison. Adam optimizer, with a learning rate of $10^{-3}$ , is used. We use a CTC-based ASR model called Jasper5x3 proposed in [47] for comparison. This model is constructed by 5 repeated Jasper blocks, and each Jasper block is constructed by 3 repeated Conv1D sub-blocks. The model parameters of Jasper5x3 are 44M, while the proposed Transformer is 32M. We show the phone error rate (PER) performance in Table 4. As we can see from the table, for different amounts of training resources, the attention-based Transformer structure generally performs better than the CTC-based method on PER. This observation is in accord with the conclusion in [14, 15], as the attention mechanism in Transformer can capture more relevant information compared with the CTC loss which holds the conditional independent assumption.

Table 4: Performance of PER on different subsets of Librispeech dataset.

train-	dev-clean	test-clean	dev-clean¹¹1Since this result on the development set dev-clean is obtained by using the teacher-forcing training, PER is much lower.	test-clean
	CTC-Based		Transformer-Based
100h	8.13%	8.50%	4.55%	8.11%
460h	4.88%	5.50%	2.32%	4.24%
960h	4.02%	4.23%	1.70%	3.17%

4.2 Latency Experiment

Next, we conduct the APED task on L2-Arctic dataset [17]. This corpus contains 26,867 utterances with 6 different accents, from 24 nonnative speakers. The 3,599 utterances annotated on phoneme-level are used for the APED task. The trainset, valset, and testset are divided into 8:1:1. For the testset, each sentence contains about 30 target phonemes on average. This suggests that the conventional autoregressive ASR-based models need to forward about 30 times on average to get each decoded phoneme sequentially. On the contrary, the proposed methods only need to forward once. We conduct the latency evaluation on a server with Intel Xeon E5-2680 CPU, and 1 NVIDIA P100 GPU. As shown in Table 5, the proposed method can bring great speedup for the APED inference.

Table 5: Latency comparsion.

b

stands for batchsize here. The latency is computed as the average time to decode each sentence in the testset.

	Latency(ms)	Speedup
ASR-Based ( $b=1$ )	1194±198	1.00 $\times$
ASR-Based ( $b=4$ )	966±106	1.24 $\times$
Proposed ( $b=1$ )	88±13	13.6 $\times$
Proposed ( $b=4$ )	67±6	17.8 $\times$

Table 6: Comparison between different models.

Accent

Classification

Phoneme

Classification

FAR

FRR

Acc

Precision

Recall

GOP-Based

GMM-HMM(Librispeech)²²2This result is taken from [17], Fig.4. It is trained on Librispeech train-960 and tested on L2-Arctic dataset.

0.290

ASR-Based

Initial(Librispeech)

✓

0.485

0.207

0.753

0.295

0.515

0.375

Fine-tuned(L2-Arctic)

✓

0.375

0.103

0.858

0.504

0.625

0.558

Fine-tuned(L2-Arctic)

✓

0.353

0.106

0.859

0.507

0.647

0.568

Proposed

BCE Loss

✓

0.458

0.051

0.890

0.639

0.542

0.587

BCE Loss

✓

0.429

0.054

0.890

0.641

0.571

0.603

F1 Loss

✓

0.442

0.055

0.889

0.630

0.558

0.591

F1 Loss

✓

0.428

0.058

0.889

0.622

0.572

0.596

Focal Loss

✓

0.424

0.060

0.888

0.617

0.576

0.595

Focal Loss

✓

0.423

0.055

0.882

0.636

0.577

0.605

4.3 APED Metric

For the APED task, the model should make a good balance of detecting the wrong pronunciations and accepting the correct ones. Thus, $F_{1}$ score is chosen as the main indicator for the performance. As defined in [48], the hierarchical evaluation structure is first divided into correct pronunciations and wrong pronunciations by the canonical pronounced phoneme. Next, depending on whether the predicted error state matches the ground truth label, the outcomes are further divided into true acceptance (TA), false rejection (FR), false acceptance (FA), and true rejection (TR). In other words, T/F suggests whether the prediction of the model is correct for the APED task, and A/R is the decision of the model. Based on this evaluation structure, $F_{1}$ score of the APED system is defined as follows:

Precision=\frac{TR}{TR+FR},

(6)

Recall=\frac{TR}{TR+FA},

(7)

F_{1}=2\frac{Precision*Recall}{Precision+Recall}.

(8)

To count $TR,TA,FR,FA$ for metrics, the predicted binary error states $(\hat{e_{1}},\hat{e_{2}},...,\hat{e_{k}})$ are firstly filtered by a threshold of $\theta=0.5$ to transform from a continuous float with the range of $(0,1)$ into discrete binary integer $\{0,1\}$ ,

\hat{e}\leftarrow\begin{cases}1,&\text{if }\hat{e}\geq\theta\\ 0.&\text{otherwise}\end{cases}

(9)

Next, each outcome is calculated by following equations,

TR=\sum_{i=1}^{k}(\hat{e_{i}}*e_{i}),

(10)

FR=\sum_{i=1}^{k}(\hat{e_{i}}*(1-e_{i})),

(11)

FA=\sum_{i=1}^{k}((1-\hat{e_{i}})*e_{i}),

(12)

TA=\sum_{i=1}^{k}((1-\hat{e_{i}})*(1-e_{i})).

(13)

Apart from the conventional classification-related metrics including $F_{1}$ score, accuracy, precision and recall, the false rejection rate (FRR) and the false acceptance rate (FAR) are also of vital importance to the APED task. They are calculated as follows,

FRR=\frac{FR}{TA+FR},

(14)

FAR=\frac{FA}{FA+TR}.

(15)

4.4 APED Result

We first conduct experiments to explore the auxiliary accent classification task. We start from the model obtained on Librispeech dataset, and train for another 200 epochs, with the learning rate decreased to $10^{-4}$ . We find that the GlobalMean method performs a little better than the GRU, as shown in Fig.4. We use $\alpha=0.7$ for the ASR-based Transformer in Eq.3, and lower it to $\alpha=0.1$ in Eq.5 to balance the $l_{asr}$ loss for further experiments about the proposed text-conditioned version.

Next, we adapt this pretrained ASR-based model to the proposed text-conditioned version. We still train the whole model for 200 epochs, with the learning rate of $10^{-4}$ . We set the ASR-based model without the auxiliary task for baseline and the proposed methods with ablation for comparison. We find that $\beta=0.3$ generally gives the best performance. The results are shown in Table 6. The performance of the GOP method tested on this dataset[17] and the initial Transformer model pretrained on Librispeech is also reported in this table. First of all, as the initial model is purely trained on Librispeech, which only includes standard pronunciations, its FRR is relative high as it directly treats some unseen accents as wrong pronunciations. When adapted to handle the L2-Arctic dataset, the model has a significant improvement. By further employing the accent auxiliary task, the $F_{1}$ score is increased by nearly 0.1. If we simply use the target text as the condition and change the prediction target to the error states, the basic binary cross-entropy loss can bring a 0.19 improvement in terms of the $F_{1}$ score. We discuss the effect of different loss functions for the proposed method in the next subsection.

4.5 Loss functions

First of all, as the $F_{1}$ score is an important metric for the APED task, inspired by [49], we directly utilize the generalized $F_{1}$ score to optimize the predicted error states. To make it differentiable, the sums of probabilities are used instead of counts. We do not apply Eq.9 before calculating Eq.10 - 13. We should also note that Eq.9 is applied only for metrics calculation. For all the loss functions discussed in this subsection, $\hat{e}$ is continuous. As we try to maximize the $F_{1}$ score, the $F_{1}$ evaluation loss of the proposed method is,

l_{eval}^{F_{1}}=1-F_{1}.

(16)

Another consideration is that only 14.56% of the labelled phone segments are mispronounced for the L2-Arctic dataset³³3Dataset document at https://psi.engr.tamu.edu/l2-arctic-corpus-docs/, which may cause an unbalance between correct pronunciations and mispronunciations. Thus, we adopt focal loss[50] to mine the hard labels. Formally, if we define $e_{t}$ as:

e_{t}=\begin{cases}\hat{e},&\text{if }e=1\\ 1-\hat{e}.&\text{otherwise}\end{cases}

(17)

The focal loss is,

l_{eval}^{focal}=-(1-e_{t})^{\gamma}log(e_{t}),

(18)

where $\gamma$ modulates how much the well-classified samples are down-weighted. When $\gamma=0$ , this loss function is equivalent to Eq.4.

We apply $F_{1}$ loss function and focal loss with different $\gamma$ values to the proposed model. We can see from Table 6 that, when adopting the $F_{1}$ loss function instead of the basic BCE loss, the result can be slightly improved. For the focal loss, we find that a small $\gamma$ value ( $\gamma$ =0.5 in our experiments) performs the best, and a bigger value will lead to a degraded $F_{1}$ score. Meanwhile, the auxiliary ASR task can boost the performance for all these loss functions. The focal loss version has the highest $F_{1}$ score 0.605 for the default $\theta=0.5$ , which is a relative 8.4% improvement over the baseline ASR-based method.

4.6 Analysis

We further analyse the behavior of the proposed method.

For the APED task, we need to make a trade-off between FAR and FRR. Meanwhile, as noted in [51], it is usually more unacceptable to take the correct pronunciations as wrong ones (false reject) than to regard the mispronunciations as correct ones (false acceptance). We can observe from Table 6 that the proposed methods all have a higher FAR and decreased FRR compared with ASR-based models, which suggests our model is behaving in a more acceptable way.

For the actual deployment, as the proficiency level of the target language varies among different students, the trade-off between FAR and FRR should be easy to adjust. Compared with ASR-based models, the proposed method can simply change the threshold $\theta$ to control how strict the APED system is. We further explore the effect of changing $\theta$ for different loss functions. We use a step of 0.1, i.e., $\theta\in[0.1,0.2,...,0.9]$ . The metrics are shown in Fig.5. By increasing $\theta$ , according to Eq.9, more output is judged as correct, and less output is judged as error. As a result, FAR (and precision) increases, while FRR (and recall) drops. Compared with the F1 loss version, the BCE loss and the focal loss version have a wider range of FAR and FRR when adjusting $\theta$ and can be a better choice for the actual deployment.

Quantile	Error Rate	ASR-Based F1	Proposed F1	Improvement
25%	[0,7.8%]	0.338	0.390	15.38%
50%	(7.8%,13.0%]	0.461	0.516	11.93%
75%	(13.0%,19.4%]	0.550	0.609	10.73%
100%	(19.4%,46.8%]	0.674	0.694	2.97%

Table 7: Pronunciation error rate breakdown on the testset.

Since the proposed method is related to the input text, we break the testset into different parts to show the impact of how much the canonical phonemes differ from the target phonemes, i.e., the pronunciation error rate. A larger pronunciation error rate suggests that the input text information becomes less related to the input speech. We use the focal loss version and the ASR-based baseline for comparison. The result is shown in Table 7. We can observe that the proposed method makes higher relative improvement when the error rate is lower.

Finally, we try to analyze the auxiliary ASR task. For the ASR-based Transformer, on the one hand, the encoder extracts the speech-related features as embeddings; On the other hand, the decoder uses the attention mechanism to query the corresponding weight of each memory for the input text. Thus, the attention mechanism in the decoder will conduct an alignment between the text and the corresponding speech. What will the proposed text-conditioned Transformer do if the input text is not the canonical pronunciation but the target one? We plot the attention map of the proposed method without or with the auxiliary ASR task in Fig.6 to explore its behavior. For simplicity, we call these two models as the simple version and the full version in the following discussion.

As shown in Fig.6, for the simple version, it still tries to align the speech feature with the target phonemes for the shallow layers. As for deeper layers, the alignment between the phonemes and the speech features becomes vague. We conjecture that the training target causes this phenomenon. As for the ASR task, the network has to predict the next phoneme exactly; However, for the APED task, the network just needs to handle the error pattern for each phoneme and outputs a binary state, which is an easier task. Under such a cosy target, the deeper layers may not work hard to do the alignment, but choose to focus on summarizing the error patterns. As pretraining on the ASR task can be viewed as a sequential adaption[52], the pretrained weights perform as a regularization for the APED optimization space. Meanwhile, as suggested in [53], the adapted model does not deviate from the pretrained weights significantly. Thus, based on the pretrained ASR weight, the model still has the ability to distinguish different phonemes and match the input phonemes with the audio features memory. This may be the reason that the simple version can still get a satisfying improvement, as shown in 6. When the model is required to conduct ASR task, namely, the full version, the attention maps appear to be regular, which are similar to those Transformers that are applied for ASR tasks. As a result, the full version generally performs better than the simple version.

5 Conclusion

In this study, we propose a text-conditioned Transformer for automatic pronunciation error detection. By conditioning the target phonemes as an extra input, the Transformer can directly evaluate the relationship between the input speech and the target phonemes. Thus, the error states are obtained in a fully end-to-end manner. Meanwhile, unlike the conventional autoregressive Transformer, the proposed method works in a feed-forward manner in both the training and the inference stage. We conduct a number of experiments to compare the performance of different methods and find that the proposed text-conditioned Transformer can boost the $F_{1}$ score of the APED task on the L2-Arctic dataset. The proposed method has a more reasonable FAR and FRR, and the degree of strictness can be easily adjusted by the threshold $\theta$ parameter.

References

[1] K. Beatty, Teaching and Researching: Computer-assisted Language Learning, Routledge, 2013. doi:10.4324/9781315833774.
[2] N. Stenson, B. Downing, J. Smith, K. Smith, The effectiveness of computer-assisted pronunciation training, Calico Journal (1992) 5–19.
[3] A. Lee, J. Glass, Pronunciation assessment via a comparison-based system, in: Speech and Language Technology in Education, 2013.
[4] A. Lee, Y. Zhang, J. Glass, Mispronunciation detection via dynamic time warping on deep belief network-based posteriorgrams, in: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, IEEE, 2013, pp. 8227–8231. doi:10.1109/icassp.2013.6639269.
[5] A. Lee, N. F. Chen, J. Glass, Personalized mispronunciation detection and diagnosis based on unsupervised error pattern discovery, in: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, IEEE, 2016, pp. 6145–6149. doi:10.1109/icassp.2016.7472858.
[6] A. Lee, J. Glass, A comparison-based approach to mispronunciation detection, in: 2012 IEEE Spoken Language Technology Workshop (SLT), IEEE, IEEE, 2012, pp. 382–387. doi:10.1109/slt.2012.6424254.
[7] S. M. Witt, Use of speech recognition in computer-assisted language learning.
[8] S. Witt, S. Young, Phone-level pronunciation scoring and assessment for interactive language learning, Speech Communication 30 (2-3) (2000) 95–108. doi:10.1016/s0167-6393(99)00044-8.
[9] W.-K. Leung, X. Liu, H. Meng, CNN-RNN-CTC based end-to-end mispronunciation detection and diagnosis, in: ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, IEEE, 2019, pp. 8132–8136. doi:10.1109/icassp.2019.8682654.
[10] L. Zhang, Z. Zhao, C. Ma, L. Shan, H. Sun, L. Jiang, S. Deng, C. Gao, End-to-end automatic pronunciation error detection based on improved hybrid CTC/Attention architecture, Sensors 20 (7) (2020) 1809. doi:10.3390/s20071809.
[11] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, u. Kaiser, I. Polosukhin, Attention is all you need, in: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Curran Associates Inc., Red Hook, NY, USA, 2017, p. 6000–6010.
[12] N. Moritz, T. Hori, J. Le, Streaming automatic speech recognition with the transformer model, in: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, IEEE, 2020, pp. 6074–6078. doi:10.1109/icassp40776.2020.9054476.
[13] Q. Zhang, H. Lu, H. Sak, A. Tripathi, E. McDermott, S. Koo, S. Kumar, Transformer transducer: A streamable speech recognition model with transformer encoders and RNN-t loss, in: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, IEEE, 2020, pp. 7829–7833. doi:10.1109/icassp40776.2020.9053896.
[14] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen, et al., Espnet: End-to-end speech processing toolkit, arXiv preprint arXiv:1804.00015.
[15] L. Dong, S. Xu, B. Xu, Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition, in: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, IEEE, 2018, pp. 5884–5888. doi:10.1109/icassp.2018.8462506.
[16] J. Gu, J. Bradbury, C. Xiong, V. O. Li, R. Socher, Non-autoregressive neural machine translation, arXiv preprint arXiv:1711.02281.
[17] G. Zhao, S. Sonsaat, A. Silpachai, I. Lucic, E. Chukharev-Hudilainen, J. Levis, R. Gutierrez-Osuna, L2-arctic: A non-native english speech corpus, in: Proc. Interspeech, 2018, p. 2783–2787. doi:10.21437/Interspeech.2018-1110.
[18] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, T.-Y. Liu, Fastspeech: Fast, robust and controllable text to speech, in: Advances in Neural Information Processing Systems, 2019, pp. 3171–3180.
[19] K. Peng, W. Ping, Z. Song, K. Zhao, Parallel neural text-to-speech, arXiv preprint arXiv:1905.08459.
[20] D. J. Berndt, J. Clifford, Using dynamic time warping to find patterns in time series., in: KDD workshop, Vol. 10, Seattle, WA, USA:, 1994, pp. 359–370.
[21] Y. Kim, H. Franco, L. Neumeyer, Automatic pronunciation scoring of specific phone segments for language instruction, in: 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE Comput. Soc. Press, 1997. doi:10.1109/icassp.1997.596227.
[22] H. Franco, L. Neumeyer, M. Ramos, H. Bratt, Automatic detection of phone-level mispronunciation for language learning, in: Sixth European Conference on Speech Communication and Technology, 1999.
[23] J. Proença, C. Lopes, M. Tjalve, A. Stolcke, S. Candeias, F. Perdigão, Detection of mispronunciations and disfluencies in children reading aloud, in: Interspeech 2017, ISCA, 2017, pp. 1437–1441. doi:10.21437/interspeech.2017-1522.
[24] W. Hu, Y. Qian, F. K. Soong, A new DNN-based high quality pronunciation evaluation for computer-aided language learning (CALL)., in: Interspeech, 2013, pp. 1886–1890.
[25] J. Cheng, X. Chen, A. Metallinou, Deep neural network acoustic models for spoken assessment applications, Speech Communication 73 (2015) 14–27. doi:10.1016/j.specom.2015.07.006.
[26] A. Graves, S. Fernández, F. Gomez, J. Schmidhuber, Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks, in: Proceedings of the 23rd international conference on Machine learning - ICML ’06, ACM Press, 2006, pp. 369–376. doi:10.1145/1143844.1143891.
[27] J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, Y. Bengio, Attention-based models for speech recognition, Advances in Neural Information Processing Systems 2015-January (2015) 577–585.
[28] W. Chan, N. Jaitly, Q. Le, O. Vinyals, Listen, attend and spell: A neural network for large vocabulary conversational speech recognition, in: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, IEEE, 2016, pp. 4960–4964. doi:10.1109/icassp.2016.7472621.
[29] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, T. Hayashi, Hybrid CTC/Attention architecture for end-to-end speech recognition, IEEE J. Sel. Top. Signal Process. 11 (8) (2017) 1240–1253. doi:10.1109/jstsp.2017.2763455.
[30] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, R. Salakhutdinov, Transformer-xl: Attentive language models beyond a fixed-length context, arXiv preprint arXiv:1901.02860.
[31] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805.
[32] L. Sampaio Ferraz Ribeiro, T. Bui, J. Collomosse, M. Ponti, Sketchformer: Transformer-based representation for sketched structure, in: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2020, pp. 14153–14162. doi:10.1109/cvpr42600.2020.01416.
[33] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko, End-to-end object detection with transformers, arXiv preprint arXiv:2005.12872.
[34] T. Okamoto, T. Toda, Y. Shiga, H. Kawai, Transformer-based text-to-speech with weighted forced attention, in: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, IEEE, 2020, pp. 6729–6733. doi:10.1109/icassp40776.2020.9053915.
[35] N. Li, S. Liu, Y. Liu, S. Zhao, M. Liu, Neural speech synthesis with transformer network, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, 2019, pp. 6706–6713.
[36] R. Liu, X. Chen, X. Wen, Voice conversion with transformer network, in: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, IEEE, 2020, pp. 7759–7759. doi:10.1109/icassp40776.2020.9054523.
[37] A. M. Harrison, W.-K. Lo, X.-j. Qian, H. Meng, Implementation of an extended recognition network for mispronunciation detection and diagnosis in computer-assisted pronunciation training, in: International Workshop on Speech and Language Technology in Education, 2009.
[38] V. Likic, The needleman-wunsch algorithm for sequence alignment, Lecture given at the 7th Melbourne Bioinformatics Course, Bi021 Molecular Science and Biotechnology Institute, University of Melbourne (2008) 1–46.
[39] C. Chang, First language phonetic drift during second language acquisition, Ph.D. thesis (10 2010).
[40] Y. Jiao, M. Tu, V. Berisha, J. Liss, Accent identification by combining deep neural networks and recurrent neural networks trained on long and short term features, in: Interspeech 2016, ISCA, 2016, pp. 2388–2392. doi:10.21437/interspeech.2016-1148.
[41] M. Tu, A. Grabek, J. Liss, V. Berisha, Investigating the role of l1 in automatic pronunciation evaluation of l2 speech, arXiv preprint arXiv:1807.01738.
[42] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Bengio, Learning phrase representations using RNN encoder-decoder for statistical machine translation, arXiv preprint arXiv:1406.1078.
[43] Y. zhao, J. Li, X. Wang, Y. Li, The speechtransformer for large-scale mandarin Chinese speech recognition, in: ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, IEEE, 2019, pp. 7095–7099. doi:10.1109/icassp.2019.8682586.
[44] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, et al., The kaldi speech recognition toolkit, in: IEEE 2011 workshop on automatic speech recognition and understanding, no. CONF, IEEE Signal Processing Society, 2011.
[45] V. Panayotov, G. Chen, D. Povey, S. Khudanpur, Librispeech: An ASR corpus based on public domain audio books, in: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, IEEE, 2015, pp. 5206–5210. doi:10.1109/icassp.2015.7178964.
[46] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, M. Sonderegger, Montreal forced aligner: Trainable text-speech alignment using kaldi, in: Interspeech 2017, Vol. 2017, ISCA, 2017, pp. 498–502. doi:10.21437/interspeech.2017-1386.
[47] J. Li, V. Lavrukhin, B. Ginsburg, R. Leary, O. Kuchaiev, J. M. Cohen, H. Nguyen, R. T. Gadde, Jasper: An end-to-end convolutional neural acoustic model, arXiv preprint arXiv:1904.03288.
[48] X. Qian, F. K. Soong, H. Meng, Discriminative acoustic model for improving mispronunciation detection and diagnosis in computer-aided pronunciation training (CAPT), in: Eleventh Annual Conference of the International Speech Communication Association, 2010.
[49] E. Eban, M. Schain, A. Mackey, A. Gordon, R. Rifkin, G. Elidan, Scalable learning of non-decomposable objectives, in: Artificial Intelligence and Statistics, 2017, pp. 832–840.
[50] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollar, Focal loss for dense object detection, in: 2017 IEEE International Conference on Computer Vision (ICCV), IEEE, 2017, pp. 2980–2988. doi:10.1109/iccv.2017.324.
[51] M. Eskenazi, An overview of spoken language technology for education, Speech Communication 51 (10) (2009) 832–844. doi:10.1016/j.specom.2009.04.005.
[52] H. H. Mao, A survey on self-supervised pre-training for sequential transfer learning in neural networks, arXiv preprint arXiv:2007.00800.
[53] V. Sanh, T. Wolf, A. M. Rush, Movement pruning: Adaptive sparsity by fine-tuning, arXiv preprint arXiv:2005.07683.