Stutter-Solver: End-to-end Multi-lingual Dysfluency Detection

Abstract

Current de-facto dysfluency modeling methods [1, 2] utilize template matching algorithms which are not generalizable to out-of-domain real-world dysfluencies across languages, and are not scalable with increasing amounts of training data. To handle these problems, we propose Stutter-Solver: an end-to-end framework that detects dysfluency with accurate type and time transcription, inspired by the YOLO [3] object detection algorithm. Stutter-Solver can handle co-dysfluencies and is a natural multi-lingual dysfluency detector. To leverage scalability and boost performance, we also introduce three novel dysfluency corpora: VCTK-Pro, VCTK-Art, and AISHELL3-Pro, simulating natural spoken dysfluencies including repetition, block, missing, replacement, and prolongation through articulatory-encodec and TTS-based methods. Our approach achieves state-of-the-art performance on all available dysfluency corpora. Code and datasets are open-sourced at https://github.com/eureka235/Stutter-Solver.

Index Terms— dysfluency, co-dysfluency, end-to-end, multi-lingual, simulation, aphasia, clinical

Refer to caption — Fig. 1: We utilize the pretrained VITS speech and text encoders to process spectrogram and reference text respectively, generating the soft speech-text alignments passed into the detector. The output matrix contains exist confidence score and 5 types of type confidence scores (start & end bounds are left out in the paradigm). The higher the brightness, the higher the score, indicating the existence and type of dysfluency. a) shows the series nature of our detector with spatial encoder and subsequent temporal encoder. b) is a diagram for a spatial encoder block - grouped convolutions are important for extracting local spatial features without completely collapsing information across the text dimension.

1 Introduction

Speech dysfluency modeling is the core module in speech therapy. The U.S. speech therapy market is projected to reach USD 6.93 billion by 2030 [4]. Technically, speech dysfluency modeling is a speech recognition problem, which is dominated by large-scale developments [5, 6, 7, 8]. However, those large ASR models struggle with dysfluent speech because ASR is a dysfluency removal process. For a long time, researchers have mainly treated it as a classification problem. Early methods relied on hand-crafted features [9, 10, 11, 12, 13]. More recently, end-to-end classification tasks have been developed [14, 15, 16, 17, 18]. However, two big problems remain. First, dysfluency depends on the text, which previous methods have ignored. Second, simple classification is too basic to be deployed in real speech therapy systems. [1] proposed 2D-alignment, the alignment between reference text and phoneme-level forced alignment. Then, the template matching algorithm (with each dysfluency type as a template) was performed for dysfluency detection. The subsequent work, H-UDM [2], proposed recursive UDM [1] that updates word boundary segments together with alignment prediction. Nevertheless, these methods still exhibit certain limitations. Firstly, UDM [1] and H-UDM [2] are essentially feature engineering approaches, which may not adequately handle real-world dysfluencies that do not conform to predefined templates. Secondly, developing templates for each language is impractical, as dysfluency templates are inherently language-dependent. Thirdly, template matching algorithms do not utilize training data, rendering them non-scalable with respect to dysfluency data whenever it is available.

To handle the aforementioned limitations, we approach dysfluency modeling from a simple and new perspective. Dysfluency modeling can be regarded as an object detection problem in the 1D domain. As such, we conceptualize this as a detection task, inspired by YOLO [3]. We propose Stutter-Solver, which takes dysfluent speech and reference ground truth text as input, and directly predicts dysfluency types and time boundaries in an end-to-end manner. Note that [19] uses the similar idea. However, Stutter-Solver focuses on: co-dysfluencies, multi-linguality, articulatory-simulation and co-dysfluency TTS-simulation. Stutter-Solver requires high-quality annotated dysfluency data (with precisely annotated type and time boundaries).Therefore, we propose an innovative dysfluency simulation method: articulatory-based [20], and we performed comparative experiments with TTS-based methods. We developed three synthetic dysfluency datasets: VCTK-Pro and AISHELL3-Pro, using VITS [21] for the TTS-based method; additionally, VCTK-Art using Articulatory Encodec [20] as a vocal tract articulation simulation tool. Both VCTK-Pro and VCTK-Art build upon the VCTK corpus [22], whereas AISHELL3-Pro builds upon the AISHELL3 corpus [23]. These datasets include repetition, missing, block, replacement, and prolongation at phoneme & word levels for English, and at the character-level for Mandarin. As such, Stutter-Solver is naturally a multi-lingual co-dysfluency detector with no hand-crafted templates involved. As part of the speech therapy process, we have 38 English and 8 Chinese Mandarin-speaking nfvPPA subjects [24] from clinical collaborations. The proposed Stutter-Solver achieved state-of-the-art accuracy, bound loss, and time F1 score on our new benchmark (VCTK-Art, VCTK-Pro, AISHELL3-Pro), public corpus, and nfvPPA speech.

2 Articulatory-based Simulation

Previous research on dysfluent speech simulation [1, 14] has focused on direct manipulation of waveform, which has resulted in poor naturalness, evidenced in Table. 2. To address this limitation, we perform simulation in two orthogonal spaces: articulatory space and textual space. This section details articulatory-based simulation, while section 3 elaborates on the textual space approach (TTS-based simulation).

For articulatory-based method, we simulate dysfluency by directly editing the articulatory control space, by utilizing an offline articulatory inversion and synthesis models (Articulatory Encodec [20]). The Articulatory Encodec is composed of acoustic-to-articulatory inversion (AAI) model and an articulatory vocoder. [20] shows that the articulatory encodec can successfully applied to arbitrary accents and speaker identities with high-performance. The pipeline is detailed below.

2.1 Method Pipeline

We first run MFA to align raw VCTK speech with its ground truth text, obtaining 50 Hz phoneme-level force alignment that matches the EMA features from the AAI module. Various types of dysfluency are then introduced by editing the EMA features: Repetition: The target phoneme segment is duplicated 2-4 times. Replace: We sample a random phoneme from the current EMA feature to replace the target phoneme. Block: A silence frame with 10-15 units is inserted after the target phoneme, with the silence frames sampled from the beginning of the current EMA feature. Missing: The target phoneme is removed. Prolongation: Interpolating within the target phoneme, extending its duration by 4 to 6 times its original length. For repetition and prolongation, the target phonemes are respectively the first phoneme of a randomly selected word and a randomly chosen vowel. For other dysfluency types, the target phonemes are selected arbitrarily without any other restrictions. To ensure smooth auditory perception, we insert a 2-unit interpolate buffer frame before and after each modification. All the interpolation operations mentioned above use bilinear interpolation. Besides phoneme-level modifications above, we also implemented word-level repetition and missing, where the target word is modified instead of the target phoneme, with all other aspects remaining identical. The whole pipeline is depicted in Fig. 2.

Note that only the English version of articulatory-encodec model is available at this time, limiting our simulation contribution to English. However, we explored multi-lingual simulation in TTS-based Simulation, detailed in the next section.

3 Multi-Lingual TTS-based Simulation

3.1 Method pipeline

The pipeline of TTS-based simulation can be divided into following steps: 1) Dysfluency injection: for VCTK-Pro, we convert VCTK [22] text into IPA sequences via the VITS phonemizer, and for AISHELL3-Pro, we convert Mandarin text into pinyin sequences. We then add different types of dysfluencies at the phoneme/word(English) and pinyin (Chinese) level according to the TTS rules (Sec. 3.2). 2) VITS [21] inference: We take dysfluency-injected IPA/Pinyin sequences as inputs, conduct the VITS inference procedure and obtain the dysfluent speech. 3) Annotation: We retrieve phoneme alignments from VITS duration model, annotate the type of dysfluency on the dysfluent region.

3.2 Co-Dysfluency TTS rules

For VCTK-Pro, we incorporate phoneme and word-level dysfluency; for AISHELL3-Pro, we introduce character-level dysfluency. Dysfluencies are simulated via TTS rules [19], with examples provided in Fig. 3.

In VCTK-Pro, we introduce co-dysfluency, adding multiple dysfluencies into a single utterance. Co-dysfluency is categorized into single-type and multi-type. For single-type, we insert 2-3 instances of the same type of dysfluency (involves every type mentioned above) at various positions within an utterance. For multi-type, we incorporate 5 combinations of dysfluencies: (rep-missing), (rep-block), (missing-block), (replace-block) and (prolong-block), with 2 random positions chosen for each combination within the utterance. Note that due to ethic concerns, de-identification techniques [25] might also be involved in the process.

The statistics of three simulated datasets are detailed in Table. 1.

Table 1: Statistics of simulated datasets (hours)

Dysfluency	VCTK-Pro	VCTK-Art	AISHELL3-Pro
Repetition	258.33	111.34	102.37
Missing	180.89	107.43	100.22
Block	132.41	56.95	104.91
Replace	128.18	56.85	100.84
Prolongation	87.50	59.62	103.03
Co-dysfluency	337.84	-	-
Total	1125.1	392.19	511.37

4 Dysfluency Detection as Object Detection

Accurate dysfluency detection necessitates handling text dependencies since stutters are not necessarily monotonic. In this work, we adopt the soft speech-text alignment from VITS [21], which is one of the SOTA TTS models. Given this speech-text alignment as input, our model requires two main components: an optimal spatial and temporal downsampling method, and an extraction mechanism to accurately attend to the relevant dysfluent signal. Region-wise dysfluency detection can be viewed as a 1D extension of the 2D object detection problem in computer vision, drawing inspiration from the YOLO [3] method, we design a detector that takes the soft speech-text alignment and produces a fixed size 64 x 8 (temporal dim x output dim) output matrix. At each timestep, 8 values are predicted: dysfluency start & end bounds, confidence score, and C (=5) class predictions. The detector which utilizes a region-wise prediction scheme consists of spatial pattern collector blocks followed by a temporal analysis unit. The entire paradigm is shown in Fig. 1 and the corresponding modules are detailed in the following.

4.1 Soft speech-text alignments

We obtain $|c_{text}|\times|z|$ monotonic attention matrix $A$ from VITS [21] that represents how each input phoneme aligns with target speech, where $c_{text}$ is text dimension and $z$ the speech duration. For training, we use the soft alignments $A$ and apply a softmax operation across the text dimension, computing the maximum attention value for each time step. To calculate the soft alignments, we use the original pre-trained text and speech encoders. The former is a Transformer encoder [26] with relative positional embeddings, and the latter is a model that uses non-causal residual blocks used in WaveGlow [27]. This soft-alignment attention matrix is then passed into the detection head for training and inference.

4.2 Spatial-Temporal Encoders

We adopt the same spatial-temporal encoders as [19]. It consists of a region-wise spatial encoder and a temporal encoder.

4.2.1 Spatial Encoder

Learnable spatial pattern collector blocks are used to preserve local spatial features. Here, we are going to elaborate more on the intuition. Traditional speech recognition tasks take speech features such as mel spectrograms as input, and the de facto encoder [28] is applied. However, the soft alignments $A$ mentioned are spatially different from speech features, such that separate convolutions (pointwise and depthwise) will be ineffective for such input representations. Therefore, a modified convolution paradigm was proposed [19], where a depthwise convolution followed by a grouped convolution, instead of a pointwise convolution, is adopted in this setting. This has been experimentally proven to preserve region-wise information, as visualized in Fig. 1 (b).

4.2.2 Temporal Encoder

Since the task is to predict time-aware dysfluencies, technically it is a region-wise aggregation, which is a 1D sequential timing problem. To achieve this, the transformer encoder [26] is simply applied to handle both global and local timing alignments. We employed transformer-base in this setting.

4.3 Training Loss

The speech utterance is split into segments of fixed steps. For each segment, we are going to predict three things: (1) the dysfluency confidence score (if and how confident we are that dysfluency exists in this segment), denoted as $y_{i}$ , (2) the boundary of the dysfluencies $b_{\text{start}}$ and $b_{\text{end}}$ , and (3) the dysfluency type $c_{n}$ : whether that dysfluency is a block, repetition, replacement, insertion, or missing word. The bound values are normalized between 0-1 using fixed padded lengths as the max bound values. The balancing factors are $\lambda_{\text{bound}}=5$ , $\lambda_{\text{conf}}=1$ , and $\lambda_{\text{class}}=0.5$ . $S$ is the number of regions and $n$ is the number of classes. $\mathbbm{1}_{obj}$ indicates whether the dysfluency appears in that segment. The loss function is denoted by the following equation:

$\mathbb{L}=\lambda_{\text{bound}}\frac{1}{S}\displaystyle\sum_{i=0}^{S}\mathbbm{1}_{\text{obj}}[(b_{\text{start}}-\hat{b}_{\text{start}})^{2}+(b_{\text{end}}-\hat{b}_{\text{end}})^{2}]$

$-\lambda_{\text{conf}}\frac{1}{S}\displaystyle\sum_{i=0}^{S}\hat{y}_{i}\log(p(y_{i}))+(1-\hat{y}_{i})\cdot\log(1-p(y_{i}))$

$-\lambda_{\text{class}}\frac{1}{S}\displaystyle\sum_{i=0}^{S}\displaystyle\sum_{j=0}^{n}c_{n}\log(p(\hat{c}_{n}))$

5 Experiments

5.1 Datasets

•

VCTK [22] includes recordings from 109 native English speakers. Each speaker reads out about 400 sentences from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive. This corpus encompasses about 48 hours of accented speech. It is used for simulating VCTK-Pro and VCTK-Art.
•

AISHELL-3 [23] is a large-scale and high-fidelity multi-speaker Mandarin speech corpus which includes 218 native Chinese mandarin speakers with roughly 85 hours of emotion-neutral recordings. It is used for simualting AISHELL3-Pro.
•

LibriStutter [14] is a synthesized dataset which contains artificially stuttered speech and stutter classification labels for 5 stutter types. It was generated using 20 hours of audio selected from the ‘dev-clean-100’ section of [29].
•

UCLASS [30] contains recordings from 128 children and adults who stutter. Only 25 files have been annotated and did not annotate for the block class, we only used those files and did not use the block class for subsequent datasets.
•

SEP-28K is curated by [31], contains 28,177 clips extracted from publicly available podcasts. These clips are labeled with five event types including block, prolongation, sound / word repetition and interjection. Clips labeled as “unsure" in the were excluded from the dataset.
•

Aphasia Speech is collected from our clinical collaborators, our dysfluent data comprises 46 participants (38 English speakers and 8 Chinese mandarin speakers) diagnosed with Primary Progressive Aphasia (PPA), larger than the data used in [1, 2] which only has 3 English speakers.

5.2 Training

We trained the detector for 30 epochs using a 90/10 train/test split, which was separately applied to three simulated datasets. We utilized a batch size of 64 and leveraged the Adam optimizer, configured with beta values of 0.9 and 0.999 and a learning rate of 3e-4. We choose not to use dropout or weight-decay in our training process. Training on VCTK-Art, VCTK-Pro (without co-dysfluency) and AISHELL3-Pro requires a total of 39, 41 and 36 hours respectively, on a RTX A6000.

5.3 Metrics

•

Phoneme Error Rate (PER) is a measure of how many errors (inserted, deleted, and changed phonemes) are predicting phoneme sequences compared to the actual phoneme sequence. It calculated by dividing the number of phoneme errors by the total number of phonemes.
•

Accuracy (Acc.) refers to the correctness of predictions regarding types of dysfluency within regions that exhibit some form of dysfluency.
•

Bound loss is calculated as the mean squared error between the predicted and actual boundaries of dysfluent regions within a 1024-length padded spectrogram, which is then converted to a time scale using a 20ms sampling frequency. For co-dysfluency analyses, the bound loss is averaged across all identified dysfluent regions.
•

Time F1 [1] measures the accuracy of boundary predictions by assessing the overlap between predicted and actual dysfluent region bounds. A sample is classified as a True Positive if any intersection occurs between these bounds.

5.4 Evaluation of dysfluency simulation

5.4.1 MOS tests

To evaluate the rationality and naturalness of three datasets we constructed, we collected Mean Opinion Score (MOS, 1-5) ratings from 11 people. The results are displayed in Table. 2. Our three simulated datasets were perceived to be far more natural than the VCTK++ [1] baseline corpus. Notably, VCTK-Pro was rated as closely mimicking human speech.

Table 2: MOS for Simulated datasets

Dysfluency Type	VCTK++	VCTK-Art	VCTK-Pro	AISHELL3-Pro
Repetition	1.40 ± 0.70	2.61 ± 1.05	3.33 ± 0.86	3.88 ± 0.73
Missing	N/A	3.44 ± 1.23	3.89 ± 1.05	3.37 ± 1.06
Block	2.80 ± 0.63	3.35 ± 1.13	3.22 ± 1.09	2.96 ± 1.03
Replace	N/A	3.48 ± 1.42	2.62 ± 1.21	3.13 ± 0.74
Prolongation	1.20 ± 0.79	2.55 ± 0.53	3.00 ± 1.00	2.64 ± 1.12
Overall	1.80 ± 0.74	3.08 ± 1.12	3.21 ± 0.97	3.19 ± 0.95

5.4.2 Dysfluency intelligibility

In order to verify the intelligibility of simulated datasets, we use phoneme recognition model [32] to evaluate the raw VCTK (/) and various types of dysfluent speech from VCTK-Pro and VCTK-Art. Results in Table. 3 show generally low PER, indicating good intelligibility and usability despite higher PERs than raw VCTK. Comparatively, VCTK-Pro performs better overall, while VCTK-Art excels particularly in repetition and block. AISHELL3-Pro was not evaluated due to the lack of available high-quality pinyin-level Chinese speech recognition model.

Table 3: Phoneme Transcription Evaluation

	PER ( $\%\downarrow$ )
Type	/	Repetition	Missing	Block	Replace	Prolongation
VCTK-Art	6.243	8.328	8.250	10.118	9.665	9.893
VCTK-Pro	6.243	8.869	7.600	11.974	8.004	6.346

Table 4: Accuracy (Acc) and Bound loss (BL) of the five dysfluency types trained on the VCTK-Pro and VCTK-Art.

	Trainable		Rep		Block		Miss		Replace		Prolong
Methods	parameters	Dataset	Acc.%	BL	Acc.%	BL	Acc.%	BL	Acc.%	BL	Acc.%	BL
H-UDM [2]	92M	VCTK-Art	84.29	29ms	97.59	24ms	29.11	27ms	-	-	-	-
Stutter-Solver(VCTK-Art)	33M	VCTK-Art	87.55	21ms	99.64	15ms	91.17	12ms	66.81	15ms	79.16	17ms
H-UDM [2]	92M	VCTK-Pro	74.66	68ms	88.44	85ms	15.00	100ms	-	-	-	-
Stutter-Solver(VCTK-Pro)	33M	VCTK-Pro	98.78	27ms	98.71	78ms	70.00	8ms	73.33	10ms	93.74	12ms
Stutter-Solver(AISHELL3-Pro)	33M	AISHELL3-Pro	93.33	17ms	99.98	52 ms	95.00	2ms	95.16	13ms	96.55	16ms

Table 5: Dysfluency evaluation on Aphasia speech.

Methods	Ave. Acc. ( $\%$ , $\uparrow$ )	Best Acc.( $\%$ , $\uparrow$ )	Ave. BL (ms, $\downarrow$ )
H-UDM [2]	41.8	70.22	52ms
Stutter-Solver(VCTK-Art)	52.82	93.47 (Repetition)	41ms
Stutter-Solver(VCTK-Pro)	54.19	92.54 (Block)	21ms
Stutter-Solver(AISHELL3-Pro)	72.37	94.85 (Block)	13ms

5.5 Dysfluency detection

To assess the performance of trained detector, we conduct evaluations on three simulated datasets, as well as on the PPA data. The results, which include type-specific detection accuracy and bound loss metrics, are detailed in Table. 4 for the simulated datasets and in Table. 5 for the PPA data. Additionally, we compared our results with previous works by validating it on UCLASS, Libristutter, and SEP-28K, where we computed type-specific accuracy and Time F1, as shown in Table. 6.

In Table.4, we used H-UDM [2] as the baseline. Both versions of Stutter-Solver(VCTK-Art and VCTK-Pro) surpassed H-UDM across all metrics. Notably, Stutter-Solver(VCTK-Pro) showed stronger results for English. Additionally, AISHELL3-Pro performed even better, likely due to the unique pronunciation traits of Chinese and its noticeable character-level dysfluency. In Table. 6, we presented our results using publicly available datasets (UCLASS, LibriStutter, and SEP-28K). Since the original benchmarks use private test sets, direct accuracy comparisons may not be completely fair. We instead emphasized time-aware detection, reporting the Time F1 score for each dataset. All baselines, except H-UDM, scored 0. Our proposed methods consistently outperformed H-UDM in these evaluations. In Table. 5, both versions of Stutter-Solver outperformed H-UDM, and the Chinese model performed best on Chinese PPA. However, the average accuracy remained low, underscoring the challenge of accurately capturing the real distribution of dysfluency.

Table 6: Type-specific accuracy (ACC) and time F1-score

Methods	Dataset	Accuracy ( $\%$ , $\uparrow$ )			Time F1 ( $\uparrow$ )
		Rep	Prolong	Block
Kourkounakis et al. [14]	UCLASS	84.46	94.89	-	0
Jouaiti et al. [16]	UCLASS	89.60	99.40	-	0
Lian et al. [2]	UCLASS	75.18	-	50.09	0.700
Stutter-Solver (VCTK-Art)	UCLASS	82.56	84.83	64.42	0.806
Stutter-Solver (VCTK-Pro)	UCLASS	92.00	91.43	56.00	0.893
Kourkounakis et al. [14]	LibriStutter	82.24	92.14	-	0
Lian et al. [2]	LibriStutter	85.00	-	-	0.660
Stutter-Solver (VCTK-Art)	LibriStutter	89.04	62.58	-	0.686
Stutter-Solver (VCTK-Pro)	LibriStutter	89.71	67.74	-	0.697
Jouaiti et al. [16]	SEP-28K	78.70	93.00	-	0
Lian et al. [2]	SEP-28K	70.99	-	66.44	0.699
Stutter-Solver (VCTK-Art)	SEP-28K	78.31	74.99	68.02	0.786
Stutter-Solver (VCTK-Pro)	SEP-28K	82.01	89.19	68.09	0.813

5.6 Co-dysfluency

In Section 3.2, we incorporated co-dysfluency into VCTK-Pro. We trained Stutter-Solver on single-type, multi-type, and mixed-type (single & multi) co-dysfluency respectively, and measured average accuracy and bound loss using corresponding simulated data. The results, shown in Table. 7, demonstrate that our detector’s performance on co-dysfluency matches its capability in simpler scenarios with only one dysfluency per utterance. This indicates that our detector handles co-dysfluency effectively. It is worth noting that this fundamental property is missing in previous work.

Table 7: Evaluation of Co-dysfluency

Methods	Co-dysfluency	Ave Acc.( $\%$ , $\uparrow$ )	Ave. BL (ms, $\downarrow$ )
Stutter-Solver(/)	/	91.24	29ms
Stutter-Solver(Single-type)	Single-type	90.22	26ms
Stutter-Solver(Multi-type)	Multi-type	89.59	15ms
Stutter-Solver(Mix-type)	Mix-type	90.08	24 ms

Table 8: Evaluation of Multi-lingual

Methods	Dataset	Ave Acc.( $\%$ , $\uparrow$ )	Ave. BL (ms, $\downarrow$ )
Stutter-Solver(VCTK-Pro)	VCTK-Pro	91.24	29ms
Stutter-Solver(Multi-lingual)	VCTK-Pro	93.98	21ms
Stutter-Solver(AISHELL3-Pro)	AISHELL3-Pro	96.00	20ms
Stutter-Solver(Multi-lingual)	AISHELL3-Pro	86.88	43ms

5.7 Multi-lingual

In addition to training the detector separately on single languages, we trained it simultaneously on two languages to evaluate its performance in a multi-lingual scenario. We randomly sampled 300 hours of data from both VCTK-Pro and AISHELL3-Pro for training. The results, presented in Table. 8, show that multi-lingual training slightly improved detection performance for English but significantly reduced it for Chinese compared with training separately on a single language. This indicates that multi-lingual training has varying effects on detection accuracy depending on the language. It is important to note that our method does not require additional language-specific dysfluency templates, in contrast to the previous state-of-the-art work by [2].

6 Conclusions and limitations

We propose Stutter-Solver that detects speech dysfluencies in an end-to-end manner. Stutter-Solver is able to handle co-dysfluencies within the utterance and is a natural multi-lingual dysfluency detector. We proposed three annotated dysfluency simulated corpora such that Stutter-Solver achieves state-of-the-art performance on a couple of dysfluency corpora. However, limitations exist. First, the performance on real nfvPPA speech is far worse than that on simulated speech. Future work will focus on reducing the gap between simulated and real dysfluency distributions. Second, the proposed simulated corpora are not at a large scale and we have not reached the limit. Future work will focus on pushing the limit of scaling efforts when more data and resources are available. Third, it is worth exploring the simulation in gestural space [33, 34] or rtMRI space [35] instead of articulatory EMA space for finer-grained control. It is also worth exploring both speaker-dependent and speaker-independent dysfluencies via disentangled analysis and synthesis [36, 37, 38, 39, 40], which serves as the foundation for behavioral dysfluency study.

7 Acknowledgement

Thanks for support from UC Noyce Initiative, Society of Hellman Fellows, NIH/NIDCD and the Schwab Innovation fund.

References

[1] Jiachen Lian, Carly Feng, Naasir Farooqi, Steve Li, Anshul Kashyap, Cheol Jun Cho, Peter Wu, Robbie Netzorg, Tingle Li, and Gopala Krishna Anumanchipalli, “Unconstrained dysfluency modeling for dysfluent speech transcription and detection,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023, pp. 1–8.
[2] Jiachen Lian and Gopala Anumanchipalli, “Towards hierarchical spoken language dysfluency modeling,” Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics, 2024.
[3] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi, “You only look once: Unified, real-time object detection,” 2016.
[4] “ppa market,” https://www.fortunebusinessinsights.com/u-s-speech-therapy-market-105574.
[5] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever, “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning. PMLR, 2023, pp. 28492–28518.
[6] Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, et al., “Google usm: Scaling automatic speech recognition beyond 100 languages,” arXiv preprint arXiv:2303.01037, 2023.
[7] Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, et al., “Scaling speech technology to 1,000+ languages,” arXiv preprint arXiv:2305.13516, 2023.
[8] Jiachen Lian, Alexei Baevski, Wei-Ning Hsu, and Michael Auli, “Av-data2vec: Self-supervised learning of audio-visual speech representations with contextualized target representations,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023.
[9] Ooi Chia Ai, M. Hariharan, Sazali Yaacob, and Lim Sin Chee, “Classification of speech dysfluencies with mfcc and lpcc features,” Expert Systems with Applications, vol. 39, no. 2, pp. 2157–2165, 2012.
[10] Lim Sin Chee, Ooi Chia Ai, M. Hariharan, and Sazali Yaacob, “Automatic detection of prolongations and repetitions using lpcc,” in 2009 International Conference for Technical Postgraduates (TECHPOS), 2009, pp. 1–4.
[11] Iman Esmaili, Nader Jafarnia Dabanloo, and Mansour Vali, “Automatic classification of speech dysfluencies in continuous speech based on similarity measures and morphological image processing tools,” Biomedical Signal Processing and Control, vol. 23, pp. 104–114, 2016.
[12] Melanie Jouaiti and Kerstin Dautenhahn, “Dysfluency classification in speech using a biological sound perception model,” in 2022 9th International Conference on Soft Computing & Machine Intelligence (ISCMI), 2022, pp. 173–177.
[13] Tedd Kourkounakis, Amirhossein Hajavi, and Ali Etemad, “Detecting multiple speech disfluencies using a deep residual network with bidirectional long short-term memory,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6089–6093.
[14] Tedd Kourkounakis, Amirhossein Hajavi, and Ali Etemad, “Fluentnet: End-to-end detection of stuttered speech disfluencies with deep learning,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2986–2999, 2021.
[15] Sadeen Alharbi, Madina Hasan, Anthony JH Simons, Shelagh Brumfitt, and Phil Green, “Sequence labeling to detect stuttering events in read speech,” Computer Speech & Language, vol. 62, pp. 101052, 2020.
[16] Melanie Jouaiti and Kerstin Dautenhahn, “Dysfluency classification in stuttered speech using deep learning for real-time applications,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6482–6486.
[17] Peter Howell and Stevie Sackin, “Automatic recognition of repetitions and prolongations in stuttered speech,” in Proceedings of the first World Congress on fluency disorders. University Press Nijmegen Nijmegen, The Netherlands, 1995, vol. 2, pp. 372–374.
[18] Payal Mohapatra, Bashima Islam, Md Tamzeed Islam, Ruochen Jiao, and Qi Zhu, “Efficient stuttering event detection using siamese networks,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
[19] Xuanru Zhou, Anshul Kashyap, Steve Li, Ayati Sharma, Brittany Morin, David Baquirin, Jet Vonk, Zoe Ezzes, Zachary Miller, Maria Luisa Gorno Tempini, et al., “Yolo-stutter: End-to-end region-wise speech dysfluency detection,” Interspeech, 2024.
[20] Cheol Jun Cho, Peter Wu, Tejas S. Prabhune, Dhruv Agarwal, and Gopala K. Anumanchipalli, “Articulatory encodec: Vocal tract kinematics as a codec for speech,” arXiv preprint arXiv:2406.12998, 2024.
[21] Jaehyeon Kim, Jungil Kong, and Juhee Son, “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in International Conference on Machine Learning. PMLR, 2021, pp. 5530–5540.
[22] Junichi Yamagishi, Christophe Veaux, Kirsten MacDonald, et al., “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92),” University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2019.
[23] Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, and Ming Li, “Aishell-3: A multi-speaker mandarin tts corpus and the baselines,” 2015.
[24] Maria Luisa Gorno-Tempini, Argye E Hillis, Sandra Weintraub, Andrew Kertesz, Mario Mendez, Stefano F Cappa, Jennifer M Ogar, Jonathan D Rohrer, Steven Black, Bradley F Boeve, et al., “Classification of primary progressive aphasia and its variants,” Neurology, vol. 76, no. 11, pp. 1006–1014, 2011.
[25] Yang Gao, Jiachen Lian, Bhiksha Raj, and Rita Singh, “Detection and evaluation of human and machine generated speech in spoofing attacks on automatic speaker verification systems,” in 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 544–551.
[26] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, L ukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems. 2017, vol. 30, Curran Associates, Inc.
[27] Ryan J. Prenger, Rafael Valle, and Bryan Catanzaro, “Waveglow: A flow-based generative network for speech synthesis,” ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617–3621, 2018.
[28] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang, “Conformer: Convolution-augmented transformer for speech recognition,” 2020.
[29] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, pp. 5206–5210.
[30] Peter Howell, Steve Davis, and Jon Bartrip, “The uclass archive of stuttered speech,” Journal of Speech Language and Hearing Research, vol. 52, pp. 556, 2009.
[31] Colin Lea, Vikramjit Mitra, Aparna Joshi, Sachin Kajarekar, and Jeffrey P Bigham, “Sep-28k: A dataset for stuttering event detection from podcasts with people who stutter,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6798–6802.
[32] Xinjian Li, Siddharth Dalmia, Juncheng Li, Matthew Lee, Patrick Littell, Jiali Yao, Antonios Anastasopoulos, David R Mortensen, Graham Neubig, Alan W Black, and Metze Florian, “Universal phone recognition with a multilingual allophone system,” in ICASSP 2020. IEEE, 2020, pp. 8249–8253.
[33] Jiachen Lian, Alan W Black, Louis Goldstein, and Gopala K. Anumanchipalli, “Deep Neural Convolutive Matrix Factorization for Articulatory Representation Decomposition,” in Proc. Interspeech 2022, 2022, pp. 4686–4690.
[34] Jiachen Lian, Alan W Black, Yijing Lu, Louis Goldstein, Shinji Watanabe, and Gopala K Anumanchipalli, “Articulatory representation learning via joint factor analysis and neural matrix factorization,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
[35] Peter Wu, Tingle Li, Yijing Lu, Yubin Zhang, Jiachen Lian, Alan W Black, Louis Goldstein, Shinji Watanabe, and Gopala K. Anumanchipalli, “Deep Speech Synthesis from MRI-Based Articulatory Representations,” in Proc. INTERSPEECH 2023, 2023, pp. 5132–5136.
[36] Jiachen Lian, Chunlei Zhang, Gopala K. Anumanchipalli, and Dong Yu, “Unsupervised tts acoustic modeling for tts with conditional disentangled sequential vae,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2548–2557, 2023.
[37] Kaizhi Qian, Yang Zhang, Heting Gao, Junrui Ni, Cheng-I Lai, David Cox, Mark Hasegawa-Johnson, and Shiyu Chang, “Contentvec: An improved self-supervised speech representation by disentangling speakers,” in ICML, 2022.
[38] Jiachen Lian, Chunlei Zhang, and Dong Yu, “Robust disentangled variational speech representation learning for zero-shot voice conversion,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6572–6576.
[39] Jiachen Lian, Chunlei Zhang, Gopala Krishna Anumanchipalli, and Dong Yu, “Towards Improved Zero-shot Voice Conversion with Conditional DSVAE,” in Proc. Interspeech 2022, 2022, pp. 2598–2602.
[40] Hyeong-Seok Choi, Jinhyeok Yang, Juheon Lee, and Hyeongju Kim, “Nansy++: Unified voice synthesis with neural analysis and synthesis,” ICLR, 2022.