Articulatory Representation Learning Via Joint Factor Analysis and Neural Matrix Factorization
Abstract
Articulatory representation learning is the fundamental research in modeling neural speech production system. Our previous work has established a deep paradigm to decompose the articulatory kinematics data into gestures, which explicitly model the phonological and linguistic structure encoded with human speech production mechanism, and corresponding gestural scores. We continue with this line of work by raising two concerns: (1) The articulators are entangled together in the original algorithm such that some of the articulators do not leverage effective moving patterns, which limits the interpretability of both gestures and gestural scores; (2) The EMA data is sparsely sampled from articulators, which limits the intelligibility of learned representations. In this work, we propose a novel articulatory representation decomposition algorithm that takes the advantage of guided factor analysis to derive the articulatory-specific factors and factor scores. A neural convolutive matrix factorization algorithm is then employed on the factor scores to derive the new gestures and gestural scores. We experiment with the rtMRI corpus that captures the fine-grained vocal tract contours. Both subjective and objective evaluation results suggest that the newly proposed system delivers the articulatory representations that are intelligible, generalizable, efficient and interpretable.
Index Terms— Articulatory, Factor Analysis, Gestural Scores,
1 Introduction
The mainstream research of deep speech representation learning is to develop a human-like neural speech processing system. However, the gap between human and machine intelligence is not that straightforward to be filled in the post-transformer era111We refer to this as the period when transformer architecture is widely adopted in the general areas of artificial intelligence. [1, 2, 3, 4]. Inspired by the concept of autonomous human intelligence system [2], we aim to develop a human speech processing system that is interpretable, efficient and effective. Basically there are two types of such systems. The first one is the neural speech perception system [5, 6, 7], which is the theoretical model for most of the current neural networks. The second one is the neural speech production system [8, 7] which learns the speech representations in a way that simulates the human speech production process. We focus on the second line. Directly modeling speech production systems from raw speech is fairly difficult and little work has been explored in this direction. To simplify the problem and to make initiative attempts, we model the speech production system and derive the speech representations from the articulatory signal, which we call articulatory representation learning.
Articulatory representation learning is defined over the framework of articulatory phonology [9], which models the relation between phonological representations as a set of discrete units called gestures, and the variability in time that derives from variation in the activation of the gestures in real-time: the magnitude of their activation, and the temporal intervals of activation as represented in gestural scores. The gestures explicitly capture the moving patterns of different articulators. The gestural scores are the ultimate form of articulatory representations. The desired properties of gestural scores should be as follows: 1) Intelligible. 2) Interpretable. 3) Sparse. The gestures and gestural scores explicitly form a simple yet effective speech production system. Previously, only a few works have attempted to derive the gestures and gestural scores in a data-driven manner. [10] adopts the convolutive sparse non-negative matrix factorization (CSNMF) to decompose the articulatory data into gestures and gestural scores. Our previous work [11] proposed an end-to-end neural convolutive matrix factorization (NCMF) paradigm so that both gestures and gestural scores can be automatically learned via neural networks. However, directly applying matrix factorization on the articulatory data is problematic. In the task dynamics model of speech production [12], each articulator contributes to the production of speech with a certain percentage. These percent contributions should be carefully determined [13]. In NCMF paradigm [11], the gestural scores, which implicitly encode the percent contributions of articulators, are pretty sensitive to the parameter initialization and are randomly determined. This also affects the dynamical patterns of gestures so that some of the gestures do not capture articulator-specific moving patterns, which limits the interpretability of both gestures and gestural scores.
To alleviate the aforementioned problems, we propose a two-step articulatory decomposition rule. In the first step, we adopt the guided factor analysis algorithm [14, 13] to extract factors and factor scores from the articulatory data. The factor characterizes spatial variation in the position and shape of an articulator. The factor scores parameterize how the position and shape of all articulators change over time. The first step ensures that percent contribution of each articulator is within a reasonable range and ensures that the articulator-specific patterns could be captured. In the second step, we perform the NCMF [11] on the factor scores to to obtain the sparse gestural scores and gestures. As we mentioned, the gestural scores are the ultimate form of articulatory representation. But different from NCMF [11], we define the matrix product of gestures and factors as the new gestures, which capture fine-grained articulatory moving patterns. Details about our new gestures formulation are given in Sec. 3. Eq.11. Combining these two-step decomposition algorithms, we obtain the gestural scores and gestures from the articutary data. Another disadvantage of NCMF [11] is that the MNGU0 data is sparsely sampled from articulators, which limits the intelligibility of learned representations. In this paper, we experiment with the rtMRI corpus [15] that captures the fine-grained vocal tract contours to improve the intelligibility of the learned representations. Both subjective and objective evaluation results suggest that the newly proposed system delivers decent articulatory representations in terms of intelligibility, explainability, efficiency and generalizability.

2 Guided Factor Analysis
2.1 Problem Formulation
Given vocal tract video representation which is a set of consecutive frames of vocal tract contours, guided factor analysis [14, 13] aims to decompose into articulator-specific factors and time domain representation factor scores , where indicates x,y coordinates of contour vertices and denotes number of factors. Guilded factor analysis can be formulated in Eq. 1.
(1) |
The objective of guilded factor analysis is to parameterize the vocal tract representation into the linear combination [13] of factors such that each factor characterizes spatial variation in the position and shape of an articulator. The factor scores characterizes the temporal variation in the position and shapes of the articulators i.e. factors. In this work, we consider 5 types of articulators: jaw, tongue, lip, velum and larynx. We have one factor for each articulator. Thus, there are 5 factors in total:
(2) |
2.2 Factors
Factors are extracted from the vocal tract representation via eigenvalue decomposition algorithm. In order to obtain the articulator-specific representations, we apply a articulator-specific projection matrix on such that only the contour points on the current articulator are kept while other entries are set to zero. The projection mechanism can be formulated in Eq. 3, where . Denote a binary and diagonal projection matrix as . Given an index , the entry and which correspond to and component of th contour point are 1 if and only if this point is on the current articulator . is manually derived via the articulatory segmentation labels of the data [13].
(3) |
It is also possible to keep a set of articulators since the motion of some articulators such as jaw, tongue, lip co-occur with each other. For example, if we are going to keep both jaw and lip parts, the joint projection will happen: . In the next step, eigenvalue decomposition is separately applied to extract the articulator-specific factors. Consistent with [13], factors are extracted in a two-step manner.
First, we only extract jaw factors which capture the most part of vocal tract motions. Following [13], the jaw factors are defined in Eq. 4, where . and denote the eigenvector matrix and diagonal variances matrix for the covariance matrix . For example, . Different from [13], we do not apply a scaling factor for convariance matrix since it has little effect to the results. Note that and capture the direction and variance of jaw movement. Multiplying them with will make the jaw factors capture the joint motion from jaw, lip and tongue articulators. More details can be checked in [13]. After obtaining jaw factor , the vocal tract data can be recovered via the Eq. 5, where denotes Moore-Penrose pseudo inverse.
(4) |
(5) |
Second, we extract the tongue, lip, velum and larynx factors. The motion captured by these factors should be independent of the jaw factors [13]. In order to remove such dependence, we should remove the jaw factor component from the vocal tract data via the Eq. 6. In the next step, we still apply the same mask as Eq. 3 to obtain the articulatory-specific jaw-free vocal tract contours, which is formulated in Eq. 7.
(6) |
(7) |
Following [13], the factors for the a certain articulator can be derived via Eq. 8, where
(8) |
For all of factors, we just take the first principle component.
2.3 Factor scores
3 Neural Convolutive Matrix Factorization
In the factor analysis algorithm, vocal tract data is projected onto articulatory-specific bases called factors and the factor scores are the actual representations that keep the most information of the original data. However, the articulatory representation is also expected to sparse. In this section, we apply the neural convolutive matrix factorization [11] on factor scores to obtain the sparse representations(gestural scores) and gestures. According to [11], the convolutive matrix factorization(reconstruction) is formulated in Eq. 10.
(10) |
Now we explain these terms one by one. denotes the reconstruction of factor scores . is 1-D convolutional kernel with a kernel size of , input channel size of and output channel size of . Note that in [11], is called gestures and is number of gestures. In this work, we still denote as number of gestures, however, we define the new gesture in Eq. 11. is gestural scores, which is the ultimate form of articulatory representation. indicates that columns of are shifted to the right.
(11) |
According to [11], the entire matrix factorization can be implemented via an auto-encoder framework. The encoder takes factor scores derived from Eq. 9 as input and output the gestural scores . The decoder takes the gestural scores as input to reconstruct the factor scores . The encoder can be any type of neural network such that . The decoder is single 1-D convolutional layer with weight matrix in Eq. 10.
We use almost the same loss objectives as introduced in [11]. The reconstruction loss is L2 loss. The sparsity loss is computed over all vectors only in time (Eq. 12) dimension, where the vector-wise sparsity is defined as , in consistent with [16], where and denote L1 norm and L2 norm. Connectionist Temporal Classification (CTC) [17] loss is introduced when performing phoneme recognition. are balancing weights. Different from [11], the entropy loss is removed since we experimentally find it less helpful in improving the interpretability of gestural scores.
(12) |
(13) |
4 Experiments
4.1 Dataset and Preprocessing
The rtMRI dataset includes eight (four male, four female) speakers of American English [15]. All the speaker were asked to read the same visually presented text from a paper card. The speakers produced the sequence of utterances ten times. The vocal tract constrictions were recorded in the real-time MRI videos. The real-time MRI pulse sequence parameters can be checked in [13]. The video resolution is 83 frames per second. We directly use the segmentation results from [13] which adopts the span segmentation methods [18]. 5 articulators are involved as mentioned in Sec. 2.1 and they are: jaw, tongue, lips, velum and larynx. The total number of points for each image is . For phoneme experiments, we also extract the mel-spectrogram(win=1024,hop=256) and WavLM [19] base features from the audio waveform. Sampling rate for rtMRI audios is 16k. We use MFA [20] to extract monophones given text-audio pairs and there are in total 72 monophones.
4.2 Modules Details
The entire articulatory representation learning pipeline is shown in Fig.1. Guided factor anaysis is done offline and we use the same implementation as [13]. The encoder takes factor scores and generates the gestural scores, which is fed into decoder to reconstruct the factor scores. We use similar encoder/decoder as [11]. The encoder consists of 2 convolutional layers and the decoder is a single convolutional layer. Denote as . The configurations for these three layers are respectively, where the number of points in a image , number of gestures and window size for convolutive matrix factorization. Note that the weights of the decoder are the ”gestures” that is defined in [11]. In this work, the gesture is the multiplication of ”gestures” and factors, as shown in Eq. 11. The phoneme recognizer takes different types of features as input and is optimized via CTC [17] training. For rtMRI data, we simply reshape each frame of image into a hyper-vector so that the entire rtMRI video sequences become a 2D feature. For all the other features, we just keep their original forms. There are two types of phoneme recognizes: base and large. The base model consists of 3-layer multi head attention blocks, where the number of attention heads are 4 and feed-forward layer dimension is 128. The large model consists of 6-layer multi head attention blocks, where the number of attention heads are 8 and feed-forward layer dimension is 256.
4.3 Implementation Details
We use the same way as [13] to first obtain factors and factor scores. In the next step, we consider two sets of experiments. (1) We perform rtMRI factor scores resynthesis to extract the gestural scores and to derive the gestures. (2) We use various audio features including rtMRI original data, factor scores, gestural scores, mel spectrograms and WavLM [19] features to perform phoneme recognition. For gestural scores, we also furthermore fine-tune both encoder and decoder, which is denoted as Gestural Scores(FT) in Table. 1. For the overall loss function defined in Eq.13, we set . By default unless we finetune the gestural score where . For all experiments, adam [21] optimizer is used with an intial learning rate of 0.001 and weight decay of 4e-4. All the experiments are trained for 1k updates with a batch size of 4. The learning rate is decayed every 10 updates with a factor of 0.95. For phoneme recognition experiments, we also vary the training size in a manner that we gradually increase the number of speakers. For each speaker, we have 3 random utterances as test set and the remaining as training set. Beam search with a width of 10 is used for decoding. We evaluate the intelligibility, generaliability and efficency of the gestural scores and the interpretability of both gestural scores and gestures.
4.4 Results and discussions
4.4.1 Intelligibilty and Generaliability
The intelligibility of representations is evaluated via phoneme error rate (PER). In Table 1, rtMRI achieves a better PER compared to Mel and WavLM because it captures most of the vocal tract information, and the rtMRI audio data is not clean. Since there is information deduction from rtMRI to factor scores and from factor scores to gestural scores, PER also increases correspondingly. Although the PER for gestural scores is the highest compared to the others, it is still a promising number, indicating that the gestural scores are intelligible. WavLM achieves better PER than Mel under all settings, this is in line with [22, 23]. We also observe that articulatory features (rtMRI, Factor Scores, Gestural Scores) are less sensitive to data and model size. We use ”Range” and ”Model Variance” to evaluate this. Range is simply the absolute difference between the PER of the largest data size and the PER of the smallest data size. Model variance is the mean square value of PER difference averaged over all speakers. For CTC-Base, we set them as 0 for all features. As shown in the table, both Mel and WavLM have the largest value for ”Range” and ”Model Variance,” indicating that they are pretty sensitive to data and model scale. We conclude that articulatory features are more generalizable representations.
Speakers 1 2 3 4 5 6 7 8 Range Model Variance Sparsity CTC-Base rtMRI 18.4 18.3 17.5 16.4 16.3 17.1 15.7 12.1 6.3 / 0.14 Mel 32.1 25.1 24.2 24.1 19.7 16.4 15.3 13.5 18.6 / 0.33 WavLM 27.2 25.2 24.1 20.5 18.3 17.9 14.2 13.0 14.2 / 0.25 Factor Scores 21.2 21.7 21.4 19.5 18.5 17.4 16.9 16.2 5 / 0.28 Gestural Scores 22.4 22.2 21.1 21.8 21.1 22.2 20.1 19.7 2.3 / 0.93 Gestural Scores(FT) 22.1 21.0 21.0 21.1 20.4 18.7 19.2 17.6 4.5 / 0.80 CTC-Large rtMRI 16.1 16.0 15.8 15.1 15.0 14.8 14.0 11.1 5 3.3 0.14 Mel 22.1 20.1 17.2 17.1 15.7 13.4 13.3 12.5 9.6 31.6 0.33 WavLM 20.2 19.2 17.1 16.5 16.3 14.9 13.2 12.1 8.1 20.6 0.25 Factor Scores 20.2 21.2 19.4 18.8 18.9 17.2 16.5 15.2 5 0.88 0.28 Gestural Scores 20.4 20.2 19.1 19.8 19.2 18.2 18.3 17.5 2.9 5.5 0.93 Gestural Scores(FT) 18.5 18.1 18.0 17.1 18.4 16.7 16.7 16.0 2.5 7.9 0.74
4.4.2 Sparsity and Efficiency
Fig. 1 shows that the derived gestural scores are sparse. We use Eq.12 to compute the sparsity of the features. As shown in Table. 1, the gestural scores have the largest sparsity while reaching the comparable PER values. This indicates that quite a few information in traditional speech features might not be useful in recognition tasks. Fine-tuning will give lower sparsity since we experimentally observe that there is a trade-off between intelligibility and sparsity. We conclude that gestural scores are efficient representations.
4.4.3 Interpretability
The learned gestural scores are not only intelligible, generalizable and efficient, but also pretty much interpretable. We first look at Fig. 1, the gestures 1,2,9 are sequentially activated for the word ”bird”. Gesture1 corresponds to phone /B/ since the lower lip, jaw, and tongue move down and velum moves up. Gesture 2 corresponds to phone /ER/ since the tongue tip moves up and tongue dorsum moves down. The jaw and lower lip move down and the velum still moves up. Gesture 9 shows opposite moving patterns of gesture 2, indicating the mouth closure. The interpretability of gestural scores is consistent across different tokens. In Fig. 2, we list two sets of examples. The first set visualizes the gestural score of ”bat”, ”bet” and ”bite”. We observe that gesture1, gesture8 and gesture10 are activate for all three words. This is totally true since gesture1 corresponds to phone /b/. Regarding the phone /AE/, /EH/ and /AY/, their articulatory patterns are almost the same: lower up moves down, tongue tip and tongue body move down, jaw moves down and velum moves up, as indicated by gesture10. Gesture8 is opposite to gesture10 and it means the closure of mouth. Since gesture10 and gesture8 are able to represent /AE/, /EH/ and /AY/, the only difference among these words is the time interval that is activated for each gesture. For example, in ”bat”, gesture10 is activate for a longer time interval while in ”bite”, gesture10 is activate for a shorter interval. The second set visualizes the gestural scores for ”beet”, ”bit” and ”bait”. Gesture4 and gesture5 can represent /IY/, /IH/, and /EY/. This is reasonable because when pronouncing these phones, the tongue tip moves down and tongue body moves up. Similarly, the only difference among these words is the gestural scores patterns. In all of the examples, each articulator in the active gestures has reasonable physical meaning while this is not true for the cases in [11].

5 Conclusions and Limitations
Articulatory representation learning is the fundamental methodology for modeling the neural speech production system. We take advantage of guided factor analysis, as well as neural convolutive matrix factorization, to extract the gestural scores and gestures from rtMRI data, a fine-grained vocal tract corpus. The learned articulatory representations, i.e., gestural scores, are intelligible, generalizable, efficient, and interpretable. However, there are still some limitations. First, the presence of noise in rtMRI audio raises concerns regarding the conclusions about the generalizability. Second, the factor analysis algorithm is independent of neural matrix factorization, which makes the entire representation learning less efficient. Third, the current representations, i.e., gestural scores, have to be derived from articulatory data. However, the neural speech production system should only have speech signals accessible. Deriving an articulatory-free neural speech production system is our future work.
6 Acknowledgement
This research is supported by the following grants to PI Anumanchipalli — NSF award 2106928, Rose Hills Foundation and Noyce Foundation.
References
- [1] Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson Enrique Yalta Soplin, Ryuichi Yamamoto, Xiaofei Wang, et al., “A comparative study on transformer vs rnn in speech applications,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019, pp. 449–456.
- [2] Yann LeCun, “A path towards autonomous machine intelligence,” preprint posted on openreview, 2022.
- [3] Abdelrahman Mohamed, Hung-yi Lee, Lasse Borgholt, Jakob D Havtorn, Joakim Edin, Christian Igel, Katrin Kirchhoff, Shang-Wen Li, Karen Livescu, Lars Maaløe, et al., “Self-supervised speech representation learning: A review,” arXiv preprint arXiv:2205.10643, 2022.
- [4] Siddique Latif, Rajib Rana, Sara Khalifa, Raja Jurdak, Junaid Qadir, and Björn W Schuller, “Deep representation learning in speech processing: Challenges, recent advances, and future trends,” arXiv preprint arXiv:2001.00378, 2020.
- [5] Ronald A Cole and Jola Jakimik, “A model of speech perception,” Perception and production of fluent speech, vol. 133, no. 64, pp. 133–42, 1980.
- [6] Randy L Diehl, Andrew J Lotto, Lori L Holt, et al., “Speech perception,” Annual review of psychology, vol. 55, no. 1, pp. 149–179, 2004.
- [7] Elizabeth D Casserly and David B Pisoni, “Speech perception and production,” Wiley Interdisciplinary Reviews: Cognitive Science, vol. 1, no. 5, pp. 629–647, 2010.
- [8] Gunnar Fant, Acoustic theory of speech production, Number 2. Walter de Gruyter, 1970.
- [9] Catherine P Browman and Louis Goldstein, “Articulatory phonology: An overview,” Phonetica, vol. 49, no. 3-4, pp. 155–180, 1992.
- [10] Vikram Ramanarayanan, Louis Goldstein, and Shrikanth S Narayanan, “Spatio-temporal articulatory movement primitives during speech production: Extraction, interpretation, and validation,” The Journal of the Acoustical Society of America, vol. 134, no. 2, pp. 1378–1394, 2013.
- [11] Jiachen Lian, Alan W Black, Louis Goldstein, and Gopala Krishna Anumanchipalli, “Deep Neural Convolutive Matrix Factorization for Articulatory Representation Decomposition,” in Proc. Interspeech 2022, 2022, pp. 4686–4690.
- [12] Elliot L Saltzman and Kevin G Munhall, “A dynamical approach to gestural patterning in speech production,” Ecological psychology, vol. 1, no. 4, pp. 333–382, 1989.
- [13] Tanner Sorensen, Asterios Toutios, Louis Goldstein, and Shrikanth Narayanan, “Task-dependence of articulator synergies,” The Journal of the Acoustical Society of America, vol. 145, no. 3, pp. 1504–1520, 2019.
- [14] Shinji Maeda, “Compensatory articulation during speech: Evidence from the analysis and synthesis of vocal-tract shapes using an articulatory model,” in Speech production and speech modelling, pp. 131–149. Springer, 1990.
- [15] Johannes Töger, Tanner Sorensen, Krishna Somandepalli, Asterios Toutios, Sajan Goud Lingala, Shrikanth Narayanan, and Krishna Nayak, “Test–retest repeatability of human speech biomarkers from static and real-time dynamic magnetic resonance imaging,” The Journal of the Acoustical Society of America, vol. 141, no. 5, pp. 3323–3336, 2017.
- [16] Patrik O Hoyer, “Non-negative matrix factorization with sparseness constraints.,” Journal of machine learning research, vol. 5, no. 9, 2004.
- [17] Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369–376.
- [18] Erik Bresch and Shrikanth Narayanan, “Region segmentation in the frequency domain applied to upper airway real-time magnetic resonance images,” IEEE transactions on medical imaging, vol. 28, no. 3, pp. 323–338, 2008.
- [19] Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
- [20] Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger, “Montreal forced aligner: Trainable text-speech alignment using kaldi.,” in Interspeech, 2017, vol. 2017, pp. 498–502.
- [21] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- [22] Jiachen Lian, Chunlei Zhang, Gopala Krishna Anumanchipalli, and Dong Yu, “Towards improved zero-shot voice conversion with conditional dsvae,” arXiv preprint arXiv:2205.05227, 2022.
- [23] Jiachen Lian, Chunlei Zhang, Gopala Krishna Anumanchipalli, and Dong Yu, “Utts: Unsupervised tts with conditional disentangled sequential variational auto-encoder,” arXiv preprint arXiv:2206.02512, 2022.