\newunicodechar

ə@ \newunicodecharɛE \newunicodecharɪI \newunicodecharʊU \newunicodecharɑA \newunicodecharʌ2 ¹Electrical and Computer Engineering Department, University of British Columbia, Canada
²College of Arts, Media and Design, Northeastern University, USA
[email protected], [email protected], [email protected]

A comparative study of two-dimensional vocal tract acoustic modeling based on Finite-Difference Time-Domain methods

Abstract

The two-dimensional (2D) numerical approaches for vocal tract (VT) modelling can afford a better balance between the low computational cost and accurate rendering of acoustic wave propagation. However, they require a high spatio-temporal resolution in the numerical scheme for a precise estimation of acoustic formants at the simulation run-time expense. We have recently proposed a new VT acoustic modelling technique, known as the 2.5D Finite-Difference Time-Domain (2.5D FDTD), which extends the existing 2D FDTD approach by adding tube depth to its acoustic wave solver. In this work, first, the simulated acoustic outputs of our new model are shown to be comparable with the 2D FDTD and a realistic 3D FEM VT model at a low spatio-temporal resolution. Next, a radiation model is developed by including a circular baffle around the VT as head geometry. The transfer functions of the radiation model are analyzed using five different vocal tract shapes for vowel sounds /a/, /e/, /i/, /o/ and /u/.

Keywords: computational acoustics, vocal tract, FDTD, articulatory speech synthesis

1 Introduction

Various articulatory models have been developed to approximate the sophisticated upper vocal tract geometry and capture its acoustic features using a physics-based acoustic wave solver. The 3D acoustic analysis [10, 11] can precisely characterize the vocal tract’s spectral features while offering better geometrical flexibility. However, such models are not pragmatic for designing real-time speech synthesizers due to their high computational cost. In contrast to 3D, the 1D vocal tract models are computationally lightweight. Nevertheless, their over-simplified representation of a tube structure substantially eliminates its geometrical details. Hence, the acoustic analysis of 1D models is not accurate at higher formants.

As an alternative, the 2D acoustic wave solvers [2, 8] represent the complex vocal tract geometry as a 2D contour (a mid-sagittal cut of a 3D VT geometry) and capture the wave interaction only along the mid-sagittal plane. Due to the dimensionality reduction, the 2D models are lightweight compare to 3D ones. The most well-known 2D acoustic analysis methods for vocal tract modelling are the finite element method (FEM) and FDTD. Since 2D models do not lump off-plane wave interaction, direct use of cross-sectional areas from the vocal tract MRI images to represent 2D contours inside a simulation grid yields erroneous acoustic output. Alternatively, a nonlinear area function transformation method needs to be implemented to tune the acoustic features of 2D models that can match the more complex 3D ones. However, this procedure kills the performance boost of 2D models. Consequently, we have proposed a new vocal tract acoustic model (2.5D FDTD) that improves the 2D model by capturing the wave interaction across the mid-sagittal plane. The next section briefly discusses the numerical implementation of a 3D tube using 2D and 2.5D FDTD schemes.

2 Methodology

2.1 Acoustic Wave Solver

We implemented the FDTD numerical analysis technique to discretize the 2D acoustic wave equation on a staggered Yee grid. The 2D Yee scheme consists of a rectangular grid, define acoustic parameters (pressure ( $p^{(n)}$ ) and velocity ( $v_{x}^{(n)}$ , $v_{y}^{(n)}$ )) at each grid point, a squared 2D cell [1, 12], as shown in Figure 1. However, the employed technique for acoustic simulation does not facilitate dynamic boundary conditions. Hence, a new scalar field $0\leq\beta(x,y,t)\leq 1$ was introduced to the solver, which could transit smoothly between $\beta=1$ (air) and $\beta=0$ (boundary). At $\beta=0$ , a prescribed velocity $v=\bm{v_{b}}$ was enforced to handle the boundary condition. Specifically, per each time step ( $n$ ), the wave solver updates pressure component $p^{(n)}$ and velocity components ( $v_{x}^{(n)}$ , $v_{y}^{(n)}$ ) at each grid point, by solving the below linear wave equations in the time domain [14]:

	$\displaystyle\frac{\partial p}{\partial t}+(1-\beta)p\ =\ -\rho c^{2}(\ \frac{\partial v_{x}}{\partial x}\ +\ \frac{\partial v_{y}}{\partial y}\ )$		(1)
	$\displaystyle\beta\frac{\partial\bm{v}}{\partial t}\ +\ \left(1-\beta\right)\bm{v}\ =\ -\beta^{2}\frac{\bm{\nabla}p}{\rho}\ +\ \left(1-\beta\right)\bm{v}_{b}$		(2)

The 2.5D FDTD follows the above rationale but improves upon it by incorporating new impedance terms in its 2D acoustic wave solver, known as tube depth. The tube’s depth $(D)$ , i.e., its continuous extension along the $z$ axis (with $x$ and $y$ axis being the dimension of starting 2D scheme), is derived from the cross-sectional area of the tube and sampled at each point of the scheme as shown in Figure 1. Then the resulting depths are mapped to their respective acoustic parameters ( $p$ , $\bm{v_{x}}$ , $\bm{v_{y}}$ ) as demonstrated by [6]. The inclusion of tube depth allows the wave solver to apprehend wave interactions across the mid-sagittal plane, in turn eliminating the nonlinear cross-sectional area functions transformation requirement. Therefore, it is expected that the time complexity of a 2.5D FDTD tube model will be better than the existing 2D models. Basically, with the help of following discrete update rules, we adopt a time marching algorithm (see Algorithm 1) to sample the acoustic parameters at every grid point:

Refer to caption — Figure 1: Representation of acoustic parameters ( $p$ , $\bm{v_{x}}$ , $\bm{v_{y}}$ ) and tube depth ( $\bar{D}$ , $D_{(x)}$ and $D_{(y)}$ ) inside the 2D Yee scheme

	$\displaystyle p^{(n+1)}\ =\ \frac{\bar{D}p^{(n)}-\rho c^{2}\Delta t\ \bm{\widetilde{\nabla}}\cdot\bm{V}^{(n)}}{\bar{D}}$		(3)
	$\displaystyle\bm{v}^{(n+1)}\ =\ \frac{\beta\bm{v}^{(n)}-\beta^{2}\Delta t\ \widetilde{\nabla}p^{(n+1)}/\rho\ +\ \Delta t(1-\beta)\bm{v_{b}}}{\beta+\Delta t(1-\beta)}$		(4)

with:

\displaystyle\bm{V}\ =\ (\ D_{(x)}v_{x},\ D_{(y)}v_{y}\ )

(5)

Algorithm 1 FDTD Time-Marching algorithm

Input: VT area function $a(x)$ , audio time $t$

1: Initialize physical constants: air density

\rho

, sound speed

c

, boundary admittance coefficient

\mu

2: Initialize the simulation sampling rate

R

3: Set the temporal resolution (

\Delta t

) and grid resolution (

\Delta s

) with

R

and CFL condition.

4: Create a optimized grid having size

(M\times N)

5: Set the total number of time steps

T

of the simulator using

t

and

\Delta t

6: Define boundary cells with

a(x)

and normal acoustic impedance

Z

7: Set depth values (

\bar{D}

D_{(x)}

and

D_{(y)}

) for each grid cell from

a(x)

8: Define source excitation cells

9: Initialize source excitation velocity

v_{e}

10: Initialize acoustic components (

p

v_{x}

and

v_{y}

) for each grid cell.

11: for

n=1...T

12: for

i=1...M

13: for

j=1...N

14: Update

p^{n+1}(i,j)

with

v^{n}_{x}(i,j)

v^{n}_{y}(i,j)

and

D(i,j)

(Eq. 3 and Eq.5)

15: if

(i,j)=

excitation cell then

16:

v^{n}_{x}\leftarrow v^{n}_{x}+v^{n}_{e}

and

v^{n}_{y}\leftarrow v^{n}_{y}+v^{n}_{e}

17: end if

18: if

(i,j)=

boundary cell then

19:

v^{n}_{b}\leftarrow p^{n}(i,j)/Z

20: end if

21: Update

v^{n+1}_{x}(i,j)

with

p^{n+1}(i,j)

and

v^{n}_{b}

(Eq. 4)

22: Update

v^{n+1}_{y}(i,j)

with

p^{n+1}(i,j)

and

v^{n}_{b}

(Eq. 4)

23: end for

24: end for

25: end for

2.2 Vocal Tract Wall Losses

To simulate vocal tract wall reflection, we adopted the local reactive boundary approach as proposed here [13, 10]. The method estimates wall losses (locally reacting soft walls) and enforces boundary condition by using $\bm{v_{w}}$ , the normal particle velocity going into or coming from a wall.

\bm{v_{w}}=\frac{p_{w}}{Z_{n}}\bm{\hat{n}}

(6)

where $p_{w}$ is the current pressure value in an air cell, located in front of the wall cell. And $Z_{n}$ , the normal acoustic impedance can be computed by using the normal sound absorption coefficient $\alpha_{n}$ as follows,

Z_{n}=\rho c\frac{1+\sqrt{1-\alpha_{n}}}{1-\sqrt{1-\alpha_{n}}}

(7)

The boundary absorption coefficient $\alpha_{n}$ can be derived from the boundary admittance coefficient $\mu\in(0,1)$ . During the simulation, the solver combines equation (6) with equation (4) by setting $\bm{v_{b}}=\bm{v_{w}}$ for all the velocity vectors which are positioned in between an air cell and wall cell. However, this method does not comprehend frequency dependent wall losses.

2.3 Excitation Model

For acoustic analysis and speech production, articulatory models require a source excitation function just above the glottal-end to input acoustic energy into the tube [5]. In this work, we are only characterizing the acoustic behaviour of a vocal tract tube through its transfer function analysis. Hence, a band-passed velocity pulse having frequency range 2Hz-20kHz was injected near the glottis for various vocal tract shapes, and the corresponding transfer functions (impulse response) were obtained at the mouth-end.

2.4 Radiation Model

The outward wave propagation at the mouth-end (lips) contributes to acoustic energy losses in the vocal tract. It is one of the essential dissipation mechanisms in voice production. Hence, a nonzero radiation model of a vocal tract influences the synthesized audio output by lowering the formant frequencies and increasing their bandwidths. It might be challenging to build a model that can precisely attain the radiation effects, considering a realistic human head geometry. However, it has already been demonstrated that the inclusion of a spherical baffle (a simplified approximation of the actual head geometry) around the 3D vocal tract model offers promising results for vowel production [3]. Since the 2D model uses a mid-sagittal cut of a 3D vocal tract, the spherical baffle can be replaced by a circular baffle for the 2.5D radiation model as shown in Figure 2(a). Moreover, to illustrate acoustic waves’ propagation emanating from the mouth toward infinity (Figure 2(b)), the computational domain can be extended out of the vocal tract by using perfectly matched layers (PMLs). PMLs help in eliminating spurious wave reflection at the domain boundaries.

We create a fixed-sized circular baffle with a diameter of 0.20m around the 2.5D vocal tract contour inside the simulation domain. The circular baffle center has been adjusted to fit in the entire vocal tract contour while connecting its exit cross-section. Inside the 2.5D scheme, the region enclosed by the vocal tract and baffle boundary is set to air cells. In order to absorb outgoing waves and approximate infinite space, $6$ layers of PML are added at all sides of the domain boundary.

3 Experimental Setup

A comparative study was carried out for two different scenarios to characterize the acoustic features of the 2.5D vocal tract model: (1) numerical simulation with computational domain ended at the mouth aperture by imposing Dirichlet boundary conditions (open-end boundary condition) and (2) numerical simulation with free-field radiation, which includes head geometry and PML layers (radiation condition). These conditions were examined using the transfer functions analysis approach for five different vocal tract shapes(vowel sounds: /a/, /e/, /i/, /o/ and /u/). The 2.5D vocal tract shapes were derived using Story’s area function dataset [9], as it is the standard one for the vocal tract acoustic analysis. We have only considered the vocal tract geometry as a straight tube with circular cross-sections for this study as shown in Figure 3.

For open-end boundary conditions, we use a precise 3D FEM vocal tract model as the reference for the transfer function analysis. A virtual microphone is placed 3mm inside the mouth opening to record pressure variations. In the case of free-field radiation, we compare the transfer functions of the 2.5D model with Story’s calculated formants, obtained from the frequency-domain analysis of vocal tracts [7]. And the microphone is positioned 3mm outside of the mouth to capture the radiating acoustic waves. Courant-Friedrichs-Lewy (CFL) stability condition in two dimensions: $\Delta t\leq\Delta s/\sqrt{2}c$ is imposed, where $c$ is the speed of sound.

The simulation generates $50$ ms of synthesized audio for every vowel sounds. During the simulation, we set the following physical parameters fixed: air density $\rho=1.14$ kg/m³, boundary admittance $\mu=0.005$ and sound speed $c=350$ m/s. We implement both the 2D and 2.5D vocal tract models in the MATLAB environment. The simulation runs on a workstation equipped with an Intel Core i7-8700K processor. The custom code for the 2.5D vocal tract model is publicly available here¹¹1https://github.com/Debasishray19/Talking-Tube.

4 Model Validation

We follow the below steps to analyze the acoustic behaviour of the 2.5D model:

(A)

Transfer function analysis with open-end boundary condition for the 2D FDTD and 2.5D FDTD vocal tract models (vowel sound: /a/, /i/ and /u/).
(B)

Model performance evaluation (simulation run-rime).
(C)

Transfer function analysis for the 2.5D vocal tract model with free-field radiation (vowel sound: /a/, /e/, /i/. /o/ and /u/).

For Step (A) and (B), we simulate each vowel sound under three different spatial grid resolution for both 2D and 2.5D FDTD methods: low ( $\Delta s=0.74$ mm), mid ( $\Delta s=0.28$ mm) and high ( $\Delta s=0.18$ mm). The high spatial resolution should produce precise acoustic output as it yields vocal tract geometry with greater details. However, it requires a larger simulation domain size. The grid resolution values are empirically derived. Step (A) and (B) examine the improvement in a 2.5D vocal tract model over its 2D simulation due to the addition of tube depth. Step (C) investigates the change in transfer function with the inclusion of a circular baffle for the radiation model.

5 Result

We applied Fast Fourier Transformation (FFT) to the recorded pressure waves to obtain their transfer function. Formants were extracted by considering local maxima in transfer function curves. Here we are only analyzing the first three formants since they are mainly responsible for distinguishing vowel sounds [11]. The formants for the 3D FEM model can be found here [4].

5.1 Step A: Transfer Function Analysis

Transfer function analysis for the 2D and 2.5D simulation with open-end boundary condition²²2Colormap: For 2D simulation, For 2.5D simulation.^,³³3All the formant frequencies are in Hz..

Resolution ( $\Delta s$ )	F1	F2	F3
low (% Error)	660(-5.16) 700(0.57)	1040(-2.61) 1040(-2.61)	2960(-2.33) 3020(-0.36)
mid (% Error)	660(-5.16) 680(-2.29)	1040(-2.61) 1060(-0.74)	3000(-1.02) 3060(0.95)
high(% Error)	700(0.57) 680(-2.29)	1060(-0.74) 1040(-2.62)	3020(-0.36) 3040(0.29)

Table 1: Positional errors of the first three formants computed for vowel sound /a/, with respect to 3D FEM model.

Resolution ( $\Delta s$ )	F1	F2	F3
low (% Error)	240(-8.74) 260(-1.14)	2120(0.42) 2160(2.32)	3020(0.33) 3060(1.66)
mid (% Error)	260(-1.14) 260(-1.14)	2140(1.37) 2140(1.37)	3000(-0.33) 3040(0.99)
high(% Error)	260(-1.14) 260(-1.40)	2140(1.37) 2160(2.32)	3020(0.33) 3040(0.99)

Table 2: Positional errors of the first three formants computed for vowel sound /i/, with respect to 3D FEM model.

Resolution ( $\Delta s$ )	F1	F2	F3
low (% Error)	260(0.38) 260(0.38)	700(-7.52) 720(-4.88)	2260(-0.17) 2300(1.59)
mid (% Error)	220(-15.0) 260(0.38)	600(-20.7) 700(-7.52)	2260(-0.17) 2300(1.59)
high(% Error)	260(0.38) 260(0.38)	700(-7.52) 720(-4.88)	2280(0.70) 2300(1.59)

Table 3: Positional errors of the first three formants computed for vowel sound /u/, with respect to 3D FEM model.

5.2 Step B: Model Performance

The computational cost of FDTD vocal tract model depends upon the spatial grid resolution and simulation domain size. We implement an algorithm that enforces an optimized domain size for FDTD models. The table below presents the simulation domain size and time duration corresponding to each grid resolution for vowel /u/.

Resolution ( $\Delta s$ )	Domain Size (width $\times$ height)	Duration (Approx.) (in seconds)
low ( $0.74mm$ )	$266\times 35$	280
mid ( $0.28mm$ )	$702\times 85$	$4.98\text{e+}3$
high( $0.18mm$ )	$1051\times 127$	$1.85\text{e+}4$

Table 4: Simulation run-time at different grid resolutions.

5.3 Step C: Transfer Functions Analysis

The table below presents transfer function analysis for the 2.5D vocal tract models with the free-field radiation. We have only considered the low grid resolution ( $\Delta s=0.74$ mm) for simulation of vocal tract radiation models.

Vowel	F1	F2	F3
/a/	$640$ $-7.78\%$	$1020$ $8.28\%$	$3000$ $1.76\%$
/e/	$340$ $-29.75\%$	$1920$ $1.74\%$	$2320$ $-3.49\%$
/i/	$260$ $-20.00\%$	$2080$ $-2.75\%$	$2960$ $-0.53\%$
/o/	$480$ $-3.80\%$	$760$ $-1.04\%$	$2340$ $-3.34\%$
/u/	$260$ $-17.19\%$	$720$ $2.56\%$	$2180$ $-5.13\%$

Table 5: Formant frequencies and the percentage positional errors of the first three formants for the vocal tract radiation model.

6 Discussion & Conclusion

The transfer function analysis is a standard practice to validate acoustic features of area-based articulatory models. In the first step of our evaluation, the 2.5D FDTD simulation produces positional errors below $2\%$ at a low grid resolution. Moreover, this characteristic is consistent across all three vowels. However, the positional error is high for the 2D FDTD simulation at a low grid resolution in most cases. Table 1, 2 and 3 show that the acoustic outputs of the 2D FDTD model can be tuned by increasing its computational grid resolution whereas this approach doesn’t make much difference for the 2.5D FDTD simulation. Table 4 shows for a real-time speech synthesizer the low grid resolution is always desirable.

In the case of free-field of radiation, Table 5 presents the first three formants and their positional errors. The positional error of the first formant is high across all the vowels. And the second and third formants produce relatively lesser error values. However, these positional errors are determined relative to Story’s 1D vocal tract model, an oversimplified representation of the actual vocal tract. Hence, as future work, the 2.5D radiation model’s acoustic characteristics need to be compared with a 3D vocal tract radiation model.

Additionally, our current 2.5D model should only be seen as simple approximations of the voice radiation mechanism. Nevertheless, the influence of few head details like the nose and cross-section of the mouth aperture is worth studying. As well as, our current 2.5D FDTD model does not include boundary losses for the tube depth. All these current limitations need to be incorporated for the future development of the 2.5D vocal tract acoustic model.

7 Acknowledgements

This work is supported by the Natural Sciences and Engineering Research Council (NSERC).

References

[1] Andrew Allen and Nikunj Raghuvanshi “Aerophones in flatland: Interactive wave simulation of wind instruments” In ACM Transactions on Graphics (TOG) 34.4 ACM New York, NY, USA, 2015, pp. 1–11
[2] Marc Arnela and Oriol Guasch “Two-dimensional vocal tracts with three-dimensional behavior in the numerical generation of vowels” In The Journal of the Acoustical Society of America 135.1 ASA, 2014, pp. 369–379
[3] Marc Arnela, Oriol Guasch and Francesc Alías “Effects of head geometry simplifications on acoustic radiation of vowel sounds based on time-domain finite-element simulations” In The Journal of the Acoustical Society of America 134.4 Acoustical Society of America, 2013, pp. 2946–2954
[4] Marc Arnela Coll “Numerical production of vowels and diphthongs using finite element methods”, 2015
[5] Debasish Ray Mohapatra and Sidney Fels “Limitations Of Source-Filter Coupling In Phonation” In arXiv preprint arXiv:1811.07435, 2018
[6] Debasish Ray Mohapatra, Victor Zappi and Sidney Fels “An extended two-dimensional vocal tract model for fast acoustic simulation of single-axis symmetric three-dimensional tubes” In arXiv preprint arXiv:1909.09585, 2019
[7] Man Sondhi and Juergen Schroeter “A hybrid time-frequency domain articulatory speech synthesizer” In IEEE Transactions on Acoustics, Speech, and Signal Processing 35.7 IEEE, 1987, pp. 955–967
[8] Matt Speed, Damian Murphy and David M Howard “Characteristics of two-dimensional finite difference techniques for vocal tract analysis and voice synthesis” In Tenth Annual Conference of the International Speech Communication Association, 2009
[9] Brad H Story “Comparison of magnetic resonance imaging-based vocal tract area functions obtained from the same speaker in 1994 and 2002” In The Journal of the Acoustical Society of America 123.1 ASA, 2008, pp. 327–335
[10] Hironori Takemoto, Parham Mokhtari and Tatsuya Kitamura “Acoustic analysis of the vocal tract during vowel production by finite-difference time-domain method” In The Journal of the Acoustical Society of America 128.6 ASA, 2010, pp. 3724–3738
[11] Tomáš Vampola, Jaromír Horáček, Anne-Maria Laukkanen and Jan G Švec “Human vocal tract resonances and the corresponding mode shapes investigated by three-dimensional finite-element modelling based on CT measurement” In Logopedics Phoniatrics Vocology 40.1 Taylor & Francis, 2015, pp. 14–23
[12] Kane Yee “Numerical solution of initial boundary value problems involving Maxwell’s equations in isotropic media” In IEEE Transactions on antennas and propagation 14.3 IEEE, 1966, pp. 302–307
[13] Takatoshi Yokota, Shinichi Sakamoto and Hideki Tachibana “Visualization of sound propagation and scattering in rooms” In Acoustical science and technology 23.1 ACOUSTICAL SOCIETY OF JAPAN, 2002, pp. 40–46
[14] Victor Zappi et al. “Towards real-time two-dimensional wave propagation for articulatory speech synthesis” In Proceedings of Meetings on Acoustics 171ASA 26.1, 2016, pp. 045005 ASA