A Deep-Bayesian Framework for Adaptive Speech Duration Modification

Ravi Shankar and Archana Venkataraman This work was supported by NSF CAREER award 1845430 (PI Venkataraman). The authors are with the department of Electrical and Computer Engineering at the Johns Hopkins University, Baltimore, MD 21218 USA (e-mail: [email protected], [email protected]).

Abstract

We propose the first method to adaptively modify the duration of a given speech signal. Our approach uses a Bayesian framework to define a latent attention map that links frames of the input and target utterances. We train a masked convolutional encoder-decoder network to produce this attention map via a stochastic version of the mean absolute error loss function; our model also predicts the length of the target speech signal using the encoder embeddings. The predicted length determines the number of steps for the decoder operation. During inference, we generate the attention map as a proxy for the similarity matrix between the given input speech and an unknown target speech signal. Using this similarity matrix, we compute a warping path of alignment between the two signals. Our experiments demonstrate that this adaptive framework produces similar results to dynamic time warping, which relies on a known target signal, on both voice conversion and emotion conversion tasks. We also show that our technique results in a high quality of generated speech that is on par with state-of-the-art vocoders.

Index Terms:

Prosody, Encoder-Decoder, Attention, Adaptive Duration Modification, Dynamic Time Warping

I Introduction

Human speech is a rich and varied mode of communication that encompasses both language/semantic information and the mood/intent of the speaker. The latter is primarily conveyed by prosodic features, such as pitch, energy, and speaking rate. There are many applications where understanding and manipulating these prosodic features is required. Consider voice conversion systems. Pitch and energy modifications are used to inject emotional cues into the speech or to change the overall speaking style [1, 2, 3, 4, 5]. Prosodic features are also used to evaluate the quality of human machine dialog systems [6], and they play a significant role in speaker identification and recognition systems [7].

While there are many approaches for automated pitch and energy modification [8, 9, 10, 11, 12], comparatively little progress has been made in changing the speaking rate of an utterance. In fact, the speaking rate plays a crucial role in conveying emotion [13] and in diagnosing human speech pathologies [14]. The speaking rate is difficult to manipulate because, unlike pitch or energy, there is no explicit coding for the signal duration. Rather, it is implicitly defined by a collection of frame-wise spectral representations (e.g., the short time Fourier transform or Mel-frequency cepstral coefficients). As a result, duration modification algorithms are not adaptive; they either require considerable user supervision, or they are geared towards aligning two known speech signals.

Perhaps the earliest duration modification method is the time-domain phase overlap add (TD-PSOLA) algorithm [15]. TD-PSOLA modifies the pitch and duration of a speech signal by replicating and interpolating between individual frames. However, the user must manually specify both the portion of speech to modify and the exact manner in which it should be altered. Hence, the method is neither automated nor adaptive. An alternative approach is dynamic time warping (DTW), which finds the optimal time alignment between two parallel speech utterances [16]. DTW constructs a pairwise similarity matrix between all frames of the two utterances and estimates a warping path between the starting $(0,0)$ and ending $(T_{s},T_{t})$ points of the utterances based on a Viterbi-like decoding of the similarity matrix. While simple, DTW requires both the source and target utterances to be known a priori. Hence, it cannot be used for on-the-fly modification of new signals.

Finally, recent advancements in deep learning have led to a new generation of neural vocoders, which disentangle the semantic content from the speaking style [17, 18, 19]. These vocoders can alter the speaking rate via the learned style embeddings. While these models represent seminal contributions to speech synthesis, the latent representations are learned in an unsupervised manner, which makes it difficult to control the output speaking voice. Another drawback of these methods is the computational overhead and data resources required to train the models and generate new speech [20].

In this paper, we introduce the first fully-automated adaptive speech duration modification scheme. Our approach combines the representation capabilities of deep neural networks with the structured simplicity of dynamic decoding. Namely, we model the alignment between a source and target utterance via a latent attention map; these maps are used as the similarity matrix for backtracking. We train a masked convolutional encoder-decoder network to estimate these attention maps using a stochastic mean absolute error (MAE) formulation. We demonstrate our framework on a voice conversion task using the CMU-Arctic dataset [21] and on three multi-speaker emotion conversion tasks using the VESUS dataset [22]. Our experiments confirm that the proposed model can perform open-loop duration modification and produces high-quality speech. Finally, our approach differs fundamentally from the conventional DTW [16] algorithm which requires both, the source and target utterances to warp one onto the other.

II Method

Refer to caption — Figure 1: Graphical model for duration modification. $\gamma$ and $\theta$ are the model parameters which are inferred during training.

Fig.1 illustrates our underlying generative process. Given an utterance $X$ , we first estimate the length $T$ of the (unknown) target utterance $Y$ and subsequently use it to estimate a mask $M$ for the attention map. The mask restricts the domain of the attention vectors $A_{t}$ at each frame $t$ to mitigate distortion of the output speech. We use paired data $(X_{tr},Y_{tr})$ to train an encoder-decoder network to generate the attention vectors. During testing, we first generate the attention map from the input $X$ and use it to produce the target speech $Y$ .

II-A Loss Function

Let $X\in\mathbb{R}^{D\times T_{s}}$ denote the input speech. In this work, $X$ corresponds to the filter-bank energies, where $D$ is the number of filter-banks, and $T_{s}$ is the number of temporal frames in the utterance. Similarly, we denote target speech as $Y\in\mathbb{R}^{D\times T}$ . Notice that the target utterance length $T$ may differ from $T_{s}$ .

Our generative process for the target speech is as follows:

T\sim\text{Laplace}(T^{0},b_{T})\;\;\;and\;\;\;Y_{t}\sim\text{Laplace}(Y_{t}^{0},b_{y}),

(1)

where $T$ is the length of the target utterance, and $Y_{t}$ is the target features at time $t$ . The parameters $\{T^{0},b_{T},Y_{t}^{0},b_{y}\}$ of the distributions are unknown and we implicitly estimate them via a deep neural network parameterized by $\gamma$ and $\theta$ .

By treating the unknown parameters as functions of the input $X$ , we obtain the following estimating equations for the target sequence length and frame-wise filter-bank energies:

\hat{T}=f_{\gamma}(X)\;\;\;and\;\;\;\hat{Y}_{t}=X\cdot A_{t}+f_{\theta}(X,\hat{Y}_{0:t-1}).

(2)

The functions $f_{\gamma}(\cdot)$ and $f_{\theta}(\cdot,\cdot)$ correspond to deep networks. The variable $A_{t}\in\mathbb{R}^{T_{s}}$ is an attention vector that combines frame-wise features of the source utterance $X$ to generate the target frame $\hat{Y}_{t}$ . Notice that the residual, which cannot be explained by the input utterance, depends on the predictions $\hat{Y}_{0:t-1}$ at previous time steps. This autoregressive property allows the neural network to learn a time-varying component that can differentiate between the speakers or emotions.

During training, we use paired data $(X,Y)$ to maximize the likelihood of the target speech signal with respect to the neural network weights $\{\theta,\gamma\}$ . This likelihood can be written as:

P(\hat{Y},\hat{T}|X)=P(\hat{T}|X)\prod_{t=1}^{\hat{T}}P(\hat{Y}_{t}|X,\hat{T},\hat{Y}_{0:t-1}),

(3)

where, the second term of Eq. (7) can be expanded as follows:

	$\displaystyle P($	$\displaystyle\hat{Y}_{t}\|X,\hat{T},\hat{Y}_{0:t-1})=\sum_{A_{t}}P(\hat{Y}_{t},A_{t}\|X,\hat{T},\hat{Y}_{0:t-1},M)$
		$\displaystyle=\sum_{A_{t}}P(\hat{Y}_{t}\|X,\hat{T},A_{t},\hat{Y}_{0:t-1})P(A_{t}\|X,\hat{Y}_{0:t-1},M)$		(4)

The variable $M$ in Eq. (8) denotes the attention mask and is introduced for convenience; it is a deterministic function of the source speech length $T_{s}$ and the estimated target length $\hat{T}$ .

We use a variational free energy formulation [23] to derive an upper bound to our data log-likelihood (see supplemental materials for complete derivation). This bound can be translated into the following neural network loss function:

	$\displaystyle L$	$\displaystyle=E_{A_{t}\sim q_{\theta}}\big{[}\log\big{(}P(\hat{Y}_{t}\|X,A_{t},\hat{Y}_{0:t-1})\big{)}\big{]}+\log\big{(}P(\hat{T}\|X)\big{)}$
		$\displaystyle=\lambda_{1}\times E_{A_{t}}\big{[}\Arrowvert\hat{Y}_{t}-Y_{t}^{0}\Arrowvert_{1}\big{]}\;+\;\lambda_{2}\times\Arrowvert\hat{T}-T^{0}\Arrowvert_{1}$		(5)

Here, $\lambda_{1}$ and $\lambda_{2}$ are model hyperparameters and implicitly contain the variances of Laplace distributions in Eq. (1). The distribution $q_{\theta}$ is a variational distribution which is approximated by the fully convolutional neural network in Fig. 3.

II-B Masking

The mask $M$ is used to constrain the scope of the attention mechanism to be similar in time-scale to the input. This procedure is important for two reasons. From a speech quality perspective, large swings in speaking rate may generate unintelligible speech. From an estimation perspective, the utterances contains hundreds (sometimes thousands) of frames. It is difficult to robustly train a deep network to generate such long attention vectors using smaller datasets.

We use the masks derived from Itakura parallelogram [24], as illustrated in Fig. 2. The Itakura parallelogram is used to speedup DTW algorithms when the speaking rates in the source and target utterances are expected to be similar [24]. The slope of the Itakura parallelogram specifies the minimum and maximum speaking rates that the reconstructed utterances are allowed to possess in comparison to the input speech.

1 function modifyDuration

(X)

;

Input : filter-bank energy (

X\in\mathbb{R}^{D\times T_{s}}

and

Y_{0}

)

Output : alignments (

(x_{1},y_{1}),(x_{2},y_{2}),...

)

2 Predict length of target sequence

T_{t}=f_{\gamma}(X)

;

3 Create attention mask

M\in\mathbb{R}^{T_{s}\times T_{t}}

and Set

t=0

;

4 if $t<T_{t}$ then

5 Using mask

M_{t}

X

, and

Y_{0:t-1}

estimate

A_{t}

;

6 Using

X

Y_{0:t-1}

, and

A_{t}

, predict

Y_{t}

;

t\leftarrow t+1

;

9 end if

10 Run DTW backtracking on the attention matrix

A

;

11 return (alignments

(x_{1},y_{1}),(x_{2},y_{2}),...(x_{n},y_{n})

);

Algorithm 1 Strategy for duration modification

II-C Neural Network Architecture

We adapt the neural network architecture from [25] by adding skip connections to the last layer and changing the configuration of the attention module. Fig. 3 shows the encoder, decoder and the new attention module of the convolutional neural network. The encoder is responsible for generating feature embeddings for the decoder and for predicting the relative length of target speech. The sample operation in Fig. 3 is responsible for generating a sample from the attention distribution required for reconstruction and backpropagation.

We train our model using the Adam optimizer [26] with a fixed learning rate of $10^{-4}$ . The input $X$ is an 80-dimensional vector of Mel-filterbank energies. The projection layer expands this input to $256$ dimensions. Both the encoder and decoder consist of $10$ convolutional layers, each followed by gated linear unit. We use data augmentation to stabilize the network. Specifically, we reverse the input-output sequences and randomly extract intervals (with probability $0.5$ ) from the full utterance. Our full model training procedure is described in the supplementary materials. The source code can be download from: https://engineering.jhu.edu/nsa/links/.

II-D DTW Back-Tracking

Our final step is to use the attention map produced by the decoder as a proxy for the DTW similarity matrix between the source and target speech frames. Effectively, we use the robust dynamic programming operation to get a path of alignment within the mask boundary, rather than rely on the noisy spectral reconstruction (see Algorithm 1). To avoid skipping phonemes, the path is constrained to take at most one horizontal or vertical step consecutively while backtracking. We finally use this alignment as a lookup table to synthesize the target speech from the input via the WORLD vocoder [27].

III Experimental Results

We evaluate our model on two multi-speaker datasets: CMU-ARCTIC [21] and VESUS [22]. We query three properties of our model on four tasks, as described below.

III-A Data and Voice Morphing Tasks

CMU-ARCTIC has 4 American English speakers (two male, two female), who are paired according to gender for voice conversion. We train our duration modification framework using $2164$ utterances from the database and use the remaining $100$ utterances for testing the open-loop modification properties.

VESUS is an emotional speech corpus containing $250$ phrases read by $10$ speakers in $4$ emotion classes: neutral, angry, happy, and sad. VESUS also contains crowd-sourced emotional annotations. Here, we primarily use those utterances that are correctly annotated by at least half of the listeners.

We train three duration models corresponding to the three neutral-emotional pairs. This results in the following split:

•

Neutral to Angry Conversion: 2385 utterances for training, 72 for validation and, 61 for testing.
•

Neutral to Happy Conversion: 2431 utterances for training, 43 for validation and, 43 for testing.
•

Neutral to Sad Conversion: 2371 utterances for training, 75 for validation and, 63 for testing.

Given the small sample size due to shorter sequences, we fine-tune the model trained on CMU-ARCTIC for each emotion conversion task in lieu of training the networks from scratch.

III-B Length Prediction

As seen in Fig. 3, we use the encoder embeddings to predict the length of the target utterance as a ratio of the source utterance length. Fig. 4 shows the error in predicting this ratio in a ms/sec. Notice that our framework mispredicts the utterance lengths by only $40$ ms/sec and $65$ ms/sec on CMU-ARCTIC and VESUS, respectively. Duration prediction is particularly challenging on VESUS due to marked differences between neutral and emotional utterances. However, our framework performs well even in this challenging scenario, likely due to our fusion of deep representation and Bayesian regularization.

III-C Attention Alignment

Next, we compare the alignment between source and target speech frames using our method with the original DTW algorithm. Recall that DTW requires access to the target speech utterance, whereas our approach does not. To compare the warping paths, we code the horizontal, diagonal, and vertical moves of the backtracking procedure into three classes. We then compute the edit distance between the DTW alignment and the attention map based alignment. Fig. 5 illustrates the match ratio normalized by the average length. As seen, the match ratio varies between $0.70$ and $0.85$ , which suggests that our approach captures the general characteristics of an unseen target utterance. To our knowledge, this is the first demonstration of an adaptive duration modification framework.

Fig. 7 shows the effect of modifying the slope of the Itakura parallelogram and the horizontal/vertical movement constraint during DTW. As expected, relaxing the slope constraint and increasing the number of horizontal/vertical moves provide more flexibility in adjusting the speaking rate of generated speech. However, this flexibility can lead to missing or distorted phonemes, suggesting a trade-off between changing the speaking rhythm and preserving naturalness. Our framework allows the user to tune these knobs for their own application.

III-D Reconstruction Quality

Finally, we crowd source the mean opinion score (MOS) for the re-synthesized speech in the test set using Amazon mechanical turk (AMT). As seen in Fig. 6, our framework achieves an average MOS between $3.7-4.0$ across the four tasks. This performance is at par with state-of-the-art neural vocoders trained on hundreds of hours of speech. We note that CMU-ARCTIC task has the lowest MOS, perhaps due to the longer and more complex utterances. Interestingly, the MOS is unaffected by errors in length prediction, as evidenced by the VESUS neutral-angry emotion conversion task. This suggests that our approach of combining the neural network attention weights with a structured DTW algorithm provides robustness to both the speech characteristics and estimation errors.

IV Conclusions

We have presented a novel deep-Bayesian framework for adaptive speech duration modification. Our model used a convolutional encoder-decoder architecture to estimate attention maps to associate frames of the input speech with frames of the target. The attention maps are modeled as latent variables, which lead to a stochastic formulation of the MAE loss for model training. During testing, the attention map is directly used to approximate the similarity matrix for a DTW-style backtracking procedure. We evaluated our framework on a voice conversion and three separate emotion conversion tasks. Overall, our framework produces similar duration modification as the vanilla DTW but without requiring access to the target utterance. Further, we show that the re-synthesized speech has similar quality to most state-of-the-art neural vocoders.

References

[1] J. A. Russell, J.-A. Bachorowski, and J.-M. Fernandez-Dols, “Facial and vocal expressions of emotion,” Annual Review of Psychology, vol. 54, pp. 329–349, 11 2003.
[2] D. Schacter, D. T. Gilbert, and D. M. Wegner, Psychology (2nd Edition). Worth Publications, 2011.
[3] R. Shankar, H.-W. Hsieh, N. Charon, and A. Venkataraman, “Automated Emotion Morphing in Speech Based on Diffeomorphic Curve Registration and Highway Networks,” in Proc. Interspeech 2019, 2019, pp. 4499–4503.
[4] R. Shankar, J. Sager, and A. Venkataraman, “A Multi-Speaker Emotion Morphing Model Using Highway Networks and Maximum Likelihood Objective,” in Proc. Interspeech 2019, 2019, pp. 2848–2852.
[5] R. Valle, J. Li, R. Prenger, and B. Catanzaro, “Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens,” 2019.
[6] M. Swerts and E. Krahmer, “On the use of prosody for on-line evaluation of spoken dialogue systems,” 04 2000.
[7] S. J. Park, C. Sigouin, J. Kreiman, P. Keating, J. Guo, G. Yeung, F.-Y. Kuo, and A. Alwan, “Speaker identity and voice quality: Modeling human responses and automatic speaker recognition,” in Interspeech 2016, 2016, pp. 1044–1048.
[8] T. Toda, A. W. Black, and K. Tokuda, “Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 8, pp. 2222–2235, Nov 2007.
[9] R. Aihara, R. Takashima, T. Takiguchi, and Y. Ariki, “Gmm-based emotional voice conversion using spectrum and prosody features,” American Journal of Signal Processing, vol. 2, pp. 134–138, 12 2012.
[10] T. Kaneko and H. Kameoka, “Parallel-data-free voice conversion using cycle-consistent adversarial networks,” CoRR, vol. abs/1711.11293, 2017.
[11] R. Shankar, J. Sager, and A. Venkataraman, “Non-Parallel Emotion Conversion Using a Deep-Generative Hybrid Network and an Adversarial Pair Discriminator,” in Proc. Interspeech 2020, 2020, pp. 3396–3400.
[12] R. Shankar, H.-W. Hsieh, N. Charon, and A. Venkataraman, “Multi-Speaker Emotion Conversion via Latent Variable Regularization and a Chained Encoder-Decoder-Predictor Network,” in Proc. Interspeech 2020, 2020, pp. 3391–3395.
[13] J. Schmidt, E. Janse, and O. Scharenborg, “Perception of emotion in conversational speech by younger and older listeners,” Frontiers in Psychology, vol. 7, p. 781, 2016.
[14] S. P. Bayerl, F. Hönig, J. Reister, and K. Riedhammer, “Towards automated assessment of stuttering and stuttering therapy,” 2020.
[15] F. Charpentier and M. Stella, “Diphone synthesis using an overlap-add technique for speech waveforms concatenation,” ICASSP ’86. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 11, pp. 2015–2018, 1986.
[16] Dynamic Time Warping (DTW). Dordrecht: Springer Netherlands, 2008, pp. 570–570.
[17] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” CoRR, vol. abs/1609.03499, 2016.
[18] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. J. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgiannakis, and Y. Wu, “Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions,” CoRR, vol. abs/1712.05884, 2017.
[19] Y. Wang, R. J. Skerry-Ryan, Y. Xiao, D. Stanton, J. Shor, E. Battenberg, R. Clark, and R. A. Saurous, “Uncovering latent style factors for expressive speech synthesis,” CoRR, vol. abs/1711.00520, 2017.
[20] Y. Yasuda, X. Wang, and J. Yamagishi, “Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis,” 2020.
[21] J. Kominek and A. W Black, “The cmu arctic speech databases,” SSW5-2004, 01 2004.
[22] J. Sager, R. Shankar, J. Reinhold, and A. Venkataraman, “VESUS: A Crowd-Annotated Database to Study Emotion Production and Perception in Spoken English,” in Proc. Interspeech 2019, 2019, pp. 316–320.
[23] M. J. Beal, “Variational algorithms for approximate bayesian inference,” Ph.D. dissertation, UCL (University College London), 2003.
[24] F. Itakura, “Minimum prediction residual principle applied to speech recognition,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 23, no. 1, pp. 67–72, 1975.
[25] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin, “Convolutional sequence to sequence learning,” 2017.
[26] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2015.
[27] M. Morise, F. YOKOMORI, and K. Ozawa, “World: A vocoder-based high-quality speech synthesis system for real-time applications,” IEICE Transactions on Information and Systems, vol. E99.D, pp. 1877–1884, 07 2016.

Supplemental Materials:
A Deep-Bayesian Framework for Adaptive Speech Duration Modification

IV-A Loss derivation

We use a convolutional neural network to predict the length and target speech frame using the following expression:

\hat{T}=f_{\gamma}(X)\;\;\;and\;\;\;\hat{Y}_{t}=X\cdot A_{t}+f_{\theta}(X,\hat{Y}_{0:t-1}).

(6)

We maximize the log likelihood of the observed data $(X,Y)$ to estimate the weights of the neural network denoted by $\{\theta,\gamma\}$ . This data likelihood can be written as:

P(\hat{Y},\hat{T}|X)=P(\hat{T}|X)\prod_{t=1}^{\hat{T}}P(\hat{Y}_{t}|X,\hat{T},\hat{Y}_{0:t-1})

(7)

By expanding the second term of Eq. (7) and using the conditional independece from the graphical model, we have:

	$\displaystyle P($	$\displaystyle\hat{Y}_{t}\|X,\hat{T},\hat{Y}_{0:t-1})=\sum_{A_{t}}P(\hat{Y}_{t},A_{t}\|X,\hat{T},\hat{Y}_{0:t-1},M)$
		$\displaystyle=\sum_{A_{t}}P(\hat{Y}_{t}\|X,\hat{T},A_{t},\hat{Y}_{0:t-1})P(A_{t}\|X,\hat{Y}_{0:t-1},M)$		(8)

The attention mask $M$ is a deterministically constructed from the source speech length $T_{s}$ and the estimated target length $\hat{T}$ .

In this work, we encode the attention $A_{t}$ as a one-hot vector across the $T_{s}$ input frames of the source speech. Therefore, it follows a multinomial distribution. For simplicity, we model $A_{t}$ as conditionally independent of the utterance length $T$ given the mask $M$ and the input $X$ . Specifically, taking the $\log(\cdot)$ of Eq. (7) and combining with Eq. (8) yields:

$\displaystyle\mathcal{L}$	$\displaystyle(\theta,\gamma)=-\log\big{(}\sum_{A_{t}}P(\hat{Y}_{t},A_{t}\|X,\hat{T},\hat{Y}_{0:t-1},M)\big{)}-\log\big{(}P(\hat{T}\|X)\big{)}$
	$\displaystyle=-\log\Big{(}\sum_{A_{t}}\frac{q_{\theta}(A_{t}\|X,\hat{Y}_{0:t-1},M)}{q_{\theta}(A_{t}\|X,\hat{Y}_{0:t-1},M)}P(\hat{Y}_{t},A_{t}\|X,\hat{T},\hat{Y}_{0:t-1},M)\Big{)}$
	$\displaystyle\phantom{\hskip 54.2025pt}-\log\big{(}P(\hat{T}\|X)\big{)}$
	$\displaystyle\leq-\sum_{A_{t}}q_{\theta}(A_{t}\|X,\hat{Y}_{0:t-1},M)\log\big{(}P(\hat{Y}_{t}\|X,A_{t},\hat{Y}_{0:t-1})\big{)}$
	$\displaystyle\phantom{\hskip 54.2025pt}-\log\big{(}P(\hat{T}\|X)+KL(q_{\theta}(A_{t})\|\|P(A_{t}))$
	$\displaystyle=-\sum_{A_{t}}q_{\theta}(A_{t}\|X,\hat{Y}_{0:t-1},M)\log\big{(}P(\hat{Y}_{t}\|X,A_{t},\hat{Y}_{0:t-1})\big{)}$
	$\displaystyle\phantom{\hskip 54.2025pt}-\log\big{(}P(\hat{T}\|X)-H(q_{\theta})+const.$
	$\displaystyle\leq-\sum_{A_{t}}q_{\theta}(A_{t}\|X,\hat{Y}_{0:t-1},M)\log\big{(}P(\hat{Y}_{t}\|X,A_{t},\hat{Y}_{0:t-1})\big{)}$
	$\displaystyle\phantom{\hskip 54.2025pt}-\log\big{(}P(\hat{T}\|X)+const.$	(9)

The distribution $q_{\theta}(\cdot)$ above is an approximating distribution for the attention vectors implemented by a convolutional network. The first inequality uses the convexity of the $-\log$ function, and the second inequality comes from the fact that entropy $H(q_{\theta})\geq 0$ . Notice that we have implicitly assumed $P(A_{t}|X,\hat{Y}_{0:t-1},M)$ has a uniform distribution over the masked region. This is a reasonable assumption given that the masking process reduces the attention domain to a small region. However, the form of $q_{\theta}$ is not penalized for deviating from this uniform distribution during training. This flexibility allows the network to learn realistic attention vectors during autoregressive decoding. Eq. (9) can be easily translated into a neural network loss function which we minimize for $\{\theta,\gamma\}$ :

	$\displaystyle L$	$\displaystyle=E_{A_{t}\sim q_{\theta}}\big{[}\log\big{(}P(\hat{Y}_{t}\|X,A_{t},\hat{Y}_{0:t-1})\big{)}\big{]}+\log\big{(}P(\hat{T}\|X)\big{)}$
		$\displaystyle=\lambda_{1}\times E_{A_{t}}\big{[}\Arrowvert\hat{Y}_{t}-Y_{t}^{0}\Arrowvert_{1}\big{]}\;+\;\lambda_{2}\times\Arrowvert\hat{T}-T^{0}\Arrowvert_{1}$		(10)

$\lambda_{1}$ and $\lambda_{2}$ are the model hyperparameters that adjusts the trade-off between the two objectives and implicitly contain the variances of the Laplace distributions introduced in the main text. Notice that the loss in Eq. (10) computes an expectation over the attention maps. We use the Monte-Carlo estimate by sampling from the attention map at each time-step. The training procedure is therefore stochastic in nature due to this random sampling. We mix this stochastic version with the maximum aposteriori estimate (MAP) of the attention vector with probability of 0.1 in the beginning of training procedure.

IV-B Training Algorithm

1 function trainModelParameters

(X,Y)

;

Input : filterbank energies (

X\in\mathbb{R}^{D\times T_{s}}

Y\in\mathbb{R}^{D\times T_{t}}

)

Output : model parameters (

\theta,\gamma

)

2 Set epoch = 0 and threshold;

3 if epoch $<$ MaxEpochs then

4 Predict length of target sequence

\hat{T}=f_{\gamma}(X)

;

5 Create attention mask

M\in\mathbb{R}^{T_{s}\times T_{t}}

and set

t=0

;

6 Estimate

A\in\mathbb{R}^{T_{s}\times T_{t}}

using masked convolution;

7 Sample

u\sim U(0,1)

;

8 if u $<$ threshold then

9 Sample

a\in\mathbb{R}^{T_{s}\times T_{t}}

as 1-hot vectors from

A

;

10 Reconstruct using

\hat{Y}_{t}=X\cdot a+f_{\theta}(X,Y_{0:t-1})

;

12 else

13 Reconstruct using

\hat{Y}_{t}=X\cdot A+f_{\theta}(X,Y_{0:t-1})

;

15 end if

16 Compute reconstruction and length prediction error;

17 Update parameters

\theta,\gamma

using backpropagation;

18 epoch

\leftarrow

epoch + 1;

20 end if

21return

\theta

and

\gamma

;

Algorithm 2 Strategy for model training

We start with a small threshold in line 8 (i.e., low contribution of the stochastic loss) to prevent the model from diverging in sub-optimal directions. The MAP estimate helps in this regard. Once, the number of training epochs exceed a fix value, we increase threshold to place more emphasis on the stochastic loss. Empirically, we found this to be extremely helpful in generating monotonic attention. We fixed the slope of attention mask in line 5 to $1.25$ based on the relative difference in length observed in the training datasets.

$\displaystyle\mathcal{L}$	$\displaystyle(\theta,\gamma)=-\log\big{(}\sum_{A_{t}}P(\hat{Y}_{t},A_{t}\|X,\hat{T},\hat{Y}_{0:t-1},M)\big{)}-\log\big{(}P(\hat{T}\|X)\big{)}$
	$\displaystyle=-\log\Big{(}\sum_{A_{t}}\frac{q_{\theta}(A_{t}\|X,\hat{Y}_{0:t-1},M)}{q_{\theta}(A_{t}\|X,\hat{Y}_{0:t-1},M)}P(\hat{Y}_{t},A_{t}\|X,\hat{T},\hat{Y}_{0:t-1},M)\Big{)}$
	$\displaystyle\phantom{\hskip 54.2025pt}-\log\big{(}P(\hat{T}\|X)\big{)}$
	$\displaystyle\leq-\sum_{A_{t}}q_{\theta}(A_{t}\|X,\hat{Y}_{0:t-1},M)\log\big{(}P(\hat{Y}_{t}\|X,A_{t},\hat{Y}_{0:t-1})\big{)}$
	$\displaystyle\phantom{\hskip 54.2025pt}-\log\big{(}P(\hat{T}\|X)+KL(q_{\theta}(A_{t})\|\|P(A_{t}))$
	$\displaystyle=-\sum_{A_{t}}q_{\theta}(A_{t}\|X,\hat{Y}_{0:t-1},M)\log\big{(}P(\hat{Y}_{t}\|X,A_{t},\hat{Y}_{0:t-1})\big{)}$
	$\displaystyle\phantom{\hskip 54.2025pt}-\log\big{(}P(\hat{T}\|X)-H(q_{\theta})+const.$
	$\displaystyle\leq-\sum_{A_{t}}q_{\theta}(A_{t}\|X,\hat{Y}_{0:t-1},M)\log\big{(}P(\hat{Y}_{t}\|X,A_{t},\hat{Y}_{0:t-1})\big{)}$
	$\displaystyle\phantom{\hskip 54.2025pt}-\log\big{(}P(\hat{T}\|X)+const.$	(9)