Improved Normalizing Flow-Based Speech Enhancement using an
All-pole Gammatone Filterbank for Conditional Input Representation

Abstract

Deep generative models for Speech Enhancement (SE) received increasing attention in recent years. The most prominent example are Generative Adversarial Networks (GANs), while normalizing flows (NF) received less attention despite their potential. Building on previous work, architectural modifications are proposed, along with an investigation of different conditional input representations. Despite being a common choice in related works, Mel-spectrograms demonstrate to be inadequate for the given scenario. Alternatively, a novel All-Pole Gammatone filterbank (APG) with high temporal resolution is proposed. Although computational evaluation metric results would suggest that state-of-the-art GAN-based methods perform best, a perceptual evaluation via a listening test indicates that the presented NF approach (based on time domain and APG) performs best, especially at lower SNRs. On average, APG outputs are rated as having good quality, which is unmatched by the other methods, including GAN.

Index Terms— speech enhancement, normalizing flows, all-pole gammatone filterbank, DNN

1 Introduction

Speech Enhancement (SE) aims to improve the quality of speech degraded by disturbing background noise [1]. Therefore, it plays a vital role as a front end for automatic speech recognition systems [2, 3, 4] and far-field speech processing [5]. SE was investigated extensively and approaches based on Deep Neural Network (DNN) largely overtook traditional techniques like spectral substraction [6], Wiener filter [7] or subspace methods [8]. Most commonly, a separation mask is estimated by minimizing a distance metric to extract the clean speech components in Time-Frequency (TF) domain [9, 10] or a learned subspace [11]. Still, in recent years there has been an increasing interest in generative approaches trying to outline the probability distribution of speech signals. The most prominent examples include Generative Adversarial Networks (GANs) [12, 13], Variational Autoencoders (VAE) [14], autoregressive models [15] and diffusion probabilistic models [16]. GAN-based architectures stand out in their performance. For instance, MetricGAN+ [12] and HiFi-GAN-2 [13] are the respective successors of adversarialy trained DNNs for SE. MetricGAN+ is directly optimized on PESQ [17] or STOI [18], reporting high values in the corresponding metrics at the output. HiFi-GAN-2 is pretrained in a discriminative way, followed by adversarial optimization to improve perceptual quality. Although the presented results are genuinely impressive, GANs in general are known to be difficult to train in a stable manner and tend to suffer from mode collapse [19].
Diffusion probabilistic models are a recent example of generative models where the transformation from Gaussian noise to clean input is learned by a diffusion process. Lu et al. [16] were the first to apply this approach to SE, restoring clean speech by conditioning the process on noisy speech. They show a leading performance in time-domain generative models and promising generalization in mismatched conditions. Still, sampling from a diffusion process is rather slow and computationally expensive [20]. Normalizing Flows (NFs) [21] are another generative modelling technique. They are trained by maximizing the likelihood of the data directly, making them easy and stable to train. Despite increasing success in fields like computer vision [22] or speech synthesis [23], their application in SE has received less attention. Nugraha et al. [24] applied NFs in combination with a VAE to learn a deep speech prior to be combined with a SE algorithm of choice. In contrast, Strauss et al. [25] used NFs to learn the mapping from Gaussian noise to clean speech conditioned on a noisy speech sample entirely in time domain. While outperforming the results of other time-domain GAN-based methods, the overall performance evaluated on computational metrics lag behind comparable TF domain approaches.
Building on previous work, the aim of this paper is to give further insights on NF-based SE. We improve the architecture of [25] by a simple double coupling scheme to ensure that the entire input signal is processed in one flow block. Further, different input representations for the conditional noisy input signal are considered. Our experiments show that despite the fact that Mel-spectrograms are a common choice for conditional signal representation in related fields, like neural vocoders [23, 26], they are inadequate for our scenario. Alternatively, the usage of a Bark-spaced All-Pole Gammatone filterbank (APG) [27] is proposed. Similar to Mel this design makes use of a perceptually motivated filterbank to mimic the human auditory system and reduce the dimensionality of the filter output compared to a standard Short-Time Fourier Transform (STFT). At the same time temporal resolution with the design of this filterbank is increased, which overcomes the limitations of a standard Mel-spectrogram. Perceptual evaluation via a listening test indicates that the presented NF approach (based on time domain and APG) performs better than state-of-the-art GAN-based methods, especially at lower SNRs, even though this is not reflected by computational evaluation metrics.

2 Normalizing flow-based speech enhancement

Let’s define two $D$ dimensional random variables $\textbf{x}\in\mathbb{R}^{D}$ and $\textbf{z}\in\mathbb{R}^{D}$ . A NF is defined by a differentiable function $f$ with differentiable inverse allowing a bijective transformation between the two random variables [21] , i.e.,

\textbf{x}=f(\textbf{z}),\hskip 28.45274pt\textbf{z}=f^{-1}(\textbf{x}).

(1)

The invertability of $f$ ensures that the random variable x is defined by a given probability distribution and can be computed by a change of variables, i.e.,

p_{x}(\textbf{x})=p_{z}(\textbf{z})\left\lvert\det\left(J(\textbf{x})\right)\right\rvert,

(2)

where ${J(\textbf{x})=\partial\textbf{z}/\partial\textbf{x}}$ is the Jacobian containing all first order derivatives. Since $f$ is invertible, this holds true also for a sequence of functions $f_{1:T}$ , i.e.,

\textbf{x}=f_{1}\circ f_{2}\circ\cdots\circ f_{T}(\textbf{z}).

(3)

Let us now introduce a single channel noisy speech signal $\textbf{y}\in\mathbb{R}^{N}$ with sequence length $N$ , obtained by the summation of a clean speech utterance $\textbf{x}\in\mathbb{R}^{N}$ and background noise $\textbf{n}\in\mathbb{R}^{N}$ :

\textbf{y}=\textbf{x}+\textbf{n}.

(4)

Moreover, $\textbf{z}\in\mathbb{R}^{N}$ is defined to be sampled from a Gaussian distribution with zero mean and unit variance, i.e.,

\textbf{z}\sim\mathcal{N}(\textbf{z}|0,\textbf{I}).

(5)

The aim of NF-based SE is now to outline the conditional probability distribution $p_{x}(\textbf{x}|\textbf{y})$ by a DNN with parameters $\theta$ . Hence, the overall training objective is described by a maximization of the log-likelihood, i.e.,

\log p_{x}(\textbf{x}|\textbf{y};\theta)=\log p_{z}(f^{-1}_{\theta}(\textbf{x})|\textbf{y})+\log\left\lvert\det\left(J(\textbf{x})\right)\right\rvert.

(6)

Inverting the learned network, a noise example sampled from $p_{z}(z)$ is conditioned on a noisy speech utterance and mapped back to the distribution of clean speech utterances resulting in an enhanced speech output.

3 Proposed methods

3.1 Model architecture

The model used in the experiments builds upon [25]. This network consists of a sequence of so-called flow blocks to transform the input clean speech utterance to Gaussian noise. One flow block (Figure 1) contains a combination of an invertible 1x1 convolutional layer [28] and an affine coupling layer [22]. Similar to the Glow [28] network, each block processes multiple channels of the input signal at once. Therefore, the input $x\in\mathbb{R}^{1\times N}$ with sequence length $N$ is subsampled by a factor $G$ to create a multichannel signal $x\in\mathbb{R}^{G\times(N/G)}$ . After the invertible convolutional layer, the input is separated into two halves along the channel dimension with one part being provided to the subnetwork inside the coupling layer to learn affine transformation parameters $s$ and $t$ for the second half. The transformed signal is concatenated with the unchanged second part and serves as an input for the next block. This operation is invertible, ensuring that the network is invertible overall, although the subnetwork inside the coupling layer estimating the affine parameters does not need to be invertible.

To increase the capacity of each block, a double coupling scheme inspired by [29] was implemented where the output of the affine transformation is reused as an input to calculate the affine parameters for the second part, i.e.,

\begin{split}\hat{x}_{1}=s_{1}(x_{2})\odot x_{1}+t_{1}(x_{2}),\\ \hat{x}_{2}=s_{2}(\hat{x}_{1})\odot x_{2}+t_{2}(\hat{x}_{1}),\end{split}

(7)

where the input $x$ is separated into $x_{1}$ and $x_{2}$ and $s_{1}$ , $t_{1}$ , as well as $s_{2}$ , and $t_{2}$ are estimated by respective subnetworks. The output is concatenated, i.e., $\hat{x}=\left[\hat{x}_{1},\hat{x}_{2}\right]$ , and passed to the next flow block. This procedure is illistrated in Figure 1.

Refer to caption — Fig. 1: Double coupling scheme. Before entering the affine coupling layer, the subsampled input $x$ passes through the invertible 1x1 convolution. The conditional input $y$ serves as input to both subnetworks. For single coupling $x_{2}$ is passed unchanged through the identity function essentially leading to $\hat{x}_{2}=x_{2}$ .

Similar to Waveglow [23] the subnetwork used in this paper is a stack of dilated convolutions with skip connections applied to the input signal. The conditional signal is also subsampled by the factor $G$ , before being processed by a single convolutional layer and introduced to each layer of the subnetwork by a gated activation, i.e. a combination of a tanh and sigmoid function, as proposed for WaveNet [30].

3.2 Conditional input representations

Next to the original time domain as representation of the conditional signal, experiments with additional variations are conducted. The Mel-spectrogram is a common choice for conditional representation and works well, e.g., in neural vocoders [23, 26]. Each Mel spectral coefficient is, as usual, computed from FFT magnitudes by multiplication with a triangular spectral weighting function and summation. For instance, in [23] an FFT length of 1024 samples (at $f_{s}=22$ kHz) was chosen to obtain a frequency resolution appropriate for achieving sufficient accuracy of the lowest Mel coefficients. This, however, leads to a time resolution which is much below that of the human auditory system at higher frequencies, so that fine temporal structures are not sufficiently represented. Further, the time frames are up-sampled to match the time input dimension using a transposed convolution layer. This step is rather redudant, since the low number of time frames needs to be brought up to full time resolution again without adding further information. Consequently, in initial experiments using a Mel representation phase/time shifts occured in the enhanced output, which were accounted to the low temporal resolution in the conditional input. Moreover, at enhancement the only useful information provided to the network is the noisy conditional signal and corrupted time frames potentially do not provide sufficient information.
Therefore, we investigated the use of a specifically designed complex-valued All-Pole Gammatone filterbank (APG). Motivated by human hearing, the center frequencies have constant distances on the Bark scale with increasing bandwidth at increasing frequencies, proportional to the Bark bandwidths. The IIR filters operate directly on the time domain input signal with cascades of first order complex valued stages. Thus, filter outputs at the input sampling rate are obtained. Although we use only the output magnitudes as conditional input, their temporal resolutions are only limited by the extent of the impulse responses, which are longer for the narrow bands with lower center frequencies, but relatively short for the wider bands with higher center frequencies as can be seen in Figure 2.

4 Experimental setup

4.1 Dataset

For the experiments we consider the commonly used Voice-Bank-DEMAND dataset [31]. It includes 30 speakers separated into 28 for training and 2 for testing. The dataset items consist of speech samples from the VoiceBank corpus [32] corrupted with noise items from the DEMAND database [33] and artificially generated speech shaped and babble noise. The items are mixed together with Signal-to-Noise-Ratios (SNRs) of 0, 5, 10, and 15 dB for training. For testing, SNRs of 2.5, 7.5, 12.5, 17.5 dB are used. In our experiments, one male and one female speaker are taken out of the training set to build a development set. All items are re-sampled to 16 kHz.

4.2 Model configurations

The models are constructed with 16 flow blocks and a subsampling factor $G=12$ . The subnetwork has 8 layers of dilated convolutions implemented as depthwise separable convolutions [34]. In contrast to [25], the output channels of the dilated convolutions are set to 128 and the conditional input layer is replaced by a depthwise separable convolution. This configuration of the model using the time domain input has a total of 8.8 M parameters, which is significantly lower than the one in [25]. Using the double coupling scheme approximately doubles the amount of parameters, since two subnetworks have to be learned for each block. From each training audio file, 1 s long chunks are randomly extracted and given as inputs to the network. In enhancement the entire signal is processed at once. The models are trained with a batch size of 4 and Adam optimizer (learning rate $=0.001$ ). The learning rate is decayed by a factor of $0.5$ , if the validation loss did not decrease for 10 consecutive epochs. All models are trained for 200 epochs. Similar to previous works [23, 25], using a lower standard deviation for the sampling distribution in inference showed a slightly better performance experimentally. Hence, the standard deviation was lowered from $\sigma=1.0$ in training to $\sigma=0.9$ in enhancement. For the Mel-spectrogram the FFT parameters are: 512 samples for the input and window size, Hann window, and 75% overlap. The spectrogram includes 80 frequency bands. The APG is implemented with a filter order of 4 and a lookahead factor for group delay compensation of 0.7. The minimum center frequency is set to 40 Hz with a total of 80 frequency bands and the maximum center frequency just below Nyquist frequency.

5 Evaluation

We compare the following flow-based systems. The original model with single coupling and time domain conditional input [25] is denoted as SE-Flow ${}_{\text{sc}}$ , while the proposed double coupling version is referred to with SE-Flow. The varied input conditions are characterized by the suffix -Mel, and -APG. Two state-of-the-art generative models are also considered, namely the MetricGAN+ [12], and CDiffuSE [16]. The samples enhanced by MetricGAN+ are obtained from the model in the speechbrain project [35]. For CDiffuSE, the outputs were kindly provided by the authors of the corresponding paper.

5.1 Computational evaluation metrics

The methods are first evaluated with computational metrics. PESQ [17] (worst: -0.5; best: 4.5) and the mean opinion score estimating composite measures [36] (worst: 1; best: 5) are commonly reported on this dataset. For further insights, STOI [18] (worst: 0; best: 1) and the 2f-model score [37, 38] (worst: 0; best: 100) are also reported.

5.2 Listening test

The considered methods are also compared via a listening test following the MUSHRA methodology [39] with a reference and a 3.5 kHz low-pass anchor. The participants were instructed to rate the overall sound quality of the presented items with regard to the reference. The test items were selected from the BUS and CAFE noise settings and only the most difficult SNR conditions, i.e., 2.5 and 7.5 dB SNR. Per test speaker, one item of at least 3 s was randomly selected for a total of 8 items. The computational evaluation was repeated on the test items and confirmed that this item selection was not biased towards a particular model. The samples used for this test along with the unprocessed input signals can be found online¹¹1https://www.audiolabs-erlangen.de/resources/2022-SLT-improved_SE_Flow.

The raw outputs from the different systems differ greatly in overall energy. For one example item, output integrated loudness [40] can range from -17 to -24 loudness units full scale (LUFS), while both noisy input and clean speech are -22.8 LUFS. Moreover, different levels of leaking noise are observed after processing. This can make it very difficult to assign an overall quality score to the compared systems, as noise suppression and speech quality are often inversely proportional. In order to ensure that a fair comparison of the systems is possible, leaking noise level matching and loudness normalization are carried out, similarly to [41]. First, background components are obtained by subtracting the enhanced output or the clean reference from the input mixture. Then:

1.

The reference condition is created by mixing the reference clean speech with the corresponding background component, with an attenuation factor of 30 dB.
2.

Speech activity information is determined by thresholding the envelope of the clean reference.
3.

The integrated loudness of the non-speech parts of the reference condition are determined (gating deactivated).
4.

For each test condition, the noise attenuation level is obtained iteratively, until the same loudness of the non-speech parts is reached as in the reference condition. The same speech activity information gathered for the reference condition is used.
5.

Each condition is normalized to -23 LUFS (integrated loudness, gating deactivated).

The test was conducted online using webMUSHRA [42] with each participant using their own PC and headphones. The participants were 20 fellow colleagues with various level of experience in audio research. No results had to be removed in accordance to the MUSHRA post-screening procedure. Note that since we want to make sure that the amount of leaking noise is comparable across systems, the test items for CDiffuSE are generated from the raw network output, while the numbers reported in the paper include a recombination with the original noisy signal.

6 Results and Discussion

Table 1: Computational evaluation results for the test set (VoiceBank-DEMAND). Mean values. Best results in bold.

Method	PESQ	CSIG	CBAK	COVL	STOI	2f-model
Noisy	1.97	3.35	2.45	2.63	0.92	31.70
CDiffuSE [16]	2.52	3.72	2.91	3.10	0.91	33.65
MetricGAN+ [12]	3.13	4.08	3.16	3.60	0.93	34.36
SE-Flow ${}_{\text{sc}}$	2.24	3.60	2.95	2.91	0.90	44.47
SE-Flow	2.41	3.79	3.11	3.09	0.93	46.92
SE-Flow-Mel	1.63	2.84	2.00	2.19	0.85	35.63
SE-Flow-APG	2.05	3.30	2.51	2.65	0.89	40.16

6.1 Computational evaluation results

Table 1 shows the results of the computational evaluation on the test set. The computational metrics were also evaluated separately for the conditions 7.5 and 2.5 dB input SNR, as well as considering exclusively the items selected for the listening test. While the metrics show significantly different absolute values, the main trends and the ranking of the methods remain the same as in Table 1.

The metrics show that SE-Flow outperforms the single coupling version, confirming the benefits of the proposed double coupling architecture at the cost of a more complex network. SE-Flow with time domain conditional input shows the best performance among the flow-based models. SE-Flow-Mel shows the lowest performance with some metrics even worse than the noisy baseline. One possible explanation is that the Mel-representation of the noisy speech is sub-optimal for the application at hand, possibly because it does not provide phase information about the input. SE-Flow-APG exhibits lower results in PESQ and in the composite measures than the other methods, but its 2f-model results only lay behind the time domain flow models.

MetricGAN+ shows the best results in PESQ and the composite measures. These results are somewhat to be expected since MetricGAN+ directly optimizes PESQ. In terms of STOI, MetricGAN+ shows the best performance together with SE-Flow. CDiffuSE outperforms all flow-based models in terms of PESQ, and composite measures, only staying behind MetricGAN+. This confirms the results reported in their publication with regard to other time-domain generative models. With regard to the 2f-model, the time domain SE-Flow shows the best performance among all methods.

6.2 Listening test results

The results of the listening test are depicted in Figure 3. The average results over all items show that SE-Flow-APG performs best among all methods, being rated as having good quality on average, and with the confidence intervals not overlapping with the other methods. SE-Flow follows. Despite the high values in the computational evaluation metrics, MetricGAN+ performs worse than SE-Flow, SE-Flow-APG and CDiffuSE. Inspecting some of the enhanced samples reveal artifacts in the voice timbre, which could explain the low results in part. CDiffuSE takes the third place behind both suggested flow-based approaches. Examining the respective samples reveals a low-pass-filter-like characteristic of the output samples, which could explain the results to some extend. As indicated also by the computational evaluation, SE-Flow-Mel exhibits the worst performance, with scores similar to the low-pass anchor.
Considering the results grouped by input SNR condition, SE-Flow-APG performs best at the lowest SNR condition, while being on par with SE-Flow at 7.5 dB SNR. In fact, the performance of SE-Flow drops dramatically going from 7.5 to 2.5 dB SNR, where the superiority of SE-Flow-APG is evident. Also, MetricGAN+ is close to the good quality range for the higher SNR condition, but it drops 15 MUSHRA points when tested at 2.5 dB SNR. CDiffuSE shows more robustness across SNRs, but overall lower quality than SE-Flow-APG.
It is worth highlighting that the results from the listening test are in partial disagreement with the results from the computational metrics. In fact, even if computational metrics can be extremely useful for their convenience and reproducibility, their correlation with perceived audio quality is often low [38]. For this reason, conclusions drawn exclusively from computational metrics should be taken with care, as they can be partially misleading in terms of perceived quality.

7 Conclusion

In this paper, several improvements to a flow-based SE model are introduced. With the presented double coupling scheme, the model processes the entire input signal in each coupling layer, leading to higher capacity and performance. Additional experiments consider different representations for the conditional input. Despite being a common choice in related fields, Mel-spectrograms proved not to be a suitable choice for flow-based SE. As an alternative, a proposed Bark spaced All-Pole Gammatone filterbank-based pre-processing with increased time resolution overcomes the Mel induced problems. While the results of popular computational metrics are behind state-of-the-art generative models, the outcome of a listening test indicates, that flow-based SE using a time-domain or gammatone-filtered conditional signal has favourable perceptual performance. Hereby, it was shown that the proposed method not only outperforms the compared generative models, but the performance also remains strong throughout different SNR conditions.

8 ACKNOWLEDGMENTS

The authors would like to thank Yen-Ju Lu and Yu Tsao for sharing the test samples of their diffusion model for comparison.

References

[1] P. Loizou, Speech Enhancement: Theory and Practice, CRC Press, 2nd edition, 2013.
[2] A. Pandey, C. Liu, et al., “Dual Application of Speech Enhancement for Automatic Speech Recognition,” in 2021 IEEE Spoken Language Technology Workshop (SLT), 2021, pp. 223–228.
[3] Tom O’Malley et al., “A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech Enhancement and Speech Separation,” in Automatic Speech Recognition and Understanding Workshop (ASRU), 2021, pp. 304–311.
[4] Chenda Li et al., “ESPnet-se: End-To-End Speech Enhancement and Separation Toolkit Designed for ASR Integration,” in 2021 IEEE Spoken Language Technology Workshop (SLT), 2021, pp. 785–792.
[5] Reinhold Haeb-Umbach et al., “Far-Field Automatic Speech Recognition,” Proceedings of the IEEE, vol. 109, no. 2, pp. 124–148, 2021.
[6] H. Gustafsson, S.E. Nordholm, and I. Claesson, “Spectral Subtraction Using Reduced Delay Convolution and Adaptive Averaging,” IEEE Trans. Speech Audio Process., vol. 9, no. 8, pp. 799–807, 2001.
[7] J. Chen, J. Benesty, et al., “New Insights into the Noise Reduction Wiener Filter,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 14, no. 4, pp. 1218–1234, 2006.
[8] M. Klein and P. Kabal, “Signal Subspace Speech Enhancement with Perceptual Post-Filtering,” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2002, pp. 537–540.
[9] Y. Koizumi, S. Karita, S. Wisdom, et al., “DF-Conformer: Integrated Architecture of Conv-TasNet and Conformer Using Linear Complexity Self-Attention for Speech Enhancement,” in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2021, pp. 161–165.
[10] D. Wang and J. Chen, “Supervised Speech Separation Based on Deep Learning: An Overview,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 26, no. 10, pp. 1702–1726, 2018.
[11] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 27, pp. 1256–1266, 2019.
[12] S.-W. Fu, C. Yu, et al., “MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement,” in Proc. Interspeech Conf., 2021, pp. 201–205.
[13] J Su, Z. Jin, and A. Finkelstein, “HiFi-GAN-2: Studio-Quality Speech Enhancement via Generative Adversarial Networks Conditioned on Acoustic Features,” in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2021, pp. 166–170.
[14] S. Leglaive, X. Alameda-Pineda, et al., “A Recurrent Variational Autoencoder for Speech Enhancement,” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 371–375.
[15] K. Qian, Y. Zhang, et al., “Speech Enhancement Using Bayesian Wavenet,” in Proc. Interspeech Conf., 2017, pp. 2013–2017.
[16] Y.-J. Lu, Z.-Q. Wang, et al., “Conditional Diffusion Probabilistic Model for Speech Enhancement,” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 7402–7406.
[17] International Telecommunication Union, “Recommendation ITU–T P.862 Perceptual evaluation of speech quality (PESQ), and objective method for end-to-end speech quality assessment of narrowband telephone band and wideband digital codes,” 2000.
[18] C. H. Taal, R. C. Hendriks, et al., “A Short-Time Objective Intelligibility Measure for Time-Frequency Weighted Noisy Speech,” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2010, pp. 4214–4217.
[19] L. Metz, B. Poole, et al., “Unrolled Generative Adversarial Networks,” in 5th Int. Conf. on Learning Representations, ICLR, 2017.
[20] J. Song, C. Meng, and S. Ermon, “Denoising Diffusion Implicit Models,” in 9th Int. Conf. on Learning Representations, ICLR, 2021.
[21] G. Papamakarios, E.T. Nalisnick, D.J. Rezende, et al., “Normalizing Flows for Probabilistic Modeling and Inference,” Journal of Machine Learning Research, vol. 22, no. 57, pp. 1–64, 2021.
[22] L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density Estimation Using Real NVP,” in 5th Int. Conf. on Learning Representations, ICLR, 2017.
[23] R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A Flow-based Generative Network for Speech Synthesis,” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 3617–3621.
[24] A. A. Nugraha, K. Sekiguchi, and K. Yoshii, “A Flow-Based Deep Latent Variable Model for Speech Spectrogram Modeling and Enhancement,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 28, pp. 1104–1117, 2020.
[25] M. Strauss and B. Edler, “A Flow-Based Neural Network for Time Domain Speech Enhancement,” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 5754–5758.
[26] A. Mustafa, N. Pia, and G. Fuchs, “StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization,” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6034–6038.
[27] R. F. Lyon, “All-pole models of auditory filtering,” in Diversity in Auditory Mechanics, 1997, pp. 205–211.
[28] D.P. Kingma and P. Dhariwal, “Glow: Generative Flow with Invertible 1x1 Convolutions,” in Advances in Neural Information Processing Systems 31, 2018, pp. 10215–10224.
[29] L. Ardizzone, J. Kruse, et al., “Analyzing Inverse Problems with Invertible Neural Networks,” in 7th Int. Conf. on Learning Representations, ICLR, 2019.
[30] A. van den Oord, S. Dieleman, H. Zen, et al., “WaveNet: A Generative Model for Raw Audio,” in arXiv:1609.03499, 2016.
[31] C. Valentini-Botinhao, X. Wang, et al., “Speech Enhancement for a Noise-Robust Text-to-Speech Synthesis System Using Deep Recurrent Neural Networks,” in Proc. Interspeech Conf., 2016, pp. 352–356.
[32] C. Veaux, J. Yamagishi, and S. King, “The Voice Bank Corpus: Design, collection and data analysis of a large regional accent speech database,” in Int. Conf. Oriental COCOSDA held jointly with the Conf. on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), 2013, pp. 1–4.
[33] J. Thiemann, N. Ito, and E. Vincent, “The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings,” Proc. of Meetings on Acoustics, vol. 19, no. 1, pp. 035081, 2013.
[34] F. Chollet, “Xception: Deep Learning with Depthwise Separable Convolutions,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1800–1807.
[35] Mirco Ravanelli et al., “SpeechBrain: A General-Purpose Speech Toolkit,” 2021, arXiv:2106.04624.
[36] Y. Hu and P. Loizou, “Evaluation of Objective Quality Measures for Speech Enhancement,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 1, pp. 229 – 238, 2008.
[37] T. Kastner and J. Herre, “An Efficient Model for Estimating Subjective Quality of Separated Audio Source Signals,” in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2019, pp. 95–99.
[38] Matteo Torcoli, Thorsten Kastner, and Jürgen Herre, “Objective Measures of Perceptual Audio Quality Reviewed: An Evaluation of their Application Domain Dependence,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 1530–1541, 2021.
[39] International Telecommunication Union, “Recommendation ITU–R BS.1534-3 Method for the subjective assessment of intermediate quality level of audio systems,” 2015.
[40] International Telecommunication Union, “Recommendation ITU–R BS.1770-4 Algorithms to measure audio programme loudness and true-peak audio level,” 2015.
[41] M. Strauss, J. Paulus, M. Torcoli, and B. Edler, “A Hands-On Comparison of DNNs for Dialog Separation Using Transfer Learning from Music Source Separation,” in Proc. Interspeech Conf., 2021, pp. 3900–3904.
[42] M. Schoeffler et al., “webMUSHRA — A Comprehensive Framework for Web-based Listening Tests,” Journal of Open Research Software, vol. 6, no. 1, pp. 8, 2018.

Improved Normalizing Flow-Based Speech Enhancement using an All-pole Gammatone Filterbank for Conditional Input Representation