This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

DDD: A Perceptually Superior Low-Response-Time DNN-based Declipper

Abstract

Clipping is a common nonlinear distortion that occurs whenever the input or output of an audio system exceeds the supported range. This phenomenon undermines not only the perception of speech quality but also downstream processes utilizing the disrupted signal. Therefore, a real-time-capable, robust, and low-response-time method for speech declipping (SD) is desired. In this work, we introduce DDD (Demucs-Discriminator-Declipper), a real-time-capable speech-declipping deep neural network (DNN) that requires less response time by design. We first observe that a previously untested real-time-capable DNN model, Demucs, exhibits a reasonable declipping performance. Then we utilize adversarial learning objectives to increase the perceptual quality of output speech without additional inference overhead. Subjective evaluations on harshly clipped speech shows that DDD outperforms the baselines by a wide margin in terms of speech quality. We perform detailed waveform and spectral analyses to gain an insight into the output behavior of DDD in comparison to the baselines. Finally, our streaming simulations also show that DDD is capable of sub-decisecond mean response times, outperforming the state-of-the-art DNN approach by a factor of six.

Index Terms—  Speech Declipping, Speech Enhancement, Adversarial Training

1 Introduction

Clipping occurs when an input or output of an audio system exceeds its supported range. The out-of-bounds samples are typically replaced with the maximum or minimum value, resulting in a jarring sound when played back [1]. Clipping may degrade perception of human listeners or subsequent machine processing of speech [2, 3]. Therefore, lightweight methods to declip speech signals are desired.

While speech declipping (SD) has not been extensively investigated, audio declipping (AD) has been widely researched as a constrained optimization problem using non-DNN methods [4]. Among the past approaches, the non-parametric A-SPADE [5] has shown superb reconstruction performance under low-SNR conditions (\leq5dB) when evaluated on music [4]. However, when applied to speech, it was found to fall short of some DNN-based approaches [6] in terms of speech reconstruction quality.

Refer to caption
Fig. 1: The architecture used to train DDD. In training, the output from our lightweight generator (green) and the original signal (black) are fed to the discriminators for an adversarial training objective and to enhance the perceptive quality of restored speech signals. The discriminators are dropped in inference and incur no overhead.

On the other hand, although a few SD or AD methods employ deep neural networks (DNN), they often fail to outperform A-SPADE in the audio domain in terms of metrics such as Δ\DeltaSNR (Signal-to-Noise Ratio) [7, 8]. Some other DNN-based methods reference a high number of “future” samples and have very high response times by design [9]. Others require much more computation and are not real-time capable [10, 6, 11]. While a practitioner’s first intuition may be to resort to fast speech enhancement (SE) models such as Conv-Tasnet [12] or DPRNN [13], our preliminary experiments showed that these models fail to converge on our SD dataset, presumably due to the fundamental differences between the additive environmental noises and the subtractive clipping noises.

To this end, we propose, train, and extensively evaluate DDD (Demucs-Discriminator-Declipper) - a DNN-based declipper. We train Demucs [14] with an adversarial training objective given by the HiFiGAN [15] discriminator to better declip while preserving inference speed. Our subjective evaluations on harshly clipped speech confirm that DDD outperforms T-UNet [9] and A-SPADE in terms of perceived audio quality. We qualitatively analyze waveforms and spectrums to investigate how the behavior of DDD differs from other DNN-based methods. Finally, we found that our model, which is real-time capable on consumer CPUs, can be tuned to have <<100ms response-times, one-sixth the response time compared to T-UNet.

All source code and pretrained models required to reproduce experiments are publicly available online111 https://github.com/stet-stet/DDD. Audio samples can also be heard at https://stet-stet.github.io/DDD. .

2 Method

2.1 Problem Formulation

Let 𝐲𝐒\mathbf{y}\in\mathbf{S} be an arbitrary-length speech signal, where 𝐒=t=1[1,1]t\mathbf{S}=\cup_{t=1}^{\infty}[-1,1]^{t}. Also let us denote the length of signal 𝐲\mathbf{y} by len(𝐲)\text{len}(\mathbf{y}); in other words, 𝐲[1,1]len(𝐲)\mathbf{y}\in[-1,1]^{\text{len}(\mathbf{y})}. Given any threshold θ[0,1]\theta\in[0,1], we can define a hard-clip map fθ:𝐒𝐒f_{\theta}:\mathbf{S}\to\mathbf{S}:

[fθ(𝐲)]i={yiif|yi|θθsgn(yi)otherwise[f_{\theta}(\mathbf{y})]_{i}=\begin{cases}y_{i}&\text{if}\hskip 5.69046pt|y_{i}|\leq\theta\\ \theta\cdot\text{sgn}(y_{i})&\text{otherwise}\end{cases} (1)

Where yiy_{i} is the iith element of 𝐲\mathbf{y}. The aim of the SD task is to infer 𝐲\mathbf{y} from fθ(𝐲)f_{\theta}(\mathbf{y}). This is in stark contrast to common formulations of speech enhancement (SE), where the aim is frequently to recover the original signal from its sum with an instance of roughly stochastic environmental noise.

2.2 DDD

Our setting for training DDD can be seen in Fig. 1. We sought a twofold improvement in perceived speech quality and inference lookahead. To this end, we adopt the Generative Adversarial Network (GAN) [16] framework to introduce adversarial training objectives.

The Generator. We sought a model allowing for low-response-time, real-time inference on consumer-grade CPUs. We found an instance of causal Demucs [14], shown in Fig. 1, to be the most suitable. The five 1D-strided-convolution blocks of the Demucs encoder each downsample the input by a factor of 4. Channels are doubled every block, with the exception of the first block which maps monochannel audio to a 64-channel representation. The decoder reverses these steps using transposed convolutions. Between the encoder and decoder is a causal 2-layer LSTM. To further reduce temporal lookahead, we upsample input and downsample output by a factor of four. Thus, the generator needs no more than 500 samples of lookahead. Although this exact setting was reported to be incapable of real-time streaming inference [14], we explain in section 3.4 a way to accelerate inference at the expense of response time. Notably, faster SE models such as Conv-Tasnet [12] or DPRNN [13] failed to converge in our preliminary experiments.

The Discriminator. We use Multi-Scale Discriminators (MSD) and Multi-Period Discriminators (MPD), which were successfully applied to speech generation [15] and speech enhancement [17]. These discriminators are sub-stacks of strided convolution layers that take different subsets of the time-domain speech samples as input: speech subsampled with a period of {2,3,5,7,11} for the MPD, and {1,2,4}x average-pooled speech for the MSD. The discriminator is trained to label clean speech signals as 1, and restored signals as 0. We use LS-GAN [18] objectives for a successful training. Moreover, for the generator, we also use the feature-matching loss as outlined in [19] and successfully applied in [15, 17].

Training Objective. We used the L1 loss and multi-resolution STFT loss, as follows.

1T[𝐲𝐲^1+i=13Lstft(i)(𝐲,𝐲^)]\frac{1}{T}[\hskip 2.84544pt||\mathbf{y}-\mathbf{\hat{y}}||_{1}+\sum_{i=1}^{3}L_{stft}^{(i)}(\mathbf{y},\mathbf{\hat{y}})\hskip 2.84544pt] (2)

where 𝐲\mathbf{y} is the ground truth, 𝐲^\hat{\mathbf{y}} is the network output, and

Lstft(i)(𝐲,𝐲^)\displaystyle L_{stft}^{(i)}(\mathbf{y},\mathbf{\hat{y}}) =|STFT(i)(𝐲)||STFT(i)(𝐲^)|F|STFT(i)(𝐲)|F\displaystyle=\frac{||\hskip 2.84544pt|\text{STFT}^{(i)}(\mathbf{y})|-|\text{STFT}^{(i)}(\mathbf{\hat{y}})|\hskip 2.84544pt||_{F}}{||\hskip 2.84544pt|\text{STFT}^{(i)}(\mathbf{y})|\hskip 2.84544pt||_{F}}
+log|STFT(i)(𝐲)|log|STFT(i)(𝐲^)|1,\displaystyle+||\log|\text{STFT}^{(i)}(\mathbf{y})|-\log|\text{STFT}^{(i)}(\mathbf{\hat{y}})|\hskip 2.84544pt||_{1}, (3)

where ||||1||\cdot||_{1} and ||||F||\cdot||_{F} each refers to the L1 and Frobenius norm. FFT sizes of 512, 1024, and 2048 were used for each STFT(i)\text{STFT}^{(i)}. The final training objective for the generator was a sum of the above, the feature matching loss multiplied by four, and the discriminator loss.

Refer to caption
Fig. 2: Violin plots of subjective evaluation results with VBDM-1dB-Testset (left) and DNS-1dB-Testset (right).
Refer to caption
(a) Typical waveform of clean and declipped speech. The two gray horizontal lines denote the clipping thresholds used.
Refer to caption
(b) Spectrum of the region in (a) without LUFS-normalization. DD and T-UNet often exhibits a flat, low spectrum beyond 3-4kHz,
failing to model any formants in the area.
Fig. 3: Reconstructed results from a hard-clipped signal (SNR=1dB). DDD, T-UNet, and DD reconstruction results are shown with the original clean speech. DDD-declipped speech shows elements of natural speech that baseline T-UNet or DD do not exhibit. The DD/baseline-declipped waveforms typically fail to recreate (a) the “spiky” contours of clean speech as well as (b) some higher-order formants.

3 Experimental Setup

3.1 Dataset & Preprocessing

The “clean” splits of the Voicebank-DEMAND dataset [20], comprised of a “train” split of length 9.4 hours and a “test” split of 35 mins, were downsampled to 16kHz. In training, the “train” split samples were clipped on-the-fly with a random threshold and aligned with the unprocessed speech. To place more emphasis on heavily clipped speech, we sampled ss uniformly from [2.0,0.9][-2.0,-0.9] and used θ=10s\theta=10^{s} as the final threshold. The range of θ\theta covered SNRs of approximately 1dB to 9dB.

The 150 clean “test”-split utterances of Interspeech 2020 DNS-Challenge dataset [21], also downsampled to 16kHz, were used together with the “test” split of Voicebank-DEMAND for evaluation. We clip these sets, henceforth “VBDM-Testset” and “DNS-Testset”, to have SNRs of 1dB, 3dB, 7dB, and 15dB, calculating the SNR as

SNR=10log10𝐲2𝐱𝐲2,\text{SNR}=10\log_{10}{\frac{||\mathbf{y}||^{2}}{||\mathbf{x}-\mathbf{y}||^{2}}}, (4)

where 𝐲\mathbf{y} and 𝐱\mathbf{x} represent the clean and noisy signals. We denote each of these sets as “{VBDM,DNS}-XdB-Testset” where X is the SNR.

3.2 Baseline Approaches

As a DNN baseline, we re-implemented and trained T-UNet [9], the state-of-the-art for the SD task. This architecture employs strided convolutions to downsample input speech by a factor of two every layer, then upsamples with sub-pixel convolutions to generate raw waveforms. We do not use the discriminator for T-UNet. As a non-DNN baseline, we use A-SPADE which has shown competitive performance in AD [8]. We use a MATLAB implementation provided by a past work [4]. Finally, as an ablation, we train the generator without the discriminator. We refer to the trained model as DD.

3.3 Network Training Details

For the T-UNet, we followed the training and inference procedures given in the original paper [9]. We padded and split all utterances into 2142^{14}-sample-segments to make them compatible with the T-UNet. In inference, the segments were stitched back together. For DD and DDD, 24,000-samples-long utterances were used. Inference was performed without any segmentation.

All models are trained for a total of 75 epochs with the AdamW optimizer with a learning rate of 10410^{-4}, β1=0.9\beta_{1}=0.9, β2=0.999\beta_{2}=0.999, and a weight decay of 10210^{-2}. A batch size of 32 was used for DD and T-UNet, and 2 was used for DDD.

Table 1: All models were trained on the VoiceBank-DEMAND dataset and evaluated on datasets indicated. The proposed models have faster response times than baseline approaches. PESQ and STOI(%) are denoted for each set and input SNR. Notably, DDD was trained in a GAN framework, so the objective metrics below may not fully represent the perceptual audio quality - which may be better illustrated by Fig. 2.
SNR (VoiceBank-DEMAND) (dB) SNR (DNS-Challenge) (dB) Streaming Simulations
Method 1 3 7 15 1 3 7 RTF Response (ms)
Clipped 1.15 / 78 1.39 / 86 2.02 / 94 3.20 / 98 1.14 / 70 1.33 / 82 1.93 / 92 - -
A-SPADE [5] 1.54 / 80 2.02 / 90 2.85 / 96 3.89 / 99 1.37 / 76 1.73 / 87 2.66 / 96 >>10 -
T-UNet [9] 2.97 / 94 3.40 / 97 3.84 / 98 4.18 / 99 2.13 / 87 2.87 / 93 3.61 / 97 0.634 544
DD (Ablation) 2.99 / 94 3.40 / 97 3.85 / 98 4.24 / 100 2.16 / 88 2.78 / 93 3.54 / 97 0.881 88
DDD (Proposed) 2.55 / 93 3.27 / 96 3.85 / 98 4.20 / 99 1.82 / 86 2.48 / 92 3.36 / 96 0.881 88

3.4 Evaluation Metrics

Subjective Evaluation. Subjective tests were performed to more accurately gauge absolute speech quality. We used the webMUSHRA framework [22] to run a MUSHRA-like test [23]. 20 and 15 samples randomly selected from the VDBM-1dB-Testset and DNS-1dB-Testset were run through the trained models. Afterwards, the inputs, outputs, and the associated clean samples were normalized to -27 LUFS. 24 participants with more than two years of experience in music production or audio engineering were asked to score each normalized sample on a 100-point scale based on the quality of speech. Alteration of speech content, if any, was asked to be considered. Most participants were unfamiliar with hearing tests. We filtered out respondents who failed to grasp the existence of a hidden reference within the first four sets via post-survey interviews.

We excluded A-SPADE outputs from the longer 15 DNS-Testset questions to minimize ear fatigue. This exclusion in only one dataset can be justified since A-SPADE is a non-parametric model.

Inference Time Evaluation. We ran inferences on the VDBM-Testset using a batch size of 1, and divided the total inference times for each approach by the total set duration to calculate the throughput. We also calculated the MACs (Multiply-ACcumulate operations) per sample for each DNN-based approach.

To measure response times and the real-time factors (RTF), we simulated a typical audio streaming application. We fed samples into the model at a rate of 16,000 samples/sec for 100 seconds, then calculated the mean time interval between the feeding and output of every 500th{}^{\text{th}} sample. Samples were buffered, and then fed into the network once a sufficient number of samples were gathered to yield output. Due to its architecture, the T-UNet needs at least 2142^{14} samples to generate an output - therefore 2142^{14} samples were buffered before yielding any output. DD and DDD needs much less, only requiring enough to make one input vector to feed the LSTM. To enable out model to run real-time in streaming settings, we buffered enough samples to produce four input vectors for the LSTM. In this case, the required lookahead is 1,429 samples, due to how we upsample inputs by a factor of four. This evaluation was performed on a Ryzen 7 5700X CPU @ 3.4GHz.

Objective Evaluation. Two objective metrics were employed to evaluate speech: PESQ (Perceptual Evaluation of Speech Quality) [24] and STOI (Short-Time Objective Intelligibility) [25]. PESQ is a similarity-based metric that predicts speech quality, and is given as a score between -0.5 and 4.5. STOI is likewise a similarity-based objective that aims to predict the intelligibility of speech. Although they were recently singled out to be unreliable as a measure of absolute speech quality due to how they take speech similarity into account [26], both metrics have seen popular use in SD literature [27, 9]. Following common practice, we report these metrics for a variety of input SNR values.

4 Results

4.1 Subjective Evaluations

Subjective evaluation results are presented in Fig. 2. Out of twenty-four responses, four were filtered out as described in Section 3.4. To account for each participant’s different scoring criteria, the 20 scores for each samples were averaged before drawing the violin plot. A pairwise post-hoc paired t-test with Bonferroni correction concludes that differences between DDD and all other models for both datasets are statistically significant (p<0.0001p<0.0001). While still distinguishable from the hidden reference, DDD may be a significant improvement over the T-UNet or DD. Moreover, comparing DD and DDD suggests that the adversarial training objectives may have helped boost the quality of speech produced by DDD.

The scores for the DNS-Testset are smaller than those for the Voicebank-DEMAND dataset - a natural result, as the models were trained on the latter. Thus, the DNS-Challenge evaluations were in effect zero-shot evaluations on utterances recorded in a different environment. The long ”tail” of score distributions exhibited by T-UNet on this set hints at some catastrophic failures of the T-UNet on unseen recording environments. This result implies that DDD is more robust compared to the T-UNet.

4.2 Qualitative Analysis

In Fig. 3a, a typical waveform of clean and clipped (1dB) speech is presented with speech declipped by T-UNet, DD, and DDD. Notably, T-UNet and DD are often unable to reconstruct the up-and-down “ruggedness” of the clean-speech waveform as shown in Fig. 3a.

In the regions that were previously saturated, the DD and T-UNet outputs are less “rugged” than the clean signal; indeed, over the entire VBDM-1dB-Testset, the associated clean signals have a total of 1.52M local extrema over the previously saturated regions, about 35-50% more than DD (1.17M extrema) or T-UNet outputs (1.09M extrema). This problem is alleviated by DDD (1.67M extrema). Therefore, we suspect that the inability of DD and T-UNet to achieve a perfect reconstruction of speech is at least partially associated with their incapacity to reproduce the up-and-down “ruggedness” of the clean-speech waveform.

Fig. 3b shows the Fourier transform coefficients of the same region. Across the entire dataset, all model outputs conform to the original spectrum in the sub-1kHz regions. However, the T-UNet frequently fails to reproduce higher-order formants at 3kHz and beyond. In contrast, DDD does seem to model higher-frequency(2-4kHz) regions, sometimes resulting in formants minutely different from the clean speech but resulting in coherent and high-quality speech nonetheless. These observations may explain the higher subjective scores of DDD as presented in Fig. 2.

4.3 Objective Evaluations

Table LABEL:tab:objective_results presents objective evaluation results. DD performs comparably to T-UNet in terms of similarity-based metrics, with DDD following close behind, and A-SPADE scoring far lower. However, we cannot conclude that T-UNet or DD provides better reconstructions - subjective evaluations are known to be more reliable [23]. Moreover, similarity-based scores have been previously pointed out to be more unreliable for adversarially-trained speech networks such as DDD [17, 26]. Reflecting on previous analyses, we may conclude that DDD has traded away an acceptable degree of dataset conformance for perceptual quality.

RTF and response time measurement results are presented in the rightmost columns of Table LABEL:tab:objective_results. MACs per sample and the throughput with respect to temporal length were measured as well: 0.19M/sample and 0.15x for T-UNet, and 0.48M/sample and 0.34x for DD or DDD. As noted in past works [4], the RTF for A-SPADE differs with input SNR, but was typically over 20. While both DNN-based approaches have a sub-unity RTF, T-UNet has a mean response time of 0.58 seconds, which is unacceptably high for many applications. In contrast, DD and DDD have sub-decisecond mean response times. Moreover, as discussed in section 3.4 DD and DDD can trade RTF and response time using buffers: it can, for instance, be configured to match the RTF of T-UNet and still yield one third the response time. Therefore, DD and DDD may be better suited to real-time audio processing.

5 Conclusion

In this work we proposed and trained DDD, a DNN model capable of real-time, low-response-time, high-quality declipping. We adopted the GAN framework along with many tricks to boost the capabilities of Demucs [14]. MUSHRA-like subjective evaluations on harshly clipped speech revealed that DDD outperforms previously proposed methods by a large margin. We performed qualitative analyses, observing how existing approaches suffer from “round-waveform” behavior accompanied by a neglection of high-frequency modeling. Finally, DDD was found to exhibit a response time far lower than the previous state-of-the-art. While advarsarial training objectives enabled a perceptually-better speech declipping with lower response times, an exact declipping of speech remains an open problem.

References

  • [1] C. Laguna and A. Lerch, “An efficient algorithm for clipping detection and declipping audio,” in Audio Engineering Society Convention 141. Audio Engineering Society, 2016.
  • [2] C. Tan, B. C. J. Moore, and N. Zacharov, “The effect of nonlinear distortion on the perceived quality of music and speech signals,” Journal of the Audio Engineering Society, vol. 51, no. 11, pp. 1012–1031, 2003.
  • [3] Y. Tachioka, T. Narita, and J. Ishii, “Speech recognition performance estimation for clipped speech based on objective measures,” Acoustical Science and Technology, vol. 35, no. 6, pp. 324–326, 2014.
  • [4] P. Záviška, R. Rajmic, A. Ozerov, and L. Rencker, “A survey and an extensive evaluation of popular audio declipping methods,” IEEE Journal of Selected Topics in Signal Processing, vol. 15, no. 1, pp. 5–24, 2020.
  • [5] S. Kitić, N. Bertin, and R. Gribonval, “Sparsity and cosparsity for audio declipping: a flexible non-convex approach,” in Latent Variable Analysis and Signal Separation: 12th International Conference, LVA/ICA 2015, Liberec, Czech Republic, August 25-28, 2015, Proceedings 12. Springer, 2015, pp. 243–250.
  • [6] T. Tanaka, K. Yatabe, and Y. Oikawa, “Upglade: Unplugged plug-and-play audio declipper based on consensus equilibrium of dnn and sparse optimization,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  • [7] T. Tanaka, K. Yatabe, M. Yasuda, and Y. Oikawa, “Applade: Adjustable plug-and-play audio declipper combining dnn with sparse optimization,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 1011–1015.
  • [8] J. Imort, G. Fabbro, M. A. Martínez-Ramírez, S. Uhlich, Y. Koyama, and Y. Mitsufuji, “Distortion audio effects: Learning how to recover the clean signal,” in 23rd International Society for Music Information Retrieval Conference (ISMIR), 2022.
  • [9] A. A. Nair and K. Koishida, “Cascaded time+ time-frequency unet for speech enhancement: Jointly addressing clipping, codec distortions, and gaps,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 7153–7157.
  • [10] E. Moliner, J. Lehtinen, and V. Välimäki, “Solving audio inverse problems with a diffusion model,” arXiv preprint arXiv:2210.15228, 2022.
  • [11] H. Liu, Q. Kong, Q. Tian, Y. Zhao, D. Wang, C. Huang, and Y. Wang, “Voicefixer: Toward general speech restoration with neural vocoder,” arXiv preprint arXiv:2109.13731, 2021.
  • [12] Y. Luo and N. Mesgarani, “Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM transactions on audio, speech, and language processing, vol. 27, no. 8, pp. 1256–1266, 2019.
  • [13] Y. Luo, Z. Chen, and T. Yoshioka, “Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 46–50.
  • [14] A. Défossez, G. Synnaeve, and Y. Adi, “Real time speech enhancement in the waveform domain,” Proc. Interspeech 2020, pp. 3291–3295, 2020.
  • [15] J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” Advances in Neural Information Processing Systems, vol. 33, pp. 17022–17033, 2020.
  • [16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020.
  • [17] J. Su, Z. Jin, and A. Finkelstein, “Hifi-gan-2: Studio-quality speech enhancement via generative adversarial networks conditioned on acoustic features,” in 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2021, pp. 166–170.
  • [18] X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang, and S. P. Smolley, “Least squares generative adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2794–2802.
  • [19] Kundan Kumar, Rithesh Kumar, Thibault De Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre De Brebisson, Yoshua Bengio, and Aaron C Courville, “Melgan: Generative adversarial networks for conditional waveform synthesis,” Advances in neural information processing systems, vol. 32, 2019.
  • [20] C. Valentini-Botinhao, “Noisy speech database for training speech enhancement algorithms and tts models, [dataset],” University of Edinburgh. School of Informatics. Centre for Speech Technology Research (CSTR), 2016.
  • [21] C. K. A. Reddy, V. Gopal, R. Cutler, E. Beyrami, R. Cheng, H. Dubey, S. Matusevych, R. Aichner, A. Aazami, S. Braun, P. Rana, S. Srinivasan, and J. Gehrke, “The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results,” arXiv preprint arXiv:2005.13981, 2020.
  • [22] M. Schoeffler, S. Bartoschek, F. Stöter, M. Roess, S. Westphal, B. Edler, and J. Herre, “webmushra—a comprehensive framework for web-based listening tests,” Journal of Open Research Software, vol. 6, no. 1, 2018.
  • [23] “Method for the subjective assessment of intermediate quality level of audio systems,” Rec. ITU-R BS.1534-3, ITU, Oct. 2015.
  • [24] ITU-T, “Perceptual evaluation of speech quality (pesq): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs,” Rec. ITU-T P. 862, 2001.
  • [25] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125–2136, 2011.
  • [26] P. Manocha, Z. Jin, and A. Finkelstein, “Audio Similarity is Unreliable as a Proxy for Audio Quality,” in Proc. Interspeech 2022, 2022, pp. 3553–3557.
  • [27] W. Mack and E. A. P. Habets, “Declipping speech using deep filtering,” in 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2019, pp. 200–204.