A lightweight and robust method for blind wideband-to-fullband extension of speech

Jan Büthe Jan Büthe is with Xiph.org ([email protected])

Abstract

Reducing the bandwidth of speech is common practice in resource constrained environments like low-bandwidth speech transmission or low-complexity vocoding. We propose a lightweight and robust method for extending the bandwidth of wideband speech signals that is inspired by classical methods developed in the speech coding context. The resulting model has just $\sim 370$ K parameters and a complexity of $\sim 140$ MFLOPS (or $\sim 70$ MMACS). With a frame size of 10 ms and a lookahead of just 0.27 ms the model is well-suited for common wideband speech codecs. We evaluate the model’s robustness by pairing it with the Opus SILK speech codec (1.5 release) and verify in a P.808 DCR listening test that it significantly improves quality from 6 to 12 kb/s. We also demonstrate that Opus 1.5 together with the proposed bandwidth extension at 9 kb/s meets the quality of 3GPP EVS at 9.6 kb/s and that of Opus 1.4 at 18 kb/s showing that the blind bandwidth extension can meet the quality of classical guided bandwidth extensions.

I Introduction

Limiting the bandwidth of speech is a common technique for dealing with constrained resources. The most prominent example is speech coding for real-time communication which often uses narrowband codecs (e.g. G.711 [1]) or wideband codecs (e.g. AMR-WB [2], Opus SILK [27]). A second example is neural vocoding in complexity-constrained environments (e.g. LPCNet [25]) which is used for many applications like text-to-speech synthesis or speech enhancement.

While bandwidth reduction is effective for saving resources and (mostly) maintains speech intelligibility, it also degrades the listening experience and can lead to listener fatigue. Therefore, a blind bandwith extension (BWE) method can have a largely positive impact for billions of listeners every day. However, low complexity is critical for the applications stated above as typical target devices like smartphones can have rather limited compute. Furthermore, robustness is essential since any real-world deployment faces huge variability of input signals.

BWE is a well-studied topic and both classical ([18, 8, 29, 28, 4]) and DNN-based ([10, 23, 17, 19, 13]) methods have been proposed for this task. While classical methods are very low in complexity, they struggle with blind highband estimation and are therefore most effective when provided with a bit of side information. DNN-based methods, on the other hand, are much better at high-band modeling, but even dedicated low-complexity algorithms still operate in the range of multiple GFLOPS (e.g. $\sim 13$ GFLOPS in [23] or $\sim 7$ GFLOPS in [13]) which prevents their deployment on smaller devices.

In this paper we seek to overcome this problem by combining the high-band modeling capacity of data-driven, DNN-based methods with the simplicity and low complexity of DSP-based BWE methods. The approach is inspired by classical time-domain bandwidth extension, where a bandwidth-extending operation like non-linear function application or spectral folding is applied to the upsampled signal and combined with time-varying spectral shaping filters. The signal-processing part of the resulting algorithm only consists of classical DSP, i.e. fixed non-linear mapping, fixed and time-varying linear filtering and time-varying sample-wise weighting. The time-varying filters and sample-wise weights in turn are produced by a small DNN which thus governs the content and shape of the generated highband signal. The resulting model has $\sim 370$ K parameters and a computational complexity of $\sim 140$ MFLOPS (or $\sim 70$ MMACS) which makes it suitable for use even on older smartphone devices. Furthermore, since it is built around a low-delay upsampler, the signal path induces a delay of only 13 samples at 48 kHz or $\sim 0.27$ ms and the feature path has a framing delay of $10$ ms. This means that if combined with a speech production system that operates on multiples of 10 ms (which most speech codecs do) the total added delay is as low as $\sim 0.27$ ms

To test model robustness, we combine it with the Opus codec (1.5 release) and confirm in a P.808 listening test¹¹1The page https://janpbuethe.github.io/BWEDemo contains some demo samples. It also includes vocoding examples which were not included in the listening test. that BWE provides consistent improvement for all tested bitrates. We furthermore include the superwideband codec EVS [3] at 9.6 kbs and Opus 1.4 at 18 kb/s which produces fullband speech in a hybrid coding mode. The test results show that both are statistically tied with the bandwidth-extended Opus 1.5 at 9 kb/s showing that Opus 1.5 with blind bandwidth extension can meet the quality of codecs with classical parametric resp. semi-parametric bandwidth extensions.

Finally, it should be pointed out that the implementation of the proposed BWE method is based on the adaptive DDSP modules AdaConv and AdaShape proposed in [7] and [6]. The successful application to low-complexity BWE demonstrates the general usefulness of this adaptive signal processing approach and there are likely more applications to which it could be applied.

A python implementation is available at https://gitlab.xiph.org/xiph/opus/-/tree/exp_bwe/dnn/torch/osce (BBWENet).

II Model Description

We propose a model based on the classical approach of pre-filtering, upsampling, bandwidth extension and post-filtering [18]. A high-level overview of the model is given in Figure 1.

Adaptive pre- and post-filtering of the signal is implemented using the AdaConv module proposed in [7] and extended to multiple input and output channels in [6]. AdaConv is similar to regular Conv1d layers but the weights are adapted at a fixed rate (200 Hz in this case) based on a latent feature vector provided by the feature encoder depicted on the left side of Figure 1.

For upsampling, the model leverages the libopus 16 to 48 kHz upsampler²²2https://gitlab.xiph.org/xiph/opus which operates in two stages: in a first stage the signal is upsampled by a factor two using IIR filters and in a second stage a 1.5x interpolation with short FIR filters is performed. The upsampler is both low in complexity and has a low delay of 13 samples at 48 kHz, which is also the total delay of the signal path on the right hand side of Figure 1. IIR filters are approximated by long FIR filters for training.

Figure 1: High-level overview of blind bandwidth-extension model. The feature encoder on the left side calculates a sequence of latent feature vectors from a sequence of 72-dimensional feature vector containing spectral information. The latent feature vectors are then used by AdaConv modules to steer pre- and post-filtering and by the AdaShape module by adaptively extending the bandwidths of the input signals.

For the actual bandwidth extension, the model uses a hybrid approach that utilizes two common methods for time-domain bandwidth extension. The first one is non-linear function application, which has the advantage of generating a consistent harmonic extension if the baseband is quasi-periodic (voiced speech). We deviate from the usual choice of non-linearity (absolute value or rectified linear unit) and choose instead a non-linearity that extends the signal more agressively (1). It is designed to approximately preserve the scale of the signal and to induce a similar amount of distortion regardless of the scale of the input signal.

f(x)=x\sin(\log\mid x\mid)

(1)

The second extension method is motivated by spectral folding as proposed in [18]. However, we use the term spectral folding in a broader sense of multiplying the signal with a locally periodic sequence of non-negative weights. We imlement this using the AdaShape module proposed in [6] which multiplies the input signal with a sequence of weights calculated from the sequence of latent feature vectors.

\operatorname{AdaShape}(x(\cdot),\phi(\cdot))(n)=\alpha(n,\phi(\cdot),x(\cdot))\cdot x(n).

(2)

Folding, especially when combined with spectral flattening as pre-filtering, provides an effective way for extending unvoiced signal parts. In principle, this module is also capable of extending the signal by sharpening pulses in voiced signal parts. However, an a-posteriori analysis (Figure 3) shows the model mostly uses folding for extending unvoiced signal parts and the non-linear function extrapolation for extending voiced signal parts.

The features for the feature encoder on the left side of Figure 1 are designed to provide the model with basic signal properties like spectral envelope, and pitch and voicing while being simple to compute. We chose to use a Hanning-window STFT with a window size of 20 ms and a hop size of 10 ms from which we compute a 32-band ERB-scale log magnitude spectrogram and instantaneous frequency information for the first 40 STFT bins in the form

\operatorname{IF}(k,n)=\frac{X(n,k)\,X^{*}(n-1,k)}{\mid X(n,k)\,X^{*}(n-1,k)\mid}.

(3)

Here, $n$ denotes the frame index, $k$ denotes the frequency index, and ^∗ denotes complex conjugation. Instantaneous frequencies are included as features since it has been demonstrated that they are sufficient for high-accuracy pitch estimation [24]. The feature encoder upsamples the 72-dimensional feature vector from 100 to 200 Hz and includes a GRU for accumulating context.

III Training

III-A Strategy and Data

Bandwidth extension, viewed as an inverse problem, is generally ill posed since even for the same source signal the actual recording will depend on environmental factors such as acoustic environment or recording device. It is possible to a certain extent to fit a model to the characteristics of a specific homogeneous dataset, but we observed that this can lead to poor generalisation. This is in line with the findings in [15] which in particular highlights the critical impact of microphone channels. The training procedure for the proposed model therefore prioritizes plausibility and robustness over over correctness.

To achieve plausibility, we follow the standard approach and use an adversarial loss to bring the distribution of the extended signal closer to the distribution of a native fullband signal [10]. We do, however, use a family of frequency-domain discriminators instead of the commonly used multi-scale and multi-period time-domain discriminators as we found this to lead to faster convergence and higher quality.

To achieve robustness, we train on a mixture of multiple high-quality TTS datasets [9, 16, 22, 11, 14, 20, 26, 12, 5] containing more than 900 speakers in 34 languages and dialects. Since some of the datasets contained upsampled wideband and superwideband recordings, we filter out items with very little energy high frequency range. Furthermore, we apply the following data augmentation steps to increase model robustness to unseen speech:

1.

we apply a random eq filter that is constant above 4 kHz to 40% of the training clips
2.

we add stationary wideband noise with random gain to 20% of training clips
3.

we apply a random RIR from the Aachen Impulse Response Database³³3https://www.openslr.org/20 to 20% of training clips
4.

we add a random DC offset to 10% of training clips

Finally, we filter the 48 kHz target clips with a 20 kHz lowpass filter to remove ambiguity from mixing 44.1 and 48 kHz speech samples.

Step 1) prevents the model from relying too much on low frequencies, which should be irrelevant for extending the wideband signal. Step 2) teaches the model not to extend the noise but only the speech from the baseband. This is an intentional design decision and it stems from the observation, that adding a bandwidth extension to signal of poor quality can reduce perceived quality as the resulting signal sounds noisier than the bandlimited input signal. Steps 2) and 3), when both carried out on the same item, are executed in random order.

The model input is derived from the augmented 48 kHz signals by applying a random lowpass filter with cutoff between 7.5 and 8 kHz and varying slope. Furthermore, the target signal is delayed by 13 samples to compensate for the resampling delay in the signal path.

III-B Losses and Training

Training is split into a pre-training phase, using only regression losses, and an adversarial training phase using both a discriminator loss and a regression loss for regularization.

We use three regression losses, the STFT-based envelope matching loss and the spectral fine structure loss $\mathcal{L}_{env}$ and $\mathcal{L}_{spec}$ from [7], which are averaged over multiple STFT resolutions with window sizes $3\cdot 2^{n}$ for $6\leq n\leq 11$ , and a time-domain $L^{2}$ loss $\mathcal{L}_{tdlp}$ on the low frequency range to enforce lowband reconstruction. The lowpass filter is a 15-tap, zero-phase filter with a cutoff frequency of 4 kHz and a gentle rolloff. The total regression loss for pretraining is given by

\mathcal{L}_{pre}=\frac{1}{13}\mathcal{L}_{env}+\frac{2}{13}\mathcal{L}_{spec}+\frac{10}{13}\mathcal{L}_{tdlp}

(4)

Pretraining is carried out on 1s-segments using the Adam optimizer with a batch size of 256, an initial learning rate of $5\times 10^{-4}$ and a weight decay factor of $2.5\times 10^{-5}$ for 50 epochs.

For adversarial training, we use a modification of the discriminator architecture in [6], a multi-layer 2d-convolutional model on log-magnitude spectrograms including frequency-positional embeddings to allows the discriminator to take the frequency range into account. The main changes are increasing the STFT sizes by a factor 3 to compensate for the sampling rate increase, increasing the kernel size along the frequency axis from from $3\times 3$ to $7\times 3$ to maintain the frequency widths of receptive fields, and limiting the maximal channel number to 64 to compensate for the parameter increase caused by the wider covolution kernels. Otherwise, adversarial training is identical to [6] with $\mathcal{L}_{reg}=0.6\mathcal{L}_{pre}$ . Adversarial training is carried out on $0.9$ -second segments with a batch size of 64 using the Adam optimizer with a constant learning rate of $10^{-4}$ for 40 epochs.

IV Evaluation

IV-A Subjective Evaluation

Refer to caption — Figure 2: Results of the P.808 DCR listening on samples of the EARS dataset. Bandwidth-extended signals are labeled with ’+ BWE’. All bandwidth-extended signals show significant improvement over their wideband source signals ( $p=0.95$ ). Furthermore, extended Opus 1.5 can match the quality of higher bandwidth codecs that use classical, guided methods for coding above-wideband content.

To test model performance and robustness, we carried out a multi-bitrate speech coding test using the P.808 DCR methodology with samples from the EARS dataset [21] from which no portion was included in the training data. To this end, we extracted three sentence pairs per speaker from the regular speech category. We tested the proposed bandwidth extension both for Opus 1.5 with decoder complexity 10 (which means NoLACE enhancement will be applied) and for clean speech input and report significant improvement for all conditions. In particular, the improvement of Opus 1.5 at 6 kb/s is remarkable since the baseband already exhibits very audible distortions.

As comparison points we included the 3GPP EVS codec at 9.6 kb/s and the results show that Opus with lowband enhancement and blind bandwidth extension matches the quality even at the same bitrate. This indirectly gives a comparison of the proposed blind BWE method to a guided, parametric BWE. Furthermore, we included EnCodec at 6 and 12 kb/s to test the hybrid approach against the full end-to-end neural coding approach. While EnCodec delivers better quality at 6 kb/s, even Opus 1.5 wideband at 9 kb/s already significantly outperforms EnCodec at 12 kb/s. This could be in part a domain shift problem as the originally provided samples at 12 kb/s sound better than the coded samples from the EARS dataset. In particular, a pre-test also included the neural codec AudioDec, which exhibited severe quality degradation on this dataset and was therefore excluded from the final test (examples are included in the demo page). These results suggest, that the hybrid approach may be more robust than the end-to-end approach and they certainly suggest that robustness must be a concern when considering to deploy an end-to-end codec.

Finally, we added Opus 1.4 at 18 kb/s as a self comparison point to evaluate the effect of combined lowband enhancement and highband extension. Opus 1.5 at 9 kb/s is already statistically tied to Opus 1.4 at 18 kb/s and equal quality is likely achieved around 10 kb/s. So in combination, these backward-compatible enhancement methods yield a bitrate reduction of 45 to 50%. Furthermore, this comparison gives a second, indirect comparison of the blind BWE to a Opus-coded highband in hybrid mode.

The evaluation on clean, uncoded speech shows, that while the proposed bandwidth extension significantly improves quality in a direct comparison to the fullband reference signal it is still distinguishable from the original.

IV-B Model Inspection

Due to the simple architecture of the signal-processing part of the model, it is straight forward to inspect the contribution of the individual modules to the final bandwidth extension. In particular, since the mixing or post-filtering layers are linear, the second bypass signal $y_{32}(t)$ and the output signal $y_{48}(t)$ can be decomposed as a sum of signals stemming from the previous bypass channel and the output channels of the AdaShape and NonLin modules.

Spectrograms of these contributions are displayed in Figure 3. In the first stage, unvoiced signal parts are primarily extended by the AdaShape module while voiced signal parts are extended by the non-linearity. In the second stage, the SWB to FB extension is primarily constructed from the AdaShape output and from imaging remaining from the short FIR interpolation filters.

This analysis shows the usefulness of the dual approach of combining these two extension. It also suggests the second NonLin module could likely be omitted without loss of quality which would result in a small complexity saving. While no formal listening test was carried out on this matter, the authors observed by informal (blind) listening that omitting either the non-linearity or the AdaShape module from the model results in audible degradation of the extended signal.

V Conclusion

We proposed a lightweight method for wideband-to-fullband extension of speech. We demonstrated the method’s effectiveness and robustness by applying it to both coded and clean speech. While the underlying model was only trained on clean speech, the bandwidth extension resulted in significant quality improvement even when paired with a low-bitrate speech codec.

References

[1] G.711 : Pulse code modulation (PCM) of voice frequencies. Technical report, ITU-T.
[2] G.722.2 : Wideband coding of speech at around 16 kbit/s using Adaptive Multi-Rate Wideband (AMR-WB). Technical report, ITU-T.
[3] 3GPP. TS 26.453: Codec for Enhanced Voice Services (EVS). Technical report, ETSI.
[4] Venkatraman Atti, Venkatesh Krishnan, Duminda Dewasurendra, Venkata Chebiyyam, Shaminda Subasingha, Daniel J. Sinder, Vivek Rajendran, Imre Varga, Jon Gibbs, Lei Miao, Volodya Grancharov, and Harald Pobloth. Super-wideband bandwidth extension for speech in the 3gpp evs codec. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5927–5931, 2015.
[5] Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg, and Yang Zhang. Hi-Fi Multi-Speaker English TTS Dataset. In Proc. Interspeech 2021, pages 2776–2780, 2021.
[6] Jan Büthe, Ahmed Mustafa, Jean-Marc Valin, Karim Helwani, and Michael M. Goodwin. Nolace: Improving low-complexity speech codec enhancement through adaptive temporal shaping. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 476–480, 2024.
[7] Jan Büthe, Jean-Marc Valin, and Ahmed Mustafa. Lace: A light-weight, causal model for enhancing coded speech through adaptive convolutions. In 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 1–5, 2023.
[8] Yan Ming Cheng, D. O’Shaughnessy, and P. Mermelstein. Statistical recovery of wideband speech from narrowband speech. IEEE Transactions on Speech and Audio Processing, 2(4):544–548, 1994.
[9] Isin Demirsahin, Oddur Kjartansson, Alexander Gutkin, and Clara Rivera. Open-source Multi-speaker Corpora of the English Accents in the British Isles. In Proceedings of The 12th Language Resources and Evaluation Conference (LREC), pages 6532–6541, Marseille, France, May 2020. European Language Resources Association (ELRA).
[10] Sefik Emre Eskimez, Kazuhito Koishida, and Zhiyao Duan. Adversarial training for speech super-resolution. IEEE Journal of Selected Topics in Signal Processing, 13(2):347–358, 2019.
[11] Adriana Guevara-Rukoz, Isin Demirsahin, Fei He, Shan-Hui Cathy Chu, Supheakmungkol Sarin, Knot Pipatsrisawat, Alexander Gutkin, Alena Butryna, and Oddur Kjartansson. Crowdsourcing Latin American Spanish for Low-Resource Text-to-Speech. In Proceedings of The 12th Language Resources and Evaluation Conference (LREC), pages 6504–6513, Marseille, France, May 2020. European Language Resources Association (ELRA).
[12] Alexander Gutkin, Işın Demirşahin, Oddur Kjartansson, Clara Rivera, and Kọ́lá Túbọ̀sún. Developing an Open-Source Corpus of Yoruba Speech. In Proceedings of Interspeech 2020, pages 404–408, Shanghai, China, October 2020. International Speech and Communication Association (ISCA).
[13] Esteban Gómez, Mohammad Hassan Vali, and Tom Bäckström. Low-complexity real-time neural network for blind bandwidth extension of wideband speech. In 2023 31st European Signal Processing Conference (EUSIPCO), pages 31–35, 2023.
[14] Fei He, Shan-Hui Cathy Chu, Oddur Kjartansson, Clara Rivera, Anna Katanova, Alexander Gutkin, Isin Demirsahin, Cibu Johny, Martin Jansche, Supheakmungkol Sarin, and Knot Pipatsrisawat. Open-source Multi-speaker Speech Corpora for Building Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu Speech Synthesis Systems. In Proceedings of The 12th Language Resources and Evaluation Conference (LREC), pages 6494–6503, Marseille, France, May 2020. European Language Resources Association (ELRA).
[15] Peter J Huber. Robust statistics, volume 523. John Wiley & Sons, 2004.
[16] Oddur Kjartansson, Alexander Gutkin, Alena Butryna, Isin Demirsahin, and Clara Rivera. Open-Source High Quality Speech Datasets for Basque, Catalan and Galician. In Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), pages 21–27, Marseille, France, May 2020. European Language Resources association (ELRA).
[17] Haohe Liu, Woo Yong Choi, Xubo Liu, Qiuqiang Kong, Qiao Tian, and Deliang Wang. Neural vocoder is all you need for speech super-resolution. ArXiv, abs/2203.14941, 2022.
[18] J. Makhoul and M. Berouti. High-frequency regeneration in speech coding systems. In ICASSP ’79. IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 4, pages 428–431, 1979.
[19] Moshe Mandel, Or Tal, and Yossi Adi. Aero: Audio super resolution in the spectral domain. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023.
[20] Yin May Oo, Theeraphol Wattanavekin, Chenfang Li, Pasindu De Silva, Supheakmungkol Sarin, Knot Pipatsrisawat, Martin Jansche, Oddur Kjartansson, and Alexander Gutkin. Burmese Speech Corpus, Finite-State Text Normalization and Pronunciation Grammars with an Application to Text-to-Speech. In Proceedings of The 12th Language Resources and Evaluation Conference (LREC), pages 6328–6339, Marseille, France, May 2020. European Language Resources Association (ELRA).
[21] J. Richter, Y.-C. Wu, S. Krenn, S. Welker, B. Lay, S. Watanabe, A. Richard, and T. Gerkmann. EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation. In Proc. Interspeech 2024, pages 4874–4877, 2024.
[22] Keshan Sodimana, Knot Pipatsrisawat, Linne Ha, Martin Jansche, Oddur Kjartansson, Pasindu De Silva, and Supheakmungkol Sarin. A Step-by-Step Process for Building TTS Voices Using Open Source Data and Framework for Bangla, Javanese, Khmer, Nepali, Sinhala, and Sundanese. In Proc. The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU), pages 66–70, Gurugram, India, August 2018.
[23] Erfan Soltanmohammadi, Paris Smaragdis, and Mike Goodwin. Low-complexity streaming speech super-resolution. In IEEE 2023 Workshop on Machine Learning for Signal Processing (MLSP), 2023.
[24] Krishna Subramani, Jean-Marc Valin, Jan Büthe, Paris Smaragdis, and Mike Goodwin. Noise-robust dsp-assisted neural pitch estimation with very low complexity. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 11851–11855, 2024.
[25] Jean-Marc Valin and Jan Skoglund. LPCNet: Improving Neural Speech Synthesis Through Linear Prediction. CoRR, abs/1810.11846, 2018.
[26] Daniel van Niekerk, Charl van Heerden, Marelie Davel, Neil Kleynhans, Oddur Kjartansson, Martin Jansche, and Linne Ha. Rapid development of TTS corpora for four South African languages. In Proc. Interspeech 2017, pages 2178–2182, Stockholm, Sweden, August 2017.
[27] Koen Vos, K.V. Sørensen, S.S. Jensen, and Jean-Marc Valin. Voice coding with opus. 135th Audio Engineering Society Convention 2013, pages 722–731, 01 2013.
[28] H. Yasukawa. Restoration of wide band signal from telephone speech using linear prediction error processing. In Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP ’96, volume 2, pages 901–904 vol.2, 1996.
[29] Hiroshi Yasukawa. Signal restoration of broad band speech using nonlinear processing. In 1996 8th European Signal Processing Conference (EUSIPCO 1996), pages 1–4, 1996.