Towards Error-Resilient Neural Speech Coding
Abstract
Neural audio coding has shown very promising results recently in the literature to largely outperform traditional codecs but limited attention has been paid on its error resilience. Neural codecs trained considering only source coding tend to be extremely sensitive to channel noises, especially in wireless channels with high error rate. In this paper, we investigate how to elevate the error resilience of neural audio codecs for packet losses that often occur during real-time communications. We propose a feature-domain packet loss concealment algorithm (FD-PLC) for real-time neural speech coding. Specifically, we introduce a self-attention-based module on the received latent features to recover lost frames in the feature domain before the decoder. A hybrid segment-level and frame-level frequency-domain discriminator is employed to guide the network to focus on both the generative quality of lost frames and the continuity with neighbouring frames. Experimental results on several error patterns show that the proposed scheme can achieve better robustness compared with the corresponding error-free and error-resilient baselines. We also show that feature-domain concealment is superior to waveform-domain counterpart as post-processing.
Index Terms: error resilience, packet loss concealment, neural audio coding, real-time communication

1 Introduction
High-fidelity audio transmission over wireless channels has been increasingly important recently. However, audio packets over the Internet are prone to various types of errors, e.g. random bit errors, packet loss, network congestion and jitters. These errors, if not handled properly, may lead to severe distortion and discontinuity in the received audio. Error resilience is a crucial topic and has been extensively studied in traditional audio coding. Forward error correction (FEC) [1] is a traditional way to protect the compressed bitstream at the sender side. At the same time, modern signal processing based audio coding is typically equipped with a packet loss concealment (PLC) module [2, 3] to restore the delayed and missing packets at the receiver side. Numerous studies on adaptive quantizer [4], data partition, unequal error protection [5, 6] and PLC algorithms have been proposed to improve the error resilience of the coding.
In recent years, neural audio/speech coding schemes have shown great vitality in providing extremely high coding efficiency, either by using a strong decoder for recovery from acoustic features [7, 8, 9, 10] or by end-to-end neural coding [11, 12, 13, 14, 15]. They have demonstrated a high audio quality at a very low bitrate, largely outperformed traditional audio codecs like Opus [16]. However, these coding schemes target only at coding efficiency without taking error resilience into account. When there are channel noises, the source coding models are extremely sensitive to channel noises according to our experiments. This paper aims to fill in this gap by investigating how to handle packet losses in a neural coding scheme.
Existing PLC algorithms can be partitioned into two groups, i.e. parametric-domain and waveform-domain PLC. Parametric-domain PLC algorithms aim to predict lost parameters at the codec level, which are used to synthesize the waveform audio. One example is NetEQ [2], the standard PLC algorithm in WebRTC that uses Linear Prediction Coefficient (LPC) to estimate the voice and unvoiced components of the signal and interpolates samples as the linear combinations of highly-correlated pitch-periods. Waveform-domain PLC algorithms solve this problem by designing a post-processing step on the decoded waveform. Time-scale modification (TSM) techniques such as Waveform Similarity Overlap-and-Add (WSOLA) [3] have been widely used for waveform-domain PLC for its capability of extrapolating audio samples in the time domain with good audio quality. These signal processing based methods yield good quality for short packet losses but tend to produce robot-like artifacts for long burst losses.
Comparatively, with the significant breakthroughs made in deep learning and generative models, deep-learning based PLC algorithms have been demonstrated to show superior restoration ability recently, especially for long-term packet loss scenario. Most existing deep PLC algorithms are waveform-domain methods introduced to reconstruct lost packets as a post-processing stage. Generally, they can be divided into auto-regressive networks [17, 2] and generative adversarial networks (GANs) [18, 19, 20]. Auto-regressive methods use recurrent neural networks like LSTM [17], WaveRNN [21], WaveNetEQ [2] as regression models of waveform samples to perform PLC in a real-time setup. These methods usually need a special tuning of the sampling process to generate audio samples after the lost packet and introduce extra delay by feeding output into the network input. In contrast, GAN-based PLC algorithms can generate speech in a frame-in/frame-out manner without any regression [19, 20]. They typically employ an adversarial training strategy, taking an auto-encoder architecture as the generator and one or more discriminators as a learnable loss function of the restoration task. They have been verified to outperform both the auto-regressive counterparts and the classical methods. One problem of the post-processing based methods is that they are highly dependent on the codec’s output. Models trained on uncompressed speech usually degrade a lot when being used directly on the decoded audio without a finetuning or retraining on the codec output. Also, its maximum potential is limited by the codec’s quality.
Thanks to the learning capability of neural audio/speech coding, the error resilience could be optimized jointly with source coding to push the boundary further. In our previous work [15], we show an example of this optimization. In this paper, we dig further into this problem and investigate it in a more systematical way. Specifically, we propose a feature-domain PLC (FD-PLC) for neural audio/speech coding, alike the parametric-domain PLC in traditional audio coding. A light-weight attention-based PLC block is introduced to recover lost feature frames at the decoder. This structure is efficient to capture both local and global correlations along the temporal dimension with different attentiveness to lost/non-lost frames. Furthermore, both a spectrogram-based loss at multiple scales and a hybrid segment-level and frame-level adversarial loss are utilized to achieve a natural and temporally coherent reconstruction quality. Taking the end-to-end neural speech coding network [15] for real-time communications as a backbone, our experimental results show that the proposed method largely enhances the output quality under several packet loss patterns.
2 The Proposed Scheme
A typical neural speech coding network is composed of an encoder, a vector quantizer and a decoder, as illustrated in Fig. 1. A single channel recording is mapped to a sequence of embeddings with low latency. Then the vector quantizer discretizes the embeddings to quantized features with a set of finite codebooks to meet the target bitrate. Without considering channel losses, the decoder produces a lossy reconstruction from . When the channel suffers from packet losses, only the lossy quantized features are available. The decoder needs to recover from with both quantization noises and packet losses. To facilitate this recovery, we introduce the FD-PLC module just after the inverse quantization in the feature domain. Let denote the recovered features by FD-PLC. The decoder follows to reconstruct the whole waveform. The whole network is trained on the decoder part to minimize the distortion at given bitrate and packet loss rate and pattern. Multiple discriminators are designed for adversarial training to produce a natural and temporally-coherent output quality with high fidelity. The following subsections will describe them in detail.
2.1 Backbone network
We take the low-latency neural speech coding network TFNet in our previous work [15] as the backbone. It takes the time-frequency spectrum with a 20ms window and a 5ms hop length as input with power-law compressed normalization on the magnitude. The encoder and decoder are composed of causal convolutions and deconvolutions for capturing frequency dependencies and two kinds of causal temporal filtering modules in-between them, i.e. dilated temporal convolution module (TCM) and group-wise gated recurrent unit (G-GRU) for capturing temporal dependencies. The two temporal filtering modules are organized in an interleaved way to efficiently extract both local and long-term temporal dependencies. For vector quantization, the latent embeddings are split into groups. For each group, it uses an independent codebook containing codewords. At the target bitrate of 6kbps, we combine 4 overlapped frames (corresponding to 20ms new data) into a vector for quantization.
2.2 FD-PLC module
The FD-PLC module is composed of two kinds of causal modules stacked together, i.e, a group-wise temporal self-attention block (G-TSA) and a TCM module. The G-TSA block is similar to the multi-head self-attention block (MHSA) in transformer [22] and we turn it into a causal operation along the temporal dimension using a window of frames, i.e. each frame only has access to past frames without any look-ahead. It captures different attentiveness to different frames thus provides temporal adaptation to the content and the loss/non-loss properties. As shown in Figure 2, we use a pre-norm residual unit similar to that in [23] for the G-TSA block, with two convolutional layers with a kernel size of to reduce and increase the dimensions. The TCM block is similar to that used in the backbone network but with layer normalization preceding the convolutions and GELU as the activation function. Several TCM blocks with increasing dilation rates are stacked together to form a large TCM module with a large receptive field. We use G-TSA to extract the local correlations which is more important to the current frame at a fine granularity and the TCM is used to aggregate the G-TSA output features to catch long-term dependencies at a coarse granularity.

2.3 Adversarial training
Adversarial training is widely used for restoration tasks to achieve good reconstruction audio quality. We employ two frequency-domain discriminators with discrimination capability at different granularity. They both take the magnitude spectrum, with a window length of and a hop length of as the input. The first is a segment-level discriminator used to judge the overall quality of an audio clip. It consists of four convolutional layers with a kernel size of (3, 3) and a stride of (2, 2), followed by normalization and Leaky Relu activation. The number of channels is progressively increased to 64 with the depth of network. Finally, a fully connected layer is used to aggregate all channels into one and we do an average pooling on both the (down-sampled) time and frequency dimensions to get just one logit at the output. The second discriminator targets at frame-level discrimination. For this purpose, we use a kernel size of (2, 5) with a stride of (1, 2) in convolution blocks so as to down-sample frequency bins while keeping the temporal resolution. Only frequency-dimension average pooling is used to obtain a 1-dimensional logits as the output. In both discriminators, we use spectral normalization [24] for the first convolution block and instance normalization [25] for others for stable training. Sigmoid activation is used on the output.
We take the Binary Cross Entropy (BCE) loss generally used for GAN [26], i.e.
(1) |
(2) |
We also use a feature matching loss [27] as an additional constraint for the generator, which is given by
(3) |
where denotes the number of layers in the discriminator that are used for the feature loss. and denote the features and feature sizes in the -th layer of the discriminator, respectively.
2.4 Training objectives
The neural coding network works as the generator in adversarial training. We use a combination of several loss terms in guiding it towards a good decoded audio quality as follows
(4) |
where the scalars are weights to balance different terms and set by in our implementation. and are adversarial and feature loss terms in Eq. 2 and 3.
The first term adds the supervision on guiding the FD-PLC module to recover the lost frame from . During training, we add proportional lost frames and non-lost frames to make the data balance. Let denote this set and the loss term is given by
(5) |
where and are -th frame of and , respectively. denotes the number of frames in . The L1 loss is used to measure the distance between and . In our implementation, we take the error-free baseline as the pretrained model for encoder, the codebook and and train the FD-PLC and the decoder jointly. More ablation studies on the training algorithm could be found in section 4.2.
The second and the third terms are quality terms at each frequency-bin and mel-band. As shown in [28] that time-frequency distribution can be effectively captured by jointly optimizing multi-resolution spectrum and adversarial loss functions, we use at multiple resolutions to achieve a high perceptual quality, which is given by
(6) |
is the -th mel-scale band. To achieve high-fidelity, we also use the frequency-bin wise quality term by
(7) |
is the power-law compressed magnitude of . We use L2 distance metric here.


Scheme | MCD[dB] | PLC-MOS | NISQA-MOS | NISQA-Discontinuity | NISQA-TTS |
---|---|---|---|---|---|
Error-free baseline | 1.859 | 2.461 | 2.799 | 2.844 | 2.85 |
Error-resilient baseline | 1.43 | 3.713 | 3.509 | 3.572 | 2.985 |
Baseline-GAN | 1.182 | 4.159 | 3.875 | 4.028 | 3.121 |
Atten-GAN | 1.158 | 4.238 | 3.978 | 4.174 | 3.183 |
Post-PLC | 1.502 | 4.256 | 3.663 | 3.790 | 3.076 |
FD-PLC | 1.239 | 4.247 | 3.957 | 4.165 | 3.325 |
3 Experimental Setup
3.1 Datasets and settings
We use the 16khz raw clean speech data from Deep Noise Suppression Challenge at ICASSP 2021[29]. It includes multilingual speech and emotional clips. For packet loss, we simulate with a random loss rate from , whose maximum burst loss length is milliseconds, respectively. Besides, we simulate WLAN packet loss pattern with three-state Markov models[30]. For training, we synthesized 600 hours of data, 100 hours for each loss rate category. For testing, we use the blind test set from Audio Deep Packet Loss Concealment Challenge at INTERSPEECH 2022[31]. This test set uses real packet loss traces captured in real Microsoft Teams calls. Its maximum burst loss length is up to 1000 milliseconds. We also use another test sets with synthetic traces with a random loss rate from to for a deep investigation.
For training, we use the Adam optimizer[32] for both generator and discriminator. A learning rate of is employed for generator while for discriminator the learning rate is decayed by for every epoch with an initial rate of . The temporal window size of the FD-PLC module is set to 32 frames. The network is trained for 60 epochs with a batch size of 200.
3.2 Evaluation metrics
We measure signal quality at various packet loss rates while the bitrate is fixed to 6kbps. For a generative task, speech distortion, perceptual quality and naturalness are important factor to measure the quality. To this end, we employ mel cepstral distortion (MCD; in dBs) [33] to measure speech disortion which focuses on perceptually relevant speech characteristics of the short-term speech spectrum. To measure the overall quality, we use the PLCMOS, the evaluation tool for the PLC Challenge at INTERSPEECH 2022 [31] and NISQA[34]. NISQA is a deep learning framework for speech quality evaluation covering several sources of degradation, including packet losses and audio compression. We use NISQA-MOS to evaluate the overall quality, NISQA-Discontinuity for the audio continuity, and NISQA-TTS for the naturalness of synthesized speech. Our experiments find that they reasonably match our perception when using the same neural audio codec.
3.3 Baselines for comparison
We compare the proposed FD-PLC scheme with several baselines to verify its effectiveness, the error-free baseline, the error-resilient baseline, Baseline-GAN, Atten-GAN, Post-PLC and the FD-PLC as shown in Fig. 3 and Table. 1. The error-free and error-resilience baselines are trained without and with consideration on packet losses. The Baseline-GAN differs from the Error-resilient baseline by adding adversarial training. The Atten-GAN further introduces the FD-PLC module into Baseline-GAN so that it uses the same network as the proposed FD-PLC. It differs from the proposed FD-PLC only in that no loss is used. What’s more, the Post-PLC moves the PLC module to waveform domain, acting as a post-processing of Baseline-GAN. The Post-PLC module takes a U-Net structure with causal convolutional encoder and decoder and skip connections between them. TCM and TSA blocks similar to that in FD-PLC are used in-between the encoder and decoder. It is designed with the similar model size as the FD-PLC module but with much larger receptive field.
4 Results
4.1 Comparison with other schemes
Fig.3 and 3 show the MCD and NISQA-TTS comparison on synthetic traces at different packet loss rates. They measure the signal distortion and the naturalness of the concealed speech, respectively. It can be seen that the Error-free baseline works well when there is no packet losses but the quality drops sharply when there is packet loss, showing its sensitiveness to channel noises. Other error-resilient schemes surpass the error-free baseline not only in loss scenario but also in the non-loss scenario, showing their stronger robustness and restoration capability against the error-free counterpart. Among these error-resilient schemes, except for Error-resilient baseline and Post-PLC, others are comparable in terms of MCD. The proposed FD-PLC scheme achieves relatively lower MCD and highest NISQA-TTS MOS scores consistently over all loss rates. The Post-PLC has much larger MCD, indicating large difference on the signal level. The Atten-GAN also shows promising results, especially at high packet loss rate. Compared with the model without attention module, we observe that more frequency bins are restored since more attention has been put on the lost part. But the content it generates is not as coherent as the proposed scheme. Similar results can be found in Table 1 evaluated on the blind test set with real traces. The proposed FD-PLC scheme achieves best NISQA-TTS MOS and comparable results on other metrics with the top one. It also outperforms the Post-PLC on the signal fidelity and perceptual quality.
4.2 Ablation study on training algorithms
Here we investigate several training schemes, the end-to-end training, multi-stage training and the proposed. In multi-stage training, the encoder and codebook are pretrained as that in proposed but it trains the FD-PLC module first in decoding and finetunes the decoder after that. As Table. 2 shows, end-to-end training is the worst. This is because the model will be confused by the PLC task when the target quantized features are still at a preliminary stage. The proposed joint training of the FD-PLC and the decoder provides more room for the trade-off between packet loss recovery and quantization loss recovery.
scheme | MCD[dB] | NISQA-TTS |
---|---|---|
Error-resilient baseline | 1.43 | 2.985 |
End-to-end training | 1.222 | 2.897 |
Multi-stage training | 1.245 | 3.253 |
Proposed | 1.239 | 3.325 |
5 Conclusions
We propose a feature-domain packet loss concealment algorithm for real-time neural speech coding in this paper. Experimental results show that it could achieve both a better signal fidelity and perceptual quality compared with waveform-domain post-PLC. The proposed self-attention based generative network is able to recover a burst loss with a length of up to 120ms and degrade gracefully with longer burst losses. The proposed FD-PLC module can be easily applied to other neural audio/speech coding networks as well.
References
- [1] C. P. O. H. V. Hardman, “A survey of packet-loss recovery techniques for streaming audio,” IEEE Network Magazine, vol. 10, no. 65.730750, 1998.
- [2] F. Stimberg, A. Narest, A. Bazzica, L. Kolmodin, P. B. González, O. Sharonova, H. Lundin, and T. C. Walters, “Waveneteq—packet loss concealment with wavernn,” in 2020 54th Asilomar Conference on Signals, Systems, and Computers. IEEE, 2020, pp. 672–676.
- [3] A. Stenger, K. B. Younes, R. Reng, and B. Girod, “A new error concealment technique for audio transmission with packet loss,” in 1996 8th European Signal Processing Conference (EUSIPCO 1996). IEEE, 1996, pp. 1–4.
- [4] G. Simkus, M. Holters, and U. Zölzer, “Error resilience enhancement for a robust adpcm audio coding scheme,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014, pp. 3685–3689.
- [5] J. Zhou, Q. Zhang, Z. Xiong, and W. Zhu, “Error resilient scalable audio coding (ersac) for mobile applications,” in 2001 IEEE Fourth Workshop on Multimedia Signal Processing (Cat. No. 01TH8564). IEEE, 2001, pp. 307–312.
- [6] D. Yang, H. Ai, C. Kyriakakis, and C.-C. J. Kuo, “Error-resilient design of high-fidelity scalable audio coding,” in Digital Wireless Communications IV, vol. 4740. International Society for Optics and Photonics, 2002, pp. 53–63.
- [7] W. Kleijin, F. Lim, A. Luebs, and J. Skoglund, “WaveNet based low rate speech coding,” in ICASSP. IEEE, 2018, pp. 676–680.
- [8] W. Kleijn, A. Storus, M. Chinen, T. Denton, F. Lim, A. Luebs, J. Skoglund, and H. Yeh, “Generative speech coding with predictive variance regularization,” in arXiv:2102.09660, 2021.
- [9] J. Klejsa, P. Hedelin, C. Zhou, R. Fejgin, and L. Villemoes, “High-quality speech coding with sample RNN,” in ICASSP. IEEE, 2019, pp. 7155–7159.
- [10] R. Fejgin, J. Klejsa, L. Villemoes, and C. Zhou, “Source coding of audio signals with a generative model,” in ICASSP. IEEE, 2020, pp. 341–345.
- [11] C. Gârbacea, A. van den Oord, Y. Li, F. Lim, A. Luebs, O. Vinyals, and T. C. Walters, “Low bit-rate speech coding with VQ-VAE and a WaveNet decoder,” in 2019 IEEE Int. Conf. Acoust Speech Signal Processing (ICASSP). IEEE, 2019, pp. 735–739.
- [12] J. Williams, Y. Zhao, E. Cooper, and J. Yamagishi, “Learning disentangled phone and speaker representations in a semi-supervised VQ-VAE paradigm,” in 2021 IEEE Int. Conf. Acoust Speech Signal Processing (ICASSP). IEEE, 2021.
- [13] N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “SoundStream: an end-to-end neural audio codec,” in arXiv:2107.03312v1, 2021.
- [14] K. Zhen, J. Sung, M. Lee, S. Beack, and M. Kim, “Cascaded cross-module residual learning towards lightweight end-to-end speech coding,” in Proceedings of the Annual Conference of the International Speech and Communication Association (Interspeech), 2019.
- [15] X. Jiang, X. Peng, C. Zheng, H. Xue, Y. Zhang, and Y. Lu, “End-to-end neural audio coding for real-time communications,” arXiv preprint arXiv:2201.09429, 2022.
- [16] J.-M. Valin, K. Vos, and T. Terriberry, “Definition of the opus audio codec,” IETF, September, vol. 2, 2012.
- [17] J. Lin, Y. Wang, K. Kalgaonkar, G. Keren, D. Zhang, and C. Fuegen, “A time-domain convolutional recurrent network for packet loss concealment,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 7148–7152.
- [18] M. Bińkowski, J. Donahue, S. Dieleman, A. Clark, E. Elsen, N. Casagrande, L. C. Cobo, and K. Simonyan, “High fidelity speech synthesis with adversarial networks,” arXiv preprint arXiv:1909.11646, 2019.
- [19] S. Pascual, J. Serrà, and J. Pons, “Adversarial auto-encoding for packet loss concealment,” in 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2021, pp. 71–75.
- [20] J. Wang, Y. Guan, C. Zheng, R. Peng, and X. Li, “A temporal-spectral generative adversarial network based end-to-end packet loss concealment for wideband speech transmission,” The Journal of the Acoustical Society of America, vol. 150, no. 4, pp. 2577–2588, 2021.
- [21] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. Oord, S. Dieleman, and K. Kavukcuoglu, “Efficient neural audio synthesis,” in International Conference on Machine Learning. PMLR, 2018, pp. 2410–2419.
- [22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- [23] Q. Wang, B. Li, T. Xiao, J. Zhu, C. Li, D. F. Wong, and L. S. Chao, “Learning deep transformer models for machine translation,” arXiv preprint arXiv:1906.01787, 2019.
- [24] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral normalization for generative adversarial networks,” arXiv preprint arXiv:1802.05957, 2018.
- [25] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance normalization: The missing ingredient for fast stylization,” arXiv preprint arXiv:1607.08022, 2016.
- [26] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in neural information processing systems, vol. 27, 2014.
- [27] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. de Brébisson, Y. Bengio, and A. C. Courville, “Melgan: Generative adversarial networks for conditional waveform synthesis,” Advances in neural information processing systems, vol. 32, 2019.
- [28] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6199–6203.
- [29] C. K. Reddy, H. Dubey, V. Gopal, R. Cutler, S. Braun, H. Gamper, R. Aichner, and S. Srinivasan, “Icassp 2021 deep noise suppression challenge,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6623–6627.
- [30] B. P. Milner and A. B. James, “An analysis of packet loss models for distributed speech recognition,” in Eighth International Conference on Spoken Language Processing, 2004.
- [31] L. Diener, Jan 2022. [Online]. Available: https://github.com/microsoft/PLC-Challenge/blob/INTERSPEECH2022DeepPLCChallenge.pdf
- [32] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- [33] R. Kubichek, “Mel-cepstral distance measure for objective speech quality assessment,” in Proceedings of IEEE pacific rim conference on communications computers and signal processing, vol. 1. IEEE, 1993, pp. 125–128.
- [34] G. Mittag, B. Naderi, A. Chehadi, and S. Möller, “Nisqa: A deep cnn-self-attention model for multidimensional speech quality prediction with crowdsourced datasets,” arXiv preprint arXiv:2104.09494, 2021.