Preserving background sound in noise-robust voice conversion via multi-task learning

Abstract

Background sound is an informative form of art that is helpful in providing a more immersive experience in real-application voice conversion (VC) scenarios. However, prior research about VC, mainly focusing on clean voices, pay rare attention to VC with background sound. The critical problem for preserving background sound in VC is inevitable speech distortion by the neural separation model and the cascade mismatch between the source separation model and the VC model. In this paper, we propose an end-to-end framework via multi-task learning which sequentially cascades a source separation (SS) module, a bottleneck feature extraction module and a VC module. Specifically, the source separation task explicitly considers critical phase information and confines the distortion caused by the imperfect separation process. The source separation task, the typical VC task and the unified task shares a uniform reconstruction loss constrained by joint training to reduce the mismatch between the SS and VC modules. Experimental results demonstrate that our proposed framework significantly outperforms the baseline systems while achieving comparable quality and speaker similarity to the VC models trained with clean data.

Index Terms— Voice conversion, background sound, multi-task learning, end-to-end

1 Introduction

Voice conversion (VC) is a speech signal transformation technique that changes the voice of the original speaker into the target speaker while keeping the linguistic content [1]. VC has attracted long-term research interests due to its various applications, such as voice dubbing for movies, personalized speech synthesis and speaker anonymization for voice privacy protection [2]. With the increasing diversity of VC applications such as audiobook and movie dubbing, there has been rising need for handling the noisy source during conversion. First, the source speech in real applications is usually entangled with background sound, such as the mixed sound track of speech and background music in movies and audio books. This challenges the robustness of current VC system. Second, to provide better listening experiences, preservation of the original source background sound in the converted audio is also desired. It is ideal to have a controllable VC system that can keep or remove the source background sound in the target synthetic speech according to specific applications. Besides, background sound is also a valuable resource for VC-based data augmentation to improve the robustness of downstream systems, such as automatic speech recognition (ASR) [3] or automatic speaker verification (ASV) [4].

Typical VC studies aim to disentangle the linguistic content and speaker timbre from source speech, without explicitly handling background sounds in the source speech [5, 6, 7, 8]. As noisy source may affect the quality of the converted speech, some prior works [9, 10] addressed the background noise problem by using phonetic posteriorgrams (PPGs) or bottleneck features to represent the linguistic content, which is extracted from a multi-condition trained noise-robust automatic speech recognition (ASR) system. Another straightforward approach is to use an extra denoising model to handle the noisy source speech before voice conversion. Although noise is suppressed, inevitable speech distortion in the source speech induced by the imperfect denoising process will propagate to downstream and degrade the quality of the converted speech as well [11].

For noise-controllable speech generation, adversarial training has been extensively studied in text-to-speech [12, 13, 14, 15], while the generated speech can be clean or noisy conditioned on an acoustic tag no matter the speaker training data is clean or noisy. The problem is that the synthetic noisy speech only contains some kind of “averaged” noise learned from the training data. Such an approach is not suitable for background sound preservation in the VC scenario. To address this problem, a noisy-to-noisy VC framework [16, 17] is proposed which adopts a pre-trained neural denoising module to conduct VC while preserving background sound at the same time. Specifically, the denoised speech is fed to the VC module and the residual (noisy speech subtracts the denoised speech) is regarded as the background sound which is finally combined with the converted speech. Similarly, helping with a per-trained source separation model, the singing voice conversion approach in [18] manages to preserve background music (BGM) in the converted singing by simply concatenating the separated BGM with the converted speech.

The above approaches deal with noisy source by cascading individually trained denoising/separation model and VC model. Besides the different training objectives, as mentioned earlier, the inevitable distortion (error) in speech as well as in the background sound induced by the imperfect front-end module may lead to inferior sound quality with audible artifacts in the converted audio.

Refer to caption — Fig. 1: (a) System overview of the proposed multi-task framework, BN represents bottleneck features. (b) SS module. (c) VC module. The solid line represents the forward propagation, and the dashed line represents which loss functions are used for the module output.

To address this problem, we propose an end-to-end (E2E) framework in a multi-task learning manner for VC in the presence of background sound. We shape our E2E approach by sequentially cascading a source separation (SS) module, a bottleneck feature extraction module and a VC module with multiple specifically designed learning objectives. Specifically, the source separation task is formed with a deep complex convolution recurrent network (DCCRN) [19] optimized by power-law compressed phase-aware (PLCPA) loss and asymmetry loss [20]. Here critical phase information is explicitly considered and distortion is confined by leveraging the recent advances in neural source separation. Besides the source separation task and the typical VC task, importantly, the unified task shares a uniform reconstruction loss constrained by joint training to reduce the mismatch between the SS and VC modules while ensuring the quality of the converted speech. Experiments and ablation studies show the significant advantages of the proposed approach.

2 Proposed Multi-Task Framework

As shown in Figure 1(a), the proposed framework consists of three modules, e.g. the bottleneck feature extraction module, SS module, and VC module. With the above modules, we approach the problem of VC with background sound via a three-step process: 1) separating vocal and background sound from the input signal using the SS module, 2) conducting voice conversion on vocal using the VC module with the bottleneck feature as input, 3) superimposing the converted voice with background sound extracted by the SS module. The bottleneck feature extraction module is a pre-trained ASR encoder developed by WeNet tools [21] and is available on the official website ¹¹1https://github.com/wenet-e2e/wenet. The SS module and VC modules are trained beforehand with different datasets separately, and then jointly trained by multi-task learning with the bottleneck feature extraction module frozen.

2.1 Source Separation Module

Due to the large parameters and complex structure, most current source separation models are usually too bulky to be jointly trained with the VC task. Therefore, we adopt DCCRN as our SS module [19], which only has 3.7M parameters and achieved the best performance for Deep Noise Suppression (DNS) challenge in the real-time track [22].

The original DCCRN estimates complex ratio mask (CRM) by the real and imaginary parts of the noisy complex spectrogram. In this paper, DCCRN is used as a source separation model by adding additional CRM of clean speech, as shown in Figure 1 (b). Meanwhile, we use power-law compressed phase-aware asymmetric (PLCPA-ASYM) loss to replace the original scale-invariant signal-to-noise ratio (SI-SNR) loss to confine speech distortion and achieve better power control [20]. Suppose $x$ and $\hat{x}$ are the estimated and clean spectrograms, respectively. PLCPA loss is defined as follows:

$\displaystyle\mathcal{L}_{a}(t,f)$	$\displaystyle=\left.\|\|x(t,f)\right\|^{p}-\left.\|\hat{x}(t,f)\|^{p}\right\|^{2}$	(1)
$\displaystyle\mathcal{L}_{p}(t,f)$	$\displaystyle=\left.\|\|x(t,f)\right\|^{p}e^{j\varphi(x(t,f))}-\left.\|\hat{x}(t,f)\|^{p}e^{j\varphi(\hat{x}(t,f))}\right\|^{2}$
$\displaystyle\mathcal{L}_{\texttt{plcpa}}$	$\displaystyle=\frac{1}{T}\frac{1}{F}\sum_{t}^{T}\sum_{f}^{F}\left(\alpha\mathcal{L}_{a}(t,f)+(1-\alpha)\mathcal{L}_{p}(t,f)\right),$

where $T$ and $F$ are the total time and frequency frames, respectively, while $t$ and $f$ stand for time and frequency index. The spectral compression factor $p$ is set to 0.3 and operator $\varphi$ calculates the argument of a complex number while the weighted coefficient between the phase-aware components and amplitude is $\alpha$ . The asymmetric loss [23] is adapted to the amplitude part of the PLCPA loss to alleviate the over-suppression (OS) issue, which is defined as

	$\displaystyle h(x)$	$\displaystyle=\left\{\begin{array}[]{lr}0,\quad if\ \ x\leq 0,&\\ x,\quad if\ \ x>0,&\end{array}\right.$		(2)
	$\displaystyle\mathcal{L}_{\texttt{os}}(t,f)$	$\displaystyle=\Big{\|}h(\|x(t,f)\|^{p}-\|\hat{x}(t,f)\|^{p})\Big{\|}^{2}.$		(2)

So that the final PLCPA-ASYM loss can be defined as

\mathcal{L}_{\texttt{plcpa-asym}}=\mathcal{L}_{\texttt{plcpa}}+\beta\frac{1}{T}\frac{1}{F}\sum_{t}^{T}\sum_{f}^{F}\mathcal{L}_{\texttt{os}}(t,f),

(3)

where $\beta$ is the positive weighting coefficient for $\mathcal{L}_{\texttt{os}}(t,f)$ .

2.2 Voice Conversion Module

After extracting the speech from source waveforms with background sound through the SS module, we hire a VC module to transform the speech into the target speaker. As illustrated in Figure 1(c), the VC module goes through single-stage training for efficient end-to-end learning. The VC module consists of a convolutional long short-term memory (CLSTM) encoder and a HiFiGAN-based decoder [24], which aim at high-level linguistic representation encoding and waveform reconstruction, respectively. CLSTM consists of three stacks of convolution layers followed by the LeakyReLU activation function and a LSTM layer. The speaker embedding from the lookup table is fed to the decoder as conditions for target voice generation. The architecture and objective of the decoder generator follow the same configuration as HiFi-GAN [24].

2.3 Multi-task Training

To reduce the mismatch between the SS module and VC module during voice conversion in the presence of background sound, we adopt the multi-task learning strategy in the following steps: 1) only the VC module is optimized by the VC training loss with the SS module frozen; 2) only the SS module is optimized by PLCPA-ASYM loss with VC module frozen; 3) both the SS and VC modules are jointly optimized in a multi-task learning manner. The bottleneck feature extraction module is frozen at each stage.

Reconstruction loss for the unified task. The unified task aims at optimizing the composed waveform with vocals and background sound by minimizing the reconstruction loss. We denote the Mel-spectrogram of training data as $M=(M_{\texttt{s}},M_{\texttt{b}})$ , which is the Mel-spectrogram composed of speech audio $X_{\texttt{s}}$ and background sound $X_{\texttt{b}}$ . The reconstruction loss for the unified task is defined as

\mathcal{L}_{\texttt{rec}}^{\texttt{uni}}=||M-\hat{M}||_{1},

(4)

where $\hat{M}$ is the Mel-spectrogram of the reconstructed waveform containing speech voice and background sound.

SS training loss. The source separation task considers critical phase information and uses PLCPA-ASYM loss to optimize the estimated CRM and confine the separated speech and background distortion. PLCPA-ASYM loss is calculated for the separated speech and background sound, denoted by $\mathcal{L}_{\texttt{s}}^{\texttt{ss}}$ and $\mathcal{L}_{\texttt{b}}^{\texttt{ss}}$ , respectively.

VC training loss. The objective of the typical VC task with no background sound follows HiFiGAN [24], which consists of reconstruction loss $\mathcal{L}_{\texttt{rec}}^{\texttt{vc}}$ , feature matching loss $\mathcal{L}_{\texttt{fm}}^{\texttt{vc}}$ , and adversarial loss $\mathcal{L}_{\texttt{adv}}^{\texttt{vc}}$ . We adopt the L1 loss as reconstruction loss to optimize the spectrogram $\hat{M}_{\texttt{s}}$ of the predicted speech $\hat{X_{\texttt{s}}}$ as

\mathcal{L}_{\texttt{rec}}^{\texttt{vc}}=||M_{\texttt{s}}-\hat{M_{\texttt{s}}}||_{1}.

(5)

To improve the performance of voice conversion, we employ adversarial training for more natural speech. The adversarial generator loss of the single-task VC is calculated as

\mathcal{L}_{\texttt{adv}}^{\texttt{gen}}=(D(\hat{X_{\texttt{s}}})-1)^{2},

(6)

\mathcal{L}_{\texttt{adv}}^{\texttt{dis}}=(D(X_{\texttt{s}})-1)^{2}+D(\hat{X_{\texttt{s}}})^{2},

(7)

where $D$ is a discriminator network. For adversarial training stability, feature matching loss is also used as

\mathcal{L}_{\texttt{fm}}^{\texttt{vc}}=\sum_{i=1}^{T}\frac{1}{N_{i}}\left\|D_{i}(X_{\texttt{s}})-D_{i}(\hat{X_{\texttt{s}}})\right\|_{1},

(8)

where $T$ denotes the total number of layers in the discriminator and $D_{i}$ produces the feature map of the $i$ -th layer of the discriminator with $N_{i}$ number of features. The joint training process will combine the SS and VC modules into a single unit if the source separation losses and voice conversion losses are not added, rendering the separated intermediate results useless and unable to control whether the background sound is preserved.

The SS and VC modules share a uniform reconstruction loss constrained by joint training and the SS module can be regarded as data augmentation or a desirable pre-process model for the VC module to improve robustness. The final loss function of our multi-task learning approach is a weighted sum of the following losses:

	$\displaystyle\mathcal{L}_{\texttt{mtl}}=$	$\displaystyle\lambda_{\texttt{uni}}\mathcal{L}_{\texttt{rec}}^{\texttt{uni}}+\lambda_{\texttt{ss}}(\mathcal{L}_{\texttt{s}}^{\texttt{ss}}+\mathcal{L}_{\texttt{b}}^{\texttt{ss}})+$		(9)
		$\displaystyle\lambda_{\texttt{vc}}(\mathcal{L}_{\texttt{rec}}^{\texttt{vc}}+\mathcal{L}_{\texttt{adv}}^{\texttt{vc}}+\mathcal{L}_{\texttt{fm}}^{\texttt{vc}}),$		(9)

where $\lambda_{\texttt{uni}}$ , $\lambda_{\texttt{ss}}$ , and $\lambda_{\texttt{vc}}$ are hyper-parameters balancing loss terms of each task. We set $\lambda_{\texttt{uni}}=45$ , $\lambda_{\texttt{ss}}=1$ and $\lambda_{\texttt{vc}}=1$ empirically.

3 Experiments

3.1 Dataset

We train the SS module and VC module on two different datasets. To train the SS module, we use the MUSDB18-train dataset [25], which contains 100 full lengths of music tracks of different genres and their isolated drums, bass, vocals, and other items. The VC module is trained on VCTK dataset [26] which contains 110 English speakers and 400 utterances per speaker. For the joint training of the SS and VC module, MUSDB18-train and VCTK datasets are mixed to conduct the evaluation of the background sound VC. Background sound is randomly selected from the dataset and reprocessed to be clipped as the same length as speech data by a certain signal-to-noise ratio (SNR) range of $0$ to $10$ dB. We randomly select 4 source speakers (i.e. p245, p264, p270, and p361) and 2 target speakers (i.e. p294 and p334) from the VCTK dataset mixed with clips from the MUSDB18-test dataset for conducting evaluations. For each source and target speaker, 30 utterances are reserved as the test set. All training and evaluation data are conducted on 16 kHz sampling rate.

3.2 Comparisons and Evaluation Metrics

For a fair comparison, we conduct evaluations on the following systems with the same VC module: 1) Upper Bound, where the audio is converted from clean source voice and then superimposed with the original background sound; 2) Baseline1 (Sep), consisting of a VC module and a separation model ²²2https://github.com/bytedance/music_source_separation by individual training, the same source separation approach as [18]; 3) Baseline2 (Sep + Denoise), containing a VC module, the same separation model as Baseline1 and the same denoising model as [17], which can be regarded as a more robust source separation way than [17]; 4) Proposed, the multi-task learning framework proposed in this paper.

The effectiveness of the proposed framework compared with the baseline systems is investigated by objective and subjective evaluations. For subjective evaluation, we use comparative mean opinion score (CMOS) to compare the converted speech with background sound to the baseline system and use Upper Bound results as the reference to determine the ensemble quality. To evaluate the quality and similarity more effectively, we also conducted a mean opinion score (MOS) test on the converted speech without superimposing the background sound. Scale-invariant-signal-to-distortion ratio (SI-SDR) [27], and perceptual evaluation of speech quality (PESQ) [28] are employed as the objective metrics to evaluate separated background sound and converted speech. Audio samples are available online ³³3https://yaoxunji.github.io/background_sound_vc/.

Table 1: Ensemble quality CMOS and speech MOS comparison of the proposed framework vs. different baseline systems. “F” and “M” represent female and male, corresponding to different gender combinations of source and reference voices.

System	CMOS	Quality					Similarity
System	CMOS	M2M	M2F	F2F	F2M	Mean	M2M	M2F	F2F	F2M	Mean
Upper Bound	0	3.96 $\pm 0.07$	3.89 $\pm 0.08$	3.92 $\pm 0.07$	3.83 $\pm 0.08$	3.90 $\pm 0.08$	3.82 $\pm 0.06$	3.87 $\pm 0.06$	3.88 $\pm 0.07$	3.83 $\pm 0.07$	3.85 $\pm 0.07$
Sep	-0.54	2.92 $\pm 0.10$	2.81 $\pm 0.11$	3.01 $\pm 0.11$	2.86 $\pm 0.11$	2.90 $\pm 0.11$	3.11 $\pm 0.10$	3.21 $\pm 0.09$	3.17 $\pm 0.08$	3.23 $\pm 0.10$	3.18 $\pm 0.09$
Sep + Denoise	-0.35	3.44 $\pm 0.10$	3.35 $\pm 0.10$	3.49 $\pm 0.09$	3.40 $\pm 0.10$	3.42 $\pm 0.10$	3.40 $\pm 0.10$	3.44 $\pm 0.10$	3.45 $\pm 0.10$	3.47 $\pm 0.10$	3.44 $\pm 0.10$
Proposed	-0.12	3.80 $\pm$ 0.05	3.69 $\pm$ 0.06	3.74 $\pm$ 0.05	3.65 $\pm$ 0.06	3.72 $\pm$ 0.05	3.69 $\pm$ 0.08	3.74 $\pm$ 0.06	3.73 $\pm$ 0.06	3.64 $\pm$ 0.07	3.70 $\pm$ 0.07

3.3 Experimental Results

3.3.1 Subjective Evaluation

We first evaluate the overall performance of VC in the presence of background sound by CMOS tests between the compared systems and the Upper Bound, as shown in Table 1. The results of CMOS, i.e. -0.54 of only separation approach, -0.35 of separation and denoising approach, and -0.12 of Proposed, demonstrate that the proposed framework achieves a statistically significant preference compared with baseline systems. In addition, the score of -0.12 indicates that the converted speech with background sound from the proposed framework has slight regression compared to Upper Bound in ensemble quality.

To evaluate the quality and speaker similarity of the converted speech, we conduct MOS tests on different systems, as shown in Table1. For more accurate rating results, we use the converted speech without superposing background sound for perceptive listening tests. We can find that the proposed framework achieves better results for the audio quality than the two baselines. Meanwhile, the MOS also indicates that the audio quality generated from in-gender voice conversion is better than cross-gender. For the speaker similarity, the results show that our proposed framework significantly outperforms baselines in both cross-gender and in-gender. The subjective results demonstrate the proposed multi-task learning framework can effectively deal with noise source and convert the voice to the target speaker with background sound.

Table 2: The objective evaluation results of the converted speech and background sound before superimposing.

System	Speech		Background Sound
System	SI-SDR	PESQ	SI-SDR	PESQ
Sep	11.76	2.75	9.85	2.29
Sep+Denoise	12.01	3.43	9.85	2.29
Proposed	12.10	3.43	11.11	2.56

3.3.2 Objective Evaluation

For precise evaluations, we individually calculate the SI-SDR and PESQ results on speech and background sound before superimposing, since the superimposed audio may conceal the detail of speech and background sound. Because of the same separation model employed by model Sep and Sep+Denoise, the two baselines have the same SI-SDR and PESQ results of background sound.

The objective results are shown in Table 2. For evaluations on speech, the proposed framework achieves the highest scores on SI-SDR and PESQ than two baselines, which verifies the good performance of the proposed framework on the separated voice. In addition, we find that model Sep+Denoise obtains the same PESQ score with our proposed framework. It demonstrates our SS module can achieve similar separated speech results as the state-of-the-art denoising model which further indicates the effectiveness of the proposed SS module in multi-task learning.

Regarding background sound, our proposed framework outperforms two baselines and achieves the highest scores in the SI-SDR and PESQ metrics, i.e. 11.11 and 2.56, respectively. According to the objective results, our multi-task learning framework can achieve better separated results than baseline systems by a single source separation module.

3.3.3 Ablation Study

To further investigate the effectiveness of each component in our framework, we conduct ablation studies for speech and background sound by subjective and objective metrics, respectively, as shown in Table 3. In the ablation studies, we remove the loss of the source separation task and the voice conversion task in the procedure of joint training.

Table 3: Ablation studies of speech MOS and background sound objective evaluation results. To evaluate the biases induced by each component, we remove them one at a time.

Variants	Speech		Background Sound
Variants	Quality	Similarity	SI-SDR	PESQ
Proposed	3.69 $\pm 0.06$	3.71 $\pm 0.07$	11.11	2.56
$\quad-\mathcal{L}_{*}^{\texttt{se}}$	3.36 $\pm 0.07$	3.51 $\pm 0.06$	7.91	1.96
$\quad-\mathcal{L}_{*}^{\texttt{vc}}$	3.27 $\pm 0.07$	3.37 $\pm 0.07$	10.19	2.70
$\quad-$ Joint Training	3.14 $\pm 0.05$	3.23 $\pm 0.06$	9.91	2.31

The results demonstrate that variants without SS loss “ $-\mathcal{L}_{*}^{\texttt{ss}}$ ” and VC loss “ $-\mathcal{L}_{*}^{\texttt{vc}}$ ” get worse performance in both subjective and objective metrics. When removing “ $\mathcal{L}_{*}^{\texttt{ss}}$ ”, the separated background sound degrades by a large margin in terms of SI-SDR and PESQ. Besides, the Mel-spectrogram of the vocal part is damaged severely. As for variant “ $-\mathcal{L}_{*}^{\texttt{vc}}$ ”, it achieves worse speech quality and similarity than variant “ $-\mathcal{L}_{*}^{\texttt{ss}}$ ” but slightly better PESQ result than the proposed framework, which indicates that removing VC loss makes the variant pay more attention to separation. In addition, the variant without joint training degrades speech and background sound performance. We believe the reason is that there exists a mismatch between the SS module and VC module when using the separation results directly for conversion, leading to distortion of the conversion result. The ablation studies verify the effectiveness of our proposed multi-task learning approach on VC with background sound.

4 Conclusion

In this paper, we present an end-to-end framework in a multi-task manner that can produce high-quality VC in the presence of background sound and is capable of flexible controlling linguistic content, background sound, and speaker timbre. The framework sequentially cascades a SS module, a bottleneck feature extraction module and a VC module. The SS module is formed with DCCRN and confines the distortion by leveraging the PLCPA-ASYM loss. To tackle the disentangling problem between the vocal and the background sound, we employ multi-task loss to constrain the outputs of the SS and VC modules. Furthermore, the source separation task, the voice conversion task, and the unified task share a uniform reconstruction loss to reduce the mismatch between the SS module and the VC module. The subjective and objective evaluation results demonstrate that the proposed framework has higher speech quality and similarity, effectively bridging the margin on the ensemble quality between the baseline and the upper bound.

References

[1] Berrak Sisman, Junichi Yamagishi, Simon King, and Haizhou Li, “An overview of voice conversion and its challenges: From statistical modeling to deep learning,” IEEE/ACM Trans Audio Speech Lang. Process., vol. 29, pp. 132–157, 2020.
[2] Seyed Hamidreza Mohammadi and Alexander Kain, “An overview of voice conversion systems,” Speech Comm., vol. 88, pp. 65–82, 2017.
[3] S. Shahnawazuddin, Nagaraj Adiga, Kunal Kumar, Aayushi Poddar, and Waquar Ahmad, “Voice conversion based data augmentation to improve children’s speech recognition in limited data scenario,” in Proc. Interspeech, 2020, pp. 4382–4386.
[4] S. Shahnawazuddin, Waquar Ahmad, Nagaraj Adiga, and Avinash Kumar, “In-domain and out-of-domain data augmentation to improve children’s speaker verification system in limited data scenario,” in Proc. ICASSP, 2020, pp. 7554–7558.
[5] Bac Nguyen and Fabien Cardinaux, “Nvc-net: End-to-end adversarial voice conversion,” in Proc. ICASSP, 2022, pp. 7012–7016.
[6] Yen-Hao Chen, Da-Yi Wu, Tsung-Han Wu, and Hung-yi Lee, “Again-vc: A one-shot voice conversion using activation guidance and adaptive instance normalization,” in Proc. ICASSP, 2021, pp. 5954–5958.
[7] Disong Wang, Liqun Deng, Yu Ting Yeung, Xiao Chen, Xunying Liu, and Helen Meng, “VQMIVC: vector quantization and mutual information-based unsupervised speech representation disentanglement for one-shot voice conversion,” in Proc. Interspeech, 2021, pp. 1344–1348.
[8] Trung Dang, Dung N. Tran, Peter Chin, and Kazuhito Koishida, “Training robust zero-shot voice conversion models with self-supervised features,” in Proc. ICASSP, 2022, pp. 6557–6561.
[9] Damien Ronssin and Milos Cernak, “AC-VC: non-parallel low latency phonetic posteriorgrams based voice conversion,” in Proc. ASRU, 2021, pp. 710–716.
[10] Hongqiang Du, Lei Xie, and Haizhou Li, “Noise-robust voice conversion with domain adversarial training,” Neur. Net., vol. 148, pp. 74–84, 2022.
[11] Yangyang Xia, Sebastian Braun, Chandan K. A. Reddy, Harishchandra Dubey, Ross Cutler, and Ivan Tashev, “Weighted speech distortion losses for neural-network-based real-time speech enhancement,” in Proc. ICASSP, 2020, pp. 871–875.
[12] Liumeng Xue, Shan Yang, Na Hu, Dan Su, and Lei Xie, “Learning noise-independent speech representation for high-quality voice conversion for noisy target speakers,” in Proc. Interspeech, 2022, pp. 2548–2552.
[13] Wei-Ning Hsu, Yu Zhang, Ron J. Weiss, Yu-An Chung, Yuxuan Wang, Yonghui Wu, and James R. Glass, “Disentangling correlated speaker and noise for speech synthesis via data augmentation and adversarial factorization,” in Proc. ICASSP, 2019, pp. 5901–5905.
[14] Jian Cong, Shan Yang, Lei Xie, Guoqiao Yu, and Guanglu Wan, “Data efficient voice cloning from noisy samples with domain adversarial training,” in Proc. Interspeech, 2020, pp. 811–815.
[15] Shan Yang, Yuxuan Wang, and Lei Xie, “Adversarial feature learning and unsupervised clustering based speech synthesis for found data with acoustic and textual noise,” IEEE Signal Process. Lett., vol. 27, pp. 1730–1734, 2020.
[16] Chao Xie, Yi-Chiao Wu, Patrick Lumban Tobing, Wen-Chin Huang, and Tomoki Toda, “Noisy-to-noisy voice conversion framework with denoising model,” in Proc. APSIPA, 2021, pp. 814–820.
[17] Chao Xie, Yi-Chiao Wu, Patrick Lumban Tobing, Wen-Chin Huang, and Tomoki Toda, “Direct noisy speech modeling for noisy-to-noisy voice conversion,” in Proc. ICASSP, 2022, pp. 6787–6791.
[18] Divyesh G. Rajpura, Jui Shah, Maitreya Patel, Harshit Malaviya, Kirtana Phatnani, and Hemant A. Patil, “Effectiveness of transfer learning on singing voice conversion in the presence of background music,” in Proc. SPCOM, 2020, pp. 1–5.
[19] Yanxin Hu, Yun Liu, Shubo Lv, Mengtao Xing, Shimin Zhang, Yihui Fu, Jian Wu, Bihong Zhang, and Lei Xie, “DCCRN: deep complex convolution recurrent network for phase-aware speech enhancement,” in Proc. Interspeech, 2020, pp. 2472–2476.
[20] Sefik Emre Eskimez, Takuya Yoshioka, Huaming Wang, Xiaofei Wang, Zhuo Chen, and Xuedong Huang, “Personalized speech enhancement: new models and comprehensive evaluation,” in Proc. ICASSP, 2022, pp. 356–360.
[21] Zhuoyuan Yao, Di Wu, Xiong Wang, Binbin Zhang, Fan Yu, Chao Yang, Zhendong Peng, Xiaoyu Chen, Lei Xie, and Xin Lei, “Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit,” in Proc. Interspeech, 2021, pp. 4054–4058.
[22] Chandan K. A. Reddy, Harishchandra Dubey, Vishak Gopal, Ross Cutler, Sebastian Braun, Hannes Gamper, Robert Aichner, and Sriram Srinivasan, “ICASSP 2021 deep noise suppression challenge,” in Proc. ICASSP, 2021, pp. 6623–6627.
[23] Quan Wang, Ignacio Lopez-Moreno, Mert Saglam, Kevin W. Wilson, Alan Chiao, Renjie Liu, Yanzhang He, Wei Li, Jason Pelecanos, Marily Nika, and Alexander Gruenstein, “Voicefilter-lite: Streaming targeted voice separation for on-device speech recognition,” in Proc. Interspeech, 2020, pp. 2677–2681.
[24] Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Proc. NeurIPS, 2020.
[25] Zafar Rafii, Antoine Liutkus, Fabian-Robert Stöter, Stylianos Ioannis Mimilakis, and Rachel Bittner, “The MUSDB18 corpus for music separation,” 2017.
[26] Christophe Veaux, Junichi Yamagishi, Kirsten MacDonald, et al., “Superseded-cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit,” 2016.
[27] Jonathan Le Roux, Scott Wisdom, Hakan Erdogan, and John R Hershey, “Sdr–half-baked or well done?,” in Proc. ICASSP, 2019, pp. 626–630.
[28] Antony W Rix, John G Beerends, Michael P Hollier, and Andries P Hekstra, “Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” in Proc. ICASSP, 2001, pp. 749–752.