DIFFRENT: A DIFFUSION MODEL FOR RECORDING ENVIRONMENT TRANSFER OF SPEECH
Abstract
Properly setting up recording conditions, including microphone type and placement, room acoustics, and ambient noise, is essential to obtaining the desired acoustic characteristics of speech. In this paper, we propose Diff-R-EN-T, a Diffusion model for Recording ENvironment Transfer which transforms the input speech to have the recording conditions of a reference speech while preserving the speech content. Our model comprises the content enhancer, the recording environment encoder, and the diffusion decoder which generates the target mel-spectrogram by utilizing both enhancer and encoder as input conditions. We evaluate DiffRENT in the speech enhancement and acoustic matching scenarios. The results show that DiffRENT generalizes well to unseen environments and new speakers. Also, the proposed model achieves superior performances in objective and subjective evaluation. Sound examples of our proposed model are available online111https://jakeoneijk.github.io/diffrent-demo.
Index Terms— diffusion probabilistic model, generative model, recording environment transfer, speech enhancement, acoustic matching
1 Introduction
The acoustic characteristics of speech are determined by various recording conditions, such as the type and position of microphones, room acoustics, and ambient noises. The proper conditions depend on the usage of the recorded speech. For instance, voice-overs or audiobooks require a clean environment with high-quality microphones and non-reverberant space. On the other hand, automated dialog replacement (ADR) demands post-production to replicate the acoustic qualities that provide environment information of the dialog.
While its significance is indisputable, producing speech audio within a targeted recording environment requires considerable professional knowledge and effort. To tackle this problem, we introduce “recording environment transfer” which transforms input speech so that it has the recording condition of a reference speech.
Previous studies of recording environment transfer mainly concentrate on a single target recording environment. One of the examples is speech enhancement, which converts speech audio recorded in a noisy environment to a clean voice. Conventional speech enhancement models aim to mitigate noise or reverberation. Recently, a number of studies have tackled many different distortions simultaneously to achieve a realistic recording environment [15, 28]. Despite the success of these works, the model that deals with a single target environment has a limitation in usability.
Another case of recording environment transfer task is to simulate the conditions of a desired acoustic environment. Previous work implemented this by estimating or generating room impulse response (RIR) of the target environment [5, 26, 2, 24], matching audio effects such as equalization (EQ) [6] or reverb [23], or converting audio from a source microphone to that from a target microphone [18]. However, most of them focused on only one recording condition.
This work aims to address general recording environment transfer that takes into account microphone type and placement, room acoustics, and ambient noise all at once. This holistic environment transfer was previously attempted by the acoustic matching model [27], which matches reverberation, EQ distortion, and noise to the target environment using a reference speech that contains the recording environment. While they show promising results, this model has several limitations. First, it tends to be over-fitted to environments in the training set. As a result, its performance in generating audio in unseen recording environments is degraded. Second, they mainly focus on the case when the target environment is reverberant and noisy. Therefore, the performance of the model in transforming a noisy and reverberant environment into a clean environment is relatively unexplored. Lastly, the model was not capable of transferring realistic noises.
In this paper, we propose a unified environment transfer model that can faithfully change the acoustic characteristics of speech to arbitrary recording conditions. We implement it with a diffusion model which have shown impressive performance in generating images [7, 22] and audio [12, 16] and thus we call it Diff-R-EN-T, a Diffusion model [25] for Recording ENvironment Transfer. DiffRENT consists of three modules: the recording environment encoder, the content enhancer, and the diffusion decoder. The recording environment encoder extracts the recording environment embedding from a reference speech with the target environment. The content enhancer filters out the recording conditions of the source speech while preserving the speech content. The diffusion decoder generates the transformed speech with the target environment given the outputs of the recording environment encoder and the content enhancer. We show that DiffRENT effectively transfer any recording environment, disentangling the recording environment and the speech content well. We validate it in the speech enhancement and acoustic matching scenarios and show that the model achieves superior performance in both objective and subjective evaluation.
2 DIFFRENT
Figure 1 illustrates the overview of the proposed model. Let , , and be content speech (input), reference speech in the target recording environment, and output speech with the content of and the recording environment of , respectively, where and are the number of samples. Their log mel-spectrograms are represented as , , and where indicates the number of frequency bins, and and indicate the number of time frames. DiffRENT generates , given and as input conditions. The target audio signal can be obtained from using a pre-trained HiFi-GAN vocoder [10]. The proposed model comprises the content enhancer , the recording environment encoder , and the diffusion decoder . In the following, we describe each component in detail.
2.1 Recording Environment Encoder
The role of the recording environment encoder is to extract environmental acoustic features other than the speech content from the reference audio. We assume that the recording environment remains unchanged over time. From input mel-spectrograms , produces embedding , the disentangled recording environment representation. The diffusion decoder takes as an input condition. Without any other objectives, and are trained jointly with a denoising objective. We employ the ECAPA-TDNN [3] architecture for the recording environment encoder. ECAPA-TDNN was originally designed to extract speaker embedding for speaker verification. Given its strong capability in capturing static features, we deduce its suitability for our task. We removed the AAM-softmax in ECAPA-TDNN and modified the number of nodes in the final fully-connected layer to align with the condition channel size of the diffusion decoder.
2.2 Content Enhancer
In order to preserve the speech content from the input audio and filter out the recording environment features, the content enhancer is trained by minimizing the following mean absolute error (MAE) loss:
(1) |
where is the log mel-spectrogram of clean speech having the same content as . Note that the goal of is to facilitate improved content capturing for the diffusion decoder rather than speech enhancement. There are some specific artifacts or distortion in as it is trained with the simple L1 loss. However, they are addressed in the diffusion decoder and so their influence on the overall model performance remains negligible. We employed a ResUNet from [11] and slightly modified it to operate on mel-spectrogram as in [8].
2.3 Diffusion Decoder
To generate a target mel-spectrogram , from the two input conditions and , we apply a denoising diffusion probabilistic model [7]. A diffusion probabilistic model belongs to the category of generative models that learn a model distribution which approximates the data distribution . A diffusion model, with the total number of diffusion steps , is a latent variable model having the form , where , …, are latent variables of the same dimensionality as . It comprises two processes as described below.
2.3.1 Diffusion/Forward Process
The diffusion/forward process is a fixed Markov process that gradually adds small Gaussian noise to the data over iterations, with the aim of making the distribution of a standard Gaussian:
(2) |
with , where is a deterministic noise schedule constant satisfying . The forward process at each step can be marginalized as follows:
(3) |
where and .
2.3.2 Reverse Process
In DiffRENT, the reverse process can be defined as a Markov chain with learned Gaussian conditional transition distributions starting from .
(4) |
(5) |
where . The diffusion decoder is trained to optimize learnable parameters , enabling the reverse of the diffusion process.
2.3.3 Optimizing Diffusion Decoder
To maximize the intractable likelihood , evidence lower bound (ELBO) is used to train the diffusion decoder [7]. Given , , and , sampled from the training dataset, the objective function for training the diffusion decoder is as follows:
(6) |
where and is sampled uniformly. The trainable parameters and are concurrently optimized while is fixed.
3 Experiments
3.1 Dataset and Preprocessing
We evaluated our proposed model on the DDS dataset [14] which includes 12 hours of clean speech data with 48 speakers (24 female and 24 male) and corresponding paired data re-recorded with different combinations of 3 microphones, 9 spaces, and 6 microphone positions. For the test set, we selected four speakers (two males and two females) and one recording environment setup (Uber Microphone, livingroom1, and F position).
To enhance the robustness of content capturing, we implemented data augmentation by convolving clean speech with diverse impulse responses and adding noise. The noise samples were sourced from the REVERB Challenge database [9], and a total of 1207 impulse responses are obtained from the DetmoldSRIR dataset [1] and the MIT Impulse Response Survey Dataset [29]. Note that data augmentation is exclusively applied to content speech, because the goal of the diffusion decoder is to learn the distribution of realistic environments rather than synthetic ones.
For our task, it is important that the neural vocoder operates in a speaker-independent manner and acquires the capability to synthesize speech in diverse recording environments. To achieve this, we trained the HiFi-GAN vocoder [10] with the DDS dataset [14] and the LibriSpeech corpus [20]. The LibriSpeech corpus [20] initially designed for automatic speech recognition (ASR) research contains 982 hours of speech data from 2,484 speakers. Despite some considerable noisy audio files within this dataset, they align with the requirements of our task, as the vocoder is also required to generate noise effectively.
All audio sample rates were resampled to 16 kHz. Speech data were randomly segmented or padded to match the length of 4 seconds. The spectrogram was extracted from the audio by short-time Fourier transform (STFT) with a Hann window size of 1024 samples and a hop size of 256 samples. The log mel-spectrogram was computed through an 80-channel mel filterbank and the log magnitude compression. For the diffusion decoder, it was normalized to the range between -1 and 1 by a min-max normalization.
3.2 Implementation Details
Diffusion Decoder: We conducted experiments with two diffusion decoders: WaveNet [30] architecture in [16] and U-Net architecture in [22]. The embedding channel size, contingent on the conditioning method of the diffusion decoder, was set to 256 for WaveNet and 80 for U-Net. We empirically found this difference in channel size does not yield a significant variance in the performance of the recording environment encoder. To align the time frame size of with , is repeated along the time axis. In the case of U-Net, the final condition is established by concatenating with . For the WaveNet architecture, the channel size of is transformed to 256 via a linear layer, followed by the summation with to make the input condition.
Training Procedure: We first trained the content enhancer for 225k steps. Then, the diffusion decoder and the recording environment encoder were trained jointly for 400k steps. We adopted a training and inference procedure similar to [16] for the diffusion decoder. All models were trained with the AdamW optimizer and a batch size of 32. The learning rate started at 0.0008 and was halved every 20k steps.
3.3 Evaluation
We compare DiffRENT with the acoustic matching model [27]. Due to the absence of source code, we reproduced the model based on the description provided in [27]. To evaluate the speech enhancement capabilities of our model, we also compare our method with CDiffuSE[17] and VoiceFixer [15], the state-of-the-art level speech enhancement model. The official implementations of them were employed. Comparison models were trained with the identical data configuration used in the proposed model.
We conducted an ablation study to evaluate the effectiveness of each component in DiffRENT. For the recording environment encoder, we used a simple encoder composed of 1D convolution with a kernel size of 1, followed by the attentive statistics pooling [19] and batch normalization as a baseline model. Furthermore, we evaluated the effect of the content enhancer with the models using the original input instead of the enhanced speech .
3.3.1 Objective Evaluation
To assess the adaptability of our model in multiple tasks, we employed three test cases:
-
1.
Env-to-Clean: evaluate the speech enhancement performance.
-
2.
Clean-to-Env: evaluate the performance of transforming a clean environment into an unseen environment.
-
3.
Env-to-Env: evaluate the performance of transforming an unseen environment into a seen environment.
For each test case, 500 excerpt pairs composed of content speech and reference speech were chosen. For Env-to-Env, we randomly selected the target environment of each reference speech. We adopt the log-spectral distance (LSD) [4] and structural similarity (SSIM) [31] as the objective metrics. Furthermore, for Env-to-Clean, wideband perceptual evaluation of speech quality (PESQ-wb) [21] and scale-invariant spectrogram to noise ratio (SiSPNR) [15] are employed. SiSPNR is the scale-invariant signal-to-noise ratio (SiSNR) [13] computed on the magnitude spectrogram.
3.3.2 Subjective Evaluation
We conducted a subjective evaluation through a listening test with 20 participants for each test case. Each test case encompassed ten questions. For Env-to-Clean, participants were provided the content speech as a low anchor. They were instructed to evaluate two distinct criteria: content preservation and enhancement quality. For Clean-to-Env and Env-to-Env, participants were presented with the target speech as a high anchor, and they were asked to assess content preservation and environment similarity. In each question, participants assessed the audio files generated by our proposed model and the comparison model. Additionally, the target speech and the content speech were assessed to estimate upper and lower bounds. Note that content preservation evaluates how well speech content is preserved regardless of the recording environment. Therefore, both the target speech and the content speech (unprocessed) are expected to receive high scores in this criterion. We selected U-R2-C for all test cases. For Env-to-Clean and Clean-to-Env & Env-to-Env, we selected VoiceFixer and the acoustic matching network as the comparison model, respectively. Each evaluation item was rated on a scale of 1 to 5 points.
4 Results
Env-to-Clean | ||||
Method | PESQ ↑ | SiSPNR ↑ | LSD ↓ | SSIM ↑ |
Unprocessed | 1.34 | 8.45 | 1.24 | 0.82 |
Target-Mel | 2.9 | 13.83 | 0.23 | 0.99 |
Target | 4.64 | 128.51 | 0.0 | 1.0 |
A-Match [27] | 1.4 | 9.68 | 0.9 | 0.9 |
CDiffuSE [17] | 1.32 | 9.51 | 0.94 | 0.87 |
VoiceFixer [15] | 1.6 | 11.55 | 0.58 | 0.94 |
W-R1 | 1.4 | 9.97 | 0.83 | 0.91 |
W-R2 | 1.33 | 9.29 | 0.95 | 0.9 |
W-R1-C | 1.47 | 11.12 | 0.66 | 0.93 |
W-R2-C | 1.53 | 11.24 | 0.64 | 0.93 |
U-R1 | 1.63 | 10.72 | 0.75 | 0.92 |
U-R2 | 1.7 | 10.92 | 0.69 | 0.92 |
U-R1-C | 1.47 | 11.12 | 0.66 | 0.93 |
U-R2-C | 1.61 | 11.36 | 0.63 | 0.93 |
Clean-to-Env | Env-to-Env | |||
Method | LSD ↓ | SSIM ↑ | LSD ↓ | SSIM ↑ |
Unprocessed | 1.24 | 0.82 | 0.73 | 0.87 |
Target-Mel | 0.2 | 0.98 | 0.21 | 0.98 |
Target | 0.0 | 1.0 | 0.0 | 1.0 |
A-Match [27] | 0.82 | 0.81 | 0.69 | 0.86 |
W-R1 | 0.86 | 0.84 | 0.68 | 0.87 |
W-R2 | 1.0 | 0.83 | 0.8 | 0.86 |
W-R1-C | 0.64 | 0.85 | 0.62 | 0.87 |
W-R2-C | 0.58 | 0.87 | 0.57 | 0.89 |
U-R1 | 0.76 | 0.84 | 0.71 | 0.86 |
U-R2 | 0.71 | 0.86 | 0.59 | 0.89 |
U-R1-C | 0.64 | 0.85 | 0.62 | 0.87 |
U-R2-C | 0.59 | 0.87 | 0.55 | 0.9 |
4.1 Objective Evaluation
Table 1 shows the result of the objective evaluation. In all test cases, all proposed models except those using the WaveNet decoder without the content enhancer outperform the acoustic matching model in all metrics. In Env-to-Clean, three proposed models have better PESQ values than VoiceFixer, which performs best among the comparison models. Model U-R2-C achieves comparable performance compared to VoiceFixer across all metrics. In the majority of cases, the model employing the U-Net architecture in the diffusion decoder demonstrates superior performance compared to the model with the WaveNet decoder. Additionally, the models with R2 encoder consistently outperform the models with R1 encoder. It implies that enhanced disentanglement features contribute to the improved ability of the decoder to generate sound within the target environment. There is only a single case where the model with R1 outperforms the model with R2 (W-R1 and W-R2). This might be attributed to the limited capacity of the decoder, which could lead to degraded performance with diverse input conditions. The results also show the effectiveness of the content enhancer.
Env-to-Clean | Clean-to-Env | Env-to-Env | ||||
Method | CP ↑ | EQ ↑ | CP ↑ | ES ↑ | CP ↑ | ES ↑ |
Unprocessed | 4.39 | 2.31 | 4.35 | 1.57 | 4.26 | 3.03 |
Target | 4.72 | 4.74 | 4.68 | 4.66 | 4.71 | 4.60 |
VoiceFixer [15] | 3.52 | 3.64 | - | - | - | - |
A-Match [27] | - | - | 3.87 | 2.92 | 4.02 | 3.45 |
DiffRENT | 4.22 | 4.38 | 4.39 | 4.00 | 4.33 | 4.15 |
4.2 Subjective Evaluation
Table 2 shows the result of the subjective evaluation. While no significant differences were observed in objective evaluation, our model notably surpasses VoiceFixer in subjective evaluation. This result implies that our model generates sound with greater perceptual quality in comparison to VoiceFixer. Across all metrics, our model exhibits superior performance to the acoustic matching network. We found that performance of the acoustic matching network deteriorates when there is a substantial difference between the environments of the target and the content speech, while our model maintains consistent performance.
4.3 Analysis of the Recording Environment Encoder
We investigated the recording environment encoder to understand the behavior better by visualizing latent spaces with t-SNE. We used 400 audio files composed of 4 unseen speakers, 10 excerpts, and 10 environments. Test environments include a clean environment, environments with one unseen condition, and a totally unseen environment. Fig. 2 shows the t-SNE plots of each embedding from the acoustic embedding network [27] and two recording environment encoders of DiffRENT. Points sharing the same color represent audio within the identical recording environment. The point locations show that two encoders in DiffRENT disentangle the recording environment better than the acoustic embedding network. Furthermore, the encoder adapting ECAPA-TDNN exhibits more vivid isolation among different recording environments than the baseline encoder.
5 Conclusions
We propose a novel recording environment transfer diffusion model for speech, which accurately models modeling microphone type and placement, room acoustics, and noise from a reference speech. The model effectively disentangles the recording environment and is adaptable to multiple scenarios of acoustic transform by the reference speech. Both objective and subjective evaluations show its effectiveness in speech enhancement and acoustic matching. Future work will involve enhancing the audio quality further by improving the neural vocoder. Additionally, we plan to develop a technique to reduce the number of refinement steps in the diffusion decoder for fast inference.
References
- [1] S. V. Amengual Gari, B. Sahin, D. Eddy, and M. Kob. Open database of spatial room impulse responses at detmold university of music. In Audio Engineering Society Convention 149. Audio Engineering Society, 2020.
- [2] C. Chen, U. Jain, C. Schissler, S. V. A. Gari, Z. Al-Halah, V. K. Ithapu, P. Robinson, and K. Grauman. Soundspaces: Audio-visual navigation in 3d environments. In ECCV, pages 17–36. Springer, 2020.
- [3] B. Desplanques, J. Thienpondt, and K. Demuynck. ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. Interspeech 2020, pages 3830–3834, 2020.
- [4] A. Erell and M. Weintraub. Estimation using log-spectral-distance criterion for noise-robust speech recognition. In International Conference on Acoustics, Speech, and Signal Processing, pages 853–856. IEEE, 1990.
- [5] H. Gamper and I. J. Tashev. Blind reverberation time estimation using a convolutional neural network. In IWAENC, pages 136–140. IEEE, 2018.
- [6] F. G. Germain, G. J. Mysore, and T. Fujioka. Equalization matching of speech recordings in real-world environments. In ICASSP, pages 609–613. IEEE, 2016.
- [7] J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- [8] J. Im, S. Choi, S. Yong, and J. Nam. Neural vocoder feature estimation for dry singing voice separation. In APSIPA ASC, pages 809–814. IEEE, 2022.
- [9] K. Kinoshita, M. Delcroix, T. Yoshioka, T. Nakatani, A. Sehr, W. Kellermann, and R. Maas. The REVERB challenge: A common evaluation framework for dereverberation and recognition of reverberant speech. In WASPAA, pages 1–4, 2013.
- [10] J. Kong, J. Kim, and J. Bae. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33:17022–17033, 2020.
- [11] Q. Kong, Y. Cao, H. Liu, K. Choi, and Y. Wang. Decoupling magnitude and phase estimation with deep resunet for music source separation. In ISMIR, pages 342–349, 2021.
- [12] Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. In ICLR, 2021.
- [13] J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey. Sdr–half-baked or well done? In ICASSP, pages 626–630. IEEE, 2019.
- [14] H. Li and J. Yamagishi. DDS: A new device-degraded speech dataset for speech enhancement. In Proc. Interspeech 2022, pages 2913–2917, 2022.
- [15] H. Liu, X. Liu, Q. Kong, Q. Tian, Y. Zhao, D. Wang, C. Huang, and Y. Wang. VoiceFixer: A Unified Framework for High-Fidelity Speech Restoration. In Proc. Interspeech 2022, pages 4232–4236, 2022.
- [16] S. Liu, Y. Cao, D. Su, and H. Meng. Diffsvc: A diffusion probabilistic model for singing voice conversion. In ASRU, pages 741–748. IEEE, 2021.
- [17] Y.-J. Lu, Z.-Q. Wang, S. Watanabe, A. Richard, C. Yu, and Y. Tsao. Conditional diffusion probabilistic model for speech enhancement. In ICASSP, pages 7402–7406. IEEE, 2022.
- [18] A. Mathur, A. Isopoussu, F. Kawsar, N. Berthouze, and N. D. Lane. Mic2mic: using cycle-consistent generative adversarial networks to overcome microphone variability in speech systems. In ACM/IEEE IPSN, pages 169–180, 2019.
- [19] K. Okabe, T. Koshinaka, and K. Shinoda. Attentive statistics pooling for deep speaker embedding. Proc. Interspeech 2018, pages 2252–2256, 2018.
- [20] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. Librispeech: an asr corpus based on public domain audio books. In ICASSP, pages 5206–5210. IEEE, 2015.
- [21] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra. Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221), volume 2, pages 749–752. IEEE, 2001.
- [22] C. Saharia, W. Chan, H. Chang, C. Lee, J. Ho, T. Salimans, D. Fleet, and M. Norouzi. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022.
- [23] A. Sarroff and R. Michaels. Blind arbitrary reverb matching. In DAFx, volume 2, 2020.
- [24] N. Singh, J. Mentch, J. Ng, M. Beveridge, and I. Drori. Image2reverb: Cross-modal reverb impulse response synthesis. In ICCV, pages 286–295, 2021.
- [25] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
- [26] C. J. Steinmetz, V. K. Ithapu, and P. Calamia. Filtered noise shaping for time domain room impulse response estimation from reverberant speech. In WASPAA, 2021.
- [27] J. Su, Z. Jin, and A. Finkelstein. Acoustic matching by embedding impulse responses. In ICASSP, pages 426–430. IEEE, 2020.
- [28] J. Su, Z. Jin, and A. Finkelstein. Hifi-gan-2: Studio-quality speech enhancement via generative adversarial networks conditioned on acoustic features. In WASPAA, pages 166–170. IEEE, 2021.
- [29] J. Traer and J. H. McDermott. Statistics of natural reverberation enable perceptual separation of sound and space. Proceedings of the National Academy of Sciences, 113(48):E7856–E7865, 2016.
- [30] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. Wavenet: A generative model for raw audio. In 9th ISCA Speech Synthesis Workshop, pages 125–125, 2016.
- [31] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.