Generating Diverse Vocal Bursts with StyleGAN2 and MEL-Spectrograms

Marco Jiralerspong Gauthier Gidel

Abstract

We describe our approach for the generative emotional vocal burst task (ExVo Generate) of the ICML Expressive Vocalizations Competition. We train a conditional StyleGAN2 architecture on mel-spectrograms of preprocessed versions of the audio samples. The mel-spectrograms generated by the model are then inverted back to the audio domain. As a result, our generated samples substantially improve upon the baseline provided by the competition from a qualitative and quantitative perspective for all emotions. More precisely, even for our worst-performing emotion (awe), we obtain an FAD of $1.76$ compared to the baseline of $4.81$ (for reference, the FAD between the train/validation sets for awe is $0.776$ ). Code for this work can be found at https://github.com/marcojira/stylegan3-melspectrograms

Machine Learning, ICML

1 Introduction

The Emotion Generation Track (ExVo Generate) of the ICML Expressive Vocalizations Competition presents a unique opportunity to examine generative models and optimize their ability to create audio samples that resemble a novel dataset of non-verbal vocal bursts from (Cowen et al., 2022). The hope is that the resulting models from this endeavor can provide exciting insights into the interplay between emotion and audio.

We train a conditional variant of StyleGAN2 (Karras et al., 2020b) on various processed mel-spectrograms generated from the data and find substantial improvements in generated sample quality over the baseline described in (Baird et al., 2022).

2 Related Work

While research on mel-spectrograms as a means to go between the audio and the visual domain is extensive, most work focuses on mapping mel-spectrograms to audio (using GANs or other techniques) and not the unconditional generation of the mel-spectrograms themselves (Kumar et al., 2019; Shen et al., 2018; Prenger et al., 2019; Wang et al., 2018). By combining the methods above with an additional network that maps text sequences to mel-spectrograms, state-of-the-art voice synthesis is possible (Tan et al., 2021).

However, as the first part of this process often depends on being provided a text sequence as input: unconditional generation of mel-spectrograms is as of yet still poorly understood. This area is crucial for the non-verbal domain, where we cannot provide complete text input to the model. Instead, the model must perform either completely unconditional or conditional generation with very little information (e.g., just being provided a specific emotion to generate).

The closest work examining the unconditional generation of non-verbal audio samples is Afsar et al. (2021), that uses an MSGGAN to generate realistic laughter from a relatively small dataset of laughter samples. Conditional generation of laughter using additional information such as gender and age is also briefly explored, and interpolations between genders demonstrate the possibility of learning a meaningful latent space.

Our Contributions.

We further the work by (Afsar et al., 2021) extending the generation to a more complex challenge (larger dataset with multiple emotions) and move to use StyleGAN2, which has yielded substantial improvements at more traditional image generation tasks (Karras et al., 2020a).

3 Method

At a high level, our model’s audio generation process consists of 4 steps:

1.

Preprocess the audio to improve sample quality (for all emotions).
2.

Convert the processed audio to square mel-spectrograms of dimension $128\times 128$ or $256\times 256$ .
3.

Train StyleGAN2 on the mel-spectrograms and use it to generate new ones.
4.

Convert back the generated mel-spectrograms to audio.

3.1 Audio preprocessing

To improve sample quality, training, and subsequent generation, we aim to create samples composed almost solely of the vocal burst. As we observe that many of the dataset’s samples have varying noise levels, we first denoise the audio using the noisereduce (Sainburg, ; Sainburg et al., 2020) library, which uses spectral gating (i.e., a filtering threshold on the spectrogram of audio samples) to remove noise.

Once the audio is denoised, we trim the silence at the start/end of each sample. As samples often have varying levels of silence at the start/end of recordings (usually related to the participant recording the sample), the trimming ensures that a larger proportion of the mel-spectrogram contains information on the vocal burst.

Finally, we remove empty noise samples by performing a simple thresholding test (i.e., removing all samples where $\geq 98$ % of the log-scale mel-spectrogram has an amplitude under - $4.0$ ) and removing all samples of length $\leq 0.05$ s after trimming). Combined, this eliminates 1047 samples.

3.2 Mel-spectrogram conversion

We transform the remaining processed samples to mel-spectrograms using torchaudio’s (Yang et al., 2021) mel-spectrogram conversion using filter values from librosa (Brian McFee et al., 2015) for a more straightforward inversion. Specifically, we generate two sets of different resolutions (128 and 256) to experiment with the impact of higher resolutions on audio quality and the training process.

The number of mel filterbanks determines the resolution of the resulting mel-spectrogram. We use $128$ for the first set and $256$ for the second. While the higher resolution does allow for higher quality audio after inversion, we find that the improvement is not significant compared to the potential improvement in generation quality (i.e. great low resolution mel-spectrograms will sound better once inverted than good high resolution mel-spectrograms).

The hop length is the distance between windows where the Short-Time Fourier Transform (STFT) is applied. We use 256 for both and either pad (with a pixel value of $1\mathrm{e}{-6}$ ) or truncate the resulting samples, so they are of size 128x128 or 256x256. As most samples (after being trimmed) are between 1 and 3 seconds in length, this process ensures there is not excessive padding or truncation. However, a more principled approach could be used in the future to truly minimize padding/truncation.

Lastly, we convert the samples to images by rescaling pixel values to be between 0 and 255. Using all of the above yields two datasets of grayscale samples (i.e. with one channel) of dimensions 128x128 or 256x256. Examples of the resulting samples can be seen in Figure 1.

Refer to caption — Figure 1: Examples of generated 128x128 mel-spectrograms using the processing steps described above on audio samples from the provided dataset. These samples have a single channel (i.e. are grayscale) but a Viridis colormap is used to make the data easier to digest. The figure contains 2 rows of 6 square samples.

3.3 StyleGAN training

We use the StyleGAN2 architecture (Karras et al., 2020a) from the StyleGAN3 repository (Karras et al., 2021) which takes advantage of an intermediate latent space that gets fed (after applying an affine transformation) to each layer of the generator through adaptive instance normalization. On the generated grayscale samples, we train both an unconditional model (on samples from all emotions at once) and a conditional one for both resolutions, yielding a total of 4 models. We set gamma to $.0512$ as recommended for images of that resolution and train for 72 hours with a batch size of 32 on a RTX 8000.

The unconditional version is trained on the entire dataset (including the test set) while the conditional version is trained on the training + validation sets (as we do not have access to the classes of the test set). For each emotion, we select all samples whose score for that emotion is 1 (except for triumph which lacks samples).

3.4 Mel-spectrogram inversion

After generating mel-spectrograms using the trained StyleGAN2 models, we invert them back to audio samples. To do so, we begin by scaling them from [-1, 1] back to log scale and then reverse the transformation to log scale (i.e. $x=e^{2f(x)}-1\mathrm{e}{-6}$ ). Finally, we apply librosa’s (Brian McFee et al., 2015) mel-spectrogram inversion function, which makes use of the Griffin-Lim algorithm (Griffin & Lim, 1984) to convert mel-spectrograms to audio.

4 Results

The StyleGAN2 architecture learns to produce realistic mel-spectrograms in both the unconditional and conditional cases with a relatively little tuning. Audio samples from the best conditional model checkpoint are made available¹¹1https://bit.ly/39Ea0gk. The FAD (Fréchet Audio Distance), HEEP (Human-Evaluated Expression Precision) and $S_{GEN}$ values of the best model can be found in Table 1. Overall, we find that our StyleGAN2 based method significantly improves the generation of each emotion in terms of FAD and $S_{GEN}$ scores while HEEP is significantly improved for each emotion except Awe/Horror which are marginally worse.

Before continuing, we distinguish between visual FID (Heusel et al., 2017) and the evaluation metric mentioned in (Baird et al., 2022) which we designate as FAD (Fréchet Audio Distance), as it still examines the Fréchet distance of Gaussians but relies on the features from the last layer of a model trained on audio data instead of those from Inception Net (Szegedy et al., 2016).

HEEP on the other hand uses human evaluations of the submitted samples where each sample is assigned a rating for each emotion based on how intensely the sample represents that emotion. The resulting matrix $H$ is then compared to the matrix $T$ of one-hot encoded samples (i.e. 1 for the emotion the sample is supposed to represent and 0 for the others), using:

HEEP=\frac{\sigma_{TH}}{\sqrt{\sigma^{2}_{T}\sigma^{2}_{H}}}

(1)

Finally, $S_{GEN}$ is an aggregate measure that combines FAD and HEEP:

S_{GEN}{{}_{e}}={\frac{1/FAD_{e}+HEEP_{e}}{2}}.

(2)

Figure 3 compares FAD and FID as training progresses. Even after only a few epochs, the generator learns to produce mel-spectrograms that, to the untrained eye, resemble those obtained from the real samples. Then, for most configurations we experimented with, the lowest visual FID is achieved roughly halfway through training. Nonetheless, qualitatively, sample quality keeps increasing before eventually training breaks as we observe mode collapse for all models (similarly to the collapse observed in (Brock et al., 2019)). Visually, this progression can be observed in Figure 2.

Table 1: FAD, HEEP and

S_{GEN}

values for each emotion. For FAD, 1000 samples are generated for each emotion using the best model checkpoint. The activation statistics of these samples are compared with the activation statistics of the samples in the validation set for the corresponding emotion (provided by (Baird et al., 2022)). The row Lower bound corresponds to the FAD between the train and the validation set. Since an ideal model could not generate better samples than the ones of the validation set these values give a lower-bound on the best FAD we can expect. HEEP and

S_{GEN}

scores are provided by the competition’s organizers and computed as described in (Baird et al., 2022).

FAD (lower is better)
Emotion	Amuse.	Awe	Awkward.	Distress	Excite.	Fear	Horror	Sadness	Surprise
Lower Bound	.634	.776	1.20	.866	.697	.649	1.25	.992	.341
This Work	1.28	1.76	1.76	1.77	1.75	1.57	1.34	0.94	1.67
Baseline	4.92	4.81	8.27	6.11	6.00	5.71	5.64	5.00	6.08
HEEP (higher is better)
This Work	0.707	0.455	0.312	0.372	0.212	0.229	0.205	0.359	0.599
Baseline	0.490	0.46	0.036	0.32	0.084	0.042	0.27	-0.033	0.22
$S_{GEN}$ (higher is better)
This Work	0.744	0.512	0.440	0.467	0.392	0.433	0.476	0.711	0.598
Baseline	0.347	0.334	0.078	0.242	0.125	0.109	0.224	0.084	0.192

To generate the samples provided for the competition, we ultimately select the conditional model at 128x128, sampled at a checkpoint slightly before mode collapse. We find, from manual inspection combined with FAD that it yields the highest quality samples. No truncation is used to improve the precision of generated samples.

While we do not report results for the models trained on 256x256 mel-spectrograms we find that even though they achieve decent results, the 128x128 equivalent was almost always better in terms of sample quality. Nonetheless, the higher resolution models also suffered from the same issue of mode collapse.

5 Discussion

Generally, we find that StyleGAN2 is a suitable architecture for generating realistic mel-spectrograms as long as early-stopping or some manual equivalent is used with respect of FAD due to the presence of mode collapse.

While StyleGAN2 has generally been more resistant to mode collapse than other architectures the issue in this case is perhaps exacerbated by a relatively high resolution of images combined with a small dataset. Further work is needed to better understand the cause and methods to avoid/delay it.

As shown in Figure 3, even after FID starts increasing, FAD keeps improving (as well as our qualitative evaluation of the generated audio samples). This discrepancy seems to indicate two things. FAD is a better metric for evaluating the generator than FID and StyleGAN2 is able to learn domain-specific features that are ordinarily not required for visual learning tasks (or at least these features are not learned by InceptionNet). The difference could also potentially explain why it is hard to visually observe the quality of the mel-spectrograms increasing (after the initial bit of training) even though the quality of the generated audio is still improving.

The experiments with 256x256 mel-spectrograms illustrate a tradeoff that needs to be made between improving the inversion quality (a higher resolution mel-spectrogram will be inverted to higher quality audio) and the difficulty of the image-generation task (it is harder to learn to generate higher resolution images). While 128x128 seemed to offer a good middle ground (a not too significant loss in inversion quality with a resolution that StyleGAN2 has generally performed well on), it is possible that a lower or higher resolution would better optimize that tradeoff.

6 Conclusion

While this initial foray into unconditional, non-verbal audio generation is promising, multiple parts of this audio generation process can be further optimized.

Due to time constraints, little work was done in regards to hyperparameter exploration (amount of preprocessing, mel-spectrogram generation parameters, StyleGAN2 training parameters, etc.). A more principled search of these hyperparameters (especially the resolution of the images) as well as the addition of some form of extra regularization could significantly improve performance. For example (Tseng et al., 2021) finds that a regularization term that penalizes the discriminator for having different outputs for real/fake images helps postpone mode collapse and stabilize training. Otherwise, recent work has been done on optimizer related techniques to stabilize training of GANs. Exploring other optimizers than Adam such as ExtraAdam (Gidel et al., 2019) could also address mode collapse.

Additionally, instead of using the Griffin-Lim algorithm for mel-spectrogram inversion, taking advantage of a model such as MelGAN (Kumar et al., 2019) could improve inversion quality and speed. It would be particularly interesting to see if the two sets of networks could perhaps be trained jointly or if it’d be better to train them separately (we could also potentially get better inversion from low resolution mel-spectrograms using MelGAN).

While StyleGAN2 works well as a general purpose image generation framework, many of the optimizations it introduced are not designed with the domain of mel-spectrogram images in mind. Further domain-specific techniques (for example adapting the architecture to be able to generate rectangular images which are more suitable for audio) could bring about significant improvements in generation capabilities of mel-spectrograms.

Other than just improving generation capabilities, a principled investigation of the latent space of the trained GANs could yield interesting insights. It would be particularly relevant to attempt interpolation along different emotions. We could then perhaps answer questions such as: what sound is halfway between horror and awe?

References

Afsar et al. (2021) Afsar, M. M., Park, E., Paquette, E., Gidel, G., Mathewson, K. W., and Muller, E. Generating diverse realistic laughter for interactive art. NeurIPS Workshop on ML for Creativity and Design, Sydney, Australia, 2021. doi: 10.48550/ARXIV.2111.03146.
Baird et al. (2022) Baird, A., Tzirakis, P., Gidel, G., Jiralerspong, M., Muller, E. B., Mathewson, K., Schuller, B., Cambria, E., Keltner, D., and Cowen, A. The ICML 2022 Expressive Vocalizations Workshop and Competition: Recognizing, Generating, and Personalizing Vocal Bursts, 2022.
Brian McFee et al. (2015) Brian McFee, Colin Raffel, Dawen Liang, Daniel P.W. Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. librosa: Audio and Music Signal Analysis in Python. In Kathryn Huff and James Bergstra (eds.), Proceedings of the 14th Python in Science Conference, 2015. doi: 10.25080/Majora-7b98e3ed-003.
Brock et al. (2019) Brock, A., Donahue, J., and Simonyan, K. Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2019.
Cowen et al. (2022) Cowen, A., Bard, A., Tzirakis, P., Opara, M., Kim, L., Brooks, J., and Metrick, J. The hume vocal burst competition dataset (H-VB) — raw data [exvo: updated 02.28.22] [data set]. Zenodo, 2022. URL https://doi.org/10.5281/zenodo.6308780.
Gidel et al. (2019) Gidel, G., Berard, H., Vignoud, G., Vincent, P., and Lacoste-Julien, S. A Variational Inequality Perspective on Generative Adversarial Networks. In International Conference on Learning Representations, 2019.
Griffin & Lim (1984) Griffin, D. and Lim, J. Signal estimation from modified short-time fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1984. doi: 10.1109/TASSP.1984.1164317.
Heusel et al. (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
Karras et al. (2020a) Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J., and Aila, T. Training generative adversarial networks with limited data. 2020a. doi: 10.48550/ARXIV.2006.06676.
Karras et al. (2020b) Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., and Aila, T. Analyzing and improving the image quality of StyleGAN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020b.
Karras et al. (2021) Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J., and Aila, T. Stylegan3. https://github.com/NVlabs/stylegan3, 2021.
Kumar et al. (2019) Kumar, K., Kumar, R., de Boissiere, T., Gestin, L., Teoh, W. Z., Sotelo, J., de Brébisson, A., Bengio, Y., and Courville, A. C. Melgan: Generative adversarial networks for conditional waveform synthesis. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
Prenger et al. (2019) Prenger, R., Valle, R., and Catanzaro, B. Waveglow: A flow-based generative network for speech synthesis. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019. doi: 10.1109/ICASSP.2019.8683143.
(14) Sainburg, T. timsainb/noisereduce: v1.0. doi: 10.5281/zenodo.3243139.
Sainburg et al. (2020) Sainburg, T., Thielk, M., and Gentner, T. Q. Finding, visualizing, and quantifying latent structure across diverse animal vocal repertoires. PLoS computational biology, 16, 2020.
Shen et al. (2018) Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerrv-Ryan, R., Saurous, R. A., Agiomvrgiannakis, Y., and Wu, Y. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018. doi: 10.1109/ICASSP.2018.8461368.
Szegedy et al. (2016) Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.
Tan et al. (2021) Tan, X., Qin, T., Soong, F., and Liu, T.-Y. A survey on neural speech synthesis. arXiv preprint arXiv:2106.15561, 2021.
Tseng et al. (2021) Tseng, H.-Y., Jiang, L., Liu, C., Yang, M.-H., and Yang, W. Regularizing generative adversarial networks under limited data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021.
Wang et al. (2018) Wang, Y., Stanton, D., Zhang, Y., Ryan, R.-S., Battenberg, E., Shor, J., Xiao, Y., Jia, Y., Ren, F., and Saurous, R. A. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research. PMLR, 10–15 Jul 2018.
Yang et al. (2021) Yang, Y.-Y., Hira, M., Ni, Z., Chourdia, A., Astafurov, A., Chen, C., Yeh, C.-F., Puhrsch, C., Pollack, D., Genzel, D., Greenberg, D., Yang, E. Z., Lian, J., Mahadeokar, J., Hwang, J., Chen, J., Goldsborough, P., Roy, P., Narenthiran, S., Watanabe, S., Chintala, S., Quenneville-Bélair, V., and Shi, Y. Torchaudio: Building blocks for audio and speech processing. arXiv preprint arXiv:2110.15018, 2021.