ImmerseDiffusion: A Generative Spatial Audio Latent Diffusion Model

Abstract

We introduce ImmerseDiffusion, an end-to-end generative audio model that produces 3D immersive soundscapes conditioned on the spatial, temporal, and environmental conditions of sound objects. ImmerseDiffusion is trained to generate first-order ambisonics (FOA) audio, which is a conventional spatial audio format comprising four channels that can be rendered to multichannel spatial output. The proposed generative system is composed of a spatial audio codec that maps FOA audio to latent components, a latent diffusion model trained based on various user input types, namely, text prompts, spatial, temporal and environmental acoustic parameters, and optionally a spatial audio and text encoder trained in a Contrastive Language and Audio Pretraining (CLAP) style. We propose metrics to evaluate the quality and spatial adherence of the generated spatial audio. Finally, we assess the model performance in terms of generation quality and spatial conformance, comparing the two proposed modes: “descriptive”, which uses spatial text prompts) and “parametric”, which uses non-spatial text prompts and spatial parameters. Our evaluations demonstrate promising results that are consistent with the user conditions and reflect reliable spatial fidelity.

Index Terms— Generative Spatial Audio, Immersive Audio synthesis, Latent Diffusion Model, Ambisonic audio generation, Interactive spatial audio generation.

1 Introduction

In today’s dynamic technology landscape, there is an increasing demand for immersive audio experiences across many domains, including virtual and augmented reality platforms, healthcare, education, and entertainment. Generative machine learning can play a significant role in these applications by creating audio content in dynamic and interactive ways. While generative audio models have made substantial progress recently, they are still limited to generating mono or stereo without capability to accurately pan sources to desired spatial locations [1, 2, 3].

Refer to caption — Fig. 1: (I) Descriptive Conditioner: integrates ELSA [4] text conditioner to encode prompts detailing the audio source and spatial and environmental context plus temporal conditioning blocks. (II) Parametric Conditioner: including LAION [5] text encoder to provide non-spatial text embeddings along with parametric spatial, environmental, and temporal conditioning blocks. (III) ImmerseDiffusion overall architecture, comprising the ambisonic autoencoder, conditioner, and transformer diffusion including training and inference diagrams.

In [1], Liu et al. introduced AudioLDM which overcomes the need for large scale audio-text data annotation by using CLAP [5]. By mapping the audio to the latent domain using an audio encoder, the authors demonstrated that general audio latents can be generated then decoded to the desired final audio. In [2], the authors extended AudioLDM to generate a variety of sound types including speech, music, and other sound effects. Despite its capability to generate good quality audio content, both AudioLDM variants are not designed to generate spatial audio content.

Several other models have been proposed to produce stereo audio. MusicGen [6] employs a single-stage transformer language model combined with efficient token interleaving patterns to create stereo-channel music conditioned on textual descriptions or melodic features. Jen-1 [7] uses a transformer latent diffusion structure to generate text-conditioned stereo-channel music. Moûsai [8] uses a two-stage cascade latent diffusion model with a 1D convolutional network-based U-Net structure to generate stereo music. Stable Audio [3] adopts a 1D convolutional network with temporal conditioning in a U-Net configuration to produce stereo music. Its subsequent version [9] incorporates a transformer-based diffusion model with an enhanced codec to generate extended stereo-channel music audio. Although these stereo models can generate two-channel audio, they lack the ability to map the audio sources to desired spatial locations. For example, we found that they are unable to generate sound exclusively from the left channel in response to a prompt similar to ”a dog barking on the left”. This limitation hinders their appropriateness in creating stereo audio experiences with precise spatial rendering.

This paper introduces ImmerseDiffusion, a text-conditioned generative spatial audio model designed to produce 3D Ambisonic soundscapes with localization in horizontal, vertical, and distance dimensions, while also accounting for temporal and environmental factors like room size and reverberation. Our model generates FOA-domain audio, and features two operational modes: descriptive and parametric. The descriptive model generates spatial audio based on textual descriptions of sound sources and their spatial and environmental details, suited for narrative-driven applications such as cinematic audio. The parametric model combines textual descriptions of sound sources with numerical spatial parameters, ideal for machine-centered uses like game engines and virtual simulations. To the best of our knowledge, this is the first spatial audio generative model. Consequently, we introduce new metrics to evaluate the quality of the generated FOA audio. These metrics include Ambisonics Fréchet Audio Distance (FAD), spatial Kullback–Leibler (KL) divergence, and spatial CLAP scores. Additionally, we assess the spatial accuracy using azimuth, elevation, and distance L1 scores, besides spatial angle which are all based on sound intensity vectors [10].

2 Methodology

ImmerseDiffusion integrates three main components: a spatial autoencoder designed to encode 4-channel ambisonic domain signals into a continuous latent domain and decode back to the waveform representation, a cross-attention-based conditioner block that incorporates spatial information in two distinct manners tailored to descriptive and parametric generation modes, and a transformer-based diffusion model operating on the autoencoder latents. This architecture enables the generation of spatial audio aligned with audio source descriptions, spatial, environmental and temporal cues, enhancing realism and immersion across various interactive and narrative-driven applications.

2.1 Spatial Audio Format

The Ambisonic surround-sound system encodes and reproduces sound directions and amplitudes to deliver immersive 3D audio experiences. For simplicity and considering the fact that the first-order Ambisonics (FOA) format generally provides adequate spatial resolution [11], this paper focuses exclusively on generating FOA components. However, it can be generalized to higher-order Ambisonics for improved spatial fidelity, albeit with increased complexity. FOA achieves spatial rendering using four-channel components, denoted as W, X, Y, and Z as follows:

	$\displaystyle W$	$\displaystyle=\frac{1}{\sqrt{2}}p,$	$\displaystyle\quad X$	$\displaystyle=p\cos(\theta)\cos(\phi)$		(1)
	$\displaystyle Y$	$\displaystyle=p\sin(\theta)\cos(\phi),$	$\displaystyle\quad Z$	$\displaystyle=p\sin(\phi)$		(1)

where $p$ is sound pressure, $W$ is the omnidirectional component, $X$ is the front-back directional component, defined by the azimuth angle $\theta$ $\;\in(-\pi,\pi]$ , and elevation angle $\phi$ $\;\in(-\pi/2,\pi/2]$ , $Y$ is the left-right directional component, defined by the azimuth angle $\theta$ and elevation angle $\phi$ , and $Z$ is the up-down directional component, defined by the elevation angle $\phi$ . Both the training data and the generated audio will conform to this standard.

2.2 Spatial Codec

Large model sizes, high-dimensional audio data, iterative diffusion processes, and the need to capture long-term structures necessitate efficient computation in generative audio models. Autoencoder architectures are commonly employed to manage these demands by operating within compressed latent domains. For our model, which generates four FOA components simultaneously, compression is even more critical.

To address compression, various approaches have been explored in the literature. Some models leverage autoencoders with quantized latents such as Vector Quantized Variational Autoencoders VQ-VAE (e.g., [12]) or Residual Vector Quantization RVQ-VAE (e.g., [13]). Other models leverage continuous VAE latents to avoid quantization information loss [2, 3, 9]. Following [3], we utilize the 1D convolutional U-Net autoencoder from the Descript Audio Codec (DAC) [14] and replace the discrete RVQ bottleneck with a continuous VAE bottleneck. We also eliminate the decoder’s $tanh()$ activation to prevent harmonic distortion, as noted in [9]. Additionally, we modify the training loss functions to suit the FOA components. Unlike stereo codecs, which leverage losses like Mid-Side Short Time Fourier Transform (MS-STFT) for proper stereo reconstruction, FOA does not need such losses due to the inherent orthogonality of the $X$ , $Y$ , and $Z$ channels. The training loss function of our FOA codec is as follows:

\mathcal{L}_{\text{codec}}=\frac{\lambda_{\text{mrstft}}}{4}\mathcal{L}_{\text{mrstft\_W}}+\frac{\lambda_{\text{mrstft}}}{4}\mathcal{L}_{\text{mrstft\_X}}+\frac{\lambda_{\text{mrstft}}}{4}\mathcal{L}_{\text{mrstft\_Y}}+\frac{\lambda_{\text{mrstft}}}{4}\mathcal{L}_{\text{mrstft\_Z}}+\lambda_{\text{kl}}\mathcal{L}_{\text{kl}}+\lambda_{\text{adv}}\mathcal{L}_{\text{adv}}+\lambda_{\text{fm}}\mathcal{L}_{\text{fm}}

(2)

In this context, $\mathcal{L}_{\text{mrstft}}$ represents the Multi-Resolution STFT loss for each channel, collectively contributing to the generation loss. $\mathcal{L}_{\text{kl}}$ denotes KL divergence loss applied to the VAE bottleneck, $\mathcal{L}_{\text{adv}}$ signifies the adversarial (discrimination) loss, while $\mathcal{L}_{\text{fm}}$ refers to the feature matching loss. The $\lambda$ parameters are weights perceptually assigned to adjust each corresponding loss contribution in total loss [3]. Our ambisonic codec model transforms a 4-channel FOA signal with length $L$ into $64$ channel latent with length $L/2048$ with an overall compression factor of $128$ .

Table 1: Specifications and Quality and Spatial Accuracy Evaluation Results of the trained FOA Codecs on Spatial AudioCaps.

FOA Codec Model

Downsampling

Rate

Latent

Size

STFT

Distance

Mel

Distance

{}_{\text{azimuth}}

{}_{\text{elevation}}

{}_{\text{distance}}

Angular

{}_{\text{difference}}

1D-Conv U-Net(32X)

1024

128

1.69

2.83

1.55

0.29

2.05

1.56

1D-Conv U-Net (64X)

1024

1.70

2.62

1.57

0.24

1.82

1.57

1D-Conv U-Net (128X)

2048

1.73

2.55

1.57

0.16

1.49

1.57

1D-Conv U-Net-

\beta

(128X)

2048

2.33

4.13

1.43

0.15

1.53

1.51

Measurement Error

—

0.17

0.14

1.36

1.03

2.3 Spatial, Temporal and Environmental Conditioning

Many generative audio methods use text prompts for user control [12, 13]. Recent approaches have transitioned from pure language models to CLAP-based models, such as LAION CLAP [5] for audio and MuLan [15] for music, due to their enhanced performance [3, 2, 16]. These models are effective in aligning language and audio embeddings, enabling cross-modality training and inference [2]. Additionally, various conditioning types are explored, including control over temporal order, pitch, energy [17], music genre [12], and rhythm and chords [18].

ImmerseDiffusion offers spatial control in audio generation via two models: descriptive and parametric. The descriptive model utilizes text-based prompts to define the sound source, its spatial attributes, and environmental acoustics. In contrast, the parametric model uses non-spatial text prompts only for sound source descriptions while providing numerical values for spatial and environmental parameters. Moreover, in line with previous stereo diffusion models [3, 9], both models employ temporal conditioning, including start time and total time as well. In Figure 1, (I) and (II) illustrate how descriptive and parametric conditioning incorporate location and environmental parameters through text or a combination of text and numerical values, along with temporal conditioning.

To generate conditioning spatial text embeddings for the descriptive model, we use the Embeddings for Language and Spatial Audio (ELSA) [4] model’s text encoder. Trained on spatial audio-text pairs using synthetic spatial variants of AudioCaps [19], FreeSound [20], and Clotho [21], ELSA demonstrated superior performance in spatial audio retrieval tasks [4]. ELSA’s design for comprehending spatial qualities makes it ideal for text conditioning in our model. Thanks to ELSA text conditioner and the mentioned spatial training datasets, our proposed descriptive model considers spatial description details (e.g., right corner, far left, front top) and environment acoustics (e.g., medium-sized acoustically damped room, large highly reverberant hall) in spatial audio generation. In addition to that, two cross-attention conditioners are employed to include the mentioned temporal conditions (i.e. start time and total time).

For the parametric model, we replace ELSA text encoder with the LAION CLAP [5] text encoder, which is solely trained on non-spatial prompts to validate parametric conditioning and avoid potential spatial bias from ELSA embeddings. Alongside text embeddings, this model includes three numerical spatial conditioning parameters (azimuth, elevation, distance) and two environmental acoustic conditioning parameters (room size, 30dB decay time). It also comprises two temporal conditioning parameters, namely, start time and total time, similar to the descriptive model.

2.4 Diffusion Model

Several generative audio latent diffusion models utilize 1D convolutional U-Net structures for their diffusion blocks (e.g., [2, 3]). However, the Diffusion Transformer (DiT), initially proposed for image generation [22], has recently been adapted for music generation [9, 23]. In our work, we employ a DiT that features a cascade of blocks that include self-attention, cross-attention, and gated MLP components with layer normalization and skip connections, similar to that used in [9], with structural adoptions for FOA-domain. We adjust conditioning token dimensions to $512$ and introduce a projecting conditioning layer to meet dimensionality requirements. The model is trained using the v-objective approach [24] to minimize the Mean Squared Error (MSE) between the true $v$ and the predicted $v_{\theta}(z_{t},t,c)$ at timestep $t$ , conditioned on $c$ and given the latent variable $z_{t}$ as follows:

\mathcal{L}_{\text{Diffusion}}=\left\|v_{\theta}(z_{t},t,c)-v\right\|_{2}^{2}

(3)

2.5 Datasets

We trained the ambisonic codec and both descriptive and parametric ImmerseDiffusion models using synthetically spatialized versions of the FreeSound [20], AudioCaps [19], and Clotho [21] datasets, as provided in [4]. To improve generalization and performance on speech audio, we also included a spatialized version of LibriSpeech [25] and the corresponding FOA noise dataset, detailed in [26]. Each dataset contains spatialized audio, original captions, spatial and environmental parameters, and spatially derived captions, as described in [4]. Datasets information is provided in Table 2.

Table 2: Datasets used for training and evaluation of the models

Datasets

Number of

Samples

Duration

(hrs)

Spatial Clotho [4, 21]

8,546

55.0

Spatial AudioCaps [4, 19]

98,459

258.12

Spatial Freesound [4, 20]

783,033

4,425.53

Spatial LibriSpeech [26, 25]

132,377

658.6

Spatial LibriSpeech Noise [26, 25]

132,377

658.6

2.6 Training

We train our models with a cluster of A100-80GB GPUs. For the ambisonic audio codec, we use a batch size of 96 and an excerpt length of 1.48 seconds for 600K steps. The spatial codec generator and discriminator were trained with base learning rates of $10^{-4}$ and $2.\times 10^{-4}$ respectively, using the AdamW optimizer, applying a $10^{-3}$ decay factor and momentum parameters of 0.8 and 0.99.

Both descriptive and parametric models share the same training setup. We trained the Diffusion transformers with frozen ambisonic codec and conditioner weights with a batch size of 576 and a context window of 5.95 seconds for 500K steps. The base learning rate was set to $10^{-4}$ with the AdamW optimizer, employing momentum values of 0.9 and 0.99. Spatial and non-spatial text conditioning for descriptive and parametric models utilized the official pre-trained ELSA [4] and LAION CLAP [5] models, respectively.

Table 3: Quality and Spatial Accuracy Evaluation of ImmerseDiffusion for both descriptive and parametric Models on Spatial AudioCaps.

Diffusion Model	FAD_ELSA	KL_ELSA	CLAP_ELSA	L1_{$\theta$ xxx}	L1_{$\phi$ xxx}	L1_dxxx	$\Delta$ _{Spatial-Angle}
DescriptiveWider	0.28	0.11	0.64	1.08	0.43	1.88m	1.35
ParametricWider	1.75	0.094	0.59	0.36	0.34	1.92m	1.12

2.7 Evaluation Metrics

The quality of FOA codec models is evaluated using STFT and MEL distances between the original and reconstructed FOA audio from the test set using AuraLoss [27] with default settings similar to [3, 9].

We evaluate the plausibility of generated FOA audio clips by computing FAD which compares the statistical properties of the generated and reference audio embeddings. To compute the reference and generated FOA audio embeddings, we used ELSA audio encoder that supports FOA audio at 48 kHz. Additionally, we report KL divergence using a pre-trained ELSA.

The CLAP score is computed as the cosine similarity between the conditioning spatial text embeddings and the generated FOA audio embeddings, using the ELSA model. For the parametric model, KL divergence and CLAP score are computed using corresponding spatial captions from the test set, despite the model being trained on non-spatial captions and spatial parameters.

To measure the spatial accuracy of FOA audio generated by ImmerseDiffusion, we report metrics based on ground truth and estimated azimuth ( $\theta$ ), elevation ( $\phi$ ), and distance (d) using equation set (5). The intensity vectors $I_{x}$ , $I_{y}$ and $I_{z}$ are derived by multiplying the omnidirectional channel with the corresponding directional channels, as outlined in (4).

$I_{x}=W\cdot X,\quad\quad I_{y}=W\cdot Y,\quad\quad I_{z}=W\cdot Z$

(4)

$\begin{aligned} \text{$\theta$}=\tan^{-1}\left(\frac{I_{y}}{I_{x}}\right),\quad\text{$\phi$}=\tan^{-1}\left(\frac{I_{z}}{\sqrt{I_{x}^{2}+I_{y}^{2}}}\right),\quad\text{d}=\sqrt{I_{x}^{2}+I_{y}^{2}+I_{z}^{2}}\end{aligned}$

(5)

For all mentioned three parameters, we report the L1 norm between the ground truth and estimated values for azimuth, elevation and distance, across the test set. To calculate the L1 norm for azimuth, we use the circular difference method (see Equation 6). This method accounts for the circular nature of azimuth angles, ensuring we measure the shortest angular distance between the ground truth $\theta$ and the estimated azimuth $\hat{\theta}$ . It avoids issues with linear differences, where angles like -3.13 and +3.13 are in nearly the same direction but appear far apart in a linear measurement.

\text{L1}_{\text{$\theta$}}=\|\min\left(\left|\text{$\theta$}-\hat{\theta}\right|,2\pi-\left|\text{$\theta$}-\hat{\theta}\right|\right)||_{1}

(6)

To quantify the difference between the ground truth and estimated direction of arrival (DoA), we also include their spatial angle ${\Delta}_{\text{Angular}}$ [28] calculated using equations (7) and (8) as follows:

a=\sin^{2}\left(\frac{{\Delta}_{\phi}}{2}\right)+\cos(\phi)\cdot\cos({\hat{\phi}})\cdot\sin^{2}\left(\frac{{\Delta}_{\theta}}{2}\right)

(7)

{\Delta}_{\text{Spatial-Angle}}=2\cdot\arctan 2\left(\sqrt{a},\sqrt{1-a}\right)

(8)

where ${\Delta}_{\phi}$ and ${\Delta}_{\theta}$ denote the linear and circular differences for elevation and azimuth respectively, and ${\phi}$ and $\hat{\phi}$ represent the ground truth and estimated elevations.

3 Results and Discussions

We summarize the evaluation results of the ambisonic codec models in Table 1. All models were trained with identical hyper parameters. The 1D-Conv U-Net (32X) is based on the baseline [3] structure, with the $tanh$ activation removed. The 1D-Conv U-Net (64X) and 1D-Conv U-Net (128X) follow the same structure with 2X and 4X higher compression ratios, respectively. We also include the 1D-Conv U-Net- $\beta$ (128X) builds on [9] baseline, to evaluate incorporating a learned snake activation frequency $\beta$ . This model maintains the same compression ratio as the 1D-Conv U-Net (128X), which is twice that of the original model [9].

Comparing the first three models shows that higher compression achieves similar FOA reconstruction quality, indicating that increased compression is suitable for this task. Notably, as compression increases, MEL performance improves, likely because the model focuses more on reconstructing lower frequencies, which are crucial for MEL distance. Additionally, comparing 1D-Conv U-Net (128X) with 1D-Conv U-Net- $\beta$ (128X) reveals that the model without the learned $\beta$ outperforms the one with learned $\beta$ . The inconsistent efficacy of the learned $\beta$ is also noted in previous work on music codecs [9].

Table 3 shows the quality and spatial accuracy evaluations for both descriptive and parametric ImmerseDiffusion models. The descriptive model significantly outperforms the parametric model in FAD_ELSA score. However, this advantage may be influenced by the use of ELSA embeddings as a conditioner in the descriptive model, which the parametric model lacks. In terms of KL divergence and CLAP scores, on the other hand, the parametric model performs similarly to the descriptive model, slightly outperforming in one and under-performing in the other. This comparable performance underlines the effectiveness of the parametric training. In fact, even though the parametric model is trained on non-spatial prompts, it is capable of producing spatially plausible audio content thanks to the additional spatial parametrization.

We report the spatial accuracy of both models using the metrics proposed in Section 2.7. Given that localization estimation inherently includes some error, we also account for measurement error, which is the lower bound on the error that the model can achieve, by comparing the estimated spatial parameters with the ground truth labels. Both models demonstrate high localization accuracy when considering these errors. However, the parametric model significantly outperforms the descriptive model across all Direction of Arrival (DoA) metrics. This superior DoA performance is expected, as the parametric model uses precise absolute location values, offering more accurate spatial conditioning compared to the broader ranges in the descriptive labels. In contrast, for distance estimation, both models perform similarly.

It is worth mentioning that the signal processing methods used for spatial adherence evaluation offer two key advantages: simplicity, which is a crucial criterion for evaluation metrics, and generality, as they do not require pre-trained models or pose the risk of training data biases, unlike deep learning methods. However, due to higher lower-bound errors in some metrics (e.g., distance or spatial angle differences) as shown in Table 3, future extension of this work can include leveraging deep learning-based methods such as [29] to improve measurement accuracy.

4 Conclusion

This paper introduced ImmerseDiffusion, an end-to-end model for generating 3D immersive soundscapes based on spatial, temporal, and environmental conditions. Using a spatial audio codec and a latent diffusion model, ImmerseDiffusion generates first-order ambisonics (FOA) audio. We presented two models: a descriptive model that creates spatial audio from text prompts detailing sound sources and contexts, and a parametric model that uses text prompts with explicit spatial and environmental parameters. Experimental evaluations demonstrated that ImmerseDiffusion produces spatially accurate audio according to user-defined conditions.

References

[1] Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D Plumbley, “Audioldm: Text-to-audio generation with latent diffusion models,” Proc. 40th PMLR ICML, 2023.
[2] Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yuping Wang, Wenwu Wang, Yuxuan Wang, and Mark D Plumbley, “Audioldm 2: Learning holistic audio generation with self-supervised pretraining,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024.
[3] Zach Evans, CJ Carr, Josiah Taylor, Scott H Hawley, and Jordi Pons, “Fast timing-conditioned latent audio diffusion,” in Proc. 41st PMLR ICML, 235, 2024, pp. 12652–12665.
[4] Bhavika Devnani, Skyler Seto, Zakaria Aldeneh, Alessandro Toso, Elena Menyaylenko, Barry-John Theobald, Jonathan Sheaffer, and Miguel Sarabia, “Learning Spatially-Aware Language and Audio Embeddings,” 2024, ArXiV preprint.
[5] Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in Proc. IEEE ICASSP, 2023, pp. 1–5.
[6] Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, and Alexandre Défossez, “Simple and controllable music generation,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[7] Peike Li, Boyu Chen, Yao Yao, Yikai Wang, Allen Wang, and Alex Wang, “Jen-1: Text-guided universal music generation with omnidirectional diffusion models,” arXiv preprint arXiv:2308.04729, 2023.
[8] Flavio Schneider, Ojasv Kamal, Zhijing Jin, and Bernhard Schölkopf, “Moûsai: Text-to-music generation with long-context latent diffusion,” arXiv preprint arXiv:2301.11757, 2023.
[9] Zach Evans, Julian D Parker, CJ Carr, Zack Zukowski, Josiah Taylor, and Jordi Pons, “Long-form music generation with latent diffusion,” arXiv preprint arXiv:2404.10301, 2024.
[10] Despoina Pavlidi, Symeon Delikaris-Manias, Ville Pulkki, and Athanasios Mouchtaris, “3d localization of multiple sound sources with intensity vector estimates in single source zones,” in Proc. IEEE EUSIPCO, 2015, pp. 1556–1560.
[11] David G Malham and Anthony Myatt, “3-d sound spatialization using ambisonic techniques,” Computer music journal, vol. 19, no. 4, pp. 58–70, 1995.
[12] Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever, “Jukebox: A generative model for music,” arXiv preprint arXiv:2005.00341, 2020.
[13] Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, et al., “Audiolm: a language modeling approach to audio generation,” IEEE/ACM transactions on audio, speech, and language processing, vol. 31, pp. 2523–2533, 2023.
[14] Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, and Kundan Kumar, “High-fidelity audio compression with improved rvqgan,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[15] Qingqing Huang, Aren Jansen, Joonseok Lee, Ravi Ganti, Judith Yue Li, and Daniel PW Ellis, “Mulan: A joint embedding of music audio and natural language,” arXiv preprint arXiv:2208.12415, 2022.
[16] Ge Zhu, Yutong Wen, Marc-André Carbonneau, and Zhiyao Duan, “Edmsound: Spectrogram based diffusion models for efficient and high-quality audio synthesis,” arXiv preprint arXiv:2311.08667, 2023.
[17] Zhifang Guo, Jianguo Mao, Rui Tao, Long Yan, Kazushige Ouchi, Hong Liu, and Xiangdong Wang, “Audio generation with multiple conditional diffusion model,” in Proc. AAAI Conference on Artificial Intelligence, 2024, vol. 38, pp. 18153–18161.
[18] Yun-Han Lan, Wen-Yi Hsiao, Hao-Chung Cheng, and Yi-Hsuan Yang, “Musicongen: Rhythm and chord control for transformer-based text-to-music generation,” arXiv preprint arXiv:2407.15060, 2024.
[19] Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim, “Audiocaps: Generating captions for audios in the wild,” in Proc. NAACL, 2019, 2019, pp. 119–132.
[20] Frederic Font, Gerard Roma, and Xavier Serra, “Freesound technical demo,” in Proc. 21st ACM Multimedia, 2013, pp. 411–412.
[21] Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen, “Clotho: An audio captioning dataset,” in Proc. IEEE ICASSP, 2020, pp. 736–740.
[22] William Peebles and Saining Xie, “Scalable diffusion models with transformers,” in Proc. IEEE/CVF ICCV, 2023, pp. 4195–4205.
[23] Mark Levy, Bruno Di Giorgi, Floris Weers, Angelos Katharopoulos, and Tom Nickson, “Controllable music production with diffusion models and guidance gradients,” arXiv preprint arXiv:2311.00613, 2023.
[24] Tim Salimans and Jonathan Ho, “Progressive distillation for fast sampling of diffusion models,” arXiv preprint arXiv:2202.00512, 2022.
[25] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in Proc. IEEE ICASSP, 2015, pp. 5206–5210.
[26] Miguel Sarabia, Elena Menyaylenko, Alessandro Toso, Skyler Seto, Zakaria Aldeneh, Shadi Pirhosseinloo, Luca Zappella, Barry-John Theobald, Nicholas Apostoloff, and Jonathan Sheaffer, “Spatial Librispeech: An augmented dataset for spatial audio learning,” in Proc. ISCA Interspeech, 2023, pp. 3724–3728.
[27] Christian J Steinmetz and Joshua D Reiss, “auraloss: Audio focused loss functions in pytorch,” in Digital music research network one-day workshop (DMRN+ 15), 2020.
[28] Glen Van Brummelen, Heavenly mathematics: The forgotten art of spherical trigonometry, Princeton University Press, 2012.
[29] Sharath Adavanne, Archontis Politis, Joonas Nikunen, and Tuomas Virtanen, “Sound event localization and detection of overlapping sources using convolutional recurrent neural networks,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 1, pp. 34–48, 2018.