Neural Ambisonic Encoding For Multi-Speaker Scenarios Using A Circular Microphone Array
^†^†thanks: *This work was conducted during Yue Qiao’s internship at Tencent AI Lab, Bellevue, USA.

Yue Qiao 3D3A Lab, Princeton University
Princeton, USA
[email protected] Vinay Kothapally Tencent AI Lab
Bellevue, USA
[email protected] Meng Yu Tencent AI Lab
Bellevue, USA
[email protected] Dong Yu Tencent AI Lab
Bellevue, USA
[email protected]

Abstract

Spatial audio formats like Ambisonics are playback device layout-agnostic and well-suited for applications such as teleconferencing and virtual reality. Conventional Ambisonic encoding methods often rely on spherical microphone arrays for efficient sound field capture, which limits their flexibility in practical scenarios. We propose a deep learning (DL)-based approach, leveraging a two-stage network architecture for encoding circular microphone array signals into second-order Ambisonics (SOA) in multi-speaker environments. In addition, we introduce: (i) a novel loss function based on spatial power maps to regularize inter-channel correlations of the Ambisonic signals, and (ii) a channel permutation technique to resolve the ambiguity of encoding vertical information using a horizontal circular array. Evaluation on simulated speech and noise datasets shows that our approach consistently outperforms traditional signal processing (SP) and DL-based methods, providing significantly better timbral and spatial quality and higher source localization accuracy. Binaural audio demos with visualizations are available at https://bridgoon97.github.io/NeuralAmbisonicEncoding/.

Index Terms:

Ambisonic Encoding, Spatial Audio, Deep Learning, Microphone Array Processing.

I Introduction

Ambisonics [1] is a widely used spatial audio format for capturing, synthesizing, and rendering sound fields. It leverages spherical harmonic (SH) [2] as basis functions to decompose a sound field into Ambisonic channels of different orders, each of which encodes distinctive spatial information. In general, sound fields captured by spherical microphone arrays are encoded as Ambisonic signals through linear transformations. These encoded signals can be used to reproduce the sound field accurately, either through loudspeakers arranged in specific configurations or can be rendered for headphones via binaural processing. To ensure high-fidelity spatial audio reproduction, it is essential to minimize errors introduced during the Ambisonic encoding process [3].

The effectiveness of Ambisonic encoding is fundamentally limited by the microphone array used in practice. Ideally, numerous microphones uniformly distributed on a sphere are required to obtain Ambisonic signals with high spatial resolution. However, practical microphone arrays often have limited capsules distributed irregularly, depending on the application scenario (e.g., wearable devices [4]). This can lead to issues such as spatial aliasing [5] and poor spatial coverage of the captured sound field, ultimately degrading the sound field fidelity after encoding. To mitigate the adverse effects of practical microphone arrays, existing studies have explored different Ambisonic encoding approaches. These include traditional least squares (LS)-based optimization using the steering vectors of microphone arrays [6, 7, 8], adding constraints to regularize the orthogonality of Ambisonic channels [9], and parameterizing the sound field in terms of source directions and diffuseness [4].

Recently, deep learning (DL) has been applied to Ambisonic encoding [10, 11] and other tasks related to spatial audio, such as generating spatial audio from mono microphone recordings [12, 13, 14], binaural rendering from Ambisonics [15], upsampling Ambisonics to higher orders [16], and estimating virtual microphone signals from existing arrays [17]. The spatial audio generation methods leverage visual data to learn the spatial distribution of sound sources, facilitating the Ambisonic encoding process. However, encoding from only microphone signals is challenging, as it requires implicit extraction of spatial cues from the amplitude and phase information captured by the microphone array(s).

In [10, 11], convolution-based DNNs are adopted to learn the transformation from the microphone signals to Ambisonic signals. In [10], the DNN consists of convolutional layers for different frequency bands, with an additional $l_{1}$ -norm penalty in the loss function to enforce network sparsity. In [11], the DNN is adapted from the U-net architecture [18], with channel-wise coherence and energy-based regularization introduced into the loss function. Although these methods have shown similar or better encoding performance compared to traditional LS-based methods under regular microphone array geometries, they may not generalize well to other array layouts in practice, as the network architectures and loss functions used are not tailored for the encoding problem as well as the intrinsic properties of Ambisonic signals (e.g., orthogonality).

Refer to caption — Figure 1: Illustration of using Ambisonic encoding for teleconferencing.

In this paper, we aim to further improve the performance of DL-based Ambisonic encoding under more challenging conditions, for multi-speaker scenarios such as teleconferencing (Fig. 1). Specifically, we choose a circular microphone array situated on the horizontal plane to encode full-3D Ambisonic signals. To guide the DNN in learning the spatial structures of the sound field, we propose a two-stage network architecture that mimics the processes of plane wave decomposition and Ambisonics synthesis. Additionally, we introduce a spatial power map-based loss function to regularize the inter-channel correlation of Ambisonic signals. To address the ambiguity of encoding vertical sound field information using horizontal microphone arrays, we introduce a channel permutation process that discriminates the upper and lower half-space at model inference. We evaluate our proposed method against existing SP- and DL-based encoding methods for: (i) timbral audio quality, (ii) spatial audio quality, and (iii) source localization accuracy.

II Problem Formulation

In the general problem of Ambisonic encoding, we consider a model consisting of $M$ omnidirectional microphones placed in the free field at $(r_{m},\theta_{m},\phi_{m})$ , $m=1,\cdots,M$ , and plane waves from $Q$ directions $(\theta_{q},\phi_{q})$ , $q=1,\cdots,Q$ . The signals received by the microphones at frequency $f$ , $\mathbf{x}(f)$ , are related to the signals (or source strengths) associated with the plane waves, $\mathbf{s}(f)$ , and the microphone noise signals, $\mathbf{n}(f)$ , as

\mathbf{x}(f)=\mathbf{V}(f)\mathbf{s}(f)+\mathbf{n}(f),

(1)

where $\mathbf{x}(f)=[x_{1}(f),\cdots,x_{M}(f)]^{T}$ , $\mathbf{s}(f)=[s_{1}(f),\cdots,\\ s_{Q}(f)]^{T}$ , $\mathbf{n}(f)=[n_{1}(f),\cdots,n_{M}(f)]^{T}$ , and $\mathbf{V}(f)\in\mathbb{C}^{M\times Q}$ denotes the array steering matrix, with each element $V_{mq}$ referring to the transfer function from the $q$ -th plane wave to the $m$ -th microphone. These plane waves are represented in the Ambisonic domain by utilizing the spherical harmonics (SH) functions as

\mathbf{b}_{N}(f)=\mathbf{Y}^{H}_{N}\mathbf{s}(f),

(2)

where $\mathbf{b}_{N}(f)=[b_{00}(f),\cdots,b_{NN}(f)]^{T}$ are the encoded Ambisonic signals of order $N$ , $\mathbf{Y}_{N}=[\mathbf{y}_{00},\cdots,\mathbf{y}_{NN}]\in\mathbb{C}^{Q\times(N{+}1)^{2}}$ is SH matrix for the given plane waves, and $\mathbf{y}_{nm}=[Y_{nm}(\theta_{1},\phi_{1}),\cdots,Y_{nm}(\theta_{Q},\phi_{Q})]^{T}$ , where $Y_{nm}(\theta_{q},\phi_{q})$ is the SH function of order $n$ and degree $m$ corresponding to the angle $(\theta_{q},\phi_{q})$ .

This study aims to develop a DL-based end-to-end system ( $\mathcal{F}$ ) that transforms the microphone signals $\mathbf{x}(f)$ to an approximation of the ideal Ambisonic signals $\hat{\mathbf{b}}(f)$ for arbitrary source directions:

\mathcal{F}:\mathbf{x}(f)\mapsto\mathbf{b}(f).

(3)

III Proposed Neural Ambisonic Encoding DNN

The overall architecture of the proposed Ambisonic encoder DNN is depicted in Fig. 2, which consists of two stages jointly trained in an end-to-end approach: (i) virtual loudspeaker signal estimation and (ii) Ambisonics generation. The DNN input is the stacked $M$ -channel microphone signals. After the short-time Fourier transform (STFT), the time-frequency domain signals, $\mathbf{X}(t,f)$ , are used to extract two types of audio features: the directional feature [19], which is the cosine difference between the inter-channel phase difference and target-dependent phase difference, and the spatial covariance matrix (SCM) of the microphone signals.

III-A Stage I: Virtual Loudspeaker Signal Estimation

In this stage, the DNN uses the extracted features to predict complex-valued ratio filters (cRFs) for estimating the virtual loudspeaker signals corresponding to sound sources captured by the microphone array. This process is conceptually similar to plane wave decomposition of the captured sound field [20]. The extracted features are first processed through a one-layer long short-term memory (LSTM) module, a multi-head self-attention (MHSA) [21] module, and a linear layer for feature aggregation. The aggregated features are then passed through a series of four 1-D time-domain convolutional layers (T-Conv) to estimate the cRF masks, $\mathrm{cRF}_{\text{LS}}\in\mathbb{C}^{L\times M}$ , for generating $L$ virtual loudspeaker signals, $\mathrm{VLS}(t,f)\in\mathbb{C}^{L}$ :

\mathrm{VLS}(t,f)=\sum_{i\in[-I,I]}\mathrm{cRF}_{\text{LS}}(t,f,i)\mathbf{X}(t{+}i,f),

(4)

where $i$ represents the taps of the cRFs.

III-B Stage II: Ambisonics Generation

In this stage, the DNN uses the estimated virtual loudspeaker signals and a concatenated single-channel microphone signal (see Fig. 2) to estimate a nonlinear spatial transformation for generating Ambisonic signals. Incorporating the microphone signal helps the DNN accurately generate the zeroth-order (omnidirectional) Ambisonic signal, similar to how residual connections in deep learning preserve key information and stabilize training. The concatenated signals are passed through linear layers for dimension reduction and gated recurrent unit (GRU) layers for feature aggregation, followed by four narrow-band blocks from SpatialNet [22] to estimate another set of cRFs. Each block includes a MHSA module and a time-convolutional feedforward module (T-ConvFFN), which perform spatial clustering and temporal smoothing/filtering, respectively. The MHSA also helps adapt the DNN weights to each loudspeaker’s spatial position. For more details on the narrow-band block, see [22]. The Ambisonic signals, $\hat{\mathbf{B}}(t,f)$ , are generated by filtering the concatentaed signals with the estimated cRFs, $\mathrm{cRF}_{\text{SH}}\in\mathbb{C}^{(N+1)^{2}\times(L+1)}$ :

(5)

Next, energy normalization is employed on the estimated Ambisonic signals to ensure the energy of the estimated zeroth-order Ambisonics matches that of the microphone signal. Finally, $\hat{\mathbf{B}}(t,f)$ is transformed to the time domain, $\hat{\mathtt{b}}$ , with the inverse-STFT operation.

III-C Loss functions

We use four loss functions to train the proposed DNN. The first three, Magnitude $l_{1}$ -norm loss, signal-invariant signal-to-noise ratio (SI-SNR, [23]) loss, and coherence loss [11], aim to minimize the channel-wise errors between the estimated and ground truth Ambisonics. They are defined as

(6)

(7)

where $\hat{\mathtt{b}}_{n}$ and $\mathtt{b}_{n}$ are the estimated and ground truth time-domain Ambisonic signal of channel $n$ , respectively, and

(8)

The fourth loss function is based on the spatial power map [24] derived from Ambisonic signals. The power map, $\Gamma(\theta,\phi)$ , is computed using fixed beamformer weights, $\mathbf{w}(\theta,\phi)\in\mathbb{C}^{(N{+}1)^{2}}$ , corresponding to directions, $\{\theta_{i},\phi_{i}\}_{i=1}^{Q}$ :

\Gamma(\theta_{i},\phi_{i})=\|\mathbf{w}^{H}(\theta_{i},\phi_{i})\mathtt{b}\|.

(9)

In this work, we choose $\mathbf{w}(\theta,\phi)$ to be the maximum directivity index (max-DI) beamformer weights [25] for 1296 directions (spherical design of degree 50 [26]). The power map loss minimizes the difference between the estimated power map, $\hat{\Gamma}(\theta,\phi)$ , and the ground truth power map, $\Gamma(\theta,\phi)$ :

\mathcal{L}_{\text{PowerMap}}=\alpha\mathcal{L}_{\text{KL}}(\Gamma,\hat{\Gamma})+\beta\frac{1}{Q}\sum_{i=1}^{Q}|\Gamma(\theta_{i},\phi_{i})-\hat{\Gamma}(\theta_{i},\phi_{i})|^{2},

(10)

where $\mathcal{L}_{\text{KL}}$ is the Kullaback-Leibler divergence between the power map distributions, and $\alpha,\beta$ are weighting parameters ( $\alpha{=}1,\beta{=}100$ used in training). The loss definition is inspired from [27] where audio localization maps are considered. The power map loss helps regularize the cross-channel correlation and therefore enhances the prediction of spatial information. During training, these four loss functions are equally weighted and combined.

III-D Vertical channel permutation

When using a horizontal microphone array to encode Ambisonics, the vertical sound field components cannot be accurately encoded due to the lack of ability to distinguish between the sound sources “mirrored” in the horizontal plane. However, if we assume that all the sound sources are located in the same half-space, and this positioning is known or can be assumed, such ambiguity can be addressed by permuting the vertical Ambisonic channels. This assumption is often valid in scenarios like teleconferencing, where most speakers are either above or below the microphone array. Specifically, we permute the vertical channels (e.g., for second-order Ambisonics (SOA), $y_{1,0},y_{2,-1},y_{2,0},y_{2,1}$ ) by multiplying them with -1 (i.e., inverting the phase, equivalent to vertically flipping the sound sources) at model inference when the sources are in the lower half-space. This ensures a one-to-one mapping from the microphone input to the estimated sound field.

IV Experimental Setup

We use an open microphone array with eight omnidirectional microphone capsules arranged in a circle with a 5-cm radius on the horizontal plane for capturing the sound field. The target Ambisonics is SOA (9 channels). The STFT/iSTFT used in the model has an FFT size of 512 and a hop size of 256, assuming a 16 kHz sampling frequency. The directional feature is computed with 128 uniformly sampled directions and 6 microphone pairs, and the SCM feature contains 128 channels (real and imaginary parts of the 64-channel SCM concatenated). In the first stage, the MHSA module has 8 heads, and the T-Conv modules use kernels with size 3, stride 1, dilation 1, and padding 1 on both sides. The number of virtual loudspeaker channels is set to 50. In the second stage, the narrow-band block parameters follow the “SpatialNet-small” preset from [22]. The cRFs in both stages are set to $I{=}I^{\prime}{=}2$ for the filter taps. The channel dimension of all the intermediate layers is 256, resulting in a total model size of 8.1 M.

We simulate an audio dataset with speech scenarios involving 1, 2, and 3 speakers for training and evaluation. The speakers are 1 m from the array center, with directions randomly sampled on a sphere; the elevation range is limited to $[30^{\circ},150^{\circ}]$ as speakers outside this range are uncommon in practical applications. In multi-speaker scenarios, speakers are in the same half-space, and the upper/lower information is known for the permutation operation. Speech signals are simulated by convolving dry signals with room impulse responses (RIRs) for each microphone using the FRAM-RIR method [28]. The room size is randomized between $[4,4,3.5]m$ and $[10,10,6]m$ , and the RT60 is randomly sampled in $[0.05,0.9]s$ . The ground truth Ambisonic channels are treated as co-located microphones at the array center, with directivity patterns equivalent to the SH functions ( $\mathbf{y}_{nm}$ ). For each speaker, signals corresponding to the direct sound and secondary reflections are filtered individually based on the source locations and summed together. The speech dataset includes 158 hours for training, 8 hours for validation, and 2 hours for evaluation. A noise dataset that contains single sources evenly distributed in the space with white noise is also simulated to evaluate source localization accuracy in the estimated Ambisonic signals.

V Evaluation

Methods	Audio Quality							Source Localization Accuracy
	Timbral				Spatial			ERR_AZI (^∘) $\downarrow$	ERR_ELEV (^∘) $\downarrow$
	Temporal		Spectral		SH	Binaural
	SI-SNR (dB) $\uparrow$	ENV $\downarrow$	LSD (dB) $\downarrow$	Coherence $\uparrow$	RMSE_MAP $\downarrow$	RMSE_ILD (dB) $\downarrow$	RMSE_IC $\downarrow$
DSB + Ambi. Synth.	-2.842	0.077	1.209	0.376	0.270	2.152	0.563	6.651	52.601
LS-based Filtering I	2.677	0.063	0.776	0.434	0.169	1.908	0.585	2.857	40.307
LS-based Filtering II	3.141	0.063	0.776	0.434	0.141	1.904	0.585	2.866	12.259
U-net-based (8.4M) [11]	3.283	0.060	0.776	0.534	0.139	1.738	0.258	20.847	30.162
SpatialNet (8.3M) [22]	6.351	0.051	0.503	0.626	0.122	1.302	0.316	16.845	32.361
Proposed DL-system (8.1M)	6.353	0.042	0.350	0.610	0.072	1.043	0.140	1.555	10.204
(w/o PowerMap loss)	6.781	0.045	0.470	0.620	0.118	1.469	0.231	2.515	13.814
(w/o Perm. & PowerMap loss)	6.345	0.040	0.314	0.616	0.138	1.200	0.132	18.752	32.215

TABLE I: Comparison of Ambisonic encoding methods across metrics regarding audio quality and source localization accuracy.

V-A Metrics

We evaluate two aspects of Ambisonic encoding performance using the speech and noise datasets: audio quality (timbral and spatial) and source localization accuracy. For timbral audio quality, we choose SI-SNR (Eq. 7) and envelope distance (ENV, [12]) between the estimated ( $\hat{\mathtt{b}}$ ) and ground truth ( $\mathtt{b}$ ) Ambisonic signals after Hilbert transform to measure temporal quality. Log spectral distance (LSD, [30]) and channel-wise coherence (Eq. 8) between $\hat{\mathbf{b}}(f)$ and $\mathbf{b}(f)$ are used for spectral quality. For spatial quality, we calculate the root mean square error (RMSE) between the power map $\hat{\Gamma}(\theta,\phi)$ and $\Gamma(\theta,\phi)$ in the SH domain; in the binaural domain, following [4], we decode Ambisonics into binaural signals using head-related transfer functions (HRTFs) from the KEMAR dataset [31]), then compute RMSE for inter-aural level difference (ILD) and inter-aural coherence (IC). For source localization accuracy, using the noise dataset, we estimate the direction of arrival by finding the maximum RMS value of the spatial power map, $\hat{\Gamma}(\theta,\phi)$ , from the estimated Ambisonics and then calculate the azimuth and elevation errors relative to the ground truth. For simplicity, all metrics are computed for SOA and averaged across time and frequency for all evaluation samples.

V-B Baseline methods

We implement four baseline encoding methods (two SP-based and two DL-based) for performance comparison with the proposed method. The first SP-based method uses a delay-and-sum beamformer (DSB) to generate virtual loudspeaker signals for 160 fixed directions and synthesizes Ambisonics by filtering each loudspeaker signal with corresponding SH functions. The second SP-based method, similar to the conventional baselines in [6, 4, 11], filters the microphone signals using an encoding matrix, $\mathbf{E}\in\mathbb{C}^{(N{+}1)^{2}\times M}$ , which linearly transforms the signals as

\hat{\mathbf{b}}(f)=\mathbf{E}(f)\mathbf{x}(f),

(11)

where $\hat{\mathbf{b}}(f)$ is the estimated Ambisonic signals. A closed-form solution for $\mathbf{E}(f)$ can be obtained in a least-squares sense, by minimizing the expectation of the $l_{2}$ -norm difference between $\hat{\mathbf{b}}(f)$ and $\mathbf{b}(f)$ [3, 32]:

\mathbf{E}(f)=\mathbf{Y}_{N}^{H}(f)\mathbf{V}^{H}(f)[\mathbf{V}(f)\mathbf{V}^{H}(f)+\lambda(f)\mathbf{I}_{M}]^{-1},

(12)

where $\lambda(f)$ is a regularization parameter set to 0.01 in our experiment. The steering matrix $\mathbf{V}$ is sampled at 5^∘ resolution. To address the ambiguity issue (Sec. III-D), when it is unknown whether the sound sources are above or below the horizontal plane, the encoding matrix is generated using only the steering vectors above the horizontal plane. When such information is known, separate encoding matrices are generated for both cases using the corresponding steering vectors. The third method adopts the U-net-based architecture [11] with unchanged hyperparameters and loss functions, while the fourth method adopts the “SpatialNet-big” preset from SpatialNet [22], employing all the loss functions from Sec. III-C except the spatial power map loss. No channel permutation is applied to these DNN models.

V-C Results

Table I shows the performance of the four baseline methods and the proposed method (including two in ablation settings) in terms of audio quality and source localization accuracy. The second and third rows refer to single LS-based filtering (I) using steering vectors above the horizontal plane, and two filters (II) generated separately for above and below, respectively.

Regarding audio quality, the DSB-based method performs worst in both timbral and spatial quality, likely due to the frequency-dependent energy roll-off (which leads to signal coloration) and low spatial resolution outside the horizontal plane. LS-based filtering improves audio quality, especially in phase-related metrics such as SI-SNR, with further gains when separate filters are used. The DL-based baselines outperform SP-based ones as discretization artifacts are avoided, and SpatialNet outperforms U-net, possibly due to its narrow-band processing. Our proposed model performs similarly or better in timbral quality and SH-domain spatial quality, with significantly better binaural spatial quality, likely due to the addition of the power map loss. For source localization accuracy, SP-based baselines, despite having lower audio quality, achieve better azimuth prediction than DL-based ones. This indicates that DNNs trained with only channel-wise loss functions may not fully preserve spatial information. In addition, using separate LS-based filters reduces the elevation error, confirming the ambiguity between upper and lower half-spaces. In comparison, our proposed method yields the lowest localization errors for both azimuth and elevation angles. The impact of adding channel permutation and the power map loss is analyzed by comparing the last three rows, where we see significant improvements in spatial quality and localization accuracy, with slight degradation in timbral quality. Compared to SpatialNet, our model without permutation and the power map loss still yields better ENV, LSD, and binaural quality, with similar performance in other metrics.

Fig. 3 shows the spatial power maps computed from the ground truth and estimated Ambisonics, for a segment of a two-speaker scenario. The proposed method demonstrates better preservation of the spatial information in the captured sound field, while the two baseline methods introduce artifacts, such as “ghost” source images and shifts in the source positions.

VI Conclusion and future work

In this paper, we presented a DL-based Ambisonic encoding method designed for multi-speaker scenarios, such as teleconferencing, using a circular microphone array. The two-stage network architecture mimics plane wave decomposition and Ambisonics synthesis, incorporating channel permutation and a spatial loss function to enhance spatial information preservation. Evaluation with simulated speech and noise datasets demonstrated that our method significantly improves spatial audio quality and source localization accuracy compared to existing baseline methods. Future work could explore the impact of different microphone array layouts and further optimize the network architecture.

References

[1] M. A. Gerzon, “Periphony: With-height sound reproduction,” Journal of the audio engineering society, vol. 21, no. 1, pp. 2–10, 1973.
[2] B. Rafaely, Fundamentals of spherical array processing. Springer, 2015, vol. 8.
[3] S. Moreau, J. Daniel, and S. Bertet, “3d sound field recording with higher order ambisonics–objective measurements and validation of a 4th order spherical microphone,” in 120th Convention of the AES, 2006, pp. 20–23.
[4] L. McCormack, A. Politis, R. Gonzalez, T. Lokki, and V. Pulkki, “Parametric ambisonic encoding of arbitrary microphone arrays,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 2062–2075, 2022.
[5] T. Lübeck, J. M. Arend, and C. Pörschmann, “Spatial upsampling of sparse spherical microphone array signals,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 1163–1174, 2023.
[6] S. Gao, X. Wu, and T. Qu, “High order ambisonics encoding method using differential microphone array,” in Audio Engineering Society Convention 144. Audio Engineering Society, 2018.
[7] A. Bastine, L. Birnie, T. D. Abhayapala, P. Samarasinghe, and V. Tourbabin, “Ambisonics capture using microphones on head-worn device of arbitrary geometry,” in 2022 30th European Signal Processing Conference (EUSIPCO). IEEE, 2022, pp. 309–313.
[8] Y. Gayer, V. Tourbabin, Z. Ben-Hur, J. Donley, and B. Rafaely, “Ambisonics encoding for arbitrary microphone arrays incorporating residual channels for binaural reproduction,” arXiv preprint arXiv:2402.17362, 2024.
[9] C. Schörkhuber and R. Höldrich, “Ambisonic microphone encoding with covariance constraint,” in Proceedings of the International Conference on Spatial Audio, 2017, pp. 7–10.
[10] S. Gao, J. Lin, X. Wu, and T. Qu, “Sparse dnn model for frequency expanding of higher order ambisonics encoding process,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1124–1135, 2022.
[11] M. Heikkinen, A. Politis, and T. Virtanen, “Neural ambisonics encoding for compact irregular microphone arrays,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 701–705.
[12] P. Morgado, N. Nvasconcelos, T. Langlois, and O. Wang, “Self-supervised generation of spatial audio for 360 video,” Advances in neural information processing systems, vol. 31, 2018.
[13] A. Rana, C. Ozcinar, and A. Smolic, “Towards generating ambisonics using audio-visual cue for virtual reality,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 2012–2016.
[14] W. Lim and J. Nam, “Enhancing spatial audio generation with source separation and channel panning loss,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 8321–8325.
[15] Y. Zhu, Q. Kong, J. Shi, S. Liu, X. Ye, J.-C. Wang, H. Shan, and J. Zhang, “End-to-end paired ambisonic-binaural audio rendering,” IEEE/CAA Journal of Automatica Sinica, vol. 11, no. 2, pp. 502–513, 2024.
[16] G. Routray, S. Basu, P. Baldev, and R. M. Hegde, “Deep-sound field analysis for upscaling ambisonic signals,” in EAA Spatial Audio Signal Processing Symposium, 2019, pp. 1–6.
[17] T. Ochiai, M. Delcroix, T. Nakatani, R. Ikeshita, K. Kinoshita, and S. Araki, “Neural network-based virtual microphone estimator,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6114–6118.
[18] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18. Springer, 2015, pp. 234–241.
[19] R. Gu, S.-X. Zhang, M. Yu, and D. Yu, “3d spatial features for multi-channel target speech separation,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 996–1002.
[20] B. Rafaely, “Plane-wave decomposition of the sound field on a sphere by spherical convolution,” The Journal of the Acoustical Society of America, vol. 116, no. 4, pp. 2149–2157, 2004.
[21] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, vol. 30, 2017.
[22] C. Quan and X. Li, “Spatialnet: Extensively learning spatial information for multichannel joint speech separation, denoising and dereverberation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 1310–1323, 2024.
[23] J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “Sdr–half-baked or well done?” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 626–630.
[24] L. McCormack, S. Delikaris-Manias, and V. Pulkki, “Parametric acoustic camera for real-time sound capture, analysis and tracking,” in Proceedings of the 20th International Conference on Digital Audio Effects (DAFx-17), 2017, pp. 412–419.
[25] F. Lluís, N. Meyer-Kahlen, V. Chatziioannou, and A. Hofmann, “Direction specific ambisonics source separation with end-to-end deep learning,” Acta Acustica, vol. 7, p. 29, 2023.
[26] M. Gräf and D. Potts, “On the computation of spherical designs by a new optimization approach based on fast spherical fourier transforms,” Numerische Mathematik, vol. 119, no. 4, pp. 699–724, 2011. [Online]. Available: https://homepage.univie.ac.at/manuel.graef/quadrature.php
[27] X. Wu, Z. Wu, L. Ju, and S. Wang, “Binaural audio-visual localization,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 4, 2021, pp. 2961–2968.
[28] Y. Luo and R. Gu, “Fast random approximation of multi-channel room impulse response,” in 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW). IEEE, 2024, pp. 449–454.
[29] L. McCormack and A. Politis, “Sparta & compass: Real-time implementations of linear and parametric spatial audio reproduction and processing methods,” in AES International Conference on Immersive and Interactive Audio. Audio Engineering Society, 2019.
[30] A. Gray and J. Markel, “Distance measures for speech processing,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 24, no. 5, pp. 380–391, 1976.
[31] W. G. Gardner and K. D. Martin, “Hrtf measurements of a kemar,” The Journal of the Acoustical Society of America, vol. 97, no. 6, pp. 3907–3908, 1995.
[32] A. Politis and H. Gamper, “Comparing modeled and measurement-based spherical harmonic encoding filters for spherical microphone arrays,” in 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2017, pp. 224–228.

Neural Ambisonic Encoding For Multi-Speaker Scenarios Using A Circular Microphone Array ††thanks: *This work was conducted during Yue Qiao’s internship at Tencent AI Lab, Bellevue, USA.