Audio Spotforming Using Nonnegative Tensor Factorization with Attractor-Based Regularization
Abstract
Spotforming is a target-speaker extraction technique that uses multiple microphone arrays. This method applies beamforming (BF) to each microphone array, and the common components among the BF outputs are estimated as the target source. This study proposes a new common component extraction method based on nonnegative tensor factorization (NTF) for higher model interpretability and more robust spotforming against hyperparameters. Moreover, attractor-based regularization was introduced to facilitate the automatic selection of optimal target bases in the NTF. Experimental results show that the proposed method performs better than conventional methods in spotforming performance and also shows some characteristics suitable for practical use.
Index Terms:
Microphone arrays, beamforming, nonnegative matrix/tensor factorization, attractor-based regularizationI Introduction
Target speaker extraction extracts only the target source from the observed signal. This technique can be applied to front-end systems in various audio applications, including automatic speech recognition.
Beamforming (BF) [1] is the most common approach for target speaker extraction when using a microphone array. Because BF emphasizes all source signals present in a specific direction from the microphone array, interference sources in the same direction cannot be suppressed. Spotforming [2] using multiple microphone arrays was proposed to solve this problem. Spotforming aims to extract only the target source from a specific area, as shown in Fig. 1. A spatial spotforming filter using all synchronized microphone arrays was proposed [2]. In addition, the optimal arrangement of multiple microphone arrays for spotforming was investigated [3].
As another effective approach, spotforming utilizing nonnegative matrix factorization (NMF) [4] has been proposed [5]. This method emphasizes the target directions using each microphone array with a BF. Then, it applies NMF [4] to the BF outputs concatenated in the time-frame dimension, as shown in Fig. 2 (a). The target source is estimated as the common component extracted from the NMF decomposition results using a binary mask obtained by thresholding the activation matrix. However, this decomposition lacks model interpretability because of the absence of explicit modeling of the relationship between each basis vector and each of the BF outputs, making it difficult to introduce effective regularization to enhance discriminative basis learning. Moreover, the performance depends on the setting of the hyperparameters, including the number of basis vectors and the threshold value for extraction of common components. These hyperparameters must be tuned in advance based on the observed signal characteristics, such as the signal length and time-frequency structures of the target and interference sources.


In this study, we propose a new spotforming method that has higher model interpretability and achieves more robust spotforming against hyperparameters. To this end achieve, we utilize nonnegative tensor factorization (NTF) [6] for common component extraction, as shown in Fig. 2 (b). An allocation matrix is introduced to represent the basis vectors corresponding to the target source components, resulting in higher model interpretability. In addition, this approach enables the incorporation of attractor-based regularization to simultaneously facilitate discriminative basis learning and automatic optimization of the number of basis vectors for each source. While our proposed method involves two hyperparameters, the number of basis vectors and the weight coefficient of the regularizer, the spotforming performance is shown to be relatively robust to variations in these values.
II Spotforming Using Multiple Microphone Arrays
II-A Scenario of Spotforming and Its Signal Model
In this study, we consider the scenario depicted in Fig. 1. The target and interference sources are located in the same direction relative to each microphone array. Although the BF in each microphone array can enhance the target source, the interference sources in the same direction are also enhanced. However, because the residual interference sources are different for each BF output, estimation of the target source becomes feasible by extracting only the common components among the BF output signals.
Let denote the time-frequency components of the multichannel observed signal captured by -th microphone array, with its elements defined as , where , , , and are the indices of frequency bins, time frames, microphone arrays, and microphones within each array, respectively. In spotforming using distributed microphone arrays, BF is initially applied within each microphone array as , where represents the output spectrogram of the BF operation and denotes a specific look direction. We also define the tensor for as , with its elements denoted as . The objective of spotforming is to estimate the components of the target source from that correspond to the common components of the BF outputs .
II-B Conventional NMF-Based Spotforming
In the conventional method [5], NMF extracts common components. Let denote an input matrix of NMF, with its elements defined as , where is the index of columns. This matrix is constructed by concatenating the amplitude spectrograms in the time-frame dimension, as shown in Fig. 2 (a), namely,
(1) |
where for matrices denotes an element-wise absolute operation and . Matrix is decomposed using NMF as follows:
(2) |
where and are the basis and activation matrices, respectively, and is the index of NMF basis vectors. The optimization problem of (2) is defined as follows:
(3) |
where is a divergence function.
After the estimation of and , a binary mask matrix is calculated as
(4) |
where is an element of and is a thresholding value. This binary mask defines the component with an activation greater than in all the microphone arrays as the common target source component and sets it to unity.
The spectrogram of the target source can be estimated by a Wiener filter using the binary mask as111The method proposed in [5] does not apply (5), but directly uses the estimated NMF model for obtaining with a phase recovery technique. Since (5) slightly improves the performance, we employ (5) in this paper.
(5) |
where is the -th element of . To obtain the time domain signals , an inverse short-time Fourier transform (STFT) is applied to , where is the signal length of . Finally, a delay-and-sum operation using is performed for further enhancement of the target source signal.
III Proposed Method
III-A Motivations
The NMF model in Fig. 2 (a) lacks model interpretability because of the absence of explicit modeling of the relationship between each basis vector in and each of the BF outputs . Consequently, it is difficult to regularize or to enhance discrimination between the target and interference source components. Moreover, the hyperparameters and must be appropriately tuned depending on the observed signal in advance, which increases the difficulty of putting the conventional method into practical use.
To address these shortcomings, we propose a novel spotforming method that utilizes an NTF to model the spectrograms of the BF outputs (Fig. 2 (b)). The proposed method allocates basis vectors in to each microphone array using an allocation matrix . Furthermore, we introduce an attractor-based regularization into to allocate each basis vector automatically assign/allocate to the corresponding BF outputs. This significantly increases the interpretability of the model and makes basis vectors more discriminative than the conventional method. This regularization automatically optimizes the number of basis vectors for the target source, resulting in robust spotforming against hyperparameter settings.
III-B NTF-Based Spotforming
In the proposed method, the input tensor of the NTF is defined as follows (Fig. 2 (b)):
(6) |
where is an element of . In contrast to (1), the three-dimensional tensor maintains the physical dimensions of the microphone array, frequency bins, and time frames. Then, is decomposed into three matrices, i.e., the allocation matrix , the basis matrix , and the activation matrix , as
(7) |
where and are the elements of and , respectively, and and are the column vectors of and , respectively.
Matrix allocates basis vectors to microphone arrays to approximate input tensor . Because the target source components are commonly included across all BF outputs , such basis vectors should be allocated to all microphone arrays. Therefore, if represents the target source, should be . In contrast, if corresponds to the other (interference) sources, should be a one-hot vector. This allocation can be interpreted as a partitional clustering of the basis vectors. Although such optimization can be facilitated by (7) and the low-rank approximation property in NTF, we introduce a new attractor-based regularization to enhance the aforementioned partitional clustering further.
The optimization problem of the proposed method is formulated as follows:
(8) |
where is the weight coefficient, and the regularization term is defined as
(9) | ||||
(10) | ||||
(11) |
Also, is an element of , is the index of attractor vectors , and . The set encompasses attractor vectors for each class; corresponds to the target-source class, while correspond to the other interference-source classes related to each microphone array. calculated by (11) corresponds to the index of the nearest attractor vector (class) from the current allocation vector . Thus, the regularization term in (9) forces to be closer to the nearest attractor vector, , emphasizing source clustering of the basis vectors . This can further affect the basis matrix, resulting in more discriminative basis vectors. In addition, this regularization automatically classifies basis vectors into target and interference sources. Consequently, the optimal number of basis vectors for the target source is estimated jointly during the optimization.
After the optimization of , , and , a binary vector is calculated as follows:
(12) |
where indicates that corresponds to the target source. In contrast to (4), (12) is independent of the time frame .
Similar to (5), the spectrogram of the target source is obtained using the following Wiener filter:
(13) |
The other post-processing steps are the same as those used in the conventional method.
III-C Derivation of Update Rules
The cost function in (8) can be minimized using a majorization-minimization (MM) algorithm [7]. In this study, we use a generalized Kullback–Leibler divergence
(14) |
in (3), (8), and (11), which provides better performance in many audio source separation tasks, e.g., [8]. We define the cost function in (8) as . By applying Jensen’s inequality, we obtain the majorization function as follows:
(15) |
where denotes equality up to a constant and is an auxiliary variable that satisfies . The equality holds if and only if
(16) |
From , we obtain
Thus,
(17) |
By substituting (16) into (17), we obtain the update rule for as
(18) |
Note that index must always be updated by (11) before updating . Similar to (18), the update rules for and can be derived as follows:
(19) | ||||
(20) |
To ensure and , we apply the normalization of and after (18) and (19), respectively, such that the cost function does not change by scaling each column of .
For the convergence of the proposed optimization algorithm, we can state the following theorem. This ensures a theoretical nonincrease in the cost function in the proposed method.
Theorem 1.
Proof.
Based on the MM algorithm, update rules (18)–(20) ensure a monotonic nonincrease of the cost function in (8). Thus, the monotonic nonincrease of the entire algorithm (11), (18)–(20) depends on whether (11) has a monotonic nonincrease property. Let and represent the old and updated indices, respectively. As is updated such that is minimized in (11), the following inequality holds:
(21) |
Therefore, the update rule (11) does not increase the value of the cost function in (8). ∎
IV Experiment
IV-A Conditions
A spotforming experiment was conducted to validate the proposed method. To simulate the recording environment illustrated in Fig. 3, we used a two-dimensional image method implemented in Pyroomacoustics [9]. We simulated two reverberation times for each environment, and ms, resulting in four recording conditions. The speech signals listed in Table I, randomly selected from the LibriTTS [10] dataset, were used as the dry sources. These dry sources were normalized to have uniform signal energies before the room impulse responses were convoluted. The effect of background noise was not considered in this experiment.
We applied a minimum variance distortionless response (MVDR) BF to each observed signal of the microphone array and obtained enhanced signals . The target steering vectors and noise covariance matrices for each MVDR BF were set to their oracle values calculated from impulse responses. This condition simulated that BF preprocessing provides ideal performance, and the net performances of the conventional and proposed methods were compared. All microphones were synchronized in this experiment.
As an evaluation criterion, we used the source-to-distortion ratio (SDR) [11] of the target source, a common score reflecting the total source separation quality. Because the NMF and NTF results depend on the initial random values of the parameters, we used 10 random seeds. We calculated the average SDR scores and their standard deviations. The other conditions are listed in Table II.


IV-B Results and Discussion
Figs. 4 and 5 show the results of two- and three-microphone-array cases, respectively. The hyperparameter for the conventional method was set to 12 patterns, and the three better conditions were shown in Figs. 4 and 5. These results confirm that the proposed method outperforms the conventional method in all cases. Moreover, the proposed method maintained better performance when increased, while the performance of the conventional method degraded for a large value of in some conditions of .
The hyperparameter in the proposed method also affected the performance. Fig. 6 shows the performance behavior of the proposed method with various settings of . The proposed method achieved optimal performance when we set to a certain large value, e.g., . Strong regularization with (9) results in a hard classification of the basis vectors , namely, each allocation vector coincides with one of the attractor vectors . This phenomenon was consistently confirmed under other values of and . Thus, such optimization tends to provide a better spotforming performance for the proposed method. Thanks to this property, we can robustly obtain better results by using a certain large value of and .
File name | Source |
---|---|
84_121123_000008_000002.wav | Target |
652_130737_000012_000000.wav | Interf. 0 |
3000_15664_000020_000005.wav | Interf. 1 |
1272_141231_000024_000005.wav | Interf. 2 |
Sampling frequency | Down sampled to 16 kHz |
---|---|
Window function used in STFT | Hann window |
Window length in STFT | 32 ms |
Window shift length in STFT | 16 ms |
Number of iterations in NMF/NTF | 100 times |
Initial values of , , and | Uniform random values in the range |
Initial values of | All the elements are set to |
Weight coefficient | for first 50 iterations, and |
for the rest of iterations | |




V Conclusion
This study has proposed a new spotforming algorithm that unifies NTF and attractor-based regularization. The regularization term is designed based on the partitional clustering of the NTF basis vectors into target and interference sources. The experimental results revealed that the proposed method outperformed the conventional NMF-based spotforming technique.
References
- [1] M. Brandstein and D. Ward, Microphone Arrays: Signal Processing Techniques and Applications. Springer-Verlag Berlin Heidelberg New York, 2001.
- [2] M. Taseska and E. A. P. Habets, “Spotforming: spatial filtering with distributed arrays for position-selective sound acquisition,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 24, no. 7, pp. 1291–1304, 2016.
- [3] K. Sekiguchi, Y. Bando, K. Itoyama, and K. Yoshii, “Layout optimization of cooperative distributed microphone arrays based on estimation of source separation performance,” J. Robotics and Mechatronics, vol. 29, no. 1, pp. 83–93, 2017.
- [4] D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791, 1999.
- [5] Y. Kagimoto, K. Itoyama, K. Nishida, and K. Nakadai, “Spotforming by NMF using multiple microphone arrays,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., pp. 9253–9258, 2022.

