Determined Multichannel Blind Source Separation with Clustered Source Model

Jianyu Wang1, shanzheng Guan1 1Northwestern Polytechnical University, China. {alexwang96, gshanzheng, }@mail.nwpu.edu.cn,

Abstract

The independent low-rank matrix analysis (ILRMA) method stands out as a prominent technique for multichannel blind audio source separation. It leverages nonnegative matrix factorization (NMF) and nonnegative canonical polyadic decomposition (NCPD) to model source parameters. While it effectively captures the low-rank structure of sources, the NMF model overlooks inter-channel dependencies. On the other hand, NCPD preserves intrinsic structure but lacks interpretable latent factors, making it challenging to incorporate prior information as constraints. To address these limitations, we introduce a clustered source model based on nonnegative block-term decomposition (NBTD). This model defines blocks as outer products of vectors (clusters) and matrices (for spectral structure modeling), offering interpretable latent vectors. Moreover, it enables straightforward integration of orthogonality constraints to ensure independence among source images. Experimental results demonstrate that our proposed method outperforms ILRMA and its extensions in anechoic conditions and surpasses the original ILRMA in simulated reverberant environments.

Index Terms:

Independent low rank matrix analysis, multichannel blind audio source separation, block term decomposition.

I Introduction

In real-world applications, the observation of a source signal of interest frequently suffers from interference. Therefore, it is essential to use signal processing techniques to extract latent sources from observed mixtures by multiple microphones [1, 2]. Generally, there are two different paradigms for source separation. One is beamforming, which extracts the components from mixed signals using direction information with spatial filtering techniques [3, 4, 5]. They usually assume that the geometry of array and the incidence angle of each source are known. Another paradigm is to perform multichannel blind audio source separation (MBASS) based on independent component analysis (ICA) [6], which exploits the statistical independence of source signals. This study focuses on the latter paradigm.

Generally, MBASS is conducted in the short-time Fourier transform (STFT) domain to address convolutive mixing. However, a significant challenge arises due to inner permutations, which can greatly affect separation performance. To tackle this issue, independent vector analysis (IVA) [7] was then adopted, and the majorization-minimization (MM) principle [8] was introduced to derive fast update rules for IVA [9]. Despite its effectiveness in handling permutations, IVA-based methods often overlook spectral structure information. To incorporate spectral structure, nonnegative matrix factorization (NMF) [10] was extended to multichannel cases for MBASS [11, 12, 13]. However, Multichannel NMF methods are computationally demanding and sensitive to parameter initialization. In response to these challenges, the independent low-rank matrix analysis (ILRMA) method was devised [14]. ILRMA enforces an interpretable low-rank structure constraint on the spectrogram and employs a rank-one relaxation for the spatial model.

To further improve performance, efforts have been directed towards enhancing the source model by generalizing the distribution of source signals and achieving initialization-robust performance [15, 16, 17, 18, 19]. Additionally, research has explored the integration of prior constraints into parameters associated with the source model to enhance MBASS performance [20, 21]. However, these NMF-based methods are typically inadequate to capture inter-channel dependencies and higher-order structures inherent in multichannel data. While Kitamura [14] also discussed a CPD-based source model, its spectral distinctiveness remains somewhat deficient.

In this paper, we propose a clustered source model for determined blind source separation. Utilizing nonnegative block-term decomposition (NBTD), the source model parameters are expressed as a summation of several components, each being the outer product of a vector and a matrix. By applying orthogonality constraints to the latent vectors of this decomposition, they gain a clear interpretation, revealing distinct clusters of sources.

II Signal model and problem formulation

Let $M$ and $N$ be the number of microphones and sources, respectively. The observed signal in the STFT domain can be expressed as:

\displaystyle\mathbf{x}_{ij}=\sum_{n=1}^{N}\mathbf{c}_{nij}=\mathbf{A}_{i}\mathbf{s}_{ij},\quad 1\leq i\leq I,\quad 1\leq j\leq J,

(1)

where the subscripts $i$ and $j$ denote, respectively, the frequency and frame indices, $I$ and $J$ denote, the total number of STFT bins and time frames, $\mathbf{x}_{ij}=\left[\begin{array}[]{ccc}x_{1ij}&\dots&x_{Mij}\end{array}\right]^{\intercal}\in\mathbb{C}^{M}$ and $\mathbf{s}_{ij}=\left[\begin{array}[]{ccc}s_{1ij}&\dots&s_{Nij}\end{array}\right]^{\intercal}\in\mathbb{C}^{N}$ are, respectively, the vectors consisting of the STFTs of the observation and source signals, respectively, $\mathbf{c}_{nij}=\left[\begin{array}[]{ccc}c_{nij1}&\dots&c_{nijM}\end{array}\right]^{\intercal}\in\mathbb{C}^{M}$ represents the image of the $n$ th source, and $\mathbf{A}_{i}=\left[\begin{array}[]{ccc}\mathbf{a}_{1i}&\dots&\mathbf{a}_{Ni}\end{array}\right]\in\mathbb{C}^{M\times N}$ is called the mixing matrix. All the signals are assumed to have zero mean.

With the signal model given in (1), the problem of blind source separation (BSS) becomes one of identifying a demixing matrix $\mathbf{D}_{i}$ such that

\displaystyle\mathbf{y}_{ij}=\mathbf{D}_{i}\mathbf{x}_{ij},\quad 1\leq i\leq I,\quad 1\leq j\leq J,

(2)

where $\mathbf{y}_{ij}=\left[\begin{array}[]{ccc}y_{1ij}&\dots&y_{Nij}\end{array}\right]^{\intercal}\in\mathbb{C}^{N}$ denotes an estimate of the source signal $\mathbf{s}_{ij}$ , and $\mathbf{D}_{i}=\left[\begin{array}[]{ccc}\mathbf{d}_{1i}&\dots&\mathbf{d}_{Mi}\end{array}\right]\in\mathbb{C}^{N\times M}$ . The difficulty of this identifiying depends on many factors, e.g., number of sources, number of sensors, the nature of the source signals, the property of the mixing system. In this work, we focus on the case where the number of souces is equal to the number of microphone sensors, i.e., $M=N$ and assume that the source signals are mutually independent and their distributions are stationary. Given these conditions, the problem of MBSS can be solved using only the second-order statistics [2].

From (1), the covariance matrix of mixtures is $\mathbf{X}_{ij}=\mathbb{E}\left[\mathbf{x}_{ij}\mathbf{x}_{ij}^{\mathsf{H}}\right]$ . Using the signal model given in (1), we obtain

\displaystyle\mathbf{X}_{ij}=\mathbf{A}_{i}\mathbb{E}\left[\mathbf{s}_{ij}\mathbf{s}_{ij}^{\mathsf{H}}\right]\mathbf{A}_{i}^{\mathsf{H}}.

(3)

Under the assumption of statistical independence among sources, the covariance matrix of sources $\boldsymbol{\Lambda}_{ij}=\mathbb{E}\left[\mathbf{s}_{ij}\mathbf{s}_{ij}^{\mathsf{H}}\right]$ should be a diagonal matrix, which can be represented as $\boldsymbol{\Lambda}_{ij}=\mbox{Diag}\left(\begin{array}[]{cccc}\lambda_{1ij}&\lambda_{2ij}&\cdots&\lambda_{Nij}\end{array}\right)$ . Let us stack all the non-zeros parameters in $\boldsymbol{\Lambda}_{ij}$ to form a large matrix, i.e.,

\displaystyle\boldsymbol{\lambda}=\left[\begin{array}[]{ccccccc}\boldsymbol{\lambda}_{1}&\boldsymbol{\lambda}_{2}&\cdots&\boldsymbol{\lambda}_{n}&\cdots&\boldsymbol{\lambda}_{N}\end{array}\right],

(5)

where $\boldsymbol{\lambda}_{n}\in\mathbb{R}^{I\times J}$ is a matrix consisting of all the parameters associated with the $n$ th source, and $\boldsymbol{\lambda}$ is a matrix of size $N\times I\times J$ , i.e., $\boldsymbol{\lambda}\in\mathbb{R}^{N\times I\times J}$ , which encompass all the parameters of the source model,

In ILRMA, the NMF tool is used to decompose the matrix $\boldsymbol{\lambda}_{n}$ , $n=1,\cdots,N$ , into the following form:

\displaystyle\boldsymbol{\lambda}_{n}

\displaystyle=\mathbf{T}_{n}\mathbf{V}_{n},

(6)

where $\mathbf{T}_{n}$ is a basis matrix of size $I\times K$ with $K$ being the number of bases, and $\mathbf{V}_{n}$ is called the activation matrix, whose dimension is $K\times J$ . This decomposition based source modeling is referred to as the NMF source model.

Another way to model the source model parameters is through a rank-one-tensor decomposition of the matrix $\boldsymbol{\lambda}$ , leading to the non-negative CPD (NCPD) based source model. In NCPD, the matrix $\boldsymbol{\lambda}$ is expressed as

\displaystyle\boldsymbol{\lambda}=

\displaystyle\sum_{k=1}^{K}\mathbf{z}_{k}\circ\tilde{\mathbf{t}}_{k}\circ\tilde{\mathbf{v}}_{k},

(7)

where

$\displaystyle\mathbf{z}_{k}=$	$\displaystyle\left[\begin{array}[]{cccc}z_{1k}&z_{2k}&\cdots&z_{Nk}\end{array}\right]^{\intercal},$	(9)
$\displaystyle\tilde{\mathbf{t}}_{k}=$	$\displaystyle\left[\begin{array}[]{cccc}\tilde{t}_{1k}&\tilde{t}_{2k}&\cdots&\tilde{t}_{Ik}\end{array}\right]^{\intercal},$	(11)
$\displaystyle\tilde{\mathbf{v}}_{k}=$	$\displaystyle\left[\begin{array}[]{cccc}\tilde{v}_{1k}&\tilde{v}_{2k}&\cdots&\tilde{v}_{Jk}\end{array}\right]^{\intercal}.$	(13)

and $\circ$ denotes the outer product.

II-A Proposed source model

\begin{overpic}[width=227.62204pt]{MVNTR.pdf} \put(78.0,58.0){\normalsize$\otimes$} \put(162.0,58.0){\normalsize$\otimes$} \end{overpic}

Figure 1: Illustration of block term decomposition.

While the NCPD method adeptly captures the intricate nature of multi-channel data by expressing the source parameter tensor as a summation of a series of rank-one tensors, it is essential to acknowledge that multi-channel signals often exhibit a more nuanced structure in the time-frequency domain. This nuanced structure can be better modeled with the so-called nonnegative block term decomposition (NBTD), which emerges as a potent tensor factorization model tailor-made to precisely capture localized and recurring patterns within the time-frequency representation of speech signals [22]. The decomposition process is illustrated in Fig. 1 and can be mathematically written as follows:

$\displaystyle\boldsymbol{\lambda}$	$\displaystyle=\sum_{o=1}^{O}\left(\mathbf{T}_{o}\mathbf{V}_{o}^{\intercal}\right)\circ\mathbf{g}_{o},$	(14)
$\displaystyle\mathbf{T}_{o}$	$\displaystyle=\begin{bmatrix}t_{o11}&t_{o12}&\cdots&t_{o1K}\\ t_{o21}&t_{o22}&\cdots&t_{o2K}\\ \vdots&\vdots&\ddots&\vdots\\ t_{oI1}&t_{oI2}&\cdots&t_{oIK}\end{bmatrix},$
$\displaystyle\mathbf{V}_{o}$	$\displaystyle=\begin{bmatrix}v_{o11}&v_{o12}&\cdots&v_{o1J}\\ v_{o21}&v_{o22}&\cdots&v_{o2J}\\ \vdots&\vdots&\ddots&\vdots\\ v_{oK1}&v_{oK2}&\cdots&v_{oKJ}\end{bmatrix},$

\displaystyle\mathbf{g}_{o}

\displaystyle=\left[\begin{array}[]{cccc}g_{1o}&g_{2o}&\cdots&g_{No}\end{array}\right]^{\intercal},

(16)

where $\boldsymbol{\lambda}\in\mathbb{R}^{N\times I\times J}$ , $\mathbf{T}_{o}\in\mathbb{R}^{I\times L_{o}}$ , $\mathbf{V}_{o}\in\mathbb{R}^{J\times L_{o}}$ , and $\mathbf{G}=\left[\begin{array}[]{cccc}\mathbf{g}_{1}&\mathbf{g}_{2}&\dots&\mathbf{g}_{O}\end{array}\right]\in\mathbb{R}^{N\times O}$ are all non-negative matrices, and $\circ$ denotes the outer product. It is easy to check that in this decomposition each elements of $\boldsymbol{\lambda}$ is presented as $\lambda_{nij}=\sum_{o}g_{no}\sum_{k}t_{oik}v_{okj}$ . This decomposition based source modeling is referred to as the NBTD source model.

Let us introduce two matrices:

	$\displaystyle\overset{\circ}{\boldsymbol{\Lambda}}_{ij}$	$\displaystyle=\begin{bmatrix}\sum_{k}t_{1ik}v_{1kj}&\dots&0\\ \vdots&\ddots&\vdots\\ 0&\dots&\sum_{k}t_{Oik}v_{Okj}\end{bmatrix}\in\mathbb{R}_{\geq 0}^{O\times O},$		(17)
	$\displaystyle\mathbf{U}$	$\displaystyle=\begin{bmatrix}g_{11}^{\frac{1}{2}}&g_{12}^{\frac{1}{2}}&\dots&g_{1O}^{\frac{1}{2}}\\ g_{21}^{\frac{1}{2}}&g_{22}^{\frac{1}{2}}&\dots&g_{2O}^{\frac{1}{2}}\\ \vdots&\vdots&\ddots&\vdots\\ g_{N1}^{\frac{1}{2}}&g_{N2}^{\frac{1}{2}}&\dots&g_{NO}^{\frac{1}{2}}\end{bmatrix}\in\mathbb{R}_{\geq 0}^{N\times O},$		(18)

where the matrix $\mathbf{U}$ satisfies the orthogonal constraint. This constraint not only relates the NBTD based source model to the k-means clustering of spectral components, but also allows the formula (14) to be expressed in the form of a diagonal matrix under each time-frequency bins:

\displaystyle\boldsymbol{\Lambda}_{ij}=\mathbf{U}\overset{\circ}{\boldsymbol{\Lambda}}_{ij}\mathbf{U}^{\intercal}.

(19)

Following the orthogonal constraints on matrix $\mathbf{U}$ , we have $\mathbf{U}\mathbf{U}^{\intercal}=\mathbf{I}_{N}$ . It can also be expressed as:

\displaystyle\sum_{o}g_{no}=1.\quad\forall\quad n=1,\dots,N,

(20)

It indicates the clusters of sources of source model in cILRMA which fully utilize the spectral structure. Then, We build the generative model of covariance matrix for cILRMA as:

	$\displaystyle\hat{\mathbf{X}}_{ij}$	$\displaystyle=\sum_{n}\mathbf{a}_{in}\mathbf{a}_{in}^{\mathsf{H}}\left(\sum_{o}g_{no}\left(\sum_{k}t_{oik}v_{okj}\right)\right)$
		$\displaystyle=\mathbf{A}_{i}\mathbf{U}\overset{\circ}{\boldsymbol{\Lambda}}_{ij}\mathbf{U}^{\intercal}\mathbf{A}_{i}^{\mathsf{H}}.$		(21)

Generally, it is assumed that mixed signal in each STFT bin follow a complex Gaussian distribution, i.e.,

\displaystyle\mathbf{x}_{ij}\sim\mathcal{N}_{\mathbb{C}}\left(\mathbf{x}_{ij}\bigg{|}\mathbf{0},\hat{\mathbf{X}}_{ij}\right).

(22)

The maximum likelihood cost function for estimating the model parameters is then written as

$\displaystyle\vspace{-5cm}\mathcal{L}=$	$\displaystyle\sum_{ij}\!\Big{[}\!\mbox{tr}\!\left(\!\mathbf{D}_{i}^{\!-\!1}\mathbf{y}_{ij}\mathbf{y}_{ij}^{\mathsf{H}}\!\left(\!\mathbf{D}_{i}^{\mathsf{H}}\!\right)^{\!-\!1}\!\mathbf{D}_{i}^{\mathsf{H}}\!\left(\mathbf{U}^{\intercal}\!\right)^{\!-\!1}({\overset{\circ}{\boldsymbol{\Lambda}}_{ij}})^{\!-\!1}\mathbf{U}^{\!-\!1}\mathbf{D}_{i}\!\right)\!$
	$\displaystyle+\log{\left(\det\mathbf{A}_{i}\right)\left(\det\mathbf{U}\overset{\circ}{\boldsymbol{\Lambda}}_{ij}\mathbf{U}^{\intercal}\right)\left(\det\mathbf{A}_{i}^{\mathsf{H}}\right)}\Big{]}$
	$\displaystyle+\sigma\mbox{tr}\left(\mathbf{U}\mathbf{U}^{\intercal}-\mathbf{I}_{N}\right),$	(23)

where $\sigma$ denotes the Lagrange multiplier with respect to $\mathbf{U}$ . Then the problem of cILRMA is converted into one of estimating the source model related parameters $t_{oik},v_{okj},g_{no}$ and spatial model related parameters $\mathbf{D}_{i}$ .

III Parameters Optimization

In this section, we derive the update rules for source model related parameters using the objective function given in (II-A). For $\overset{\circ}{\boldsymbol{\Lambda}}_{ij}$ , the the objective function can be expressed as

\displaystyle\!\!\!\!\mathcal{L}(\!\overset{\circ}{\boldsymbol{\Lambda}}_{ij}\!)\!=\!\mbox{tr}\!\left(\!\mathbf{y}_{ij}\mathbf{y}_{ij}^{\mathsf{H}}\left(\!\mathbf{U}^{\intercal}\!\right)^{\!{\dagger}\!}\!\!(\!\overset{\circ}{\boldsymbol{\Lambda}}_{ij}\!)^{\!-1\!}\mathbf{U}^{\!{\dagger}\!}\!\right)\!\!+\!\log\det\!\left(\!\mathbf{U}\overset{\circ}{\boldsymbol{\Lambda}}_{ij}\mathbf{U}^{\intercal}\!\right)\!\!.\!

(24)

Since $\mathbf{U}$ is a unitary orthogonal matrix, (24) can be further expressed as

		$\displaystyle\mathcal{L}(\!\overset{\circ}{\boldsymbol{\Lambda}}_{ij}\!)\!=\mbox{tr}\left(\mathbf{U}^{\intercal}\mathbf{y}_{ij}\mathbf{y}_{ij}^{\mathsf{H}}\mathbf{U}(\overset{\circ}{\boldsymbol{\Lambda}}_{ij})^{-1}\right)\!\!+\!\!\log\det\left(\mathbf{U}\overset{\circ}{\boldsymbol{\Lambda}}_{ij}\mathbf{U}^{\intercal}\right)\!,\!$
		$\displaystyle\!=\!\!\sum_{ij}\!\!\Bigg{[}\!\!\sum_{o}\!\!\frac{\sum_{n}g_{no}\|y_{nij}\|^{2}}{\sum_{k}t_{oik}v_{okj}}\!+\!\!\sum_{n}\!\log\!\sum_{o}g_{no}\!\!\left(\!\sum_{k}\!t_{oik}v_{okj}\!\!\right)\!\!\Bigg{]}\!.\!$		(25)

Direction optimization of $\mathcal{L}(\!\overset{\circ}{\boldsymbol{\Lambda}}_{ij}\!)$ with respect to $\overset{\circ}{\boldsymbol{\Lambda}}_{ij}$ is rather difficult. To circumvent this, we introduce the Jensen’s inequality and the tangent line inequality to obtain the following auxiliary function for (III):

	$\displaystyle\mathcal{Q}(\overset{\circ}{\boldsymbol{\Lambda}}_{ij})=$	$\displaystyle\sum_{ij}\Bigg{[}\sum_{ok}\frac{\sum_{n}g_{no}\|y_{nij}\|^{2}}{t_{oik}v_{okj}}\alpha_{okij}^{2}$
		$\displaystyle+\frac{1}{\beta_{oij}}\Bigg{(}\sum_{k}t_{oik}v_{okj}-\beta_{oij}\Bigg{)}+\log\beta_{oij}\Bigg{]},$		(26)

where $\alpha_{okij}$ and $\beta_{oij}$ are two auxiliary variables and the equality holds if and only if $\alpha_{okij}=\frac{t_{oik}v_{okj}}{\sum_{k}t_{oik}v_{okj}}$ , $\beta_{oij}=\sum_{k}t_{oik}v_{okj}$ .

Identifying partial derivatives of $\mathcal{Q}(\overset{\circ}{\boldsymbol{\Lambda}}_{ij})$ with respect to $\partial t_{oik}$ and $\partial v_{okj}$ and forcing the results to zero, we obtain:

	$\displaystyle t_{oik}\leftarrow t_{oik}\sqrt{\frac{\sum_{nj}\|y_{nij}\|^{2}g_{no}v_{okj}\left(\sum_{k^{\prime}}t_{oik^{\prime}}v_{ok^{\prime}j}\right)^{-2}}{\sum_{j}v_{okj}\left(\sum_{k^{\prime}}t_{oik^{\prime}}v_{ok^{\prime}j}\right)^{-1}}},$		(27)
	$\displaystyle v_{okj}\leftarrow v_{okj}\sqrt{\frac{\sum_{ni}\|y_{nij}\|^{2}g_{no}t_{oik}\left(\sum_{k^{\prime}}t_{oik^{\prime}}v_{ok^{\prime}j}\right)^{-2}}{\sum_{i}t_{oik}\left(\sum_{k^{\prime}}t_{oik^{\prime}}v_{ok^{\prime}j}\right)^{-1}}}.$		(28)

Similarly, the maximum likelihood function with respect to $\mathbf{U}$ can be expressed as

	$\displaystyle\mathcal{L}(\mathbf{U})=$	$\displaystyle\mbox{tr}\left(\mathbf{y}_{ij}\mathbf{y}_{ij}^{\mathsf{H}}\left(\mathbf{U}^{\intercal}\right)^{-1}(\overset{\circ}{\boldsymbol{\Lambda}}_{ij})^{-1}\mathbf{U}^{-1}\right)$
		$\displaystyle+\log\det\left(\mathbf{U}\overset{\circ}{\boldsymbol{\Lambda}}_{ij}\mathbf{U}^{\intercal}\right)+\sigma\mbox{tr}\left(\mathbf{U}\mathbf{U}^{\intercal}-\mathbf{I}_{N}\right).$		(29)

Following the previous analysis, one can obtain the auxiliary function for (III) with respect to $\mathbf{G}$ :

		$\displaystyle\mathcal{Q}(\mathbf{G})\!=\!\sum_{ijn}\!\!\left[\!\sum_{o}\!\!\frac{\|y_{nij}\|^{2}}{g_{no}\!\sum_{k}\!t_{oik}v_{okj}}\hat{\alpha}_{onij}^{2}\!\right]\!\!+\!\sigma\!\!\left(\!\sum_{no}\!g_{no}\!-\!N\!\!\right)$
		$\displaystyle\!+\!\sum_{ijn}\!\!\left[\!\frac{1}{\hat{\beta}_{nij}}\!\bigg{(}\!\!\sum_{o}g_{no}\!\sum_{k}t_{oik}v_{okj}\!-\!\hat{\beta}_{nij}\!\!\bigg{)}\!\!+\!\log\hat{\beta}_{nij}\!\!\right]\!,\!$		(30)

where $\hat{\alpha}_{onij}$ and $\hat{\beta}_{nij}$ are two auxiliary variables, and the equality satisfies if and only if $\hat{\alpha}_{onij}=\frac{g_{no}\sum_{k}t_{oik}v_{okj}}{\sum_{o}g_{no}\sum_{k}t_{oik}v_{okj}}$ , and $\hat{\beta}_{nij}=\sum_{o}g_{no}\sum_{k}t_{oik}v_{okj}$ . Identifying the partial derivative of(III) with respect to $g_{no}$ and forcing the result to zero gives the following update rule:

\displaystyle g_{no}\!\leftarrow\!g_{no}\!\sqrt{\!\frac{\!\sum_{ij}\!|y_{nij}|^{2}\!\left(\!\sum_{k}\!t_{oik}v_{okj}\!\right)\!\left(\!\sum_{ok}g_{no}\!t_{oik}v_{okj}\!\right)^{\!-2\!}}{\sum_{ij}\!\left(\sum_{k}t_{oik}v_{okj}\!\right)\!\left(\!\sum_{ok}g_{no}t_{oik}v_{okj}\!\right)^{\!-1\!}\!+\!\sigma}}\!.\!

(31)

The update rules of the demixing matrix $\mathbf{D}_{i}$ in cILRMA is similar to that in AuxIVA [9], which are as following:

	$\displaystyle\mathbf{O}_{ni}=\frac{1}{J}\sum_{j}\frac{1}{\sum_{0}g_{no}\sum_{k}u_{oik}v_{okj}}\mathbf{x}_{ij}\mathbf{x}^{\mathsf{H}}_{ij},$		(32)
	$\displaystyle\mathbf{d}_{ni}\leftarrow\left[\mathbf{D}_{i}\mathbf{O}_{ni}\right]^{-1}\mathbf{e}_{n},$		(33)
	$\displaystyle\mathbf{d}_{ni}\leftarrow\mathbf{d}_{ni}\left[\mathbf{d}_{ni}^{\mathsf{H}}\mathbf{O}_{ni}\mathbf{d}_{ni}\right]^{-\frac{1}{2}}.$		(34)

where $\mathbf{O}_{ni}$ denotes the auxiliary variable, and $\mathbf{e}_{n}$ denotes the $n$ th column vector of the identity matrix of size $N\times N$ .

Refer to caption — Figure 2: SDR and SIR improvements for the studied methods.

IV Experiments

IV-A Experimental configuration

We followed the SISEC challenge [23] and selected speech signals from the Wall Street Journal (WSJ0) corpus [24] as the clean speech source signals. Subsequently, we constructed evaluation signals for specific speech separation task with $M=N=2$ and simulated room environments. The dimensions of the room were set to $8\times 8\times 3$ m. Two microphones are positioned at the center of the room and their spacing is $6$ cm. The two sources were positioned 2 m from the center of the two microphones. The incident angles of the two source signals were designated as $80^{\circ}$ and $110^{\circ}$ respectively, with the direction normal to the line connecting the two microphones marked as $0^{\circ}$ . We employed the image model method [25] to generate room impulse responses where the sound absorption coefficients were determined using Sabine’s Formula [26]. The reverberation time $T_{60}$ is controlled to be in the range from $0$ to $600$ ms with an interval of $50$ ms. For each gender combinations (there are three combinations) and every value of $T_{60}$ , 100 sets of mixed signals are generated for evaluation. The sampling rate is $16$ kHz.

The value of the hyperparameter $\sigma$ in cILRMA was set to 1. We compared cILRMA with AuxIVA [9], MNMF [13], ILRMA [14], $t$ ILRMA [15], Generalized Gaussian distributed ILRMA (GGDILRMA) [17] and $m$ ILRMA [27]. Signal-to-distortion ration (SDR) and source-to-interferences ratio (SIR) are used as the performance metrics for evaluation and the definitions of these metrics can be found in [28].

IV-B Main results

Figure 2 plots the averaged performance in different reverberant environments. It is evident that the cILRMA method exhibits greater improvements in SDR and SIR compared to the other algorithms, although the discrepancy diminishes as the reverberation time increases.

An essential parameter in the proposed source model is parameter $O$ . To explore its impact on performance, a set of experiments were conducted. Figure 3 depicts the SDR and SIR improvements across various values of $O$ , with the source signals originating from two female speakers. The results indicate that performance enhances with increasing values of $O$ , suggesting that a higher $O$ value leads to a more accurate source model.

Figure 4 illustrates the SDR and SIR improvements achieved by cILRMA and ILRMA across varying numbers of bases. The results demonstrate that regardless of the number of bases, cILRMA consistently outperforms ILRMA by approximately $4$ dB in terms of performance.

The convergence behavior for cILRMA and ILRMA are shown in Fig. 5. It is seen that it is approximately 100 iterations for cILRMA to perform better than ILRMA.

V Conclusions

The paper presented a clustered source model tailored for ILRMA-based MBASS. Leveraging the NBTD technique, this model defines blocks as outer products of vectors (clusters) and matrices for spectral structure modeling, thus providing interpretable latent vectors. By integrating orthogonality constraints, the model ensures independence among source images. Experimental results demonstrated the superiority of the proposed method over its traditional counterparts in anechoic environments.

References

[1] Y. Huang, J. Benesty, and J. Chen, Acoustic MIMO signal processing, Springer Science & Business Media, 2006.
[2] A. Belouchrani, K. Abed-Meraim, J.F. Cardoso, and E. Moulines, “A blind source separation technique using second-order statistics,” IEEE Trans. Signal Process., vol. 45, no. 2, pp. 434–444, Feb. 1997.
[3] J. Benesty, Chen J., and Huang Y., Microphone Array Signal Processing, 2008.
[4] J. Benesty, I. Cohen, and J. Chen, Fundamentals of signal enhancement and array signal processing, John Wiley & Sons, Singapore Pte. Ltd., 2018.
[5] S. Lee, S. H. Park, and K.M. Sung, “Beamspace-domain multichannel nonnegative matrix factorization for audio source separation,” IEEE Signal Process. Lett., vol. 19, no. 1, pp. 43–46, Jan. 2011.
[6] P. Comon, “Independent component analysis, a new concept?,” Signal Process., vol. 36, no. 3, pp. 287–314, Apr. 1994.
[7] T. Kim, T. Eltoft, and T.-W. Lee, “Independent vector analysis: An extension of ica to multivariate components,” in Proc. Int. Conf. Independent Compon. Anal. Blind Source Separation. Springer, Oct. 2006, pp. 165–172.
[8] Y. Sun, P. Babu, and D. P. Palomar, “Majorization-minimization algorithms in signal processing, communications, and machine learning,” IEEE Trans. Signal Process., vol. 65, no. 3, pp. 794–816, Feb. 2016.
[9] N. Ono, “Stable and fast update rules for independent vector analysis based on auxiliary function technique,” in Proc. IEEE Workshop Appl. Signal Process. Audio Acoust. IEEE, Oct. 2011, pp. 189–192.
[10] D. Lee and H. S. Seung, “Algorithms for non-negative matrix factorization,” in Proc. Adv. Neural Inf. Process. Syst. May. 2000, pp. 556–562, MIT Press.
[11] A. Ozerov and C. Févotte, “Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 18, no. 3, pp. 550–563, Mar. 2009.
[12] N. Duong, E. Vincent, and R. Gribonval, “Under-determined reverberant audio source separation using a full-rank spatial covariance model,” IEEE Trans. Audio, Speech, Lang. Process., vol. 18, no. 7, pp. 1830–1840, Sept. 2010.
[13] H. Sawada, H. Kameoka, S. Araki, and N. Ueda, “Multichannel extensions of non-negative matrix factorization with complex-valued data,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 5, pp. 971–982, May. 2013.
[14] D. Kitamura, N. Ono, H. Sawada, H. Kameoka, and H. Saruwatari, “Determined blind source separation unifying independent vector analysis and nonnegative matrix factorization,” IEEE Trans. Audio, Speech, Lang. Process., vol. 24, no. 9, pp. 1626–1641, Sept. 2016.
[15] S. Mogami, D. Kitamura, Y. Mitsui, N. Takamune, H. Saruwatari, and N. Ono, “Independent low-rank matrix analysis based on complex student’s t-distribution for blind audio source separation,” in Proc. IEEE 27th Int. Workshop Mach. Learn. Signal Process. IEEE, Sept. 2017, pp. 1–6.
[16] D. Kitamura, S. Mogami, Y. Mitsui, N. Takamune, H. Saruwatari, N. Ono, Y. Takahashi, and K. Kondo, “Generalized independent low-rank matrix analysis using heavy-tailed distributions for blind source separation,” EURASIP J. Adv. Signal Process., vol. 2018, no. 1, pp. 1–25, May. 2018.
[17] R. Ikeshita and Y. Kawaguchi, “Independent low-rank matrix analysis based on multivariate complex exponential power distribution,” in Proc. IEEE ICASSP. IEEE, Apr. 2018, pp. 741–745.
[18] S. Mogami, N. Takamune, D. Kitamura, H. Saruwatari, Y. Takahashi, K. Kondo, H. Nakajima, and N. Ono, “Independent low-rank matrix analysis based on time-variant sub-gaussian source model,” in Proc. Asia-Pacific Signal Inf. Process. Assoc. Annu. Summit Conf. IEEE, Nov. 2018, pp. 1684–1691.
[19] S. Mogami, N. Takamune, D. Kitamura, H. Saruwatari, Y. Takahashi, K. Kondo, and N. Ono, “Independent low-rank matrix analysis based on time-variant sub-gaussian source model for determined blind source separation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 28, pp. 503–518, Dec. 2019.
[20] Y. Mitsui, D. Kitamura, S. Takamichi, N. Ono, and H. Saruwatari, “Blind source separation based on independent low-rank matrix analysis with sparse regularization for time-series activity,” in Proc. IEEE ICASSP. IEEE, Mar. 2017, pp. 21–25.
[21] J. Wang, S. Guan, S. Liu, and X. Zhang, “Minimum-volume multichannel nonnegative matrix factorization for blind audio source separation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 29, pp. 3089–3103, Oct. 2021.
[22] L. De Lathauwer, “Decompositions of a higher-order tensor in block terms—part ii: Definitions and uniqueness,” SIAM Journal on Matrix Analysis and Applications, vol. 30, no. 3, pp. 1033–1066, 2008.
[23] S. Araki, F. Nesta, E. Vincent, Z. Koldovskỳ, G. Nolte, A. Ziehe, and A. Benichoux, “The 2011 signal separation evaluation campaign (sisec2011):-audio source separation,” in LVA/ICA. Springer, 2012, pp. 414–422.
[24] J. Garofolo, D. Graff, D. Paul, and D. Pallett, “Csr-i (wsj0) complete ldc93s6a,” Linguistic Data Consortium, vol. 83, 1993.
[25] J. Allen and D. Berkley, “Image method for efficiently simulating small-room acoustics,” J. Acoust. Soc. Am., vol. 65, no. 4, pp. 943–950, June. 1979.
[26] Robert W Young, “Sabine reverberation equation and sound power calculations,” J. Acoust. Soc. Am., vol. 31, no. 7, pp. 912–921, July. 1959.
[27] Jianyu Wang, Shanzheng Guan, and Xiao-Lei Zhang, “Minimum-volume regularized ilrma for blind audio source separation,” in APSIPA ASC, 2021, pp. 630–634.
[28] E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 4, pp. 1462–1469, July. 2006.