This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Determined Multichannel Blind Source Separation with Clustered Source Model

Jianyu Wang1, shanzheng Guan1 1Northwestern Polytechnical University, China. {alexwang96, gshanzheng, }@mail.nwpu.edu.cn,
Abstract

The independent low-rank matrix analysis (ILRMA) method stands out as a prominent technique for multichannel blind audio source separation. It leverages nonnegative matrix factorization (NMF) and nonnegative canonical polyadic decomposition (NCPD) to model source parameters. While it effectively captures the low-rank structure of sources, the NMF model overlooks inter-channel dependencies. On the other hand, NCPD preserves intrinsic structure but lacks interpretable latent factors, making it challenging to incorporate prior information as constraints. To address these limitations, we introduce a clustered source model based on nonnegative block-term decomposition (NBTD). This model defines blocks as outer products of vectors (clusters) and matrices (for spectral structure modeling), offering interpretable latent vectors. Moreover, it enables straightforward integration of orthogonality constraints to ensure independence among source images. Experimental results demonstrate that our proposed method outperforms ILRMA and its extensions in anechoic conditions and surpasses the original ILRMA in simulated reverberant environments.

Index Terms:
Independent low rank matrix analysis, multichannel blind audio source separation, block term decomposition.

I Introduction

In real-world applications, the observation of a source signal of interest frequently suffers from interference. Therefore, it is essential to use signal processing techniques to extract latent sources from observed mixtures by multiple microphones [1, 2]. Generally, there are two different paradigms for source separation. One is beamforming, which extracts the components from mixed signals using direction information with spatial filtering techniques [3, 4, 5]. They usually assume that the geometry of array and the incidence angle of each source are known. Another paradigm is to perform multichannel blind audio source separation (MBASS) based on independent component analysis (ICA) [6], which exploits the statistical independence of source signals. This study focuses on the latter paradigm.

Generally, MBASS is conducted in the short-time Fourier transform (STFT) domain to address convolutive mixing. However, a significant challenge arises due to inner permutations, which can greatly affect separation performance. To tackle this issue, independent vector analysis (IVA) [7] was then adopted, and the majorization-minimization (MM) principle [8] was introduced to derive fast update rules for IVA [9]. Despite its effectiveness in handling permutations, IVA-based methods often overlook spectral structure information. To incorporate spectral structure, nonnegative matrix factorization (NMF) [10] was extended to multichannel cases for MBASS [11, 12, 13]. However, Multichannel NMF methods are computationally demanding and sensitive to parameter initialization. In response to these challenges, the independent low-rank matrix analysis (ILRMA) method was devised [14]. ILRMA enforces an interpretable low-rank structure constraint on the spectrogram and employs a rank-one relaxation for the spatial model.

To further improve performance, efforts have been directed towards enhancing the source model by generalizing the distribution of source signals and achieving initialization-robust performance [15, 16, 17, 18, 19]. Additionally, research has explored the integration of prior constraints into parameters associated with the source model to enhance MBASS performance [20, 21]. However, these NMF-based methods are typically inadequate to capture inter-channel dependencies and higher-order structures inherent in multichannel data. While Kitamura [14] also discussed a CPD-based source model, its spectral distinctiveness remains somewhat deficient.

In this paper, we propose a clustered source model for determined blind source separation. Utilizing nonnegative block-term decomposition (NBTD), the source model parameters are expressed as a summation of several components, each being the outer product of a vector and a matrix. By applying orthogonality constraints to the latent vectors of this decomposition, they gain a clear interpretation, revealing distinct clusters of sources.

II Signal model and problem formulation

Let MM and NN be the number of microphones and sources, respectively. The observed signal in the STFT domain can be expressed as:

𝐱ij=n=1N𝐜nij=𝐀i𝐬ij,1iI,1jJ,\displaystyle\mathbf{x}_{ij}=\sum_{n=1}^{N}\mathbf{c}_{nij}=\mathbf{A}_{i}\mathbf{s}_{ij},\quad 1\leq i\leq I,\quad 1\leq j\leq J, (1)

where the subscripts ii and jj denote, respectively, the frequency and frame indices, II and JJ denote, the total number of STFT bins and time frames, 𝐱ij=[x1ijxMij]M\mathbf{x}_{ij}=\left[\begin{array}[]{ccc}x_{1ij}&\dots&x_{Mij}\end{array}\right]^{\intercal}\in\mathbb{C}^{M} and 𝐬ij=[s1ijsNij]N\mathbf{s}_{ij}=\left[\begin{array}[]{ccc}s_{1ij}&\dots&s_{Nij}\end{array}\right]^{\intercal}\in\mathbb{C}^{N} are, respectively, the vectors consisting of the STFTs of the observation and source signals, respectively, 𝐜nij=[cnij1cnijM]M\mathbf{c}_{nij}=\left[\begin{array}[]{ccc}c_{nij1}&\dots&c_{nijM}\end{array}\right]^{\intercal}\in\mathbb{C}^{M} represents the image of the nnth source, and 𝐀i=[𝐚1i𝐚Ni]M×N\mathbf{A}_{i}=\left[\begin{array}[]{ccc}\mathbf{a}_{1i}&\dots&\mathbf{a}_{Ni}\end{array}\right]\in\mathbb{C}^{M\times N} is called the mixing matrix. All the signals are assumed to have zero mean.

With the signal model given in (1), the problem of blind source separation (BSS) becomes one of identifying a demixing matrix 𝐃i\mathbf{D}_{i} such that

𝐲ij=𝐃i𝐱ij,1iI,1jJ,\displaystyle\mathbf{y}_{ij}=\mathbf{D}_{i}\mathbf{x}_{ij},\quad 1\leq i\leq I,\quad 1\leq j\leq J, (2)

where 𝐲ij=[y1ijyNij]N\mathbf{y}_{ij}=\left[\begin{array}[]{ccc}y_{1ij}&\dots&y_{Nij}\end{array}\right]^{\intercal}\in\mathbb{C}^{N} denotes an estimate of the source signal 𝐬ij\mathbf{s}_{ij}, and 𝐃i=[𝐝1i𝐝Mi]N×M\mathbf{D}_{i}=\left[\begin{array}[]{ccc}\mathbf{d}_{1i}&\dots&\mathbf{d}_{Mi}\end{array}\right]\in\mathbb{C}^{N\times M}. The difficulty of this identifiying depends on many factors, e.g., number of sources, number of sensors, the nature of the source signals, the property of the mixing system. In this work, we focus on the case where the number of souces is equal to the number of microphone sensors, i.e., M=NM=N and assume that the source signals are mutually independent and their distributions are stationary. Given these conditions, the problem of MBSS can be solved using only the second-order statistics [2].

From (1), the covariance matrix of mixtures is 𝐗ij=𝔼[𝐱ij𝐱ij𝖧]\mathbf{X}_{ij}=\mathbb{E}\left[\mathbf{x}_{ij}\mathbf{x}_{ij}^{\mathsf{H}}\right]. Using the signal model given in (1), we obtain

𝐗ij=𝐀i𝔼[𝐬ij𝐬ij𝖧]𝐀i𝖧.\displaystyle\mathbf{X}_{ij}=\mathbf{A}_{i}\mathbb{E}\left[\mathbf{s}_{ij}\mathbf{s}_{ij}^{\mathsf{H}}\right]\mathbf{A}_{i}^{\mathsf{H}}. (3)

Under the assumption of statistical independence among sources, the covariance matrix of sources 𝚲ij=𝔼[𝐬ij𝐬ij𝖧]\boldsymbol{\Lambda}_{ij}=\mathbb{E}\left[\mathbf{s}_{ij}\mathbf{s}_{ij}^{\mathsf{H}}\right] should be a diagonal matrix, which can be represented as 𝚲ij=Diag(λ1ijλ2ijλNij)\boldsymbol{\Lambda}_{ij}=\mbox{Diag}\left(\begin{array}[]{cccc}\lambda_{1ij}&\lambda_{2ij}&\cdots&\lambda_{Nij}\end{array}\right). Let us stack all the non-zeros parameters in 𝚲ij\boldsymbol{\Lambda}_{ij} to form a large matrix, i.e.,

𝝀=[𝝀1𝝀2𝝀n𝝀N],\displaystyle\boldsymbol{\lambda}=\left[\begin{array}[]{ccccccc}\boldsymbol{\lambda}_{1}&\boldsymbol{\lambda}_{2}&\cdots&\boldsymbol{\lambda}_{n}&\cdots&\boldsymbol{\lambda}_{N}\end{array}\right], (5)

where 𝝀nI×J\boldsymbol{\lambda}_{n}\in\mathbb{R}^{I\times J} is a matrix consisting of all the parameters associated with the nnth source, and 𝝀\boldsymbol{\lambda} is a matrix of size N×I×JN\times I\times J, i.e., 𝝀N×I×J\boldsymbol{\lambda}\in\mathbb{R}^{N\times I\times J}, which encompass all the parameters of the source model,

In ILRMA, the NMF tool is used to decompose the matrix 𝝀n\boldsymbol{\lambda}_{n}, n=1,,Nn=1,\cdots,N, into the following form:

𝝀n\displaystyle\boldsymbol{\lambda}_{n} =𝐓n𝐕n,\displaystyle=\mathbf{T}_{n}\mathbf{V}_{n}, (6)

where 𝐓n\mathbf{T}_{n} is a basis matrix of size I×KI\times K with KK being the number of bases, and 𝐕n\mathbf{V}_{n} is called the activation matrix, whose dimension is K×JK\times J. This decomposition based source modeling is referred to as the NMF source model.

Another way to model the source model parameters is through a rank-one-tensor decomposition of the matrix 𝝀\boldsymbol{\lambda}, leading to the non-negative CPD (NCPD) based source model. In NCPD, the matrix 𝝀\boldsymbol{\lambda} is expressed as

𝝀=\displaystyle\boldsymbol{\lambda}= k=1K𝐳k𝐭~k𝐯~k,\displaystyle\sum_{k=1}^{K}\mathbf{z}_{k}\circ\tilde{\mathbf{t}}_{k}\circ\tilde{\mathbf{v}}_{k}, (7)

where

𝐳k=\displaystyle\mathbf{z}_{k}= [z1kz2kzNk],\displaystyle\left[\begin{array}[]{cccc}z_{1k}&z_{2k}&\cdots&z_{Nk}\end{array}\right]^{\intercal}, (9)
𝐭~k=\displaystyle\tilde{\mathbf{t}}_{k}= [t~1kt~2kt~Ik],\displaystyle\left[\begin{array}[]{cccc}\tilde{t}_{1k}&\tilde{t}_{2k}&\cdots&\tilde{t}_{Ik}\end{array}\right]^{\intercal}, (11)
𝐯~k=\displaystyle\tilde{\mathbf{v}}_{k}= [v~1kv~2kv~Jk].\displaystyle\left[\begin{array}[]{cccc}\tilde{v}_{1k}&\tilde{v}_{2k}&\cdots&\tilde{v}_{Jk}\end{array}\right]^{\intercal}. (13)

and \circ denotes the outer product.

II-A Proposed source model

\begin{overpic}[width=227.62204pt]{MVNTR.pdf} \put(78.0,58.0){\normalsize$\otimes$} \put(162.0,58.0){\normalsize$\otimes$} \end{overpic}
Figure 1: Illustration of block term decomposition.

While the NCPD method adeptly captures the intricate nature of multi-channel data by expressing the source parameter tensor as a summation of a series of rank-one tensors, it is essential to acknowledge that multi-channel signals often exhibit a more nuanced structure in the time-frequency domain. This nuanced structure can be better modeled with the so-called nonnegative block term decomposition (NBTD), which emerges as a potent tensor factorization model tailor-made to precisely capture localized and recurring patterns within the time-frequency representation of speech signals [22]. The decomposition process is illustrated in Fig. 1 and can be mathematically written as follows:

𝝀\displaystyle\boldsymbol{\lambda} =o=1O(𝐓o𝐕o)𝐠o,\displaystyle=\sum_{o=1}^{O}\left(\mathbf{T}_{o}\mathbf{V}_{o}^{\intercal}\right)\circ\mathbf{g}_{o}, (14)
𝐓o\displaystyle\mathbf{T}_{o} =[to11to12to1Kto21to22to2KtoI1toI2toIK],\displaystyle=\begin{bmatrix}t_{o11}&t_{o12}&\cdots&t_{o1K}\\ t_{o21}&t_{o22}&\cdots&t_{o2K}\\ \vdots&\vdots&\ddots&\vdots\\ t_{oI1}&t_{oI2}&\cdots&t_{oIK}\end{bmatrix},
𝐕o\displaystyle\mathbf{V}_{o} =[vo11vo12vo1Jvo21vo22vo2JvoK1voK2voKJ],\displaystyle=\begin{bmatrix}v_{o11}&v_{o12}&\cdots&v_{o1J}\\ v_{o21}&v_{o22}&\cdots&v_{o2J}\\ \vdots&\vdots&\ddots&\vdots\\ v_{oK1}&v_{oK2}&\cdots&v_{oKJ}\end{bmatrix},
𝐠o\displaystyle\mathbf{g}_{o} =[g1og2ogNo],\displaystyle=\left[\begin{array}[]{cccc}g_{1o}&g_{2o}&\cdots&g_{No}\end{array}\right]^{\intercal}, (16)

where 𝝀N×I×J\boldsymbol{\lambda}\in\mathbb{R}^{N\times I\times J}, 𝐓oI×Lo\mathbf{T}_{o}\in\mathbb{R}^{I\times L_{o}}, 𝐕oJ×Lo\mathbf{V}_{o}\in\mathbb{R}^{J\times L_{o}}, and 𝐆=[𝐠1𝐠2𝐠O]N×O\mathbf{G}=\left[\begin{array}[]{cccc}\mathbf{g}_{1}&\mathbf{g}_{2}&\dots&\mathbf{g}_{O}\end{array}\right]\in\mathbb{R}^{N\times O} are all non-negative matrices, and \circ denotes the outer product. It is easy to check that in this decomposition each elements of 𝝀\boldsymbol{\lambda} is presented as λnij=ognoktoikvokj\lambda_{nij}=\sum_{o}g_{no}\sum_{k}t_{oik}v_{okj}. This decomposition based source modeling is referred to as the NBTD source model.

Let us introduce two matrices:

𝚲ij\displaystyle\overset{\circ}{\boldsymbol{\Lambda}}_{ij} =[kt1ikv1kj00ktOikvOkj]0O×O,\displaystyle=\begin{bmatrix}\sum_{k}t_{1ik}v_{1kj}&\dots&0\\ \vdots&\ddots&\vdots\\ 0&\dots&\sum_{k}t_{Oik}v_{Okj}\end{bmatrix}\in\mathbb{R}_{\geq 0}^{O\times O}, (17)
𝐔\displaystyle\mathbf{U} =[g1112g1212g1O12g2112g2212g2O12gN112gN212gNO12]0N×O,\displaystyle=\begin{bmatrix}g_{11}^{\frac{1}{2}}&g_{12}^{\frac{1}{2}}&\dots&g_{1O}^{\frac{1}{2}}\\ g_{21}^{\frac{1}{2}}&g_{22}^{\frac{1}{2}}&\dots&g_{2O}^{\frac{1}{2}}\\ \vdots&\vdots&\ddots&\vdots\\ g_{N1}^{\frac{1}{2}}&g_{N2}^{\frac{1}{2}}&\dots&g_{NO}^{\frac{1}{2}}\end{bmatrix}\in\mathbb{R}_{\geq 0}^{N\times O}, (18)

where the matrix 𝐔\mathbf{U} satisfies the orthogonal constraint. This constraint not only relates the NBTD based source model to the k-means clustering of spectral components, but also allows the formula (14) to be expressed in the form of a diagonal matrix under each time-frequency bins:

𝚲ij=𝐔𝚲ij𝐔.\displaystyle\boldsymbol{\Lambda}_{ij}=\mathbf{U}\overset{\circ}{\boldsymbol{\Lambda}}_{ij}\mathbf{U}^{\intercal}. (19)

Following the orthogonal constraints on matrix 𝐔\mathbf{U}, we have 𝐔𝐔=𝐈N\mathbf{U}\mathbf{U}^{\intercal}=\mathbf{I}_{N}. It can also be expressed as:

ogno=1.n=1,,N,\displaystyle\sum_{o}g_{no}=1.\quad\forall\quad n=1,\dots,N, (20)

It indicates the clusters of sources of source model in cILRMA which fully utilize the spectral structure. Then, We build the generative model of covariance matrix for cILRMA as:

𝐗^ij\displaystyle\hat{\mathbf{X}}_{ij} =n𝐚in𝐚in𝖧(ogno(ktoikvokj))\displaystyle=\sum_{n}\mathbf{a}_{in}\mathbf{a}_{in}^{\mathsf{H}}\left(\sum_{o}g_{no}\left(\sum_{k}t_{oik}v_{okj}\right)\right)
=𝐀i𝐔𝚲ij𝐔𝐀i𝖧.\displaystyle=\mathbf{A}_{i}\mathbf{U}\overset{\circ}{\boldsymbol{\Lambda}}_{ij}\mathbf{U}^{\intercal}\mathbf{A}_{i}^{\mathsf{H}}. (21)

Generally, it is assumed that mixed signal in each STFT bin follow a complex Gaussian distribution, i.e.,

𝐱ij𝒩(𝐱ij|𝟎,𝐗^ij).\displaystyle\mathbf{x}_{ij}\sim\mathcal{N}_{\mathbb{C}}\left(\mathbf{x}_{ij}\bigg{|}\mathbf{0},\hat{\mathbf{X}}_{ij}\right). (22)

The maximum likelihood cost function for estimating the model parameters is then written as

=\displaystyle\vspace{-5cm}\mathcal{L}= ij[tr(𝐃i1𝐲ij𝐲ij𝖧(𝐃i𝖧)1𝐃i𝖧(𝐔)1(𝚲ij)1𝐔1𝐃i)\displaystyle\sum_{ij}\!\Big{[}\!\mbox{tr}\!\left(\!\mathbf{D}_{i}^{\!-\!1}\mathbf{y}_{ij}\mathbf{y}_{ij}^{\mathsf{H}}\!\left(\!\mathbf{D}_{i}^{\mathsf{H}}\!\right)^{\!-\!1}\!\mathbf{D}_{i}^{\mathsf{H}}\!\left(\mathbf{U}^{\intercal}\!\right)^{\!-\!1}({\overset{\circ}{\boldsymbol{\Lambda}}_{ij}})^{\!-\!1}\mathbf{U}^{\!-\!1}\mathbf{D}_{i}\!\right)\!
+log(det𝐀i)(det𝐔𝚲ij𝐔)(det𝐀i𝖧)]\displaystyle+\log{\left(\det\mathbf{A}_{i}\right)\left(\det\mathbf{U}\overset{\circ}{\boldsymbol{\Lambda}}_{ij}\mathbf{U}^{\intercal}\right)\left(\det\mathbf{A}_{i}^{\mathsf{H}}\right)}\Big{]}
+σtr(𝐔𝐔𝐈N),\displaystyle+\sigma\mbox{tr}\left(\mathbf{U}\mathbf{U}^{\intercal}-\mathbf{I}_{N}\right), (23)

where σ\sigma denotes the Lagrange multiplier with respect to 𝐔\mathbf{U}. Then the problem of cILRMA is converted into one of estimating the source model related parameters toik,vokj,gnot_{oik},v_{okj},g_{no} and spatial model related parameters 𝐃i\mathbf{D}_{i}.

III Parameters Optimization

In this section, we derive the update rules for source model related parameters using the objective function given in (II-A). For 𝚲ij\overset{\circ}{\boldsymbol{\Lambda}}_{ij}, the the objective function can be expressed as

(𝚲ij)=tr(𝐲ij𝐲ij𝖧(𝐔)(𝚲ij)1𝐔)+logdet(𝐔𝚲ij𝐔).\displaystyle\!\!\!\!\mathcal{L}(\!\overset{\circ}{\boldsymbol{\Lambda}}_{ij}\!)\!=\!\mbox{tr}\!\left(\!\mathbf{y}_{ij}\mathbf{y}_{ij}^{\mathsf{H}}\left(\!\mathbf{U}^{\intercal}\!\right)^{\!{\dagger}\!}\!\!(\!\overset{\circ}{\boldsymbol{\Lambda}}_{ij}\!)^{\!-1\!}\mathbf{U}^{\!{\dagger}\!}\!\right)\!\!+\!\log\det\!\left(\!\mathbf{U}\overset{\circ}{\boldsymbol{\Lambda}}_{ij}\mathbf{U}^{\intercal}\!\right)\!\!.\! (24)

Since 𝐔\mathbf{U} is a unitary orthogonal matrix, (24) can be further expressed as

(𝚲ij)=tr(𝐔𝐲ij𝐲ij𝖧𝐔(𝚲ij)1)+logdet(𝐔𝚲ij𝐔),\displaystyle\mathcal{L}(\!\overset{\circ}{\boldsymbol{\Lambda}}_{ij}\!)\!=\mbox{tr}\left(\mathbf{U}^{\intercal}\mathbf{y}_{ij}\mathbf{y}_{ij}^{\mathsf{H}}\mathbf{U}(\overset{\circ}{\boldsymbol{\Lambda}}_{ij})^{-1}\right)\!\!+\!\!\log\det\left(\mathbf{U}\overset{\circ}{\boldsymbol{\Lambda}}_{ij}\mathbf{U}^{\intercal}\right)\!,\!
=ij[ongno|ynij|2ktoikvokj+nlogogno(ktoikvokj)].\displaystyle\!=\!\!\sum_{ij}\!\!\Bigg{[}\!\!\sum_{o}\!\!\frac{\sum_{n}g_{no}|y_{nij}|^{2}}{\sum_{k}t_{oik}v_{okj}}\!+\!\!\sum_{n}\!\log\!\sum_{o}g_{no}\!\!\left(\!\sum_{k}\!t_{oik}v_{okj}\!\!\right)\!\!\Bigg{]}\!.\! (25)

Direction optimization of (𝚲ij)\mathcal{L}(\!\overset{\circ}{\boldsymbol{\Lambda}}_{ij}\!) with respect to 𝚲ij\overset{\circ}{\boldsymbol{\Lambda}}_{ij} is rather difficult. To circumvent this, we introduce the Jensen’s inequality and the tangent line inequality to obtain the following auxiliary function for (III):

𝒬(𝚲ij)=\displaystyle\mathcal{Q}(\overset{\circ}{\boldsymbol{\Lambda}}_{ij})= ij[okngno|ynij|2toikvokjαokij2\displaystyle\sum_{ij}\Bigg{[}\sum_{ok}\frac{\sum_{n}g_{no}|y_{nij}|^{2}}{t_{oik}v_{okj}}\alpha_{okij}^{2}
+1βoij(ktoikvokjβoij)+logβoij],\displaystyle+\frac{1}{\beta_{oij}}\Bigg{(}\sum_{k}t_{oik}v_{okj}-\beta_{oij}\Bigg{)}+\log\beta_{oij}\Bigg{]}, (26)

where αokij\alpha_{okij} and βoij\beta_{oij} are two auxiliary variables and the equality holds if and only if αokij=toikvokjktoikvokj\alpha_{okij}=\frac{t_{oik}v_{okj}}{\sum_{k}t_{oik}v_{okj}}, βoij=ktoikvokj\beta_{oij}=\sum_{k}t_{oik}v_{okj}.

Identifying partial derivatives of 𝒬(𝚲ij)\mathcal{Q}(\overset{\circ}{\boldsymbol{\Lambda}}_{ij}) with respect to toik\partial t_{oik} and vokj\partial v_{okj} and forcing the results to zero, we obtain:

toiktoiknj|ynij|2gnovokj(ktoikvokj)2jvokj(ktoikvokj)1,\displaystyle t_{oik}\leftarrow t_{oik}\sqrt{\frac{\sum_{nj}|y_{nij}|^{2}g_{no}v_{okj}\left(\sum_{k^{\prime}}t_{oik^{\prime}}v_{ok^{\prime}j}\right)^{-2}}{\sum_{j}v_{okj}\left(\sum_{k^{\prime}}t_{oik^{\prime}}v_{ok^{\prime}j}\right)^{-1}}}, (27)
vokjvokjni|ynij|2gnotoik(ktoikvokj)2itoik(ktoikvokj)1.\displaystyle v_{okj}\leftarrow v_{okj}\sqrt{\frac{\sum_{ni}|y_{nij}|^{2}g_{no}t_{oik}\left(\sum_{k^{\prime}}t_{oik^{\prime}}v_{ok^{\prime}j}\right)^{-2}}{\sum_{i}t_{oik}\left(\sum_{k^{\prime}}t_{oik^{\prime}}v_{ok^{\prime}j}\right)^{-1}}}. (28)

Similarly, the maximum likelihood function with respect to 𝐔\mathbf{U} can be expressed as

(𝐔)=\displaystyle\mathcal{L}(\mathbf{U})= tr(𝐲ij𝐲ij𝖧(𝐔)1(𝚲ij)1𝐔1)\displaystyle\mbox{tr}\left(\mathbf{y}_{ij}\mathbf{y}_{ij}^{\mathsf{H}}\left(\mathbf{U}^{\intercal}\right)^{-1}(\overset{\circ}{\boldsymbol{\Lambda}}_{ij})^{-1}\mathbf{U}^{-1}\right)
+logdet(𝐔𝚲ij𝐔)+σtr(𝐔𝐔𝐈N).\displaystyle+\log\det\left(\mathbf{U}\overset{\circ}{\boldsymbol{\Lambda}}_{ij}\mathbf{U}^{\intercal}\right)+\sigma\mbox{tr}\left(\mathbf{U}\mathbf{U}^{\intercal}-\mathbf{I}_{N}\right). (29)

Following the previous analysis, one can obtain the auxiliary function for (III) with respect to 𝐆\mathbf{G}:

𝒬(𝐆)=ijn[o|ynij|2gnoktoikvokjα^onij2]+σ(nognoN)\displaystyle\mathcal{Q}(\mathbf{G})\!=\!\sum_{ijn}\!\!\left[\!\sum_{o}\!\!\frac{|y_{nij}|^{2}}{g_{no}\!\sum_{k}\!t_{oik}v_{okj}}\hat{\alpha}_{onij}^{2}\!\right]\!\!+\!\sigma\!\!\left(\!\sum_{no}\!g_{no}\!-\!N\!\!\right)
+ijn[1β^nij(ognoktoikvokjβ^nij)+logβ^nij],\displaystyle\!+\!\sum_{ijn}\!\!\left[\!\frac{1}{\hat{\beta}_{nij}}\!\bigg{(}\!\!\sum_{o}g_{no}\!\sum_{k}t_{oik}v_{okj}\!-\!\hat{\beta}_{nij}\!\!\bigg{)}\!\!+\!\log\hat{\beta}_{nij}\!\!\right]\!,\! (30)

where α^onij\hat{\alpha}_{onij} and β^nij\hat{\beta}_{nij} are two auxiliary variables, and the equality satisfies if and only if α^onij=gnoktoikvokjognoktoikvokj\hat{\alpha}_{onij}=\frac{g_{no}\sum_{k}t_{oik}v_{okj}}{\sum_{o}g_{no}\sum_{k}t_{oik}v_{okj}}, and β^nij=ognoktoikvokj\hat{\beta}_{nij}=\sum_{o}g_{no}\sum_{k}t_{oik}v_{okj}. Identifying the partial derivative of(III) with respect to gnog_{no} and forcing the result to zero gives the following update rule:

gnognoij|ynij|2(ktoikvokj)(okgnotoikvokj)2ij(ktoikvokj)(okgnotoikvokj)1+σ.\displaystyle g_{no}\!\leftarrow\!g_{no}\!\sqrt{\!\frac{\!\sum_{ij}\!|y_{nij}|^{2}\!\left(\!\sum_{k}\!t_{oik}v_{okj}\!\right)\!\left(\!\sum_{ok}g_{no}\!t_{oik}v_{okj}\!\right)^{\!-2\!}}{\sum_{ij}\!\left(\sum_{k}t_{oik}v_{okj}\!\right)\!\left(\!\sum_{ok}g_{no}t_{oik}v_{okj}\!\right)^{\!-1\!}\!+\!\sigma}}\!.\! (31)

The update rules of the demixing matrix 𝐃i\mathbf{D}_{i} in cILRMA is similar to that in AuxIVA [9], which are as following:

𝐎ni=1Jj10gnokuoikvokj𝐱ij𝐱ij𝖧,\displaystyle\mathbf{O}_{ni}=\frac{1}{J}\sum_{j}\frac{1}{\sum_{0}g_{no}\sum_{k}u_{oik}v_{okj}}\mathbf{x}_{ij}\mathbf{x}^{\mathsf{H}}_{ij}, (32)
𝐝ni[𝐃i𝐎ni]1𝐞n,\displaystyle\mathbf{d}_{ni}\leftarrow\left[\mathbf{D}_{i}\mathbf{O}_{ni}\right]^{-1}\mathbf{e}_{n}, (33)
𝐝ni𝐝ni[𝐝ni𝖧𝐎ni𝐝ni]12.\displaystyle\mathbf{d}_{ni}\leftarrow\mathbf{d}_{ni}\left[\mathbf{d}_{ni}^{\mathsf{H}}\mathbf{O}_{ni}\mathbf{d}_{ni}\right]^{-\frac{1}{2}}. (34)

where 𝐎ni\mathbf{O}_{ni} denotes the auxiliary variable, and 𝐞n\mathbf{e}_{n} denotes the nnth column vector of the identity matrix of size N×NN\times N.

Refer to caption

(a) female+female

Refer to caption

(b) female+female

Refer to caption

(c) male+male

Refer to caption

(d) male+male

Refer to caption

(e) female+male

Refer to caption

(f) female+male

Figure 2: SDR and SIR improvements for the studied methods.

IV Experiments

IV-A Experimental configuration

We followed the SISEC challenge [23] and selected speech signals from the Wall Street Journal (WSJ0) corpus [24] as the clean speech source signals. Subsequently, we constructed evaluation signals for specific speech separation task with M=N=2M=N=2 and simulated room environments. The dimensions of the room were set to 8×8×38\times 8\times 3 m. Two microphones are positioned at the center of the room and their spacing is 66 cm. The two sources were positioned 2 m from the center of the two microphones. The incident angles of the two source signals were designated as 8080^{\circ} and 110110^{\circ} respectively, with the direction normal to the line connecting the two microphones marked as 00^{\circ}. We employed the image model method [25] to generate room impulse responses where the sound absorption coefficients were determined using Sabine’s Formula [26]. The reverberation time T60T_{60} is controlled to be in the range from 0 to 600600 ms with an interval of 5050 ms. For each gender combinations (there are three combinations) and every value of T60T_{60}, 100 sets of mixed signals are generated for evaluation. The sampling rate is 1616 kHz.

The value of the hyperparameter σ\sigma in cILRMA was set to 1. We compared cILRMA with AuxIVA [9], MNMF [13], ILRMA [14], ttILRMA [15], Generalized Gaussian distributed ILRMA (GGDILRMA) [17] and mmILRMA [27]. Signal-to-distortion ration (SDR) and source-to-interferences ratio (SIR) are used as the performance metrics for evaluation and the definitions of these metrics can be found in [28].

Refer to caption

Refer to caption

Figure 3: Average SDR and SIR improvements versus the value of OO. Conditions: source signals are from two female speakers in WSJ0 and there is no reverberation.

Refer to caption

Refer to caption

Figure 4: Average SDR and SIR improvements versus different number of bases. Conditions: source signals are from two female speakers in WSJ0 and there is no reverberation.

Refer to caption

Refer to caption

Figure 5: Average SDR and SIR improvements versus iteration number. Conditions: source signals are from two female speakers in WSJ0 and there is no reverberation.

IV-B Main results

Figure 2 plots the averaged performance in different reverberant environments. It is evident that the cILRMA method exhibits greater improvements in SDR and SIR compared to the other algorithms, although the discrepancy diminishes as the reverberation time increases.

An essential parameter in the proposed source model is parameter OO. To explore its impact on performance, a set of experiments were conducted. Figure 3 depicts the SDR and SIR improvements across various values of OO, with the source signals originating from two female speakers. The results indicate that performance enhances with increasing values of OO, suggesting that a higher OO value leads to a more accurate source model.

Figure 4 illustrates the SDR and SIR improvements achieved by cILRMA and ILRMA across varying numbers of bases. The results demonstrate that regardless of the number of bases, cILRMA consistently outperforms ILRMA by approximately 44 dB in terms of performance.

The convergence behavior for cILRMA and ILRMA are shown in Fig. 5. It is seen that it is approximately 100 iterations for cILRMA to perform better than ILRMA.

V Conclusions

The paper presented a clustered source model tailored for ILRMA-based MBASS. Leveraging the NBTD technique, this model defines blocks as outer products of vectors (clusters) and matrices for spectral structure modeling, thus providing interpretable latent vectors. By integrating orthogonality constraints, the model ensures independence among source images. Experimental results demonstrated the superiority of the proposed method over its traditional counterparts in anechoic environments.

References

  • [1] Y. Huang, J. Benesty, and J. Chen, Acoustic MIMO signal processing, Springer Science & Business Media, 2006.
  • [2] A. Belouchrani, K. Abed-Meraim, J.F. Cardoso, and E. Moulines, “A blind source separation technique using second-order statistics,” IEEE Trans. Signal Process., vol. 45, no. 2, pp. 434–444, Feb. 1997.
  • [3] J. Benesty, Chen J., and Huang Y., Microphone Array Signal Processing, 2008.
  • [4] J. Benesty, I. Cohen, and J. Chen, Fundamentals of signal enhancement and array signal processing, John Wiley & Sons, Singapore Pte. Ltd., 2018.
  • [5] S. Lee, S. H. Park, and K.M. Sung, “Beamspace-domain multichannel nonnegative matrix factorization for audio source separation,” IEEE Signal Process. Lett., vol. 19, no. 1, pp. 43–46, Jan. 2011.
  • [6] P. Comon, “Independent component analysis, a new concept?,” Signal Process., vol. 36, no. 3, pp. 287–314, Apr. 1994.
  • [7] T. Kim, T. Eltoft, and T.-W. Lee, “Independent vector analysis: An extension of ica to multivariate components,” in Proc. Int. Conf. Independent Compon. Anal. Blind Source Separation. Springer, Oct. 2006, pp. 165–172.
  • [8] Y. Sun, P. Babu, and D. P. Palomar, “Majorization-minimization algorithms in signal processing, communications, and machine learning,” IEEE Trans. Signal Process., vol. 65, no. 3, pp. 794–816, Feb. 2016.
  • [9] N. Ono, “Stable and fast update rules for independent vector analysis based on auxiliary function technique,” in Proc. IEEE Workshop Appl. Signal Process. Audio Acoust. IEEE, Oct. 2011, pp. 189–192.
  • [10] D. Lee and H. S. Seung, “Algorithms for non-negative matrix factorization,” in Proc. Adv. Neural Inf. Process. Syst. May. 2000, pp. 556–562, MIT Press.
  • [11] A. Ozerov and C. Févotte, “Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 18, no. 3, pp. 550–563, Mar. 2009.
  • [12] N. Duong, E. Vincent, and R. Gribonval, “Under-determined reverberant audio source separation using a full-rank spatial covariance model,” IEEE Trans. Audio, Speech, Lang. Process., vol. 18, no. 7, pp. 1830–1840, Sept. 2010.
  • [13] H. Sawada, H. Kameoka, S. Araki, and N. Ueda, “Multichannel extensions of non-negative matrix factorization with complex-valued data,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 5, pp. 971–982, May. 2013.
  • [14] D. Kitamura, N. Ono, H. Sawada, H. Kameoka, and H. Saruwatari, “Determined blind source separation unifying independent vector analysis and nonnegative matrix factorization,” IEEE Trans. Audio, Speech, Lang. Process., vol. 24, no. 9, pp. 1626–1641, Sept. 2016.
  • [15] S. Mogami, D. Kitamura, Y. Mitsui, N. Takamune, H. Saruwatari, and N. Ono, “Independent low-rank matrix analysis based on complex student’s t-distribution for blind audio source separation,” in Proc. IEEE 27th Int. Workshop Mach. Learn. Signal Process. IEEE, Sept. 2017, pp. 1–6.
  • [16] D. Kitamura, S. Mogami, Y. Mitsui, N. Takamune, H. Saruwatari, N. Ono, Y. Takahashi, and K. Kondo, “Generalized independent low-rank matrix analysis using heavy-tailed distributions for blind source separation,” EURASIP J. Adv. Signal Process., vol. 2018, no. 1, pp. 1–25, May. 2018.
  • [17] R. Ikeshita and Y. Kawaguchi, “Independent low-rank matrix analysis based on multivariate complex exponential power distribution,” in Proc. IEEE ICASSP. IEEE, Apr. 2018, pp. 741–745.
  • [18] S. Mogami, N. Takamune, D. Kitamura, H. Saruwatari, Y. Takahashi, K. Kondo, H. Nakajima, and N. Ono, “Independent low-rank matrix analysis based on time-variant sub-gaussian source model,” in Proc. Asia-Pacific Signal Inf. Process. Assoc. Annu. Summit Conf. IEEE, Nov. 2018, pp. 1684–1691.
  • [19] S. Mogami, N. Takamune, D. Kitamura, H. Saruwatari, Y. Takahashi, K. Kondo, and N. Ono, “Independent low-rank matrix analysis based on time-variant sub-gaussian source model for determined blind source separation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 28, pp. 503–518, Dec. 2019.
  • [20] Y. Mitsui, D. Kitamura, S. Takamichi, N. Ono, and H. Saruwatari, “Blind source separation based on independent low-rank matrix analysis with sparse regularization for time-series activity,” in Proc. IEEE ICASSP. IEEE, Mar. 2017, pp. 21–25.
  • [21] J. Wang, S. Guan, S. Liu, and X. Zhang, “Minimum-volume multichannel nonnegative matrix factorization for blind audio source separation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 29, pp. 3089–3103, Oct. 2021.
  • [22] L. De Lathauwer, “Decompositions of a higher-order tensor in block terms—part ii: Definitions and uniqueness,” SIAM Journal on Matrix Analysis and Applications, vol. 30, no. 3, pp. 1033–1066, 2008.
  • [23] S. Araki, F. Nesta, E. Vincent, Z. Koldovskỳ, G. Nolte, A. Ziehe, and A. Benichoux, “The 2011 signal separation evaluation campaign (sisec2011):-audio source separation,” in LVA/ICA. Springer, 2012, pp. 414–422.
  • [24] J. Garofolo, D. Graff, D. Paul, and D. Pallett, “Csr-i (wsj0) complete ldc93s6a,” Linguistic Data Consortium, vol. 83, 1993.
  • [25] J. Allen and D. Berkley, “Image method for efficiently simulating small-room acoustics,” J. Acoust. Soc. Am., vol. 65, no. 4, pp. 943–950, June. 1979.
  • [26] Robert W Young, “Sabine reverberation equation and sound power calculations,” J. Acoust. Soc. Am., vol. 31, no. 7, pp. 912–921, July. 1959.
  • [27] Jianyu Wang, Shanzheng Guan, and Xiao-Lei Zhang, “Minimum-volume regularized ilrma for blind audio source separation,” in APSIPA ASC, 2021, pp. 630–634.
  • [28] E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 4, pp. 1462–1469, July. 2006.