This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Distributed Node-Specific Block-Diagonal LCMV Beamforming in Wireless Acoustic Sensor Networks

Xinwei Guo Minmin Yuan Chengshi Zheng [email protected] Xiaodong Li Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, 100190, Beijing, China University of Chinese Academy of Sciences, 100049, Beijing, China Research Institute of Highway Ministry of Transport, 100088, Beijing, China
Abstract

This paper derives the analytical solution of a novel distributed node-specific block-diagonal linearly constrained minimum variance beamformer from the centralized linearly constrained minimum variance (LCMV) beamformer when considering that the noise covariance matrix is block-diagonal. To further reduce the computational complexity of the proposed beamformer, the Sherman-Morrison-Woodbury formula is introduced to compute the inversion of noise sample covariance matrix. By doing so, the exchanged signals can be computed with lower dimensions between nodes, where the optimal LCMV beamformer is still available at each node as if each node is to transmit its all raw sensor signal observations. The proposed beamformer is fully distributable without imposing restrictions on the underlying network topology or scaling computational complexity, i.e., there is no increase in the per-node complexity when new nodes are added to the networks. Compared with state-of-the-art distributed node-specific algorithms that are often time-recursive, the proposed beamformer exactly solves the LCMV beamformer optimally frame by frame, which has much lower computational complexity and is more robust to acoustic transfer function estimation error and voice activity detector error. Numerous experimental results are presented to validate the effectiveness of the proposed beamformer.

keywords:
distributed beamforming, node-specific, speech enhancement , wireless acoustic sensor networks.
journal: Signal Processing

1 Introduction

Wireless acoustic sensor networks (WASNs) generally consist of several nodes, where each node has one or many sensors, a processing unit, and a wireless communication module allowing them to exchange data. Compared with the traditional and single sensor array [1], WASNs can physically cover a wider area, which have more opportunity to select a subset of nodes close to some target sources, and thus higher signal-to-noise ratio (SNR) and direct-to-reverberant ratio (DRR) can be expected [2, 3]. As the next-generation technology for audio acquisition and processing, WASNs have many potential applications, such as binaural hearing aids [4, 5, 6], (hands-free) speech communication systems [7, 8, 9], and acoustic monitoring systems [10, 11, 12, 13].

In principle, all the sensor signal observations from different nodes can be transmitted to a fusion center, and then an optimal beamformer can be computed, where this approach is known as the centralized estimation [14, 15, 16]. The centralized estimation requires a large communication bandwidth, a large transmission power consumption at the individual nodes, and a nonnegligible computational complexity at the fusion center. However, both the power and the communication bandwidth resources in WASNs are often limited. Furthermore, in many WASNs applications, the fusion center may be undesirable due to privacy considerations [17]. A trivial solution is obtained by only utilizing the local sensor signal observations at a single node without any communication link with other nodes. Whereas, this solution cannot utilize the entire information from the WASNs and hence is only sub-optimal. A promising solution is to develop a suitable distributed approach, which often has three stages [18]. At the first stage, each node processes its own sensor signal observations to obtain some compressed signals. At the second stage, only these compressed signals are transmitted to reduce the communication bandwidth. At the last stage, a target signal is obtained by merging all these compressed signals properly.

Distributed speech enhancement algorithms can be roughly divided into two main categories: node-specific and non-node-specific. For the node-specific estimation algorithms, each node in the WASNs can estimate a different target signal, that is to say, a target source for one node may be an interfering source for another node, and vice versa. The node-specific estimation problem is intrinsic in a blind beamforming framework where the acoustic transfer functions (ATFs) between the target sound sources and the sensors are generally unknown. For these blind beamformers, some subspace estimation algorithms can be used to estimate the subspace of the ATFs [19, 20], and then the target signal can be estimated as it is observed at a reference sensor. If each node in the WASNs chooses its own local sensor as reference, the spatial information of the target source can be preserved. Therefore, the node-specific estimation algorithms are preferred for many practical applications [21].

Several non-node-specific speech enhancement algorithms have been presented in [17, 18], and [22], where different nodes shared a common reference sensor. In [17], each node was assumed to have one sensor and a distributed delay and sum (DDS) beamformer in a randomly connected network was proposed. The DDS with the randomized gossip algorithm [23] is an iterative algorithm for solving averaging consensus problems in a distributed way, where all nodes’ outputs are expected to converge to the same optimal average value. The DDS typically needs many iterations to converge to the optimal solution, as well as multiple (re)-broadcasts of the intermediary variables. The DDS is more suitable for estimating the fixed or slowly varying parameters [24]. In [18], a time-recursive distributed generalized sidelobe canceler (DGSC) was proposed for a fully connected network. The DGSC has two components including constraints subspace and its corresponding null-space, and updates the filter coefficients during speech-absent segments. The DGSC needs to transmit SS-dimensional raw signal observations, with SS the number of target sources, in addition to the compressed signals to construct the constraints subspace component. The DGSC requires a larger communication bandwidth than those distributed algorithms transmitting only the compressed signals when one aims to get the estimation of the SS target source signals separately. In [22], the proposed block-diagonal LCMV (BD-LCMV) beamformer utilizes a set of linearly equality constraints to reduce the full-element noise sample covariance matrix to a block-diagonal form, and the imposed block-diagonal structure of the estimated sample covariance matrix results in a naturally separable objective function. Then the distributed optimal problem can be solved by the primal-dual method of multipliers (PDMM) [25]. The BD-LCMV requires lots of iterations to achieve high performance, and therefore we need to make a trade-off between per-frame optimality and communication overhead in practice.

Several node-specific estimation algorithms have been proposed in [4, 5, 14, 24, 26, 27], and [28], where two main criteria including the minimum mean square error (MMSE) and the minimum variance distortionless response (MVDR) are used. The mean square error (MSE) between the output signal and the desired signal comprises two components, namely the desired signal distortion and the residual noise [18]. The MVDR, first proposed by Capon [29], minimizes the noise power at the output signal while maintaining a distortionless response towards the desired direction. Er and Cantoni [30] generalized the single distortionless response to a set of linear constraints, and denoted this beamformer as LCMV.

In [4], a distributed node-specific speech enhancement algorithm was proposed using the MMSE criterion in a 2-node network for binaural hearing aids applications. The node-specific estimation is required to preserve the auditory cues at the two ears. This method relies on the speech-distortion-weighted multichannel Wiener filter (SDW-MWF), and was referred to as the distributed MWF (DB-MWF). In [5], an iterative distributed MVDR (DB-MVDR) beamformer was introduced for a similar binaural hearing aids setting. Both methods assume a single target source to obtain convergence and optimality, and are equivalent when the trade-off factor between noise reduction and distortion is zero in the SDW-MWF. A more general case was presented in [24], [26], and [27], where multiple target sources and J(J2)J(J\geq 2) nodes are considered in a so-called distributed adaptive node-specific signal estimation (DANSE) scheme. The scheme considers each node in the WASNs as a data sink, gathering the compressed signals from other nodes, and then estimates the optimal filter coefficients in an iterative fashion. In [26] and [27], the algorithms were proposed for a fully connected network and a network with a tree topology (T-DANSE), respectively. In [24], the algorithm is topology-independent (TI-DANSE). The TI-DANSE algorithm has a slower convergence rate compared to [26] and [27], and requires a larger number of frames to obtain near optimal performance [22]. In [14], a distributed LCMV beamformer referred to as LC-DANSE was proposed by combining the DANSE scheme with the LCMV beamformer. For the DANSE algorithms in [24], [26], and [27], they attempt to align the signal components from the same source in different microphone signals. However, the alignment of the signal components is only possible when the filter length is at least twice the maximum time difference of arrival (TDOA) between all the sensors. This means that in general, the noise reduction performance degrades with increasing TDOA and a fixed filter length [6]. For the LC-DANSE algorithm, the raw signal observations at one node and the compressed signals from other nodes are concatenated into a new vector. In fact, the new vector still has very high dimensions, especially when the number of target sources SS is large, and requires a high complexity to compute the inverse of its covariance matrix. Besides, both the DANSE and the LC-DANSE algorithms are time-recursive and require multiple frames to reach optimality, which incurs a slow tracking performance [22]. In [28], each node was assumed to have more than one sensor. The recursive estimation of the inverse noise or noisy sample covariance matrix is structured as a consensus problem and is realized in a distributed manner via the randomized gossip algorithm for arbitrary topologies, similar to [17]. In each iteration, each node needs to transmit MM-dimensional, with MM the total number of sensors in the WASNs, signals to obtain the product of the M×MM\times M inverse sample covariance matrix and the MM-dimensional sensor signal observations. The communication cost may be higher than that of the centralized algorithm, and the convergence error accumulates across time when the aforementioned product is not accurately estimated.

In this paper, we propose a distributed node-specific block-diagonal linearly constrained minimum variance (DNBD-LCMV) beamformer. The DNBD-LCMV utilizes a set of linear equality constraints to reduce the full-element noise sample covariance matrix to a block-diagonal form and then its analytical solution can be derived from the centralized LCMV beamformer directly. The inverse noise sample covariance matrix at each node is proposed to update by the Sherman-Morrison-Woodbury formula [31] and is used to compute the exchanged signals. The proposed DNBD-LCMV can significantly reduce the number of signals exchanged between nodes, yet obtains the optimal LCMV beamformer at each node as if each node can transmit its all raw sensor signal observations. The DNBD-LCMV is fully distributable for any network topology and is completely scalable, i.e., there is no increase in the per-node computational complexity when new nodes are added to the networks. Compared with the state-of-the-art distributed node-specific algorithms, the DNBD-LCMV exactly solves the LCMV beamformer optimally in each frame, which has much lower computational complexity and is more robust to ATF estimation error and voice activity detector (VAD) error.

The remainder of this paper is organized as follows. In Section 2, the signal model is introduced, and the centralized LCMV is also presented in this section. In Section 3, the DNBD-LCMV and extension of DNBD-LCMV are shown, and its computational complexity and communication bandwidth are analyzed in Section 4. The experimental results are presented in Section 5 and some conclusions are given in Section 6.

2 Preliminaries

2.1 Signal Model

We consider the WASNs with JJ sensor nodes, where the set of nodes is denoted as 𝒥={1,,j,,J}\mathcal{J}=\left\{1,\cdots,j,\cdots,J\right\}. Each node jj is equipped with MjM_{j} microphones and thus the total number of microphones is M=j=1JMjM=\sum_{j=1}^{J}M_{j}. The distributed speech enhancement problem is often formulated in the short-time Fourier transform (STFT) domain and the vector 𝐲(f,l)\mathbf{y}\left(f,l\right) is given by

𝐲(f,l)=[𝐲1T(f,l),,𝐲jT(f,l),,𝐲JT(f,l)]T,\mathbf{y}\left(f,l\right)={\left[\mathbf{y}^{T}_{1}\left(f,l\right),\cdots,\mathbf{y}^{T}_{j}\left(f,l\right),\cdots,\mathbf{y}^{T}_{J}\left(f,l\right)\right]}^{T}, (1)

where ff denotes the frequency index, and ll denotes the time-frame index. 𝐲j(f,l)\mathbf{y}_{j}\left(f,l\right) is an Mj×1M_{j}\times 1 vector consisting of locally received microphone signals at the jjth node, and the superscript TT denotes the transpose operator. 𝐲(f,l)\mathbf{y}\left(f,l\right) can be modeled as

𝐲(f,l)=𝐇(f,l)𝐬(f,l)+𝐧(f,l),\mathbf{y}\left(f,l\right)=\mathbf{H}\left(f,l\right)\mathbf{s}\left(f,l\right)+\mathbf{n}\left(f,l\right), (2)

where 𝐬(f,l)\mathbf{s}\left(f,l\right) is a signal vector containing SS speech sources, 𝐧(f,l)\mathbf{n}\left(f,l\right) is a noise vector, and

𝐇(f,l)=[𝐇1T(f,l),,𝐇jT(f,l),,𝐇JT(f,l)]T\mathbf{H}\left(f,l\right)={\left[\mathbf{H}^{T}_{1}\left(f,l\right),\cdots,\mathbf{H}^{T}_{j}\left(f,l\right),\cdots,\mathbf{H}^{T}_{J}\left(f,l\right)\right]}^{T} (3)

is a full-rank M×SM\times S ATF matrix. In particularly, 𝐇j(f,l)\mathbf{H}_{j}\left(f,l\right) is the ATF matrix between the SS speech sources and the MjM_{j} microphones at the jjth node. In the following, 𝐇(f,l)\mathbf{H}\left(f,l\right) can be approximated as time-invariant in each frame and all derivations refer to a single frequency bin. The frame index ll and the frequency index ff will be omitted when no confusion arises.

2.2 Centralized LCMV Beamforming

For the centralized LCMV beamformer, node jj applies a MM-dimensional estimator 𝐰j\mathbf{w}_{j} to the MM-dimensional microphone signals 𝐲\mathbf{y} to obtain the node-specific output dj=𝐰jH𝐲d_{j}=\mathbf{w}^{H}_{j}\mathbf{y}, where the superscript HH denotes the conjugate transpose operator. 𝐰j\mathbf{w}_{j} can be obtained from the following general optimization problem [32][33]

min𝐰j𝐰jH𝐑nn𝐰j,s.t.𝐰jH𝐇=𝐠jH,\begin{gathered}\min\limits_{\mathbf{w}_{j}}~{}~{}\mathbf{w}^{H}_{j}\mathbf{R}_{nn}\mathbf{w}_{j},\\ \mathrm{s.t.}~{}~{}\mathbf{w}^{H}_{j}\mathbf{H}=\mathbf{g}^{H}_{j},\end{gathered} (4)

where 𝐑nn=E{𝐧𝐧H}\mathbf{R}_{nn}=E\left\{\mathbf{n}\mathbf{n}^{H}\right\} is the noise covariance matrix and E{}E\{\cdot\} denotes the expected value operator. 𝐠j\mathbf{g}_{j} is an S×1S\times 1 desired response vector for the SS speech sources. Its entries usually consist of ones and zeros to preserve the target sources and eliminate other interfering sources simultaneously. The solution of (4) can be given by

𝐰j=𝐑nn1𝐇(𝐇H𝐑nn1𝐇)1𝐠j.\mathbf{w}_{j}=\mathbf{R}^{-1}_{nn}\mathbf{H}\left(\mathbf{H}^{H}\mathbf{R}^{-1}_{nn}\mathbf{H}\right)^{-1}\mathbf{g}_{j}. (5)

The node-specific output can be expressed as

dj=𝐰jH𝐲=𝐠jH(𝐇H𝐑nn1𝐇)1𝐇H𝐑nn1(𝐇𝐬+𝐧)=k=1Sgj(k)s(k)+𝐰jH𝐧,\begin{split}d_{j}&=\mathbf{w}^{H}_{j}\mathbf{y}\\ &=\mathbf{g}^{H}_{j}\left(\mathbf{H}^{H}\mathbf{R}^{-1}_{nn}\mathbf{H}\right)^{-1}\mathbf{H}^{H}\mathbf{R}^{-1}_{nn}\left(\mathbf{H}\mathbf{s}+\mathbf{n}\right)\\ &=\sum_{k=1}^{S}g^{*}_{j}\left(k\right)s\left(k\right)+\mathbf{w}^{H}_{j}\mathbf{n},\end{split} (6)

where gj(k)g_{j}\left(k\right) and s(k)s\left(k\right) are the kkth entries of 𝐠j\mathbf{g}_{j} and 𝐬\mathbf{s}, respectively. The superscript “*” marker denotes the conjugate operator.

Equations (5) and (6) require each node to have access to all microphone signals 𝐲\mathbf{y} to estimate 𝐑nn\mathbf{R}_{nn} and then obtain djd_{j}. Therefore, all the locally received signals 𝐲j\mathbf{y}_{j} at the jjth node need to be transmitted, which results in a large communication bandwidth and a large transmission power in the WASNs. Besides, the computational power grows dramatically with the increase of MM when computing the inversion of 𝐑nn\mathbf{R}_{nn}.

3 Method

In the previous section, it is assumed that each node transmits all its microphone signals to every other node in the WASNs such that each node can compute (5) and (6). We now look, instead, to the case where each node only transmits a linearly compressed version of its microphone signals by means of the distributed node-specific block-diagonal LCMV (DNBD-LCMV) beamformer. For the sake of an easy exposition, we first assume that the ATF matrix 𝐇\mathbf{H} is known, and describe the DNBD-LCMV for a fully connected network, where each node is able to directly communicate with every other node in the WASNs. Then, all derivations are extended to a blind beamforming framework and any topology, similar to [22].

3.1 DNBD-LCMV with Known ATF Matrix

If the reverberation time of a room is moderate or large enough and/or the noise is far away from the microphones, i.e., the distance between the noise and the microphones is larger than the reverberation radius, its reverberant sound field is diffuse, homogenous, and isotropic [22] and [34]. In this case, the normalized correlation Ωm,m1(f){\Omega}_{m,m_{1}}\left(f\right) between two microphones mm and m1m_{1} with distance λm,m1{\lambda}_{m,m_{1}} at a frequency ff can be given by

Ωm,m1(f)=sin(2πfλm,m1/c)2πfλm,m1/c,{\Omega}_{m,m_{1}}\left(f\right)=\frac{\mathrm{sin}\left(2\pi f{\lambda}_{m,m_{1}}/c\right)}{2\pi f{\lambda}_{m,m_{1}}/c}, (7)

where c=343m/sc=343~{}\mathrm{m/s} is the sound speed. The correlation can be roughly divided into two frequency regions: one highly correlated at low frequencies and the other much less correlated at high frequencies. The boundary between the two regions occurs at the first zero-crossing frequency fc=c/(2λm,m1)f_{c}=c/\left(2{\lambda}_{m,m_{1}}\right). When the distance λm,m1{\lambda}_{m,m_{1}} is large, the frequency fcf_{c} is small. For example, fcf_{c} equals 171.5 Hz for λm,m1=1{\lambda}_{m,m_{1}}=1 m.

For the WASNs, the microphones within a node are often nearby, whereas the microphones from different nodes are further away. The noise can be assumed to be uncorrelated across the different nodes [22], and then 𝐑nn(l)\mathbf{R}_{nn}\left(l\right) has the following block-diagonal form approximately

𝐑nn(l)=Blockdiag(𝚫nn,1(l),,𝚫nn,j(l),,𝚫nn,J(l))=[𝚫nn,1(l)𝟎M1×Mj𝟎M1×MJ𝟎Mj×M1𝚫nn,j(l)𝟎Mj×MJ𝟎MJ×M1𝟎MJ×Mj𝚫nn,J(l)],\begin{split}\mathbf{R}_{nn}\left(l\right)&=\mathrm{Blockdiag}\left(\mathbf{\Delta}_{nn,1}\left(l\right),\cdots,\mathbf{\Delta}_{nn,j}\left(l\right),\cdots,\mathbf{\Delta}_{nn,J}\left(l\right)\right)\\ &=\begin{bmatrix}\mathbf{\Delta}_{nn,1}\left(l\right)&\cdots&\mathbf{0}_{M_{1}\times M_{j}}&\cdots&\mathbf{0}_{M_{1}\times M_{J}}\\ \vdots&\ddots&\vdots&\vdots&\vdots\\ \mathbf{0}_{M_{j}\times M_{1}}&\cdots&\mathbf{\Delta}_{nn,j}\left(l\right)&\cdots&\mathbf{0}_{M_{j}\times M_{J}}\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ \mathbf{0}_{M_{J}\times M_{1}}&\cdots&\mathbf{0}_{M_{J}\times M_{j}}&\cdots&\mathbf{\Delta}_{nn,J}\left(l\right)\end{bmatrix},\end{split} (8)

where 𝚫nn,j(l)\mathbf{\Delta}_{nn,j}\left(l\right) is the noise covariance matrix at the jjth node, and 𝟎Mj1×Mj2\mathbf{0}_{M_{j_{1}}\times M_{j_{2}}} is an Mj1×Mj2M_{j_{1}}\times M_{j_{2}} null matrix.

Note that the noise covariance matrix used in the existing node-specific algorithms, such as [14, 24, 26, 27], and [28], is full-element. In the non-node-specific algorithm BD-LCMV [22], the block-diagonal 𝐑nn(l)\mathbf{R}_{nn}\left(l\right) is adopted and the weight vector associated with the jjth node is given by

𝐰j(l)=𝚫nn,j1(l)𝐇j𝝁(l),\mathbf{w}_{j}(l)=\mathbf{\Delta}^{-1}_{nn,j}\left(l\right)\mathbf{H}_{j}\bm{\mu}\left(l\right), (9)

where 𝝁(l)\bm{\mu}\left(l\right) is a Lagrange multiplier shared by all nodes. Then the dual optimization problem is introduced to compute the optimal 𝝁(l)\bm{\mu}\left(l\right) and is solved by PDMM. Finally, the signal ej(l)=𝐰jH(l)𝐲j(l)e_{j}(l)=\mathbf{w}^{H}_{j}(l)\mathbf{y}_{j}(l) is exchanged between nodes and the output dj(l)d_{j}(l) is obtained by summing all ej(l)e_{j}(l).

In this paper, the proposed beamformer has a completely new scheme. From (5) and (8), the weight vector corresponding to the llth-frame microphone signals can be expressed by

𝐰j(l)=𝐑nn1(l)𝐇(𝐇H𝐑nn1(l)𝐇)1𝐠j,=[𝚫nn,11(l)𝐇1𝚫nn,j1(l)𝐇j𝚫nn,J1(l)𝐇J](j1=1J𝐇j1H𝚫nn,j11(l)𝐇j1)1𝐠j.\begin{split}\mathbf{w}_{j}(l)&=\mathbf{R}^{-1}_{nn}\left(l\right)\mathbf{H}\left(\mathbf{H}^{H}\mathbf{R}^{-1}_{nn}\left(l\right)\mathbf{H}\right)^{-1}\mathbf{g}_{j},\\ &=\begin{bmatrix}\bm{\Delta}^{-1}_{nn,1}\left(l\right)\mathbf{H}_{1}\\ \vdots\\ \bm{\Delta}^{-1}_{nn,j}\left(l\right)\mathbf{H}_{j}\\ \vdots\\ \bm{\Delta}^{-1}_{nn,J}\left(l\right)\mathbf{H}_{J}\end{bmatrix}\left(\sum_{j_{1}=1}^{J}\mathbf{H}^{H}_{j_{1}}\bm{\Delta}^{-1}_{nn,j_{1}}\left(l\right)\mathbf{H}_{j_{1}}\right)^{-1}\mathbf{g}_{j}.\\ \end{split} (10)

The SS-dimensional compressed signals 𝐳j(l)\mathbf{z}_{j}\left(l\right) and the S×SS\times S matrix 𝐃j(l)\mathbf{D}_{j}\left(l\right) related to the jjth node are defined as

𝐳j(l)=𝐇jH𝚫nn,j1(l)𝐲j(l),𝐃j(l)=𝐇jH𝚫nn,j1(l)𝐇j.\begin{gathered}\mathbf{z}_{j}\left(l\right)=\mathbf{H}^{H}_{j}\mathbf{\Delta}^{-1}_{nn,j}\left(l\right)\mathbf{y}_{j}\left(l\right),\\ \mathbf{D}_{j}\left(l\right)=\mathbf{H}^{H}_{j}\mathbf{\Delta}^{-1}_{nn,j}\left(l\right)\mathbf{H}_{j}.\end{gathered} (11)

From (6) and (LABEL:BDWeight), the node-specific output can be rewritten as

dj(l)=𝐰jH(l)𝐲(l)=𝐠jH𝐳~(l),\begin{split}d_{j}\left(l\right)&=\mathbf{w}^{H}_{j}\left(l\right)\mathbf{y}\left(l\right)\\ &=\mathbf{g}^{H}_{j}\tilde{\mathbf{z}}\left(l\right),\end{split} (12)

where 𝐳~(l)\tilde{\mathbf{z}}\left(l\right) is the product of 𝐃1(l)\mathbf{D}^{-1}\left(l\right) and 𝐳(l)\mathbf{z}\left(l\right),

𝐳~(l)=𝐃1(l)𝐳(l).\tilde{\mathbf{z}}\left(l\right)=\mathbf{D}^{-1}\left(l\right)\mathbf{z}\left(l\right). (13)

In particularly, 𝐃(l)\mathbf{D}\left(l\right) and 𝐳(l)\mathbf{z}\left(l\right) are obtained by summing all 𝐳j(l)\mathbf{z}_{j}\left(l\right) and all 𝐃j(l)\mathbf{D}_{j}\left(l\right) separately,

𝐃(l)=j=1J𝐃j1(l),𝐳(l)=j=1J𝐳j2(l).\mathbf{D}\left(l\right)=\sum_{j=1}^{J}\mathbf{D}_{j_{1}}\left(l\right),~{}\mathbf{z}\left(l\right)=\sum_{j=1}^{J}\mathbf{z}_{j_{2}}\left(l\right). (14)

The above derivations assume that the noise covariance matrix 𝚫nn,j(l)\mathbf{\Delta}_{nn,j}\left(l\right) at the jjth node is perfectly known. However, for practical applications, 𝚫nn,j(l)\mathbf{\Delta}_{nn,j}\left(l\right) is unknown and needs to be estimated from the noisy observations. This often requires a hard or soft VAD to determine whether the speakers are present or not. It is noted that the design of a VAD mechanism is a hot research topic on its own, and is out of the scope of this paper [18, 24, 35]. In the following, two methods including non-recursive (moving average) smoothing and first-order recursive smoothing are considered to estimate 𝚫nn,j(l)\mathbf{\Delta}_{nn,j}\left(l\right).

3.1.1 Non-Recursive Smoothing Method

For the non-recursive smoothing method, the set of microphone signals frames and the set of noise-only frames for the current block of microphone signals are denoted by y\mathcal{L}_{y} and n(ny)\mathcal{L}_{n}\left(\mathcal{L}_{n}\subseteq\mathcal{L}_{y}\right), respectively. The block has |y|\lvert\mathcal{L}_{y}\rvert frames microphone signals and ||\lvert\cdot\rvert denotes the cardinality of a set. The noise covariance matrix at the jjth node can be estimated using the set of noise-only frames,

𝚫nn,j(l)𝚫^nn,j(l)=1|n|lnn𝐲j(ln)𝐲jH(ln).\mathbf{\Delta}_{nn,j}\left(l\right)\approx\widehat{\mathbf{\Delta}}_{nn,j}\left(l\right)=\frac{1}{\lvert\mathcal{L}_{n}\rvert}\sum_{l_{n}\in\mathcal{L}_{n}}\mathbf{y}_{j}\left(l_{n}\right)\mathbf{y}^{H}_{j}\left(l_{n}\right). (15)

Rigorously, all the estimated values need to use “^~{}\widehat{}~{}” to distinguish them from their true values. Analogously to 𝚫nn,j(l)\mathbf{\Delta}_{nn,j}\left(l\right) in (15), we omit “^~{}\widehat{}~{}” in the following for the sake of brevity when no confusion arises. For the current block of microphone signals, 𝚫nn,j(l)\mathbf{\Delta}_{nn,j}\left(l\right) only needs to be estimated once. From (11), 𝐃j(l)\mathbf{D}_{j}\left(l\right) also needs to be estimated once.

We assume that each block has BB frames microphone signals, i.e., |y|=B\lvert\mathcal{L}_{y}\rvert=B. The scheme of the DNBD-LCMV with the non-recursive smoothing method is shown in Fig. 1 and it consists of the following steps:

1) Initialize: i0i\leftarrow 0.

2) Each node j𝒥j\in\mathcal{J} performs the following operation cycle:

\bullet Collect the current block of microphone signals 𝐲j(l),ly={iB,iB+1,,(i+1)B1}\mathbf{y}_{j}\left(l\right),l\in\mathcal{L}_{y}=\{iB,iB+1,\cdots,\left(i+1\right)B-1\}.

\bullet For the current block of microphone signals, 𝚫nn,j(l)\mathbf{\Delta}_{nn,j}\left(l\right) is estimated with (15).

\bullet 𝐇jH𝚫nn,j1(l)\mathbf{H}^{H}_{j}\mathbf{\Delta}^{-1}_{nn,j}\left(l\right) is applied to MjM_{j}-dimensional microphone signals 𝐲j(l)\mathbf{y}_{j}\left(l\right) to obtain the SS-dimensional compressed signals 𝐳j(l)\mathbf{z}_{j}\left(l\right) with (11). Besides, 𝐃j(l)\mathbf{D}_{j}\left(l\right) can also be obtained.

\bullet 𝐳j(l)\mathbf{z}_{j}(l) and 𝐃j(l)\mathbf{D}_{j}\left(l\right) are transmitted. In particularly, 𝐃j(l)\mathbf{D}_{j}\left(l\right) only needs to be transmitted once for the current block of microphone signals.

\bullet Each node can have access to all 𝐳j(l)\mathbf{z}_{j}(l) and 𝐃j(l)\mathbf{D}_{j}\left(l\right), and computes 𝐳(l)\mathbf{z}(l) and 𝐃(l)\mathbf{D}\left(l\right) with (14). Then, the inverse matrix 𝐃1(l)\mathbf{D}^{-1}\left(l\right) is multiplied with 𝐳(l)\mathbf{z}(l) to obtain 𝐳~(l)\tilde{\mathbf{z}}(l). Finally, 𝐠j\mathbf{g}_{j} is applied to 𝐳~(l)\tilde{\mathbf{z}}(l) to obtain dj(l)d_{j}(l).

3) ii+1i\leftarrow i+1.

4) Return to step 2).

Refer to caption
Figure 1: The scheme of the DNBD-LCMV with the non-recursive smoothing method for the current block of microphone signals (J=2J=2). Each node jj computes its node-specific output dj(l)d_{j}\left(l\right) using its own SS-dimensional compressed signals 𝐳j(l)\mathbf{z}_{j}\left(l\right) and SS-dimensional compressed signals transmitted by the other node (𝐃j(l)\mathbf{D}_{j}\left(l\right) only needs to be transmitted once for the current block of microphone signals).

From Fig. 1, for each block of microphone signals, 𝐃j(l)\mathbf{D}_{j}\left(l\right) only needs to be transmitted once. Since 𝐃j(l)\mathbf{D}_{j}\left(l\right) is a Hermitian matrix and its main diagonal entries are real numbers, the transmission of 𝐃j(l)\mathbf{D}_{j}\left(l\right) results in a total of S2S^{2} transmitted real numbers for each frequency bin per block. The transmission cost is slight and can be neglected.

3.1.2 First-Order Recursive Smoothing Method

For the first-order recursive smoothing method, the noise sample covariance matrix 𝚫nn,j(l)\mathbf{\Delta}_{nn,j}\left(l\right) at the jjth node is updated by [36]

𝚫nn,j(l)=(1𝒫j(l))(α𝚫nn,j(l1)+(1α)𝐲j(l)𝐲jH(l))+𝒫j(l)𝚫nn,j(l1),\begin{split}\mathbf{\Delta}_{nn,j}\left(l\right)&=\left(1-\mathcal{P}_{j}\left(l\right)\right)\big{(}\alpha\mathbf{\Delta}_{nn,j}\left(l-1\right)+\left(1-\alpha\right)\mathbf{y}_{j}\left(l\right)\mathbf{y}^{H}_{j}\left(l\right)\big{)}\\ &\quad+\mathcal{P}_{j}\left(l\right)\mathbf{\Delta}_{nn,j}\left(l-1\right),\end{split} (16)

where 𝒫j(l)=1\mathcal{P}_{j}\left(l\right)=1 when speech component is detected in the llth-frame microphone signals 𝐲j(l)\mathbf{y}_{j}\left(l\right); and 𝒫j(l)=0\mathcal{P}_{j}\left(l\right)=0, otherwise. When the binary VAD decision is replaced by a soft speech presence probability (SPP) [37], 𝒫j(l)\mathcal{P}_{j}\left(l\right) can vary from 0 to 1, which will not be further considered here. α\alpha is a forgetting factor ranging from 0 to 1.

When 𝒫j(l)=1\mathcal{P}_{j}\left(l\right)=1, 𝚫nn,j(l)\mathbf{\Delta}_{nn,j}\left(l\right) is not updated, i.e., 𝚫nn,j(l)=𝚫nn,j(l1)\mathbf{\Delta}_{nn,j}\left(l\right)=\mathbf{\Delta}_{nn,j}\left(l-1\right). Similar to Section 3.1.1, only the transmission of 𝐳j(l)\mathbf{z}_{j}(l) is required. While, for 𝒫j(l)=0\mathcal{P}_{j}\left(l\right)=0, i.e., 𝐲j(l)=𝐧j(l)\mathbf{y}_{j}(l)=\mathbf{n}_{j}(l), the current noise-only frame is included to update 𝚫nn,j(l)\mathbf{\Delta}_{nn,j}\left(l\right) by the following equation

𝚫nn,j(l)=α𝚫nn,j(l1)+(1α)𝐲j(l)𝐲jH(l).\mathbf{\Delta}_{nn,j}\left(l\right)=\alpha\mathbf{\Delta}_{nn,j}\left(l-1\right)+\left(1-\alpha\right)\mathbf{y}_{j}\left(l\right)\mathbf{y}^{H}_{j}\left(l\right). (17)

A naive solution is obtained by transmitting 𝐃j(l)\mathbf{D}_{j}(l) and 𝐳j(l)\mathbf{z}_{j}(l) for each noise-only frame, where a total of (S2+2S)(S^{2}+2S) transmitted real numbers and the inversion operation of 𝚫nn,j(l)\mathbf{\Delta}_{nn,j}\left(l\right) are required. The communication cost and computational load are huge and need to be further reduced.

With the help of the Sherman-Morrison-Woodbury formula [31], the inversion of 𝚫nn,j(l)\mathbf{\Delta}_{nn,j}\left(l\right) in (17) can be expressed by

𝚫nn,j1(l)=1α(𝚫nn,j1(l1)𝚫nn,j1(l1)𝐲j(l)𝐲jH(l)𝚫nn,j1(l1)α/(1α)+𝐲jH(l)𝚫nn,j1(l1)𝐲j(l)).\begin{gathered}\mathbf{\Delta}^{-1}_{nn,j}\left(l\right)=\frac{1}{\alpha}\Bigg{(}\mathbf{\Delta}^{-1}_{nn,j}\left(l-1\right)-\frac{\mathbf{\Delta}^{-1}_{nn,j}\left(l-1\right)\mathbf{y}_{j}\left(l\right)\mathbf{y}^{H}_{j}\left(l\right)\mathbf{\Delta}^{-1}_{nn,j}\left(l-1\right)}{\alpha/\left(1-\alpha\right)+\mathbf{y}^{H}_{j}\left(l\right)\mathbf{\Delta}^{-1}_{nn,j}\left(l-1\right)\mathbf{y}_{j}\left(l\right)}\Bigg{)}.\end{gathered} (18)

We further define 𝐜¯j(l),cj(l)\bar{\mathbf{c}}_{j}\left(l\right),c_{j}\left(l\right), and 𝐜~j(l)\tilde{\mathbf{c}}_{j}\left(l\right) as follows:

𝐜¯j(l)=𝐇jH𝚫nn,j1(l1)𝐲j(l),cj(l)=𝐲jH(l)𝚫nn,j1(l1)𝐲j(l),𝐜~j(l)=𝚫nn,j1(l1)𝐲j(l).\begin{gathered}\bar{\mathbf{c}}_{j}\left(l\right)=\mathbf{H}^{H}_{j}\mathbf{\Delta}^{-1}_{nn,j}\left(l-1\right)\mathbf{y}_{j}\left(l\right),\\ c_{j}\left(l\right)=\mathbf{y}^{H}_{j}\left(l\right)\mathbf{\Delta}^{-1}_{nn,j}\left(l-1\right)\mathbf{y}_{j}\left(l\right),\\ \tilde{\mathbf{c}}_{j}\left(l\right)=\mathbf{\Delta}^{-1}_{nn,j}\left(l-1\right)\mathbf{y}_{j}\left(l\right).\end{gathered} (19)

By substituting (18) and (19) into (11), we get

𝐃j(l)=𝐇jH𝚫nn,j1(l)𝐇j=1α(𝐃j(l1)𝐜¯j(l)𝐜¯jH(l)α/(1α)+cj(l)),\begin{split}\mathbf{D}_{j}\left(l\right)&=\mathbf{H}^{H}_{j}\mathbf{\Delta}^{-1}_{nn,j}\left(l\right)\mathbf{H}_{j}\\ &=\frac{1}{\alpha}\left(\mathbf{D}_{j}\left(l-1\right)-\frac{\bar{\mathbf{c}}_{j}\left(l\right)\bar{\mathbf{c}}^{H}_{j}\left(l\right)}{\alpha/(1-\alpha)+c_{j}\left(l\right)}\right),\end{split} (20)

and

𝐳j(l)=𝐇jH𝚫nn,j1(l)𝐲j(l)=𝐜¯j(l)α(1cj(l)α/(1α)+cj(l)).\begin{split}\mathbf{z}_{j}\left(l\right)&=\mathbf{H}^{H}_{j}\mathbf{\Delta}^{-1}_{nn,j}\left(l\right)\mathbf{y}_{j}\left(l\right)\\ &=\frac{\bar{\mathbf{c}}_{j}\left(l\right)}{\alpha}\left(1-\frac{c_{j}\left(l\right)}{\alpha/\left(1-\alpha\right)+c_{j}(l)}\right).\end{split} (21)

By substituting (19)\left(\ref{definition}\right) into (18)\left(\ref{Formula}\right) , (18)\left(\ref{Formula}\right) can be further written as

𝚫nn,j1(l)=1α(𝚫nn,j1(l1)𝐜~j(l)𝐜~jH(l)α/(1α)+cj(l)).\mathbf{\Delta}^{-1}_{nn,j}\left(l\right)=\frac{1}{\alpha}\left(\mathbf{\Delta}^{-1}_{nn,j}\left(l-1\right)-\frac{\tilde{\mathbf{c}}_{j}\left(l\right)\tilde{\mathbf{c}}^{H}_{j}\left(l\right)}{\alpha/\left(1-\alpha\right)+c_{j}\left(l\right)}\right). (22)

For the noise-only frame, i.e., 𝒫j(l)=0\mathcal{P}_{j}\left(l\right)=0, from (20), (21), and (22), the scheme of the DNBD-LCMV with the first-order recursive smoothing method is shown in Fig. 2, and it consists of the following steps:

1) Each node j𝒥j\in\mathcal{J} performs the following operation cycle:

\bullet The inverse matrix 𝚫nn,j1(l1)\mathbf{\Delta}^{-1}_{nn,j}\left(l-1\right) is applied to MjM_{j}-dimensional microphone signals 𝐲j(l)\mathbf{y}_{j}\left(l\right) to obtain the MjM_{j}-dimensional vector 𝐜~j(l)\tilde{\mathbf{c}}_{j}\left(l\right) and the real number cj(l)c_{j}\left(l\right) with (19), where the SS-dimensional vector 𝐜¯j(l)\bar{\mathbf{c}}_{j}\left(l\right) can also be obtained by using 𝐇j\mathbf{H}_{j}. Then, 𝚫nn,j1(l)\mathbf{\Delta}^{-1}_{nn,j}\left(l\right) is estimated with (22).

\bullet 𝐜¯j(l)\bar{\mathbf{c}}_{j}\left(l\right) and cj(l)c_{j}\left(l\right) are transmitted.

\bullet Each node can have access to 𝐜¯j(l)\bar{\mathbf{c}}_{j}\left(l\right) and cj(l)c_{j}\left(l\right), and reconstructs 𝐃j(l)\mathbf{D}_{j}\left(l\right) and 𝐳j(l)\mathbf{z}_{j}\left(l\right) with (20) and (21), respectively. Then, 𝐃(l)\mathbf{D}\left(l\right) and 𝐳(l)\mathbf{z}\left(l\right) are computed with (14). The inverse matrix 𝐃1(l)\mathbf{D}^{-1}\left(l\right) is multiplied with 𝐳(l)\mathbf{z}\left(l\right) to obtain 𝐳~(l)\tilde{\mathbf{z}}\left(l\right). Finally, 𝐠j\mathbf{g}_{j} is applied to 𝐳~(l)\tilde{\mathbf{z}}\left(l\right) to obtain dj(l)d_{j}\left(l\right).

2) ll+1l\leftarrow l+1.

3) Return to step 1).

From Fig. 2, for each noise-only frame, the SS-dimensional vector 𝐜¯j(l)\bar{\mathbf{c}}_{j}\left(l\right) and the real number cj(l)c_{j}\left(l\right) are transmitted to reconstruct the S×SS\times S matrix 𝐃j(l)\mathbf{D}_{j}\left(l\right) and the SS-dimensional vector 𝐳j(l)\mathbf{z}_{j}\left(l\right) at other nodes instead of the transmission of 𝐃j(l)\mathbf{D}_{j}\left(l\right) and 𝐳j(l)\mathbf{z}_{j}\left(l\right). Besides, the inversion operation of 𝚫nn,j(l)\mathbf{\Delta}_{nn,j}\left(l\right) is only required at the beginning. The communication cost and computational load are greatly reduced compared to the naive solution mentioned above.

Refer to caption
Figure 2: The scheme of the DNBD-LCMV with the first-order recursive smoothing method (J=2,𝒫j(l)=0J=2,\mathcal{P}_{j}\left(l\right)=0). For each node jj, the SS-dimensional vector 𝐜¯j(l)\bar{\mathbf{c}}_{j}\left(l\right) and the real number cj(l)c_{j}\left(l\right) are transmitted to reconstruct 𝐃j(l)\mathbf{D}_{j}\left(l\right) and 𝐳j(l)\mathbf{z}_{j}\left(l\right) with (20) and (21), respectively, at the other node. The inverse matrix 𝚫nn,j1(l)\mathbf{\Delta}^{-1}_{nn,j}\left(l\right) is updated with (22). Each node jj computes its node-specific output dj(l)d_{j}\left(l\right) use its own SS-dimensional compressed signal 𝐳j(l)\mathbf{z}_{j}\left(l\right) and the reconstructed signal 𝐳q(l),q𝒥{j}\mathbf{z}_{q}\left(l\right),q\in\mathcal{J}\setminus\{j\}, where 𝐃j(l)\mathbf{D}_{j}\left(l\right) and the reconstructed matrix 𝐃q(l)\mathbf{D}_{q}\left(l\right) are also used.

3.2 DNBD-LCMV with Unknown ATF Matrix

In general, the ATF matrix 𝐇\mathbf{H} is unknown and needs to be estimated online. Some subspace estimation algorithms can be used to estimate the column space of 𝐇\mathbf{H} [19, 20]. Define 𝐐=[𝐪1,,𝐪S]\mathbf{Q}=\left[\mathbf{q}_{1},\cdots,\mathbf{q}_{S}\right] to be any basis that spans the column space of 𝐇\mathbf{H},

𝐇=𝐐𝚯,\mathbf{H}=\mathbf{Q}\mathbf{\Theta}, (23)

where 𝚯\mathbf{\Theta} is an S×SS\times S matrix comprised of the projection coefficients of the original ATFs on the basis vectors.

When the speakers only change their positions slowly with respect to their initial positions, such as teleconferencing, we can estimate the column space 𝐐\mathbf{Q} at the initial stage (e.g., in a centralized way) [22]. This may cause some estimation error of 𝐐\mathbf{Q} if the speakers have some slight movements and therefore robust beamformer is preferred.

The goal for node jj is to estimate the target signal from the signal vector 𝐬\mathbf{s} as observed by one of node jj’s microphones, referred to as the reference microphone. Without loss of generality, the first microphone of each node is chosen as the reference microphone. For node jj, the index of its reference microphone is equal to m=1+q=1(j1)Mqm=1+\sum_{q=1}^{\left(j-1\right)}M_{q}. Accordingly, the weight vector in (5) can be rewritten as

𝐰¯j=𝐑nn1𝐐(𝐐H𝐑nn1𝐐)1𝐠¯j,g¯j(k)=gj(k)Q(m,k),\begin{gathered}\bar{\mathbf{w}}_{j}=\mathbf{R}^{-1}_{nn}\mathbf{Q}\left(\mathbf{Q}^{H}\mathbf{R}^{-1}_{nn}\mathbf{Q}\right)^{-1}\bar{\mathbf{g}}_{j},\\ \bar{g}_{j}\left(k\right)=g_{j}\left(k\right)Q^{*}\left(m,k\right),\\ \end{gathered} (24)

where Q(m,k)Q\left(m,k\right) is the entry in the mmth row and kkth column of 𝐐\mathbf{Q}. g¯j(k)\bar{g}_{j}\left(k\right) and gj(k)g_{j}\left(k\right) are the kkth entry of 𝐠¯j\bar{\mathbf{g}}_{j} and that of 𝐠j\mathbf{g}_{j}, respectively. The node-specific output in (6) is modified by the following equation [14]

d¯j=𝐰¯jH𝐲=𝐠¯jH(𝐐H𝐑nn1𝐐)1𝐐H𝐑nn1(𝐐𝚯𝐬+𝐧)=k=1Sgj(k)H(m,k)s(k)+𝐰¯jH𝐧,\begin{split}\bar{d}_{j}&=\bar{\mathbf{w}}^{H}_{j}\mathbf{y}\\ &=\bar{\mathbf{g}}^{H}_{j}\left(\mathbf{Q}^{H}\mathbf{R}^{-1}_{nn}\mathbf{Q}\right)^{-1}\mathbf{Q}^{H}\mathbf{R}^{-1}_{nn}\left(\mathbf{Q}\mathbf{\Theta}\mathbf{s}+\mathbf{n}\right)\\ &=\sum_{k=1}^{S}g^{*}_{j}\left(k\right)H\left(m,k\right)s\left(k\right)+\bar{\mathbf{w}}^{H}_{j}\mathbf{n},\end{split} (25)

where H(m,k)H\left(m,k\right) is the entry in the mmth row and kkth column of 𝐇\mathbf{H}.

It is obvious that different nodes have different desired response vectors due to different reference microphones, i.e., 𝐠¯j𝐠¯q\bar{\mathbf{g}}_{j}\not=\bar{\mathbf{g}}_{q} with jqj\not=q. Therefore, each node can extract a node-specific target signal based on node-specific reference microphone to preserve the spatial information of the target source, such as time difference cues, which are very important in target source localization.

3.3 Extension of DNBD-LCMV

When the noise covariance matrix 𝚫nn,j\mathbf{\Delta}_{nn,j} in (8) is replaced by the noisy covariance matrix 𝚫yy,j\mathbf{\Delta}_{yy,j}, that is to say,

𝐑yy=Blockdiag(𝚫yy,1,,𝚫yy,j,,𝚫yy,J),\mathbf{R}_{yy}=\mathrm{Blockdiag}\left(\mathbf{\Delta}_{yy,1},\cdots,\mathbf{\Delta}_{yy,j},\cdots,\mathbf{\Delta}_{yy,J}\right), (26)

one can obtain the distributed node-specific block-diagonal linearly constrained minimum power (DNBD-LCMP) beamformer, which can be given by

𝐰¯j=𝐑yy1𝐐(𝐐H𝐑yy1𝐐)1𝐠¯j.\bar{\mathbf{w}}^{{}^{\prime}}_{j}=\mathbf{R}^{-1}_{yy}\mathbf{Q}\left(\mathbf{Q}^{H}\mathbf{R}^{-1}_{yy}\mathbf{Q}\right)^{-1}\bar{\mathbf{g}}_{j}. (27)

where this beamformer does not require an estimate of the noise covariance matrix.

Particularly, when 𝚫nn,j\mathbf{\Delta}_{nn,j} is an identity matrix 𝐈j\mathbf{I}_{j}, the DNBD-LCMV becomes the distributed node-specific delay and sum (DNDS) beamformer

𝐰¯j′′=𝐐(𝐐H𝐐)1𝐠¯j.\bar{\mathbf{w}}^{{}^{\prime\prime}}_{j}=\mathbf{Q}\left(\mathbf{Q}^{H}\mathbf{Q}\right)^{-1}\bar{\mathbf{g}}_{j}. (28)

where this beamformer only needs to be calculated once at the beginning. However, it cannot control the sound sources that are not included in 𝐠¯j\bar{\mathbf{g}}_{j}.

Similar to [22], the proposed beamformers including DNBD-LCMV/DNBD-LCMP, and DNDS can be implemented in the WASNs with arbitrary topologies. This can be achieved by a slight modification to the data transmission process of each node, where each node aggregates its transmitted data including 𝐳j(l)\mathbf{z}_{j}\left(l\right) and 𝐃j(l)\mathbf{D}_{j}\left(l\right) with the transmitted data from its neighbors. This will allow for the transmitted data to disperse through the WASNs by means of an in-network summation, and is demonstrated for the tree topology, as shown in Fig. 3.

Refer to caption
Figure 3: The tree topology with two sub-topologies: chain and star.

We denote 𝒩j\mathcal{N}_{j} as the set of neighbors of node jj in the topology with node jj excluded. After the tree is formed with any tree formation algorithm [38, 39], one arbitrary node is assigned as the root node and the nodes only communicate with their neighbors (for example, 𝒩6={5,7,8,9}\mathcal{N}_{6}=\{5,7,8,9\} and node 1 is the root node in Fig. 3). The following data-driven signal flow is executed for each block of microphone signals (non-recursive smoothing method is considered below):

1) Any leaf node, i.e., a non-root node qq having only a single neighbor, can immediately fire and transmits 𝐳q(l)\mathbf{z}_{q}\left(l\right) and 𝐃q(l)\mathbf{D}_{q}\left(l\right) to its single neighbor (toward the root node). Any non-root node jj with more than a single neighbor waits until it has received the transmitted data from all its neighbors except for a single neighbor that has yet to fire, say node jj^{{}^{\prime}}, and then computes the following sum

𝐳ˇj(l)=𝐳j(l)+q𝒩jj𝐳q(l),𝐃ˇj(l)=𝐃j(l)+q𝒩jj𝐃q(l).\begin{gathered}\check{\mathbf{z}}_{j}\left(l\right)=\mathbf{z}_{j}\left(l\right)+\sum_{q\in\mathcal{N}_{j}\setminus j^{{}^{\prime}}}\mathbf{z}_{q}\left(l\right),\\ \check{\mathbf{D}}_{j}\left(l\right)=\mathbf{D}_{j}\left(l\right)+\sum_{q\in\mathcal{N}_{j}\setminus j^{{}^{\prime}}}\mathbf{D}_{q}\left(l\right).\end{gathered} (29)

Next, 𝐳ˇj(l)\check{\mathbf{z}}_{j}\left(l\right) and 𝐃ˇj(l)\check{\mathbf{D}}_{j}\left(l\right) are transmitted to node jj^{{}^{\prime}} (toward the root node). This process is repeated at every non-root node in the tree until the root node is reached.

2) Once the data-driven signal flow has reached the root node, say node j′′j^{{}^{\prime\prime}}, the vector 𝐳~(l)\tilde{\mathbf{z}}\left(l\right) is obtained by

𝐳~(l)=𝐃ˇj′′1(l)𝐳ˇj′′(l),𝐃ˇj′′(l)=𝐃(l),𝐳ˇj′′(l)=𝐳(l).\begin{gathered}\tilde{\mathbf{z}}\left(l\right)=\check{\mathbf{D}}^{-1}_{j^{{}^{\prime\prime}}}\left(l\right)\check{\mathbf{z}}_{j^{{}^{\prime\prime}}}\left(l\right),\\ \check{\mathbf{D}}_{j^{{}^{\prime\prime}}}\left(l\right)=\mathbf{D}\left(l\right),\check{\mathbf{z}}_{j^{{}^{\prime\prime}}}\left(l\right)=\mathbf{z}\left(l\right).\end{gathered} (30)

The vector 𝐳~(l)\tilde{\mathbf{z}}\left(l\right) is now flooded through the WASNs (away from the root node) so that it reaches every node, where the nodes simple act as relays to pass 𝐳~(l)\tilde{\mathbf{z}}\left(l\right) further through the tree. Finally, the output dj(l)=𝐠jH𝐳~(l)d_{j}\left(l\right)={\mathbf{g}}^{H}_{j}\tilde{\mathbf{z}}\left(l\right) is obtained.

Based on the data-driven signal flow above, we see that any leaf node will transmit only a single block of compressed signals 𝐳q(l)\mathbf{z}_{q}\left(l\right). Any non-leaf node will transmit a maximum of two blocks of signals including 𝐳ˇj(l)\check{\mathbf{z}}_{j}\left(l\right) and 𝐳~(l)\tilde{\mathbf{z}}\left(l\right), first toward the root node and then away from the root node. It is worth noting that 𝐃q(l)\mathbf{D}_{q}\left(l\right) and 𝐃ˇj(l)\check{\mathbf{D}}_{j}\left(l\right) only need to be transmitted once for each block of microphone signals and can be ignored.

When the first-order recursive smoothing method is applied, for any leaf node qq, 𝐜¯q(l)\bar{\mathbf{c}}_{q}\left(l\right) and cq(l)c_{q}\left(l\right) are transmitted to reconstruct 𝐃q(l)\mathbf{D}_{q}\left(l\right) and 𝐳q(l)\mathbf{z}_{q}\left(l\right) with (20) and (21) at the non-root node jj. The rest is the same as the non-recursive smoothing method. In particularly, 𝐃ˇj(l)\check{\mathbf{D}}_{j}\left(l\right) needs to be transmitted for each noise-only frame.

4 Analysis of Complexity and Bandwidth

This section analyzes the complexity and the bandwith of the proposed beamformer, and those of state-of-the-art beamformers are also presented. For the centralized LCMV/LCMP beamformer, each node needs to have access to the microphone signals from other nodes and therefore all microphone signals in the WASNs are transmitted. In general, we hope to get the estimation of the SS speech sources separately, where the constraint vector 𝐠¯j\bar{\mathbf{g}}_{j} should be modified to an S×SS\times S matrix 𝐆¯j\bar{\mathbf{G}}_{j} with only one non-zero entry per column. Without loss of generality, we assume that each node has Mj=NM_{j}=N microphones. We need to note that the cost discussed below does not include the overhead associated with those algorithms exploiting a VAD.

The total number of transmissions related to the communication bandwidth is dependent not only on the choice of the beamformer but also on the WASNs topology. As such, it is difficult to analytically bound this transmission cost for any network topology. The comparison of the communication bandwidth and the computational complexity of different beamformers is performed in a fully connected network, where the centralized LCMV/LCMP, LC-DANSE [14], BD-LCMV/BD-LCMP [22], and the beamformers proposed in this paper have the same update rate.

We denote the transmission of one real number as one transmission. For the DNBD-LCMV/DNBD-LCMP with the first-order recursive smoothing method, each node needs to transmit a SS-dimensional complex vector 𝐜¯j(l)\bar{\mathbf{c}}_{j}\left(l\right) and a real number cj(l)c_{j}\left(l\right), where the total number of transmissions is J(2S+1)J\left(2S+1\right). Besides, an inversion of an S×SS\times S matrix 𝐃(l)\mathbf{D}\left(l\right) in (13) is performed, yielding a complexity of 𝒪(S3)\mathcal{O}\left(S^{3}\right). For the DNDS, each node needs to transmit a SS-dimensional compressed signals 𝐳j(l)\mathbf{z}_{j}\left(l\right) and the total number of transmissions is 2JS2JS. The computational complexity and the communication bandwidth of different beamformers are shown in Table 1.

Beamformer Complexity Bandwidth
N=6,J=4,S=2N=6,J=4,S=2,
|y|=100,tmax=11\lvert\mathcal{L}_{y}\rvert=100,~{}t_{\max}=1{\textsuperscript{1}}
Complexity Bandwidth
Non-recursive smoothing method LCMV/LCMP 𝒪((JN)3)/|y|\mathcal{O}\left(\left(JN\right)^{3}\right)/\lvert\mathcal{L}_{y}\rvert 2JN2JN 𝒪(13824)/100\mathcal{O}\left(13824\right)/100 48
LC-DANSE 𝒪((N+(J1)S)3)/|y|\mathcal{O}\left(\left(N+\left(J-1\right)S\right)^{3}\right)/\lvert\mathcal{L}_{y}\rvert 2JS2JS 𝒪(1728)/100\mathcal{O}\left(1728\right)/100 16
BD-LCMV/
BD-LCMP
𝒪(N3)/|y|\mathcal{O}\left(N^{3}\right)/\lvert\mathcal{L}_{y}\rvert 2JS2JS 𝒪(216)/100\mathcal{O}\left(216\right)/100 16
DNBD-LCMV/
DNBD-LCMP
𝒪(S3)/|y|\mathcal{O}\left(S^{3}\right)/\lvert\mathcal{L}_{y}\rvert 2JS2JS 𝒪(8)/100\mathcal{O}(8)/100 16
DNDS 𝒪(S3)\mathcal{O}(S^{3}) (once)2 2JS2JS 𝒪(8)\mathcal{O}\left(8\right)(once)2 16
First-order recursive smoothing method LCMV/LCMP 𝒪((JN)3)\mathcal{O}\left(\left(JN\right)^{3}\right) 2JN2JN 𝒪(13824)\mathcal{O}\left(13824\right) 48
BD-LCMV/
BD-LCMP
𝒪(N3)\mathcal{O}\left(N^{3}\right) 2JStmax(S+1)2JSt_{\max}\left(S+1\right) 𝒪(216)\mathcal{O}\left(216\right) 48
DNBD-LCMV/
DNBD-LCMP
𝒪(S3)\mathcal{O}\left(S^{3}\right) J(2S+1)J\left(2S+1\right) 𝒪(8)\mathcal{O}\left(8\right) 20
  • 1

    tmaxt_{\max} is the maximum number of iterations for the BD-LCMV/BD-LCMP.

  • 2

    the weight vector 𝐰¯j′′\bar{\mathbf{w}}^{{}^{\prime\prime}}_{j} in (28) only needs to be calculated once at the beginning.

Table 1: complexity and bandwidth of different beamformers

From Table 1, first, for the beamformers with the first-order recursive smoothing method, the DNBD-LCMV/DNBD-LCMP has the lowest complexity and bandwidth. For the beamformers with the non-recursive smoothing method, the DNBD-LCMV/DNBD-LCMP has the lowest complexity, and has the same bandwidth as the LC-DANSE and the BD-LCMV/BD-LCMP. Second, the bandwidth of the DNDS is not higher than other beamformers. In particular, its complexity is negligible because 𝐰¯j′′\bar{\mathbf{w}}^{{}^{\prime\prime}}_{j} in (28) only needs to be calculated once at the beginning. Finally, the complexity of the proposed beamformers and the BD-LCMV/BD-LCMP is independent of the number of nodes JJ and they are completely scalable, where there is no increase in the per-node complexity when new nodes are added to the networks.

5 Experimental Results

This section evaluates the performance of the proposed beamformers including DNBD-LCMV/DNBD-LCMP and DNDS and compares them with three state-of-the-art beamformers including centralized LCMV/LCMP and LC-DANSE [14]) by using three objective measures, which are SNR, short-time objective intelligibility (STOI) [40], and average TDOA error (ATE) of the speaker in a simulated room when the column space 𝐐\mathbf{Q} and VAD have errors for reverberation times T60=0.3sT_{60}=0.3~{}\mathrm{s} and T60=0.5sT_{60}=0.5~{}\mathrm{s}. The room impulse responses (RIRs) are generated by the image method [41, 42].

5.1 Experimental Setup

The dimensions of the simulated room is 5m×5m×3m5~{}\mathrm{m}\times 5~{}\mathrm{m}\times 3~{}\mathrm{m} with reverberation times T60=0.3sT_{60}=0.3~{}\mathrm{s} and T60=0.5sT_{60}=0.5~{}\mathrm{s}. The WASNs consist of J=4J=4 nodes, each having Mj=6M_{j}=6 microphones forming a uniform linear array with an inter-microphone distance of 3 cm. Four point sound sources including S=2S=2 speech sources and two babble noise sources are presented. The configuration of the nodes and sound sources are depicted in Fig. 4. The two speakers that produce speech sentences taken from the NOIZEUS corpus [43] have the same power. The two babble directional noise sources are mutually uncorrelated with power that is 10%10\% of the power of any speaker, i.e., 10 dB SNR. Besides the two babble directional noise sources, all microphone signals have an uncorrelated white Gaussian noise component with 30 dB SNR with respect to the superimposed speech signals in the first microphone of node 1. The sampling frequency is 8 kHz and the sound speed is c=343m/sc=343~{}\mathrm{m/s}.

Refer to caption
Figure 4: The acoustic scenario used in the experiment. There are S=2S=2 speakers, two babble directional noise sources, and J=4J=4 nodes with Mj=6M_{j}=6 microphones each. The nodes are located at the center of each of the four walls, 30 cm from the walls. All nodes and all sources are in the same horizontal plane, 1.5 m above ground level.

In order to approximate the real acoustic scenario, these positions of the speakers, where the column space 𝐐\mathbf{Q} was estimated, were uniformly distributed over a sphere centered around the true source positions depicted in Fig. 4. These positions were referred to as training positions. For Speaker 1 and Speaker 2, its erroneous training positions had radii r1r_{1} and r2r_{2} ranging from 0 to 10 cm, respectively. Therefore, the column space estimation error can be modeled as a function of positional error between the training positions and the true source positions [22]. For every value of the positional error, the average performance of 100 different setups were measured. Each setup used the same source signals at the true source positions. However, a different set of training positions mentioned previously and different realizations of the microphone-self noise were used in each setup.

In the experiment, the non-recursive smoothing method was adopted to estimate 𝚫yy,j\mathbf{\Delta}_{yy,j} and 𝚫nn,j\mathbf{\Delta}_{nn,j}, and the estimation was performed on the entire length of signal [24, 44],

𝚫yy,j=1|y|ly𝐲j(l)𝐲jH(l),𝚫nn,j=1|y|(l1yy𝐧j(l1)𝐧jH(l1)+l2y𝐲j(l2)𝐲jH(l2)),\begin{gathered}\mathbf{\Delta}_{yy,j}=\frac{1}{\lvert\mathcal{L}_{y}\rvert}\sum_{l\in\mathcal{L}_{y}}\mathbf{y}_{j}(l)\mathbf{y}^{H}_{j}(l),\\ \mathbf{\Delta}_{nn,j}=\frac{1}{\lvert\mathcal{L}_{y}\rvert}(\sum_{l_{1}\in\mathcal{L}_{y}\setminus\mathcal{L}^{{}^{\prime}}_{y}}\mathbf{n}_{j}(l_{1})\mathbf{n}^{H}_{j}(l_{1})+\sum_{l_{2}\in\mathcal{L}^{{}^{\prime}}_{y}}\mathbf{y}_{j}(l_{2})\mathbf{y}^{H}_{j}(l_{2})),\end{gathered} (31)

where y\mathcal{L}_{y} is the set of frames of the entire time horizon, and y\mathcal{L}^{{}^{\prime}}_{y} is the set of frames of the noisy signals used to estimate 𝚫nn,j\mathbf{\Delta}_{nn,j}. The set y\mathcal{L}^{{}^{\prime}}_{y} is used to simulate the VAD error, where the noisy frames containing speech component are erroneously detected as noise-only frames. The error can be measured by the following scalar

R=|y||y|×100%.R=\frac{\lvert\mathcal{L}^{{}^{\prime}}_{y}\rvert}{\lvert\mathcal{L}_{y}\rvert}\times 100\%. (32)

When y\mathcal{L}^{{}^{\prime}}_{y} is an empty set, i.e., y={}\mathcal{L}^{{}^{\prime}}_{y}=\{\varnothing\} and R=0R=0, an ideal VAD is considered.

The SNR is the ratio between the powers of the desired speech component and the noise, and can be defined as

SNR=10logE{d¯j2(t)}E{n¯j2(t)}E{n¯j2(t)},\mathrm{SNR}=10\log{\frac{E\{\bar{d}^{2}_{j}(t)\}-E\{\bar{n}^{2}_{j}(t)\}}{E\{\bar{n}^{2}_{j}(t)\}}}, (33)

where d¯j(t)\bar{d}_{j}(t) and n¯j(t)\bar{n}_{j}(t) denote the time domain beamformer output and the time domain noise component (also containing the residual competing speech component [18]) at the jjth node, respectively.

The TDOA error ΔTj1,s\Delta{T}_{j1,s} between the jjth node and the 1st node for the ssth speaker can expressed by

ΔTj1,s={Tj1,sT^j1,s,ifTj1,sT^j1,s0;T^j1,sTj1,s,ifTj1,sT^j1,s<0.Tj1,s=(𝐱j1𝐱¯s𝐱11𝐱¯s)/c,\begin{gathered}\Delta{T}_{j1,s}=\begin{cases}T_{j1,s}-\hat{T}_{j1,s},&\mathrm{if}~{}T_{j1,s}-\hat{T}_{j1,s}\geq 0;\\ \hat{T}_{j1,s}-T_{j1,s},&\mathrm{if}~{}T_{j1,s}-\hat{T}_{j1,s}<0.\end{cases}\\ T_{j1,s}=\left(\lVert\mathbf{x}_{j1}-\bar{\mathbf{x}}_{s}\rVert-\lVert\mathbf{x}_{11}-\bar{\mathbf{x}}_{s}\rVert\right)/c,\end{gathered} (34)

where 𝐱j1\mathbf{x}_{j1} is the location of the 1st microphone at the jjth node, 𝐱¯s\bar{\mathbf{x}}_{s} is the location of the ssth speaker, and \lVert\cdot\rVert denotes the two-norm of a vector. Tj1,sT_{j1,s} is the theoretical TDOA and T^j1,s\hat{T}_{j1,s} is the estimated TDOA based on the output signals of different nodes using the generalized cross-correlation phase transform (GCC-PHAT) [45]. More details about TDOA can be found in [46]. The ATE is defined by

ATE=1J1j=2JΔTj1,s.\mathrm{ATE}=\frac{1}{J-1}\sum_{j=2}^{J}\Delta{T}_{j1,s}. (35)

For the following performance comparison, the first microphone of each node is chosen as the reference microphone and the experimental results for Speaker 1 are presented. In particularly, LC-DANSE [14] is time-recursive and converges to the solution of its centralized algorithm. Therefore, LC-DANSE is replaced by the centralized LCMV [22].

5.2 Robustness to Column Space Estimation Error

Fig. 5 shows the performance of different beamformers under different positional errors in terms of reverberation time T60=0.3T_{60}=0.3 s and VAD error R=5%R=5\%. The spectrograms of the desired signal received by the first microphone of node 1 and the output signals of node 1 of different beamformers were depicted in Fig. 6 for T60=0.3T_{60}=0.3 s, R=5%R=5\%, and r1=r2=5r_{1}=r_{2}=5 cm.

Refer to caption
Refer to caption
Refer to caption
Figure 5: Comparison of SNR, STOI, and ATE for different beamformers (T60=0.3T_{60}=0.3 s and R=5%R=5\%). (a)-(b) SNR and STOI of output signal of node 1; (c) ATE of output signals of different nodes.

From Fig. 5, first, one can get that the performance of the five beamformers decreases with increasing positional error. When there is no positional error, i.e., r1=r2=0r_{1}=r_{2}=0 cm, the SNR and the STOI of DNBD-LCMP are higher than LCMP and DNDS, and are slightly lower than LCMV and DNBD-LCMV. When there is positional error, the SNR and the STOI of DNBD-LCMP are significantly reduced, and are much lower than DNDS, LCMV, and DNBD-LCMV. For different positional error values, the SNR and the STOI of LCMP are the lowest. LCMP and DNBD-LCMP had lower robustness to the positional error than DNDS, LCMV, and DNBD-LCMV. Second, for DNDS, LCMV, and DNBD-LCMV, the SNR and the STOI of DNDS are the lowest when the positional error is zero and then approach LCMV and DNBD-LCMV with increasing positional error (even higher than LCMV). This is related to the fact that DNDS cannot well suppress the directional noise sources that are not included in the desired response vector but has more robust performance to the positional error [47]. DNBD-LCMV has higher SNR and STOI than LCMV for non-zero positional error and is less sensitive to the positional error, where 𝚫nn,j\mathbf{\Delta}_{nn,j} has lower dimensions than the full-element noise sample covariance matrix used in the LCMV and is numerically more favorable. Third, the ATEs of different beamformers do not appear to be a significant difference.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: The spectrograms of the desired signal received by the 1st microphone of node 1, and the output signals of node 1 of different beamformers for T60=0.3T_{60}=0.3 s, R=5%R=5\%, and r1=r2=5r_{1}=r_{2}=5 cm. (a) Desired signal; (b) LCMP; (c) LCMV; (d) DNDS; (e) DNBD-LCMP; (f) DNBD-LCMV.

From Fig. 6, the speech component in the output signal of LCMP is almost completely removed, where the column space estimation error results in removal of the actual speech component and preservation in the direction of the wrongly estimated column space. NBD-LCMP has the similar problem to LCMP. However, the performance degradation was not that great as with LCMP, this is because DNBD-LCMP has lower degrees of freedom to satisfy the wrong distortionless response and suppresses less speech component than LCMP [22]. Note that LCMP and DNBD-LCMP will not be further considered in the following experiments due to their poor performance in signal model mismatch cases. LCMV causes more speech distortion than DNBD-LCMV. DNDS preserved the speech component like DNBD-LCMV. However, it also preserves more noise component around 0.5 kHz.

5.3 Robustness to VAD Error

Refer to caption
Refer to caption
Refer to caption
Figure 7: Comparison of SNR, STOI, and ATE for different beamformers (T60=0.3T_{60}=0.3 s and r1=r2=5r_{1}=r_{2}=5 cm). (a)-(b) SNR and STOI of output signal of node 1; (c) ATE of output signals of different nodes.

In this subsection, we compare the performance of DNDS, LCMV, and DNBD-LCMV under different VAD errors with the reverberation time T60=0.3T_{60}=0.3 s and the positional error r1=r2=5r_{1}=r_{2}=5 cm, as shown in Fig. 7. First, for an ideal VAD, i.e., R=0R=0, the LCMV has the highest SNR and STOI. However, VAD error is often inevitable for practical applications. The SNR and the STOI of LCMV deteriorate significantly and become the lowest with increasing VAD error, which indicates that LCMV is the most sensitive to the VAD error. Second, when the VAD error becomes larger, the SNR and the STOI of DNBD-LCMV are reduced and gradually approach the DNDS, which is independent of the VAD error. Third, for different beamformers, their ATEs are almost identical and do not become larger with increasing VAD error.

5.4 Robustness to Reverberation Time

In this subsection, we compare the performance of DNDS, LCMV, and DNBD-LCMV when the reverberation time increases to T60=0.5T_{60}=0.5 s, as shown in Fig. 8. First, one can observe that the SNR and the STOI of LCMV are the lowest for non-zero positional error and the VAD error. The SNR and the STOI of DNBD-LCMV are reduced, and gradually approach or even lower than DNDS with increasing positional error or the VAD error. Second, no significant difference can be observed in the ATEs of different beamformers, which is similar to Fig. 5 (c) and Fig. 7 (c).

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 8: Comparison of SNR, STOI, and ATE for different beamformers (T60=0.5T_{60}=0.5 s). (a)-(c) R=5%R=5\%; (d)-(f) r1=r2=5r_{1}=r_{2}=5 cm.

From Figs. 5, 7, and 8, the performance including SNR, STOI, and ATE of DNDS, LCMV, and DNBD-LCMV becomes worse for a larger reverberation time, and DNDS and DNBD-LCMV are less sensitive to the reverberation time than LCMV.

6 Conclusion

In this paper, we propose a distributed node-specific block-diagonal linearly constrained minimum variance (DNBD-LCMV) beamformer, where the block-diagonal noise covariance matrix is considered to derive its analytical solution from the centralized LCMV beamformer. By updating the inversion of the noise sample covariance matrix using the Sherman-Morrison-Woodbury formula, the exchanged signals can be computed in a more efficient way. The proposed DNBD-LCMV significantly reduces the number of signals exchanged between nodes and exactly solves the LCMV beamformer optimally in each frame. Analysis and experimental results confirm that the proposed DNBD-LCMV has much lower computational complexity, and is also more robust to column space estimation error and the VAD error than other state-of-the-art distributed node-specific algorithms.

Reference

References

  • [1] M. Brandstein, D. Ward, Microphone Arrays: Signal Processing Techniques and Applications, Springer-Verlag, Berlin, Germany, 2001.
  • [2] A. Bertrand, Applications and trends in wireless acoustic sensor networks: A signal processing perspective, in: Proc. IEEE Symp. Commun. Veh. Technol. (SCVT), Ghent, Belgium, 2011, pp. 1–6.
  • [3] D. Culler, D. Estrin, M. Srivastava, Overview of sensor networks, IEEE Computer. 37 (8) (2004) 41–49.
  • [4] S. Doclo, M. Moonen, T. Van den Bogaert, J. Wouters, Reduced-bandwidth and distributed MWF-based noise reduction algorithms for binaural hearing aids, IEEE Trans. Audio, Speech, Lang. Process. 17 (1) (2009) 38–51.
  • [5] S. Markovich Golan, S. Gannot, I. Cohen, A reduced bandwidth binaural MVDR beamformer, in: Proc. Int. Workshop Acoust. Echo Noise Contr. (IWAENC), Tel-Aviv, Israel, 2010.
  • [6] A. Bertrand, M. Moonen, Robust distributed noise reduction in hearing aids with external acoustic sensor nodes, EURASIP J. Adv. Signal Process. 2009 (1) (2009) 14.
  • [7] Y. Jia, Y. Luo, Y. Lin, I. Kozintsev, Distributed microphone arrays for digital home and office, in: Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Toulouse, France, 2006, pp. 1065–1068.
  • [8] Z. C. Liu, Z. Y. Zhang, L. W. He, P. Chou, Energy-based sound source localization and gain normalization for ad hoc microphone arrays, in: Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Honolulu, HI, USA, 2007, pp. 761–764.
  • [9] M. H. Chen, Z. C. Liu, L. W. He, P. Chou, Z. Y. Zhang, Energy-based position estimation of microphones and speakers for ad hoc microphone arrays, in: Proc. IEEE Workshop. Applications. Signal Process. Audio. Acoust. (WASPAA), New Paltz, NY, USA, 2007, pp. 22–25.
  • [10] A. Griffin, A. Alexandridis, D. Pavlidi, Y. Mastorakis, A. Mouchtaris, Localizing multiple audio sources in a wireless acoustic sensor network, Signal Process. 107 (2015) 54–67.
  • [11] A. Alexandridis, A. Mouchtaris, Multiple sound source location estimation in wireless acoustic sensor networks using DOA estimates: the data-association problem, IEEE/ACM Trans. Audio, Speech, Lang. Process. 26 (2) (2018) 342–356.
  • [12] X. W. Guo, Z. F. Chen, X. Q. Hu, X. D. Li, Multi-source localization using time of arrival self-clustering method in wireless sensor networks, IEEE Access. 7 (2019) 82110–82121.
  • [13] M. Cobos, J. J. Perez-Solano, S. Felici-Castell, J. Segura, J. M. Navarro, Cumulative-Sum-Based localization of sound events in low-cost wireless acoustic sensor networks, IEEE/ACM Trans. Audio, Speech, Lang. Process. 22 (12) (2014) 1792–1802.
  • [14] A. Bertrand, M. Moonen, Distributed node-specific LCMV beamforming in wireless sensor networks, IEEE Trans. Signal Process. 60 (1) (2012) 233–246.
  • [15] A. Bertrand, M. Moonen, Distributed adaptive estimation of covariance matrix eigenvectors in wireless sensor networks with application to distributed PCA, Signal Process. 104 (2014) 120–135.
  • [16] J. Szurley, A. Bertrand, M. Moonen, Distributed adaptive node-specific signal estimation in heterogeneous and mixed-topology wireless sensor networks, Signal Process. 117 (2015) 44–60.
  • [17] Y. Zeng, R. C. Hendriks, Distributed delay and sum beamformer for speech enhancement via randomized gossip, IEEE Trans. Audio, Speech, Lang. Process. 22 (1) (2014) 260–273.
  • [18] S. Markovich Golan, S. Gannot, I. Cohen, Distributed multiple constraints generalized sidelobe canceler for fully connected wireless acoustic sensor networks, IEEE Trans. Audio, Speech, Lang. Process. 21 (2) (2013) 343–356.
  • [19] S. Markovich Golan, S. Gannot, I. Cohen, Multichannel eigenspace beamforming in a reverberant noisy environment with multiple interfering speech signals, IEEE Trans. Audio, Speech, Lang. Process. 17 (6) (2009) 1071–1086.
  • [20] S. Markovich Golan, S. Gannot, I. Cohen, Subspace tracking of multiple sources and its application to speakers extraction, in: Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Dallas, TX, USA, 2010, pp. 201–204.
  • [21] T. Van den Bogaert, J. Wouters, S. Doclo, M. Moonen, Binaural cue preservation for hearing aids using an interaural transfer function multichannel Wiener filter, in: Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Honolulu, HI, USA, 2007, pp. 565–568.
  • [22] A. I. Koutrouvelis, T. W. Sherson, R. Heusdens, R. C. Hendriks, A low-cost robust distributed linearly constrained beamformer for wireless acoustic sensor networks with arbitrary topology, IEEE/ACM Trans. Audio, Speech, Lang. Process. 26 (8) (2018) 1434–1448.
  • [23] S. Boyd, A. Ghosh, B. Prabhakar, D. Shah, Randomized gossip algorithms, IEEE Trans. Inf. Theory. 52 (6) (2006) 2508–2530.
  • [24] J. Szurley, A. Bertrand, M. Moonen, Topology-independent distributed adaptive node-specific signal estimation in wireless sensor networks, IEEE Trans. Signal Info. Process. Over Netw. 3 (1) (2017) 130–144.
  • [25] G. Q. Zhang, R. Heusdens, Distributed optimization using the primal-dual method of multipliers, IEEE Trans. Signal Info. Process. Over Netw. 4 (1) (2018) 173–187.
  • [26] A. Bertrand, M. Moonen, Distributed adaptive node-specific signal estimation in fully connected sensor networks-part I: Sequential node updating, IEEE Trans. Signal Process. 58 (10) (2010) 5277–5291.
  • [27] A. Bertrand, M. Moonen, Distributed adaptive estimation of node-specific signals in wireless sensor networks with a tree topology, IEEE Trans. Signal Process. 59 (5) (2011) 2196–2210.
  • [28] Y. Zeng, R. C. Hendriks, Distributed estimation of the inverse of the correlation matrix for privacy preserving beamforming, Signal Process. 107 (2015) 109–122.
  • [29] J. Capon, High-resolution frequency-wavenumber spectrum analysis, Proc. IEEE. 57 (8) (1969) 1408–1418.
  • [30] M. Er, A. Cantoni, Derivative constraints for broad-band element space antenna array processors, IEEE Trans. Acoust., Speech, Signal Process. ASSP-31 (6) (1983) 1378–1393.
  • [31] J. Sherman, W. J. Morrison, Adjustment of an inverse matrix corresponding to a change in one element of a given matrix, Ann. Math. Stat. 21 (1) (1950) 124–127.
  • [32] B. D. Van Veen, K. M. Buckley, Beamforming: A versatile approach to spatial filtering, IEEE ASSP Mag. 5 (2) (1988) 4–24.
  • [33] O. L. Frost, III, An algorithm for linearly constrained adaptive array processing, Proc. IEEE. 60 (8) (1972) 926–935.
  • [34] S. Gannot, E. Vincent, S. Markovich Golan, A. Ozerov, A consolidated perspective on multimicrophone speech enhancement and source separation, IEEE/ACM Trans. Audio, Speech, Lang. Process. 25 (4) (2017) 692–730.
  • [35] S. Markovich Golan, A. Bertrand, M. Moonen, S. Gannot, Optimal distributed minimum-variance beamforming approaches for speech enhancement in wireless acoustic sensor networks, Signal Process. 107 (2015) 4–20.
  • [36] C. S. Zheng, A. Deleforge, X. D. Li, W. Kellermann, Statistical analysis of the multichannel Wiener filter using a bivariate normal distribution for sample covariance matrices, IEEE/ACM Trans. Audio, Speech, Lang. Process. 26 (5) (2018) 951–966.
  • [37] T. Gerkmann, R. C. Hendriks, Unbiased MMSE-based noise power estimation with low complexity and low tracking delay, IEEE Trans. Audio, Speech, Lang. Process. 20 (4) (2012) 1383–1393.
  • [38] H. Chen, A. Campbell, B. Thomas, A. Tamir, Minimax flow tree problems, Networks. 54 (3) (2009) 117–129.
  • [39] Y. Choi, M. Khan, V. S. Anil Kumar, G. Pandurangan, Energy-optimal distributed algorithms for minimum spanning trees, IEEE J. Sel. Areas Commun. 27 (7) (2009) 1297–1304.
  • [40] C. H. Taal, R. C. Hendriks, R. Heusdens, J. Jensen, An algorithm for intelligibility prediction of time-frequency weighted noisy speech, IEEE Trans. Audio, Speech, Lang. Process. 19 (7) (2011) 212–2136.
  • [41] J. B. Allen, D. A. Berkley, Image method for efficiently simulating small-room acoustics, J. Acoust. Soc. Amer. 65 (4) (1979) 943–950.
  • [42] E. A. Lehmann, A. M. Johansson, Prediction of energy decay in room impulse responses simulated with an imgae-source model, J. Acoust. Soc. Amer. 124 (1) (2008) 269–277.
  • [43] Y. Hu, P. C. Loizou, Subjective comparison and evaluation of speech enhancement algorithms, Speech Communication. 49 (7) (2007) 588–601.
  • [44] A. Spriet, M. Moonen, J. Wouters, The impact of speech detection errors on the noise reduction performance of multi-channel Wiener filtering and generalized sidelobe cancellation, Signal Process. 85 (2005) 1073–1088.
  • [45] C. H. Knapp, G. C. Carter, The generalized correlation method for estimation of time delay, IEEE Trans. Acoust., Speech, Signal Process. ASSP-24 (4) (1976) 320–327.
  • [46] Y. T. Chan, K. C. Ho, A simple and efficient estimator for hyperbolic location, IEEE Trans. Signal Process. 42 (8) (1994) 1905–1915.
  • [47] H. Cox, Resolving power and sensitivity to mismatch of optimum array processors, J. Acoust. Soc. Amer. 54 (3) (1973) 771–785.