This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Implicit Filter-and-sum Network for Multi-channel Speech Separation

Abstract

Various neural network architectures have been proposed in recent years for the task of multi-channel speech separation. Among them, the filter-and-sum network (FaSNet) performs end-to-end time-domain filter-and-sum beamforming and has shown effective in both ad-hoc and fixed microphone array geometries. In this paper, we investigate multiple ways to improve the performance of FaSNet. From the problem formulation perspective, we change the explicit time-domain filter-and-sum operation which involves all the microphones into an implicit filter-and-sum operation in the latent space of only the reference microphone. The filter-and-sum operation is applied on a context around the frame to be separated. This allows the problem formulation to better match the objective of end-to-end separation. From the feature extraction perspective, we modify the calculation of sample-level normalized cross correlation (NCC) features into feature-level NCC (fNCC) features. This makes the model better matches the implicit filter-and-sum formulation. Experiment results on both ad-hoc and fixed microphone array geometries show that the proposed modification to the FaSNet, which we refer to as iFaSNet, is able to significantly outperform the benchmark FaSNet across all conditions with an on par model complexity.

Index Terms: Speech separation, speech enhancement, multi-channel, end-to-end

1 Introduction

The design of multi-channel speech separation systems is one of the active topics in the speech separation community in the past years. Despite the advances in time-frequency domain neural beamformers where a neural network is used to assist the conventional beamformers for better robustness and performance [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], time-domain architectures have also earned the attention from the community due to their ability to perform purely end-to-end optimization towards the target signals. Moreover, as conventional beamformers often require a temporal context for better estimation of spatial features [12, 13], time-domain systems have the potential to be operated at frame-level with a lower theoretical system latency.

Recent time-domain systems can be classifies into three categories. The first category reformulates the multi-channel separation problem as a single-channel separation problem on a selected reference microphone, with the help of additional cross-channel features. Various cross-channel features have been proposed and analyzed in versatile datasets [14, 15, 16, 17, 18]. The second category processes the multi-channel mixtures with a convolutional encoder where the channels are treated as different feature maps in a convolutional operation. Such systems learn a direct mapping between the mixtures and the target signals [19, 20]. The third category performs end-to-end beamforming without solving the optimization problems required in conventional beamformers. The beamforming filters can be compared to the masks in single-channel separation systems as they both operate at frame-level [21, 22].

One of the systems in the third category is the filter-and-sum network (FaSNet) which performs time-domain end-to-end beamforming [21, 22]. FaSNet attempts to directly estimate the beamforming filters via a neural network, and previous results have proven its effectiveness in simulated noisy reverberation datasets comparing with single-channel methods and baseline methods in the first category. However, FaSNet has several drawbacks in the problem formulation. Since FaSNet formulates the separation problem as a time-domain beamforming problem and the end-to-end training target is typically the reverberant clean signals, asking the filters to reconstruct not only the direct-path signal but also all the reverberation components may not be optimal, as it requires the filters to achieve a complex beampattern to preserve the reverberations. Although it is possible to perform joint dereverberation and separation by setting the direct-path signals as the training target, it may significantly increase the task difficulty. Moreover, one reason for the preference of time-frequency domain beamformers than time-domain beamformers is their advantage in both robustness and performance due to the use of short-time Fourier transform (STFT), hence it is also necessary to explore such beamforming operation in a latent space.

In this paper, we explore multiple aspects to improve the original FaSNet architecture. We investigate four modifications to FaSNet. First, we compare the original multi-input-multi-output (MIMO) formulation with the multi-input-single-output (MISO) formulation, where the filter is only estimated for the reference channel instead of all the channels. Second, we consider estimating the filter in a learnable latent space instead of the original waveform-domain. Together with the MISO assumption, it also matches the design of the systems in the first category. Third, we look for better cross-channel features more suitable for the MISO and latent-space filtering design. Fourth, we utilize context-aware processing to further improve the model performance. We refer to the modification of FaSNet as implicit FaSNet (iFaSNet). Ablation experiments show that such modifications are able to allow iFaSNet to outperform the original FaSNet across various data configurations while maintaining a same model size and complexity.

The rest of the paper is organized as follows. Section 2 briefly goes over the original FaSNet architecture and introduces the proposed iFaSNet. Section 3 provides the experiment configurations. Section 4 presents the results and discussions. Section 5 concludes the paper.

2 Implicit Filter-and-sum Network

Refer to caption
Fig. 1: Flowchart for the proposed iFaSNet architecture. The modifications to the original FaSNet are highlighted, which include (A) the use of MISO design instead of the original MIMO design, (B) the use of implicit filtering in the latent space instead of the original explicit filtering on the waveforms, (C) the use of feature-level NCC feature for cross-channel information instead of the original sample-level NCC feature, and (D) the use of context-aware filtering instead of the original context-independent filtering.

2.1 Filter-and-sum Network recap

Filter-and-sum network (FaSNet) performs time-domain filter-and-sum beamforming at frame level and directly estimates the beamforming filters with a neural network. For each frame of input mixtures from the MM channels {𝐲i}i=1M1×L\{\bm{\mathrm{y}}^{i}\}_{i=1}^{M}\in\mathbb{R}^{1\times L}, a context window of WW samples in both past and future is concatenated into 𝐲i\bm{\mathrm{y}}^{i}, resulting in a context frame 𝐲^i1×(L+2W)\hat{\bm{\mathrm{y}}}^{i}\in\mathbb{R}^{1\times(L+2W)}. For each target source, MM time-domain beamforming filters {𝐡i}i=1M1×(1+2W)\{\bm{\mathrm{h}}^{i}\}_{i=1}^{M}\in\mathbb{R}^{1\times(1+2W)} are estimated from {𝐲i}i=1M\{\bm{\mathrm{y}}^{i}\}_{i=1}^{M} by a neural network, and the filters are applied to the input to obtain the separated outputs:

𝐱^=i=1M𝐲^i𝐡i\displaystyle\hat{\bm{\mathrm{x}}}=\sum_{i=1}^{M}\hat{\bm{\mathrm{y}}}^{i}\circledast\bm{\mathrm{h}}^{i} (1)

where \circledast represents the convolution operation. The estimation of the filters rely on both the channel-wise features and the cross-channel features, and FaSNet applies a linear fully-connected (FC) layer to extract the channel-wise features for each input channel:

𝐬i=𝐲^i𝐖^\displaystyle\bm{\mathrm{s}}^{i}=\hat{\bm{\mathrm{y}}}^{i}\hat{\bm{\mathrm{W}}} (2)

where 𝐖^(L+2W)×N\hat{\bm{\mathrm{W}}}\in\mathbb{R}^{(L+2W)\times N} is a learnable parameter matrix. The linear FC layer can be also regarded as a linear 1-D convolutional layer. The cross-channel feature used by FaSNet is the normalized cross correlation (NCC) feature 𝐪i1×(1+2W)\bm{\mathrm{q}}^{i}\in\mathbb{R}^{1\times(1+2W)} calculated between the center frame at reference microphone 𝐲1\bm{\mathrm{y}}^{1} and the context frame at all microphones {𝐲^i}i=1M\{\hat{\bm{\mathrm{y}}}^{i}\}_{i=1}^{M}:

{𝐲^ji=𝐲^i[j:j+L1]qji=𝐲1𝐲^jiT𝐲12𝐲^ji2,j=1,,2W+1\displaystyle\begin{cases}\hat{\bm{\mathrm{y}}}_{j}^{i}=\hat{\bm{\mathrm{y}}}^{i}[j:j+L-1]\\ q_{j}^{i}=\frac{\bm{\mathrm{y}}_{1}\hat{\bm{\mathrm{y}}}_{j}^{iT}}{\left\|\bm{\mathrm{y}}_{1}\right\|_{2}\left\|\hat{\bm{\mathrm{y}}}_{j}^{i}\right\|_{2}}\end{cases},\quad j=1,\ldots,2W+1 (3)

Various neural network architectures can be selected to estimate the filters. FaSNet makes use of the dual-path RNN (DPRNN) [23] architecture together with the transform-average-concatenate (TAC) module [22] to perform robust filter estimation and allows the model to be invariant to the number and permutation (locations) of the microphones. This is rather important in ad-hoc array scenarios.

2.2 Implicit Filter-and-sum Network

To improve the performance of FaSNet, we propose the implicit filter-and-sum network (iFaSNet) which modifies the standard FaSNet in multiple aspects. Figure 1 shows the flowchart of iFaSNet, and the modifications to the original FaSNet are highlighted.

2.2.1 Multi-input-single-output design

FaSNet applies the time-domain filter-and-sum beamforming together with an end-to-end training objective. The training target for FaSNet is typically set to the reverberant clean signals in a selected reference microphone111We do not consider the case for joint separation and dereverberation here.. However, the learned beamforming filters at different channels have their own beampatterns, and using the reverberant clean signals as the training target implies that the filters should not only enhance the signal coming from a certain direction, but also need to reconstruct all the reverberation components. The corresponding beampattern can thus be a mess, as the reverberations typically cover a much wider range of angles. Although FaSNet applies frame-level beamforming where infinite optimal frame-level filters may exist since the linear equation in equation 1 is underdetermined (M×(1+2W)M\times(1+2W) unknowns and 1+2W1+2W equations), finding such reverberation-preserving filters for all channels may hurt the generalization of the network and thus affect the separation performance.

To maintain the end-to-end training configuration, we change the original multi-input-multi-output (MIMO) filter estimation into multi-input-single-output (MISO) estimation, where only the filter for the reference channel is calculated. This reformulates the multi-channel separation problem back to the single-channel separation problem as in the first category discusse in Section 1, while the input to the model still contains the mixtures from all channels. Figure 1 (A) shows the MISO module.

2.2.2 Implicit filtering in latent space

FaSNet explicitly calculates the beamforming filters in the waveform-domain. The advantage is that it follows the standard definition of time-domain beamforming, and it is easy to analyze the filters’ behaviors such as beampatterns. However, most literatures on multi-channel separation focus on learning a mapping in a latent space where the signal can be better represented. Not only the existing neural beamformers are mainly designed in the time-frequency domain, but also the time-domain systems utilize a learnable latent space for better signal representations and separation performance. Motivated by these recent progress, we adopt the standard setting in time-domain systems where a pair of learnable encoder and decoder are used for signal representation in the latent space. The filter is thus similar to the “mask” in single-channel separation systems, which is defined as a multiplicative function on the encoder output of the reference channel.

According to equation 2, the encoder in the original FaSNet maps the context frame 𝐲^i\bm{\mathrm{\hat{y}}}^{i} to a feature vector 𝐬i\bm{\mathrm{s}}^{i} by the encoder weight 𝐖\bm{\mathrm{W}}. 𝐬i\bm{\mathrm{s}}^{i} is only used for the estimation of the beamforming filters and does not involve in the beamforming operation. The encoder in iFaSNet is applied on the center frame 𝐲i\bm{\mathrm{y}}^{i} instead of the context frame 𝐲^i\hat{\bm{\mathrm{y}}}^{i}, with its corresponding encoder weight 𝐖L×N\bm{\mathrm{W}}\in\mathbb{R}^{L\times N}. The feature is denoted as 𝐟i1×N\bm{\mathrm{f}}^{i}\in\mathbb{R}^{1\times N}. A decoder with its weight 𝐔N×L\bm{\mathrm{U}}\in\mathbb{R}^{N\times L} is also applied to transform the latent feature back to waveforms. Figure 1 (B) shows the newly-added decoder module.

2.2.3 Feature-level normalized cross correlation

The cross-channel feature in original FaSNet is calculated by time-domain normalized cross correlation (tNCC) defined in equation 3. The rationale behind tNCC is to capture both the delay information across channels and the source-dependent information for different targets. However, the tNCC feature requires a sample-level convolution operation, which involves L(1+2W)ML(1+2W)M float-point multiplications. Moreover, when the sample rate of the signals is low (e.g. in certain telecommunication systems), such sample-level correlation may fail to capture the cross-channel delay information. To save the number of operations and accelerate the feature calculation, we modify the tNCC to feature-level NCC (fNCC). Denote the context feature [𝐟tCi,,𝐟ti,,𝐟t+Ci][\bm{\mathrm{f}}^{i}_{t-C},\ldots,\bm{\mathrm{f}}^{i}_{t},\ldots,\bm{\mathrm{f}}^{i}_{t+C}] as 𝐅ti(1+2C)×N\bm{\mathrm{F}}^{i}_{t}\in\mathbb{R}^{(1+2C)\times N} where CC denotes the context size. fNCC calculates the cosine similarity between 𝐅t1\bm{\mathrm{F}}^{1}_{t} and {𝐅ti}i=1M\{\bm{\mathrm{F}}^{i}_{t}\}_{i=1}^{M}:

𝐪^ti=𝐅¯t1𝐅¯tiT\displaystyle\hat{\bm{\mathrm{q}}}^{i}_{t}=\bar{\bm{\mathrm{F}}}_{t}^{1}\bar{\bm{\mathrm{F}}}^{iT}_{t} (4)

where 𝐅¯ti\bar{\bm{\mathrm{F}}}^{i}_{t} denotes the column-normalized feature of 𝐅ti\bm{\mathrm{F}}^{i}_{t} where each column has a unit length, and 𝐪^ti(1+2C)×(1+2C)\hat{\bm{\mathrm{q}}}^{i}_{t}\in\mathbb{R}^{(1+2C)\times(1+2C)} denotes the fNCC feature at time tt for channel ii. 𝐪^ti\hat{\bm{\mathrm{q}}}^{i}_{t} is then flatten to a vector of shape 1×(1+2C)21\times(1+2C)^{2}.

fNCC only contains N(1+2C)2MN(1+2C)^{2}M float-point multiplication operations. For the default setting in FaSNet where W=L=256W=L=256 with a 50% overlap between frames, we have C=2C=2 and (1+2C)2=251+2W=513(1+2C)^{2}=25\ll 1+2W=513. By properly setting the value of NN, fNCC can greatly save the computation needed for the cross-channel feature extraction. Figure 1 (C) shows fNCC calculation module.

2.2.4 Context-aware filtering

Existing systems for both single- and multi-channel end-to-end separation only make use of the center feature 𝐟ti\bm{\mathrm{f}}^{i}_{t} at time tt. On the other hand, utilizing context information to improve the modeling of local frame is very common in various systems [24, 25]. iFaSNet explores a context encoder to squeeze the context feature 𝐅ti\bm{\mathrm{F}}^{i}_{t} into a single feature vector 𝐟^ti1×N\hat{\bm{\mathrm{f}}}^{i}_{t}\in\mathbb{R}^{1\times N}, and 𝐟^ti\hat{\bm{\mathrm{f}}}^{i}_{t} together with the fNCC feature 𝐪^ti\hat{\bm{\mathrm{q}}}^{i}_{t} is concatenated and used as the input to the separation module. The MISO separation module generates 𝐠t11×N\bm{\mathrm{g}}^{1}_{t}\in\mathbb{R}^{1\times N}, a feature vector for the reference channel, and a context decoder receives the concatenation of 𝐅ti\bm{\mathrm{F}}^{i}_{t} and 𝐠t1\bm{\mathrm{g}}^{1}_{t} to decode the filters [𝐡^tC1,,𝐡^t1,,𝐡^t+C1][\hat{\bm{\mathrm{h}}}^{1}_{t-C},\ldots,\hat{\bm{\mathrm{h}}}^{1}_{t},\ldots,\hat{\bm{\mathrm{h}}}^{1}_{t+C}] of the context, where the estimated filter 𝐡^t1\hat{\bm{\mathrm{h}}}^{1}_{t} has the same shape as 𝐟t1\bm{\mathrm{f}}^{1}_{t}. The filters are then applied to the encoder outputs, and mean-pooling is applied across the time dimension:

𝐳t1=11+2Cj=02C𝐟tC+j1𝐡^tC+j1\displaystyle\bm{\mathrm{z}}^{1}_{t}=\frac{1}{1+2C}\sum_{j=0}^{2C}\bm{\mathrm{f}}^{1}_{t-C+j}\odot\hat{\bm{\mathrm{h}}}^{1}_{t-C+j} (5)

where \odot denotes the Hadamard product. The implicit “filter-and-sum” operation is thus defined on the context. Figure 1 (D) shows the context encoder and decoder.

3 Experiment configurations

Table 1: Experiment results with various model configurations. SI-SDRi is reported on decibel scale.
 
Model # of param. # of mics Overlap ratio Average
<<25% 25-50% 50-75% >>75%
 
FaSNet 2.9M 2 / 4 / 6 15.0 / 15.3 / 14.8 10.7 / 11.1 / 11.6 8.6 / 9.2 / 9.3 5.4 / 7.0 / 7.0 9.7 / 10.8 / 10.9
+MISO 2.9M 14.8 / 15.5 / 15.7 10.4 / 11.3 / 11.9 8.5 / 9.0 / 9.4 5.0 / 6.8 / 7.1 9.5 / 10.8 / 11.2
+fNCC 3.0M 14.5 / 14.8 / 14.4 10.1 / 11.0 / 11.3 8.3 / 8.9 / 9.0 4.9 / 6.7 / 6.9 9.3 / 10.4 / 10.6
+MISO+fNCC 2.9M 15.0 / 15.7 / 15.7 10.6 / 11.4 / 12.2 8.4 / 9.4 / 9.6 5.3 / 7.4 / 8.0 9.7 / 11.1 / 11.6
+MISO+implicit 2.9M 14.2 / 14.9 / 15.2 9.8 / 10.9 / 11.3 7.7 / 8.1 / 8.7 4.6 / 5.7 / 6.1 8.9 / 10.0 / 10.6
+MISO+implicit+fNCC 3.0M 15.3 / 16.0 / 16.1 10.9 / 11.8 / 12.5 8.5 / 9.6 / 10.1 5.7 / 7.6 / 8.3 9.9 / 11.4 / 12.0
+MISO+implicit+fNCC+context 3.0M 15.6 / 16.4 / 16.5 11.2 / 12.4 / 12.9 9.0 / 10.1 / 10.3 5.8 / 7.9 / 8.8 10.2 / 11.8 / 12.3
 

3.1 Dataset

We evaluate our approach on a simulated ad-hoc multi-channel two-speaker noisy speech dataset. 20000, 5000 and 3000 4-second long utterances are simulated for training, validation and test sets, respectively. For each utterance, two speech signals and one noise signal are randomly selected from the 100-hour Librispeech subset [26] and the 100 Nonspeech Corpus [27], respectively. The overlap ratio between the two speakers is uniformly sampled between 0% and 100%, and the two speech signals are shifted accordingly and rescaled to a random relative SNR between 0 and 5 dB. The relative SNR between the power of the sum of the two clean speech signals and the noise is randomly sampled between 10 and 20 dB. The transformed signals are then convolved with the room impulse responses simulated by the image method [28] using the gpuRIR toolbox [29]. The length and width of all the rooms are randomly sampled between 3 and 10 meters, and the height is randomly sampled between 2.5 and 4 meters. The reverberation time (T60) is randomly sampled between 0.1 and 0.5 seconds. After convolution, the echoic signals are summed to create the mixture for each microphone. The ad-hoc array dataset contains utterances with 2 to 6 microphones, where the number of utterances for each microphone configuration is set equal.

3.2 Model configurations

The original FaSNet with transform-average-concatenate (TAC) module [22] is used as the backbone architecture as well as the baseline. The frame size LL and the context size WW are both set to 16 ms (256 points), and the overlap ratio between frames is set to 50%. In iFaSNet, we use the same configuration of LL and WW (with C=2C=2 as discussed in Section 2). The architecture for the context encoding and decoding modules has two choices:

  1. 1.

    MLP: 𝐅ti\bm{\mathrm{F}}^{i}_{t} is flattened to a vector with shape 1×N(1+2C)1\times N(1+2C), and a multi-layer perceptron (MLP) is used to map the vector into 𝐟^ti\hat{\bm{\mathrm{f}}}^{i}_{t}. Another MLP is used to map the frame-level concatenation of 𝐠t1\bm{\mathrm{g}}^{1}_{t}, the output of the MISO separation module, and each feature in 𝐅ti\bm{\mathrm{F}}^{i}_{t} to estimate the filters for the entire context.

  2. 2.

    RNN: 𝐅ti\bm{\mathrm{F}}^{i}_{t} is passed to an RNN layer followed by a mean-pooling operation across time to generate 𝐟^ti\hat{\bm{\mathrm{f}}}^{i}_{t}. The output of the MISO separation module 𝐠t1\bm{\mathrm{g}}^{1}_{t} is concatenated to each feature in 𝐅ti\bm{\mathrm{F}}^{i}_{t} and passed to another RNN layer to decoder the filters.

We use the RNN configuration as empirically it leads to better performance than the MLP configuration with a same model size. We use two BLSTM layers for the context encoding and decoding modules, respectively. The implementations of both the original FaSNet and the iFaSNet are available online222https://github.com/yluo42/TAC.

3.3 Training configurations

All models are trained for 100 epochs with the Adam optimizer [30] with an initial learning rate of 0.001. Signal-to-noise ratio (SNR) is used as the training objective for all models. The learning rate is decayed by 0.98 for every two epochs. Gradient clipping by a maximum gradient norm of 5 is always applied for proper convergence of DPRNN-based models. Early stopping is applied when no best validation model is found for 10 consecutive epochs. No other training tricks or regularization techniques are used. Auxiliary autoencoding training (A2T) is applied to enhance the robustness on this reverberant separation task [31].

4 Results and discussions

Table 1 presents the experiment results of the baseline FaSNet and various configurations of iFaSNet. We first note that the results for FaSNet is lower than the results reported in [22], and this is due to the use of A2T during training. We suspect that this is mainly because the filter-and-sum problem formulation makes A2T training harder and affects the separation performance. We do not dive deeper into the role of filter-and-sum operation in A2T as it is beyond the scope of this paper. For ablation experiments, we explore the effect of introducing MISO configuration or replacing the original tNCC to fNCC. We observe that modifying the MIMO configuration into MISO configuration has no harm on the overall performance, while using fNCC instead of tNCC in the MIMO setting leads to worse performance. A possible explanation for this is that fNCC cannot explicitly capture cross-channel delay information, which is important for the MIMO time-domain filter-and-sum operation.

The second step for the ablation experiments starts with setting the MISO configuration as default. Unlike the MIMO+fNCC configuration, MISO+fNCC configuration is able to improve the performance especially when the number of available channels is large. Comparing with the MISO+tNCC configuration, the improvement is more significant at higher overlap ratios. However, applying implicit filtering together with the tNCC feature leads to a drastic performance degrade. Since implicit filtering without context is equivalent to the “masking” configuration in various existing multi-channel end-to-end systems [14, 15, 16], the result implies that the cross-channel feature needs to be carefully selected to align with the formulation of separation.

The third ablation experiment combines MISO, implicit filtering and fNCC feature and achieves better performance than any previous configurations. Note that comparing with the MISO+implicit configuration, such result indicates that the fNCC feature is a suitable cross-channel feature for the implicit filtering approach. The ablation experiments end with applying context-aware filtering into the system and result in our final design for iFaSNet. Note that the context size is set to C=2C=2 as mentioned in Section 2.2.4, and the effect of other context sizes and window sizes is left for future work.

5 Conclusion

In this paper, we explored ways to improve a previously proposed end-to-end multi-channel speech separation system, the filter-and-sum network (FaSNet). We considered four different aspects to modify the original model: the multi-input-single-output (MISO) problem formulation, the implicit filtering in latent spae, the feature-level normalized cross correlation feature for cross-channel information, and the context-aware filtering operation. We named our modification to FaSNet as the implicit FaSNet (iFaSNet). Ablation experiment results showed that the iFaSNet combining such four modifications can lead to a significant performance improvement across various configurations.

6 Acknowledgments

This work was funded by a grant from the National Institute of Health, NIDCD, DC014279; a National Science Foundation CAREER Award; and the Pew Charitable Trusts.

References

  • [1] J. Heymann, L. Drude, and R. Haeb-Umbach, “Neural network based spectral mask estimation for acoustic beamforming,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on.   IEEE, 2016, pp. 196–200.
  • [2] H. Erdogan, J. R. Hershey, S. Watanabe, M. I. Mandel, and J. Le Roux, “Improved MVDR beamforming using single-channel mask prediction networks.” in Proc. Interspeech, 2016, pp. 1981–1985.
  • [3] X. Xiao, S. Zhao, D. L. Jones, E. S. Chng, and H. Li, “On time-frequency mask estimation for mvdr beamforming with application in robust speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on.   IEEE, 2017, pp. 3246–3250.
  • [4] T. Ochiai, S. Watanabe, T. Hori, J. R. Hershey, and X. Xiao, “Unified architecture for multichannel end-to-end speech recognition with neural beamforming,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1274–1288, 2017.
  • [5] X. Zhang, Z.-Q. Wang, and D. Wang, “A speech enhancement algorithm by iterating single-and multi-microphone processing and its application to robust ASR,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on.   IEEE, 2017, pp. 276–280.
  • [6] J. Heymann, M. Bacchiani, and T. N. Sainath, “Performance of mask based statistical beamforming in a smart home scenario,” in Acoustics, Speech and Signal Processing (ICASSP), 2018 IEEE International Conference on.   IEEE, 2018, pp. 6722–6726.
  • [7] K. Qian, Y. Zhang, S. Chang, X. Yang, D. Florencio, and M. Hasegawa-Johnson, “Deep learning based speech beamforming,” in Acoustics, Speech and Signal Processing (ICASSP), 2018 IEEE International Conference on.   IEEE, 2018, pp. 5389–5393.
  • [8] Y. Xu, C. Weng, L. Hui, J. Liu, M. Yu, D. Su, and D. Yu, “Joint training of complex ratio mask based beamformer and acoustic model for noise robust asr,” in Acoustics, Speech and Signal Processing (ICASSP), 2019 IEEE International Conference on.   IEEE, 2019, pp. 6745–6749.
  • [9] T. Ochiai, M. Delcroix, R. Ikeshita, K. Kinoshita, T. Nakatani, and S. Araki, “Beam-TasNet: Time-domain audio separation network meets frequency-domain beamformer,” in Acoustics, Speech and Signal Processing (ICASSP), 2020 IEEE International Conference on.   IEEE, 2020, pp. 6384–6388.
  • [10] Y. Xu, M. Yu, S.-X. Zhang, L. Chen, C. Weng, J. Liu, and D. Yu, “Neural spatio-temporal beamformer for target speech separation,” arXiv preprint arXiv:2005.03889, 2020.
  • [11] Z. Zhang, Y. Xu, M. Yu, S.-X. Zhang, L. Chen, and D. Yu, “ADL-MVDR: All deep learning mvdr beamformer for target speech separation,” arXiv preprint arXiv:2008.06994, 2020.
  • [12] T. Higuchi, N. Ito, S. Araki, T. Yoshioka, M. Delcroix, and T. Nakatani, “Online mvdr beamformer based on complex gaussian mixture model with spatial prior for noise robust asr,” IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), vol. 25, no. 4, pp. 780–793, 2017.
  • [13] T. Higuchi, K. Kinoshita, N. Ito, S. Karita, and T. Nakatani, “Frame-by-frame closed-form update for mask-based adaptive mvdr beamforming,” in Acoustics, Speech and Signal Processing (ICASSP), 2018 IEEE International Conference on.   IEEE, 2018, pp. 531–535.
  • [14] R. Gu, J. Wu, S.-X. Zhang, L. Chen, Y. Xu, M. Yu, D. Su, Y. Zou, and D. Yu, “End-to-end multi-channel speech separation,” arXiv preprint arXiv:1905.06286, 2019.
  • [15] R. Gu and Y. Zou, “Temporal-spatial neural filter: Direction informed end-to-end multi-channel target speech separation,” arXiv preprint arXiv:2001.00391, 2020.
  • [16] R. Gu, S.-X. Zhang, L. Chen, Y. Xu, M. Yu, D. Su, Y. Zou, and D. Yu, “Enhancing end-to-end multi-channel speech separation via spatial feature learning,” in Acoustics, Speech and Signal Processing (ICASSP), 2020 IEEE International Conference on.   IEEE, 2020, pp. 7319–7323.
  • [17] Y. Koyama, O. Azeez, and B. Raj, “Efficient integration of multi-channel information for speaker-independent speech separation,” arXiv preprint arXiv:2005.11612, 2020.
  • [18] C. Fan, B. Liu, J. Tao, J. Yi, and Z. Wen, “Spatial and spectral deep attention fusion for multi-channel speech separation using deep embedding features,” arXiv preprint arXiv:2002.01626, 2020.
  • [19] D. Stoller, S. Ewert, and S. Dixon, “Wave-U-Net: A multi-scale neural network for end-to-end audio source separation,” arXiv preprint arXiv:1806.03185, 2018.
  • [20] C.-L. Liu, S.-W. Fu, Y.-J. Li, J.-W. Huang, H.-M. Wang, and Y. Tsao, “Multichannel speech enhancement by raw waveform-mapping using fully convolutional networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), vol. 28, pp. 1888–1900, 2020.
  • [21] Y. Luo, C. Han, N. Mesgarani, E. Ceolini, and S.-C. Liu, “FaSNet: Low-latency adaptive beamforming for multi-microphone audio processing,” in Automatic Speech Recognition and Understanding (ASRU), 2019 IEEE Workshop on.   IEEE, 2019, pp. 260–267.
  • [22] Y. Luo, Z. Chen, N. Mesgarani, and T. Yoshioka, “End-to-end microphone permutation and number invariant multi-channel speech separation,” in Acoustics, Speech and Signal Processing (ICASSP), 2020 IEEE International Conference on.   IEEE, 2020, pp. 6394–6398.
  • [23] Y. Luo, Z. Chen, and T. Yoshioka, “Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation,” in Acoustics, Speech and Signal Processing (ICASSP), 2020 IEEE International Conference on.   IEEE, 2020, pp. 46–50.
  • [24] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), vol. 23, no. 1, pp. 7–19, 2014.
  • [25] D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on.   IEEE, 2017, pp. 241–245.
  • [26] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an ASR corpus based on public domain audio books,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on.   IEEE, 2015, pp. 5206–5210.
  • [27] G. Hu, “100 Nonspeech Sounds,” http://web.cse.ohio-state.edu/pnl/corpus/HuNonspeech/HuCorpus.html.
  • [28] J. B. Allen and D. A. Berkley, “Image method for efficiently simulating small-room acoustics,” The Journal of the Acoustical Society of America, vol. 65, no. 4, pp. 943–950, 1979.
  • [29] D. Diaz-Guerra, A. Miguel, and J. R. Beltran, “gpuRIR: A python library for room impulse response simulation with gpu acceleration,” Multimedia Tools and Applications, pp. 1–19, 2020.
  • [30] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [31] Y. Luo, C. Han, and N. Mesgarani, “Distortion-controlled training for end-to-end reverberant speech separation with auxiliary autoencoding loss,” in 2021 IEEE Spoken Language Technology Workshop (SLT).   IEEE, 2021.