All-neural beamformer for continuous speech separation

Abstract

Continuous speech separation (CSS) aims to separate overlapping voices from a continuous influx of conversational audio containing an unknown number of utterances spoken by an unknown number of speakers. A common application scenario is transcribing a meeting conversation recorded by a microphone array. Prior studies explored various deep learning models for time-frequency mask estimation, followed by a minimum variance distortionless response (MVDR) filter to improve the automatic speech recognition (ASR) accuracy. The performance of these methods is fundamentally upper-bounded by MVDR’s spatial selectivity. Recently, the all deep learning MVDR (ADL-MVDR) model was proposed for neural beamforming and demonstrated superior performance in a target speech extraction task using pre-segmented input. In this paper, we further adapt ADL-MVDR to the CSS task with several enhancements to enable end-to-end neural beamforming. The proposed system achieves significant word error rate reduction over a baseline spectral masking system on the LibriCSS dataset. Moreover, the proposed neural beamformer is shown to be comparable to a state-of-the-art MVDR-based system in real meeting transcription tasks, including AMI, while showing potentials to further simplify the runtime implementation and reduce the system latency with frame-wise processing.

Index Terms— Continuous speech separation, LibriCSS, AMI, automatic speech recognition, ADL-MVDR

1 Introduction

Undesirable background noises or interfering speakers often contaminate speech in daily communications. This poses a significant challenge for the current automatic speech recognition (ASR) systems as they are designed for the scenario where at most one person is speaking at a given time instance. Speech separation algorithms have been proposed to address this issue by separating different speaker sources from a mixture signal. Speech separation algorithms have been serving as important front-ends for various different speech communication systems, including ASR [1, 2], meeting transcription [3], and digital hearing-aid devices [4].

With the recent advancements in deep learning, several data-driven speech separation algorithms have been proposed [5, 6, 7], yielding improved speech quality and intelligibility. These systems include time-frequency (T-F) mask-based systems [8, 9], and some other time-domain end-to-end systems such as TasNet [10], Conv-TasNet [11] and Wave-U-Net [12]. However, these purely deep learning-based systems focus on removing undesired interfering sources without having constraints for limiting the solution space. This often results in non-linear distortions on the separated speech that are harmful to the current ASR systems [13].

The minimum variance distortionless response (MVDR) filter has been widely adopted to address this non-linear distortion issue. It is often combined with a neural network that estimates the speech and noise components for deriving the beamforming filtering weights. However, the mathematically-derived MVDR solution is not straightforward for end-to-end optimization due to numerical instability [14]. For the same reason, it is also challenging to perform adaptive beamforming on a frame-by-frame basis stably. Therefore, the MVDR filter is usually applied on a per-segment basis. Recently, all deep learning MVDR (ADL-MVDR) [15, 16] was proposed for neural frame-adaptive beamforming and demonstrated superior performance in audio quality of the separated signals and the ASR accuracy in a target speech extraction task for pre-segmented speech mixtures. It incorporates two separate gated recurrent unit (GRU) [17] based networks to replace the matrix operations (e.g., matrix inverse) involved in the conventional MVDR solution, which bypasses the numerical instability issue and makes the end-to-end training more feasible.

In this paper, we extend ADL-MVDR to continuous speech separation (CSS) to enable frame-wise neural beamforming. Unlike most of the existing studies that convert pre-segmented audio mixture into per-speaker separated speech, a CSS system converts long-form unsegmented audio, including an unknown number of speakers into a few (two in our experiments) audio streams, each of which contains overlap-free signals [18]. This CSS design enables us to handle overlapping speech of an unknown number of speakers with low latency, which is preferable to serve for many real applications such as meeting transcription. We introduce several enhancements to make ADL-MVDR effective for the CSS setting, including steering vector normalization, the use of a voice activity detection (VAD) network, positive semi-definite constraint on matrix inversion and residual connection. The proposed neural beamformer is first evaluated on the LibriCSS [19] dataset, which consists of long-form multi-talker real recordings generated by concatenating and mixing utterances from LibriSpeech dataset [20]. We further compare our systems’ performance on several real meeting recordings (including AMI corpus [21] and Microsoft internal meetings), where the proposed neural beamformer is shown to be comparable to a state-of-the-art MVDR-based system and demonstrates potentials on reducing system latency with frame-wise beamforming.

2 Technical Background

2.1 Continuous Speech Separation

Most of the existing speech separation algorithms operate on pre-segmented mixtures by assuming an ideal overlap detector to be available. However, in real scenarios, speech separation systems need to deal with a continuous audio stream consisting of multiple speakers, which can be hours long and include both overlapped and non-overlapped utterances. More recently, the CSS scheme [22] has been proposed, which is defined as the process of generating a limited number of overlap-free signals from the continuous audio stream. To deal with the long input signals, we adopt the chunk-wise CSS scheme proposed in [23]. As illustrated in Fig. 1, a sliding window is used which contains three sub-windows, including the history sub-window ( $N_{h}$ frames), current sub-window ( $N_{c}$ frames), and future sub-window ( $N_{f}$ frames). For each time step, the window is moved forward by $N_{c}$ frames. During test time evaluation, the speech separation algorithm takes in the entire chunk of information (i.e., $N=N_{h}+N_{c}+N_{f}$ frames) to estimate $K(=2)$ overlap-free signals for the current $N_{c}$ frames. Estimated signals for each chunk are later aligned via block stitching [19].

Refer to caption — Fig. 1: The chunk-wise processing scheme for CSS is shown.

2.2 ADL-MVDR for Target Speech Separation

The MVDR filter aims to preserve the information from the target direction while minimizing the power of interfering sources. Specifically, the MVDR filter is defined as

\mathbf{h_{\text{MVDR}}}=\underset{\mathbf{h}}{\arg\min\mathbf{h}}^{\mathrm{H}}\mathbf{\Phi}_{\mathrm{VV}}\mathbf{h}\quad\bf{\text{s.t.}}\quad\mathbf{h}^{\mathrm{H}}\mathbf{\boldsymbol{v}}=\mathbf{1},

(1)

where $\mathbf{h_{\text{MVDR}}}$ is the MVDR filtering weights, superscript ^H denotes the Hermitian transpose, and $\mathbf{\Phi}_{\mathrm{VV}}$ is the covariance matrix of interfering sources (background noise and/or interfering speakers). Variable $\mathbf{\boldsymbol{v}}$ represents the steering vector which can be approximated by extracting the principal eigenvector of the speech covariance matrix [24], i.e., $\mathbf{\boldsymbol{v}}=\mathcal{P}\{\mathbf{\Phi}_{\mathrm{SS}}\}$ . By solving Eq. (1), we have [25, 26]

\mathbf{h_{\text{MVDR}}}=\frac{\mathbf{\Phi}_{\mathrm{VV}}^{-1}\boldsymbol{v}}{\mathbf{\boldsymbol{v}^{\mathrm{H}}}\mathbf{\Phi}_{\mathrm{VV}}^{-1}\mathbf{\boldsymbol{v}}}.

(2)

Most existing studies estimate the covariance matrices in a chunk-wise fashion. However, this chunk-wise processing also incurs less flexibility and adaptability of the MVDR filter.

ADL-MVDR has been recently proposed as a fully neural network-based beamformer that has demonstrated superior performance in a target speech extraction task [15, 16]. The core idea of ADL-MVDR is utilizing two separate GRU-based networks (denoted as GRU-Nets) to replace the matrix inversion and principal eigenvector extraction involved in the conventional MVDR solution. Each of the GRU-Nets takes in the speech/noise covariance matrix and estimates the steering vector or the matrix inverse on a per-frame basis. Note that, unlike the conventional MVDR systems, which take sums or expectations over the entire chunk to derive the covariance matrices, the temporal information in the covariance matrices can be leveraged in the ADL-MVDR network.

3 Proposed System

The overall framework of our proposed system is depicted in Fig. 2. The T-F mask estimator first predicts three T-F masks for CSS tasks, including two-speaker masks and an isotropic noise mask. ADL-MVDR then estimates time-varying beamforming weights based on the estimated masks. Meanwhile, we introduce a VAD network based on the GRU-Net, whose result is multiplied with the output of ADL-MVDR. Finally, the output audio is emitted with an additional residual connection from the T-F masked speech signal. During training, the entire system is updated with permutation invariant training (PIT) scheme [27].

3.1 ADL-MVDR for CSS

In the CSS scheme, at most $K$ speech sources are assumed to be active within each chunk. Here, we describe the proposed method with $K=2$ . The normalized time-varying input covariance matrices for the speaker and interfering noise sources can be derived as

	$\displaystyle\mathbf{\Phi}_{\mathrm{SS}}^{(k)}(t,f)$	$\displaystyle=\frac{\mathbf{\hat{S}_{\text{mask}}}^{(k)}(t,f)\mathbf{\hat{S}_{\text{mask}}}^{\mathrm{H}\>(k)}(t,f)}{\sum_{t=1}^{T}\mathbf{M}^{(k)2}_{\mathrm{S}}(t,f)},$		(3)
	$\displaystyle\mathbf{\Phi}_{\mathrm{VV}}^{(k)}(t,f)$	$\displaystyle=\frac{\mathbf{\hat{V}_{\text{mask}}}^{(k)}(t,f)\mathbf{\hat{V}_{\text{mask}}}^{\mathrm{H}\>(k)}(t,f)}{\sum_{t=1}^{T}\mathbf{M}^{(k)2}_{\mathrm{V}}(t,f)},$		(3)

where $\mathbf{\Phi}_{\mathrm{SS}}^{(k)}$ is the instantaneous speech covariance matrix for speaker source $k\in\{0,1\}$ , and $(t,f)$ represent the time and frequency indices. $\mathbf{\hat{S}_{\text{mask}}}^{(k)}=\mathbf{M}^{(k)}_{\mathrm{S}}\mathbf{Y}$ is the masked speech for source $k$ , where $\mathbf{Y}$ is the multi-channel noisy speech and $\mathbf{M}_{\mathrm{S}}^{(k)}$ denotes the T-F speech mask for source $k$ . $T$ is the total number of frames. Similarly, $\mathbf{\Phi}_{\mathrm{VV}}^{(k)}$ is the interfering source covariance matrix for source $k$ . The interfering source consists of two parts (i.e., ambient noise and interfering speaker) as $\mathbf{\hat{V}_{\text{mask}}}^{(k)}=\mathbf{M}_{\mathrm{N}}\mathbf{Y}+\mathbf{\hat{S}_{\text{mask}}}^{(1-k)}$ , where $\mathbf{M}_{\mathrm{N}}$ is the isotropic noise mask. Similarly we have $\mathbf{M}_{\mathrm{V}}^{(k)}=\mathbf{M}_{\mathrm{N}}+\mathbf{M}_{\mathrm{S}}^{(1-k)}$ . Next, the time-varying variables corresponding to the steering vector and inverse noise covariance matrix are estimated using two separate GRU-Nets as

	$\displaystyle\mathbf{\boldsymbol{\hat{\boldsymbol{v}}}}^{(k)}(t,f)$	$\displaystyle=\mathbf{GRU{\text{-}}Net}_{\upsilon}(\mathbf{\Phi}_{\mathrm{SS}}^{(k)}(t,f)),$		(4)
	$\displaystyle{\mathbf{\hat{\Phi}}_{\mathrm{VV}}}^{-1\>(k)}(t,f)$	$\displaystyle=\mathbf{GRU{\text{-}}Net}_{\mathrm{VV}}(\mathbf{\Phi}_{\mathrm{VV}}^{(k)}(t,f)).$		(4)

Once these time-varying coefficients are obtained, the beamforming weights $\mathbf{h}_{\text{ADL-MVDR}}\in\mathbb{C}^{F\times T\times C}$ ( $F$ and $C$ represent the frequency and channel dimensions, respectively) can be derived on a per-frame basis by plugging these estimated terms into Eq. (2). Finally, the ADL-MVDR filtered speech for source $k$ can be obtained as

\mathbf{\hat{S}}^{(k)}_{\text{ADL-MVDR}}(t,f)=\mathbf{\hat{h}}^{\mathrm{H}\>(k)}_{\text{ADL-MVDR}}(t,f)\mathbf{Y}(t,f).

(5)

3.2 Enhancements to ADL-MVDR

To further enhance the performance of the ADL-MVDR in the CSS scheme, we propose several techniques. Firstly, we introduce constraints on the estimated steering vector and the inverse of the covariance matrix. Specifically, we apply normalization on the estimated time-varying steering vector, where the normalized steering vector is derived as $\bar{\boldsymbol{v}}=\hat{\boldsymbol{v}}/|\hat{\boldsymbol{v}}|$ . A positive semi-definite constraint is also imposed on the estimated inverse of the interfering noise covariance matrix. This is done by modifying the GRU-Net to estimate an upper triangular matrix $U$ , by which the inverse matrix is calculated as $\mathbf{\hat{\Phi}}^{-1}_{\mathrm{VV}}=UU^{H}$ .

Secondly, as we mentioned at the beginning of this section, we introduce the VAD network to control the gain of the output.

	$\displaystyle\hat{\mathbf{w}}_{\text{VAD}}^{(k)}(t)$	$\displaystyle=\mathbf{GRU{\text{-}}Net}_{\text{VAD}}(\mathbf{M}_{\mathrm{S}}^{(k)}(t,f)),$		(6)
	$\displaystyle\hat{\mathbf{S}}_{\text{VAD}}^{(k)}(t,f)$	$\displaystyle=\hat{\mathbf{S}}_{\text{ADL-MVDR}}^{(k)}(t,f)\hat{\mathbf{w}}_{\text{VAD}}^{(k)}(t),$		(6)

where $\hat{\mathbf{S}}_{\text{VAD}}^{(k)}$ is the VAD filtered speech for source $k$ , $\hat{\mathbf{w}}_{\text{VAD}}^{(k)}\in\mathbb{R}$ denotes the estimated frame-level VAD weights. Finally, the separated speech $\hat{\mathbf{S}}^{(k)}$ can be obtained with residual connection as

\hat{\mathbf{S}}^{(k)}(t,f)=\hat{\mathbf{S}}_{\text{VAD}}^{(k)}(t,f)+\alpha\cdot\mathbf{\hat{S}_{\text{mask}}}^{(k)}(t,f),

(7)

where $\alpha$ is a weighting factor, and we use the first channel of the masked speech for the residual connection.

3.3 Mask Estimator and GRU-Nets

We use a T-F mask estimation model proposed in [28]. The multi-channel input speech stream is first encoded by a shared encoding block (based on conformer layers [29, 7]) to extract the intra-channel feature independently, followed by a stack of geometry agnostic modules. Within each geometry agnostic module, there is a transform-average-concatenate (TAC) block [6] and another shared conformer to alternately encode the inter- and intra-channel information. Finally, the encoded features are pooled out by taking the average across the channel dimension and fed into another conformer block to estimate the three T-F masks (i.e., two speakers and one noise [23]).

Each GRU-Net consists of two unidirectional GRU layers, followed by another feed forward layer (FFL) as illustrated in Fig. 3. For frame-wise coefficients estimation, the GRU-Net takes in the covariance matrix that is derived following Eq. (3). Note that the real and imaginary parts are concatenated as input. We use the same architecture for both the ADL-MVDR and VAD networks except the difference in the number of units in each layer.

4 Experimental Setup

4.1 Datasets

Our training set contains 219 hours of randomly mixed and reverberated utterances from WSJ1 SI-284 [30]. We simulated the multi-channel mixtures by randomly picking the audio of one or two speakers and convolving it with a 7-channel room impulse response, which was simulated with the image method [31]. Then, the reverberated signals were mixed with a source energy ratio between -5 and 5 dB. Simulated isotropic noise was then added with a 0-10 dB signal-to-noise ratio (SNR). Noise samples from MUSAN [32] were also reverberated and added at an SNR between -5 and 10 dB. The average overlap ratio of the training set was about 50%.

LibriCSS [19] was used for the first experiment, which consists of 10 hours of 7-channel recordings. Sound sources were generated by mixing LibriSpeech utterances [20]. They were played back in a real meeting room and recorded by a 7-channel microphone array. Several overlap ratios were included, ranging from 0 to 40%. There were two subsets for the 0% overlap condition: one with short (S) inter-utterance silence and one with long (L) inter-utterance silence.

To evaluate the performance of the proposed neural beamformer in more realistic and challenging environments, we also carried out experiments using real meeting datasets; namely, the AMI corpus [21] and Microsoft internal meeting dataset, dubbed as MS. The AMI recordings were made with an 8-channel circular microphone array, while the MS data collection used a 7-channel array used in [19]. MS contained 60 sessions in total with various numbers of speakers per session. In order to get the transcriptions, we adopted a modified version of the conversation transcription system described in [3] with a hybrid ASR model.

Table 1: Continuous speech separation results on LibriCSS dataset are shown. The numbers in the left and right of ‘/’ symbol indicate the word error rates (WERs) with real- and complex-valued masks, respectively. Best performance is marked with bold font for each condition.

Sys. ID

Loss type

Beamformer

Norm.

\boldsymbol{v}

Positive semi- definite

\mathbf{\Phi}^{-1}_{\mathrm{VV}}

VAD

Res. connection

Overlap ratios (%)

Avg.

Baseline spectral masking systems

Mag.

N/A

6.1/6.6

6.9/7.3

8.6/9.8

11.4/12.0

14.7/15.3

15.8/16.2

11.1/11.7

Log-mel

N/A

6.4/6.8

6.9/7.2

9.2/9.6

11.9/12.1

15.1/15.1

16.7/17.1

11.6/11.9

Proposed neural beamformer systems

Mag.

ADL-MVDR

✗

9.1/9.4

9.6/8.9

11.5/11.3

13.5/14.4

16.7/16.3

18.5/17.8

13.7/13.5

Mag.

ADL-MVDR

✗

✓

✗

6.1/6.3

6.8/6.6

9.1/9.0

11.5/11.8

14.6/14.1

16.1/15.7

11.3/11.1

Mag.

ADL-MVDR

✓

✗

✓

✗

5.9/6.1

6.5/6.7

9.0/9.0

11.3/11.1

13.9/13.6

15.4/15.0

10.9/10.7

Mag.

ADL-MVDR

✓

✗

✓

6.1/6.1

6.7/6.4

8.7/9.3

10.5/11.8

13.2/14.2

14.7/15.3

10.5/11.0

Mag.

ADL-MVDR

✓

6.3/6.1

6.5/6.4

8.8/9.2

11.2/11.6

13.4/14.0

15.1/15.4

10.7/11.0

Log-mel

ADL-MVDR

✗

✓

✗

6.0/5.8

6.5/6.4

9.3/8.6

11.5/10.8

14.1/13.2

15.6/15.1

11.0/10.5

Log-mel

ADL-MVDR

✓

✗

✓

✗

6.0/6.4

6.3/6.4

8.7/8.4

10.7/11.1

12.7/13.1

14.5/15.0

10.3/10.5

Log-mel

ADL-MVDR

✓

✗

✓

6.0/6.1

6.6/6.4

8.6/8.8

10.9/11.4

12.7/13.8

14.0/15.1

10.2/10.8

Log-mel

ADL-MVDR

✓

6.0/6.0

6.3/6.0

8.5/8.6

10.5/11.1

12.5/13.0

14.1/14.3

10.1/10.3

4.2 System Configurations

We considered two different geometry agnostic T-F mask estimation models as the baseline spectral masking systems: one predicting real-valued masks using sigmoid activation and the other estimating complex-valued masks. We used the mask estimation model from our recent paper [28]. The model consists of nine consecutive conformer layers that were interleaved by two TAC layers, followed by mean pooling and another stack of six conformer layers. Each conformer layer used four heads with 64 dimensions and 33 convolution kernels. See [28] for further details¹¹1Note that our model setting is different from the one used in [28], and thus the numbers cannot be directly compared..

As regards the ADL-MVDR component, we used 200 and 100 units in the two recurrent layers of $\mathbf{GRU{\text{-}}Net}_{\upsilon}$ , followed by another 14-unit FFL with linear activation. For the $\mathbf{GRU{\text{-}}Net}_{\text{VV}}$ , there were 200 units in both GRU layers with another 98-unit linear FFL. For VAD weights estimation, the corresponding GRU-Net featured a two-layer GRU with 200 units each, followed by a 1-unit (i.e., frequency-independent) FFL with ReLU activation.

Two loss functions were examined; namely, the mean squared error (MSE) of the magnitude spectra [27] and the MSE of the log-mel features [33] obtained with 80 mel filter banks. We did not use the scale-invariant signal-to-distortion ratio (Si-SDR) loss which was used before for target speech extraction [15, 16]. This is because the Si-SDR loss was found to be unstable when one source was silent as with the CSS task. $\alpha$ was set to 0.5 for residual connection. The AdamW optimizer [34] was used with a weight decay of $1e^{-2}$ . A warm-up learning schedule was used with the peak learning rate of $1e^{-3}$ followed by an exponential decay. The baseline systems were trained for 50 epochs, and the ADL-MVDR systems were trained jointly with the baseline systems for another 100 epochs. For CSS chunk-wise processing, $N_{h},N_{c},N_{f}$ were set to 1.2 s, 0.8s and 0.4 s, respectively.

5 Experimental Results

5.1 Results on LibriCSS

The experimental results with different system configurations are shown in Table 1. Among the systems implementing some or all of the proposed improvements, the best performing approach was system 10 with a real-valued mask estimator. The proposed system achieved an approximately 9% relative WER improvement on average (i.e., 10.1 % vs. 11.1%) compared with the baseline system (i.e., system 0 with real-valued mask). The proposed end-to-end neural beamforming was more advantageous for more challenging conditions, yielding a 15% relative gain for the 30% overlap condition (i.e., 12.5% vs. 14.7%).

It is noteworthy that system 3 significantly outperformed system 2 (11.3 % vs. 13.7% for real-valued masks), and the latter even underperformed the baseline system. This indicates that the beamformer cannot fully eliminate the speech signals even when the output T-F masks are almost zero, especially in reverberant conditions [5]. The use of the VAD network helped the neural beamformer deal with the case of the output source being occasionally zero, as in the CSS task.

The results of Table 1 also show the impact of the normalized steering vector and applying the positive semi-definite constraint. Normalizing the estimated steering vector improved the ASR accuracy in most conditions. In particular, if we compare systems 3 and 4, we can see 3.5% (i.e., 11.3% vs. 10.9%) and 3.6% (i.e., 11.1% vs. 10.7%) relative WER gains for the real- and complex-valued mask configurations, respectively. Similar trends were observed for systems 7 and 8. Also, the positive semi-definite constraint on the estimate of $\mathbf{\Phi}^{-1}_{\mathrm{VV}}$ was found to preserve the performance with the reduced number of parameters.

For the proposed end-to-end neural beamformers, the log-mel scale magnitude loss consistently yielded better ASR accuracy. For instance, between systems 10 and 6, there were 5.6% and 6.4% relative improvements for the real- and complex-valued mask configurations, respectively.

5.2 Results on Real Meeting Recordings

Table 2: WERs (%) on real meeting recordings.

System	MS	AMI-dev	AMI-eval
BF	17.8	25.0	27.3
CSS w/ MVDR	16.6	17.9	21.1
CSS w/ ADL-MVDR	16.3	18.1	20.6

For the real meeting experiments, we used a larger training set consisting of 438 hours of mixtures. Table 2 compares the proposed all-neural beamformer (i.e., system 10 with real-valued masks) with two baseline systems: one used the conventional chunk-wise MVDR as with [28]. We also show the result with a super-directive beamformer with real-time beam steering, which only removes ambient noise and does not perform speech separation (denoted as BF).

For the MS dataset, the proposed ADL-MVDR achieved the best ASR accuracy (i.e., 16.3%), which is about 8.4% and 1.8% relative improvements over BF and the conventional MVDR systems. While comparable performance was observed on the AMI development set for two MVDR systems, the proposed ADL-MVDR slightly outperformed the conventional MVDR by relative 2.4% (i.e., 20.6% vs. 21.1%) in the AMI evaluation set. These results here suggest that the proposed neural beamformer achieved comparable ASR accuracy with a state-of-the-art MVDR-based system while also enabling frame-wise beamforming.

6 Conclusions

We developed the all-neural beamformer that enables frame-wise beamforming for the CSS task. Our neural beamformer achieved significant improvements in the ASR accuracy over baseline spectral masking systems, especially in challenging overlap conditions. For real conversation recordings, the proposed system achieved comparable performance to a conventional MVDR-based system with simplified runtime implementation. The experimental results suggest that VAD is needed for the neural beamformer to work effectively for CSS and that applying additional enhancements such as normalizing the steering vector further improved the performance.

References

[1] X. Zhang, Z.-Q. Wang, and D. Wang, “A speech enhancement algorithm by iterating single-and multi-microphone processing and its application to robust ASR,” in ICASSP. IEEE, 2017, pp. 276–280.
[2] Z.-Q. Wang and D. Wang, “On spatial features for supervised speech separation and its application to beamforming and robust ASR,” in ICASSP. IEEE, 2018, pp. 5709–5713.
[3] T. Yoshioka et al., “Advances in online audio-visual meeting transcription,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019, pp. 276–283.
[4] S. Doclo, S. Gannot, M. Moonen, A. Spriet, S. Haykin, and K. R. Liu, “Acoustic beamforming for hearing aid applications,” Handbook on array processing and sensor networks, pp. 269–302, 2010.
[5] T. Yoshioka, H. Erdogan, Z. Chen, and F. Alleva, “Multi-microphone neural speech separation for far-field multi-talker speech recognition,” in ICASSP. IEEE, 2018, pp. 5739–5743.
[6] Y. Luo, Z. Chen, N. Mesgarani, and T. Yoshioka, “End-to-end microphone permutation and number invariant multi-channel speech separation,” in ICASSP. IEEE, 2020, pp. 6394–6398.
[7] S. Chen, Y. Wu, Z. Chen, J. Wu, J. Li, T. Yoshioka, C. Wang, S. Liu, and M. Zhou, “Continuous speech separation with conformer,” in ICASSP. IEEE, 2021, pp. 5749–5753.
[8] Z. Zhang et al., “On loss functions and recurrency training for GAN-based speech enhancement systems,” Interspeech, pp. 3266–3270, 2020.
[9] H. Yu, W.-P. Zhu, and Y. Yang, “Constrained ratio mask for speech enhancement using DNN.” in INTERSPEECH, 2020, pp. 2427–2431.
[10] Y. Luo and N. Mesgarani, “TasNet: time-domain audio separation network for real-time, single-channel speech separation,” in ICASSP. IEEE, 2018, pp. 696–700.
[11] ——, “Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM transactions on audio, speech, and language processing, vol. 27, no. 8, pp. 1256–1266, 2019.
[12] D. Stoller, S. Ewert, and S. Dixon, “Wave-u-net: A multi-scale neural network for end-to-end audio source separation,” 2018, pp. 334–340.
[13] Y. Xu, Z. Zhang, M. Yu, S.-X. Zhang, and D. Yu, “Generalized spatio-temporal RNN beamformer for target speech separation,” INTERSPEECH, pp. 3076–3080, 2021.
[14] W. Zhang, C. Boeddeker, S. Watanabe, T. Nakatani, M. Delcroix, K. Kinoshita, T. Ochiai, N. Kamo, R. Haeb-Umbach, and Y. Qian, “End-to-end dereverberation, beamforming, and speech recognition with improved numerical stability and advanced frontend,” in ICASSP. IEEE, 2021, pp. 6898–6902.
[15] Z. Zhang, Y. Xu, M. Yu, S.-X. Zhang, L. Chen, and D. Yu, “ADL-MVDR: All deep learning MVDR beamformer for target speech separation,” in ICASSP. IEEE, 2021, pp. 6089–6093.
[16] Z. Zhang, Y. Xu, M. Yu, S.-X. Zhang, L. Chen, D. S. Williamson, and D. Yu, “Multi-channel multi-frame ADL-MVDR for target speech separation,” arXiv preprint arXiv:2012.13442, 2020.
[17] K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio, “On the properties of neural machine translation: Encoder-decoder approaches,” arXiv preprint arXiv:1409.1259, 2014.
[18] Ö. Çetin and E. Shriberg, “Analysis of overlaps in meetings by dialog factors, hot spots, speakers, and collection site: Insights for automatic speech recognition,” in Ninth international conference on spoken language processing, 2006.
[19] Z. Chen, T. Yoshioka, L. Lu, T. Zhou, Z. Meng, Y. Luo, J. Wu, X. Xiao, and J. Li, “Continuous speech separation: Dataset and analysis,” in ICASSP. IEEE, 2020, pp. 7284–7288.
[20] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an ASR corpus based on public domain audio books,” in ICASSP. IEEE, 2015, pp. 5206–5210.
[21] J. Carletta et al., “The AMI meeting corpus: A pre-announcement,” in International workshop on machine learning for multimodal interaction. Springer, 2005, pp. 28–39.
[22] T. Yoshioka, Z. Chen, C. Liu, X. Xiao, H. Erdogan, and D. Dimitriadis, “Low-latency speaker-independent continuous speech separation,” in ICASSP. IEEE, 2019, pp. 6980–6984.
[23] T. Yoshioka et al., “Recognizing overlapped speech in meetings: A multichannel separation approach using neural networks,” Interspeech, pp. 3038––3042, 2018.
[24] Y. Liu et al., “Neural network based time-frequency masking and steering vector estimation for two-channel MVDR beamforming,” in ICASSP. IEEE, 2018, pp. 6717–6721.
[25] T. Higuchi et al., “Online MVDR beamformer based on complex gaussian mixture model with spatial prior for noise robust ASR,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 4, pp. 780–793, 2017.
[26] K. Shimada et al., “Unsupervised beamforming based on multichannel nonnegative matrix factorization for noisy speech recognition,” in ICASSP. IEEE, 2018, pp. 5734–5738.
[27] D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in ICASSP. IEEE, 2017, pp. 241–245.
[28] T. Yoshioka, X. Wang, D. Wang, M. Tang, Z. Zhu, Z. Chen, and N. Kanda, “VarArray: Array-Geometry-Agnostic continuous speech separation,” arXiv preprint arXiv:2110.05745, 2021.
[29] A. Gulati, J. Qin, C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: Convolution-augmented transformer for speech recognition,” in Proc. Interspeech, 2020, pp. 5036–5040.
[30] L. D. Consortium et al., “CSR-II (wsj1) complete,” Linguistic Data Consortium, Philadelphia, vol. LDC94S13A, 1994.
[31] J. B. Allen and D. A. Berkley, “Image method for efficiently simulating small-room acoustics,” The Journal of the Acoustical Society of America, vol. 65, no. 4, pp. 943–950, 1979.
[32] D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech, and Noise Corpus,” 2015, arXiv:1510.08484v1.
[33] C. Boeddeker, H. Erdogan, T. Yoshioka, and R. Haeb-Umbach, “Exploring practical aspects of neural mask-based beamforming for far-field speech recognition,” in Proc. ICASSP, 2018, pp. 6697–6701.
[34] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” ICLR, 2017.