Multi-stage Speaker Extraction with Utterance and Frame-Level Reference Signals

Abstract

Speaker extraction requires a sample speech from the target speaker as the reference. However, enrolling a speaker with a long speech is not practical. We propose a speaker extraction technique, that performs in multiple stages to take full advantage of short reference speech sample. The extracted speech in early stages is used as the reference speech for late stages. For the first time, we use frame-level sequential speech embedding as the reference for target speaker. This is a departure from the traditional utterance-based speaker embedding reference. In addition, a signal fusion scheme is proposed to combine the decoded signals in multiple scales with automatically learned weights. Experiments on WSJ0-2mix and its noisy versions (WHAM! and WHAMR!) show that SpEx++ consistently outperforms other state-of-the-art baselines.

Index Terms— multi-stage, time-domain, speaker extraction, signal fusion, speaker embedding

1 Introduction

Real-world speech communication usually takes place in complex auditory scenes in the presence of multiple speakers. Blind speech separation is one of the solutions to isolate one source from others, such as DPCL [1, 2, 3], PIT [4, 5, 6], and TasNet [7, 8, 9]. However, it requires that the number of sources is known in advance, and assumes the permutation of source labels is unchanged during training, which greatly limits its scope of applications.

Unlike blind speech separation, speaker extraction only extracts the target speech from a mixture speech given a reference utterance of the target speaker [10, 11, 12, 13, 14, 15]. Thus, it naturally avoids the problems of arbitrary source permutation and unknown number of sources. Previous studies [14, 15] show that longer duration of a reference speech in speaker extraction always leads to better performance. However, real applications usually only permit a short reference utterance, e.g., a wakeup word for mobile device. This prompts us to study how to reduce the required duration of a reference utterance.

Refer to caption — Fig. 1: The diagram of the SpEx++ system. Each stage consists of speech encoders, speaker encoder, speaker extractor, speech decoder and signal fusion. The $y(t)$ , $x(t)$ , $s(t)$ , and $I$ are the mixture speech, reference speech, clean speech and true speaker label, respectively. The $\hat{s}(t)$ , $\hat{I}$ , $v_{\text{utt}}$ and $v_{\text{frame}}$ represent the extracted target voice, the predicted probability of target speaker, utterance-level speaker embedding and newborn frame-level speaker embedding in different stages, respectively.

Psychoacoustic studies [16] suggest brain circuits create perceptual attractors, or magnets, that warp the stimulus space such that it draws the sound that is closest to it. Human auditory attention uses the current attended acoustic stimulus to reinforce the attractor [17] in a continuous auditory process. In speaker extraction, the reference speech that is encoded as a speaker embedding can be seen as such a perceptual attractor. Motivated by the psychoacoustic studies, we propose a speaker extraction architecture, that reinforces the attractor for target speaker in multiple stages.

Specifically, we take advantage of the multi-stage architecture to reuse the extracted target voice from previous stages to strengthen the attractor. The extracted target speech is combined with the original reference utterance to strengthen utterance-level speaker embedding. At the same time, the extracted speech is utilized to capture the frame-level speaker embedding which is aligned with the mixture speech frame-by-frame. In this way, the extracted speech provides a second reference signal. Note that the first reference signal is an utterance-level speaker embedding, which represents the overall characteristics of target speaker; the second reference signal is a sequential reference that describes the temporal dynamics of what exactly the target speaker speaks. Such information is hard to be captured directly in one single-stage. In addition, we propose a signal fusion strategy to combine the decoded signals in multiple scales with automatically learned weights to produce a higher-quality target speech.

2 SpEx++ Architecture

SpEx++ is a multi-stage SpEx+ [15] pipeline. Without loss of generality, we study a 3-stage architecture in this work, as shown in Fig. 1. However, SpEx++ has an entirely different target speaker referencing mechanism from SpEx+. Let’s begin with a brief review of SpEx+ system. We then discuss a signal fusion strategy and a multi-stage system.

2.1 Review of SpEx+ system

SpEx+ system is a complete time-domain speaker extraction solution, that accepts a time-domain mixture speech and reference speech as inputs to extract the target speech. SpEx+ mainly consists of four modules: twin speech encoder, speaker encoder, speaker extractor, and speech decoder. The twin speech encoder shares the network structure and parameters, projecting the mixture speech and the reference speech of the target speaker into a common latent space. The speaker encoder produces the discriminative utterance-level speaker embedding from the reference speech. The speaker extractor collects the utterance-level speaker embedding from speaker encoder and transformed features from twin speech encoder to estimate masks in various scales, and then goes through the speech encoder to extract the target signals in various scales.

SpEx+ network is trained with multi-task learning objective, that measures both the target speaker prediction and signal reconstruction. The former minimizes the Cross-Entropy (CE) loss between predicted target speaker and actual speaker label. The latter minimizes reconstruction error (multi-scale SI-SDR) between the extracted signals in different scales and the clean target speech.

SpEx+ processes speech in multiple scales. However, at run-time inference, the extracted target signal with smallest scale (highest temporal resolution) is chosen as the final output [15]. As a result, the decoded signals from other scales don’t contribute to target speech generation. In addition, the weights to balance the multi-scale SI-SDR loss between the extracted signals in different scales and the clean target speech are tuned on the development set, which adds complexity to the training process.

2.2 Fusion of multi-scale signals

To take advantage of the complementary decoded signals from multiple scales during training and inference, we propose a signal fusion strategy before the loss calculation. First, we combine multi-scale decoded signals with automatically learned weights, and then calculate the reconstruction error (SI-SDR) between the fused signal and clean signal.

Let $\hat{s}_{1}^{(k)}$ , $\hat{s}_{2}^{(k)}$ , and $\hat{s}_{3}^{(k)}$ denote the multi-scale decoded signals in stage $k$ , we can use several learnable weights ( $w_{1}$ , $w_{2}$ , and $w_{3}$ ) to obtain the fused signal as follow:

\hat{s}^{(k)}=w_{1}*\hat{s}_{1}^{(k)}+w_{2}*\hat{s}_{2}^{(k)}+w_{3}*\hat{s}_{3}^{(k)},\quad k=1,2,3\vspace{-8pt}

(1)

It was reported [14] that the decoded signal with the highest temporal resolution achieves the best performance, and the weights to balance the multi-scale SI-SDR loss are tuned to be $0.8$ , $0.1$ , $0.1$ for high, middle, and low temporal resolutions [14, 15]. We follow the same weights to combine the signals of 3 temporal resolutions first, then calculate the SI-SDR between the fused signal and the clean signal as,

	$\displaystyle\mathcal{L}_{\text{SI-SDR}}^{(k)}=-\rho(\hat{s}^{(k)},s),$		(2)
	$\displaystyle\rho(\hat{s},s)=20\log_{10}\frac{\|\|(\hat{s}^{T}s/s^{T}s)\cdot s\|\|}{\|\|(\hat{s}^{T}s/s^{T}s)\cdot s-\hat{s}\|\|}$		(3)

where $\hat{s}$ and $s$ are the estimated signal and the target clean signal, respectively.

2.3 Multi-stage speaker extraction

The idea of multi-stage architecture is to pipeline several single-stage speaker extraction modules sequentially such that a later module operates on the extracted target speech of an earlier one. The effect of such composition is an incremental refinement of the reference speech from the target speaker. As shown in Fig. 1, each stage takes the extracted target voice from the previous stage as an extra input to refine the reference speech of the target speaker. We implement two mechanisms in SpEx++ to do so.

One is to combine the extracted target voice $\hat{s}^{(k-1)}(t)$ with the original, short reference utterance $x(t)$ to extract the utterance-level speaker embedding $v_{\text{utt}}^{(k)}$ . Such speaker embedding is expected to represent characteristics of the target speaker, that is referred to as the first reference signal,

v_{\text{utt}}^{(k)}=\text{Enc}_{\text{speaker}}(\text{Enc}_{\text{speech}}(\text{Concat}(x,\hat{s}^{(k-1)}))),

(4)

where $k=2,3$ , $\text{Enc}_{\text{speech}}(\cdot)$ and $\text{Enc}_{\text{speaker}}(\cdot)$ represent the speech encoder and speaker encoder, respectively.

Another is to pass the extracted speech $s^{(k-1)}(t)$ through the speech encoder to obtain a frame-level speech embedding. As $s^{(k-1)}(t)$ is obtained from the mixture speech $y(t)$ , and aligned with the mixture speech frame-by-frame. Such frame-level embedding is different from the utterance-level speaker embedding, but provides a strong signal about the target speaker and his/her speech content. For the first time, we introduce the frame-level speech embedding $v_{\text{frame}}^{(k)}(t)$ as a second reference signal in speaker extraction study,

v_{\text{frame}}^{(k)}(t)=\text{Enc}_{\text{speech}}(\hat{s}^{(k-1)}(t)),\quad k=2,3

(5)

Now we formulate the way to use two reference signals for speaker extraction,

	$\displaystyle M_{i}^{(k)}=\text{Ext}(v_{\text{utt}}^{(k)},\text{Concat}(v_{\text{frame}}^{(k)},\text{Enc}_{\text{speech}}(y))),$		(6)
	$\displaystyle\hat{s}_{i}^{(k)}=\text{Dec}(M_{i}^{(k)}\otimes\text{Enc}_{\text{speech}}(y))$		(7)

where $i=1,2,3$ represents the three different scales, and $\otimes$ is an operation for element-wise multiplication. The $\text{Ext}(\cdot)$ and $\text{Dec}(\cdot)$ represent the speaker extractor and speech decoder, respectively. Finally, the reconstruction error can be calculated by applying the operations in Section 2.2.

Table 1: A comparative study on the WSJ0-2mix dataset under open condition. Given Ref. denotes the duration of given test reference speech in second (s). Target Ref. shows the use of reference signals. Utt denotes utterance-level speaker embedding

v_{\text{utt}}^{(k)}

, Fm denotes frame-level speech embedding

v_{\text{frame}}^{(k)}

. #stage denotes the number of SpEx decoding stages.

Given Ref.	Methods	Signal Fusion	#stages	Target Ref.	SDRi (dB)	SI-SDRi (dB)	PESQ
7.3s (avg)	SpEx+ [15]	$\times$	1	Utt	17.2	16.9	3.43
	SpEx++	$\surd$	1	Utt	17.5	17.2	3.46
	SpEx++	$\surd$	2	Utt	17.7	17.3	3.47
	SpEx++	$\surd$	2	Fm	18.2	17.8	3.51
	SpEx++	$\surd$	2	Utt+Fm	18.3	17.9	3.52
	SpEx++	$\surd$	3	Utt+Fm	18.4	18.0	3.53
2s	SpEx+ [15]	$\times$	1	Utt	16.6	16.2	3.37
	SpEx++	$\surd$	1	Utt	16.9	16.4	3.40
	SpEx++	$\surd$	2	Utt+Fm	17.6	17.1	3.46
	SpEx++	$\surd$	3	Utt+Fm	17.6	17.2	3.46

3 Experiments and Discussion

We evaluated our system on the noise-free two-speaker mixture database WSJ0-2mix. The database was derived from the WSJ0 corpus at sampling rate of 8kHz. The simulated database contained 101 speakers and was divided into three sets: training set (20,000 utterances), development set (5,000 utterances), and test set (3,000 utterances). Specifically, the utterances from two speakers in WSJ0 “si_tr_s” corpus were randomly selected to generate the training and development set at various SNR between 0dB and 5dB. Similarly, the test set was generated by randomly mixing the utterances from two speakers in WSJ0 “si_dt_05” and “si_et_05” set. Since the speakers in test set were unseen during training, the test set was considered as open condition evaluation.

The noisy versions of WSJ0-2mix, called WHAM! [18] and WHAMR! [19], were also used to verify the robustness of our system. WHAM! paired each two-speaker mixture in WSJ0-2mix with a non-speech ambient noise sample, recorded in real cases such as coffee shops, restaurants, and bars. WHAMR! extended WHAM! by introducing reverberation to the speech sources in addition to the existing noise.

Unlike in blind speech separation, speaker extraction technique requires a reference speech of target speaker as input. For each mixture speech in WSJ0-2mix, the speakers in mixed speech acted as the target speaker in turn, and the corresponding reference speech was randomly selected from original WSJ0 corpus. The reference speeches in WHAM! and WHAMR! were the same as that in WSJ0-2mix.

3.1 Experimental setup

We trained all systems for 100 epochs on the mixture segments with 4-second and their corresponding reference utterances as in [15], which has an average of 7.3 seconds. The learning rate was initialized to $1e^{-3}$ and decays by 0.5 if the accuracy of validation set was not improved in 2 consecutive epochs. Early stopping was applied if no best model is found in the validation set for 6 consecutive epochs. Adam was used as the optimizer. The network structure follows SpEx+ [15].

The speech encoders at each stage shared the network structure and weights. The filter lengths of convolutions in speech encoder and decoder were 2.5ms, 10ms, and 20ms for the 3 time-scales with speech of 8kHz sampling rate, respectively. The speaker extractor repeated a stack of 8 temporal convolution network (TCN) blocks for 4 times. We included 3 ResNet blocks in speaker encoder, with 256, 256, and 512 filters respectively. The utterance-level speaker embedding dimension was set to 256 in practice. For the signal fusion configuration, the weights are initialized as $w_{1}=0.8$ , $w_{2}=0.1$ , $w_{3}=0.1$ , and then learned automatically during training.

Table 2: SDRi (dB) and SI-SDRi (dB) in a comparative study on the WHAM! dataset under the open condition. For blind speech separation (BSS) task, we report the results evaluated on the oracle-selected streams. For speaker extraction (SE) task, we report the results on the extracted target streams.

Task	Methods	Given Ref.	SDRi	SI-SDRi
BSS	BLSTM-TasNet [18]	-	-	9.8
BSS	chimera++ [18]	-	-	9.9
SE	SpEx [14]	7.3s (avg)	13.0	12.2
	SpEx+ [15]	7.3s (avg)	13.6	13.1
	SpEx++ (2-stages)	7.3s (avg)	14.4	14.0
	SpEx++ (2-stages)	60s	14.7	14.3

Table 3: SDRi (dB) and SI-SDRi (dB) in a comparative study on the WHAMR! dataset under the open condition.

Cond.	Task	Methods	Given Ref.	SDRi (dB)	SI-SDRi (dB)
Noise	BSS	Conv-TasNet [19]	-	-	11.5
		BLSTM-TasNet [19]	-	-	12.0
		Cascaded System [19]	-	-	12.6
	SE	SpEx [14]	7.3s (avg)	13.2	12.4
		SpEx+ [15]	7.3s (avg)	14.0	13.5
		SpEx++ (2-stages)	7.3s (avg)	14.5	14.0
		SpEx++ (2-stages)	60s	14.8	14.3
Reverb	BSS	Conv-TasNet [19]	-	-	7.6
		BLSTM-TasNet [19]	-	-	8.9
		Cascaded System [19]	-	-	9.9
	SE	SpEx [14]	7.3s (avg)	8.7	9.7
		SpEx+ [15]	7.3s (avg)	9.3	10.6
		SpEx++ (2-stages)	7.3s (avg)	9.7	11.0
		SpEx++ (2-stages)	60s	10.0	11.4
Noise + Reverb	BSS	Conv-TasNet [19]	-	-	8.3
		BLSTM-TasNet [19]	-	-	9.2
		Cascaded System [19]	-	-	10.1
	SE	SpEx [14]	7.3s (avg)	9.5	10.3
		SpEx+ [15]	7.3s (avg)	10.0	10.9
		SpEx++ (2-stages)	7.3s (avg)	10.4	11.4
		SpEx++ (2-stages)	60s	10.7	11.7

3.2 Results on noise-free condition

We firstly compare SpEx++ with previous baseline SpEx+ system on WSJ0-2mix in terms of SDRi, SI-SDRi and PESQ with both 7.5 seconds and 2 seconds reference speech.

From Table 1, we conclude: 1) Multi-stage framework significantly outperforms single-stage framework. The improvements mainly come from the augmented frame-level speech embedding through a multi-stage pipeline. As an evidence, the 2-stage SpEx++ outperforms 1-stage counterpart substantially benefiting from the extra frame-level speaker embedding. However, the 3-stage SpEx++ doesn’t improve much further because the 3-stage and 2-stage solutions use similar speaker information overall (i.e., Utt + Fm). 2) The signal fusion strategy in SpEx++ (1-stage) achieves 0.3dB performance gain over SpEx+ baseline in terms of SDRi and SI-SDRi. The results validate the effectiveness of the proposed signal fusion strategy. Besides, the learned weights for $\omega_{1},\omega_{2},\omega_{3}$ were 1.18, 0.32, and 0.11, respectively. This suggests that the decoded signal with higher temporal resolution contribute more. Note that we don’t constrain that the sum of weights is equal to one. 3) The experiments with 2-second reference speech further show that SpEx++ system works well with much shorter reference speech, achieving 1.0dB SI-SDR improvement over SpEx+ baseline system.

Table 4: SDRi (dB) and SI-SDRi (dB) in an universal study of the SpEx++ system on WHAMR!. The system is trained on 4 conditions and tested on each condition individually.

Given	Methods	Conditions		SDRi	SI-SDRi
Ref.	Methods	Training	Test	(dB)	(dB)
7.3s (avg)	SpEx [14]	4 conditions	Noise	13.5	12.7
			Reverb	8.8	9.8
			Noise + Reverb	10.1	10.8
	SpEx++ (2-stages)	4 conditions	Noise	14.6	14.2
			Reverb	9.9	11.2
			Noise + Reverb	10.7	11.6
60s	SpEx [14]	4 conditions	Noise	14.3	13.8
			Reverb	9.7	10.8
			Noise + Reverb	10.8	11.7
	SpEx++ (2-stages)	4 conditions	Noise	14.9	14.6
			Reverb	10.3	11.5
			Noise + Reverb	11.1	12.0

3.3 Results on noisy conditions

We further verify the robustness of SpEx++ system with other baselines under noisy conditions, as reported in Table 2 and Table 3. Compared with the results under noise-free condition in Table 1, we observe that the additive noise and reverberation in the mixture significantly degrades the performance of both blind speech separation and speaker extraction systems. This fact shows that extracting target speaker’s voice from noisy mixture is a challenging task. Despite this, our SpEx++ system still shows effective performance under noisy conditions. Specifically, SpEx++ system achieves about 0.9dB and 0.5dB absolute improvement over SpEx+ baseline in terms of SI-SDRi on WHAM! and WHAMR!, respectively.

We finally study an universal SpEx++ system that is trained on four mixture conditions (clean, noisy only, reverberation only, and noisy and reverberation). The performance of the universal system is further evaluated on each noisy mixture condition individually, as shown in Table 4. Compared with the results in Table 3 and Table 4, we observe that the universal SpEx++ trained on four conditions achieves better performance than that trained on a single condition.

4 Conclusions

In this paper, we proposed a multi-stage time-domain speaker extraction called SpEx++, to alleviate the problem of insufficient reference materials of the target speaker. We took the advantages of multi-stage architecture to refine the speaker characteristics of target speaker using the extracted voice in previous stages. Experiments showed that the proposed frame-level sequential speech embedding serves as the second reference signal, and significantly improve speaker extraction without additional reference speech.

References

[1] John R Hershey, Zhuo Chen, Jonathan Le Roux, and Shinji Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 31–35.
[2] Yusuf Isik, Jonathan Le Roux, Zhuo Chen, Shinji Watanabe, and John R Hershey, “Single-channel multi-speaker separation using deep clustering,” arXiv preprint arXiv:1607.02173, 2016.
[3] Zhongqiu Wang, Jonathan Le Roux, and John R Hershey, “Alternative objective functions for deep clustering,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 686–690.
[4] Dong Yu, Morten Kolbæk, Zhenghua Tan, and Jesper Jensen, “Permutation invariant training of deep models for speaker–independent multi–talker speech separation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 241–245.
[5] Morten Kolbæk, Dong Yu, Zhenghua Tan, and Jesper Jensen, “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 10, pp. 1901–1913, 2017.
[6] Chenglin Xu, Wei Rao, Xiong Xiao, Eng Siong Chng, and Haizhou Li, “Single channel speech separation with constrained utterance level permutation invariant training using grid lstm,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 6–10.
[7] Yi Luo and Nima Mesgarani, “Real-time single-channel dereverberation and separation with time-domain audio separation network.,” in Proc. Interspeech, 2018, pp. 342–346.
[8] Yi Luo and Nima Mesgarani, “Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256–1266, 2019.
[9] Yi Luo, Zhuo Chen, and Takuya Yoshioka, “Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 46–50.
[10] Kateřina Žmolíková, Marc Delcroix, Keisuke Kinoshita, Tsubasa Ochiai, Tomohiro Nakatani, Lukáš Burget, and Jan Černockỳ, “SpeakerBeam: Speaker aware neural network for target speaker extraction in speech mixtures,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 4, pp. 800–814, 2019.
[11] Quan Wang, Hannah Muckenhirn, Kevin Wilson, Prashant Sridhar, Zelin Wu, John Hershey, Rif A Saurous, Ron J Weiss, Ye Jia, and Ignacio Lopez Moreno, “Voicefilter: Targeted voice separation by speaker-conditioned spectrogram masking,” arXiv preprint arXiv:1810.04826, 2018.
[12] Chenglin Xu, Wei Rao, Eng Siong Chng, and Haizhou Li, “Optimization of speaker extraction neural network with magnitude and temporal spectrum approximation loss,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6990–6994.
[13] Chenglin Xu, Wei Rao, Eng Siong Chng, and Haizhou Li, “Time-domain speaker extraction network,” in IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 327–334.
[14] Chenglin Xu, Wei Rao, Chng Eng Siong, and Haizhou Li, “SpEx: Multi-scale time domain speaker extraction network,” IEEE/ACM Transaction on Audio, Speech, and Language Processing, vol. 28, pp. 1370–1384, 2020.
[15] Meng Ge, Chenglin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang, and Haizhou Li, “Spex+: A complete time domain speaker extraction network,” in Proc. Interspeech, 2020, pp. 1406–1410.
[16] Patricia Kuhl, “Human adults and human infants show a “perceptual magnet effect” for the prototypes of speech categories. monkeys do not.,” Perception & psychophysics, vol. 50, pp. 93–107, 09 1991.
[17] Adelbert W Bronkhorst, “The cocktail-party problem revisited: early processing and selection of multi-talker speech,” Attention, Perception, & Psychophysics, vol. 77, no. 5, pp. 1465–1487, 2015.
[18] Gordon Wichern, Joe Antognini, Michael Flynn, Licheng Richard Zhu, Emmett McQuinn, Dwight Crow, Ethan Manilow, and Jonathan Le Roux, “Wham!: Extending speech separation to noisy environments,” in Proc. Interspeech, Sept. 2019, pp. 1368–1372.
[19] Matthew Maciejewski, Gordon Wichern, and Jonathan Le Roux, “Whamr!: Noisy and reverberant single-channel speech separation,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2020, pp. 696–700.