MULTI-CHANNEL TARGET SPEECH EXTRACTION WITH CHANNEL DECORRELATION AND TARGET SPEAKER ADAPTATION
Abstract
The end-to-end approaches for single-channel target speech extraction have attracted widespread attention. However, the studies for end-to-end multi-channel target speech extraction are still relatively limited. In this work, we propose two methods for exploiting the multi-channel spatial information to extract the target speech. The first one is using a target speech adaptation layer in a parallel encoder architecture. The second one is designing a channel decorrelation mechanism to extract the inter-channel differential information to enhance the multi-channel encoder representation. We compare the proposed methods with two strong state-of-the-art baselines. Experimental results on the multi-channel reverberant WSJ0 2-mix dataset demonstrate that our proposed methods achieve up to and relative improvements in SDR and SiSDR respectively, which are the best reported results on this task to the best of our knowledge.
Index Terms— Target speech extraction, multi-channel, speaker embedding vector, channel decorrelation
1 Introduction
Speech separation (SS) is a task that aims to separate each source signal from the mixed speech. Many effective methods have been proposed to perform the separation in the time-frequency domain, such as deep clustering [1], deep attractor network [2], and permutation invariant training (PIT) [3], etc. More recently, a convolutional time-domain audio separation network (Conv-TasNet)[4] has been proposed and achieved significant separation performance improvement over those time-frequency based techniques. This Conv-TasNet has attracted widespread attention and further improved in many recent works, either for the single-channel or multi-channel speech separation tasks [5, 6, 7].
Our work is also to improve the Conv-TasNet, however, instead of for the pure speech separation, we aim to generalize such an idea to the target speech extraction (TSE) task. Generally, compared with the pure speech separation techniques, most TSE approaches require the additional target speaker clues to drive the network towards extracting the target speech.
Many previous works have focused on the TSE task, such as Voicefilter [8], SBF-MTSAL-Concat [9], and SpEx [10]. Although these close-talk TSE approaches achieve great progress, the performance of far-field speech extraction is still far from satisfactory due to the reverberation. When the microphone arrays are available, the additional multi-channel spatial information usually can be helpful in the TSE task. Such benefits have attracted many studies to exploit the multi-channel information. For example, the direction-aware SpeakerBeam [11] combines an attention mechanism with beamforming to enhance the signal of the target direction; the neural spatial filter [12] use the directional information of the target speaker to extract the corresponding speech; the time-domain SpeakerBeam (TD-SpeakerBeam) [13] incorporates the inter-microphone phase difference (IPD)[14] as additional input features to further improve the speaker discrimination capability. All of these multi-channel TSE approaches showed promising results, which indicates that the multi-channel information can provide an alternative guider to discriminate the target speaker better. Actually, one of the keys of the TSE is still speech separation. In order to further enhance the separation ability, many strategies for exploiting the multi-channel information have been recently proposed, such as normalized cross-correlation (NCC) [15], transform-average-concatenate (TAC) [16], and inter-channel convolution difference (ICD)[17], etc. Therefore, how to effectively exploit the multi-channel spatial information for TSE is crucial.
In this study, we also focus on exploiting the multi-channel spatial information for the target speech extraction task. Two approaches are proposed: first, we integrate a target speech adaptation layer into a parallel encoder architecture. This adaptation is used to enhance the target speaker clues of multi-channel encoder output by weighting the mixture embeddings. Second, unlike creating hand-crafted spatial features, we design a channel decorrelation mechanism to extract the inter-channel differential spatial information automatically. This decorrelation is performed on each dimension of all the multi-channel encoder representations of input mixtures. Furthermore, together with the same target speech adaptation layer, the proposed decorrelation mechanism can provide another effective representation of the target speaker to enhance the whole end-to-end TSE network. To validate the effectiveness of our proposed approaches, we choose two state-of-the-art TSE systems as the baselines, one is the parallel encoder proposed in [18] and the other is the TD-SpeakerBeam with IPD features in [13]. All of our experiments are performed on the publicly available multi-channel reverberant WSJ0 2-mix dataset. Results show that our proposed methods improve the performances of multi-channel target speech extraction significantly. Moreover, to make the research reproducible, we released our source code on GitHub 111https://github.com/jyhan03/channel-decorrelation.
2 TIME-DOMAIN SPEAKERBEAM
TD-SpeakerBeam is a very effective target speech extraction approach that has been recently proposed in [13]. The structure of TD-SpeakerBeam is shown in Fig.1 without the IPD concatenation block. It contains three main parts: encoder (1d convolution layer), mask estimator (several convolution blocks), and decoder (1d deconvolution layer). The , , and are the mixture waveform of the first (reference) channel, the extracted target speech waveform, and the adaptation utterance of the target speaker respectively. The TD-SpeakerBeam network follows a similar configuration as Conv-TasNet[4], except for inserting a multiplicative adaptation layer [19] between the first and second convolution blocks to drive the network towards extracting the target speech. The adaptation layer accepts the mixture embedding matrix and the target speaker embedding vector as inputs. In the adaptation layer, is repeated to perform element-wise multiplication with the mixture embedding. The is computed by a time-domain convolutional auxiliary network as shown in the bottom of Fig.1.

In TD-SpeakerBeam, the network accepts the time-domain signal of the mixture and outputs the time-domain signal of the target speaker directly. Moreover, as shown in Fig.1, authors in [13] also extended the TD-SpeakerBeam to the multi-channel TSE task, however, they only simply concatenated the hand-crafted IPD features (processed with a 1d convolutional encoder, upsampling, and a convolution block) with the adapted encoder representation to exploit the multi-channel spatial information.
The whole network architecture of TD-SpeakerBeam is trained jointly in an end-to-end multi-task way. The multi-task loss combines the scale-invariant signal-to-distortion ratio (SiSDR)[20] as the signal reconstruction loss and cross-entropy as the speaker identification loss. The overall loss function is defined as,
(1) |
where represents the model parameters, is the input mixture, is the target speech signal, is a one-hot vector representing the target speaker identity, is a scaling parameter, is a weight matrix, and is a softmax operation. More details can be found in [13].
3 PROPOSED METHODS
3.1 Parallel encoder with speaker adaptation
Instead of using the IPD features, we first extend the TD-SpeakerBeam to multi-channel and introduce a target speaker adaptation layer in a parallel encoder architecture that has been proposed in [18]. The original parallel encoder is shown in Fig.2 (a) without the adaptation block. It just directly sums the waveform encodings of each input channel to form the final mixture representation. However, to enhance the target speaker clues of multi-channel encoder output, besides inserting the adaptation layer in the mask estimator of TD-SpeakerBeam, we also integrate a same adaptation layer in the parallel encoder. The diagram is shown in Fig.2 (a).

3.2 Channel decorrelation
Unlike exploiting the cross-correlations between multi-channel speech in the original parallel encoder. In this paper, we design a channel decorrelation (CD) mechanism to improve the multi-channel TSE performance on TD-SpeakerBeam. The proposed network structure is shown in Fig.2 (b). This decorrelation aims to extract the inter-channel differential spatial information automatically. It is performed on each dimension of all the multi-channel encoder representations of input mixtures. Take the case of two channels as an example, as shown in Fig.2 (b), the CD block accepts two encoded mixture representations and , then outputs the differential spatial information between two channels. The specific procedure is as follows:
First, compute the cosine correlation between and in each corresponding dimension (row). Note that and are two matrices, i.e.,
(2) |
where is the input of the -th channel of CD, is the output dimension of convolutional encoder, is the number of frames of the encoder output. , is the -th dimension vector of the -th channel. is the operation of transpose.
The cosine correlation of -th dimensional vector between the first and second channel is calculated as,
(3) |
where is the inner product of two vectors, represents the Euclidean norm. The vectors involved in the operation are normalized to zero mean prior to the calculation.
Then, calculate the cosine correlation between the vectors of each dimension of and in turn, and concatenate them to get a similarity vector ,
(4) |
where represents the similarity of two encoded mixture representations in each dimension in a latent space.
Next, to convert each dimension of the similarity vector into a probability vector , we introduce an auxiliary vector with the same size as and values are all 1, i.e.,
(5) |
where can be regarded as the cosine correlation between first channel and itself or a probability vector whose values are all 1. Then, a softmax operation is used to calculate the probability between each corresponding dimensional element in and to obtain the final similarity probability vector , i.e.,
(6) |
Next, subtract from to get a vector that represents differentiated scores between channels, and upsampling to the same size as to get the differentiated score matrix , i.e.,
(7) |
Finally, the differential spatial information between channels can be extracted by multiplying by as
(8) |
where represents how much the differential spatial information can be provided by the second channel over the first channel, denotes element-wise multiplication. Besides, to better guide the network to extract the speech of the target speaker, we also perform the target speaker adaptation on to exploit the target speaker-dependent spatial information, as shown in Fig.2 (b).
4 EXPERIMENTS AND RESULTS
4.1 Dataset
Our experiments are performed on the publicly available multi-channel reverberant WSJ0-2mix corpus [21]. Multi-channel recordings are generated by convolving clean speech signals with room impulse responses simulated with the image method for reverberation time of up to about 600ms [13]. The dataset consists of 8 channel recordings, but to have a fair comparison with the state-of-the-art baselines, we also use only two channels in our experiments.
We use the same way as in [22] to generate adaptation utterances of the target speaker. The adaptation utterance is selected randomly that different from the utterance in the mixture. The adaptation recordings used in our experiments are anechoic. The size of the training, validation, and test sets are 20k, 5k, and 3k utterances, respectively. All of the data are resampled to 8kHz from the original 16kHz sampling rate.
4.2 Configurations
Our experiments are performed based on the open source Conv-TasNet implementation [23]. We use the same hyper-parameters as our baseline TD-SpeakerBeam in [13].
The same multi-task loss function in Equation (1) of TD-SpeakerBeam is used in our experiments. We set to balance the loss tradeoff between the SiSDR and cross-entropy of SI-loss. For those experiments with IPD combination, the IPD features are extracted using an STFT window of 32 msec and hop size of 16 msec. For the performance evaluation, both the signal-to-distortion ratio (SDR) of BSSeval [24] and the SiSDR [20] are used. For more details of experimental configurations, please refer to our released source code on GitHub mentioned in the introduction.
4.3 Results and Discussion
4.3.1 Baselines
Two state-of-the-art multi-channel TSE systems are taken as our baselines. One is the TD-SpeakerBeam with IPD [13] and the other is the parallel encoder proposed in [18]. Results are shown in Table 1. System (1) and (2) are the results given in [13]. As the source code of TD-SpeakerBeam and parallel encoder are not publicly available, we implemented them by ourselves and reproduced the results of the system (3) to (6) on the same WSJ0-2mix corpus. It’s clear to see that our reproduced results are slightly better than the results in [13], both the multi-task loss and IPD are effective to improve the multi-channel TSE, and it is interesting to find that only slight improvements are obtained by using IPD together with the multi-task loss.
System | Multi | IPD | SDR | SiSDR |
---|---|---|---|---|
(1) TD-SpkBeam [13] | - | - | 11.17 | - |
(2) | - | ✓ | 11.45 | - |
(3) TD-SpkBeam (our) | - | - | 11.26 | 10.76 |
(4) | ✓ | - | 11.51 | 11.00 |
(5) | ✓ | ✓ | 11.57 | 11.07 |
(6) Parallel (our) | ✓ | - | 12.43 | 11.91 |
Furthermore, as we can see, the results of system (6) are much better than the ones of other systems. Actually, the original idea of the parallel encoder in [18] was proposed for the multi-channel speech separation, here we extend it with TD-SpeakerBeam to improve the target speech extraction task. It means that the parallel encoder is much effective than IPD features to capture the spatial information between multi-channels. This may due to the parallel encoder is a purely data-driven method and it is more suitable for the time-domain TSE architecture. We take the best results from the system (5) and (6) in Table 1 as our baselines.
4.3.2 Results of the proposed methods
Table 2 shows the performance comparisons between TD-SpeakerBeam based TSE systems with different multi-channel spatial information utilization techniques. System (1) and (2) in this table are the best baselines. It’s clear to see that system (3) achieves both 2.4% relative improvements in SDR and SiSDR over the system (2). It indicates that performing a target speaker adaptation on the multi-channel encoded representation can provide more effective target speaker-dependent spatial information than directly summing the multi-channel encoder outputs.
System | IPD | Adapt | SDR | SiSDR |
---|---|---|---|---|
(1) TD-SpkBeam (our) | ✓ | - | 11.57 | 11.07 |
(2) Parallel (our) | - | - | 12.43 | 11.91 |
(3) | - | ✓ | 12.73 | 12.20 |
(4) CD | - | - | 12.87 | 12.34 |
(5) | - | ✓ | 12.87 | 12.35 |
(6) | ✓ | ✓ | 12.55 | 12.01 |
(7) CC | - | ✓ | 12.66 | 12.13 |
Moreover, by comparing results of the system (2) and (4), the inter-channel differential spatial information extracted by the proposed channel decorrelation is much better than the cross-channel correlation spatial information captured in the parallel encoder. Further 3.5% SDR and 3.6% SiSDR relative gains have been obtained. However, unlike the target speaker adaptation in (3), an adaptation of the decorrelation spatial information almost does not bring any performance improvements. Interestingly, we find that incorporating the hand-crafted IPD spatial features degrade the performances a little bit when the CD mechanism is used. This may due to the spatial information mismatch between IPD and the inter-channel differential spatial information extracted by CD. Because the IPD is computed in the frequency domain, while the channel decorrelation is performed in the time-domain.
In addition, instead of the proposed CD for inter-channel differential information, we also tried to exploit the inter-channel correlation (CC) information, which is achieved by replacing the with in Equation (7) in Section 3.2. Results are shown in the system (7), which are much worse than the results of the system (5). It indicates that the extracted inter-channel differential spatial information is more effective than the correlation information for the multi-channel end-to-end TSE systems.
Actually, all of the system (2) to (7) can be regarded as extensions of the TD-SpeakerBeam, therefore, we can conclude that the best result (system (5)) of our proposed method significantly outperforms the multi-channel TD-SpeakerBeam baseline by 11.2% and 11.5% relative improvements in SDR and SiSDR, respectively.
4.3.3 Visualization
To better understand the role of each part during the channel decorrelation, we further investigate the distribution differences between the CD output and its two encoded mixture representation inputs and of one mixture utterance with two overlapped speakers in Fig.3. As we expected, , , and have a similar pattern, and they all focus on the red dashed areas. Given the nature of our task, we believe that the area is strongly related to the target speaker. This means that after the whole TSE system training, the network can automatically focus on the contents related to the target speaker and ignore others.
Comparing the plot of with , their distributions are significantly different, i.e., is more densely distributed, while is more sparse. This indicates that the plays a main role during the target speech extraction, while plays a auxiliary role for providing the complementary spatial information between channels. Furthermore, by comparing the distribution of with , some contents are removed. Based on the calculation process described in Section 3.2, we believe that the removed information is the correlation between and , only the inter-channel differential information is emphasized in .

5 CONCLUSION
In this work, we propose two novel methods to extract the multi-channel spatial information for the target speech extraction task. Both methods are performed in the time-domain using the end-to-end neural networks and they are extensions of the time-domain SpeakerBeam. One is designing a parallel encoder with a target speaker adaptation layer to guide the target speaker-dependent spatial information. The other is proposing a channel decorrelation mechanism to effectively exploit the inter-channel differential spatial information. Experiments on the reverberate WSJ0-2mix corpus demonstrate that our proposed methods significantly improved the multi-channel TD-SpeakerBeam for target speech extraction. Our future work will focus on how to combine the hand-crafted and data-driven based spatial features in an effective way.
References
- [1] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in Proc. ICASSP. IEEE, 2016, pp. 31–35.
- [2] Z. Chen, Y. Luo, and N. Mesgarani, “Deep attractor network for single-microphone speaker separation,” in Proc. ICASSP. IEEE, 2017, pp. 246–250.
- [3] M. Kolbæk, D. Yu, Z. H. Tan, and J. Jensen, “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 10, pp. 1901–1913, 2017.
- [4] Y. Luo and N. Mesgarani, “Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM transactions on audio, speech, and language processing, vol. 27, no. 8, pp. 1256–1266, 2019.
- [5] D. Ditter and T. Gerkmann, “A multi-phase gammatone filterbank for speech separation via tasnet,” in Proc. ICASSP. IEEE, 2020, pp. 36–40.
- [6] J. Heitkaemper, D. Jakobeit, C. Boeddeker, L. Drude, and R. Haeb-Umbach, “Demystifying tasnet: A dissecting approach,” in Proc. ICASSP. IEEE, 2020, pp. 6359–6363.
- [7] T. Ochiai, M. Delcroix, R. Ikeshita, K. Kinoshita, T. Nakatani, and S. Araki, “Beam-tasnet: Time-domain audio separation network meets frequency-domain beamformer,” in Proc. ICASSP. IEEE, 2020, pp. 6384–6388.
- [8] Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. R Hershey, R. A. Saurous, R. J. Weiss, Y. Jia, and I. L. Moreno, “Voicefilter: Targeted voice separation by speaker-conditioned spectrogram masking,” in Proc. Interspeech, 2019, pp. 2728–2732.
- [9] C. Xu, W. Rao, E. S. Chng, and H. Li, “Optimization of speaker extraction neural network with magnitude and temporal spectrum approximation loss,” in Proc. ICASSP. IEEE, 2019, pp. 6990–6994.
- [10] C. Xu, W. Rao, E. S. Chng, and H. Li, “Spex: Multi-scale time domain speaker extraction network,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. PP, no. 99, pp. 1–1, 2020.
- [11] G. Li, S. Liang, S. Nie, W. Liu, M. Yu, L. Chen, S. Peng, and C. Li, “Direction-aware speaker beam for multi-channel speaker extraction.,” in Proc. Interspeech, 2019, pp. 2713–2717.
- [12] R. Gu, L. Chen, S.-X. Zhang, J. Zheng, Y. Xu, M. Yu, D. Su, Y. Zou, and D. Yu, “Neural spatial filter: Target speaker speech separation assisted with directional information.,” in Proc. Interspeech, 2019, pp. 4290–4294.
- [13] M. Delcroix, T. Ochiai, K. Zmolikova, K. Kinoshita, N. Tawara, T. Nakatani, and S. Araki, “Improving speaker discrimination of target speech extraction with time-domain speakerbeam,” in Proc. ICASSP. IEEE, 2020, pp. 691–695.
- [14] Z. Chen, X. Xiao, T. Yoshioka, H. Erdogan, J. Li, and Y. Gong, “Multi-channel overlapped speech recognition with location guided speech extraction network,” in Spoken Language Technology Workshop (SLT). IEEE, 2018, pp. 558–565.
- [15] Y. Luo, C. Han, N. Mesgarani, E. Ceolini, and S.-C. Liu, “Fasnet: Low-latency adaptive beamforming for multi-microphone audio processing,” in Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019, pp. 260–267.
- [16] Y. Luo, Z. Chen, N. Mesgarani, and T. Yoshioka, “End-to-end microphone permutation and number invariant multi-channel speech separation,” in Proc. ICASSP. IEEE, 2020, pp. 6394–6398.
- [17] R. Gu, S.-X. Zhang, L. Chen, Y. Xu, M. Yu, D. Su, Y. Zou, and D. Yu, “Enhancing end-to-end multi-channel speech separation via spatial feature learning,” in Proc. ICASSP. IEEE, 2020, pp. 7319–7323.
- [18] R. Gu, J. Wu, S.-X. Zhang, L. Chen, Y. Xu, M. Yu, D. Su, Y. Zou, and D. Yu, “End-to-end multi-channel speech separation,” arXiv preprint arXiv:1905.06286, 2019.
- [19] M. Delcroix, K. Zmolikova, T. Ochiai, K. Kinoshita, S. Araki, and T. Nakatani, “Compact network for speakerbeam target speaker extraction,” in Proc. ICASSP. IEEE, 2019, pp. 6965–6969.
- [20] J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “Sdr–half-baked or well done?,” in Proc. ICASSP. IEEE, 2019, pp. 626–630.
- [21] Z.-Q. Wang, J. Le Roux, and J. R. Hershey, “Multi-channel deep clustering: Discriminative spectral and spatial embeddings for speaker-independent speech separation,” in Proc. ICASSP. IEEE, 2018, pp. 1–5.
- [22] https://github.com/xuchenglin28/speaker_extraction/tree/master/simulation.
- [23] https://github.com/funcwj/conv-tasnet.
- [24] E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,” IEEE transactions on audio, speech, and language processing, vol. 14, no. 4, pp. 1462–1469, 2006.