This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A comparison of handcrafted, parameterized, and learnable features for speech separation

Abstract

The design of acoustic features is important for speech separation. It can be roughly categorized into three classes: handcrafted, parameterized, and learnable features. Among them, learnable features, which are trained with separation networks jointly in an end-to-end fashion, become a new trend of modern speech separation research, e.g. convolutional time domain audio separation network (Conv-Tasnet), while handcrafted and parameterized features are also shown competitive in very recent studies. However, a systematic comparison across the three kinds of acoustic features has not been conducted yet. In this paper, we compare them in the framework of Conv-Tasnet by setting its encoder and decoder with different acoustic features. We also generalize the handcrafted multi-phase gammatone filterbank (MPGTF) to a new parameterized multi-phase gammatone filterbank (ParaMPGTF). Experimental results on the WSJ0-2mix corpus show that (i) if the decoder is learnable, then setting the encoder to STFT, MPGTF, ParaMPGTF, and learnable features lead to similar performance; and (ii) when the pseudo-inverse transforms of STFT, MPGTF, and ParaMPGTF are used as the decoders, the proposed ParaMPGTF performs better than the other two handcrafted features.

Index Terms—  Speech separation, handcrafted features, learnable features, parameterized features, multi-phase gammatone filterbank.

1 Introduction

Speech separation aims to separate a mixture of multiple speech sources into its components. In this paper, we study deep learning based speaker-independent speech separation, which does not require training and test speakers to be the same [1]. Hershey et al. first addressed the problem by deep clustering [2]. Since then, several methods have been proposed, such as permutation invariant training [3, 4] and deep attractor networks [5] which aim to estimate a time-frequency mask for each speaker. Among the methods, the magnitude spectrogram of short-time Fourier transform (STFT) is the most widely used acoustic feature. However, when recovering the time-domain speech from the separated magnitude spectrograms, the noisy phase has to be used, which results in suboptimal performance.

To remedy this weakness, learnable features, which learn a network for the transforms between the time-domain signal and its time-frequency spectrogram are becoming a new trend. Representative ones include one-dimensional convolution (1D-conv) filters [6, 7][8, 9]. Because the transforms are jointly trained with the separation network, and also because they do not need additional handcrafted operations, they lead to improved performance over STFT. Among the time-domain speech separation methods, convolutional time domain audio separation network (Conv-Tasnet), which reaches outstanding separation performance with a frame length of only 2 ms, received much attention.

Several recent work studied acoustic features with Conv-Tasnet. For example, Ditter and Gerkmann [10] used a handcrafted feature, named multi-phase gammatone filterbank (MPGTF), to replace the 1D-conv learnable feature of the encoder, which leads to improvement over the original Conv-Tasnet in terms of the scale-invariant source-to-noise ratio (SI-SNR). Pariente et al. [11] extended the parameterized filters introduced in [12] to complex-valued analytic filters, and then proposed a similar analytic extension for the 1D-conv filter as well. The analytic 1D-conv filter improves the performance as well. The aforementioned positive results demonstrate that handcrafted and parameterized features are also competitive to the state-of-the-art learnable features.

However, there lacks a comparison between the handcrafted, parameterized, and learnable features. Motivated by replacing the encoder or decoder by handcrafted features, in this paper, we compared the three kinds of features in the framework of Conv-Tasnet. To understand the connection between the three kinds of features, we proposed a parameterized extension of MPGTF, named parameterized MPGTF (ParaMPGTF). The center frequencies and bandwidths of ParaMPGTF are jointly trained with the separation network. We conducted an experimental comparison between STFT, MPGTF, ParaMPGTF, and learnable features on WSJ0-2mix [2]. Experimental results show that, if the decoder is learnable, then setting the encoder to any of the comparison features leads to similar performance. We have also compared STFT, MPGTF, and ParaMPGTF when their (pseudo) inverse transforms are used as the decoders. Results show that the proposed ParaMPGTF performs better than the other two handcrafted features.

This paper is organized as follows. Section 2 presents the comparison framework and the proposed ParaMPGTF. Section 3 presents the experimental results. Finally, we conclude our findings in Section 4.

2 Methods

Refer to caption
Fig. 1: The building blocks of Conv-Tasnet

2.1 Preliminary

Given CC speech sources {𝐬c(t)}c=1C\left\{\mathbf{s}_{c}(t)\right\}_{c=1}^{C} with tt as the index of time samples, their mixed signal is

𝐱(t)=c=1C𝐬c(t)\mathbf{x}(t)=\sum_{c=1}^{C}\mathbf{s}_{c}(t) (1)

The problem of speech separation can be described as producing an accurate estimate 𝐬^c(t)\hat{\mathbf{s}}_{c}(t) for 𝐬c(t)\mathbf{s}_{c}(t) from 𝐱(t)\mathbf{x}(t).

The framework in this study is Conv-Tasnet [7]. As shown in Fig. 1, it consists of three main parts—an encoder, a separation network, and a decoder. It uses a small frame size in the encoder and decoder to reduce the time delay significantly. The encoder and decoder are learnable 1D-conv filters, which perform like transforms between the time-domain signal and time-frequency features. The separation network is a fully-convolutional separation module that consists of stacked one-dimensional dilated convolutional blocks [13, 14]. It is optimized with the scale-invariant soure-to-noise ratio (SI-SNR) loss [5]. It produces a mask for each speech source.

2.2 Comparison framework

The comparison uses handcrafted transforms, parameterized transforms, and learnable filters as the encoder and decoder. The encoder can be thought of as s set of NN filters of length LL. The output of the encoder is a time-frequency representation produced from the convolution of the mixed speech input signal with the filter:

𝐗(n,i)=(l=0L1𝐱(iD+l)𝐡nEnc(Ll))\mathbf{X}(n,i)=\mathcal{H}(\sum_{l=0}^{L-1}\mathbf{x}(iD+l)\mathbf{h}_{n}^{\mathrm{Enc}}(L-l)) (2)

where nn is the filter index, ii is the frame index, DD is the frame shift, 𝐡nEnc()\mathbf{h}_{n}^{Enc}(\cdot) is the nn-th filter of the filterbanks, ll denotes the sample index in a frame, and ()\mathcal{H}(\cdot) is the rectified linear unit (ReLU) to ensure that the representation is non-negative. In the comparison, 𝐡nEnc()\mathbf{h}_{n}^{\mathrm{Enc}}(\cdot) can be any of the three kinds of feature transforms.

The decoder reconstructs the time-domain signal of the cc-th speaker 𝐬^cT\hat{\mathbf{s}}_{c}\in\mathbb{R}^{T}. The output of the decoder is:

𝐬^c(k,i)=n=0N1𝐒^c(n,i)𝐡NnDec(k)\hat{\mathbf{s}}_{c}(k,i)=\sum_{n=0}^{N-1}\hat{\mathbf{S}}_{c}(n,i)\mathbf{h}_{N-n}^{\mathrm{Dec}}(k) (3)

where 𝐒^c(n,i)\hat{\mathbf{S}}_{c}(n,i) is the output of the separation network for the cc-th speaker, kk is the index of the filter weight, 𝐡nDec()\mathbf{h}_{n}^{\rm Dec}(\cdot) is the nn-th filter of the decoder, and 𝐬^c(k,i)\hat{\mathbf{s}}_{c}(k,i) is the estimate of the cc-th speech source at the ii-th frame. To decode the frame-shift operation between speech frames, the decoder further calculates 𝐬^c(t)=i=𝐬^c(tiD,i)\hat{\mathbf{s}}_{c}(t)=\sum_{i=-\infty}^{\infty}\hat{\mathbf{s}}_{c}(t-iD,i).

The comparison uses STFT, MPGTF, ParaMPGTF, and learnable filters as 𝐡nEnc()\mathbf{h}_{n}^{\mathrm{Enc}}(\cdot) with their inverse transforms as 𝐡NnDec()\mathbf{h}_{N-n}^{\mathrm{Dec}}(\cdot), where the proposed ParaMPGTF is presented in the next subsection.

2.3 Parameterized multi-phase gammatone filterbank

Gammatone filterbank, which mimics the masking effect of the human auditory system, are good features for speech separation [15]. The impulse response function γ(t)\gamma(t) of a gammatone filter is

γ(t)=αtn1exp(2πbt)cos(2πfct+ϕ)\gamma(t)=\alpha t^{n-1}\exp(-2\pi bt)\cos(2\pi f_{c}t+\phi) (4)

where nn is the order, bb is a bandwidth parameter, fcf_{c} is the centre frequency of the filter, t>0t\textgreater 0 is the time in seconds, α\alpha is the amplitude, and ϕ\phi is the phase shift. Ditter and Gerkmann [10] extended the classical gammatone filterbank to MPGTF. The extension has the following three three aspects: First, the length of the filters is set to 22ms, which keeps the system a low latency. Second, for each filter hnEnc()h_{n}^{\rm Enc}(\cdot), MPGTF introduces hnEnc()-h_{n}^{\rm Enc}(\cdot) to ensure that, for each centre frequency, at least one filter contains energy. Third, the phase shift ϕ\phi varies at the same centre frequency. The details of MPGTF can be found in [10].

From (4), we observe that the bandwidth parameter bb and filter centre frequency fcf_{c} are two important parameters. They are determined by the equivalent rectangular bandwidth (ERB) [16] using a rectangular band-pass filter:

ERB(fc,c1,c2)=c1+fcc2\displaystyle\mathrm{ERB}(f_{c},c_{1},c_{2})=c_{1}+\frac{f_{c}}{c_{2}} (5)
fc=c2(ERBc1)\displaystyle{\color[rgb]{0.00,0.07,1.00}f_{c}=c_{2}(\mathrm{ERB}-c_{1})} (6)
b=ERB(n1)!π((2n2)!)222n\displaystyle b=\frac{\mathrm{ERB}\sqrt{(n-1)!}}{\pi\left((2n-2)!\right)2^{2-2n}} (7)

where c1c_{1} and c2c_{2} are two parameters. Traditionally, the parameters c1c_{1} and c2c_{2} are set to 24.724.7 and 9.2659.265 respectively in experience [16]. This empirical setting may not be accurate enough, which may lead to suboptimal performance.

To overcome this issue, we propose ParaMPGTF which trains the filterbank parameters c1c_{1} and c2c_{2} in MPGTF jointly with the network. For each iteration, we update the parameter bb by (7) and the centre frequencies fc1,fc2,,fcMf_{c_{1}},f_{c_{2}},\dots,f_{c_{M}} by:

fcj=ERBscale1(ERBscale(fcj1)+1)f_{c_{j}}=\mathrm{ERB}_{\rm scale}^{-1}(\mathrm{ERB}_{\rm scale}(f_{c_{j-1}})+1) (8)

according to the updated c1c_{1} and c2c_{2}, where fcjf_{c_{j}} denotes the centre frequency of the jj-th filter, MM is the number of filters in the filterbank, ERBscale\mathrm{ERB}_{\rm scale} denotes the ERB scale calculated by integrating 1/ERB(fc)1/\mathrm{ERB}(f_{c}) across frequency, and ERBscale1\mathrm{ERB}^{-1}_{\rm scale} is the inverse of ERBscale\mathrm{ERB}_{\rm scale}. In practice, ERBscale\mathrm{ERB}_{\rm scale} and ERBscale1\mathrm{ERB}_{\rm scale}^{-1} are calculated by:

ERBscale(fHz)=c2log(1+fHzc1c2)\displaystyle\mathrm{ERB}_{\rm scale}(f_{\mathrm{Hz}})=c_{2}\log(1+\frac{f_{\mathrm{Hz}}}{c_{1}c_{2}}) (9)
ERBscale1(ERBscale)=c1c2(eERBscalec21)\displaystyle\mathrm{ERB}^{-1}_{\rm scale}(\mathrm{ERB}_{\rm scale})=c_{1}c_{2}(\mathrm{e^{{\frac{\mathrm{ERB}_{\rm scale}}{c_{2}}}}}-1) (10)

where fHzf_{\mathrm{Hz}} denotes a frequency variable. After obtaining fc1,,fcMf_{c_{1}},\dots,f_{c_{M}} and bb, we obtain the updated filterbank according to (4). To make ParaMPGTF a meaningful filterbank, fc1,fc2,,fcMf_{c_{1}},f_{c_{2}},\dots,f_{c_{M}} should be constrained between 100100 Hz to 40004000 Hz. To satisfy this constraint, we fix fc1f_{c_{1}} to 100100 Hz in the entire training process.

To summarize, ParaMPGTF combines the data-driven scheme with MPGTF [10]. It inherits the changes of MPGTF.

3 EXPERIMENTALS AND RESULTS

3.1 Dataset

We conducted the comparison on two-speaker speech separation using the WSJ0-2mix dataset [2]. It contains 3030 hours training data, 1010 hours development data, and 55 hours test data. The mixtures in WSJ0-2mix were generated by first randomly selecting different speakers and utterances in the Wall Street Journal (WSJ0) training set si_tr_s, and then mixing them at a random signal-to-noise ratio (SNR) level between -55 dB and 55 dB [7]. The utterances in the test set were from 1616 unseen speakers in the si_dt_05 and si_et_05 directories of the WSJ0 dataset. All waveforms were resampled to 88 kHz.

3.2 Experimental setup

The network was trained for 200200 epochs on 44-second long segments. Adam was used as the optimizer with an initial learning rate of 0.0010.001. The learning rate was halved if the performance of development set was not improved in 55 consecutive epochs. The network training procedure was early stopped when the performance on the development set has not been improved within the last 1010 epochs. The hyperparameters of the network followed the setting in [10], where the number of filters NN is 512. The mask activation function of TCN was set to sigmoid function and rectified linear unit (ReLU) respectively.

For ParaMPGTF, we set the order nn and amplitude α\alpha to 2 and 1 respectively. We initialize c1c_{1} and c2c_{2} to their empirical values, i.e. c1=24.7c_{1}=24.7 and c2=9.265c_{2}=9.265.

We used SI-SNR as the evaluation metric [5]. We reported the average results over all 30003000 test mixtures.

Refer to caption
(a) MPGTF-Learned
Refer to caption
(b) ParaMPGTF-Learned
Refer to caption
(c) STFT-Learned
Fig. 2: Visualization of different configuration of encoder and learned decoder magnitudes of their FFTs. The left is a MPGTF-based encoder, the middle is a ParaMPGTF-based encoder and the right is a STFT-based encoder.

3.3 Results with learnable decoders

We first conducted a comparison between STFT, MPGTF, ParaMPGTF, and learnable features when the decoders were set to the learnable features. The comparison results are listed in Table 1. From the table, we observe that the four features do not yield fundamentally different performance. If we look at the details, we find that STFT reaches the highest SI-SNR in both the development set and the test set. MPGTF and ParaMPGTF show competitive performance, where ParaMPGTF performs slightly better than MPGTF on the development set, and slightly worse than the latter on the test set.

Table 1: Comparison of different encoders when the decoders are set to learnable filters.
Encoder Decoder Mask activation   SI-SNR (dB)
Dev Test
Learned Learned Sigmoid 17.61 16.92
Learned Learned ReLU 17.45 16.89
MPGTF Learned ReLU 17.66 17.20
ParaMPGTF Learned ReLU 17.71 17.06
STFT Learned ReLU 17.96 17.28
Table 2: Comparison of c1c_{1} and c2c_{2} between MPGTF and ParaMPGTF when the decoders are set to learnable features.
MPGTF ParaMPGTF
c1c_{1} 24.7 25.09
c2c_{2} 9.265 9.198

Fig. 2 shows the magnitude spectrograms of the MPGTF, ParaMPGTF, and STFT encoders with their corresponding learnable decoders, where we only plot the STFT bins with indices from 11 to 256256 [17, 18] since that the real and imaginary parts share similar patterns. The filters are uniformly distributed in the frequency range from 0 Hz to 40004000 Hz. From the figure, we see that the magnitude spectrograms of ParaMPGTF and MPGTF are similar. This phenomenon not only accounts for their similar performance, but also demonstrates that the parameterized feature is able to be optimized successfully. As a byproduct, it shows that (i) MPGTF is a well-designed handcrafted feature; (ii) the learnable decoders are able to learn effective inverse transforms of their encoders.

Table 2 lists the comparison between the handcrafted c1c_{1} and c2c_{2} in MPGTF and the optimized c1c_{1} and c2c_{2} in ParaMPGTF. From the table, we see that the two groups of the parameters are similar, which further accounts for the similar performance of MPGTF and ParaMPGTF.

3.4 Results with (pseudo) inverse transform decoders

In this experiment, we set the encoder to STFT, MPGTF, and ParaMPGTF respectively, and set the decoder to their inverse transforms accordingly.

Table 3 lists the performance of MPGTF, ParaMPGTF, and STFT with their (pseudo) inverse transforms. From the table, we see that the performance of the three comparison methods is similar in general. If we look into the details, we see that the proposed ParaMPGTF reaches the best performance among the comparison methods on both the development set and the test set, which demonstrates the potential of the parameterized training strategy in improving conventional handcrafted features.

Table 3: Comparison of encoders and decoders with different features. The mask activation funciton is ReLU.
Encoder Decoder   SI-SNR (dB)
Dev Test
MPGTF MPGTF Pseudo Inv. 16.32 15.73
ParaMPGTF ParaMPGTF Pseudo Inv. 16.64 16.04
STFT ISTFT 16.31 15.82
Refer to caption
Fig. 3: Convergence curves of different encoder-decoder pairs in the training process.

Fig. 3 shows the convergence curves of the deep models on the development set when the decoders are set to the (pseudo) inverse transforms of their encoders. From the figure, we find that the learnable feature converges faster than the handcrafted and parameterized features. Although the handcrafted features and ParaMPGTF converge in a similar rate at the early training stage, ParaMPGTF converges faster at the late training stage.

4 CONCLUSIONS

In this paper, we have proposed a parameterized multi-phase gammatone filterbank. ParaMPGTF jointly learns the core parameters of MPGTF with the separation network. We have also compared handcrafted, parameterized, and learnable features in the same experimental framework, which is to our knowledge the first time that the three kinds of features are compared together, where the features in comparison are STFT, MPGTF, ParaMPGTF, and learnable features. Experiment results show that, when the decoders are set to learnable features, the four features behave similarly. STFT behave slightly better than the others. When the decoders are set to the (pseudo) inverse transforms of the encoders, ParaMPGTF performs better than the handcrafted features.

References

  • [1] D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 10, pp. 1702–1726, 2018.
  • [2] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 31–35.
  • [3] D. Yu, M. Kolbæk, Z. Tan, and J. Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 241–245.
  • [4] M. Kolbæk, D. Yu, Z. Tan, and J. Jensen, “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 10, pp. 1901–1913, 2017.
  • [5] Z. Chen, Y. Luo, and N. Mesgarani, “Deep attractor network for single-microphone speaker separation,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 246–250.
  • [6] Yi Luo and Nima Mesgarani, “Tasnet: time-domain audio separation network for real-time, single-channel speech separation,” 2017.
  • [7] Yi Luo and Nima Mesgarani, “Conv-tasnet: Surpassing ideal time-frequency magnitude masking for speech separation,” 2018.
  • [8] A. Pandey and D. Wang, “Tcnn: Temporal convolutional neural network for real-time speech enhancement in the time domain,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6875–6879.
  • [9] Ziqiang Shi, Huibin Lin, Liu Liu, Rujie Liu, Jiqing Han, and Anyan Shi, “Furcanext: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks,” 2019.
  • [10] D. Ditter and T. Gerkmann, “A multi-phase gammatone filterbank for speech separation via tasnet,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 36–40.
  • [11] Manuel Pariente, Samuele Cornell, Antoine Deleforge, and Emmanuel Vincent, “Filterbank design for end-to-end speech separation,” 2019.
  • [12] Mirco Ravanelli and Yoshua Bengio, “Speaker recognition from raw waveform with sincnet,” 2018.
  • [13] Colin Lea, René Vidal, Austin Reiter, and Gregory D. Hager, “Temporal convolutional networks: A unified approach to action segmentation,” in Computer Vision – ECCV 2016 Workshops, Gang Hua and Hervé Jégou, Eds., Cham, 2016, pp. 47–54, Springer International Publishing.
  • [14] C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional networks for action segmentation and detection,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1003–1012.
  • [15] R. D. Patterson, K. Robinson, J. Holdsworth, D. Mckeown, C. Zhang, and M. Allerhand, “Complex sounds and auditory images,” Auditory Physiology and Perception, pp. 429–446, 1992.
  • [16] V Hohmann, “Frequency analysis and synthesis using a gammatone filterbank,” Acta Acustica United with Acustica, vol. 88, no. 3, pp. 433–442, 2002.
  • [17] A. Pandey and D. Wang, “A new framework for cnn-based speech enhancement in the time domain,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 7, pp. 1179–1188, 2019.
  • [18] Ashutosh Pandey and DeLiang Wang, “A new framework for cnn-based speech enhancement in the time domain,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 7, pp. 1179–1188, 2019.