A comparison of handcrafted, parameterized, and learnable features for speech separation

Abstract

The design of acoustic features is important for speech separation. It can be roughly categorized into three classes: handcrafted, parameterized, and learnable features. Among them, learnable features, which are trained with separation networks jointly in an end-to-end fashion, become a new trend of modern speech separation research, e.g. convolutional time domain audio separation network (Conv-Tasnet), while handcrafted and parameterized features are also shown competitive in very recent studies. However, a systematic comparison across the three kinds of acoustic features has not been conducted yet. In this paper, we compare them in the framework of Conv-Tasnet by setting its encoder and decoder with different acoustic features. We also generalize the handcrafted multi-phase gammatone filterbank (MPGTF) to a new parameterized multi-phase gammatone filterbank (ParaMPGTF). Experimental results on the WSJ0-2mix corpus show that (i) if the decoder is learnable, then setting the encoder to STFT, MPGTF, ParaMPGTF, and learnable features lead to similar performance; and (ii) when the pseudo-inverse transforms of STFT, MPGTF, and ParaMPGTF are used as the decoders, the proposed ParaMPGTF performs better than the other two handcrafted features.

Index Terms— Speech separation, handcrafted features, learnable features, parameterized features, multi-phase gammatone filterbank.

1 Introduction

Speech separation aims to separate a mixture of multiple speech sources into its components. In this paper, we study deep learning based speaker-independent speech separation, which does not require training and test speakers to be the same [1]. Hershey et al. first addressed the problem by deep clustering [2]. Since then, several methods have been proposed, such as permutation invariant training [3, 4] and deep attractor networks [5] which aim to estimate a time-frequency mask for each speaker. Among the methods, the magnitude spectrogram of short-time Fourier transform (STFT) is the most widely used acoustic feature. However, when recovering the time-domain speech from the separated magnitude spectrograms, the noisy phase has to be used, which results in suboptimal performance.

To remedy this weakness, learnable features, which learn a network for the transforms between the time-domain signal and its time-frequency spectrogram are becoming a new trend. Representative ones include one-dimensional convolution (1D-conv) filters [6, 7][8, 9]. Because the transforms are jointly trained with the separation network, and also because they do not need additional handcrafted operations, they lead to improved performance over STFT. Among the time-domain speech separation methods, convolutional time domain audio separation network (Conv-Tasnet), which reaches outstanding separation performance with a frame length of only 2 ms, received much attention.

Several recent work studied acoustic features with Conv-Tasnet. For example, Ditter and Gerkmann [10] used a handcrafted feature, named multi-phase gammatone filterbank (MPGTF), to replace the 1D-conv learnable feature of the encoder, which leads to improvement over the original Conv-Tasnet in terms of the scale-invariant source-to-noise ratio (SI-SNR). Pariente et al. [11] extended the parameterized filters introduced in [12] to complex-valued analytic filters, and then proposed a similar analytic extension for the 1D-conv filter as well. The analytic 1D-conv filter improves the performance as well. The aforementioned positive results demonstrate that handcrafted and parameterized features are also competitive to the state-of-the-art learnable features.

However, there lacks a comparison between the handcrafted, parameterized, and learnable features. Motivated by replacing the encoder or decoder by handcrafted features, in this paper, we compared the three kinds of features in the framework of Conv-Tasnet. To understand the connection between the three kinds of features, we proposed a parameterized extension of MPGTF, named parameterized MPGTF (ParaMPGTF). The center frequencies and bandwidths of ParaMPGTF are jointly trained with the separation network. We conducted an experimental comparison between STFT, MPGTF, ParaMPGTF, and learnable features on WSJ0-2mix [2]. Experimental results show that, if the decoder is learnable, then setting the encoder to any of the comparison features leads to similar performance. We have also compared STFT, MPGTF, and ParaMPGTF when their (pseudo) inverse transforms are used as the decoders. Results show that the proposed ParaMPGTF performs better than the other two handcrafted features.

This paper is organized as follows. Section 2 presents the comparison framework and the proposed ParaMPGTF. Section 3 presents the experimental results. Finally, we conclude our findings in Section 4.

2 Methods

Refer to caption — Fig. 1: The building blocks of Conv-Tasnet

2.1 Preliminary

Given $C$ speech sources $\left\{\mathbf{s}_{c}(t)\right\}_{c=1}^{C}$ with $t$ as the index of time samples, their mixed signal is

\mathbf{x}(t)=\sum_{c=1}^{C}\mathbf{s}_{c}(t)

(1)

The problem of speech separation can be described as producing an accurate estimate $\hat{\mathbf{s}}_{c}(t)$ for $\mathbf{s}_{c}(t)$ from $\mathbf{x}(t)$ .

The framework in this study is Conv-Tasnet [7]. As shown in Fig. 1, it consists of three main parts—an encoder, a separation network, and a decoder. It uses a small frame size in the encoder and decoder to reduce the time delay significantly. The encoder and decoder are learnable 1D-conv filters, which perform like transforms between the time-domain signal and time-frequency features. The separation network is a fully-convolutional separation module that consists of stacked one-dimensional dilated convolutional blocks [13, 14]. It is optimized with the scale-invariant soure-to-noise ratio (SI-SNR) loss [5]. It produces a mask for each speech source.

2.2 Comparison framework

The comparison uses handcrafted transforms, parameterized transforms, and learnable filters as the encoder and decoder. The encoder can be thought of as s set of $N$ filters of length $L$ . The output of the encoder is a time-frequency representation produced from the convolution of the mixed speech input signal with the filter:

\mathbf{X}(n,i)=\mathcal{H}(\sum_{l=0}^{L-1}\mathbf{x}(iD+l)\mathbf{h}_{n}^{\mathrm{Enc}}(L-l))

(2)

where $n$ is the filter index, $i$ is the frame index, $D$ is the frame shift, $\mathbf{h}_{n}^{Enc}(\cdot)$ is the $n$ -th filter of the filterbanks, $l$ denotes the sample index in a frame, and $\mathcal{H}(\cdot)$ is the rectified linear unit (ReLU) to ensure that the representation is non-negative. In the comparison, $\mathbf{h}_{n}^{\mathrm{Enc}}(\cdot)$ can be any of the three kinds of feature transforms.

The decoder reconstructs the time-domain signal of the $c$ -th speaker $\hat{\mathbf{s}}_{c}\in\mathbb{R}^{T}$ . The output of the decoder is:

\hat{\mathbf{s}}_{c}(k,i)=\sum_{n=0}^{N-1}\hat{\mathbf{S}}_{c}(n,i)\mathbf{h}_{N-n}^{\mathrm{Dec}}(k)

(3)

where $\hat{\mathbf{S}}_{c}(n,i)$ is the output of the separation network for the $c$ -th speaker, $k$ is the index of the filter weight, $\mathbf{h}_{n}^{\rm Dec}(\cdot)$ is the $n$ -th filter of the decoder, and $\hat{\mathbf{s}}_{c}(k,i)$ is the estimate of the $c$ -th speech source at the $i$ -th frame. To decode the frame-shift operation between speech frames, the decoder further calculates $\hat{\mathbf{s}}_{c}(t)=\sum_{i=-\infty}^{\infty}\hat{\mathbf{s}}_{c}(t-iD,i)$ .

The comparison uses STFT, MPGTF, ParaMPGTF, and learnable filters as $\mathbf{h}_{n}^{\mathrm{Enc}}(\cdot)$ with their inverse transforms as $\mathbf{h}_{N-n}^{\mathrm{Dec}}(\cdot)$ , where the proposed ParaMPGTF is presented in the next subsection.

2.3 Parameterized multi-phase gammatone filterbank

Gammatone filterbank, which mimics the masking effect of the human auditory system, are good features for speech separation [15]. The impulse response function $\gamma(t)$ of a gammatone filter is

\gamma(t)=\alpha t^{n-1}\exp(-2\pi bt)\cos(2\pi f_{c}t+\phi)

(4)

where $n$ is the order, $b$ is a bandwidth parameter, $f_{c}$ is the centre frequency of the filter, $t\textgreater 0$ is the time in seconds, $\alpha$ is the amplitude, and $\phi$ is the phase shift. Ditter and Gerkmann [10] extended the classical gammatone filterbank to MPGTF. The extension has the following three three aspects: First, the length of the filters is set to $2$ ms, which keeps the system a low latency. Second, for each filter $h_{n}^{\rm Enc}(\cdot)$ , MPGTF introduces $-h_{n}^{\rm Enc}(\cdot)$ to ensure that, for each centre frequency, at least one filter contains energy. Third, the phase shift $\phi$ varies at the same centre frequency. The details of MPGTF can be found in [10].

From (4), we observe that the bandwidth parameter $b$ and filter centre frequency $f_{c}$ are two important parameters. They are determined by the equivalent rectangular bandwidth (ERB) [16] using a rectangular band-pass filter:

	$\displaystyle\mathrm{ERB}(f_{c},c_{1},c_{2})=c_{1}+\frac{f_{c}}{c_{2}}$		(5)
	$\displaystyle{\color[rgb]{0.00,0.07,1.00}f_{c}=c_{2}(\mathrm{ERB}-c_{1})}$		(6)
	$\displaystyle b=\frac{\mathrm{ERB}\sqrt{(n-1)!}}{\pi\left((2n-2)!\right)2^{2-2n}}$		(7)

where $c_{1}$ and $c_{2}$ are two parameters. Traditionally, the parameters $c_{1}$ and $c_{2}$ are set to $24.7$ and $9.265$ respectively in experience [16]. This empirical setting may not be accurate enough, which may lead to suboptimal performance.

To overcome this issue, we propose ParaMPGTF which trains the filterbank parameters $c_{1}$ and $c_{2}$ in MPGTF jointly with the network. For each iteration, we update the parameter $b$ by (7) and the centre frequencies $f_{c_{1}},f_{c_{2}},\dots,f_{c_{M}}$ by:

f_{c_{j}}=\mathrm{ERB}_{\rm scale}^{-1}(\mathrm{ERB}_{\rm scale}(f_{c_{j-1}})+1)

(8)

according to the updated $c_{1}$ and $c_{2}$ , where $f_{c_{j}}$ denotes the centre frequency of the $j$ -th filter, $M$ is the number of filters in the filterbank, $\mathrm{ERB}_{\rm scale}$ denotes the ERB scale calculated by integrating $1/\mathrm{ERB}(f_{c})$ across frequency, and $\mathrm{ERB}^{-1}_{\rm scale}$ is the inverse of $\mathrm{ERB}_{\rm scale}$ . In practice, $\mathrm{ERB}_{\rm scale}$ and $\mathrm{ERB}_{\rm scale}^{-1}$ are calculated by:

	$\displaystyle\mathrm{ERB}_{\rm scale}(f_{\mathrm{Hz}})=c_{2}\log(1+\frac{f_{\mathrm{Hz}}}{c_{1}c_{2}})$		(9)
	$\displaystyle\mathrm{ERB}^{-1}_{\rm scale}(\mathrm{ERB}_{\rm scale})=c_{1}c_{2}(\mathrm{e^{{\frac{\mathrm{ERB}_{\rm scale}}{c_{2}}}}}-1)$		(10)

where $f_{\mathrm{Hz}}$ denotes a frequency variable. After obtaining $f_{c_{1}},\dots,f_{c_{M}}$ and $b$ , we obtain the updated filterbank according to (4). To make ParaMPGTF a meaningful filterbank, $f_{c_{1}},f_{c_{2}},\dots,f_{c_{M}}$ should be constrained between $100$ Hz to $4000$ Hz. To satisfy this constraint, we fix $f_{c_{1}}$ to $100$ Hz in the entire training process.

To summarize, ParaMPGTF combines the data-driven scheme with MPGTF [10]. It inherits the changes of MPGTF.

3 EXPERIMENTALS AND RESULTS

3.1 Dataset

We conducted the comparison on two-speaker speech separation using the WSJ0-2mix dataset [2]. It contains $30$ hours training data, $10$ hours development data, and $5$ hours test data. The mixtures in WSJ0-2mix were generated by first randomly selecting different speakers and utterances in the Wall Street Journal (WSJ0) training set si_tr_s, and then mixing them at a random signal-to-noise ratio (SNR) level between - $5$ dB and $5$ dB [7]. The utterances in the test set were from $16$ unseen speakers in the si_dt_05 and si_et_05 directories of the WSJ0 dataset. All waveforms were resampled to $8$ kHz.

3.2 Experimental setup

The network was trained for $200$ epochs on $4$ -second long segments. Adam was used as the optimizer with an initial learning rate of $0.001$ . The learning rate was halved if the performance of development set was not improved in $5$ consecutive epochs. The network training procedure was early stopped when the performance on the development set has not been improved within the last $10$ epochs. The hyperparameters of the network followed the setting in [10], where the number of filters $N$ is 512. The mask activation function of TCN was set to sigmoid function and rectified linear unit (ReLU) respectively.

For ParaMPGTF, we set the order $n$ and amplitude $\alpha$ to 2 and 1 respectively. We initialize $c_{1}$ and $c_{2}$ to their empirical values, i.e. $c_{1}=24.7$ and $c_{2}=9.265$ .

We used SI-SNR as the evaluation metric [5]. We reported the average results over all $3000$ test mixtures.

3.3 Results with learnable decoders

We first conducted a comparison between STFT, MPGTF, ParaMPGTF, and learnable features when the decoders were set to the learnable features. The comparison results are listed in Table 1. From the table, we observe that the four features do not yield fundamentally different performance. If we look at the details, we find that STFT reaches the highest SI-SNR in both the development set and the test set. MPGTF and ParaMPGTF show competitive performance, where ParaMPGTF performs slightly better than MPGTF on the development set, and slightly worse than the latter on the test set.

Table 1: Comparison of different encoders when the decoders are set to learnable filters.

Encoder	Decoder	Mask activation	SI-SNR (dB)
			Dev	Test
Learned	Learned	Sigmoid	17.61	16.92
Learned	Learned	ReLU	17.45	16.89
MPGTF	Learned	ReLU	17.66	17.20
ParaMPGTF	Learned	ReLU	17.71	17.06
STFT	Learned	ReLU	17.96	17.28

Table 2: Comparison of

c_{1}

and

c_{2}

between MPGTF and ParaMPGTF when the decoders are set to learnable features.

	MPGTF	ParaMPGTF
$c_{1}$	24.7	25.09
$c_{2}$	9.265	9.198

Fig. 2 shows the magnitude spectrograms of the MPGTF, ParaMPGTF, and STFT encoders with their corresponding learnable decoders, where we only plot the STFT bins with indices from $1$ to $256$ [17, 18] since that the real and imaginary parts share similar patterns. The filters are uniformly distributed in the frequency range from $0$ Hz to $4000$ Hz. From the figure, we see that the magnitude spectrograms of ParaMPGTF and MPGTF are similar. This phenomenon not only accounts for their similar performance, but also demonstrates that the parameterized feature is able to be optimized successfully. As a byproduct, it shows that (i) MPGTF is a well-designed handcrafted feature; (ii) the learnable decoders are able to learn effective inverse transforms of their encoders.

Table 2 lists the comparison between the handcrafted $c_{1}$ and $c_{2}$ in MPGTF and the optimized $c_{1}$ and $c_{2}$ in ParaMPGTF. From the table, we see that the two groups of the parameters are similar, which further accounts for the similar performance of MPGTF and ParaMPGTF.

3.4 Results with (pseudo) inverse transform decoders

In this experiment, we set the encoder to STFT, MPGTF, and ParaMPGTF respectively, and set the decoder to their inverse transforms accordingly.

Table 3 lists the performance of MPGTF, ParaMPGTF, and STFT with their (pseudo) inverse transforms. From the table, we see that the performance of the three comparison methods is similar in general. If we look into the details, we see that the proposed ParaMPGTF reaches the best performance among the comparison methods on both the development set and the test set, which demonstrates the potential of the parameterized training strategy in improving conventional handcrafted features.

Table 3: Comparison of encoders and decoders with different features. The mask activation funciton is ReLU.

Encoder	Decoder	SI-SNR (dB)
		Dev	Test
MPGTF	MPGTF Pseudo Inv.	16.32	15.73
ParaMPGTF	ParaMPGTF Pseudo Inv.	16.64	16.04
STFT	ISTFT	16.31	15.82

Fig. 3 shows the convergence curves of the deep models on the development set when the decoders are set to the (pseudo) inverse transforms of their encoders. From the figure, we find that the learnable feature converges faster than the handcrafted and parameterized features. Although the handcrafted features and ParaMPGTF converge in a similar rate at the early training stage, ParaMPGTF converges faster at the late training stage.

4 CONCLUSIONS

In this paper, we have proposed a parameterized multi-phase gammatone filterbank. ParaMPGTF jointly learns the core parameters of MPGTF with the separation network. We have also compared handcrafted, parameterized, and learnable features in the same experimental framework, which is to our knowledge the first time that the three kinds of features are compared together, where the features in comparison are STFT, MPGTF, ParaMPGTF, and learnable features. Experiment results show that, when the decoders are set to learnable features, the four features behave similarly. STFT behave slightly better than the others. When the decoders are set to the (pseudo) inverse transforms of the encoders, ParaMPGTF performs better than the handcrafted features.

References

[1] D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 10, pp. 1702–1726, 2018.
[2] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 31–35.
[3] D. Yu, M. Kolbæk, Z. Tan, and J. Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 241–245.
[4] M. Kolbæk, D. Yu, Z. Tan, and J. Jensen, “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 10, pp. 1901–1913, 2017.
[5] Z. Chen, Y. Luo, and N. Mesgarani, “Deep attractor network for single-microphone speaker separation,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 246–250.
[6] Yi Luo and Nima Mesgarani, “Tasnet: time-domain audio separation network for real-time, single-channel speech separation,” 2017.
[7] Yi Luo and Nima Mesgarani, “Conv-tasnet: Surpassing ideal time-frequency magnitude masking for speech separation,” 2018.
[8] A. Pandey and D. Wang, “Tcnn: Temporal convolutional neural network for real-time speech enhancement in the time domain,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6875–6879.
[9] Ziqiang Shi, Huibin Lin, Liu Liu, Rujie Liu, Jiqing Han, and Anyan Shi, “Furcanext: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks,” 2019.
[10] D. Ditter and T. Gerkmann, “A multi-phase gammatone filterbank for speech separation via tasnet,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 36–40.
[11] Manuel Pariente, Samuele Cornell, Antoine Deleforge, and Emmanuel Vincent, “Filterbank design for end-to-end speech separation,” 2019.
[12] Mirco Ravanelli and Yoshua Bengio, “Speaker recognition from raw waveform with sincnet,” 2018.
[13] Colin Lea, René Vidal, Austin Reiter, and Gregory D. Hager, “Temporal convolutional networks: A unified approach to action segmentation,” in Computer Vision – ECCV 2016 Workshops, Gang Hua and Hervé Jégou, Eds., Cham, 2016, pp. 47–54, Springer International Publishing.
[14] C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional networks for action segmentation and detection,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1003–1012.
[15] R. D. Patterson, K. Robinson, J. Holdsworth, D. Mckeown, C. Zhang, and M. Allerhand, “Complex sounds and auditory images,” Auditory Physiology and Perception, pp. 429–446, 1992.
[16] V Hohmann, “Frequency analysis and synthesis using a gammatone filterbank,” Acta Acustica United with Acustica, vol. 88, no. 3, pp. 433–442, 2002.
[17] A. Pandey and D. Wang, “A new framework for cnn-based speech enhancement in the time domain,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 7, pp. 1179–1188, 2019.
[18] Ashutosh Pandey and DeLiang Wang, “A new framework for cnn-based speech enhancement in the time domain,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 7, pp. 1179–1188, 2019.