Sampling-Frequency-Independent Audio Source Separation Using Convolution Layer Based on Impulse Invariant Method ††thanks: This work was supported by JSPS KAKENHI Grant Number JP20K19818.
Abstract
Audio source separation is often used as preprocessing of various applications, and one of its ultimate goals is to construct a single versatile model capable of dealing with the varieties of audio signals. Since sampling frequency, one of the audio signal varieties, is usually application specific, the preceding audio source separation model should be able to deal with audio signals of all sampling frequencies specified in the target applications. However, conventional models based on deep neural networks (DNNs) are trained only at the sampling frequency specified by the training data, and there are no guarantees that they work with unseen sampling frequencies. In this paper, we propose a convolution layer capable of handling arbitrary sampling frequencies by a single DNN. Through music source separation experiments, we show that the introduction of the proposed layer enables a conventional audio source separation model to consistently work with even unseen sampling frequencies.
Index Terms:
Audio source separation, analog-to-digital filter conversion, deep neural networksI Introduction

Audio source separation is a technique for extracting individual sources from a mixture signal. It is one of the fundamental techniques for various audio applications including music remixing, automatic music transcription, and automatic speech recognition. The recent development of source separation has been built upon machine leaning techniques using the deep neural network (DNN) [1, 2, 3, 4, 5, 6, 7, 8, 9]. Since source separation is often utilized as a preprocess of another task, one of the ultimate developmental goals is to construct a single universal DNN that can be utilized as the preprocessor for any application. To realize such an almighty source separator, every variety of applications and conditions must be handled by a single DNN.
One important but often unnoticeable variety of audio signals is sampling frequency. It is usually application specific, and hence a preprocessor must be designed for the sampling frequency specified by the following application. For example, for music remixing and editing, and kHz are usually used as sampling frequencies to cover the entire human audible range [5, 8]. This is because these applications are aimed at human listeners. In contrast, the applications aimed at the recognition of the contents contained in audio signals do not require such full-band information. For example, beat tracking may use kHz [10], automatic music transcription may use and kHz [11, 12], and automatic speech recognition may use and kHz [13, 14, 15]. A versatile preprocessor must be able to handle signals sampled at all of these sampling frequencies.
However, ordinary DNNs cannot handle audio signals sampled at various sampling frequencies. Conventional DNN-based models work well for the sampling frequency specified by the training data [1, 2, 3, 4, 5, 6, 7, 8, 9]. The parameters of a DNN are trained to adapt for the training dataset, and thus there is no guarantee of applicability for signals that are sampled at the other (unseen) sampling frequencies. This is because the layers utilized in a DNN are not designed for multiple sampling frequencies. In fact, the sampling frequency has not been considered as a parameter of a DNN, but it is implicitly given by the training dataset. In order to realize a DNN that consistently works for any sampling frequency, a DNN must be designed as a sampling-frequency-independent (SFI) network.
In this paper, we propose an SFI convolution layer for the handling of arbitrary sampling frequencies by a single DNN. The key idea behind the proposed layer is to consider the connection between a digital filter and a convolution layer. From a signal processing viewpoint, we can interpret a convolution layer as a collection of time-reversed digital finite impulse response (FIR) filters. Therefore, a filter design technique can be utilized to design a convolution layer. In this paper, we consider the impulse invariant method (see Chap. 7 in [16]), in which a digital filter is designed by sampling an analog filter. On the basis of this analog-to-digital filter conversion, we introduce latent analog filters into a convolution layer. Since an analog filter is independent of sampling frequency, we can construct an SFI convolution layer via the analog representation of a filter, where the impulse invariant method determines its sampling frequency afterward. The proposed SFI layer can be trained by parametrizing the analog filter as a differentiable function. By incorporating the proposed layer into one of the state-of-the-art source separation models, we also propose an SFI audio source separation model.
II Conventional Models
II-A Conv-TasNet [3]
Conv-TasNet is a recent time-domain DNN for audio source separation that works well for speech [3] and music source separation [17, 5]. Since the architectures of Conv-TasNet are slightly different in those three papers, we adopt the Conv-TasNet architecture for music source separation defined in [17], as illustrated in Fig. 1(a). Conv-TasNet consists of a pair of an encoder and a decoder and source-specific masking modules, where denotes the number of sources. The encoder and decoder imitate a traditional time-frequency transform (e.g., the short-time Fourier transform) and its inverse transform. The encoder transforms a monaural time-domain signal into an -channel latent representation by a one-dimensional (1D) convolution layer (with kernel size and stride ) and the rectified linear unit (ReLU). Each masking module estimates a mask for the target source from the latent representation. It comprises convolution blocks, each of which consists of 1D dilated convolution layers with an exponentially increasing dilation factor. The details of the convolution block are shown in [3]. The decoder converts the masked latent representations into the separated time-domain signals by a 1D transposed convolution layer with kernel size and stride .
II-B Multi-phase Gammatone Filter [7]
In [7], the multi-phase gammatone filter (MP-GTF) was introduced to design the weights of the convolution layer of the encoder of Conv-TasNet, which improved the speech separation performance. The impulse response of the MP-GTF is given by
(1) |
where denotes the amplitude, the filter order, the bandwidth, the center frequency, and the phase shift. The parameter is given by , where . By sampling points from for various and , we obtain discrete-time impulse responses of length and concatenate them along the channel axis to form the weights as a tensor. The convolution layer of the encoder is followed by the ReLU nonlinearity, which blocks the negative values of the MP-GTF output and discards the information of the input signal. To avoid a lack of information, is used together with its phase-reversed version, i.e., with the phase shift [7].
II-C Multiple-sampling-frequency Training [13, 14, 15, 17]
There exist a few methods of training DNNs using audio signals sampled at multiple sampling frequencies [13, 14, 15, 17]. In [13, 14], an automatic speech recognition (ASR) model was trained using audio signals sampled at and kHz, where the part of input features corresponding to the missing frequency band was padded by zeros. In [15], to compensate the missing frequency band, an ASR model was jointly trained with a bandwidth expansion model. The music source separation model presented in [17] was constructed by stacking three Conv-TasNets that account for sampling frequencies of , , and kHz. The Conv-TasNets for and kHz estimate the source signals of the target sampling frequencies, referring to the masked latent representations obtained with the lower sampling frequencies.
While these training methods are valid for the trained sampling frequencies, they are not guaranteed to work with unseen sampling frequencies. In contrast, we explicitly define an SFI structure to handle any sampling frequency without retraining as shown later in Section III.
III Proposed Model
III-A Sampling-frequency-independent (SFI) Convolution Layer
To realize an SFI network, we introduce latent analog filters and analog-to-digital filter conversion into a convolution layer. By interpreting the weights of a convolution layer as a collection of time-reversed digital FIR filters, we consider them from a signal processing viewpoint. The digital filters are inherently sampling frequency dependent, whereas analog filters are SFI owing to their definition in the continuous time domain. Focusing on this fact, we introduce latent analog filters behind a convolution layer so that its weights can be adjusted using the sampling frequency of an input signal.
As shown in Fig. 1(c), the proposed layer consists of the usual 1D convolution layer and impulse responses of analog filters defined in the continuous time domain, where and are the input and output channel sizes, respectively. The generating process of the proposed layer consists of three steps. Given the sampling frequency of an input signal, the proposed layer (i) generates a discrete-time impulse response of length from each analog filter, (ii) stacks the time-reversed versions of these discrete-time impulse responses to form the weights as an tensor, and (iii) works as the usual convolution layer using them. Since steps (i) and (ii) depend only on the sampling frequency and the continuous-time impulse responses, we only need to perform them once (before the features are input) whenever the sampling frequency changes.
For step (i), we employ the impulse invariant method to generate digital FIR filters from their analog counterparts. Note that while this method is originally for the design of infinite impulse response filters, we can use it for the digital FIR filter design. Let us denote the sampling period as , a discrete time index as , and the continuous time as . The impulse invariant method generates a discrete-time impulse response from an analog filter so that the sampled instants coincide:
(2) |
Changing yields an impulse response for different sampling frequencies . By stacking the generated impulse responses, the weights for the convolution layer is obtained in step (ii). Similarly, an SFI version of a transposed convolution layer (SFI transposed convolution layer) is given by changing the convolution layer in the SFI convolution layer to the transposed convolution layer.
For the analog filter , we can use the MP-GTF in Eq. (1). The continuous-time impulse responses can be different for each channel, and hence, hereafter, a channel subscript is added to and its parameters: , and . Whereas all parameters of were fixed in [7], we propose to train and jointly with the other DNN components by the commonly-used backpropagation algorithm.
The gradient of can be computed in the same manner as the usual convolution layer. Since the gradient of equals that of owing to Eq. (2) and is differentiable with and , the gradients of the trainable parameters of can be computed by the chain rule. These computations can be easily implemented by defining the forward computation process of the proposed layer owing to the automatic differentiation mechanism installed in modern deep learning frameworks (e.g., PyTorch and TensorFlow).
III-B Aliasing Reduction Technique
Since the impulse invariant method simply performs sampling to an analog filter, aliasing occurs in the obtained digital filters. As reported in [18, 19, 6], aliasing causes degradation of the DNN performance, and thus we introduce an aliasing reduction technique. Since the energy of aliased components is distributed above the Nyquist frequency, we propose to set the weights of the th channel to zero whenever the center frequency of the corresponding analog filter is above the Nyquist frequency. This aliasing reduction technique is important for the proposed layer when it is utilized with low sampling frequencies, as shown later in Section IV.
III-C Application of Proposed Layers to Conv-TasNet
As shown in Fig. 1(b), we build an SFI audio source separation model by incorporating the proposed layers into Conv-TasNet [3]. The convolution layer of the encoder and the transposed convolution layer of the decoder are respectively replaced with the SFI convolution and transposed convolution layers. The masking modules are the same as in [3].
For our model, we should modify the kernel size and stride in accordance with the sampling frequency during inference. As described in Section II-A, the encoder and decoder can be interpreted as the time-frequency transform and its inverse transform. With this interpretation, and correspond to the frame length and the frame shift, respectively. Hence, as the sampling frequency doubles, and should double to make the representation consistent for the masking module. For this reason, we determine and for each target sampling frequency to keep the frame length and shift unchanged in the continuous time domain. By replacing all convolution layers in the masking modules with the SFI convolution layers, this issue might be resolved. However, we left it as a future work because additional care regarding the combination with other layers (e.g., group normalization [20]) must be considered.
IV Experimental Evaluation
Method | Samp. freq. | Aliasing | |
---|---|---|---|
adapt. | reduction | ||
Conv-Tasnet [3] | - | No | - |
T-MP-GTF | No | No | |
Proposed | Yes | No | |
Proposed+ | Yes | Yes | |
Symbol | Description | Value |
# of channels of latent representation | ||
# of channels in bottleneck and | ||
residual paths’ convolution blocks | ||
# of channels in skip-connection paths’ | ||
convolution blocks | ||
# of channels in convolution blocks | ||
Kernel size in convolution blocks | ||


IV-A Experimental Settings
To evaluate the efficacy of the proposed method, we conducted music source separation on the MUSDB18-HQ dataset [21], which consists of training, validation, and test tracks. Each track contains separate recordings of four musical instruments (vocals, bass, drums, and other), i.e., . The training and validation tracks were down-sampled to kHz, and we created the test data by down- and up-sampling the test tracks to several target sampling frequencies, kHz. As an evaluation metric, we used the median signal-to-distortion ratios (SDRs) computed with the BSSEval v4 toolkit [22].
We used the same data augmentation techniques as those of [17]: random cropping of the s training audio segments, random amplification within , random selection of the left or the right channel, and random intertrack shuffling of the instruments in half of the minibatch. We also applied standardization (zero mean and unit variance) to the tracks.
We compared the proposed model (Proposed) and that using the aliasing reduction technique (Proposed+) with Conv-TasNet and its variant (T-MP-GTF) whose encoder and decoder instead use the trainable extension of MP-GTF as their weights of the convolution and transposed convolution layers, respectively. T-MP-GTF was included to separately evaluate the trainable extension of the MP-GTF and the sampling frequency adaptation. We applied all models to the audio signals of the unseen sampling frequencies without resampling them at the trained sampling frequency in order to examine the effects of the sampling frequency mismatch and the proposed sampling frequency adaptation. Table I summarizes the features of these models. For Proposed and Proposed+, we determined and to be and ms at the sampling frequency of kHz, respectively, as described in Section III-C, whereas we set and for the other models. For all models, we set and . The hyperparameters of the masking modules are shown in Table II, where the symbols correspond to those used in the literature of Conv-TasNet (see Table 1 in [3]).
For , we trained and for jointly with the entire network, and constrained these parameters for the other s so that and , as described in Section III-A. We initialized and as in [7]; let us denote frequencies distributed uniformly in the equivalent rectangular bandwidth (ERB) scale [23] from to Hz by , where is the center frequency index and for all . We initialized as for (), where (, respectively). The phase shifts of the filters with the same were initialized to be uniformly distributed in . The other parameters were set as , . As in [7], these filters were normalized so that they have the same norm.
For training, we used the RAdam optimizer [24] with a weight decay rate of and the Lookahead mechanism [25] with a synchronization period of and a slow weights step of . Gradient clipping with the maximum -norm of was applied. The learning rate scheduler presented in [26] was employed with an initial learning rate of and a restart period of iterations. We trained each model with a batch size of for epochs, using the negative scale-invariant source-to-noise ratio as the loss function, and selected the model with the lowest validation loss. We applied the trained models to the left and right channels of the test tracks separately, and scaled the source estimates using instrument-wise factors to minimize the mean squared error between the input mixture and the sum of all instrument estimates, resulting from the scale invariance of the loss function [17].
IV-B Results
Fig. 2 shows the separation performances for the test data with sampling frequencies from to kHz. At the trained sampling frequency, kHz, the proposed models including T-MP-GTF achieved higher SDRs than Conv-TasNet for all instruments, showing the effectiveness of the proposed trainable extension of the MP-GTF. Interestingly, the proposed models gave much lower standard errors than Conv-TasNet, which was not reported in [7]. This observation reveals that the use of the trainable MP-GTF makes Conv-TasNet robust to the initialization of the DNN parameters.
As the sampling frequency moved away from kHz, the SDRs of Conv-TasNet and T-MP-GTF greatly decreased and they failed to separate the sources. By contrast, the proposed models with the sampling frequency adaptation, Proposed and Proposed+, provided similar SDRs for the - to -kHz-sampled data and outperformed the other models by a large margin, particularly at kHz and higher, even though they were trained only with the -kHz-sampled data. This result clearly shows that the proposed sampling frequency adaptation plays a crucial role in achieving the consistent performance.
Fig. 3 shows the magnitudes of the frequency responses of the filters of the trained SFI convolution layer at sampling frequencies of and kHz. We can confirm that the filters of the and kHz sampling frequencies exhibited consistent frequency responses. Importantly, the filters of the kHz sampling frequency blocked the frequency components higher than around kHz. Nevertheless, the proposed model achieved a consistent performance at the sampling frequencies higher than kHz, which should be because the dominant frequency components of the music signals were distributed below kHz.
For the sampling frequency of kHz, aliasing occurred from kHz (see Fig. 3(a)). This resulted in the performance degradation of Proposed (the proposed method without the aliasing reduction technique). By contrast, Proposed+ showed a consistent performance at all sampling frequencies, demonstrating the effectiveness of the proposed aliasing reduction technique when the sampling frequency is reduced. For drums and other, Proposed+ gave slightly lower SDRs than Proposed at the sampling frequency of kHz, which might be because the filters with center frequencies near the Nyquist frequency are helpful for the separation. A further investigation of this observation remains as a future work.

V Conclusion
We proposed an SFI convolution layer that can be adjusted to an arbitrary sampling frequency. We focused on the fact that the weights of a convolution layer can be seen as a collection of digital FIR filters. We explicitly defined the weights generation process of the convolution layer from the latent analog filters based on the impulse invariant method. Since the analog filters do not depend on sampling frequency, the proposed layers can generate consistent weights for arbitrary sampling frequencies. Furthermore, we built an SFI audio source separation model by incorporating the proposed layers into the encoder and decoder of Conv-TasNet. We showed, through music source separation experiments, that even when trained with only the audio signals sampled at a specific sampling frequency, the proposed model consistently worked well with not only the trained sampling frequency but also unseen ones. Since the proposed layer is a general component for audio processing, it would also be useful for various audio applications such as speech separation [27, 1, 2, 3, 7].
References
- [1] Z. Wang, J. Le Roux, and J. R. Hershey, “Alternative objective functions for deep clustering,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2018, pp. 686–690.
- [2] D. Yu, M. Kolbæk, Z. Tan, and J. Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2017, pp. 241–245.
- [3] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time–frequency ,magnitude masking for speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256–1266, 2019.
- [4] D. Stoller, S. Ewert, and S. Dixon, “Wave-U-Net: A Multi-Scale neural network for end-to-end audio source separation,” in Proceedings of International Society for Music Information Retrieval Conference, 2018, pp. 334–340.
- [5] A. Défossez, N. Usunier, L. Bottou, and F. Bach, “Music source separation in the waveform domain,” arXiv preprint arXiv:1911.13254, 2019.
- [6] T. Nakamura and H. Saruwatari, “Time-domain audio source separation based on Wave-U-Net combined with discrete wavelet transform,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2020, pp. 386–390.
- [7] D. Ditter and T. Gerkmann, “A Multi-Phase Gammatone Filterbank for speech separation via TasNet,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2020, pp. 36–40.
- [8] H. Liu, L. Xie, J. Wu, and G. Yang, “Channel-wise subband input for better voice and accompaniment separation on high resolution music,” in Proceedings of INTERSPEECH, 2020.
- [9] D. Takeuchi, K. Yatabe, Y. Koizumi, Y. Oikawa, and N. Harada, “Data-driven design of perfect reconstruction filterbank for DNN-based sound source enhancement,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2019, pp. 596–600.
- [10] F. Krebs, S. Böck, M. Dorfer, and G. Widmer, “Downbeat tracking using beat synchronous features with recurrent neural networks,” in Proceedings of International Society for Music Information Retrieval Conference, 2016, pp. 129–135.
- [11] S. Sigtia, E. Benetos, and S. Dixon, “An end-to-end neural network for polyphonic piano music transcription,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 5, pp. 927–939, 2016.
- [12] F. Pedersoli, G. Tzanetakis, and K. M. Yi, “Improving music transcription by pre-stacking a U-Net,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2020, pp. 506–510.
- [13] D. Yu, M. Seltzer, J. Li, J. Huang, and F. Seide, “Feature learning in deep neural networks - studies on speech recognition,” in Proceedings of International Conference on Learning Representations, 2013.
- [14] A. Narayanan, A. Misra, K. C. Sim, G. Pundak, A. Tripathi, M. Elfeky, P. Haghani, T. Strohman, and M. Bacchiani, “Toward domain-invariant speech recognition via large scale training,” in IEEE Spoken Language Technology Workshop, 2018, pp. 441–447.
- [15] J. Gao, J. Du, and E. Chen, “Mixed-bandwidth cross-channel speech recognition via joint optimization of DNN-based bandwidth expansion and acoustic modeling,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 3, pp. 559–571, 2019.
- [16] A. V. Oppenheim, J. R. Buck, and R. W. Schafer, Discrete-time signal processing, Prentice Hall, Upper Saddle River, NJ, 2001.
- [17] D. Samuel, A. Ganeshan, and J. Naradowsky, “Meta-learning extractors for music source separation,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2020, pp. 816–820.
- [18] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in Proceedings of European Conference on Computer Vision, 2014, pp. 818–833.
- [19] Y. Gong and C. Poellabauer, “Impact of aliasing on deep CNN-based end-to-end acoustic models,” in Proceedings of INTERSPEECH, 2018, pp. 2698–2702.
- [20] Y. Wu and K. He, “Group normalization,” in Proceedings of European Conference on Computer Vision, 2018.
- [21] Z. Rafii, A. Liutkus, F.-R. Stöter, S. I. Mimilakis, and R. Bittner, “MUSDB18-HQ - an uncompressed version of MUSDB18,” 2019.
- [22] F.-R. Stöter, A. Liutkus, and N. Ito, “The 2018 signal separation evaluation campaign,” in Proceedings of International Conference on Latent Variable Analysis and Signal Separation, 2018, pp. 293–305.
- [23] V. Hohmann, “Frequency analysis and synthesis using a gammatone filterbank,” Acta Acustica united with Acustica, vol. 88, no. 03, pp. 433–442, 2002.
- [24] L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han, “On the variance of the adaptive learning rate and beyond,” in Proceedings of International Conference on Learning Representations, 2020.
- [25] M. Zhang, J. Lucas, J. Ba, and G. Hinton, “Lookahead Optimizer: k steps forward, 1 step back,” in Proceedings of Advances in Neural Information Processing Systems, 2019, pp. 9597–9608.
- [26] I. Loshchilov and F. Hutter, “SGDR: Stochastic gradient descent with warm restarts,” in Proceedings of International Conference on Learning Representations, 2017.
- [27] H. Sawada, N. Ono, H. Kameoka, D. Kitamura, H. Saruwatari,, “A review of blind source separation methods: two converging routes to ILRMA originating from ICA and NMF,” in APSIPA Transactions on Signal and Information Processing, vol.8, no. e12, 14 pages, 2019.