This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Sampling-Frequency-Independent Audio Source Separation Using Convolution Layer Based on Impulse Invariant Method thanks: This work was supported by JSPS KAKENHI Grant Number JP20K19818.

Koichi Saito, Tomohiko Nakamura, Kohei Yatabe, Yuma Koizumi, Hiroshi Saruwatari
Graduate School of Information Science and Technology, The University of Tokyo, Tokyo, Japan
Department of Intermedia Art and Science, Waseda University, Tokyo, Japan
NTT Corporation, Tokyo, Japan
Abstract

Audio source separation is often used as preprocessing of various applications, and one of its ultimate goals is to construct a single versatile model capable of dealing with the varieties of audio signals. Since sampling frequency, one of the audio signal varieties, is usually application specific, the preceding audio source separation model should be able to deal with audio signals of all sampling frequencies specified in the target applications. However, conventional models based on deep neural networks (DNNs) are trained only at the sampling frequency specified by the training data, and there are no guarantees that they work with unseen sampling frequencies. In this paper, we propose a convolution layer capable of handling arbitrary sampling frequencies by a single DNN. Through music source separation experiments, we show that the introduction of the proposed layer enables a conventional audio source separation model to consistently work with even unseen sampling frequencies.

Index Terms:
Audio source separation, analog-to-digital filter conversion, deep neural networks

I Introduction

Refer to caption
Figure 1: Architectures of (a) Conv-TasNet [3] and (b) proposed model, and (c) illustration of proposed SFI convolution layer.

Audio source separation is a technique for extracting individual sources from a mixture signal. It is one of the fundamental techniques for various audio applications including music remixing, automatic music transcription, and automatic speech recognition. The recent development of source separation has been built upon machine leaning techniques using the deep neural network (DNN) [1, 2, 3, 4, 5, 6, 7, 8, 9]. Since source separation is often utilized as a preprocess of another task, one of the ultimate developmental goals is to construct a single universal DNN that can be utilized as the preprocessor for any application. To realize such an almighty source separator, every variety of applications and conditions must be handled by a single DNN.

One important but often unnoticeable variety of audio signals is sampling frequency. It is usually application specific, and hence a preprocessor must be designed for the sampling frequency specified by the following application. For example, for music remixing and editing, 44.144.1 and 4848 kHz are usually used as sampling frequencies to cover the entire human audible range [5, 8]. This is because these applications are aimed at human listeners. In contrast, the applications aimed at the recognition of the contents contained in audio signals do not require such full-band information. For example, beat tracking may use 1616 kHz [10], automatic music transcription may use 11.02511.025 and 22.0522.05 kHz [11, 12], and automatic speech recognition may use 88 and 1616 kHz [13, 14, 15]. A versatile preprocessor must be able to handle signals sampled at all of these sampling frequencies.

However, ordinary DNNs cannot handle audio signals sampled at various sampling frequencies. Conventional DNN-based models work well for the sampling frequency specified by the training data [1, 2, 3, 4, 5, 6, 7, 8, 9]. The parameters of a DNN are trained to adapt for the training dataset, and thus there is no guarantee of applicability for signals that are sampled at the other (unseen) sampling frequencies. This is because the layers utilized in a DNN are not designed for multiple sampling frequencies. In fact, the sampling frequency has not been considered as a parameter of a DNN, but it is implicitly given by the training dataset. In order to realize a DNN that consistently works for any sampling frequency, a DNN must be designed as a sampling-frequency-independent (SFI) network.

In this paper, we propose an SFI convolution layer for the handling of arbitrary sampling frequencies by a single DNN. The key idea behind the proposed layer is to consider the connection between a digital filter and a convolution layer. From a signal processing viewpoint, we can interpret a convolution layer as a collection of time-reversed digital finite impulse response (FIR) filters. Therefore, a filter design technique can be utilized to design a convolution layer. In this paper, we consider the impulse invariant method (see Chap. 7 in [16]), in which a digital filter is designed by sampling an analog filter. On the basis of this analog-to-digital filter conversion, we introduce latent analog filters into a convolution layer. Since an analog filter is independent of sampling frequency, we can construct an SFI convolution layer via the analog representation of a filter, where the impulse invariant method determines its sampling frequency afterward. The proposed SFI layer can be trained by parametrizing the analog filter as a differentiable function. By incorporating the proposed layer into one of the state-of-the-art source separation models, we also propose an SFI audio source separation model.

II Conventional Models

II-A Conv-TasNet [3]

Conv-TasNet is a recent time-domain DNN for audio source separation that works well for speech [3] and music source separation [17, 5]. Since the architectures of Conv-TasNet are slightly different in those three papers, we adopt the Conv-TasNet architecture for music source separation defined in [17], as illustrated in Fig. 1(a). Conv-TasNet consists of a pair of an encoder and a decoder and CC source-specific masking modules, where CC denotes the number of sources. The encoder and decoder imitate a traditional time-frequency transform (e.g., the short-time Fourier transform) and its inverse transform. The encoder transforms a monaural time-domain signal into an NN-channel latent representation by a one-dimensional (1D) convolution layer (with kernel size LL and stride WW) and the rectified linear unit (ReLU). Each masking module estimates a mask for the target source from the latent representation. It comprises RR convolution blocks, each of which consists of XX 1D dilated convolution layers with an exponentially increasing dilation factor. The details of the convolution block are shown in [3]. The decoder converts the masked latent representations into the separated time-domain signals by a 1D transposed convolution layer with kernel size LL and stride WW.

II-B Multi-phase Gammatone Filter [7]

In [7], the multi-phase gammatone filter (MP-GTF) was introduced to design the weights of the convolution layer of the encoder of Conv-TasNet, which improved the speech separation performance. The impulse response of the MP-GTF is given by

g(MP-GTF)(t)=atp1e2πbtcos(2πft+ϕ),g^{\mathrm{(MP\text{-}GTF)}}(t)=a\,t^{p-1}e^{-2\pi bt}\cos{(2\pi ft+\phi)}, (1)

where aa denotes the amplitude, pp the filter order, bb the bandwidth, ff the center frequency, and ϕ\phi the phase shift. The parameter bb is given by b=ERB(f)/1.57b=\mathrm{ERB}(f)/1.57, where ERB(f)=24.7+f/9.265\mathrm{ERB}(f)=24.7+f/9.265. By sampling LL points from g(MP-GTF)(t)g^{\mathrm{(MP\text{-}GTF)}}(t) for various ff and ϕ\phi, we obtain NN discrete-time impulse responses of length LL and concatenate them along the channel axis to form the weights as a 1×N×L1\times N\times L tensor. The convolution layer of the encoder is followed by the ReLU nonlinearity, which blocks the negative values of the MP-GTF output and discards the information of the input signal. To avoid a lack of information, g(MP-GTF)(t)g^{\mathrm{(MP\text{-}GTF)}}(t) is used together with its phase-reversed version, i.e., with the phase shift ϕ+π\phi+\pi [7].

II-C Multiple-sampling-frequency Training [13, 14, 15, 17]

There exist a few methods of training DNNs using audio signals sampled at multiple sampling frequencies [13, 14, 15, 17]. In [13, 14], an automatic speech recognition (ASR) model was trained using audio signals sampled at 88 and 1616 kHz, where the part of input features corresponding to the missing frequency band was padded by zeros. In [15], to compensate the missing frequency band, an ASR model was jointly trained with a bandwidth expansion model. The music source separation model presented in [17] was constructed by stacking three Conv-TasNets that account for sampling frequencies of 88, 1616, and 3232 kHz. The Conv-TasNets for 1616 and 3232 kHz estimate the source signals of the target sampling frequencies, referring to the masked latent representations obtained with the lower sampling frequencies.

While these training methods are valid for the trained sampling frequencies, they are not guaranteed to work with unseen sampling frequencies. In contrast, we explicitly define an SFI structure to handle any sampling frequency without retraining as shown later in Section III.

III Proposed Model

III-A Sampling-frequency-independent (SFI) Convolution Layer

To realize an SFI network, we introduce latent analog filters and analog-to-digital filter conversion into a convolution layer. By interpreting the weights of a convolution layer as a collection of time-reversed digital FIR filters, we consider them from a signal processing viewpoint. The digital filters are inherently sampling frequency dependent, whereas analog filters are SFI owing to their definition in the continuous time domain. Focusing on this fact, we introduce latent analog filters behind a convolution layer so that its weights can be adjusted using the sampling frequency of an input signal.

As shown in Fig. 1(c), the proposed layer consists of the usual 1D convolution layer and impulse responses of M(in)M(out)M^{\mathrm{(in)}}M^{\mathrm{(out)}} analog filters defined in the continuous time domain, where M(in)M^{\mathrm{(in)}} and M(out)M^{\mathrm{(out)}} are the input and output channel sizes, respectively. The generating process of the proposed layer consists of three steps. Given the sampling frequency of an input signal, the proposed layer (i) generates a discrete-time impulse response of length LL from each analog filter, (ii) stacks the time-reversed versions of these discrete-time impulse responses to form the weights as an M(in)×M(out)×LM^{\mathrm{(in)}}\times M^{\mathrm{(out)}}\times L tensor, and (iii) works as the usual convolution layer using them. Since steps (i) and (ii) depend only on the sampling frequency and the continuous-time impulse responses, we only need to perform them once (before the features are input) whenever the sampling frequency changes.

For step (i), we employ the impulse invariant method to generate digital FIR filters from their analog counterparts. Note that while this method is originally for the design of infinite impulse response filters, we can use it for the digital FIR filter design. Let us denote the sampling period as TT, a discrete time index as l=1,,Ll=1,\cdots,L, and the continuous time as tt\in\mathbb{R}. The impulse invariant method generates a discrete-time impulse response h[l]h[l] from an analog filter g(t)g(t) so that the sampled instants coincide:

h[l]=Tg(lT).h[l]=Tg(lT). (2)

Changing TT yields an impulse response for different sampling frequencies 1/T1/T. By stacking the generated impulse responses, the weights for the convolution layer is obtained in step (ii). Similarly, an SFI version of a transposed convolution layer (SFI transposed convolution layer) is given by changing the convolution layer in the SFI convolution layer to the transposed convolution layer.

For the analog filter g(t)g(t), we can use the MP-GTF in Eq. (1). The continuous-time impulse responses can be different for each channel, and hence, hereafter, a channel subscript mm is added to g(MP-GTF)(t)g^{\mathrm{(MP\text{-}GTF)}}(t) and its parameters: gm(MP-GTF)(t),am,pm,bm,fmg^{\mathrm{(MP\text{-}GTF)}}_{m}(t),a_{m},p_{m},b_{m},f_{m}, and ϕm\phi_{m}. Whereas all parameters of gm(MP-GTF)(t)g^{\mathrm{(MP\text{-}GTF)}}_{m}(t) were fixed in [7], we propose to train fmf_{m} and ϕm\phi_{m} jointly with the other DNN components by the commonly-used backpropagation algorithm.

The gradient of h[l]h[l] can be computed in the same manner as the usual convolution layer. Since the gradient of h[l]h[l] equals that of g(lT)g(lT) owing to Eq. (2) and gm(MP-GTF)(t)g^{\mathrm{(MP\text{-}GTF)}}_{m}(t) is differentiable with fmf_{m} and ϕm\phi_{m}, the gradients of the trainable parameters of gm(MP-GTF)(t)g^{\mathrm{(MP\text{-}GTF)}}_{m}(t) can be computed by the chain rule. These computations can be easily implemented by defining the forward computation process of the proposed layer owing to the automatic differentiation mechanism installed in modern deep learning frameworks (e.g., PyTorch and TensorFlow).

III-B Aliasing Reduction Technique

Since the impulse invariant method simply performs sampling to an analog filter, aliasing occurs in the obtained digital filters. As reported in [18, 19, 6], aliasing causes degradation of the DNN performance, and thus we introduce an aliasing reduction technique. Since the energy of aliased components is distributed above the Nyquist frequency, we propose to set the weights of the mmth channel to zero whenever the center frequency fmf_{m} of the corresponding analog filter is above the Nyquist frequency. This aliasing reduction technique is important for the proposed layer when it is utilized with low sampling frequencies, as shown later in Section IV.

III-C Application of Proposed Layers to Conv-TasNet

As shown in Fig. 1(b), we build an SFI audio source separation model by incorporating the proposed layers into Conv-TasNet [3]. The convolution layer of the encoder and the transposed convolution layer of the decoder are respectively replaced with the SFI convolution and transposed convolution layers. The masking modules are the same as in [3].

For our model, we should modify the kernel size LL and stride WW in accordance with the sampling frequency during inference. As described in Section II-A, the encoder and decoder can be interpreted as the time-frequency transform and its inverse transform. With this interpretation, LL and WW correspond to the frame length and the frame shift, respectively. Hence, as the sampling frequency doubles, LL and WW should double to make the representation consistent for the masking module. For this reason, we determine LL and WW for each target sampling frequency to keep the frame length and shift unchanged in the continuous time domain. By replacing all convolution layers in the masking modules with the SFI convolution layers, this issue might be resolved. However, we left it as a future work because additional care regarding the combination with other layers (e.g., group normalization [20]) must be considered.

IV Experimental Evaluation

TABLE I: Features of proposed methods and Conv-TasNet
 
Method gm(t)g_{m}(t) Samp. freq. Aliasing
adapt. reduction
 
Conv-Tasnet [3] - No -
T-MP-GTF gm(MP-GTF)(t)g_{m}^{\mathrm{(MP\text{-}GTF)}}(t) No No
Proposed gm(MP-GTF)(t)g_{m}^{\mathrm{(MP\text{-}GTF)}}(t) Yes No
Proposed+ gm(MP-GTF)(t)g_{m}^{\mathrm{(MP\text{-}GTF)}}(t) Yes Yes
 
TABLE II: Hyperparameters of masking modules used in experiments
 
Symbol Description Value
 
NN # of channels of latent representation 440440
BB # of channels in bottleneck and 160160
residual paths’ 1×11\times 1 convolution blocks
ScSc # of channels in skip-connection paths’ 160160
1×11\times 1 convolution blocks
HH # of channels in convolution blocks 160160
PP Kernel size in convolution blocks 33
 
Refer to caption
Refer to caption
Figure 2: SDRs of Conv-TasNet and proposed models for test data at various sampling frequencies. These SDRs and error bars respectively denote averages and standard errors over results obtained with four random seeds. Red line shows trained sampling frequency.

IV-A Experimental Settings

To evaluate the efficacy of the proposed method, we conducted music source separation on the MUSDB18-HQ dataset [21], which consists of 8686 training, 1414 validation, and 5050 test tracks. Each track contains separate recordings of four musical instruments (vocals, bass, drums, and other), i.e., C=4C=4. The training and validation tracks were down-sampled to 1616 kHz, and we created the test data by down- and up-sampling the test tracks to several target sampling frequencies, Fs(target)=8,12,,48F_{s}^{\mathrm{(target)}}=8,12,\ldots,48 kHz. As an evaluation metric, we used the median signal-to-distortion ratios (SDRs) computed with the BSSEval v4 toolkit [22].

We used the same data augmentation techniques as those of [17]: random cropping of the 88 s training audio segments, random amplification within [0.75,1.25][0.75,1.25], random selection of the left or the right channel, and random intertrack shuffling of the instruments in half of the minibatch. We also applied standardization (zero mean and unit variance) to the tracks.

We compared the proposed model (Proposed) and that using the aliasing reduction technique (Proposed+) with Conv-TasNet and its variant (T-MP-GTF) whose encoder and decoder instead use the trainable extension of MP-GTF as their weights of the convolution and transposed convolution layers, respectively. T-MP-GTF was included to separately evaluate the trainable extension of the MP-GTF and the sampling frequency adaptation. We applied all models to the audio signals of the unseen sampling frequencies without resampling them at the trained sampling frequency in order to examine the effects of the sampling frequency mismatch and the proposed sampling frequency adaptation. Table I summarizes the features of these models. For Proposed and Proposed+, we determined LL and WW to be 5.05.0 and 2.52.5 ms at the sampling frequency of 1616 kHz, respectively, as described in Section III-C, whereas we set L=80L=80 and W=40W=40 for the other models. For all models, we set X=6X=6 and R=2R=2. The hyperparameters of the masking modules are shown in Table II, where the symbols correspond to those used in the literature of Conv-TasNet (see Table 1 in [3]).

For gm(MP-GTF)(t)g_{m}^{\mathrm{(MP\text{-}GTF)}}(t), we trained fmf_{m} and ϕm\phi_{m} for m=1,,220m=1,\ldots,220 jointly with the entire network, and constrained these parameters for the other mms so that fm+220=fmf_{m+220}=f_{m} and ϕm+220=ϕm+π\phi_{m+220}=\phi_{m}+\pi, as described in Section III-A. We initialized fmf_{m} and ϕm\phi_{m} as in [7]; let us denote 4848 frequencies distributed uniformly in the equivalent rectangular bandwidth (ERB) scale [23] from 5050 to 80008000 Hz by fi(center)f^{\mathrm{(center)}}_{i}, where i=1,,48i=1,\cdots,48 is the center frequency index and fi<fi+1f_{i}<f_{i+1} for all ii. We initialized fmf_{m} as fm=fm/K+1(center)f_{m}=f^{\mathrm{(center)}}_{\lfloor m/K\rfloor+1} for m=1,,140m=1,\ldots,140 (m=141,,220m=141,\ldots,220), where K=5K=5 (K=4K=4, respectively). The phase shifts ϕm\phi_{m} of the filters with the same fi(center)f^{\mathrm{(center)}}_{i} were initialized to be uniformly distributed in [0,π)[0,\pi). The other parameters were set as am=1a_{m}=1, pm=2p_{m}=2. As in [7], these filters were normalized so that they have the same l2l^{2} norm.

For training, we used the RAdam optimizer [24] with a weight decay rate of 5.0×1045.0\times 10^{-4} and the Lookahead mechanism [25] with a synchronization period of 66 and a slow weights step of 0.50.5. Gradient clipping with the maximum L2L_{2}-norm of 55 was applied. The learning rate scheduler presented in [26] was employed with an initial learning rate of 1.0×1031.0\times 10^{-3} and a restart period of 200 000200\,000 iterations. We trained each model with a batch size of 1212 for 250250 epochs, using the negative scale-invariant source-to-noise ratio as the loss function, and selected the model with the lowest validation loss. We applied the trained models to the left and right channels of the test tracks separately, and scaled the source estimates using instrument-wise factors to minimize the mean squared error between the input mixture and the sum of all instrument estimates, resulting from the scale invariance of the loss function [17].

IV-B Results

Fig. 2 shows the separation performances for the test data with sampling frequencies from 88 to 4848 kHz. At the trained sampling frequency, 1616 kHz, the proposed models including T-MP-GTF achieved higher SDRs than Conv-TasNet for all instruments, showing the effectiveness of the proposed trainable extension of the MP-GTF. Interestingly, the proposed models gave much lower standard errors than Conv-TasNet, which was not reported in [7]. This observation reveals that the use of the trainable MP-GTF makes Conv-TasNet robust to the initialization of the DNN parameters.

As the sampling frequency moved away from 1616 kHz, the SDRs of Conv-TasNet and T-MP-GTF greatly decreased and they failed to separate the sources. By contrast, the proposed models with the sampling frequency adaptation, Proposed and Proposed+, provided similar SDRs for the 1212- to 4848-kHz-sampled data and outperformed the other models by a large margin, particularly at 2020 kHz and higher, even though they were trained only with the 1616-kHz-sampled data. This result clearly shows that the proposed sampling frequency adaptation plays a crucial role in achieving the consistent performance.

Fig. 3 shows the magnitudes of the frequency responses of the filters of the trained SFI convolution layer at sampling frequencies of 8,16,8,16, and 3232 kHz. We can confirm that the filters of the 1616 and 3232 kHz sampling frequencies exhibited consistent frequency responses. Importantly, the filters of the 3232 kHz sampling frequency blocked the frequency components higher than around 88 kHz. Nevertheless, the proposed model achieved a consistent performance at the sampling frequencies higher than 88 kHz, which should be because the dominant frequency components of the music signals were distributed below 88 kHz.

For the sampling frequency of 88 kHz, aliasing occurred from 44 kHz (see Fig. 3(a)). This resulted in the performance degradation of Proposed (the proposed method without the aliasing reduction technique). By contrast, Proposed+ showed a consistent performance at all sampling frequencies, demonstrating the effectiveness of the proposed aliasing reduction technique when the sampling frequency is reduced. For drums and other, Proposed+ gave slightly lower SDRs than Proposed at the sampling frequency of 1212 kHz, which might be because the filters with center frequencies near the Nyquist frequency are helpful for the separation. A further investigation of this observation remains as a future work.

Refer to caption
Figure 3: Magnitudes of frequency responses of first 220220 filters of trained SFI convolution layer at sampling frequencies of 8,16,8,16, and 3232 kHz.

V Conclusion

We proposed an SFI convolution layer that can be adjusted to an arbitrary sampling frequency. We focused on the fact that the weights of a convolution layer can be seen as a collection of digital FIR filters. We explicitly defined the weights generation process of the convolution layer from the latent analog filters based on the impulse invariant method. Since the analog filters do not depend on sampling frequency, the proposed layers can generate consistent weights for arbitrary sampling frequencies. Furthermore, we built an SFI audio source separation model by incorporating the proposed layers into the encoder and decoder of Conv-TasNet. We showed, through music source separation experiments, that even when trained with only the audio signals sampled at a specific sampling frequency, the proposed model consistently worked well with not only the trained sampling frequency but also unseen ones. Since the proposed layer is a general component for audio processing, it would also be useful for various audio applications such as speech separation [27, 1, 2, 3, 7].

References

  • [1] Z. Wang, J. Le Roux, and J. R. Hershey, “Alternative objective functions for deep clustering,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2018, pp. 686–690.
  • [2] D. Yu, M. Kolbæk, Z. Tan, and J. Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2017, pp. 241–245.
  • [3] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time–frequency ,magnitude masking for speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256–1266, 2019.
  • [4] D. Stoller, S. Ewert, and S. Dixon, “Wave-U-Net: A Multi-Scale neural network for end-to-end audio source separation,” in Proceedings of International Society for Music Information Retrieval Conference, 2018, pp. 334–340.
  • [5] A. Défossez, N. Usunier, L. Bottou, and F. Bach, “Music source separation in the waveform domain,” arXiv preprint arXiv:1911.13254, 2019.
  • [6] T. Nakamura and H. Saruwatari, “Time-domain audio source separation based on Wave-U-Net combined with discrete wavelet transform,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2020, pp. 386–390.
  • [7] D. Ditter and T. Gerkmann, “A Multi-Phase Gammatone Filterbank for speech separation via TasNet,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2020, pp. 36–40.
  • [8] H. Liu, L. Xie, J. Wu, and G. Yang, “Channel-wise subband input for better voice and accompaniment separation on high resolution music,” in Proceedings of INTERSPEECH, 2020.
  • [9] D. Takeuchi, K. Yatabe, Y. Koizumi, Y. Oikawa, and N. Harada, “Data-driven design of perfect reconstruction filterbank for DNN-based sound source enhancement,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2019, pp. 596–600.
  • [10] F. Krebs, S. Böck, M. Dorfer, and G. Widmer, “Downbeat tracking using beat synchronous features with recurrent neural networks,” in Proceedings of International Society for Music Information Retrieval Conference, 2016, pp. 129–135.
  • [11] S. Sigtia, E. Benetos, and S. Dixon, “An end-to-end neural network for polyphonic piano music transcription,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 5, pp. 927–939, 2016.
  • [12] F. Pedersoli, G. Tzanetakis, and K. M. Yi, “Improving music transcription by pre-stacking a U-Net,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2020, pp. 506–510.
  • [13] D. Yu, M. Seltzer, J. Li, J. Huang, and F. Seide, “Feature learning in deep neural networks - studies on speech recognition,” in Proceedings of International Conference on Learning Representations, 2013.
  • [14] A. Narayanan, A. Misra, K. C. Sim, G. Pundak, A. Tripathi, M. Elfeky, P. Haghani, T. Strohman, and M. Bacchiani, “Toward domain-invariant speech recognition via large scale training,” in IEEE Spoken Language Technology Workshop, 2018, pp. 441–447.
  • [15] J. Gao, J. Du, and E. Chen, “Mixed-bandwidth cross-channel speech recognition via joint optimization of DNN-based bandwidth expansion and acoustic modeling,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 3, pp. 559–571, 2019.
  • [16] A. V. Oppenheim, J. R. Buck, and R. W. Schafer, Discrete-time signal processing, Prentice Hall, Upper Saddle River, NJ, 2001.
  • [17] D. Samuel, A. Ganeshan, and J. Naradowsky, “Meta-learning extractors for music source separation,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2020, pp. 816–820.
  • [18] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in Proceedings of European Conference on Computer Vision, 2014, pp. 818–833.
  • [19] Y. Gong and C. Poellabauer, “Impact of aliasing on deep CNN-based end-to-end acoustic models,” in Proceedings of INTERSPEECH, 2018, pp. 2698–2702.
  • [20] Y. Wu and K. He, “Group normalization,” in Proceedings of European Conference on Computer Vision, 2018.
  • [21] Z. Rafii, A. Liutkus, F.-R. Stöter, S. I. Mimilakis, and R. Bittner, “MUSDB18-HQ - an uncompressed version of MUSDB18,” 2019.
  • [22] F.-R. Stöter, A. Liutkus, and N. Ito, “The 2018 signal separation evaluation campaign,” in Proceedings of International Conference on Latent Variable Analysis and Signal Separation, 2018, pp. 293–305.
  • [23] V. Hohmann, “Frequency analysis and synthesis using a gammatone filterbank,” Acta Acustica united with Acustica, vol. 88, no. 03, pp. 433–442, 2002.
  • [24] L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han, “On the variance of the adaptive learning rate and beyond,” in Proceedings of International Conference on Learning Representations, 2020.
  • [25] M. Zhang, J. Lucas, J. Ba, and G. Hinton, “Lookahead Optimizer: k steps forward, 1 step back,” in Proceedings of Advances in Neural Information Processing Systems, 2019, pp. 9597–9608.
  • [26] I. Loshchilov and F. Hutter, “SGDR: Stochastic gradient descent with warm restarts,” in Proceedings of International Conference on Learning Representations, 2017.
  • [27] H. Sawada, N. Ono, H. Kameoka, D. Kitamura, H. Saruwatari,, “A review of blind source separation methods: two converging routes to ILRMA originating from ICA and NMF,” in APSIPA Transactions on Signal and Information Processing, vol.8, no. e12, 14 pages, 2019.