This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

MULTI-SCALE TEMPORAL-FREQUENCY ATTENTION FOR MUSIC SOURCE SEPARATION

Abstract

In recent years, deep neural networks (DNNs) based approaches have achieved the start-of-the-art performance for music source separation (MSS). Although previous methods have addressed the large receptive field modeling using various methods, the temporal and frequency correlations of the music spectrogram with repeated patterns have not been explicitly explored for the MSS task. In this paper, a temporal-frequency attention module is proposed to model the spectrogram correlations along both temporal and frequency dimensions. Moreover, a multi-scale attention is proposed to effectively capture the correlations for music signal. The experimental results on MUSDB18 dataset show that the proposed method outperforms the existing state-of-the-art systems with 9.51 dB signal-to-distortion ratio (SDR) on separating the vocal stems, which is the primary practical application of MSS.

Index Terms—  Music source separation, deep neural network, attention, multi-scale

1 Introduction

During music production, recordings of vocals and individual instruments called stems are mixed together into the final song. Music source separation (MSS) is designed to separate the mixed signal into the individual stems. Since the separated stems can be used in various application such as Karaoke systems [1] or music up-mixing [2], MSS has received increasing interest in recent years. As a subtask of the Signal Separation Evaluation Campaign (SiSEC), the separated stems of MSS were categorized into vocals, bass, drums and other [3].

While traditional approaches have been proposed in [4, 5], methods based on deep neural networks (DNNs) have outperformed these traditional approaches in recent years. In [6, 7], neural network with several fully connected layers was utilized to separate the audio sources. To capture the temporal context, features of multiple frames were concatenated as the input of network. In [8, 9, 10], recurrent neural networks were used to capture the longer temporal contexts. In most of recent works [11, 12, 13, 14, 15], Convolutional Neural Network (CNN) based encoder-decoder architecture were employed and have achieved the state-of-the-art performance. By stacking several 2-dimensional CNN layers, the model can capture both temporal and frequency context. To obtain large temporal-frequency receptive field with high efficiency, one of the popular operations is repeatedly resampling the feature maps [16, 17, 18]. More specifically, the feature maps are downsampled repeatedly in the encoder, so that the CNN layers in lower resolution representation can obtain larger receptive field. Then these low resolution feature maps are upsampled repeatedly in the decoder to obtain the same resolution of input feature.

Besides, several additional modules are proposed to further capture the temporal and frequency context for the encoder-decoder architecture. In [11, 13], LSTM layers were added between the encoder and the decoder to efficiently model long-term musical structures. In [19], a time-distributed fully-connected network was proposed to extract the long-range correlations existed along the frequency axis. In [12], multi-dilated convolution with different dilation factors is utilized to model different resolution and obtain larger temporal and frequency receptive field. In [15], sufficiently large receptive field was obtained by a residual UNet architecture with up to 143 layers, and the system achieves the state-of-the-art MSS performance.

Refer to caption
Fig. 1: (a) The proposed system architecture.(Separator* is shown in Fig.2 and Fig.3) (b) The structure of a DenseNet Block. (c) The structure of a Conv2D Block. (d) The structure of a Gated Block in the decoder.

While most of the existing MSS systems can model large receptive field, the correlation along temporal or frequency dimension has not been explicitly exploited. This is especially crucial for the MSS task [20], for example, the temporal correlation in beat and downbeat patterns, and the frequency correlation in chorus, harmony and chords. In [21, 22] self-attention was used to exploit long-term dependency of music signal, but they only considered the attention along the temporal dimension. Motivated by the success of temporal and frequency self-attention mechanism in speech enhancement task [23], a new separation module with temporal and frequency self-attention layers is proposed to capture the spectrogram correlations within the encoder-decoder based MSS architecture. Moreover, considering the different frequency ranges of various instruments and rapid changes of music content, a multi-scale mechanism is introduced to capture the correlations, improving the robustness of proposed method on various music styles. Compared to the mainstream MSS systems, the proposed method also provides a new way to obtain large receptive field without may repeated resample layers or other additional modules.

The contributions of this paper can be summarized as follows: 1). We propose a temporal-frequency attention layers in encoder-decoder based architecture to capture the spectrum correlations for MSS. 2). We further introduce a multi-scale mechanism to effectively model spectrum correlations for different temporal and frequency ranges. 3). We experimentally show the effectiveness of the proposed systems, which achieves start-of-the-art results on MUSDB18 dataset [3].

2 PROPOSED SYSTEM

2.1 Overview

As shown in Fig.1(a), the neural network consists of an encoder, a separator and a decoder. It takes a discrete stereo signal with NN samples for each channel 𝒚N×2\boldsymbol{y}\in\mathbb{R}^{N\times 2} as the input. The input signal is then transformed into a time-frequency domain representation YCT×F×2Y_{C}\in\mathbb{C}^{T\times F\times 2} via STFT, where TT is the number of frames and FF is the number of frequency bins of the complex spectrogram; 2 refers to the two-channel stereo input. To form the input of the neural network, the real and imaginary components are concatenated as YR=[real(YC);imag(YC)]T×F×4Y_{R}=[real(Y_{C});imag(Y_{C})]\in\mathbb{R}^{T\times F\times 4}.

To reduce the computational cost for high-resolution input, following the subband mechanism proposed in [24], the fullband spectrogram is sliced into KK subbands to form the channel-wise subband signal YRT×(F/K)×(4×K)Y^{\prime}_{R}\in\mathbb{R}^{T\times(F/K)\times(4\times K)} served as the encoder input, where KK is set to 4 according to our preliminary results. The neural network estimates a channel-wise subband mask MRT×(F/K)×(4×K)M^{\prime}_{R}\in\mathbb{R}^{T\times(F/K)\times(4\times K)} and then reshaped to the fullband mask MCT×F×2M_{C}\in\mathbb{C}^{T\times F\times 2} for the stereo complex spectrogram. Finally, the target signal is estimated by multiplying the mixture signal YCY_{C} with the estimated mask MCM_{C}. The time-domain target signal 𝒔^\hat{\boldsymbol{s}} can be obtained via iSTFT.

2.2 Encoder and Decoder

As shown in Fig.1(a), the encoder has three encoder blocks (EBs), each consists of a DenseNet block (detailed in Fig.1(b)) and a Conv2D block (detailed in Fig.1(c)). The DenseNet block consists of four Conv2D blocks (detailed in Fig.1(c)) with concatenation operations. The DenseNet block can learn explicit cross-layer interactions and reuses features computed in preceding layers, which yields efficient parameter utilization and suits for the MSS problems as discussed in [18]. The Conv2D block consists of a Conv2D layer, a Batch Normalization layer and an ELU activation layer.

The decoder consists of three decoder blocks (DBs) followed by a Conv2D block and a Dense layer. Each decoder block consists of a gated block [23] and a DenseNet block. The gated block consists of a Conv2DTanspose layer and two Conv2D blocks (detailed in Fig.1(c)) as shown in Fig.1(d), which learns a multiplicative mask for the feature from the encoder and suppress its undesired part. The structure of the Dense block in decoder is identical to the one used in encoder. After a Conv2D block, a Dense layer with tanh activation is utilized to generate the real and imaginary components of the complex ideal ratio mask (cIRM) with boundary of [-1, 1]. The cIRM is further extended to [-2, 2] by multiplying a expanding factor of 2 to increase the upper bound of the oracle SDR as discussed in [15]. The expanding factor is chosen to maximize the separation SDR according to our preliminary experiments. The detailed hyper-parameters of the encoder and decoder is listed in Table 1.

Refer to caption
Fig. 2: (a) Separator with temporal-frequency attention. (b) The structure of the residual attention (RA) block. (c) The structure of temporal self-attention and frequency self-attention.
Table 1: The configurations of encoder and decoder
Layer Channel Kernel Stride
EB1 DenseNet 3232 (3,3)(3,3) (1,1)(1,1)
Conv2D 3232 (3,3)(3,3) (1,1)(1,1)
EB2 DenseNet 6464 (3,3)(3,3) (1,1)(1,1)
Conv2D 6464 (3,3)(3,3) (2,2)(2,2)
EB3 DenseNet 6464 (3,3)(3,3) (1,1)(1,1)
Conv2D 6464 (3,3)(3,3) (1,2)(1,2)
DB1 Conv2DTranspose 6464 (3,3)(3,3) (1,2)(1,2)
Conv2D 6464 (1,1)(1,1) (1,1)(1,1)
DenseNet 6464 (3,3)(3,3) (1,1)(1,1)
DB2 Conv2DTranspose 6464 (3,3)(3,3) (2,2)(2,2)
Conv2D 6464 (1,1)(1,1) (1,1)(1,1)
DenseNet 6464 (3,3)(3,3) (1,1)(1,1)
DB3 Conv2DTranspose 3232 (3,3)(3,3) (1,1)(1,1)
Conv2D 3232 (1,1)(1,1) (1,1)(1,1)
DenseNet 3232 (3,3)(3,3) (1,1)(1,1)
Conv2D 4×K4\times K (1,1)(1,1) (1,1)(1,1)

2.3 Temporal-Frequency Attention based Separator

As shown in Fig.2(a), the temporal-frequency attention based separator consists of four residual attention (RA) blocks (detailed in Fig.2(b)). The input feature map of the RA block is InT×F×C\mathcal{F}^{In}\in\mathbb{R}^{T^{\prime}\times F^{\prime}\times C} generated from encoder or previous RA block. TT^{\prime} is the time steps. FF^{\prime} is the feature dimension. CC is equal to 64. The input feature map is fed into two residual blocks. Each residual block consists of two Conv2D blocks with a kernel size of (3,3) and a stride of (1,1). The output of the residual block ResT×F×C\mathcal{F}^{Res}\in\mathbb{R}^{T^{\prime}\times F^{\prime}\times C} is then fed parallel into the temporal self-attention (TSA) and the frequency self-attention (FSA) blocks to capture the global dependencies along temporal and frequency dimensions, respectively. The outputs of the two self-attention blocks TempT×F×C\mathcal{F}^{Temp}\in\mathbb{R}^{T^{\prime}\times F^{\prime}\times C} and FreqT×F×C\mathcal{F}^{Freq}\in\mathbb{R}^{T^{\prime}\times F^{\prime}\times C} are concatenated with Res\mathcal{F}^{Res} to fed into a Conv2D block to generate the output of the RA block RAT×F×C\mathcal{F}^{RA}\in\mathbb{R}^{T^{\prime}\times F^{\prime}\times C}, the kernel size and stride of the output Conv2D are all (1,1).

The TSA and FSA blocks share the same structure with different reshape operations as shown in Fig.2(c). The input ResT×F×C\mathcal{F}^{Res}\in\mathbb{R}^{T^{\prime}\times F^{\prime}\times C} is fed parallel into Conv2D blocks. The kernel size and stride of the Conv2D blocks are all (1,1), the channel number is reduced by half to C2\frac{C}{2} for less computational complexity. The output feature maps of the Conv2D blocks in T×F×C2\mathbb{R}^{T^{\prime}\times F^{\prime}\times\frac{C}{2}} are then reshaped to the tkT×(C2×F)\mathcal{F}_{t}^{k}\in\mathbb{R}^{T^{\prime}\times(\frac{C}{2}\times F^{\prime})} for TSA or fkF×(C2×T)\mathcal{F}_{f}^{k}\in\mathbb{R}^{F^{\prime}\times(\frac{C}{2}\times T^{\prime})} for FSA, respectively, where k[K,Q,V]k\in[{K,Q,V}]. K,Q,VK,Q,V indicates the key, query and value in the scaled dot-product self-attention [25]. For TSA, the self-attention is formulated as:

SAt=Softmax(tQ(tK)H/C2×F)tVSA^{t}=Softmax(\mathcal{F}_{t}^{Q}\cdot(\mathcal{F}_{t}^{K})^{H}/\sqrt{\frac{C}{2}\times F^{{}^{\prime}}})\cdot\mathcal{F}_{t}^{V} (1)

where SAtT×(C2×F)SA^{t}\in\mathbb{R}^{T^{\prime}\times(\frac{C}{2}\times F^{\prime})}, ()H()^{H} denotes matrix transpose and \cdot denotes matrix multiplication. For FSA, the self-attention is formulated as:

SAf=Softmax(fQ(fK)H/C2×T)fVSA^{f}=Softmax(\mathcal{F}_{f}^{Q}\cdot(\mathcal{F}_{f}^{K})^{H}/\sqrt{\frac{C}{2}\times T^{\prime}})\cdot\mathcal{F}_{f}^{V} (2)

where SAfF×(C2×T)SA^{f}\in\mathbb{R}^{F^{\prime}\times(\frac{C}{2}\times T^{\prime})}. The SAtSA^{t} and SAfSA^{f} are further reshaped to T×F×C2\mathbb{R}^{T^{\prime}\times F^{\prime}\times\frac{C}{2}} and then fed into a Conv2D block with channel number CC, kernel size (1,1) and stride (1,1). The input Res\mathcal{F}^{Res} is added to the output of the Conv2D block to get the final temporal self-attention Temp\mathcal{F}^{Temp} or frequency self-attention Freq\mathcal{F}^{Freq}.

2.4 Multi-scale Temporal-Frequency Attention

In section 2.3, the temporal attention is calculated based on all frequency bins. The frequency attention is calculated based on all input time steps. However, considering the different frequency ranges of various instruments and rapid changes of music content, using all frequency or temporal features might not be the optimal way for attention calculation. Therefore, multi-scale segment-wise attention is proposed to calculate attention based on different frequency or temporal ranges. More specifically, the input of TSA and FSA is first sliced into PP segments along the frequency and temporal dimension respectively. The attentions are then calculated for each segment individually and combined into the final output. The segment-wise TSA is formulated as:

Refer to caption
Fig. 3: Separator with multi-scale temporal-frequency attention.
Temp=Concat(FPTemp(1),,FPTemp(P))\mathcal{F}^{Temp}=Concat(\mathcal{F}_{FP}^{Temp}(1),...,\mathcal{F}_{FP}^{Temp}(P)) (3)
FPTemp(i)=TSA(FPRes(i)))\mathcal{F}_{FP}^{Temp}(i)=TSA(\mathcal{F}_{FP}^{Res}(i))) (4)

where the {FPTemp(i),FPRes(i)}T×FP×C\{\mathcal{F}_{FP}^{Temp}(i),\mathcal{F}_{FP}^{Res}(i)\}\in\mathbb{R}^{T^{\prime}\times\frac{F^{\prime}}{P}\times C} is the i-th segment of FPTemp\mathcal{F}_{FP}^{Temp} and FPRes\mathcal{F}_{FP}^{Res} respectively. The subscript FPFP indicates that the Temp\mathcal{F}^{Temp} or Res\mathcal{F}^{Res} is sliced into PP segments along frequency dimension. TSA()TSA() is the temporal self-attention module as shown in Fig.2(c). The segment-wise FSA is formulated as:

Freq=Concat(TPFreq(1),,TPFreq(P))\mathcal{F}^{Freq}=Concat(\mathcal{F}_{TP}^{Freq}(1),...,\mathcal{F}_{TP}^{Freq}(P)) (5)
TPFreq(i)=FSA(TPRes(i)))\mathcal{F}_{TP}^{Freq}(i)=FSA(\mathcal{F}_{TP}^{Res}(i))) (6)

where the {TPFreq(i),TPRes(i)}TP×F×C\{\mathcal{F}_{TP}^{Freq}(i),\mathcal{F}_{TP}^{Res}(i)\}\in\mathbb{R}^{\frac{T^{\prime}}{P}\times F^{\prime}\times C} is the i-th segment of TPFreq\mathcal{F}_{TP}^{Freq} and TPRes\mathcal{F}_{TP}^{Res} respectively. The subscript TPTP indicates that the Freq\mathcal{F}^{Freq} or Res\mathcal{F}^{Res} is sliced into PP segments along temporal dimension.

The RA blocks with different value of PP can be combined together to form a multi-scale mechanism. Fig.2(a) shows the single-scale attention consists of four RA blocks with P=1P=1. Fig.3 shows the separator with multi-scale attention, which consists of RA blocks with P=1,2,4,8P=1,2,4,8. To obtain both small and large scale attention at the same layer, a two-branch structure with parallel RA blocks is introduced, where one branch contains RA blocks with increasing PP value while the other contains RA blocks with decreasing PP value.

2.5 Loss Function

A joint loss function with combining time domain and frequency domain losses is employed to train the network. The time domain loss is defined as mean absolute error (MAE) between target signal 𝒔\boldsymbol{s} and estimated signal 𝒔^\hat{\boldsymbol{s}},

time=𝒔𝒔^1\mathcal{L}_{time}=||\boldsymbol{s}-\hat{\boldsymbol{s}}||_{1} (7)

Where ||||1\left||\cdot\right||_{1} denotes L1-Norm. The frequency domain loss is defined as MAE between target complex spectrum 𝑺\boldsymbol{S} and estimated complex spectrum 𝑺^\hat{\boldsymbol{S}},

freq=real(𝑺𝑺^)1+imag(𝑺𝑺^)1\mathcal{L}_{freq}=||real(\boldsymbol{S}-\hat{\boldsymbol{S}})||_{1}+||imag(\boldsymbol{S}-\hat{\boldsymbol{S}})||_{1} (8)

The overall loss is defined as,

=time+αfreq\mathcal{L}=\mathcal{L}_{time}+\alpha\cdot\mathcal{L}_{freq} (9)

where α\alpha is set to 0.1 based on preliminary experiment results.

3 Experiments

3.1 Dataset and setup

The proposed system is evaluated based on the MUSDB18 dataset [3], which consists of 150 songs with stereo format and 44.1kHz sampling rate. For each song, the final mixture signal is provided with its four audio stems, namely, vocals, bass, drums and other. We adopted the official split of 86, 14 and 50 songs for train, development, and evaluation respectively. The audio recordings are split into around 5.6 seconds (240 frames) segments with 2 seconds (86 frames) shift. The time domain segments are transformed to the time-frequency domain using an 81928192 samples STFT with 10241024 samples hop size. The complex spectrum YCT×FY_{C}\in\mathbb{C}^{T\times F} of mixture clip is used as the input of system, where T=240T=240 and F=4096F=4096. Data augmentation with channel swapping and remixing [26] is employed on the fly during model training.

For each audio source, we train a dedicated model individually. The Adam optimizer is employed. The initial learning rate is set to 0.001. It will decay by a factor of 0.8 when the validation loss does not decrease for 10 epochs. All the models are trained for 300 epochs.

3.2 Comparison with the existing systems

We compare the proposed Multi-scale Temporal-Frequency Attention Network (MTFAttNet) with other existing start-of-the-art systems on MUSDB18 dataset in Table 2. The signal-to-distortion ratio (SDR) [27] computed by the museval toolbox[3] is used as evaluation metric. In the upper half of Table 2, the SDR results of the existing single domain systems including Spleeter[14], D3Net[12], Demucs [13] and ResUNetDecouple+ [15] are compared with the proposed MTFAttNet system. The WW and SS indicates the waveform domain and spectrogram domain respectively. In the bottom half of Table 2, we also listed the results of the top-ranked hybrid domain systems in the Music Demixing (MDX) challenge at ISMIR 2021 [28], namely, KUIELab-MDX-Net [29] and Hybrid Demucs[30]. W+SW+S indicates the method is working on hybrid domain.

As shown in Table 2, compared to the existing single domain methods, the proposed MTFAttNet method achieves significant improvement in separating vocals, drums and other stems. The bass SDR of the waveform domain method Demucs is higher than all the spectrogram domain methods, which might be caused by the limited frequency resolution of spectrum for bass. The overall SDR of the proposed system is 7.26 dB, which outperforms the best existing single domain method (ResUNetDecouple+ with SDR of 6.73 dB). Although the overall SDR is lower than the start-of-the-art hybrid domain systems, the proposed method still achieves the best vocal separation performance with the SDR of 9.51 dB, which outperforms the best existing hybrid domain method (KUIELab-MDX-Net with vocal SDR of 9.00 dB).

Table 2: SDR comparison for the proposed and existing MSS systems.
Method Domain Vocals Bass Drums Other All
Spleeter S 6.86 5.51 6.71 4.55 5.91
D3Net S 7.24 5.25 7.01 4.53 6.01
Demucs W 6.84 7.01 6.86 4.42 6.28
ResUNetDecouple+ S 8.98 6.04 6.62 5.29 6.73
MTFAttNet (proposed) S 9.51 6.43 7.39 5.69 7.26
KUIELAB-MDX-Net W+S 9.00 7.86 7.33 5.95 7.54
Hybrid Demucs W+S 8.04 8.67 8.58 5.59 7.72
Table 3: SDR comparison for different attention mechanisms.
Method Vocals Bass Drums Other All
noAttNet 7.17 6.11 5.52 4.82 5.90
FAttNet 8.42 6.19 6.44 5.56 6.65
TAttNet 8.34 6.09 7.29 5.43 6.79
TFAttNet 9.23 6.31 7.34 5.49 7.09
MTFAttNet 9.51 6.43 7.39 5.69 7.26

3.3 Attention mechanism study

In this section, to better understand the benefit of the proposed MTFAttNet for MSS task, we further evaluate the systems by replacing the separator of MTFAttNet with different attention mechanisms in Table 3. Condition TFAttNet employed the temporal-frequency attention with single scale attention structure illustrated in Fig.2(a). FAttNet and TAttNet are conditions only using the frequency attention module (with temporal attention removed) and the temporal attention module (with frequency attention removed), respectively. Condition noAttNet removed both temporal and frequency attention modules from condition TFAttNet.

We first discuss the overall performances of different attention systems. As shown in Table 3, noAttNet achieves similar overall SDR (5.90 dB) to Spleeter (5.91 dB, listed in Table 2) indicating the effectiveness of proposed generic network structure without attention. With the frequency attention FAttNet and temporal attention TAttNet, the overall SDR is further improved to 6.65 dB and 6.79 dB respectively, which indicates the effectiveness of attention on capturing both temporal and frequency correlations. Combing both temporal and frequency attention, TFAttNet achieves the SDR of 7.09 dB. With multi-scale attention mechanism, MTFAttNet captures the spectrogram correlations more effectively and achieves the highest SDR of 7.26 dB.

The effects of attention mechanisms on different sources is varied for different type of stems. For the bass stems, we found that applying attention mechanisms introduced less improvements than other types of stems in Table 3. This is potentially caused by the limited performance of the spectrogram domain method for the bass stems as shown in Table 2. Based on noAttNet, the additional attention modules can not effectively capture the spectrogram correlations for bass with limited frequency resolution. For drums, FAttNet achieves the SDR of 6.44 dB while TAttNet achieves the SDR of 7.29 dB. The temporal attention outperforms the frequency attention for drums, which might be due to the fact that drums contain repeated beats in temporal domain. For vocals, compared to condition noAttNet, both temporal and frequency attention can significantly improve the performance. Combing both temporal and frequency attention can increase SDR to 9.23 dB. With multi-scale attention, the SDR is further increased to 9.51 dB. The success of attentions on vocals might benefit from the relatively long duration of pitch and the harmonic structure of vocals, which leads to the high correlations in both temporal and frequency domain. The results show such temporal and frequency correlations can be effectively modeled by the proposed MTFAttNet system.

4 CONCLUSION

In this paper, we proposed a novel neural network architecture called MTFAttNet for music source separation. MTFAttNet employs a temporal-frequency attention module to exploit the spectrogram correlations along temporal and frequency dimension. A multi-scale mechanism is also proposed for the effectiveness of attention calculation. The experimental results show the proposed method achieves start-of-the art performance on the MUSDB18 dataset. In future work, we will explore the combination of waveform and spectrogram domain methods and further improve the separated results for instruments.

References

  • [1] Zafar Rafii and Bryan Pardo, “Repeating pattern extraction technique (repet): A simple method for music/voice separation,” IEEE transactions on audio, speech, and language processing, vol. 21, no. 1, pp. 73–84, 2012.
  • [2] Derry Fitzgerald, “Upmixing from mono-a source separation approach,” in 2011 International Conference on Digital Signal Processing (DSP). IEEE, 2011, pp. 1–7.
  • [3] Fabian-Robert Stöter, Antoine Liutkus, and Nobutaka Ito, “The 2018 signal separation evaluation campaign,” in International Conference on Latent Variable Analysis and Signal Separation. Springer, 2018, pp. 293–305.
  • [4] Daniel D Lee and H Sebastian Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791, 1999.
  • [5] Mike E Davies and Christopher J James, “Source separation using single channel ica,” Signal Processing, vol. 87, no. 8, pp. 1819–1832, 2007.
  • [6] Stefan Uhlich, Franck Giron, and Yuki Mitsufuji, “Deep neural network based instrument extraction from music,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 2135–2139.
  • [7] Aditya Arie Nugraha, Antoine Liutkus, and Emmanuel Vincent, “Multichannel music separation with deep neural networks,” in 2016 24th European Signal Processing Conference (EUSIPCO). IEEE, 2016, pp. 1748–1752.
  • [8] S. Uhlich, M. Porcu, F. Giron, M. Enenkl, T. Kemp, N. Takahashi, and Y. Mitsufuji, “Improving music source separation based on deep neural networks through data augmentation and network blending,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, 2017, pp. 261–265.
  • [9] Fabian-Robert Stöter, Stefan Uhlich, Antoine Liutkus, and Yuki Mitsufuji, “Open-unmix-a reference implementation for music source separation,” Journal of Open Source Software, vol. 4, no. 41, pp. 1667, 2019.
  • [10] Jen-Yu Liu and Yi-Hsuan Yang, “Dilated convolution with dilated gru for music source separation,” International Joint Conferences on Artificial Intelligence Organization(IJCAI), 2019.
  • [11] Naoya Takahashi, Nabarun Goswami, and Yuki Mitsufuji, “Mmdenselstm: An efficient combination of convolutional and recurrent neural networks for audio source separation,” in 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC). IEEE, 2018, pp. 106–110.
  • [12] Naoya Takahashi and Yuki Mitsufuji, “D3net: Densely connected multidilated densenet for music source separation,” arXiv preprint arXiv:2010.01733, 2020.
  • [13] Alexandre Défossez, Nicolas Usunier, Léon Bottou, and Francis Bach, “Music source separation in the waveform domain,” arXiv preprint arXiv:1911.13254, 2019.
  • [14] Romain Hennequin, Anis Khlif, Felix Voituret, and Manuel Moussallam, “Spleeter: a fast and efficient music source separation tool with pre-trained models,” Journal of Open Source Software, vol. 5, no. 50, pp. 2154, 2020.
  • [15] Qiuqiang Kong, Yin Cao, Haohe Liu, Keunwoo Choi, and Yuxuan Wang, “Decoupling magnitude and phase estimation with deep resunet for music source separation,” in Proceedings of the ISMIR 2021 Workshop on Music Source Separation, 2021.
  • [16] Andreas Jansson, Eric Humphrey, Nicola Montecchio, Rachel Bittner, Aparna Kumar, and Tillman Weyde, “Singing voice separation with deep u-net convolutional networks,” Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), p. 323–332, 2017.
  • [17] Daniel Stoller, Sebastian Ewert, and Simon Dixon, “Wave-u-net: A multi-scale neural network for end-to-end audio source separation,” International Society for Music Information Retrieval Conference (ISMIR), 2018.
  • [18] Naoya Takahashi and Yuki Mitsufuji, “Multi-scale multi-band densenets for audio source separation,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2017, pp. 21–25.
  • [19] Woosung Choi, Minseok Kim, Jaehwa Chung, Daewon Lee, and Soonyoung Jung, “Investigating u-nets with various intermediate blocks for spectrogram-based singing voice separation,” 21th International Society for Music Information Retrieval Conference, 2020.
  • [20] Jouni Paulus, Meinard Müller, and Anssi Klapuri, “State of the art report: Audio-based music structure analysis.,” in Ismir. Utrecht, 2010, pp. 625–636.
  • [21] Yuzhou Liu, Balaji Thoshkahna, Ali Milani, and Trausti Kristjansson, “Voice and accompaniment separation in music using self-attention convolutional neural network,” arXiv preprint arXiv:2003.08954, 2020.
  • [22] Tingle Li, Jiawei Chen, Haowen Hou, and Ming Li, “Sams-net: A sliced attention-based neural network for music source separation,” in 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP). IEEE, 2021, pp. 1–5.
  • [23] C. Zheng, X. Peng, Y. Zhang, S. Srinivasan, and Y. Lu, “Interactive speech and noise modeling for speech enhancement,” in AAAI, 2021, pp. 14549–14557.
  • [24] Haohe Liu, Lei Xie, Jian Wu, and Geng Yang, “Channel-wise subband input for better voice and accompaniment separation on high resolution music,” INTERSPEECH, 2020.
  • [25] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems (NIPS), 2017, pp. 5998–6008.
  • [26] Laure Prétet, Romain Hennequin, Jimena Royo-Letelier, and Andrea Vaglio, “Singing voice separation: A study on training data,” in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 506–510.
  • [27] Emmanuel Vincent, Rémi Gribonval, and Cédric Févotte, “Performance measurement in blind audio source separation,” IEEE transactions on audio, speech, and language processing, vol. 14, no. 4, pp. 1462–1469, 2006.
  • [28] Yuki Mitsufuji, Giorgio Fabbro, Stefan Uhlich, Fabian-Robert Stöter, Alexandre Défossez, Minseok Kim, Woosung Choi, Chin-Yun Yu, and Kin-Wai Cheuk, “Music demixing challenge 2021,” Frontiers in Signal Processing, vol. 1, 2022.
  • [29] Minseok Kim, Woosung Choi, Jaehwa Chung, Daewon Lee, and Soonyoung Jung, “Kuielab-mdx-net: A two-stream neural network for music demixing,” Proceedings of the MDX Workshop, 2021.
  • [30] Alexandre Défossez, “Hybrid spectrogram and waveform source separation,” Proceedings of the ISMIR 2021 Workshop on Music Source Separation, 2021.