Music Source Separation with Deep Equilibrium Models
Abstract
While deep neural network-based music source separation (MSS) is very effective and achieves high performance, its model size is often a problem for practical deployment. Deep implicit architectures such as deep equilibrium models (DEQ) were recently proposed, which can achieve higher performance than their explicit counterparts with limited depth while keeping the number of parameters small. This makes DEQ also attractive for MSS, especially as it was originally applied to sequential modeling tasks in natural language processing and thus should in principle be also suited for MSS. However, an investigation of a good architecture and training scheme for MSS with DEQ is needed as the characteristics of acoustic signals are different from those of natural language data. Hence, in this paper we propose an architecture and training scheme for MSS with DEQ. Starting with the architecture of Open-Unmix (UMX), we replace its sequence model with DEQ. We refer to our proposed method as DEQ-based UMX (DEQ-UMX). Experimental results show that DEQ-UMX performs better than the original UMX while reducing its number of parameters by 30%.
Index Terms— Music source separation, deep neural networks, deep implicit layers, deep equilibrium models
1 Introduction
Deep neural network (DNN)-based approaches are highly effective in achieving high performance in acoustic signal processing tasks such as music source separation (MSS) [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. However, their model size is often a problem for practical deployment. For example, downloading a software package including larger DNN models can take a considerable amount of time and frustrate users. Also, larger DNN models require more random access memory and read-only memory, which could be a problem when the available on-device memory size is limited. Therefore, methods for model size reduction which preserve the performance of the original network are worth exploring.
Deep implicit layers [11, 12, 13, 14, 15, 16] define layers implicitly such that input and output have to satisfy some joint conditions, e.g., that they are the equilibrium points of an equation: they were recently proposed in the field of machine learning and achieved higher performance than typical networks whose architecture is explicitly defined. Deep equilibrium models (DEQ) [13] are an instance of deep implicit approaches, and their layer output is calculated by finding a root of an equation (i.e., equilibrium point) parameterized by the layer input. DEQ is also considered to be a weight-tied network [17] that has approximately infinite layers. It was shown that DEQ is especially suited for sequential modeling tasks and outperformed other existing methods with a number of parameters comparable to that of only a few explicit layers.
For example, DEQ-TrellisNet and DEQ-Transformer achieved state-of-the-art performance in large-scale language modeling tasks using WikiText-103 [13]. Multi-scale DEQ, which computes several equilibrium points at different scales to deal with high-resolution images, also outperformed other existing methods in some image-processing tasks such as ImageNet classification and semantic segmentation [14]. However, such DEQ-based approaches have not been explored in audio source separation. In [18], equilibriated recurrent neural networks (ERNN) [19] were applied to speech enhancement. ERNN is a similar approach to DEQ as it approximately estimates equilibrium points of ordinary differential equations using the implicit Euler method. However, its motivation is to avoid gradient problems and stabilize the training procedure. Also, ERNN replaces one recurrent layer with ERNN layers whereas DEQ replaces any repeating layers with a DEQ-based layer.
Audio source separation requires sequence modeling as acoustic signals have a sequential structure. A network architecture for source separation is typically composed of an encoder block, a sequence modeling block, and a decoder block [20]. The sequence modeling block is usually implemented with recurrent layers or 1-D or 2-D convolutional layers. For instance, Open-Unmix (UMX) [4] employs a long short-term memory (LSTM) network, while Conv-TasNet [21] employs a temporal convolutional network (TCN). Therefore, DEQ is expected to be also suited as sequence modeling block for a source separation architecture and to contribute to a performance improvement, together with a reduction in the number of parameters. However, since the characteristics of acoustic signals are different from those of natural language data, an architecture and training scheme appropriate for source separation need to be designed.
In this paper, we propose to apply DEQ to MSS. Using the architecture of Open-Unmix (UMX) [4] as a starting point, we replace its sequence model with DEQ. We refer to our proposed method as DEQ-based UMX (DEQ-UMX). Usually DEQ-based approaches tend to require a high computational cost due to the number of iterations needed to find the equilibrium point and obtain a sufficiently large receptive field. Nevertheless, we expect that DEQ-UMX does not require a too high computational cost and can achieve sufficient performance with a few iterations by employing bidirectional LSTM (BLSTM) as a core function for DEQ. This is because the BLSTM layer will utilize all information of the input sequence even if the number of iterations is small.
As mentioned earlier, DEQ can be interpreted as an infinite stack of weight-tied networks, therefore we start by experimenting with a weight-tied network for MSS. After confirming its effectiveness, we move to DEQ for further performance improvement. Experimental results show that DEQ-UMX performs better than the original UMX while reducing the number of parameters.


2 Related Work
In this section we review weight-tied networks and DEQ to support our contributions, and Open-Unmix to provide information about a general architecture for DNN-based source separation.
2.1 Weight-tied network
A weight-tied network is realized by employing the same transformation , parameterized by , in each layer of a typical skip-connection-based architecture as
(1) |
where is the input sequence and is the hidden sequence in the -th layer [17]. Such transformation can also be interpreted as a sequence of identical transformations, where each iteration is identified by the index . This sequence is typically terminated after a fixed number of iterations (i.e., ). The number of iterations is equivalent to the number of function evaluations (NFE), thus is also called NFE. TrellisNet [17], which is implemented by using TCN as , outperformed other typical DNN-based approaches on some natural language processing (NLP) tasks without a noticeable increase of the number of parameters.
2.2 Deep Equilibrium Model
In [13], it was empirically shown that (1) tends to converge to an equilibrium point as the layer index increases. The equilibrium point satisfies the following condition:
(2) |
and the output of DEQ is the equilibrium point itself. DEQ showed further improvement over weight-tied networks on some NLP tasks [13].
The forward pass of DEQ is performed by finding the equilibrium point . Some root-finding solvers for nonlinear equations such as Newton’s method or quasi-Newton methods can be utilized to find the equilibrium point as follows:
(3) |
where , is the inverse Jacobian of (or its low-rank approximation) evaluated at , which is obtained by a root-finding solver, is the step size, and is the NFE when the iteration is stopped. The iteration stops if the norm of becomes smaller than a tolerance or the maximum number of function evaluations is reached. Broyden’s method [22], which is a quasi-Newton method, is often used for DEQ as it computes the inverse Jacobian efficiently.
In DEQ, the backpropagation is performed on the basis of implicit differentiation [13], which doesn’t need to store the intermediate values of the solver. Let be the loss function for optimizing the parameter . The derivative of with respect to can be calculated based on implicit differentiation as follows:
(4) |
where is the inverse Jacobian of evaluated at the equilibrium point , which appears in the implicit differentiation theorem. The operation can be performed by solving the equation in terms of :
(5) |
with solvers such as Broyden’s method.
Jacobian-Free backpropagation (JFB) [15] was proposed as a more efficient technique for gradient calculation in DEQ. This method approximates the inverse Jacobian in (4) with an identity matrix, avoiding the Jacobian-based computation from (5) and reducing computations. In [15] it was shown that the gradient obtained by JFB is valid (i.e., a descent direction for the loss function) under certain conditions. Furthermore, since this method does not impose a condition on how close the solver’s output is to the equilibrium point, it is expected to be robust to estimation errors by the solver.


2.3 Open-Unmix
Fig. 1 is an example of a general network architecture for DNN-based audio source separation that receives a multi-channel (typically stereo in MSS) spectrogram obtained by a short-time Fourier transform (STFT) and outputs the spectrogram for the target audio source. UMX is a typical example of such architecture, which shows performance close to state-of-the-art [4].
In the pre-processing block of UMX, the input spectrogram is first cropped to a maximum frequency (typically 16 kHz). Then the cropped spectrogram is fed into frame-wise linear transformations whose parameters are set to normalize the spectrogram to have zero mean and unitary standard deviation. The encoder block is composed of a fully-connected (FC) layer with batch normalization (BN). The sequence modeling block is a key part of source separation as it deals with contextual information necessary to separate the mixed sources. Fig. 2 shows in detail the sequence model of UMX. The output of the encoder block is first normalized with a hyperbolic tangent function such that its range is between and , then fed into 3 consecutive BLSTM layers. The normalized output of the encoder block and the output of the BLSTM block are concatenated and sequentially fed into another FC layer with BN and rectified linear unit (ReLU) activation. The decoder block is composed of a FC layer with BN. The post-processing block is a linear transformation whose parameters are initialized to 1. The output of the post-processing block is used as a multiplicative mask on the input of the pre-processing module. Such architecture is trained to separate a specific instrument. Finally, the separated signals for each target instrument are refined by applying a multi-channel Wiener filter (MWF) [23, 2] to the output of each network.
3 Proposed method
Although the DEQ already showed excellent performance in some NLP tasks, the characteristics of acoustic signals is different from that of natural language data. Therefore, we devise in this section a suitable architecture and a training scheme for DEQ-based source separation.
3.1 Forward pass
We replace the sequence model of UMX with a DEQ-based sequence model while keeping the other blocks as they are. Fig. 3 describes the overview of the sequence model of the proposed method. In the sequence model, the input sequence normalized by the hyperbolic tangent function is fed into the DEQ block followed by a ReLU as shown in Fig. 3(a). Fig. 3(b) shows our proposed architecture for the function of the DEQ block, which is inspired by the original UMX. Specifically, we adopt a concatenation block to merge the output of the BLSTM and the input sequence. We define the size of hidden sequence as equivalent to that of the input sequence since the original UMX has the same size of output with the input. Then the -th concatenated sequence is fed into a FC layer such that its dimension is the same as the original input of the DEQ block,
(6) |
where is the -th output sequence of the FC block. Then the output sequence is normalized with a group normalization (GN) block whose number of groups is one and fed into a hyperbolic tangent function followed by a BLSTM block with one layer as follows,
(7) |
The choice of GN is a deviation from the original UMX, which uses BN instead. This modification is inspired by [14] as BN often causes a poor conditioning of the Jacobian matrix. We use Broyden’s method to compute Eq. (3) in the forward pass and obtain the hidden sequence . Note that the tolerance and the maximum number of function evaluations during inference need to be the same with those of training.
Aside from the architecture shown in Fig. 3(b), we preliminary evaluated two different architectures, where we substitute concatenation with weight-averaging or with masking. We found that the architecture with the concatenation block achieved the best performance, which agrees with the tendency highlighted by [24]. Therefore, we adopt the architecture shown in Fig. 3(b).
To provide further insights into our architecture, an analogy with guided source separation (GSS) [25, 26, 27] may be helpful. In the first iteration of the solver (i.e., ), (which is an all-zero sequence) is fed into the function and concatenated with the input . This process can be interpreted as if the function attempts to extract the desired sequence without any guide information. In the first few iteration (e.g., ), although the calculated is still insufficient for the separation performance, it is used as the updated guide information for the next iteration. The guide information gradually becomes more accurate with each iteration and finally a sufficiently informative can be obtained.
Using BLSTM in the function is a key point. If a convolutional neural network (CNN)-based architecture such as TrellisNet [17] was assigned to , it would require many NFE to obtain a sufficiently large receptive field. In fact, TrellisNet requires hundreds of iterations to converge, and dozens of iterations are performed in practice [17, 13]. In contrast, we can expect that using BLSTM in the function enables us to utilize all information of the input sequence even if the NFE is small.
3.2 Backward pass and training
Although a small NFE contributes to reducing the computational cost, it could be a problem when using implicit differentiation for the backward pass because it is not guaranteed that the obtained hidden sequence satisfies the equilibrium condition sufficiently. Therefore, since JFB is expected to be more robust to the estimation error of the solver, we use JFB for an efficient gradient calculation instead of using implicit differentiation-based backpropagation.
We use the same training setup (i.e., loss function, learning rate scheduler, etc…) as in the original UMX. Following [13], the parameters of the DEQ-UMX are pre-trained as weight-tied network using the typical backpropagation with a pre-defined number of epochs for the stability of the training. NFE of the weight-tied network and the number of epochs for pre-training are determined on the basis of the validation loss.
4 Experiment
Method | # Param. (M) | MACs (G) / 6sec. | SDR [dB] | ||||
Vocals | Drums | Bass | Other | Avg. | |||
Open-Unmix (UMX) [4] | 35.55 | 9.08 | 6.32 | 5.73 | 5.23 | 4.02 | 5.33 |
UMX large (4 layers in BLSTM) | 41.85 | 10.69 | 6.41 | 5.94 | 4.87 | 4.21 | 5.36 |
UMX large (5 layers in BLSTM) | 48.16 | 12.30 | 6.22 | 5.75 | 5.16 | 4.03 | 5.29 |
UMX small | 25.15 | 6.42 | 6.15 | 5.78 | 4.87 | 4.16 | 5.24 |
WT-UMX () | 25.06 | 12.29 | 6.37 | 6.12 | 5.20 | 3.94 | 5.41 |
DEQ-UMX (with implicit diff.) | 25.06 | 18.74 | 6.32 | 5.94 | 5.05 | 4.07 | 5.34 |
DEQ-UMX (with JFB, proposed) | 25.06 | 18.74 | 6.60 | 6.17 | 5.14 | 4.20 | 5.53 |
4.1 Experimental settings
We use the MUSDB18 dataset, which consists of 150 professionally recorded songs [28]. For each song, the clean waveform of vocals, drums, bass, and other together with their mixture are available. Each audio track is stereo with a sampling frequency of 44.1 kHz. Following the official split, we used 86 songs for the training, 14 songs for the validation, and 50 songs for the test set. All networks in this experiment were trained on 6 seconds long segments using the Adam [29] optimizer and a weight decay of . The learning rate was initially set to 0.001 and decreased by 70% if the average of the loss function on the validation set did not improve in 80 consecutive epochs. The training was interrupted when the average of the loss function on the validation set did not improve in 300 consecutive epochs. For inference, the full-length signals were processed with the trained networks. MWF was also applied to the full-length signals. Signal-to-distortion ratio (SDR) [30] was used for the evaluation. We compute the SDR by taking the median over all frames of a song and then calculating the median over all songs, which is the standard way of SDR computation on MUSDB18.
4.2 Preliminary study: WT-UMX

We first investigated the feasibility of DEQ-based source separation using the weight-tied network with the architecture described in Fig. 3, which we refer to as WT-UMX hereafter. Fig. 4 shows the relationship between the NFE of WT-UMX and the average SDR. The average SDR reaches the best SDR score when the NFE, , is set to four, while it gets worse as the NFE increases. This tendency is similar to previous work in other tasks [17, 13, 16]. Although weight-tied networks and DEQ usually require hundreds of NFE to converge, the models that are evaluated are trained with dozens of NFE [13]. The relationship between NFE and performance, which clearly shows a maximum performance around a specific NFE value, is compatible with another approach [16], which uses a closed-form representation. This shows that the best NFE depends on the task and the network architecture. In our case, the model tends to reach the best performance with a relatively small NFE. This implies that using BLSTM as helps the network to utilize all the information of the input sequence early in the iterations, as hypothesized in sec. 3.1. We can conclude that it is possible applying weight-tying to UMX and that the required NFE is relatively small comparing to the experiment conducted in other works.
4.3 Evaluation of the proposed DEQ-UMX
The evaluation results are shown in Table 1. The number of parameters, the number of multiply-and-accumulate operations (MACs), the SDR values for each instrument, and the average SDR values over all the instruments are compared among several variants of UMX, WT-UMX (from the preliminary study), and our proposed DEQ-UMX. The MACs do not include the computations for the STFT and MWF. “UMX large” is a variant of UMX with more layers in the BLSTM block than the original UMX. “UMX small” is a variant with smaller hidden size (410) than that of UMX (512).
First, we can observe that even if the number of layers of UMX is simply increased, there is almost no improvement in terms of the average SDR. Also, if we reduce the hidden size of UMX, then the number of parameters will decrease while the average SDR is penalized by 0.09 dB. WT-UMX with reduces the number of parameters by 30% and improves the average SDR, with a particular improvement of the SDR for drums. The proposed DEQ-UMX achieved the best average SDR. We can also see that JFB contributes to improving the SDR values for all instruments with respect to the model trained with implicit differentiation. In particular, the proposed DEQ-UMX improved the average SDR by 0.29 dB with respect to “UMX small”, which has the comparable number of parameters. We also compared the SDR distributions of both methods and observed that the distributions have a similar shape, i.e., there are no extreme outliers for DEQ-UMX but its distribution is shifted to a better SDR score. These results suggest that the equilibrium point obtained by our proposed method allows for a better separation performance than the ones obtained by the weight-tied network with typical backpropagation and the DEQ-UMX trained by backpropagation with implicit differentiation. In our proposed method, used in pre-training was 4 for all instruments, and the number of epochs for pre-training was 400 for vocals, 0 for drums, and 300 for bass and other. After pre-training, was set to 6 for all instruments. During inference, the NFE always reached in the test set. Although the MACs of DEQ-UMX are higher than UMX, the number of parameters of DEQ-UMX is lower than for UMX by around 30%, which enables the implementation of DEQ-UMX in devices with limited memory.
5 Conclusion
We proposed a DEQ-based music source separation algorithm called DEQ-UMX. Inspired by the architecture of UMX, we replaced its BLSTM layers with a DEQ-based BLSTM layer. The DEQ-UMX trained with JFB achieved higher average SDR than the original UMX on the MSS task, while reducing the number of parameters. In future work, we plan to apply the DEQ framework to other state-of-the-art architectures and show its potential for generalization. We also plan to explore more efficient root-finding schemes to reduce the current computational complexity.
References
- [1] S. Uhlich, F. Giron, and Y. Mitsufuji, “Deep neural network based instrument extraction from music,” in Proc. of IEEE ICASSP, 2015, pp. 2135–2139.
- [2] S. Uhlich, M. Porcu, F. Giron, M. Enenkl, T. Kemp, N. Takahashi, and Y. Mitsufuji, “Improving music source separation based on deep neural networks through data augmentation and network blending,” in Proc. of IEEE ICASSP, 2017, pp. 261–265.
- [3] N. Takahashi and Y. Mitsufuji, “Multi-scale multi-band densenets for audio source separation,” in Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017, pp. 21–25.
- [4] F.-R. Stöter, S. Uhlich, A. Liutkus, and Y. Mitsufuji, “Open-unmix - a reference implementation for music source separation,” Journal of Open Source Software, vol. 4, no. 41, p. 1667, 2019.
- [5] A. Défossez, N. Usunier, L. Bottou, and F. Bach, “Music source separation in the waveform domain,” arXiv preprint arXiv:1911.13254, 2019.
- [6] W. Choi, M. Kim, J. Chung, D. Lee, and S. Jung, “Investigating U-Nets with various intermediate blocks for spectrogram-based singing voice separation,” in Proc. of 21th International Society for Music Information Retrieval Conference (ISMIR), 2020, pp. 192–198.
- [7] N. Takahashi and Y. Mitsufuji, “Densely connected multi-dilated convolutional networks for dense prediction tasks,” in Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 993–1002.
- [8] R. Sawata, S. Uhlich, S. Takahashi, and Y. Mitsufuji, “All for one and one for all: Improving music separation by bridging networks,” in Proc. of IEEE ICASSP, 2021, pp. 51–55.
- [9] Y. Mitsufuji, G. Fabbro, S. Uhlich, F.-R. Stöter, A. Défossez, M. Kim, W. Choi, C.-Y. Yu, and K.-W. Cheuk, “Music demixing challenge 2021,” Frontiers in Signal Processing, vol. 1, 2022. [Online]. Available: https://www.frontiersin.org/article/10.3389/frsip.2021.808395
- [10] Q. Kong, Y. Cao, H. Liu, K. Choi, and Y. Wang, “Decoupling magnitude and phase estimation with deep resunet for music source separation,” arXiv preprint arXiv:2109.05418, 2021.
- [11] B. Amos and J. Z. Kolter, “OptNet: Differentiable optimization as a layer in neural networks,” in Proc. of International Conference on Machine Learning (ICML), 2017, pp. 136–145.
- [12] R. T. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud, “Neural ordinary differential equations,” in Proc. of Neural Information Processing Systems (NeurIPS), 2018, pp. 6572–6583.
- [13] S. Bai, J. Z. Kolter, and V. Koltun, “Deep equilibrium models,” in Proc. of Neural Information Processing Systems (NeurIPS), 2019, pp. 690–701.
- [14] S. Bai, V. Koltun, and J. Z. Kolter, “Multiscale deep equilibrium models,” in Proc. of Neural Information Processing Systems (NeurIPS), 2020, pp. 5238–5250.
- [15] S. W. Fung, H. Heaton, Q. Li, D. McKenzie, S. Osher, and W. Yin, “Fixed point networks: Implicit depth models with Jacobian-free backprop,” arXiv preprint arXiv:2103.12803, 2021.
- [16] Z. Wang and A. Ragni, “Approximate fixed-points in recurrent neural networks,” arXiv preprint arXiv:2106.02417, 2021.
- [17] S. Bai, J. Z. Kolter, and V. Koltun, “Trellis networks for sequence modeling,” in Proc. of International Conference on Learning Representations (ICLR), 2019.
- [18] D. Takeuchi, K. Yatabe, Y. Koizumi, Y. Oikawa, and N. Harada, “Real-time speech enhancement using equilibriated RNN,” in Proc. of IEEE ICASSP, 2020, pp. 851–855.
- [19] A. Kag, Z. Zhang, and V. Saligrama, “RNNs evolving on an equilibrium manifold: A panacea for vanishing and exploding gradients?” arXiv preprint arXiv:1908.08574, 2019.
- [20] M. Pariente, S. Cornell, J. Cosentino, S. Sivasankaran, E. Tzinis, J. Heitkaemper, M. Olvera, F.-R. Stöter, M. Hu, J. M. Martín-Doñas, D. Ditter, A. Frank, A. Deleforge, and E. Vincent, “Asteroid: the PyTorch-based audio source separation toolkit for researchers,” in Proc. of INTERSPEECH, 2020, pp. 2637–2641.
- [21] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256–1266, 2019.
- [22] C. G. Broyden, “A class of methods for solving nonlinear simultaneous equations,” Mathematics of computation, vol. 19, no. 92, pp. 577–593, 1965.
- [23] A. A. Nugraha, A. Liutkus, and E. Vincent, “Multichannel music separation with deep neural networks,” in Proc. of 24th European Signal Processing Conference (EUSIPCO), 2016, pp. 1748–1752.
- [24] S. Braun, H. Gamper, C. K. Reddy, and I. Tashev, “Towards efficient models for real-time deep noise suppression,” in Proc. of IEEE ICASSP, 2021, pp. 656–660.
- [25] N. Kanda, C. Boeddeker, J. Heitkaemper, Y. Fujita, S. Horiguchi, K. Nagamatsu, and R. Haeb-Umbach, “Guided source separation meets a strong ASR backend: Hitachi/Paderborn University joint investigation for dinner party scenario,” in Proc. of INTERSPEECH, 2019, pp. 1248–1252.
- [26] K. Žmolíková, M. Delcroix, K. Kinoshita, T. Ochiai, T. Nakatani, L. Burget, and J. Černocký, “SpeakerBeam: Speaker aware neural network for target speaker extraction in speech mixtures,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 4, pp. 800–814, 2019.
- [27] C. Xu, W. Rao, E. S. Chng, and H. Li, “SpEx: Multi-scale time domain speaker extraction network,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1370–1384, 2020.
- [28] Z. Rafii, A. Liutkus, F.-R. Stöter, S. I. Mimilakis, and R. Bittner, “MUSDB18-a corpus for music separation,” 2017.
- [29] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic gradient descent,” in Proc. of International Conference on Learning Representations (ICLR), 2015, pp. 1–15.
- [30] E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 4, pp. 1462–1469, 2006.