This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Source Separation with Deep Generative Priors

Vivek Jayaram    John Thickstun
Abstract

Despite substantial progress in signal source separation, results for richly structured data continue to contain perceptible artifacts. In contrast, recent deep generative models can produce authentic samples in a variety of domains that are indistinguishable from samples of the data distribution. This paper introduces a Bayesian approach to source separation that uses generative models as priors over the components of a mixture of sources, and noise-annealed Langevin dynamics to sample from the posterior distribution of sources given a mixture. This decouples the source separation problem from generative modeling, enabling us to directly use cutting-edge generative models as priors. The method achieves state-of-the-art performance for MNIST digit separation. We introduce new methodology for evaluating separation quality on richer datasets, providing quantitative evaluation of separation results on CIFAR-10. We also provide qualitative results on LSUN.

Machine Learning, ICML

1 Introduction

The single-channel source separation problem (Davies & James, 2007) asks us to decompose a mixed signal m𝒳\textbf{m}\in\mathcal{X} into a linear combination of kk components x1,,xk𝒳\textbf{x}_{1},\dots,\textbf{x}_{k}\in\mathcal{X} with scalar mixing coefficients αi\alpha_{i}\in\mathbb{R}:

m=g(x)i=1kαixi.\textbf{m}=g(\textbf{x})\equiv\sum_{i=1}^{k}\alpha_{i}\textbf{x}_{i}. (1)

This is motivated by, for example, the “cocktail party problem” of isolating the utterances of individual speakers xi\textbf{x}_{i} from an audio mixture m captured at a busy party, where multiple speakers are talking simultaneously.

With no further constraints or regularization, solving Equation (1) for x is highly underdetermined. Classical “blind” approaches to single-channel source separation resolve this ambiguity by privileging solutions to (1) that satisfiy mathematical constraints on the components x, such as statistical independence (Davies & James, 2007) sparsity (Lee et al., 1999) or non-negativity (Lee & Seung, 1999). These constraints can be be viewed as weak priors on the structure of sources, but the approaches are blind in the sense that they do not require adaptation to a particular dataset.

Recently, most works have taken a data-driven approach. To separate a mixture of sources, it is natural to suppose that we have access to samples x of individual sources, which can be used as a reference for what the source components of a mixture are supposed to look like. This data can be used to regularize solutions of Equation (1) towards structurally plausible solutions. The prevailing way to do this is to construct a supervised regression model that maps an input mixture m to components xi\textbf{x}_{i} (Huang et al., 2014; Halperin et al., 2019). Paired training data (m,x)(\textbf{m},\textbf{x}) can be constructed by summing randomly chosen samples from the component distributions xi\textbf{x}_{i} and labeling these mixtures with the ground truth components.

Instead of regressing against components x, we use samples to train a generative prior p(x)p(\textbf{x}); we separate a mixed signal m by sampling from the posterior distribution p(x|m)p(\textbf{x}|\textbf{m}). For some mixtures this posterior is quite peaked, and sampling from p(x|m)p(\textbf{x}|\textbf{m}) recovers the only plausible separation of m into likely components. But in many cases, mixtures are highly ambiguous: see, for example, the orange-highlighted MNIST images in Figure 1. This motivates our interest in sampling, which explores the space of plausible separations. In Section 3 we introduce a procedure for sampling from the posterior, an extension of the noise-annealed Langevin dynamics introduced in Song & Ermon (2019), which we call Bayesian Annealed SIgnal Source separation: “BASIS” separation.

Refer to caption
Figure 1: Separation results for mixtures of four images from the MNIST dataset (Left) and two images from the CIFAR-10 dataset (Right), using BASIS with the NCSN (Song & Ermon, 2019) generative model as a prior over images. We draw attention to the central panel of the MNIST results (highlighted in orange), which shows how a mixture can be separated in multiple ways.

Ambiguous mixtures pose a challenge for traditional source separation metrics, which presume that the original mixture components are identifiable and compare the separated components to ground truth. For ambiguous mixtures of rich data, we argue that recovery of the original mixture components is not a well-posed problem. Instead, the problem we aim to solve is finding components of a mixture that are consistent with a particular data distribution. Motivated by this perspective, we discuss evaluation metrics in Section 4.

Formulating the source separation problem in a Bayesian framework decouples the problem of source generation from source separation. This allows us to leverage pre-trained, state-of-the-art, likelihood-based generative models as prior distributions, without requiring architectural modifications to adapt these models for source separation. Examples of source separation using noise-conditioned score networks (NCSN) (Song & Ermon, 2019) as a prior are presented in Figure 1. Further separation results using NCSN and Glow (Kingma & Dhariwal, 2018) are presented in Section 5.

2 Related Work

Blind separation. Work on blind source separation is data-agnostic, relying on generic mathematical properties to privilege particular solutions to (1) (Comon, 1994; Bell & Sejnowski, 1995; Davies & James, 2007; Huang et al., 2012). Because blind methods have no access to sample components, they face the challenging task of modeling the distribution over unobserved components while simultaneously decomposing mixtures into likely components. It is difficult to fit a rich model to latent components, so blind methods often rely on simple models such as dictionaries to capture the structure of these components.

One promising recent work in the blind setting is Double-DIP (Gandelsman et al., 2019). This work leverages the unsupervised Deep Image Prior (Ulyanov et al., 2018) as a prior over signal components, similar to our use of a trained generative model. But the authors of this work document fundamental obstructions to applying their method to single-channel source separation; they propose using multiple image frames from a video, or multiple mixtures of the same components with different mixing coefficients α\alpha. This multiple-mixture approach is common to much of the work on blind separation. In contrast, our approach is able to separate components from a single mixture.

Supervised regression. Regression models for source separation learn to predict components for a mixture using a dataset of mixed signals labeled with ground truth components. This approach has been extensively studied for separation of images (Halperin et al., 2019), audio spectrograms (Huang et al., 2014, 2015; Nugraha et al., 2016; Jansson et al., 2017), and raw audio (Lluis et al., 2019; Stoller et al., 2018b; Défossez et al., 2019), as well as more exotic data domains, e.g. medical imaging (Nishida et al., 1999). By learning to predict components (or equivalently, masks on a mixture) this approach implicitly builds a generative model of the signal components. This connection is made more explicit in recent work that uses GAN’s to force components emitted by a regression model to match the distribution of a given dataset (Zhang et al., 2018; Stoller et al., 2018a).

The supervised approach takes advantage of expressive deep models to capture a strong prior over signal components. But it requires specialized model architectures trained specifically for the source separation task. In contrast, our approach leverages standard, pre-trained generative models for source separation. Furthermore, our approach can directly exploit ongoing advances in likelihood-based generative modeling to improve separation results.

Signal Dictionaries. Much work on source separation is based on the concept of a signal dictionary, most notably the line of work based on non-negative matrix factorization (NMF) (Lee & Seung, 2001). These approaches model signals as combinations of elements in a latent dictionary. Decomposing a mixture into dictionary elements can be used for source separation by (1) clustering the elements of the dictionary and (2) reconstituting a source using elements of the decomposition associated with a particular cluster.

Dictionaries are typically learned from data of each source type and combined into a joint dictionary, clustered by source type (Schmidt & Olsson, 2006; Virtanen, 2007). The blind setting has also been explored, where the clustering is obtained without labels by e.g. k-means (Spiertz & Gnann, 2009). Recent work explores more expressive decomposition models, replacing the linear decompositions used in NMF with expressive neural autoencoders (Smaragdis & Venkataramani, 2017; Venkataramani et al., 2017).

When the dictionary is learned with supervision from labeled sources, dictionary clusters can be interpreted as implicit priors on the distributions over components. Our approach makes these prior explicit, and works with generic priors that are not tied to the dictionary model. Furthermore, our method can separate mixed sources of the same type, whereas mixtures of sources with similar structure present a conceptual difficulty for dictionary-based methods.

Generative adversarial separation. Recent work by Subakan & Smaragdis (2018) and Kong et al. (2019) explores the intriguing possibility of optimizing x given a mixture m to satisfy (1), where components xi\textbf{x}_{i} are constrained to the manifold learned by a GAN. The GAN is pre-trained to model a distribution over components. Like our method, this approach leverages modern deep generative models in a way that decouples generation from source separation. We view this work as a natural analog to our likelihood-based approach in the GAN setting.

Likelihood-based approaches. Our approach is similar in spirit to older ideas based on maximum a posteriori estimation (Geman & Geman, 1984) likelihood maximization (Pearlmutter & Parra, 1997; Roweis, 2001) and Bayesian source separation (Benaroya et al., 2005). We build upon their insights, with the advantage of increased computational resources and modern expressive generative models.

3 BASIS Separation

We consider the following generative model of a mixed signal m, relaxing the mixture constraint g(x)=mg(\textbf{x})=\textbf{m} to a soft Gaussian approximation:

x p,\displaystyle\sim p, (2)
m 𝒩(g(x),γ2I).\displaystyle\sim\mathcal{N}\left(g(\textbf{x}),{}\gamma^{2}I\right). (3)

This defines a joint distribution pγ(x,m)=p(x)pγ(m|x)p_{\gamma}(\textbf{x},\textbf{m})=p(\textbf{x})p_{\gamma}(\textbf{m}|\textbf{x}) over signal components x and mixtures m, and a corresponding posterior distribution

pγ(x|m)=p(x)pγ(m|x)/pγ(m).p_{\gamma}(\textbf{x}|\textbf{m})=p(\textbf{x})p_{\gamma}(\textbf{m}|\textbf{x})/p_{\gamma}(\textbf{m}). (4)

In the limit as γ20\gamma^{2}\to 0, we recover the hard constraint on the mixture m given by Equation (1).

BASIS separation (Algorithm 1) presents an approach to sampling from (4) based on the discussion in Sections 3.1 and 3.2. In Section 3.3 we discuss the behavior of the gradients xlogp(x)\nabla_{\textbf{x}}\log p(\textbf{x}), which motivates some of the hyper-parameter choices in Section 3.4. We describe a procedure to construct the noisy models pσip_{\sigma_{i}} required for BASIS in Section 3.5.

3.1 Langevin dynamics

Algorithm 1 BASIS Separation
  Input: m𝒳\textbf{m}\in\mathcal{X}, {σi}i=1L\{\sigma_{i}\}_{i=1}^{L}, δ\delta, TT
  Sample x1,,xkUniform(𝒳)\textbf{x}_{1},\dots,\textbf{x}_{k}\sim\text{Uniform}(\mathcal{X})
  for i1i\leftarrow 1 to LL do
     ηiδσi2/σL2\eta_{i}\leftarrow\delta\cdot\sigma_{i}^{2}/\sigma_{L}^{2}
     for t=1t=1 to TT do
        Sample εt𝒩(0,I)\varepsilon_{t}\sim\mathcal{N}(0,I)
        u(t)x(t)+ηixlogpσi(x(t))+2ηεt\textbf{u}^{(t)}\leftarrow\textbf{x}^{(t)}+\eta_{i}\nabla_{\textbf{x}}\log p_{\sigma_{i}}(\textbf{x}^{(t)})+\sqrt{2\eta}\varepsilon_{t}
        x(t+1)u(t)ηiσi2Diag(α)(mg(x(t)))\textbf{x}^{(t+1)}\leftarrow\textbf{u}^{(t)}-\frac{\eta_{i}}{\sigma_{i}^{2}}\operatorname{Diag}(\alpha)\left(\textbf{m}-g(\textbf{x}^{(t)})\right)
     end for
  end for

Sampling from the posterior distribution pγ(x|m)p_{\gamma}(\textbf{x}|\textbf{m}) looks formidable; just computing Equation (4) requires evaluation of the partition function pγ(m)p_{\gamma}(\textbf{m}). But using Langevin dynamics (Neal et al., 2011; Welling & Teh, 2011) we can sample xpγ(|m)\textbf{x}\sim p_{\gamma}(\cdot|\textbf{m}) while avoiding explicit computation of pγ(x|m)p_{\gamma}(\textbf{x}|\textbf{m}). Let x0Uniform(𝒳)\textbf{x}_{0}~{}\sim~{}\text{Uniform}(\mathcal{X}), εt𝒩(0,I)\varepsilon_{t}\sim\mathcal{N}(0,I), and define a sequence

x(t+1)x(t)+ηxlogpγ(x(t)|m)+2ηεt\displaystyle\textbf{x}^{(t+1)}\equiv\textbf{x}^{(t)}+\eta\nabla_{\textbf{x}}\log p_{\gamma}(\textbf{x}^{(t)}|\textbf{m})+\sqrt{2\eta}\varepsilon_{t} (5)
=x(t)+ηx(logp(x(t))+12γ2mg(x(t))2)+2ηεt.\displaystyle=\textbf{x}^{(t)}+\eta\nabla_{\textbf{x}}\left(\log p(\textbf{x}^{(t)})+\tfrac{1}{2\gamma^{2}}\|\textbf{m}-g(\textbf{x}^{(t)})\|^{2}\right)+\sqrt{2\eta}\varepsilon_{t}.

Observe that xlogpγ(m)=0\nabla_{\textbf{x}}\log p_{\gamma}(\textbf{m})=0, so this term is not required to compute (5). By standard analysis of Langevin dynamics, as the step size η0\eta\to 0, limtDKL(xtx|m)=0\lim_{t\to\infty}D_{KL}(\textbf{x}_{t}\;\|\;\textbf{x}|\textbf{m})=0, under regularity conditions on the distribution pγ(x|m)p_{\gamma}(\textbf{x}|\textbf{m}).

If the prior p(x)p(\textbf{x}) is parameterized by a neural model, then gradients xlogp(x)\nabla_{\textbf{x}}\log p(\textbf{x}) can be computed by automatic differentiation with respect to the inputs of the generator network. This family of likelihood-based models includes autoregressive models (Salimans et al., 2017; Parmar et al., 2018), the variational autoencoder (Kingma & Welling, 2014; van den Oord et al., 2017), or flow-based models (Dinh et al., 2017; Kingma & Dhariwal, 2018). Alternatively, if gradients of the distribution are modeled (Song & Ermon, 2019), then xlogp(x)\nabla_{\textbf{x}}\log p(\textbf{x}) can be used directly.

3.2 Accelerated mixing

To accelerate mixing of (5) we adopt a simulated annealing schedule over noisy approximations to the model p(x)p(\textbf{x}), extending the unconditional sampling algorithm proposed in Song & Ermon (2019) to accelerate sampling from the posterior distribution pγ(x|m)p_{\gamma}(\textbf{x}|\textbf{m}). Let pσ(x)p_{\sigma}(\textbf{x}) denote the distribution of x+ϵσ\textbf{x}+\epsilon_{\sigma} for xp\textbf{x}\sim p and ϵσ𝒩(0,σ2I)\epsilon_{\sigma}\sim\mathcal{N}(0,\sigma^{2}I). We define the noisy joint likelihood pσ,γ(x,m)pσ(x)pγ(m|x)p_{\sigma,\gamma}(\textbf{x},\textbf{m})\equiv p_{\sigma}(\textbf{x})p_{\gamma}(\textbf{m}|\textbf{x}), which induces a noisy posterior approximation pσ,γ(x|m)p_{\sigma,\gamma}(\textbf{x}|\textbf{m}). At high noise levels σ\sigma, pσ(x)p_{\sigma}(\textbf{x}) is approximately Gaussian and irreducible, so the Langevin dynamics (5) will mix quickly. And as σ0\sigma\to 0, DKL(pσp)0D_{KL}(p_{\sigma}\;\|\;p)\to 0. This motivates defining the modified Langevin dynamics

x(t+1)\displaystyle\textbf{x}^{(t+1)} x(t)+ηxlogpσ,γ(x(t)|m)+2ηεt.\displaystyle\equiv\textbf{x}^{(t)}+\eta\nabla_{\textbf{x}}\log p_{\sigma,\gamma}(\textbf{x}^{(t)}|\textbf{m})+\sqrt{2\eta}\varepsilon_{t}. (6)

The dynamics (6) approximate samples from p(x|g(x)=m)p(\textbf{x}|g(\textbf{x})=\textbf{m}) as η0\eta\to 0, γ20\gamma^{2}\to 0, σ20\sigma^{2}\to 0, and tt\to\infty. An implementation of these dynamics, annealing η\eta, γ2\gamma^{2}, and σ2\sigma^{2} as tt\to\infty according to the hyper-parameter settings presented in Section 3.4, is presented in Algorithm 1.

We anneal η\eta, γ2\gamma^{2}, and σ2\sigma^{2} using a heuristic introduced in Song & Ermon (2019): the idea is to maintain a constant signal-to-noise ratio (SNR) between the expected size of the posterior log-likelihood gradient term ηxlogpσ,γ(x|m)\eta\nabla_{\textbf{x}}\log p_{\sigma,\gamma}(\textbf{x}|\textbf{m}) and the expected size of the Langevin noise 2ηε\sqrt{2\eta}\varepsilon:

𝔼xpσ[ηxlogpσ,γ(x|m)2η2]\displaystyle\mathop{\mathbb{E}}_{\textbf{x}\sim p_{\sigma}}\left[\left\|\frac{\eta\nabla_{\textbf{x}}\log p_{\sigma,\gamma}(\textbf{x}|\textbf{m})}{\sqrt{2\eta}}\right\|^{2}\right]
=η4𝔼xpσ[xlogpγ(m|x)+xlogpσ(x)2].\displaystyle\quad=\frac{\eta}{4}\mathop{\mathbb{E}}_{\textbf{x}\sim p_{\sigma}}\left[\left\|\nabla_{\textbf{x}}\log p_{\gamma}(\textbf{m}|\textbf{x})+\nabla_{\textbf{x}}\log p_{\sigma}(\textbf{x})\right\|^{2}\right]. (7)

Assuming that gradients w.r.t. to the likelihood and the prior are uncorrelated, the SNR is approximately

η4𝔼xpσ[xlogpγ(m|x)2]+η4𝔼xpσ[xlogpσ(x)2].\frac{\eta}{4}\mathop{\mathbb{E}}_{\textbf{x}\sim p_{\sigma}}\left[\left\|\nabla_{\textbf{x}}\log p_{\gamma}(\textbf{m}|\textbf{x})\right\|^{2}\right]+\frac{\eta}{4}\mathop{\mathbb{E}}_{\textbf{x}\sim p_{\sigma}}\left[\left\|\nabla_{\textbf{x}}\log p_{\sigma}(\textbf{x})\right\|^{2}\right].\ (8)

Observe that logpγ(m|x)\log p_{\gamma}(\textbf{m}|\textbf{x}) is a concave quadratic with smoothness proportional to 1/γ21/\gamma^{2}; it follows analytically that 𝔼[xlogpγ(m|x)2]1/γ2\mathbb{E}\left[\left\|\nabla_{\textbf{x}}\log p_{\gamma}(\textbf{m}|\textbf{x})\right\|^{2}\right]\propto 1/\gamma^{2}. Song & Ermon (2019) found empirically that 𝔼xlogpσ(x)21/σ2\mathbb{E}\|\nabla_{\textbf{x}}\log p_{\sigma}(\textbf{x})\|^{2}\propto 1/\sigma^{2} for the NCSN model; we observe similar behavior for the flow-based Glow model (Kingma & Dhariwal, 2018) and in Section 3.3 we propose a possible explanation for this behavior. Therefore, to maintain a constant SNR, it suffices to set both γ2\gamma^{2} and σ2\sigma^{2} proportional to η\eta.

3.3 The gradients of the noisy prior

We remark that the empirical finding 𝔼xlogpσ(x)21/σ2\mathbb{E}\|\nabla_{\textbf{x}}\log p_{\sigma}(\textbf{x})\|^{2}\propto 1/\sigma^{2} discussed in Section 3.2, and the consistency of this observation across models and datasets, could be surprising. Gradients of the noisy densities pσp_{\sigma} can be described by convolution of pp with a Gaussian kernel:

xlogpσ(x)=xlog𝔼ϵ𝒩(0,I)[p(xσϵ)].\nabla_{\textbf{x}}\log p_{\sigma}(\textbf{x})=\nabla_{\textbf{x}}\log\mathop{\mathbb{E}}_{\epsilon\sim\mathcal{N}(0,I)}\left[p(\textbf{x}-\sigma\epsilon)\right]. (9)

From this expression, assuming pp is continuous, we clearly see that the gradients are asymptotically independent of σ\sigma:

limσ0xlogpσ(x)=xlogp(x).\lim_{\sigma\to 0}\nabla_{\textbf{x}}\log p_{\sigma}(\textbf{x})=\nabla_{\textbf{x}}\log p(\textbf{x}). (10)

Maintaining proportionality 𝔼xlogpσ(x)21/σ2\mathbb{E}\|\nabla_{\textbf{x}}\log p_{\sigma}(\textbf{x})\|^{2}\propto 1/\sigma^{2} requires the gradients to grow unbounded as σ0\sigma\to 0, but the gradients of the noiseless distribution logp(x)\log p(\textbf{x}) are finite. Therefore, proportionality must break down asymptotically and we conclude that–even though we turn the noise σ2\sigma^{2} down to visually imperceptible levels–we have not reached the asymptotic regime.

Refer to caption
Figure 2: The behavior of σ×xlogpσ(x)\sigma\times\|\nabla_{\textbf{x}}\log p_{\sigma}(\textbf{x})\| in expectation for the NCSN (orange) and Glow (blue) models trained on CIFAR-10 at each of 1010 noise levels as σ\sigma decays geometrically from 1.01.0 to 0.010.01. For large σ\sigma, xlogpσ(x)50/σ\|\nabla_{\textbf{x}}\log p_{\sigma}(\textbf{x})\|\approx 50/\sigma. This proportional relationship breaks down for smaller σ\sigma. Because the expected gradient of the noiseless density logp(x)\log p(\textbf{x}) is finite, its product with σ\sigma must asymptotically approach zero as σ0\sigma\to 0.
Refer to caption
Figure 3: Non-stochastic gradient ascent produces sub-par results. Annealing over smoothed-out distributions (Noise Conditioning) guides the optimization towards likely regions of pixel space, but gets stuck at sub-optimal solutions. Adding Gaussian noise to the gradients (Langevin dynamics) shakes the optimization trajectory out of bad local optima.

We conjecture that the proportionality between the gradients and the noise is a consequence of severe non-smoothness in the noiseless model p(x)p(\textbf{x}). The probability mass of this distribution is peaked around plausible images x, and decays rapidly away from these points in most directions. Consider the extreme case where the prior has a Dirac delta point mass. The convolution of a Dirac delta with a Gaussian is itself Gaussian so, near the point mass, the noisy distribution pσp_{\sigma} will be proportional to a Gaussian density with variance σ2\sigma^{2}. If pσp_{\sigma} were exactly Gaussian then analytically

𝔼xpσ[xlogpσ(x)2]=1σ4𝔼xpσ[x2]=1σ2.\displaystyle\mathop{\mathbb{E}}_{\textbf{x}\sim p_{\sigma}}\left[\|\nabla_{\textbf{x}}\log p_{\sigma}(\textbf{x})\|^{2}\right]=\frac{1}{\sigma^{4}}\mathop{\mathbb{E}}_{\textbf{x}\sim p_{\sigma}}\left[\textbf{x}^{2}\right]=\frac{1}{\sigma^{2}}. (11)

Because the distribution p(x)p(\textbf{x}) does not contain actual delta spikes–only approximations thereof–we would expect this proportionality to eventually break down as σ0\sigma\to 0. Indeed, Figure 2 shows that both for NCSN and Glow models of CIFAR-10, after maintaining a very consistent proportionality 𝔼[xlogpσ(x)2]1/σ2\mathbb{E}\left[\|\nabla_{\textbf{x}}\log p_{\sigma}(\textbf{x})\|^{2}\right]\propto 1/\sigma^{2} at the higher noise levels, the decay of σ2\sigma^{2} to zero eventually outpaces the growth of the gradients.

3.4 Hyper-parameter settings

We adopt the hyper-parameters proposed by Song & Ermon (2019) for annealing σ2\sigma^{2}, the proportionality constant δ\delta, and the iteration count TT. The noise σ\sigma is geometrically annealed from σ1=1.0\sigma_{1}~{}=~{}1.0 to σL=0.01\sigma_{L}~{}=~{}0.01 with L=10L=10. We set δ=2×105\delta=2\times 10^{-5}, and T=100T=100. We find that the same proportionality constant between σ2\sigma^{2} and η\eta also works well for γ2\gamma^{2} and η\eta, allowing us to set γ2=σ2\gamma^{2}=\sigma^{2}. We use these hyper-parameters for both the NCSN and Glow models, applied to each of the three datasets MNIST, CIFAR-10, and LSUN.

3.5 Constructing noise-conditioned models

For noise-conditioned score networks, we can directly compute xlogpσ(x)\nabla_{\textbf{x}}\log p_{\sigma}(\textbf{x}) by evaluating the score network at the desired noise level. For generative flow models like Glow, these noisy distributions are not directly accessible. We could estimate the distributions pσ(x)p_{\sigma}(\textbf{x}) by training Glow from scratch on datasets perturbed by each of the required noise levels σ2\sigma^{2}. But this not practical; Glow is expensive to train, requiring thousands of epochs to converge and consuming hundreds of gpu-hours to obtain good models even for small low-resolution datasets.

Instead of training models pσ(x)p_{\sigma}(\textbf{x}) from scratch, we apply the concept of fine-tuning from transfer learning (Yosinski et al., 2014). Using pre-trained models of p(x)p(\textbf{x}) published by the Glow authors, we fine-tune these models on noise-perturbed data x+ϵ\textbf{x}+\epsilon, where ϵ𝒩(0,σ2I)\epsilon\sim\mathcal{N}(0,\sigma^{2}I). Empirically, this procedure quickly converges to an estimate of pσ(x)p_{\sigma}(\textbf{x}), within about 10 epochs.

3.6 The importance of stochasticity

We remark that adding Gaussian noise to the gradients in the BASIS algorithm is essential. If we set aside the Bayesian perspective, it is tempting to simply run gradient ascent on the pixels of the components to maximize the likelihood of these components under the prior, with a Lagrangian term to enforce the mixture constraint g(x)=mg(\textbf{x})=\textbf{m}:

xx+ηx[logp(x)λg(x)m2].\textbf{x}\leftarrow\textbf{x}+\eta\nabla_{\textbf{x}}\left[\log p(\textbf{x})-\lambda\|g(\textbf{x})-\textbf{m}\|^{2}\right]. (12)

But this does not work. As demonstrated in Figure 3, there are many local optima in the loss surface of p(x)p(\textbf{x}) and a greedy ascent procedure simply gets stuck. Pragmatically, the noise term in Langevin dynamics can be seen as a way to knock the greedy optimization (12) out of local maxima.

In the recent literature, pixel-space optimizations by following gradients x\nabla_{\textbf{x}} of some objective are perhaps associated more with adversarial examples than with desirable results (Goodfellow et al., 2015; Nguyen et al., 2015). We note that there have been some successes of pixel-wise optimization in texture synthesis (Gatys et al., 2015) and style transfer (Gatys et al., 2016). But broadly speaking, pixel-space optimization procedures often seem to go wrong. We speculate that noisy optimizations (6) on smoothed-out objectives like pσp_{\sigma} could be a widely applicable method for making pixel-space optimizations more robust.

4 Evaluation Methodology

Many previous works on source separation evaluate their results using peak signal-to noise ratio (PSNR) or structural similarity index (SSIM) (Wang et al., 2004). These metrics assume that the original sources are identifiable; in probabilistic terms, the true posterior distribution p(x|m)p(\textbf{x}|\textbf{m}) is presumed to have a unique global maximum achieved by the ground truth sources (up to permutation of the sources). Under the identifiability assumption, it is reasonable to measure the quality of a separation algorithm by comparing separated sources to ground truth mixture components. PSNR, for example, evaluates separations by computing the mean-squared distance between pixel values of the ground truth and separated sources on a logarithmic scale.

For CIFAR-10 source separation, the ground truth source components of a mixture are not identifiable. As evidence for this claim, we call the reader’s attention to Figure 4. For each mixture depicted in Figure 4, we present separation results that sum to the mixture and (to our eyes) look plausibly like CIFAR-10 images. However, in each case the separated images exhibit high deviation from the ground truth. This phenomenon is not unusual; Figure 5 shows an un-curated collection of samples from p(x|m)p(\textbf{x}|\textbf{m}) using BASIS, illustrating a variety of plausible separation results for each given mixture. We will later see evidence again of non-identifiability in Figure 7. If we accept that the separations presented in Figures 4, 5, and 7 are reasonable, then source separation on this dataset is fundamentally underdetermined; we cannot measure success using metrics like PSNR that compare separation results to ground truth.

Instead of comparing separations to ground truth, we propose instead to quantify the extent to which the results of a source separation algorithm look like samples from the data distribution. If a pair of images sum to the given mixture and look like samples from the data distribution, we deem the separation to be a success. This shift in perspective from identifiability of the latent components to the quality of the separated components is analogous to the classical distinction in the statistical literature between estimation and prediction (Shmueli et al., 2010; Bellec et al., 2018). To this end, we borrow the Inception Score (IS) (Salimans et al., 2016) and Frechet Inception Distance (FID) (Heusel et al., 2017) metrics from the generative modeling literature to evaluate CIFAR-10 separation results. These metrics attempt to quantify the similarity between two distributions given samples. We use them to compare the distribution of components produced by a separation algorithm to the distribution of ground truth images.

In contrast to CIFAR-10, the posterior distribution p(x|m)p(\textbf{x}|\textbf{m}) for an MNIST model is demonstrably peaked. Moreover, BASIS is able to consistently identify these peaks. This constitutes a constructive proof that components of MNIST mixtures are identifiable, and therefore comparisons to the ground-truth components make sense. We report PSNR results for MNIST, which allows us to compare the results of BASIS to other recent work on MNIST image separation (Halperin et al., 2019; Kong et al., 2019).

5 Experiments

We evaluate results of BASIS on 3 datasets: MNIST (LeCun et al., 1998) CIFAR-10 (Krizhevsky, 2009) and LSUN (Yu et al., 2015). For MNIST and CIFAR-10, we consider both NCSN (Song & Ermon, 2019) and Glow (Kingma & Dhariwal, 2018) models as priors, using pre-trained weights published by the authors of these models. For LSUN there is no pre-trained NCSN model, so we consider results only with Glow. For Glow, we fine-tune the weights of the pre-trained models to construct noisy models pσp_{\sigma} using the procedure described in Section 3.5. Code and instructions for reproducing these experiments is available online.111https://github.com/jthickstun/basis-separation

Refer to caption
Figure 4: A curated collection of examples demonstrating color and structural ambiguities in CIFAR-10 mixtures. In each case, the original components differ substantially from the components separated by BASIS using NCSN as a prior. But in each case, the separation results also look like plausible CIFAR-10 images.
Refer to caption
Figure 5: Repeated sampling using BASIS with NCSN as a prior for several mixtures of CIFAR-10 images. While most separations look reasonable, variation in color and lighting makes comparative metrics like PSNR unreliable. This challenges the notion that the ground truth components are identifiable.

Baselines. On MNIST we compare to results reported for the GAN-based “S-D” method (Kong et al., 2019) and the fully supervised version of Neural Egg separation “NES” (Halperin et al., 2019). Results for MNIST are presented in Section 5.1. To the best of our knowledge there are no previously reported quantitative metrics for CIFAR-10 separation, so as a baseline we ran Neural Egg separation on CIFAR-10 using the authors’ published code. CIFAR-10 results are presented in Section 5.2. We present additional qualitative results for 64×6464\times 64 LSUN in Section 5.3, which demonstrate that BASIS scales to larger images.

We also consider results for a simple baseline, “Average,” that separates a mixture m into two 50% masks x1=x2=m/2\textbf{x}_{1}=\textbf{x}_{2}=\textbf{m}/2. This is a surprisingly competitive baseline. Observe that if we had no prior information about the distribution of components, and we measure separation quality by PSNR, then by a symmetry argument setting x1=x2\textbf{x}_{1}=\textbf{x}_{2} is the optimal separation strategy in expectation. In principle we would expect Average to perform very poorly under IS/FID, because these metrics purport to measure similarity of distributions and mixtures should have little or no support under the data distribution. But we find that IS and FID both assign reasonably good scores to Average, presumably because mixtures exhibit many features that are well supported by the data distribution. This speaks to well-known difficulties in evaluating generative models (Theis et al., 2016) and could explain the strength of “Average” as a baseline.

We remark that we cannot compare our algorithm to the separation-like task reported for CapsuleNets (Sabour et al., 2017). The segmentation task discussed in that work is similar to source separation, but the mixtures used for the segmentation task are constructed using the non-linear threshold function h(x)=max(x1+x2,1)h(\textbf{x})=\max(\textbf{x}_{1}+\textbf{x}_{2},1), in contrast to our linear function gg. While extending the techniques of this paper to non-linear relationships between x and m is intriguing, we leave this to future this work.

Class conditional separation. The Neural Egg separation algorithm is designed with the assumption that the components xi\textbf{x}_{i} are drawn from different distributions. For quantitative results on MNIST and CIFAR-10, we therefore consider two slightly different tasks. The first is class-agnostic, where we construct mixtures by summing randomly selected images from the test set. The second is class-conditional, where we partition the test set into two groupings: digits 040-4 and 595-9 for MNIST, animals and machines for CIFAR-10. The former task allows us compare to S-D results on MNIST, and the latter task allows us to compare to Neural Egg separation on MNIST and CIFAR-10.

There are two different ways to apply a prior for class-conditional separation. First observe that, because x1\textbf{x}_{1} and x2\textbf{x}_{2} are chosen independently,

p(x)=p(x1,x2)=p1(x1)p2(x2).p(\textbf{x})=p(\textbf{x}_{1},\textbf{x}_{2})=p_{1}(\textbf{x}_{1})p_{2}(\textbf{x}_{2}).\vspace*{-1mm} (13)

In the class agnostic setting, x1\textbf{x}_{1} and x2\textbf{x}_{2} are drawn from the same distribution (the empirical distribution of the test set) so it makes sense to use a single prior p=p1=p2p=p_{1}=p_{2}. In the class conditional setting, we could potentially use separate priors over components x1\textbf{x}_{1} and x2\textbf{x}_{2}. For the MNIST and CIFAR-10 experiments in this paper, we use pre-trained models trained on unconditional distribution of the training data for both the class agnostic and class conditional setting. It is possible that better results could be achieved in the class conditional setting by re-training the models on class conditional training data. For LSUN, the authors of Glow provide separate pre-trained models for the Church and Bedroom categories, so we are able to demonstrate class-conditional LSUN separations using distinct priors in Section 5.3.

Sample Likelihoods. Although we do not directly model the posterior likelihood p(x|m)p(\textbf{x}|\textbf{m}), we can compute the log-likelihood of the output samples x. The log-likelihood is a function of the artificial variance hyper-parameter γ\gamma, so it is more informative to look at the unweighted square error mg(x)2\|\textbf{m}-g(\textbf{x})\|^{2}; this quantity can be interpreted as a reconstruction error, and measures how well we approximate the hard mixture constraint. Because we geometrically anneal the variance γ\gamma, by the end of optimization the mixture constraint is rigorously enforced; per-pixel reconstruction error is smaller than the quantization level of 8-bit color, resulting in pixel-perfect visual reconstructions.

For Glow, we can also compute the log-probability of samples under the prior. How do the probabilities of sources xBASIS\textbf{x}_{\text{BASIS}} constructed by BASIS separation compare to the probabilities of data xtest\textbf{x}_{\text{test}} taken directly from a dataset’s test set? Because we anneal the noise to a fixed level σL>0\sigma_{L}>0, we find it most informative to ask this question using the minimal-noise, fine-tuned prior pσL(x)p_{\sigma_{L}}(\textbf{x}). As seen in Table 1, the outputs of BASIS separation are generally comparable in log-likelihood to test set images; BASIS separation recovers sources deemed typical by the prior.

Table 1: The mean log-likelihood under the minimal-noise Glow prior pσL(x)p_{\sigma_{L}}(\textbf{x}) for the test set xtest\textbf{x}_{\text{test}}, and for samples of 100 BASIS separations xBASIS\textbf{x}_{\text{BASIS}}. The log-likelihood of each test set under the noiseless prior p(xtest)p(\textbf{x}_{\text{test}}) is reported for reference.

Dataset p(xtest)p(\textbf{x}_{\text{test}}) pσL(xtest)p_{\sigma_{L}}(\textbf{x}_{\text{test}}) pσL(xBASIS)p_{\sigma_{L}}(\textbf{x}_{\text{BASIS}})
MNIST 0.5 3.6 3.6
CIFAR-10 3.4 4.5 4.7
LSUN (bed) 2.4 4.2 4.4
LSUN (crh) 2.7 4.4 4.4

5.1 MNIST separation

Quantitative results for MNIST image separation are reported in Table 2, and a panel of visual separation results are presented in Figure 1. For quantitative results, we report mean PSNR over separations of 12,00012,000 separated components. The distribution of PSNR for class agnostic MNIST separation is visualized in Figure 6. We observe that approximately 2/32/3 of results exceed the mean PSNR of 29.5, which to our eyes is visually indistinguishable from ground truth.

Refer to caption
Figure 6: The empirical distribution of PSNR for 5,000 class agnostic MNIST digit separations using BASIS with the NCSN prior (see Table 2 for comparison of the central tendencies of this and other separation methods).
Table 2: PSNR results for separating 6,000 pairs of equally mixed MNIST images. For class split results, one image comes from label 040-4 and the other comes from 595-9. We compare to S-D (Kong et al., 2019), NES (Halperin et al., 2019), convolutional NMF (class split) (Halperin et al., 2019) and standard NMF (class agnostic) (Kong et al., 2019).

Algorithm Class Split Class Agnostic
Average 14.8 14.9
NMF 16.0 9.4
S-D - 18.5
BASIS (Glow) 22.9 22.7
NES 24.3 -
BASIS (Glow, 10x) 27.7 27.1
BASIS (NCSN) 29.5 29.3
Refer to caption
Figure 7: Colorizing CIFAR-10 images. Left: original CIFAR-10 images. Middle: greyscale conversions of the images on the left. Right: imputed colors for the greyscale images, found by BASIS using NCSN as a prior.

A natural approach to improve separation performance is to sample multiple xp(|m)\textbf{x}\sim p(\cdot|\textbf{m}) for a given mixture m. A major advantage of models like Glow, that explicitly parameterize the prior p(x)p(\textbf{x}), is that we can approximate the maximum of the posterior distribution with the maximum over multiple samples. By construction, samples from BASIS approximately satisfy g(x)=mg(\textbf{x})=\textbf{m}, so for the noiseless model we simply declare p(m|x)=1p(\textbf{m}|\textbf{x})=1 and therefore p(x|m)p(x)p(\textbf{x}|\textbf{m})\propto p(\textbf{x}). We demonstrate the effectiveness of resampling in Table 2 (Glow, 10x) by comparing the expected PSNR of xp(|m)\textbf{x}\sim p(\cdot|\textbf{m}) to the expected PSNR of argmaxip(xi)\operatorname*{arg\,max}_{i}p(\textbf{x}_{i}) over 1010 samples x1,,x10p(|m)\textbf{x}_{1},\dots,\textbf{x}_{10}\sim p(\cdot|\textbf{m}). Even moderate resampling dramatically improves separation performance. Unfortunately this approach cannot be applied to the otherwise superior NCSN model, which does not model explicit likelihoods p(x)p(\textbf{x}).

Without any modification, we can apply BASIS to separate mixtures of k>2k>2 images. We contrast this with regression-based methods, which require re-training to target varying numbers of components. Figure 1 shows the results of BASIS using the NCSN prior applied to mixtures of four randomly selected images. For more mixture components, we observe that identifiability of ground truth sources begins to break down. This is illustrated by looking at the central item in each panel of Figure 1 (highlighted in orange).

5.2 CIFAR-10

Table 3: Inception Score / FID Score of 25,000 separations (50,000 separated images) of two overlapping CIFAR-10 images using NCSN as a prior. In Class Split one image comes from the category of animals and other from the category of vehicles. NES results using published code from Halperin et al. (2019).

Algorithm Inception Score FID
Class Split
NES 5.29 ±\pm 0.08 51.39
BASIS (Glow) 5.74 ±\pm 0.05 40.21
Average 6.14 ±\pm 0.11 39.49
BASIS (NCSN) 7.83 ±\pm 0.15 29.92
Class Agnostic
BASIS (Glow) 6.10 ±\pm 0.07 37.09
Average 7.18 ±\pm 0.08 28.02
BASIS (NCSN) 8.29 ±\pm 0.16 22.12

Quantitative results for CIFAR-10 image separation measured are presented in Table 3, and visual separation results are presented in Figure 1.

We can also view image colorization (Levin et al., 2004; Zhang et al., 2016) as a source separation problem by interpreting a grayscale image as a mixture of the three color channels of an image x=(xr,xg,xb)\textbf{x}=(\textbf{x}_{r},\textbf{x}_{g},\textbf{x}_{b}) with

g(x)=(xr+xg+xb)/3.g(\textbf{x})=(\textbf{x}_{r}+\textbf{x}_{g}+\textbf{x}_{b})/3. (14)

Unlike our previous separation problems, the channels of an image are clearly not independent, and the factorization of pp given by Equation 13 is unwarranted. But conveniently, a generative model trained on color CIFAR-10 images itself models the joint distribution p(x)=p(xr,xg,xb)p(\textbf{x})=p(\textbf{x}_{r},\textbf{x}_{g},\textbf{x}_{b}). Therefore, the same pre-trained generative model that we use to separate images can also be used to color them.

Qualitative colorization results are visualized in Figure 7. The non-identifiability of ground truth is profound for this task (see Section 4 for discussion of identifiability). We draw attention to the two cars in the middle of the panel: the white car that is colored yellow by the algorithm, and the blue car that is colored red. The colors of these specific cars cannot be inferred from a greyscale image; the best an algorithm can do is to choose a reasonable color, based on prior information about the colors of cars.

Quantitative coloring results for CIFAR-10 are presented in Table 4. We remark that the IS and FID scores for coloring are substantially better than the IS and FID scores of 8.87 and 25.32 respectively reported for unconditional samples from the NCSN model; conditioning on a greyscale image is enormously informative. Indeed, the Inception Score of NCSN-colorized CIFAR-10 is close to the Inception Score of the CIFAR-10 dataset itself.

Table 4: Inception Score / FID Score of 50,000 colorized CIFAR-10 images. As measured by IS/FID, the quality of NCSN colorizations nearly matches CIFAR-10 itself.

Data Distribution Inception Score FID Score
Input Grayscale 8.01 ±\pm 0.10 68.52
BASIS (Glow) 8.69 ±\pm 0.15 28.70
BASIS (NCSN) 10.53 ±\pm 0.17 11.58
CIFAR-10 Original 11.24 ±\pm 0.12 0.00

5.3 LSUN separation

Qualitative results for LSUN separations are visualized in Figure 8. While the separation results in Figure 8 are imperfect, Table 1 shows that the mean log-likelihood of the separated components is comparable to the mean log-likelihood that the model assigns to images in the test set. This suggests that the model is incapable of distinguishing these separations from better results, and the imperfections are attributable to the quality of the model rather than to the separation algorithm. This is encouraging, because it suggests that the artifacts are due to the Glow model rather than the BASIS separation algorithm, and that better separation results will be achievable with improved generative models.

Refer to caption
Figure 8: 64×6464\times 64 LSUN separation results using Glow as a prior. One mixture component is sampled from the LSUN churches category, and the other component is sampled from LSUN bedrooms.

6 Conclusion

In this paper, we introduced a new approach to source separation that makes use of a likelihood-based generative model as a prior. We demonstrated the ability to swap in different generative models for this purpose, presenting results of our algorithm using both NCSN and Glow. We proposed new methodology for evaluating source separation on richer datasets, demonstrating strong performance on MNIST and CIFAR-10. Finally, we presented qualitative results on LSUN that point the way towards scaling this method to practical tasks such as speech separation, using generative audio models like WaveNets (Oord et al., 2016).

Acknowledgements

We thank Zaid Harchaoui, Sham M. Kakade, Steven Seitz, and Ira Kemelmacher-Shlizerman for valuable discussion and computing resources. This work was supported by the National Science Foundation Grant DGE-1256082.

References

  • Bell & Sejnowski (1995) Bell, A. J. and Sejnowski, T. J. An information-maximization approach to blind separation and blind deconvolution. Neural computation, 7(6):1129–1159, 1995.
  • Bellec et al. (2018) Bellec, P. C., Lecué, G., Tsybakov, A. B., et al. Slope meets lasso: improved oracle bounds and optimality. The Annals of Statistics, 46(6B):3603–3642, 2018.
  • Benaroya et al. (2005) Benaroya, L., Bimbot, F., and Gribonval, R. Audio source separation with a single sensor. IEEE Transactions on Audio, Speech, and Language Processing, 14(1):191–199, 2005.
  • Comon (1994) Comon, P. Independent component analysis, a new concept? Signal processing, 36(3):287–314, 1994.
  • Davies & James (2007) Davies, M. E. and James, C. J. Source separation using single channel ica. Signal Processing, 87(8):1819–1832, 2007.
  • Défossez et al. (2019) Défossez, A., Usunier, N., Bottou, L., and Bach, F. Music source separation in the waveform domain. arXiv preprint arXiv:1911.13254, 2019.
  • Dinh et al. (2017) Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real nvp. International Conference on Learning Representations, 2017.
  • Gandelsman et al. (2019) Gandelsman, Y., Shocher, A., and Irani, M. Double-dip”: Unsupervised image decomposition via coupled deep-image-priors. In The IEEE Conference on Computer Vision and Pattern Recognition, volume 6, pp.  2, 2019.
  • Gatys et al. (2015) Gatys, L., Ecker, A. S., and Bethge, M. Texture synthesis using convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 262–270, 2015.
  • Gatys et al. (2016) Gatys, L. A., Ecker, A. S., and Bethge, M. Image style transfer using convolutional neural networks. In Conference on Computer Vision and Pattern Recognition, pp. 2414–2423, 2016.
  • Geman & Geman (1984) Geman, S. and Geman, D. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. Transactions on Pattern Analysis and Machine Intelligence, (6):721–741, 1984.
  • Goodfellow et al. (2015) Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. International Conference on Learning Representations, 2015.
  • Halperin et al. (2019) Halperin, T., Ephrat, A., and Hoshen, Y. Neural separation of observed and unobserved distributions. Advances in Neural Information Processing Systems, 2019.
  • Heusel et al. (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637, 2017.
  • Huang et al. (2012) Huang, P.-S., Chen, S. D., Smaragdis, P., and Hasegawa-Johnson, M. Singing-voice separation from monaural recordings using robust principal component analysis. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.  57–60. IEEE, 2012.
  • Huang et al. (2014) Huang, P.-S., Kim, M., Hasegawa-Johnson, M., and Smaragdis, P. Singing-voice separation from monaural recordings using deep recurrent neural networks. In International Symposium on Music Information Retrieval, pp.  477–482, 2014.
  • Huang et al. (2015) Huang, P.-S., Kim, M., Hasegawa-Johnson, M., and Smaragdis, P. Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(12):2136–2147, 2015.
  • Jansson et al. (2017) Jansson, A., Humphrey, E., Montecchio, N., Bittner, R., Kumar, A., and Weyde, T. Singing voice separation with deep u-net convolutional networks. 2017.
  • Kingma & Dhariwal (2018) Kingma, D. P. and Dhariwal, P. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10215–10224, 2018.
  • Kingma & Welling (2014) Kingma, D. P. and Welling, M. Auto-encoding variational bayes. International Conference on Learning Representations, 2014.
  • Kong et al. (2019) Kong, Q., Xu, Y., Jackson, P. J. B., Wang, W., and Plumbley, M. D. Single-channel signal separation and deconvolution with generative adversarial networks. In International Joint Conference on Artificial Intelligence, 2019.
  • Krizhevsky (2009) Krizhevsky, A. Learning multiple layers of features from tiny images. 2009.
  • LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998.
  • Lee & Seung (1999) Lee, D. D. and Seung, H. S. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788, 1999.
  • Lee & Seung (2001) Lee, D. D. and Seung, H. S. Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems, pp. 556–562, 2001.
  • Lee et al. (1999) Lee, T.-W., Lewicki, M. S., Girolami, M., and Sejnowski, T. J. Blind source separation of more sources than mixtures using overcomplete representations. IEEE signal processing letters, 6(4):87–90, 1999.
  • Levin et al. (2004) Levin, A., Lischinski, D., and Weiss, Y. Colorization using optimization. In ACM SIGGRAPH 2004 Papers, pp.  689–694. 2004.
  • Lluis et al. (2019) Lluis, F., Pons, J., and Serra, X. End-to-end music source separation: is it possible in the waveform domain? Interspeech, 2019.
  • Neal et al. (2011) Neal, R. M. et al. Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2(11):2, 2011.
  • Nguyen et al. (2015) Nguyen, A., Yosinski, J., and Clune, J. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Conference on Computer Vision and Pattern Recognition, pp. 427–436, 2015.
  • Nishida et al. (1999) Nishida, S., Nakamura, M., Ikeda, A., and Shibasaki, H. Signal separation of background eeg and spike by using morphological filter. Medical engineering & physics, 21(9):601–608, 1999.
  • Nugraha et al. (2016) Nugraha, A. A., Liutkus, A., and Vincent, E. Multichannel audio source separation with deep neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(9):1652–1664, 2016.
  • Oord et al. (2016) Oord, A. v. d., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
  • Parmar et al. (2018) Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, Ł., Shazeer, N., Ku, A., and Tran, D. Image transformer. International Conference on Machine Learning, 2018.
  • Pearlmutter & Parra (1997) Pearlmutter, B. A. and Parra, L. C. Maximum likelihood blind source separation: A context-sensitive generalization of ica. In Advances in Neural Information Processing Systems, pp. 613–619, 1997.
  • Roweis (2001) Roweis, S. T. One microphone source separation. In Advances in Neural Information Processing Systems, pp. 793–799, 2001.
  • Sabour et al. (2017) Sabour, S., Frosst, N., and Hinton, G. E. Dynamic routing between capsules. In Advances in Neural Information Processing Systems, pp. 3856–3866, 2017.
  • Salimans et al. (2016) Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pp. 2234–2242, 2016.
  • Salimans et al. (2017) Salimans, T., Karpathy, A., Chen, X., and Kingma, D. P. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. International Conference on Learning Representations, 2017.
  • Schmidt & Olsson (2006) Schmidt, M. N. and Olsson, R. K. Single-channel speech separation using sparse non-negative matrix factorization. In International Conference on Spoken Language Processing, 2006.
  • Shmueli et al. (2010) Shmueli, G. et al. To explain or to predict? Statistical Science, 25(3):289–310, 2010.
  • Smaragdis & Venkataramani (2017) Smaragdis, P. and Venkataramani, S. A neural network alternative to non-negative audio models. In International Conference on Acoustics, Speech and Signal Processing, pp.  86–90. IEEE, 2017.
  • Song & Ermon (2019) Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, pp. 11895–11907, 2019.
  • Spiertz & Gnann (2009) Spiertz, M. and Gnann, V. Source-filter based clustering for monaural blind source separation. In International Conference on Digital Audio Effects, 2009.
  • Stoller et al. (2018a) Stoller, D., Ewert, S., and Dixon, S. Adversarial semi-supervised audio source separation applied to singing voice extraction. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.  2391–2395. IEEE, 2018a.
  • Stoller et al. (2018b) Stoller, D., Ewert, S., and Dixon, S. Wave-u-net: A multi-scale neural network for end-to-end audio source separation. International Symposium on Music Information Retrieval, 2018b.
  • Subakan & Smaragdis (2018) Subakan, Y. C. and Smaragdis, P. Generative adversarial source separation. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.  26–30. IEEE, 2018.
  • Theis et al. (2016) Theis, L., Oord, A. v. d., and Bethge, M. A note on the evaluation of generative models. International Conference on Learning Representations, 2016.
  • Ulyanov et al. (2018) Ulyanov, D., Vedaldi, A., and Lempitsky, V. Deep image prior. In IEEE Conference on Computer Vision and Pattern Recognition, pp.  9446–9454, 2018.
  • van den Oord et al. (2017) van den Oord, A., Vinyals, O., et al. Neural discrete representation learning. In Advances in Neural Information Processing Systems, 2017.
  • Venkataramani et al. (2017) Venkataramani, S., Subakan, C., and Smaragdis, P. Neural network alternatives toconvolutive audio models for source separation. In International Workshop on Machine Learning for Signal Processing, pp.  1–6. IEEE, 2017.
  • Virtanen (2007) Virtanen, T. Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE transactions on audio, speech, and language processing, 15(3):1066–1074, 2007.
  • Wang et al. (2004) Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.
  • Welling & Teh (2011) Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient langevin dynamics. In International Conference on Machine Learning, pp. 681–688, 2011.
  • Yosinski et al. (2014) Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems, pp. 3320–3328, 2014.
  • Yu et al. (2015) Yu, F., Zhang, Y., Song, S., Seff, A., and Xiao, J. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
  • Zhang et al. (2016) Zhang, R., Isola, P., and Efros, A. A. Colorful image colorization. In European conference on computer vision, pp.  649–666. Springer, 2016.
  • Zhang et al. (2018) Zhang, X., Ng, R., and Chen, Q. Single image reflection separation with perceptual losses. In IEEE Conference on Computer Vision and Pattern Recognition, pp.  4786–4794, 2018.

Appendix A Experimental Details

A.1 Fine-tuning

We fine-tuned the MNIST, CIFAR-10, and Glow models at 10 noise levels σ2\sigma^{2} (see Section 3.4) for 50 epochs each on clusters of 4 1080Ti GPU’s. This procedure converges rapidly, with no further decrease of the negative log-likelihood after the first 10 epochs. Although Glow models theoretically have full support, the noiseless pre-trained models assign vanishing probability to highly noisy images. In practice, this can cause invertibility assertion failures when fine-tuning directly from the noiseless model. To avoid this we took an iterative approach: first fine-tune the lowest noise level σ=.01\sigma=.01 from the noiseless model, then fine-tune the σ=.016\sigma=.016 model from the σ=.01\sigma=.01 model, etc.

A.2 Scaling and Resources

Scaling Algorithm 1 to richer datasets is constrained primarily by the limited availability of strong, likelihood-based generative models for these datasets. For high resolution images, the running time of Algorithm 1 can also become substantial. Assuming the hyper-parameters TT and LL discussed in Section 3.4 remain valid at higher resolutions, the computational complexity of BASIS scales linearly with the cost of evaluating gradients of the model (albeit with a large multiplicative constant T×LT\times L). Therefore, if a generative model is tractable to train, then it should also be tractable to use for BASIS separation.

In concrete detail, we observe that a batch of 5050 BASIS separation results for MNIST or CIFAR-10 using NCSN takes <5<5 minutes on a single 1080Ti GPU. Running BASIS with Glow is much slower. We observe that substantial time is spent loading and unloading the noisy models pσp_{\sigma} from memory (in contrast to NCSN, which uses a single noise-conditioned model). A batch of 5050 BASIS separation results on MNIST or CIFAR-10 using Glow takes about 30 minutes on a 1080Ti. A batch of 99 BASIS separation result on LSUN using Glow takes 2-3 hours on a 1080Ti.

A.3 Visual Comparisons

When using class-agnostic priors, BASIS separation is symmetric in its output components. To facilitate visual comparisons between original images and separated components, we sort the BASIS separated components to minimize PSNR to the original images. This usually results in the separated components being visually paired with the most similar original components. But due to the deficiencies of PSNR as a comparative metric this is not always the case; the alert reader may have noticed that the yellow and silver car mixture in Figure 1 appears to have been displayed in reverse order. This happens because the separated yellow car component takes the light sky from the original silver car component, and the lightness of the sky dominates the PSNR metric.

For the LSUN separation results, where we use a church model for the first component and a bedroom model for the second, the symmetry is broken. For these results, components naturally sort themselves into church and bedroom components, which can be compared directly to the original images.

Appendix B Intermediate Samples During the Annealing Process

Refer to caption
Figure 9: Intermediate CIFAR-10 separation results taken at noise levels σ\sigma during the annealing process of BASIS separation.

Appendix C MNIST Separation Results Under Different Models and Sampling Procedures

Refer to caption
Figure 10: Uncurated class-agnostic separation results using: (1) samples from the posterior with Glow as a prior (2) an approximate MAP estimate using the maximum over 10 samples from the posterior with Glow as a prior (3) samples from the posterior with NCSN as a prior.

Appendix D Extended CIFAR-10 Separation Results

D.1 NCSN Prior

Refer to caption
Figure 11: Uncurated class-agnostic CIFAR-10 separation results using NCSN as a prior.

D.2 Glow Prior

Refer to caption
Figure 12: Uncurated class-agnostic CIFAR-10 separation results using Glow as a prior.

Appendix E Extended CIFAR-10 Colorization Results

E.1 NCSN Prior

Refer to caption
Figure 13: Uncurated CIFAR-10 colorization results using NCSN as a prior.

E.2 Glow Prior

Refer to caption
Figure 14: Uncurated CIFAR-10 colorization results using Glow as a prior.

Appendix F Extended LSUN Separation Results

Refer to caption
Figure 15: Uncurated church/bedroom LSUN separation results using Glow as a prior.