This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Self Normalizing Flows

T. Anderson Keller    Jorn W.T. Peters    Priyank Jaini    Emiel Hoogeboom    Patrick Forré    Max Welling
Abstract

Efficient gradient computation of the Jacobian determinant term is a core problem in many machine learning settings, and especially so in the normalizing flow framework. Most proposed flow models therefore either restrict to a function class with easy evaluation of the Jacobian determinant, or an efficient estimator thereof. However, these restrictions limit the performance of such density models, frequently requiring significant depth to reach desired performance levels. In this work, we propose Self Normalizing Flows, a flexible framework for training normalizing flows by replacing expensive terms in the gradient by learned approximate inverses at each layer. This reduces the computational complexity of each layer’s exact update from 𝒪(D3)\mathcal{O}(D^{3}) to 𝒪(D2)\mathcal{O}(D^{2}), allowing for the training of flow architectures which were otherwise computationally infeasible, while also providing efficient sampling. We show experimentally that such models are remarkably stable and optimize to similar data likelihood values as their exact gradient counterparts, while training more quickly and surpassing the performance of functionally constrained counterparts.

Machine Learning, ICML, Normalizing Flows

1 Introduction

The framework of normalizing flows (Tabak & Turner, 2013) allows for powerful exact density estimation through the change of variables formula (Rudin, 1987). A significant challenge with this approach is the Jacobian determinant in the objective, which is generally expensive to compute. A large body of work has therefore focused on methods to evaluate the Jacobian determinant efficiently, usually by limiting the expressivity of the transformation. Two classes of functions have been proposed to achieve this: i) those with triangular Jacobians, such that the determinant only depends on the diagonal (Bogachev et al., 2005; Marzouk et al., 2016; Jaini et al., 2019), and ii) those which are Lipschitz continuous such that Jacobian determinant can be approximated at each iteration through an infinite series (Behrmann et al., 2019; Grathwohl et al., 2019). The drawback of both of these approaches is that they rely on strong functional constraints.

Refer to caption
Figure 1: Overview of self normalizing flows. A matrix 𝑾{\bm{W}} transforms 𝒙{\bm{x}} to 𝒛{\bm{z}}. The matrix 𝑹{\bm{R}} is constrained to approximate the inverse of 𝑾{\bm{W}} with a reconstruction loss \mathcal{E}. The likelihood is efficiently optimized by approximating the gradient of the log Jacobian determinant with the learned inverse.

Recently, a number of works have pursued an alternative approach to training unconstrained normalizing flows by avoiding the expensive computation of the Jacobian determinant altogether during training (Gresele et al., 2020; Krämer et al., 2020). However, these works restrict the parametrization of the transformation, or the parameter updates. As a result, it can be difficult to scale these to higher dimensional data such as images.

In this work we introduce a new framework to avoid the determinant computation during training, which we name the Self Normalizing Framework. Instead of computing the log Jacobian determinant, we approximate its gradient directly. This can be achieved through the insight that the derivative of the log Jacobian determinant is given by the inverse of the Jacobian itself. In the framework, flow components learn to approximate their own inverse through a self-supervised layer-wise reconstruction loss. Further, we then define the overall density model as a mixture of the probability induced by both the forward and inverse transformations and show how both transformations can be updated symmetrically using their respective learned inverses directly in the gradient. Ultimately, this avoids the 𝒪(D3)\mathcal{O}(D^{3}) complexity required for computing the determinant of each layer at each training iteration, instead substituting it with an additional backwards pass of order 𝒪(D2)\mathcal{O}(D^{2}) required to propagate the reconstruction error gradients.

2 Related Work

The field of normalizing flows can be broadly divided into linear and non-linear flows (Papamakarios et al., 2019). Non-linear flows are generally constructed either by constraining the Jacobian to be triangular (Kingma et al., 2016; Papamakarios et al., 2017; van den Berg et al., 2018; Huang et al., 2018; Jaini et al., 2019) or by constraining a residual function to be Lipschitz (Grathwohl et al., 2019; Behrmann et al., 2019; Chen et al., 2019; Perugachi-Diaz et al., 2020). Linear flows are constructed using a variety of methods. Examples of linear flows are 1×11\times 1 convolutions (Kingma & Dhariwal, 2018) which have block-diagonal structure, periodic convolutions (Karami et al., 2019; Hoogeboom et al., 2019) which leverage the frequency domain, Woodbury flows (Lu & Huang, 2020) that use low-rank transformations and relative gradient-based flows (Gresele et al., 2020) that re-frame optimization of fully connected linear flows. The disadvantage of these methods is that they are either constrained to a subset of transformations, or based on matrix decomposition structures that cannot straightforwardly be extended to convolutional weight sharing.

The work of Gresele et al. (2020) is most similar to ours in the goal of training unconstrained normalizing flows through the use of efficient gradient computations. In Gresele et al. (2020) this is achieved by applying a carefully constructed post-conditioner to the gradient of each layer, transforming it into the natural gradient (Cardoso & Laheld, 1996; Amari, 1998), thereby selectively canceling out the inverse normally required during training. However, since parameters are extensively shared for convolutions, natural gradient methods cannot be straightforwardly applied to convolutional layers. In contrast, the framework proposed in this paper makes use of the traditional gradient, allowing for more flexibility in parameterization, requiring only that an inverse function can be learned and maintained throughout training.

Related to our framework, the idea of using learned inverse functions in the setting of density estimation was proposed in early work on invertible neural networks (Rippel & Adams, 2013). In that work, similar to ours, both directions of the density model are parameterized and constrained to be approximate inverses through a reconstruction loss. The learned ‘encoder’ is then used to approximate the marginal distribution of the data in latent space by finding the best fit of a tractable parametric family (such as the Beta family), and the divergence of the approximate latent distribution from the target latent distribution is then minimized. Important differences with our work are that we compute the likelihood exactly through the change of variables formula, and use the Jacobian of our learned ‘decoder’ as an approximation to the gradient of the intractable Jacobian determinant term which arises. An advantage of this approach is that we no longer need to explicitly constrain our decoder to be well conditioned, or invertible, since this constraint is implicitly imposed through the maximization of the Jacobian determinant. Additionally, our framework novelly models the density of observations as a mixture of the density under the forward and inverse transformations, taking advantage of all learned parameters.

More broadly, the idea of greedy layer-wise learning has been explored in many forms for training neural networks (Bengio et al., 2006; Hinton et al., 2006; Löwe et al., 2019). One influential class of work uses stacked auto-encoders, or deep belief networks, for pre-training or representation learning (Bengio et al., 2006; Hinton et al., 2006; Vincent et al., 2008; Kosiorek et al., 2019). Our work leverages similar models and training paradigms, but introduces them into a modern flow framework, demonstrating additional potential uses for the learned feedback connections.

Refer to caption
Figure 2: MNIST (left) and CIFAR-10 (right) samples generated from a self normalizing Glow model using the exact inverse (top) vs. the approximate learned inverse (bottom). From these samples we observe that the self normalizing flow models have learned to become good generative models of the data. Additionally, comparing the top and bottom rows, we see the inverse approximation is nearly exact.

Another related class of work addresses the biological implausibility of backpropagation also through learned layer-wise autoencoders. Target propagation (Lecun, 1986; Bengio, 2014; Lee et al., 2015; Bengio, 2020; Meulemans et al., 2020) addresses the so-called weight transport problem by training auto-encoders at each layer of a network and using these learned feedback weights to propagate ‘targets’ to previous layers. Our method takes inspiration from this approach. Specifically, our method can be viewed as a hybrid of target propagation and backpropagation (Linnainmaa, 1976; Werbos, 1982; Rumelhart et al., 1986) particularly suited to unsupervised density estimation in the normalizing flow framework. The novelty of our approach in this regard lies in the use of the inverse weights directly in the update, rather than in the backward propagation of updates.

3 A General Framework for Self Normalizing Flows

3.1 Preliminaries

Given an observation 𝒙D{\bm{x}}\in\mathbb{R}^{D}, it is assumed that 𝒙{\bm{x}} is generated from an underlying real vector 𝒛D{\bm{z}}\in\mathbb{R}^{D} through an invertible and differentiable transformation gg, and that f=g1f=g^{-1} is also differentiable (i.e. gg is a diffeomorphism). It is further assumed that 𝒛{\bm{z}} is a sample from a simple known underlying distribution p𝒁p_{{\bm{Z}}} such as a standard Gaussian. Then, the probability density p𝑿p_{{\bm{X}}} can be computed exactly using the change of variables formula:

p𝑿(𝒙)=p𝒁(𝒛)|𝒛𝒙|=p𝒁(g1(𝒙))|𝑱g1|=p𝒁(f(𝒙))|𝑱f|\displaystyle\begin{split}p_{{\bm{X}}}({\bm{x}})=p_{{\bm{Z}}}({\bm{z}})\left|\frac{\partial{\bm{z}}}{\partial{\bm{x}}}\right|&=p_{{\bm{Z}}}\left(g^{-1}({\bm{x}})\right)\left|{\bm{J}}_{g^{-1}}\right|\\ &=p_{{\bm{Z}}}\big{(}f({\bm{x}})\big{)}\left|{\bm{J}}_{f}\right|\end{split} (1)

where the change of volume term |𝑱f|=|f(𝒙)𝒙|\left|{\bm{J}}_{f}\right|=\left|\frac{\partial f({\bm{x}})}{\partial{\bm{x}}}\right| is the determinant of the Jacobian of the transformation between 𝒛{\bm{z}} and 𝒙{\bm{x}}, evaluated at 𝒙{\bm{x}}.

Typically, the functions ff and gg are defined as compositions of diffeomorphisms themselves, i.e. g=g0g1gKg=g_{0}\circ g_{1}\circ\dots g_{K} and f=fKfK1f0f=f_{K}\circ f_{K-1}\circ\dots f_{0} where gk1=fk{g_{k}}^{-1}=f_{k}. This formulation takes advantage of the fact that a composition of diffeomorphic functions is also a diffeomorphism, meaning that if each fkf_{k} is invertible and differentiable, then so is the composition ff and the change of variables formula in Equation 1 still holds.

Most approaches then propose defining and parameterizing only one direction of the flow, for example the ’forward’ functions fkf_{k}, and compute the inverses gkg_{k} exactly when needed. The log-likelihood of the observations is then simultaneously maximized, with respect to a given fkf_{k}’s vector of parameters 𝜽k\bm{\theta}_{k}, for all kk, requiring the gradient. Using the identity 𝑱log|𝑱|=𝑱T\frac{\partial}{\partial{\bm{J}}}\log|{\bm{J}}|={{\bm{J}}^{-T}} we obtain the following gradient of the loss with respect to a given layer kk’s parameters:

𝜽klogp𝑿(𝒙)=𝜽klogp𝒁(f(𝒙))+(vec𝑱f)T𝜽k(vec𝑱fT)\frac{\partial}{\partial\bm{\theta}_{k}}\log p_{{\bm{X}}}({\bm{x}})=\frac{\partial}{\partial\bm{\theta}_{k}}\log p_{{\bm{Z}}}(f({\bm{x}}))+\frac{\partial\left(\mathrm{vec}\ {\bm{J}}_{f}\right)^{T}}{\partial\bm{\theta}_{k}}\left(\mathrm{vec}\ {\bm{J}}_{f}^{-T}\right) (2)

Following the conventions of Magnus (2010) for matrix derivatives, we make use of the vectorization operator vec\mathrm{vec} which maps from m×nm\times n matricies to mn×1mn\times 1 column vectors by stacking the columns of the matrix sequentially, and formulate the parameters 𝜽k\bm{\theta}_{k} as column vectors.

3.2 Self Normalizing Framework

In order to avoid the inverse Jacobian in the gradient, we instead propose to define and parameterize both the forward and inverse functions fkf_{k} and gkg_{k} with parameters 𝜽k\bm{\theta}_{k} and 𝜸k\bm{\gamma}_{k} respectively. We then constrain the parameterized inverse gkg_{k} to be approximately equal to the true inverse fk1f^{-1}_{k} through a layer-wise reconstruction loss. We can thus define our maximization objective as the mixture of the log-likelihoods induced by both models minus the reconstruction penalty:

(𝒙)=12logp𝑿f(𝒙)+12logp𝑿g(𝒙)λk=0Kgk(fk(𝒉k))𝒉k22\mathcal{L}({\bm{x}})=\frac{1}{2}\log p^{f}_{{\bm{X}}}({\bm{x}})+\frac{1}{2}\log p^{g}_{{\bm{X}}}({\bm{x}})-\lambda\sum_{k=0}^{K}||g_{k}\left(f_{k}({\bm{h}}_{k})\right)-{\bm{h}}_{k}||^{2}_{2} (3)

where p𝑿fp^{f}_{{\bm{X}}} and p𝑿gp^{g}_{{\bm{X}}} now denote the densities induced by both the forwards and inverse transformations separately, and 𝒉k=gradient_stop(fk1f0(𝒙)){\bm{h}}_{k}=\mathrm{gradient\_stop}(f_{k-1}\circ...f_{0}({\bm{x}})) is the output of function fk1f_{k-1} with the gradients blocked such that only gkg_{k} and fkf_{k} receive gradients from the reconstruction loss at layer kk. We see that when f=g1f=g^{-1} exactly, this is equivalent to the traditional normalizing flow framework.

By the inverse function theorem, we know that the inverse of the Jacobian of an invertible function is given by the Jacobian of the inverse function, i.e. 𝑱f1(𝒙)=𝑱f1(𝒛){\bm{J}}_{f}^{-1}({\bm{x}})={\bm{J}}_{f^{-1}}({\bm{z}}). Therefore, we see that with the above parameterization and constraint, we can approximate both the change of variables formula, and the gradients for both functions, in terms of the Jacobians of the respective inverse functions. Explicitly:

𝜽klogp𝑿f(𝒙)𝜽klogp𝒁(f(𝒙))+(vec𝑱f)T𝜽k(vec𝑱gT)\frac{\partial}{\partial\bm{\theta}_{k}}\log p^{f}_{{\bm{X}}}({\bm{x}})\approx\frac{\partial}{\partial\bm{\theta}_{k}}\log p_{{\bm{Z}}}(f({\bm{x}}))+\frac{\partial\left(\mathrm{vec}\ {\bm{J}}_{f}\right)^{T}}{\partial\bm{\theta}_{k}}\left(\mathrm{vec}\ {\bm{J}}_{g}^{T}\right) (4)
𝜸klogp𝑿g(𝒙)𝜸klogp𝒁(g1(𝒙))(vec𝑱g)T𝜸k(vec𝑱fT)\frac{\partial}{\partial\bm{\gamma}_{k}}\log p^{g}_{{\bm{X}}}({\bm{x}})\approx\frac{\partial}{\partial\bm{\gamma}_{k}}\log p_{{\bm{Z}}}(g^{-1}({\bm{x}}))-\frac{\partial\left(\mathrm{vec}\ {\bm{J}}_{g}\right)^{T}}{\partial\bm{\gamma}_{k}}\left(\mathrm{vec}\ {\bm{J}}_{f}^{T}\right) (5)

where Equation 5 follows from the derivation of Equation 4 and the application of the derivative of the inverse. We note that the above approximation requires that the Jacobians of the functions are approximately inverses in addition to the functions themselves being approximate inverses. For the models presented in this work, this property is obtained for free since the Jacobian of a linear mapping is the matrix representation of the map itself. However, for more complex mappings, this may not be exactly the case and should be constrained explicitly. We suggest this could be efficiently implemented with an additional loss analogous to a reconstruction loss but employing Jacobian vector products instead of matrix vector products (see Section A.4).

Refer to caption
Figure 3: Angle (in degrees) between the exact gradient and the approximate gradient at each step of training. We see that the angle is close to 0.1 degrees for the fully connected network (FC), and less than 1 degree for the convolutional network (CNN), suggesting the approximation is close throughout training.

Although there are no known convergence guarantees for such a method, we observe in practice that, with sufficiently large values of λ\lambda, most models quickly converge to solutions which maintain the desired constraint. This phenomenon is demonstrated in Figure 3 where it is shown that the angle between the true gradient and the approximate gradient at each training iteration decreases to less than 1 degree over the course of training for both models tested. As a visual example of the quality of the inverse approximation, Figure 2 shows samples from the base distribution p𝒁p_{{\bm{Z}}} passed through both the true inverse f1f^{-1} (top) and the learned approximate inverse gg (bottom) to generate samples from p𝑿p_{{\bm{X}}}. As demonstrated by the nearly identical samples, the approximate inverse appears to be a very close match to the true inverse. The details of the models which generated these samples are in Section 5. Further mitigation strategies for potential optimization difficulties are discussed in Sections 6 and A.4. Finally, in Figure 4 we see the efficiency improvements of our method compared with the exact gradient. As expected, we see the self normalizing method scales much more favorably with input dimensionality than the exact gradient method.

Refer to caption
Figure 4: Time per batch vs. input dimensionality for a single fully connected layer, comparing the self normalizing and exact gradients. We see the self normalizing flow model scales much more favorably with input dimension.

3.3 Inference with Self Normalizing Flows

The above framework proposes approximate gradients which allow for the training of normalizing flow architectures without having to compute expensive determinants or matrix inverses. However, to compute the exact log-likelihood of an observation (i.e. perform inference), the exact log Jacobian determinant still has to be computed for each input, which may be expensive.

Refer to caption
Figure 5: Negative log-likelihood on the MNIST validation set for a 2-layer fully connected flow trained with exact vs. self normalizing (SNF) gradients. Shown vs. training time (left) and vs. epochs (right). We see that both models converge to similar optima while the SNF model trains much more quickly.

To alleviate this limitation we take advantage of the compositional formulation of ff, and the multiplicative property of the determinant. In detail, the determinant of the product of square matrices is equal to the product of their determinants. Therefore, assuming our flow is composed of square transformations fkf_{k}, the Jacobian is given by 𝑱f=k𝑱fk{\bm{J}}_{f}=\prod_{k}{\bm{J}}_{f_{k}}, and the log determinant of the Jacobian is given by:

log|𝑱f|=kKlog|𝑱fk|\log|{\bm{J}}_{f}|=\sum_{k}^{K}\log|{\bm{J}}_{f_{k}}| (6)

For neural network architectures composed of sequential linear and nonlinear transformations, this allows us to separate the components of the Jacobian which are data-independent (e.g. those of linear layers) from those that are data-dependant (e.g. those of the activations). For example, given a network composed of LL layers of the form 𝒉l=σl(𝑾l𝒉l1){\bm{h}}_{l}=\sigma_{l}({\bm{W}}_{l}{\bm{h}}_{l-1}) where σl\sigma_{l} is a point-wise non-linearity, combining Equations 1 and 6 yields:

logp𝑿(𝒙)=logp𝒁(𝒛)+kKlog|𝑾k|+kKlog|𝚺k(𝒙)|\log p_{{\bm{X}}}({\bm{x}})=\log p_{{\bm{Z}}}({\bm{z}})+\sum_{k}^{K}\log|{\bm{W}}_{k}|+\sum_{k}^{K}\log|{\bm{\Sigma}}_{k}({\bm{x}})| (7)

where 𝚺k(𝒙){\bm{\Sigma}}_{k}({\bm{x}}) is the Jacobian of the activation function at layer kk, evaluated at the sample 𝒙{\bm{x}}. Importantly, we observe that the data-independent terms, klog|𝑾k|\sum_{k}\log|{\bm{W}}_{k}|, are the expensive part of inference, and the data-dependent terms, klog|𝚺k(𝒙)|\sum_{k}\log|{\bm{\Sigma}}_{k}({\bm{x}})|, are frequently cheap to compute analytically given the derivative of the activation. Therefore, as can be seen, once trained, the expensive data-independent terms must only be computed once and their values can then be reused for all future examples, effectively amortizing the cost of the determinant computation.

4 Self Normalizing Flows

In this section we introduce two simple applications of the self normalizing framework: fully-connected and convolutional self normalizing flows.

4.1 Self Normalizing Fully Connected Layer

As a specific case of the self normalizing framework, we introduce a single fully connected self normalizing layer, as exemplified in Figure 1. Let 𝑾,𝑹D×D{\bm{W}},{\bm{R}}\in\mathbb{R}^{D\times D}, with f(𝒙)=𝑾𝒙=𝒛f({\bm{x}})={\bm{W}}{\bm{x}}={\bm{z}}, and g(𝒛)=𝑹𝒛g({\bm{z}})={\bm{R}}{\bm{z}}, such that 𝑾1𝑹{\bm{W}}^{-1}\approx{\bm{R}}. We additionally denote the layer-wise reconstruction loss as (𝒙)=𝑹𝑾𝒙𝒙22\mathcal{E}({\bm{x}})=||{\bm{R}}{\bm{W}}{\bm{x}}-{\bm{x}}||^{2}_{2}, but leave the derivation of the gradients of this term for the appendix (see Section A.1) since these are efficiently computed by standard deep learning frameworks and require no approximations.

Taking the gradients of Equation 3 with respect to all parameters 𝑾{\bm{W}} and 𝑹{\bm{R}}, we get the following exact gradients:

𝑾(𝒙)\displaystyle\frac{\partial}{\partial{\bm{W}}}\mathcal{L}({\bm{x}}) =12𝑾logp𝑿f(𝒙)+12𝑾logp𝑿g(𝒙)λ𝑾\displaystyle=\frac{1}{2}\frac{\partial}{\partial{\bm{W}}}\log p^{f}_{{\bm{X}}}({\bm{x}})+\frac{1}{2}\frac{\partial}{\partial{\bm{W}}}\log p^{g}_{{\bm{X}}}({\bm{x}})-\lambda\frac{\partial}{\partial{\bm{W}}}\mathcal{E}
=12(logp𝒁(𝑾𝒙)𝑾𝒙𝑾𝒙𝑾+log|𝑾|𝑾)λ𝑾\displaystyle=\frac{1}{2}\left(\frac{\partial\log p_{{\bm{Z}}}({\bm{W}}{\bm{x}})}{\partial{\bm{W}}{\bm{x}}}\frac{\partial{\bm{W}}{\bm{x}}}{\partial{\bm{W}}}+\frac{\partial\log|{\bm{W}}|}{\partial{\bm{W}}}\right)-\lambda\frac{\partial}{\partial{\bm{W}}}\mathcal{E}
=12(𝜹𝒛f𝒙T+𝑾T)λ𝑾\displaystyle=\frac{1}{2}\left(\bm{\delta}_{{\bm{z}}}^{f}{\bm{x}}^{T}+{\bm{W}}^{-T}\right)-\lambda\frac{\partial}{\partial{\bm{W}}}\mathcal{E} (8)
𝑹(𝒙)\displaystyle\frac{\partial}{\partial{\bm{R}}}\mathcal{L}({\bm{x}}) =12𝑹logp𝑿f(𝒙)+12𝑹logp𝑿g(𝒙)𝑹λ\displaystyle=\frac{1}{2}\frac{\partial}{\partial{\bm{R}}}\log p^{f}_{{\bm{X}}}({\bm{x}})+\frac{1}{2}\frac{\partial}{\partial{\bm{R}}}\log p^{g}_{{\bm{X}}}({\bm{x}})-\frac{\partial}{\partial{\bm{R}}}\lambda\mathcal{E}
=12(logp𝒁(𝑹1𝒙)𝑹1𝒙𝑹1𝒙𝑹+log|𝑹1|𝑹)λ𝑹\displaystyle=\frac{1}{2}\!\!\left(\frac{\partial\log p_{{\bm{Z}}}({\bm{R}}^{-1}{\bm{x}})}{\partial{\bm{R}}^{-1}{\bm{x}}}\!\frac{\partial{\bm{R}}^{-1}{\bm{x}}}{\partial{\bm{R}}}\!+\!\frac{\partial\log|{\bm{R}}^{-1}|}{\partial{\bm{R}}}\right)\!-\!\lambda\frac{\partial}{\partial{\bm{R}}}\mathcal{E}
=12(𝑹T𝜹𝒛g𝒙T𝑹T𝑹T)λ𝑹\displaystyle=\frac{1}{2}\!\!\left(-{\bm{R}}^{-T}\bm{\delta}_{{\bm{z}}}^{g}{\bm{x}}^{T}{\bm{R}}^{-T}-{\bm{R}}^{-T}\right)\!-\!\lambda\frac{\partial}{\partial{\bm{R}}}\mathcal{E} (9)

where 𝜹𝒛f=logp𝒁(𝑾𝒙)𝑾𝒙\bm{\delta}^{f}_{{\bm{z}}}=\frac{\partial\log p_{{\bm{Z}}}({\bm{W}}{\bm{x}})}{\partial{\bm{W}}{\bm{x}}} and 𝜹𝒛g=logp𝒁(𝑹1𝒙)𝑹1𝒙\bm{\delta}^{g}_{{\bm{z}}}=\frac{\partial\log p_{{\bm{Z}}}({\bm{R}}^{-1}{\bm{x}})}{\partial{\bm{R}}^{-1}{\bm{x}}} can be computed by standard backpropagation. To avoid computing matrix inverses, using our framework we substitute 𝑾1{\bm{W}}^{-1} with 𝑹{\bm{R}} in Equation 8, and symmetrically, all instances of 𝑹1{\bm{R}}^{-1} with 𝑾{\bm{W}} in Equation 9. Additionally, we approximate 𝜹𝒛f𝜹𝒛g\bm{\delta}^{f}_{{\bm{z}}}\approx\bm{\delta}^{g}_{{\bm{z}}} in Equation 9 such that we do not need to ‘forward propagate’ through the inverse model, and can re-use the backpropagated error from the forward model. This corresponds to assuming that 𝑹1𝒙𝑾𝒙{\bm{R}}^{-1}{\bm{x}}\approx{\bm{W}}{\bm{x}} for all 𝒙{\bm{x}} which follows directly from 𝑹1𝑾{\bm{R}}^{-1}\approx{\bm{W}}. Ultimately this results in the following approximations to gradient:

𝑾(𝒙)12(𝜹𝒛f𝒙T+𝑹T)λ𝑾\frac{\partial}{\partial{\bm{W}}}\mathcal{L}({\bm{x}})\approx\frac{1}{2}\left(\bm{\delta}_{{\bm{z}}}^{f}{\bm{x}}^{T}+{\bm{R}}^{T}\right)-\lambda\frac{\partial}{\partial{\bm{W}}}\mathcal{E} (10)
𝑹(𝒙)12(𝜹𝒙f𝒛T𝑾T)λ𝑹\begin{split}\frac{\partial}{\partial{\bm{R}}}\mathcal{L}({\bm{x}})&\approx\frac{1}{2}\left(-\bm{\delta}^{f}_{{\bm{x}}}{\bm{z}}^{T}-{\bm{W}}^{T}\right)-\lambda\frac{\partial}{\partial{\bm{R}}}\mathcal{E}\end{split} (11)

where 𝜹𝒙f=logp𝒁(𝑾𝒙)𝒙=𝑾Tδ𝒛f\bm{\delta}^{f}_{{\bm{x}}}=\frac{\partial\log p_{{\bm{Z}}}({\bm{W}}{\bm{x}})}{\partial{\bm{x}}}={\bm{W}}^{T}\delta^{f}_{{\bm{z}}}. We see that by using such a self normalizing layer, the gradient of the log-determinant of the Jacobian term, which originally required an expensive matrix inversion at each iteration, is approximately given by the weights of the inverse transformation – sidestepping computation of both the Jacobian and the inverse. Additionally, we see the 𝜹𝒙f\bm{\delta}^{f}_{{\bm{x}}} term required for Equation 11 is already computed by traditional backpropagation, making the update efficient. Finally, we note the above gradients can trivially be extended to compositions of such layers, combined with non-linearities, by substituting the appropriate deltas for each layer, and the corresponding layer inputs and outputs for 𝒙{\bm{x}} and 𝒛{\bm{z}} respectively.

4.2 Self Normalizing Convolutional Layer

To construct a self normalizing convolutional layer, we consider the setting where both the forward transformation ff, and the inverse transformation gg are convolutional with the same kernel size. We note, importantly, that the inverse of a convolution operation is not necessarily another convolution. However, for sufficiently large λ\lambda, we observe that ff is regularized such that it is restricted to the class of convolutions which is approximately invertible by a convolution.

We define the parameters of ff to be the kernel 𝒘{\bm{w}} and similarly, the parameters of gg to be the kernel 𝒓{\bm{r}}. Then, letting f(𝒙)=𝒘𝒙=𝒛f({\bm{x}})={\bm{w}}\star{\bm{x}}={\bm{z}} and g(𝒛)=𝒓𝒛g({\bm{z}})={\bm{r}}\star{\bm{z}}, such that f1gf^{-1}\approx g with 𝒙,𝒛D{\bm{x}},{\bm{z}}\in\mathbb{R}^{D}, we proceed to derive the exact gradients of the log-likelihood. Again, we ignore the reconstruction term for simplicity as it requires no approximations.

To make the derivation easier, we note that the convolution operation is a linear operation, and can therefore be represented in matrix form. We define a transformation 𝑾=𝒯(𝒘){\bm{W}}=\mathcal{T}({\bm{w}}), which maps between the convolutional kernel 𝒘{\bm{w}} and the corresponding matrix form of the convolution:

𝒛=𝒯(𝒘)𝒙=𝒘𝒙{\bm{z}}=\mathcal{T}({\bm{w}}){\bm{x}}={\bm{w}}\star{\bm{x}} (12)

Letting 𝒘{\bm{w}} be a column vector and again making use of the vectorization operator vec\mathrm{vec}, we compute the exact gradient of the log-likelihood with respect to the kernel weights 𝒘{\bm{w}}:

𝒘\displaystyle\frac{\partial}{\partial{\bm{w}}} logp𝑿f(𝒙)\displaystyle\log p^{f}_{{\bm{X}}}({\bm{x}})
=(vec𝒯(𝒘))T𝒘(logp𝒁(𝒯(𝒘)𝒙)vec𝒯(𝒘)+log|𝒯(𝒘)|vec𝒯(𝒘))\displaystyle=\frac{\partial(\mathrm{vec}\ \mathcal{T}({\bm{w}}))^{T}}{\partial{\bm{w}}}\left(\frac{\partial\log p_{{\bm{Z}}}\left(\mathcal{T}({\bm{w}}){\bm{x}}\right)}{\partial\ \mathrm{vec}\ \mathcal{T}({\bm{w}})}+\frac{\partial\log|\mathcal{T}({\bm{w}})|}{\partial\ \mathrm{vec}\ \mathcal{T}({\bm{w}})}\right)
=(vec𝒯(𝒘))T𝒘(vec[𝜹𝒛f𝒙T]+vec𝒯(𝒘)T)\displaystyle=\frac{\partial(\mathrm{vec}\ \mathcal{T}({\bm{w}}))^{T}}{\partial{\bm{w}}}\left(\mathrm{vec}\ \left[\bm{\delta}^{f}_{{\bm{z}}}{\bm{x}}^{T}\right]+\mathrm{vec}\ \mathcal{T}({\bm{w}})^{-T}\right)
=𝜹𝒛f𝒙+(vec𝒯(𝒘))T𝒘(vec𝒯(𝒘)T)\displaystyle=\bm{\delta}^{f}_{{\bm{z}}}\star{\bm{x}}+\frac{\partial(\mathrm{vec}\ \mathcal{T}({\bm{w}}))^{T}}{\partial{\bm{w}}}\left(\mathrm{vec}\ \mathcal{T}({\bm{w}})^{-T}\right) (13)

where we see that the first term (vec𝒯(𝒘))T𝒘(vec[𝜹𝒛f𝒙T])\frac{\partial(\mathrm{vec}\ \mathcal{T}({\bm{w}}))^{T}}{\partial{\bm{w}}}\left(\mathrm{vec}\ \left[\bm{\delta}^{f}_{{\bm{z}}}{\bm{x}}^{T}\right]\right) is given by the convolution 𝜹𝒛f𝒙\bm{\delta}^{f}_{{\bm{z}}}\star{\bm{x}}, as is usually done with backpropagation in convolutional neural networks. Then, given our soft constraint f1gf^{-1}\approx g, we can approximate 𝒯(𝒘)T\mathcal{T}({\bm{w}})^{-T} with 𝒯(𝒓)T\mathcal{T}({\bm{r}})^{T} giving us:

𝒘logp𝑿f(𝒙)𝜹𝒛f𝒙+(vec𝒯(𝒘))T𝒘(vec𝒯(𝒓)T)\begin{split}\frac{\partial}{\partial{\bm{w}}}\log p^{f}_{{\bm{X}}}({\bm{x}})&\approx\bm{\delta}^{f}_{{\bm{z}}}\star{\bm{x}}+\frac{\partial(\mathrm{vec}\ \mathcal{T}({\bm{w}}))^{T}}{\partial{\bm{w}}}\left(\mathrm{vec}\ \mathcal{T}({\bm{r}})^{T}\right)\end{split} (14)

To simplify the second term we note two points. First, the transpose of a convolution can similarly be achieved by standard convolution with a transformed kernel. We call this transformed kernel flip(𝒓)\mathrm{flip}({\bm{r}}) such that 𝒯(flip(𝒓))=𝒯(𝒓)T\mathcal{T}(\mathrm{flip}({\bm{r}}))=\mathcal{T}({\bm{r}})^{T}. Succinctly, flip()\mathrm{flip}(\cdot) is implemented by swapping the input and output axes, and mirroring the spatial (height and width) dimensions of the kernel. The second point is that the partial derivative (vec𝒯(𝒘))T𝒘\frac{\partial(\mathrm{vec}\ \mathcal{T}({\bm{w}}))^{T}}{\partial{\bm{w}}} is given by a rectangular matrix, which, when multiplied by the vectorized form of the convolution matrix 𝒯(𝒘)\mathcal{T}({\bm{w}}) yields a constant multiple 𝒎{\bm{m}} element-wise multiplied with the kernel 𝒘{\bm{w}}. Each multiple element mim_{i} is given by the number of times the associated kernel element wiw_{i} is shared across the matrix 𝒯(𝒘)\mathcal{T}({\bm{w}}). We provide the derivation of this more thoroughly in the appendix, as well as an efficient method for calculating 𝒎{\bm{m}} for arbitrary convolutions (see Section A.2). In combination, we arrive at the approximate gradient of the likelihood with respect to the kernel 𝒘{\bm{w}}:

𝒘logp𝑿f(𝒙)𝜹𝒛f𝒙+flip(𝒓)𝒎\frac{\partial}{\partial{\bm{w}}}\log p^{f}_{{\bm{X}}}({\bm{x}})\approx\bm{\delta}^{f}_{{\bm{z}}}\star{\bm{x}}+\mathrm{flip}({\bm{r}})\odot{\bm{m}}\\ (15)

We see that the symmetric derivation can be obtained for the gradient with respect to the kernel 𝒓{\bm{r}}, as outlined below:

𝒓\displaystyle\frac{\partial}{\partial{\bm{r}}} logp𝑿g(𝒙)\displaystyle\log p^{g}_{{\bm{X}}}({\bm{x}})
=(vec𝒯(𝒓))T𝒓(logp𝒁(𝒯(𝒓)1𝒙)vec𝒯(𝒓)+log|𝒯(𝒓)1|vec𝒯(𝒓))\displaystyle\!\!\!\!\!=\frac{\partial(\mathrm{vec}\ \mathcal{T}({\bm{r}}))^{T}}{\partial{\bm{r}}}\left(\frac{\partial\log p_{{\bm{Z}}}\left(\mathcal{T}({\bm{r}})^{-1}{\bm{x}}\right)}{\partial\ \mathrm{vec}\ \mathcal{T}({\bm{r}})}+\frac{\partial\log|\mathcal{T}({\bm{r}})^{-1}|}{\partial\ \mathrm{vec}\ \mathcal{T}({\bm{r}})}\right)
=(vec𝒯(𝒓))T𝒓(vec[𝒯(𝒓)T𝜹𝒛g𝒙T𝒯(𝒓)T]vec𝒯(𝒓)T)\displaystyle\!\!\!\!\!=\frac{\partial(\mathrm{vec}\ \mathcal{T}({\bm{r}}))^{T}}{\partial{\bm{r}}}\left(\mathrm{vec}\!\left[-\mathcal{T}({\bm{r}})^{-T}\bm{\delta}^{g}_{{\bm{z}}}{\bm{x}}^{T}\mathcal{T}({\bm{r}})^{-T}\right]\!-\!\mathrm{vec}\ \mathcal{T}({\bm{r}})^{-T}\right)
(vec𝒯(𝒓))T𝒓(vec[𝒯(𝒘)T𝜹𝒛f𝒙T𝒯(𝒘)T]vec𝒯(𝒘)T)\displaystyle\!\!\!\!\!\approx\frac{\partial(\mathrm{vec}\ \mathcal{T}({\bm{r}}))^{T}}{\partial{\bm{r}}}\left(\mathrm{vec}\!\left[-\mathcal{T}({\bm{w}})^{T}\bm{\delta}^{f}_{{\bm{z}}}{\bm{x}}^{T}\mathcal{T}({\bm{w}})^{T}\right]\!-\!\mathrm{vec}\ \mathcal{T}({\bm{w}})^{T}\right)
=𝜹𝒙f𝒛flip(𝒘)𝒎\displaystyle\!\!\!\!\!=-\bm{\delta}^{f}_{{\bm{x}}}\star{\bm{z}}-\mathrm{flip}({\bm{w}})\odot{\bm{m}} (16)

5 Experiments

In our first set of experiments, we train simple flows composed of the above layers and invertible non-linearities on the MNIST dataset. To evaluate our proposed approximate gradients, we compare to baseline models of the same architectures trained with the exact gradient. These architectures are designed to be small, so that it is still possible to compute the exact gradients quickly. We additionally compare with similar recent approaches to training normalizing flows with linear and convolutional layers, namely the relative gradient method of Gresele et al. (2020) and the convolution parametrizations of Hoogeboom et al. (2019, 2020).

To evaluate the scalability of our method, we perform a second set of experiments where we integrate self normalizing flows into the Glow framework (Kingma & Dhariwal, 2018) as a replacement for the 1x1 convolutional mixing layers. In this framework we train models on MNIST, CIFAR-10, and the downsized Imagenet 32x32 dataset. All experimental results are from our re-implementations for consistency. In some cases, due to differing hyper-parameters or errors in prior work, this yielded slightly different results than those published. We provide extended explanations for these discrepancies, as well as a link to our code repository, in the appendix (See Section A.3).

5.1 MNIST

On the MNIST dataset we train three classes of models: 2-layer fully connected (FC) models with smooth-leaky-ReLU activations (Gresele et al., 2020), 9-layer convolutional (CNN) models with spline activations (Durkan et al., 2019), and 32-layer Glow models composed of affine coupling layers and 1x1 convolutional mixing layers (Kingma & Dhariwal, 2018). In all cases, we compare a self normalizing version with its exact gradient baseline.

Table 1: Negative Log-likelihood in nats on the MNIST test set. Mean ±\pm std. over 3 runs. Self normalizing flows (SNF) achieve comparable performance to their exact counterparts.
Model logp𝑿(𝒙)-\log p_{{\bm{X}}}({\bm{x}})
Relative Grad. FC 2-Layer 1096.5 ±\pm 0.5
Exact Gradient FC 2-Layer 947.6 ±\pm 0.2
SNF FC 2-Layer (ours) 947.1 ±\pm 0.2
Emerging Conv. 9-Layer 645.7 ±\pm 3.6
SNF Conv. 9-Layer (ours) 638.6 ±\pm 0.9
Conv. Exponential 9-Layer 638.1 ±\pm 1.0
Exact Gradient Conv. 9-Layer 637.4 ±\pm 0.2
Glow 2L-16K 575.7 ±\pm 0.8
SNF Glow 2L-16K (ours) 575.4 ±\pm 1.4

As can be seen in Table 1 the models composed of self normalizing flow layers are nearly identical in performance to their exact gradient counterparts on the MNIST dataset. We see that the fully connected self normalizing model drastically outperforms the relative gradient method of Gresele et al. (2020), reaching the same performance as the exact gradient method. Additionally, we see that the self normalizing convolutional layer outperforms its constrained counterpart from Hoogeboom et al. (2019), likely due to the fact that the emerging convolution is unable to represent the 1x1 convolution explicitly. We hypothesize that the convolutional self normalizing flow model slightly under-performs the exact gradient method due to the convolutional-inverse constraint. We propose this constraint can be relaxed by using more complex inverse functions, potentially composed of multiple layers and non-linearities (see Section A.4), but leave this to future work. All training details can be found in the appendix (see Section A.3).

In Figure 5 the plot on the right shows that the qualitative convergence properties of the approximate gradient methods are very similar to those of the exact gradient, eventually converging to nearly the same validation likelihood. However, as can be seen in the plot on the left, due to the reduced computational complexity, the approximate gradient method trains in less than half the time, and even more quickly to approximate convergence. This timing comparison is demonstrated more exactly in Table 3 where the time per training batch and time per sample is computed for all models presented in this work. As can be seen, the self normalizing flow models are the fastest of all presented models, with the exception of the relative gradient method which appears to lag behind in likelihood performance. From this table we also see that the relative improvements to speed are directly related to the portions of the network which are replaced with self normalizing components. Since only the 1x1 convolution is replaced in the Glow framework, there is only a slight speed increase to be had.

Finally, we quantitatively measure the quality of the gradient approximation by measuring the alignment of directions of the approximate gradient and the exact gradient in Figure 3. Specifically, for models trained in the self normalizing framework, we measure the ‘angular error’ between the approximate gradient and the true gradient at each training iteration, for each layer, and plot the average angular error over all layers for each training epoch. The shaded area denotes the standard deviation in angles. We make two observations from this figure. First, both the 2-layer FC model and the 9-Layer CNN appear to have an initial slight divergence from the true gradient, but then quickly align to less than 1-degree of error, suggesting they are close approximations to the true gradient direction. Second, we observe that the average angular error of the approximate gradient is significantly larger for the convolutional model than it is for the fully connected model. When comparing this with the results in Table 1 we see that this could be contributing to the slightly lower performance of the self normalizing model when compared with the exact gradient methods. We again hypothesize this could be due to convolutional inverse constraint, making the inverse approximation more challenging. We leave further exploration of this topic to future work but believe that the performance could be ameliorated with improved inverse approximations, potentially achieved through more complex constrained optimization techniques.

Table 2: Bits per dimension for large scale Glow models on two larger scale natural image datasets. Mean ±\pm std. over 3 runs. Self normalizing flows (SNF) achieve comparable performance to their exact counterparts demonstrating that this method scales to large models (>100>100 steps of flow). See Section A.3 for details.
Model CIFAR-10 ImageNet32
Glow 3.36 ±\pm 0.002 4.12 ±\pm 0.002
SNF Glow 3.37 ±\pm 0.004 4.14 ±\pm 0.007
Table 3: Runtime comparison for the models presented in Tables 1 and 2. Hardware and implementation details are in Section A.3
Model Dataset Time / batch (ms) Time / sample (ms)
Exact FC 2-Layer MNIST 44.9 ±\pm 4.4 61.5 ±\pm 5.8
Relative Gradient FC 2-Layer MNIST 7.0 ±\pm 0.4 69.2 ±\pm 5.6
Self Normalizing FC 2-Layer MNIST 18.7 ±\pm 0.8 38.6 ±\pm 3.1
Exact Conv. 9-Layer MNIST 372.2 ±\pm 24.5 241.6 ±\pm 12.9
Emerging Conv. 9-Layer MNIST 305.0 ±\pm 14.5 71.7 ±\pm 8.8
Conv. Exponential 9-Layer MNIST 304.4 ±\pm 11.9 84.2 ±\pm 9.2
Self Normalizing Conv. 9-Layer MNIST 212.5 ±\pm 37.3 29.9 ±\pm 6.3
Glow 2L-16K MNIST 583.4 ±\pm 21.2 163.1 ±\pm 21.8
Self Normalizing Glow 2L-16K MNIST 476.4 ±\pm 16.7 30.6 ±\pm 2.2
Glow 3L-32K CIFAR-10 1841.3 ±\pm 85.4 126.3 ±\pm 13.9
Self Normalizing Glow 3L-32K CIFAR-10 1761.2 ±\pm 104.5 97.8 ±\pm 12.9
Glow 3L-48K ImageNet 32x32 2397.2 ±\pm 204.0 174.8 ±\pm 16.7
Self Normalizing Glow 3L-48K ImageNet 32x32 2047.9 ±\pm 152.8 150.7 ±\pm 20.8

5.2 CIFAR-10 and ImageNet 32x32

For large scale experiments, we incorporate self normalizing flow layers into the Glow framework and train the models on CIFAR-10 and Imagenet 32x32. Specifically, we use the same model architectures as those proposed in Kingma & Dhariwal (2018), with some slight changes to the optimization parameters as detailed in Section A.3.

We observe that the self normalizing flow models are able to achieve competitive bits per dimension on both datasets (as seen in Table 2), while simultaneously training slightly faster than their exact gradient counterparts (as seen in Table 3). Importantly, we experimented with values of λ\lambda in the set {1,10,100,1000}\{1,10,100,1000\}, and chose λ=1000\lambda=1000 for all CIFAR-10 and Imagenet models due to increased stability during training. We observed a slight reduction in final likelihood performance as a result of such a large reconstruction weight and believe this is a factor in the performance gap between the self normalizing and exact gradient models. We believe that with more tuning, or with a dynamically weighted constrained optimization method such as that presented in Platt & Barr (1988), the self normalizing model is likely to match the exact gradient model even more closely. Preliminary experiments in this direction are shown in Section A.5.

Although it is clear that the glow framework is not the optimal setting for the application of self normalizing layers, given the determinant calculation of the 1x1 convolution is relatively quick to evaluate already, we present this work as a proof of scalability of our framework to models with greater than 100 steps of flow (as in the ImageNet 32x32 case), and to larger scale images.

6 Discussion

We see that the above framework yields an efficient update rule for flow-based models which appears to perform similarly to the exact gradient while taking significantly less time to train. However, in its current form, this approach is limited in a number of its own ways.

First, as described in Section 3.3, the evaluation of the exact log-likelihood is still expensive, requiring the computation of the exact log Jacobian determinant of a general function class. However, once a model is trained, the Jacobian determinants of the linear transformations only need to be computed once, and can then be re-used for all future likelihood evaluation, effectively amortizing the cost.

Second, there are no known optimization guarantees for our proposed model. Therefore, the model could converge to a sub-optimal trade-off between the negative log-likelihood and the reconstruction error, or even diverge if the inverse approximation is very poor. In practice, we observe that reconstruction error stays very low for most models when initialized properly, and final likelihood values are only marginally impacted by the choice of λ\lambda. In future work, we intend to explore the possibility of augmented Lagrangian methods, the Modified Differential Method of Multipliers (Platt & Barr, 1988) and other constrained optimization techniques, which could provide better convergence guarantees.

We note one of the biggest constraints of the models presented here is that the inverse of the forward function may not always be given by a function of the same class, as for convolution. Although not evaluated in this work, we note that the general framework proposed here would allow for more complex asymmetric self normalizing flow components and we intend to evaluate these in future work.

Finally, as with all current exact likelihood flow methods, the dimensionality of the representation must stay consistent throughout the depth of the flow. Currently, this problem has been approached with flow components such as coupling layers and variational data augmentation, however these methods are either restrictive in their design or add significant noise. This is clearly one of the greatest architectural constraints for existing normalizing flows and remains so in this work.

7 Conclusion

In summary, we introduce Self Normalizing Flows, a new method to efficiently optimize normalizing flow layers. The method approximates the gradient of the log Jacobian determinant using learned inverses, allowing for the training of otherwise intractable normalizing flow architectures. We demonstrate that our method performs competitively with other models in the literature while simultaneously providing faster training and sampling.

References

  • Amari (1998) Amari, S.-i. Natural gradient works efficiently in learning. Neural Computation, 10(2):251–276, 1998.
  • Behrmann et al. (2019) Behrmann, J., Grathwohl, W., Chen, R. T. Q., Duvenaud, D., and Jacobsen, J. Invertible residual networks. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp.  573–582. PMLR, 2019.
  • Bengio (2014) Bengio, Y. How auto-encoders could provide credit assignment in deep networks via target propagation. CoRR, abs/1407.7906, 2014.
  • Bengio (2020) Bengio, Y. Deriving differential target propagation from iterating approximate inverses, 2020.
  • Bengio et al. (2006) Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. Greedy layer-wise training of deep networks. In Advances in Neural Information Processing Systems 19: Annual Conference on Neural Information Processing Systems 2006, NeurIPS 2006 Montréal, Canada, pp.  153–160, 2006.
  • Biewald (2020) Biewald, L. Experiment tracking with weights and biases, 2020. URL https://www.wandb.com/. Software available from wandb.com.
  • Bogachev et al. (2005) Bogachev, V. I., Kolesnikov, A. V., and Medvedev, K. V. Triangular transformations of measures. Sbornik: Mathematics, 196(3):309, 2005.
  • Cardoso & Laheld (1996) Cardoso, J.-F. and Laheld, B. H. Equivariant adaptive source separation. IEEE Transactions on signal processing, 44(21):3017–3030, 1996.
  • Chen et al. (2019) Chen, T. Q., Behrmann, J., Duvenaud, D., and Jacobsen, J. Residual flows for invertible generative modeling. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 2019.
  • Dinh et al. (2016) Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real NVP. CoRR, abs/1605.08803, 2016.
  • Durkan et al. (2019) Durkan, C., Bekasov, A., Murray, I., and Papamakarios, G. Neural spline flows. In Advances in Neural Information Processing Systems 32, pp. 7511–7522. 2019.
  • Glorot & Bengio (2010) Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. volume 9 of Proceedings of Machine Learning Research, pp. 249–256, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. JMLR Workshop and Conference Proceedings.
  • Grathwohl et al. (2019) Grathwohl, W., Chen, R. T. Q., Bettencourt, J., Sutskever, I., and Duvenaud, D. FFJORD: free-form continuous dynamics for scalable reversible generative models. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.
  • Gresele et al. (2020) Gresele, L., Fissore, G., Javaloy, A., Schölkopf, B., and Hyvärinen, A. Relative gradient optimization of the jacobian term in unsupervised deep learning. CoRR, abs/2006.15090, 2020.
  • Gritsenko et al. (2019) Gritsenko, A. A., Snoek, J., and Salimans, T. On the relationship between normalising flows and variational- and denoising autoencoders. In Workshop on Deep Generative Models for Highly Structured Data, at 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.
  • Hinton et al. (2006) Hinton, G. E., Osindero, S., and Teh, Y.-W. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527–1554, 2006.
  • Hoogeboom et al. (2019) Hoogeboom, E., van den Berg, R., and Welling, M. Emerging convolutions for generative normalizing flows. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 2771–2780. PMLR, 2019.
  • Hoogeboom et al. (2020) Hoogeboom, E., Satorras, V. G., Tomczak, J. M., and Welling, M. The convolution exponential and generalized sylvester flows. CoRR, abs/2006.01910, 2020.
  • Huang et al. (2018) Huang, C., Krueger, D., Lacoste, A., and Courville, A. C. Neural autoregressive flows. In Dy, J. G. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML, 2018.
  • Jaini et al. (2019) Jaini, P., Selby, K. A., and Yu, Y. Sum-of-squares polynomial flow. arXiv preprint arXiv:1905.02325, 2019.
  • Karami et al. (2019) Karami, M., Schuurmans, D., Sohl-Dickstein, J., Dinh, L., and Duckworth, D. Invertible convolutional flow. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS, 2019.
  • Kingma & Ba (2014) Kingma, D. and Ba, J. Adam: A method for stochastic optimization. International Conference on Learning Representations, 12 2014.
  • Kingma & Dhariwal (2018) Kingma, D. P. and Dhariwal, P. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada, pp.  10236–10245, 2018.
  • Kingma et al. (2016) Kingma, D. P., Salimans, T., Józefowicz, R., Chen, X., Sutskever, I., and Welling, M. Improving variational autoencoders with inverse autoregressive flow. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, 2016.
  • Kosiorek et al. (2019) Kosiorek, A., Sabour, S., Teh, Y. W., and Hinton, G. E. Stacked capsule autoencoders. In Advances in Neural Information Processing Systems 32, pp. 15512–15522. 2019.
  • Krämer et al. (2020) Krämer, A., Köhler, J., and Noé, F. Training invertible linear layers through rank-one perturbations, 2020.
  • Lecun (1986) Lecun, Y. Learning processes in an asymmetric threshold network. In Disordered systems and biological organization, Les Houches, France, pp.  233–240. Springer-Verlag, 1986.
  • Lee et al. (2015) Lee, D.-H., Zhang, S., Fischer, A., and Bengio, Y. Difference target propagation. In Proceedings of the 2015th European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I, ECMLPKDD’15, pp.  498–515, 2015.
  • Linnainmaa (1976) Linnainmaa, S. Taylor expansion of the accumulated rounding error. BIT Numerical Mathematics, 16(2):146–160, 1976. ISSN 1572-9125. doi: 10.1007/BF01931367.
  • Löwe et al. (2019) Löwe, S., O‘Connor, P., and Veeling, B. Putting an end to end-to-end: Gradient-isolated learning of representations. In Advances in Neural Information Processing Systems 32, pp. 3039–3051. 2019.
  • Lu & Huang (2020) Lu, Y. and Huang, B. Woodbury transformations for deep generative flows. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems, 2020.
  • Magnus (2010) Magnus, J. R. On the concept of matrix derivative. Journal of Multivariate Analysis, 101(9):2200–2206, 2010.
  • Marzouk et al. (2016) Marzouk, Y., Moselhy, T., Parno, M., and Spantini, A. An introduction to sampling via measure transport. arXiv preprint arXiv:1602.05023, 2016.
  • Meulemans et al. (2020) Meulemans, A., Carzaniga, F. S., Suykens, J. A. K., Sacramento, J., and Grewe, B. F. A theoretical framework for target propagation, 2020.
  • Papamakarios et al. (2017) Papamakarios, G., Pavlakou, T., and Murray, I. Masked autoregressive flow for density estimation. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2017.
  • Papamakarios et al. (2019) Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed, S., and Lakshminarayanan, B. Normalizing flows for probabilistic modeling and inference, 2019.
  • Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. 2019.
  • Perugachi-Diaz et al. (2020) Perugachi-Diaz, Y., Tomczak, J. M., and Bhulai, S. Invertible densenets. CoRR, abs/2010.02125, 2020.
  • Platt & Barr (1988) Platt, J. and Barr, A. Constrained differential optimization. In Neural Information Processing Systems, pp.  612–621. American Institute of Physics, 1988.
  • Rezende & Viola (2018) Rezende, D. J. and Viola, F. Taming vaes, 2018.
  • Rippel & Adams (2013) Rippel, O. and Adams, R. P. High-dimensional probability estimation with deep density models, 2013.
  • Rudin (1987) Rudin, W. Real and Complex Analysis. 1987, volume 156. McGraw-Hill Book Company, 1987.
  • Rumelhart et al. (1986) Rumelhart, D. E., Hinton, G. E., and Williams, R. J. Learning representations by back-propagating errors. Nature, 323(6088):533–536, 1986.
  • Tabak & Turner (2013) Tabak, E. G. and Turner, C. V. A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics, 66(2):145–164, 2013.
  • van den Berg et al. (2018) van den Berg, R., Hasenclever, L., Tomczak, J. M., and Welling, M. Sylvester normalizing flows for variational inference. In Globerson, A. and Silva, R. (eds.), Proceedings of the Thirty-Fourth Conference on Uncertainty in Artificial Intelligence, UAI, pp.  393–402. AUAI Press, 2018.
  • Vincent et al. (2008) Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, pp.  1096–1103, New York, NY, USA, 2008. Association for Computing Machinery.
  • Werbos (1982) Werbos, P. J. Applications of advances in nonlinear sensitivity analysis. In Drenick, R. F. and Kozin, F. (eds.), System Modeling and Optimization, pp.  762–770, Berlin, Heidelberg, 1982. Springer Berlin Heidelberg. ISBN 978-3-540-39459-4.

Appendix A Appendix

A.1 Self Normalizing Fully Connected Gradients

In this section we provide an extended derivation of Equations 10 and 11 including derivation of the reconstruction gradient. For the gradient with respect to 𝑾{\bm{W}}, as given in Equation 10, we see that we only need to approximate the gradient of the Jacobian determinant term to achieve an efficient update, yielding:

𝑾(𝒙)\displaystyle\frac{\partial}{\partial{\bm{W}}}\mathcal{L}({\bm{x}}) 12(𝜹𝒛f𝒙T+𝑹T)λ𝑾𝑹𝑾𝒙𝒙22\displaystyle\approx\frac{1}{2}\left(\bm{\delta}_{{\bm{z}}}^{f}{\bm{x}}^{T}+{\bm{R}}^{T}\right)-\lambda\frac{\partial}{\partial{\bm{W}}}||{\bm{R}}{\bm{W}}{\bm{x}}-{\bm{x}}||^{2}_{2}
=12(𝜹𝒛f𝒙T+𝑹T)2λ𝑹T(𝑹𝑾𝒙𝒙)𝒙T\displaystyle=\frac{1}{2}\left(\bm{\delta}_{{\bm{z}}}^{f}{\bm{x}}^{T}+{\bm{R}}^{T}\right)-2\lambda{\bm{R}}^{T}({\bm{R}}{\bm{W}}{\bm{x}}-{\bm{x}}){\bm{x}}^{T} (17)

For the gradient with respect to 𝑹{\bm{R}}, we start with the exact gradient as given in Equation 9:

𝑹(𝒙)=12(𝑹T𝜹𝒛g𝒙T𝑹T𝑹T)λ𝑹𝑹𝑾𝒙𝒙22\frac{\partial}{\partial{\bm{R}}}\mathcal{L}({\bm{x}})=\frac{1}{2}\left(-{\bm{R}}^{-T}\bm{\delta}_{{\bm{z}}}^{g}{\bm{x}}^{T}{\bm{R}}^{-T}-{\bm{R}}^{-T}\right)\\ -\lambda\frac{\partial}{\partial{\bm{R}}}||{\bm{R}}{\bm{W}}{\bm{x}}-{\bm{x}}||^{2}_{2} (18)

To find an efficient approximation of this update without having to compute matrix inverses, we note that in addition to substituting 𝑾{\bm{W}} in the place of 𝑹1{\bm{R}}^{-1}, we also must approximate 𝜹𝒛g\bm{\delta}_{{\bm{z}}}^{g} in order to avoid having to compute logp𝒁(𝑹1𝒙)\log p_{{\bm{Z}}}({\bm{R}}^{-1}{\bm{x}}). We propose this can similarly be approximated as 𝜹𝒛g𝜹𝒛f=logp𝒁(𝑾𝒙)𝑾𝒙\bm{\delta}_{{\bm{z}}}^{g}\approx\bm{\delta}_{{\bm{z}}}^{f}=\frac{\partial\log p_{{\bm{Z}}}({\bm{W}}{\bm{x}})}{\partial{\bm{W}}{\bm{x}}} under the same constraint that 𝑹1𝑾{\bm{R}}^{-1}\approx{\bm{W}}. Together this yields:

𝑹(𝒙)12(𝑾T𝜹𝒛f(𝑾𝒙)T𝑾T)2λ(𝑹𝑾𝒙𝒙)(𝑾𝒙)T\frac{\partial}{\partial{\bm{R}}}\mathcal{L}({\bm{x}})\approx\frac{1}{2}\!\!\left(-{\bm{W}}^{T}\bm{\delta}_{{\bm{z}}}^{f}({\bm{W}}{\bm{x}})^{T}-{\bm{W}}^{T}\right)\\ \!-\!2\lambda({\bm{R}}{\bm{W}}{\bm{x}}-{\bm{x}})({\bm{W}}{\bm{x}})^{T} (19)

Conveniently, 𝑾Tδ𝒛f=logp𝒁(𝑾𝒙)𝒙{\bm{W}}^{T}\delta^{f}_{{\bm{z}}}=\frac{\partial\log p_{{\bm{Z}}}({\bm{W}}{\bm{x}})}{\partial{\bm{x}}} is then the delta from the output backpropagated to the input of the layer, and is computed by standard backpropagation. We call this term 𝜹𝒙f\bm{\delta}^{f}_{{\bm{x}}}, giving a simplified gradient:

𝑹(𝒙)12(𝜹𝒙f𝒛T𝑾T)2λ(𝑹𝑾𝒙𝒙)𝒛T\frac{\partial}{\partial{\bm{R}}}\mathcal{L}({\bm{x}})\approx\frac{1}{2}\left(-\bm{\delta}^{f}_{{\bm{x}}}{\bm{z}}^{T}-{\bm{W}}^{T}\right)-2\lambda({\bm{R}}{\bm{W}}{\bm{x}}-{\bm{x}}){\bm{z}}^{T} (20)

We see then that both Equations A.1 and 20 require no matrix inverses, and furthermore require no additional terms beyond those computed by standard backpropagation.

A.2 Self Normalizing Convolution Gradients

In this section we provide a detailed derivation of the approximate gradient of the log Jacobian determinant term for convolutional layers. From Equations 14 and 16, we see that after substitution of the approximate inverse we have:

𝒘logp𝑿f(𝒙)\displaystyle\frac{\partial}{\partial{\bm{w}}}\log p^{f}_{{\bm{X}}}({\bm{x}}) 𝜹𝒛f𝒙+(vec𝒯(𝒘))T𝒘(vec𝒯(𝒓)T)\displaystyle\approx\bm{\delta}^{f}_{{\bm{z}}}\star{\bm{x}}+\frac{\partial(\mathrm{vec}\ \mathcal{T}({\bm{w}}))^{T}}{\partial{\bm{w}}}\left(\mathrm{vec}\ \mathcal{T}({\bm{r}})^{T}\right) (21)
𝒓logp𝑿g(𝒙)\displaystyle\frac{\partial}{\partial{\bm{r}}}\log p^{g}_{{\bm{X}}}({\bm{x}}) 𝜹𝒙f𝒛(vec𝒯(𝒓))T𝒓(vec𝒯(𝒘)T)\displaystyle\approx-\bm{\delta}^{f}_{{\bm{x}}}\star{\bm{z}}-\frac{\partial(\mathrm{vec}\ \mathcal{T}({\bm{r}}))^{T}}{\partial{\bm{r}}}\left(\mathrm{vec}\ \mathcal{T}({\bm{w}})^{T}\right) (22)

Focusing on the second term, we consider what (vec𝒯(𝒌))T𝒌(vec𝒯(𝒑)T)\frac{\partial(\mathrm{vec}\ \mathcal{T}({\bm{k}}))^{T}}{\partial{\bm{k}}}\left(\mathrm{vec}\ \mathcal{T}({\bm{p}})^{T}\right) is for arbitrary kernels 𝒌,𝒑m{\bm{k}},{\bm{p}}\in\mathbb{R}^{m}, and matrices 𝒯(𝒌),𝒯(𝒑)D×D\mathcal{T}({\bm{k}}),\mathcal{T}({\bm{p}})\in\mathbb{R}^{D\times D}. First, we see (vec𝒯(𝒌))T𝒌m×D2\frac{\partial(\mathrm{vec}\ \mathcal{T}({\bm{k}}))^{T}}{\partial{\bm{k}}}\in\mathbb{R}^{m\times D^{2}}, meaning that it is a rectangular matrix where each row corresponds to one dimension of the kernel, and each column corresponds to one element of the matrix 𝒯(𝒌)\mathcal{T}({\bm{k}}). Writing out the partial derivative element-wise, we get:

[(vec𝒯(𝒌))T𝒌]i,j=[vec𝒯(𝒌)]jki\displaystyle\left[\frac{\partial(\mathrm{vec}\ \mathcal{T}({\bm{k}}))^{T}}{\partial{\bm{k}}}\right]_{i,j}=\frac{\partial\left[\mathrm{vec}\ \mathcal{T}({\bm{k}})\right]_{j}}{\partial k_{i}} (23)

From this it is clear that the ithi^{th} row and jthj^{th} column will be 11 if [vec𝒯(𝒌)]j=ki\left[\mathrm{vec}\ \mathcal{T}({\bm{k}})\right]_{j}=k_{i} and otherwise 0. Intuitively, given the definiton of 𝒯(𝒌)\mathcal{T}({\bm{k}}), this corresponds to an indicator matrix, indicating whether kernel element ii is shared at location jj of the vectorized convolution matrix 𝒯(𝒌)\mathcal{T}({\bm{k}}).

Given this intuition, we can see that the inner product of the ithi^{th} row of this matrix and the same vectorized convolution matrix vec𝒯(𝒌)\mathrm{vec}\ \mathcal{T}({\bm{k}}) will yield a sum over elements of 𝒯(𝒌)\mathcal{T}({\bm{k}}) where 𝒯(𝒌)=ki\mathcal{T}({\bm{k}})=k_{i}. Formally:

(vec𝒯(𝒌))Tkivec𝒯(𝒌)\displaystyle\frac{\partial(\mathrm{vec}\ \mathcal{T}({\bm{k}}))^{T}}{\partial k_{i}}\mathrm{vec}\ \mathcal{T}({\bm{k}}) =jD×Dki𝟙[[vec𝒯(𝒌)]j=ki]\displaystyle=\sum_{j}^{D\times D}k_{i}*\mathbbm{1}\left[\left[\mathrm{vec}\ \mathcal{T}({\bm{k}})\right]_{j}=k_{i}\right]
=miki\displaystyle=m_{i}*k_{i} (24)

where mim_{i} is the number of times the kernel element kik_{i} is shared across the matrix 𝒯(𝒌)\mathcal{T}({\bm{k}}).

Similarly, for the inner product between a row of this indicator matrix and a vectorized convolution matrix formed from a different kernel (i.e. 𝒯(𝒑)\mathcal{T}({\bm{p}})), we see that the result is again given by a multiple times the new kernel:

(vec𝒯(𝒌))Tkivec𝒯(𝒑)\displaystyle\frac{\partial(\mathrm{vec}\ \mathcal{T}({\bm{k}}))^{T}}{\partial k_{i}}\mathrm{vec}\ \mathcal{T}({\bm{p}}) =jD×Dpi𝟙[[vec𝒯(𝒌)]j=ki]\displaystyle=\sum_{j}^{D\times D}p_{i}*\mathbbm{1}\left[\left[\mathrm{vec}\ \mathcal{T}({\bm{k}})\right]_{j}=k_{i}\right]
=mipi\displaystyle=m_{i}*p_{i} (25)

This can be understood to be due to the fact that the mapping 𝒯()\mathcal{T}(\cdot) which generates the weight sharing structure is the same for both the partial derivative matrix and the vectorized convolution matrix.

Computing the Transposed Convolution Kernel flip()\mathrm{flip}(\cdot)

In Equations 21 and 22 we see the partial derivative matrix is multiplied with a transposed convolution matrix. For the convolution operations proposed here, the transpose convolution matrix can also be written as a standard convolution matrix with a transformed kernel. We denote this transformation as flip()\mathrm{flip(\cdot)} such that 𝒯(flip(𝒌))=𝒯(𝒌)T\mathcal{T}(\mathrm{flip}({\bm{k}}))=\mathcal{T}({\bm{k}})^{T}, or equivalently flip(𝒌)=𝒯1(𝒯(𝒌)T)\mathrm{flip}({\bm{k}})=\mathcal{T}^{-1}(\mathcal{T}({\bm{k}})^{T}). This transformation can easily be implemented in deep learning frameworks through index manipulation of the kernel. Informally, this operation is achieved by swapping the input and output axes, and mirroring the spatial dimensions. Explicitly, given a four dimensional kernel 𝒌O×I×H×W{\bm{k}}\in\mathbb{R}^{O\times I\times H\times W} where O,I,H,WO,I,H,W are the number of output channels, number of input channels, kernel height, and kernel width respectively, the flip operation can be defined as:

flip(𝒌)o,i,h,w=ki,o,Hh,Ww\displaystyle\mathrm{flip}({\bm{k}})_{o,i,h,w}=k_{i,o,H-h,W-w} (26)

Computing the Multiple 𝒎{\bm{m}}

The constant 𝒎{\bm{m}} which is the same shape as the kernel, and is element-wise multiplied, is given by the number of times each element kik_{i} of the kernel 𝒌{\bm{k}} is present in the matrix 𝒯(𝒌)\mathcal{T}({\bm{k}}). This can be easily computed as a convolution of two images filled entirely of 1s1^{\prime}s, the first with the shape of the outputs, and the second with the shape of the inputs. Using syntax from the PyTorch framework, we can write this as:

𝒎=ones_like(𝒛)ones_like(𝒙){\bm{m}}=\mathrm{ones\_like}({\bm{z}})\star\mathrm{ones\_like}({\bm{x}}) (27)

Note this convolution must be performed with the same parameters as the main convolution (e.g. padding, stride, grouping, dilation, etc.).

Combining Equation A.2 with the flip\mathrm{flip} operation, for all kernel elements ii, we see that we arrive at Equations 15 and 16.

A.3 Experiment Details

All code for this paper can be found at the following repository: https://github.com/AKAndykeller/SelfNormalizingFlows

Training & Evaluation Details

In Tables 1 and 2, all log-likelihood (in nats) and bits per dimension values are computed using the exact log Jacobian determinant of the full transformation. They are reported on a held-out test set, using the saved model parameters from the best epoch as determined by performance on a separate validation set. Each value is given as a mean ±\pm standard deviation as computed over 3 runs with different random initializations.

For MNIST, the first 50,000 training images were used for training, and the last 10,000 were used for validation. The plots in figure 5, show the negative log-likelihood on the MNIST validation set for the 2-layer fully connected (FC) models.

For CIFAR-10, the first 40,000 training images were used for training and the last 10,000 were used as a held out validation set. Data augmentation including random horizontal flipping with probability 0.5 and random jitter by 1 pixel was performed to prevent overfitting.

For ImageNet 32x32, a random subset of 20,000 images were used for validation, and the remaining 1,261,149 images were used for training. The dataset was constructed using the same methodology as Kingma & Dhariwal 2018, and can be downloaded from http://image-net.org/small/download.php. The values reported in Table 2 were computed on the provided 50,000 image test set. No data augmentation was performed.

Optimization Details

All fully connected (FC) models were trained for 6000 epochs using Adam optimizer (Kingma & Ba, 2014) with a batch size of 100, a learning rate of 1×1041\times 10^{-4}, β1=0.9,β2=0.999\beta_{1}=0.9,\beta_{2}=0.999, and reconstruction weight λ=1\lambda=1. These parameters were chosen to match the training methodology of (Gresele et al., 2020). The loss was computed as the average over the batch. All convolutional (Conv.) models were trained for 1000 epochs with the same optimization parameters, but with a learning rate of 1×1031\times 10^{-3} due to observed faster convergence.

The MNIST Glow models were trained for 250 epochs using the same optimizer settings as the Conv. models. We experimented with values of λ\lambda in the set {1,10,100,1000}\{1,10,100,1000\}, and chose the λ=100\lambda=100 based on the highest validation accuracy.

The CIFAR-10 Glow models were trained for 1000 epochs with the same optimizer settings, but with λ=1000\lambda=1000. Additionally, the norm of the gradients was clipped at 10,00010,000 for the self normalizing CIFAR-10 models for improved stability during training.

The ImageNet 32x32 Glow models were trained for 15 epochs, with the same optimization parameters and again with λ=1000\lambda=1000. The norm of the gradients was clipped at 10,00010,000 for all ImageNet models.

All models except for the ImageNet models were trained using a learning-rate warm-up schedule where the learning rate is linearly increased from 0 to its full value over the course of the first 10 epochs.

Discrepancies with Published Results

We note that our value for the Relative Gradient model (Gresele et al., 2020) differs from the published result of 1375.2±1.4-1375.2\pm 1.4. We found experimentally that when using the same settings as published in (Gresele et al., 2020), our re-implementation achieved approximately 1102-1102. We found the discrepancy to be due almost exactly to the log Jacobian determinant of the data-preprocessing steps (such as dequantization, normalization, and the logit transform), which we measure to sum to 272.7±0.3272.7\pm 0.3. We discussed with the authors of (Gresele et al., 2020) and concluded they likely did not include the log Jacobian determinant of these steps in their reported values. We further note the numbers in Table 1 are using slightly different parameters than in (Gresele et al., 2020), such as α=0.3\alpha=0.3 in the activation function, a batch size of 100, and a significantly longer training duration.

We additionally see that our values in Table 2 are slightly worse than those reported in (Kingma & Dhariwal, 2018) (i.e. 3.36 vs. 3.35 on CIFAR-10, and 4.12 vs 4.09 on ImageNet 32x32). We hypothesize that our slightly worse performance is likely due to our use of an explicit validation set, reducing the effective size of the training set. To the best of our knowledge, this pre-processing step was not performed in (Kingma & Dhariwal, 2018). Additional factors could be a shorter training duration, no learning rate warmup (for ImageNet), or the imposed gradient clipping.

Timing details

Figure 4 was created by running a single fully connected layer (with no activation function) on a machine with an NVIDIA GeForce 1080Ti GPU and Intel Xeon E5-2630 v3 CPU. Each datapoint was computed by taking the mean and standard deviation of the time required per batch over 4,000 batches, with a batch size of 128, on synthetically generated random data at integer multiples of 96 dimensions starting at 32.

The values reported for MNIST in Table 3 were computed on the same machine with an NVIDIA GeForce 1080Ti GPU and Intel Xeon E5-2630 v3 CPU, with a batch size of 100. The values for CIFAR-10 and ImageNet were computed on a machine with an NVIDIA Titan X GPU and Intel Xeon E5-2640 v4 CPU, with batch sizes of 100 and 64 respectively. The discrepancy between training and sampling time for the FC models is due to the iterative optimization required to invert the Smooth Leaky ReLU activation. Figure 5 was created using the time results from Table 3. For all times reported, the times of the first and last 100 batches per epoch were ignored to reduce variance.

Architectures

All models were trained using pre-processed data in the same as manner as (Gresele et al., 2020; Papamakarios et al., 2017; Dinh et al., 2016). This includes uniform dequantization, normalization, and logit-transformation. We additionally use a standard Gaussian as our base distribution p𝒁p_{{\bm{Z}}} for all models.

All 2-layer fully connected (FC) models use the Smooth Leaky ReLU (with α=0.3\alpha=0.3) activation (as in (Gresele et al., 2020)). Weights of the forward model (𝑾{\bm{W}}’s) are initialized to identity plus noise drawn from a Xavier Normal (Glorot & Bengio, 2010) with gain 0.010.01. Weights of the inverse model (𝑹{\bm{R}}’s) are initialized to the transpose of the forward weights.

All 9-layer convolutional models are trained with spline (Durkan et al., 2019) activations with individual parameters per pixel and 5-knots, kernels of size (3×3)(3\times 3), and zero-padding of 1 on all sides. The convolutional models are additionally divided into three blocks, each of 3 layers, with 2 ‘squeeze’ layers in-between the blocks. The squeeze layers move feature map activations from the spatial dimensions into the channel dimension, reducing spatial dimensions by a half and increasing the number of channels by 4 (as in (Hoogeboom et al., 2019)). Weights of the forward model (𝒘{\bm{w}}’s) are initialized with the dirac delta function (preserving the identity of the inputs) plus noise drawn from a Xavier Normal (Glorot & Bengio, 2010) with gain 0.010.01. Weights of the inverse model (𝒓{\bm{r}}’s) are initialized to flip(𝒘)\mathrm{flip}({\bm{w}}).

The Glow models for MNIST were constructed of L=2L=2 blocks of K=16K=16 steps each (as specified in (Kingma & Dhariwal, 2018)), where each block is composed of a squeeze layer and K-steps of flow. A split layer is placed between the two blocks. Each step of flow is composed of an act-norm layer, a (1×1)(1\times 1) convolution, and an affine coupling layer. The coupling layers are constructed as in (Kingma & Dhariwal, 2018). All convolutional weights were initialized to random orthogonal matrices. For CIFAR-10 and ImageNet 32x32, the Glow models were composed of L=3,K=32L=3,K=32 and L=3,K=48L=3,K=48 respectively, matching those in (Kingma & Dhariwal, 2018).

A.4 Proposed Model Extensions

Asymmetric Convolutions

As mentioned in Section 6, the inverse of the forward function may not always be given by a function of the same class (e.g. for convolutional layers). To partially alleviate this constraint, we propose that the forward and inverse functions may be asymmetric, and derive a simple case of this below for convolutions with different kernel sizes (but identical output sizes). Initial experiments with such an asymmetric model (i.e. 3×\times3 conv. ff with 7×\times7 conv. gg) have shown promising results – improving over models with 3×\times3 convolutions in both directions.

For a function ff with a (column vector) kernel 𝒘m{\bm{w}}\in\mathbb{R}^{m} , and an inverse gg with a larger kernel 𝒓n{\bm{r}}\in\mathbb{R}^{n} (n>mn>m), we see that the approximate gradient with respect to 𝒘{\bm{w}} can be obtained by taking the internal central dimensions of 𝒓{\bm{r}}, and similarly the gradient for 𝒓{\bm{r}} is given by taking a zero-padded version of 𝒘{\bm{w}}. Formally we can write the central-indexing and padding operations as multiplication by the rectangular matrices 𝑷𝒓𝒘{\bm{P}}_{{\bm{r}}\rightarrow{\bm{w}}} and 𝑷𝒘𝒓{\bm{P}}_{{\bm{w}}\rightarrow{\bm{r}}} respectively.

𝒘logp𝑿f(𝒙)𝜹𝒛f𝒙+flip(𝑷𝒓𝒘𝒓)𝒎\begin{split}\frac{\partial}{\partial{\bm{w}}}\log p^{f}_{{\bm{X}}}({\bm{x}})\approx\bm{\delta}^{f}_{{\bm{z}}}\star{\bm{x}}+\mathrm{flip}({\bm{P}}_{{\bm{r}}\rightarrow{\bm{w}}}{\bm{r}})\odot{\bm{m}}\\ \end{split} (28)
𝒓logp𝑿g(𝒙)𝜹𝒙f𝒛flip(𝑷𝒘𝒓𝒘)𝒎\begin{split}\frac{\partial}{\partial{\bm{r}}}\log p^{g}_{{\bm{X}}}({\bm{x}})&\approx-\bm{\delta}^{f}_{{\bm{x}}}\star{\bm{z}}-\mathrm{flip}({\bm{P}}_{{\bm{w}}\rightarrow{\bm{r}}}{\bm{w}})\odot{\bm{m}}\end{split} (29)

Where 𝑷𝒓𝒘{\bm{P}}_{{\bm{r}}\rightarrow{\bm{w}}} and 𝑷𝒘𝒓{\bm{P}}_{{\bm{w}}\rightarrow{\bm{r}}} are defined as:

𝑷𝒓𝒘=[𝟎𝐈m𝟎]&𝑷𝒘𝒓=[𝟎𝐈n𝟎]{\bm{P}}_{{\bm{r}}\rightarrow{\bm{w}}}=\left[\begin{matrix}\mathbf{0}&\mathbf{I}_{m}&\mathbf{0}\end{matrix}\right]\ \ \ \ \ \&\ \ \ \ \ \ {\bm{P}}_{{\bm{w}}\rightarrow{\bm{r}}}=\left[\begin{matrix}\mathbf{0}\\ \mathbf{I}_{n}\\ \mathbf{0}\end{matrix}\right] (30)

Where 𝐈D\mathbf{I}_{D} refers to a D×DD\times D identity matrix, and 𝟎\mathbf{0} is a matrix of zeros such that the dimensions of 𝑷𝒓𝒘{\bm{P}}_{{\bm{r}}\rightarrow{\bm{w}}} are m×nm\times n and those of 𝑷𝒘𝒓{\bm{P}}_{{\bm{w}}\rightarrow{\bm{r}}} are n×mn\times m.

Jacobian Vector Product Inverse Constraint

As noted in Section 3, the approximations made assume that the Jacobians of the functions ff and gg are approximately inverses in addition to the functions themselves being approximate inverses. As stated, for the models presented in this work, this property is obtained for free since the Jacobian of a linear mapping is the matrix representation of the map itself. However, since this property may not hold in general, we propose the following additional constraint could be added to the objective:

JVP(f,g)=𝔼ν𝒰(0,1)[𝑱g𝑱fνν22]\mathcal{E}_{JVP}(f,g)=-\mathbb{E}_{\mathbf{\nu}\sim\mathcal{U}(0,1)}\left[||{\bm{J}}_{g}{\bm{J}}_{f}\mathbf{\nu}-\mathbf{\nu}||^{2}_{2}\right] (31)

where 𝑱g,𝑱f{\bm{J}}_{g},{\bm{J}}_{f} are the Jacobians of gg and ff respectively, evaluated at 𝒛{\bm{z}} and 𝒙{\bm{x}} respectively. The expectation can additionally be approximated by Monte Carlo methods through a finite number of samples.

We see that such a loss would reach a minimum when the Jacobians are exact inverses. However, it remains unclear from which distribution the points ν\mathbf{\nu} should be sampled to achieve the best approximate inverse. Since Jacobian vector products are efficiently implemented in most deep learning frameworks, (and are thus much faster than naive matrix-matrix multiplication) this loss could be added to the overall objective while still avoiding 𝒪(D3)\mathcal{O}(D^{3}) computational complexity.

Variational Interpretation

One limitation of the self normalizing flow framework becomes apparent mainly in the presence of data-dependant transformations, as would be contained in the general framework outlined in Section 3. Specifically, in this setting, the invertibility of the transformation is more difficult to guarantee since the value of the Jacobian determinant is then a function of the input, and therefore ensuring invertibility is not as simple as computing non-zero determinants on the training data.

To alleviate this difficulty, we believe a variational interpretation of the self normalizing framework as afforded by Gritsenko et al. (2019) would relax the global invertibility constraint to a local invertibility constraint. To achieve this, the encoder and decoder become stochastic with a very small noise level σ2\sigma^{2}, and in the limit of σ20\sigma^{2}\rightarrow 0, the variational lower bound becomes equal to change of variables formula. In such a model, the inverse approximation error would contribute to the decoder likelihood directly, and the ‘self normalizing’ inverse approximation would become useful in computing the gradient of the entropy of the encoder. As the authors note, such an interpretation allows for non-invertible encoders, and we thus believe it is a fruitful direction for future research.

A.5 Extended Results

Choice of λ\lambda

As noted in the discussion, we observe that final likelihood values are only marginally impacted by the choice of λ\lambda. To quantitatively demonstrate this, we present the performance for different reconstruction weights λ\lambda in Table 4 below. As can be seen, for values of λ\lambda greater than a minimum threshold, the likelihood performance decreases slightly as λ\lambda increases. For values of λ\lambda which are too small, training is initially stable but eventually becomes unstable and diverges, resulting in numerical instability (denoted ‘-’).

Table 4: NLL in nats on MNIST for SNF models with different values of λ\lambda. ‘-’ implies the model training diverged.
Modelλ\lambda \leq 0.1 1.0 10.0 100.0 1000.0
2L FC - 946.3 954.6 1007.4 1039.8
9L Conv. - 639.6 642.9 646.3 660.6
2L-16K Glow - - - 574.1 574.8

Improved Constrained Optimization

In this work, to enforce the approximate-inverse constraint f1gf^{-1}\approx g, we make use of a penalty method with parameter λ\lambda (the ‘reconstruction weight’). The downsides of this method are that it requires manual tuning of λ\lambda which can lead to sub-optimal local minima (as seen above). A powerful alternative to the basic penalty method is the method of Lagrange multipliers, whereby λ\lambda is simultaneously optimized with the model parameters by a min-max optimization scheme. One related implementation of such a method is given by the GECO algorithm from Rezende & Viola (2018). Simply, the update equation of the algorithm is given by λtλt1exp(Ct)\lambda^{t}\leftarrow\lambda^{t-1}\mathrm{exp}(\propto C^{t}) for each iteration tt, where CtC^{t} is derived from the exponential moving average of the constraint (i.e. reconstruction loss). In initial experiments, we have shown such a method works well when combined with the self normalizing framework, reducing the need for tuning, and enabling the stable training of more complex functions (see Table 5 below). The implementation of this method is additionally provided in the code repository.

Table 5: NLL in nats on MNIST comparing the impact of larger kernel sizes incorporated into the Glow framework (2L-4K width=16). SNF models are trained with the method of Lagrange multipliers. Mean ±\pm std over 3 random initalizations.
Model Glow 1x1 SNF 1x1 SNF 5x5
logp𝐗(𝐱)-\log p_{\mathbf{X}}(\mathbf{x}) 678.3 ±\pm 2.0 678.3 ±\pm 9.7 669.9 ±\pm 2.8

A.6 Novelty Comparison

We provide Table 6 (adopted from (Behrmann et al., 2019)), to facilitate comparison of the self normalizing framework with existing normalizing flow methods.

Table 6: High level comparison of Self Normalizing Flows (SNF) with existing normalizing flow frameworks.
Method NICE/ i-RevNet Real-NVP Glow FFJORD i-ResNet SNF
Free-form
Analytic Forward
Analytic Inverse
Non-volume Preserving
Exact Likelihood
Unbiased Stochastic Log-Det Estimator N/A N/A N/A
Unconstrained Lipschitz Constant

A.7 Acknowledgements

We would like to thank the creators of Weight & Biases (Biewald, 2020) and PyTorch (Paszke et al., 2019). Without these tools our work would not have been possible. We thank the Bosch Center for Artificial Intelligence for funding, and Anna Khoreva and Karen Ullrich for guidance. Finally, we thank the reviewers for their proposed extensions and constructive comments.