Improved Training of Sparse Coding Variational Autoencoder via Weight Normalization

Linxing Preston Jiang, Luciano de la Iglesia
Paul G. Allen School of Computer Science & Engineering
University of Washington
{prestonj,lucianod}@cs.washington.edu

Abstract

Learning a generative model of visual information with sparse and compositional features has been a challenge for both theoretical neuroscience and machine learning communities. Sparse coding models have achieved great success in explaining the receptive fields of mammalian primary visual cortex with sparsely activated latent representation. In this paper, we focus on a recently proposed model, sparse coding variational autoencoder (SVAE) (Barello et al., 2018), and show that the end-to-end training scheme of SVAE leads to a large group of decoding filters not fully optimized with noise-like receptive fields. We propose a few heuristics to improve the training of SVAE and show that a unit $L_{2}$ norm constraint on the decoder is critical to produce sparse coding filters. Such normalization can be considered as local lateral inhibition in the cortex. We verify this claim empirically on both natural image patches and MNIST dataset and show that projection of the filters onto unit norm drastically increases the number of active filters. Our results highlight the importance weight normalization for learning sparse representation from data and suggest a new way of reducing the number of inactive latent components in VAE learning.

1 Introduction

One key challenge in theoretical neuroscience is to understand the computation carried through the visual pathways. Hubel and Wiesel first showed that neurons in mammalian primary visual cortex (V1) have spatially localized and orientation-selective receptive fields [1] that interestingly resemble edge detectors or “parts” of objects. This connection between visual neurons’ neurophysiological properties and the statistics of the environment was successfully explore by Olshausen & Field [2]. In their model, they proposed that the primary visual cortex is learning a generative model of the visual world with sparsely activated neural activities. Remarkably, such a model produces edge detector style Gabor-like spatial filters that are similar to V1 receptive fields. This is known as the “sparse coding” model of V1.

Sparse coding (or sparse dictionary learning) has been extensively studied in the machine learning community as an unsupervised generative model of images [3, 4, 5]. It has also been shown to be robust to adversial attacks [6, 7]. More recently, Barello et al. [8] proposed a sparse coding variational autoencoder (SVAE) model that combines a variational autoencoder (VAE) [9] with a sparse coding decoder for learning sparse structure in data. Compared to the original sparse coding model, SVAE shows better reconstruction performances and allows stochastic latent representations, which is more neurally plausible than a deterministic maximum a posteriori (MAP) estimate in traditional sparse coding.

Our project focused on improving the quality of the learned decoder in SVAE. We show empirically that the formulation and training of SVAE leads to a large portion of unoptimized filters in the decoder, commonly known as the “over-pruning” problem in VAEs [10, 11, 12]. We propose three heuristics to improve the training of SVAE: First, we weighed the Kullback–Leibler (KL) divergence term in the loss function by $\beta$ , a technique proposed by the $\beta$ -VAE model [13]. Using a $\beta$ term less than 1 smooths the effect of the sparsity prior, causing large gradients on only a few filters. Second, we used a more expressive encoder architecture with ResNet blocks [14] to replace the linear filters in the original SVAE for better posterior approximation. Most importantly, we applied the same projected gradient descent step in the original sparse coding to constrain the decoder filters to have unit length. This constraint drastically improves the number of filters that are optimized and have similar quality receptive fields to the sparse coding model. We validate our claims by comparing the performance of the original SVAE and our training procedure on natural images patches. SVAE trained with our approach shows similar reconstruction error to the original training and produced qualitatively better filters that resemble parts of images. We further show that the unit-length constraint of the filters (which can be viewed as lateral inhibition in the cortex) is critical for the formation of Gabor-like filters on both natural images and MNIST dataset.

2 Background

We introduce the formulation of sparse coding and sparse coding variational autoencoders (SVAEs) in this section.

2.1 Sparse coding

The sparse coding model minimizes the following energy function

	$\displaystyle\min_{\mathbf{U},\mathbf{z}}E$	$\displaystyle=\\|\mathbf{x}-\mathbf{Uz}\\|_{2}^{2}+\lambda\\|\mathbf{z}\\|_{1}$
		$\displaystyle\text{s.t.}\\|\mathbf{U}_{i}\\|_{2}\leq 1$	$\displaystyle\forall i=1,2,\dots,N$

where $\mathbf{x}\in\mathbb{R}^{D}$ denotes the input, $\mathbf{U}\in\mathbb{R}^{D\times N}$ represents the receptive fields (RFs) or filters of the model, and $\mathbf{z}\in\mathbb{R}^{N}$ represents the neural activation (latent variables). The $L_{1}$ penalty on $\mathbf{z}$ is a relaxation of the $L_{0}$ penalty that promotes sparsity in $\mathbf{z}$ (only a small subset of the components is nonzero). $\lambda$ is a scalar that controls the degree of the sparsity penalty. In addition, each filter (column of $\mathbf{U}$ ) is constrained to have unit $L_{2}$ norm to prevent a few filters with large weights from dominating image reconstruction. We will show in later sections and results that this is a key constraint to promote filter quality and increase the number of active latent variables.

Inference

A common practice is to use proximal gradient descent rather than the vanilla gradient descent for faster convergence of the latent code. We use the iterative shrinkage threshold algorithm (ISTA) [15] which takes a shrinkage step after a gradient update. The gradient update is defined as:

\displaystyle\frac{\partial E}{\partial\mathbf{z}}=-2\mathbf{U}^{\intercal}(\mathbf{x}-\mathbf{U}\mathbf{z})

and the shrinkage update is defined as

	$\displaystyle\mathbf{z}^{\prime}$	$\displaystyle=\operatorname{Shrinkage}_{\lambda}(\mathbf{z})$
		$\displaystyle=\operatorname{sign}(\mathbf{z})\max(\|\mathbf{z}\|-\lambda,0)$

We consider $\mathbf{z}$ as converged if the change of its $L_{2}$ norm before and after one update is less than 1%.

Learning

After $\mathbf{r}$ converges for the current input $\mathbf{x}$ , we update $\mathbf{U}$ using projected gradient descent. The update rule is defined as

\displaystyle\frac{\partial E}{\partial\mathbf{U}}=-2(\mathbf{x}-\mathbf{U}\mathbf{z})\mathbf{z}^{\intercal}

where $\eta$ is the learning rate. After a gradient update, we project each column of $\mathbf{U}$ to unit norm following the constraint.

2.2 VAE and SVAE

The inference step of sparse coding can be seen as performing MAP estimate through gradient updates. Variational autoencoders (VAEs), on the other hand, perform inference using a feedforward mapping from the observation to the latent posterior using shared parameters across observations. Such mapping is then learned through minimizing the Kullback–Leibler (KL) divergence between the approximating distribution and the true posterior

\displaystyle D_{KL}(q(\mathbf{z}|\mathbf{x})\|p(\mathbf{z}|\mathbf{x}))

Since we cannot tractably compute the true posterior, VAEs choose to optimize the evidence lower bound (ELBO) $\mathcal{L}$ defined as

\displaystyle D_{KL}(q(\mathbf{z}|\mathbf{x})\|p(\mathbf{z}|\mathbf{x}))=\log p_{\theta}(\mathbf{x})-\mathcal{L}

	$\displaystyle\mathcal{L}$	$\displaystyle=\mathbb{E}_{\mathbf{z}\sim q(\cdot\|\mathbf{x})}\left[\log p_{\theta}(\mathbf{x},\mathbf{z})-\log q(\mathbf{z}\|\mathbf{x})\right]$
		$\displaystyle=\mathbb{E}_{\mathbf{z}\sim q(\cdot\|\mathbf{x})}\left[\log p_{\theta}(\mathbf{x}\|\mathbf{z})\right]-D_{KL}\left(q(\mathbf{z}\|\mathbf{x})\\|p_{\theta}(\mathbf{z})\right)$

In the original VAE formulation [9], the proposal distribution $q(\mathbf{z}|\mathbf{x})$ , the latent prior $p_{\theta}(\mathbf{z})$ , and the likelihood term $p_{\theta}(\mathbf{x}|\mathbf{z})$ are all chosen to be Gaussian. The dimension of the latent code $\mathbf{z}$ is typically much smaller than the data dimension, forcing the model to learn low-dimensional structures that generate the true data distribution.

SVAE

In the work of Barello et al. [8], the authors proposed three modifications to the original VAE: (1) Make the dimension of the latent variables overcomplete (larger than the input dimension); (2) Use a sparsity-inducing prior (e.g. $p_{\theta}(\mathbf{z})\sim\text{Laplace}(0,1)$ ) instead of the Gaussian prior; (3) Parameterize the decoder with a single linear layer rather than a deep neural network. In SVAE, the encoder replaces the iterative inference (ISTA), generating a full posterior $q(\mathbf{z}|\mathbf{x})$ instead of a single MAP estimation. The decoder behaves like the sparse coding filters $\mathbf{U}$ by taking a sampled $\mathbf{z}\sim q(\mathbf{z}|\mathbf{x})$ and reconstructing the input $\mathbf{\hat{x}}=\mathbf{U}\mathbf{z}$ . The proposal distribution and the likelihood term remain Gaussian, and the encoder is parameterized with two linear layers followed by a ReLU nonlinearity, and two separate linear layers that generate the mean and the log variance of the proposal distribution, respectively.

However, in practice, we found that this SVAE formulation leads to a large number of noise filters learned in the decoder. Figure 2 shows the decoder filters learned by sparse coding (left) and SVAE (middle). Only a small subset of filters in SVAE decoder resemble the oriented bandpass Gabor filters like the RFs in V1, while most of the filters are not optimized to represent sparse structure in natural image data. In the next section, we present a few heuristics to improve the number of active filters learned in SVAE (Figure 2, right).

3 Improving SVAE training

To tackle the issue of under-optimized filters in the decoder of SVAE, we propose the following heuristics to improve the training of SVAE.

Balancing between reconstruction and KL divergence

Following the idea proposed in $\beta$ -VAE [13], we weigh the KL divergence term in ELBO to be less than 1 in order to smooth the effect of the sparse prior placing heavy gradients on only few filters. The new ELBO term then becomes

\displaystyle\mathcal{L}=\mathbb{E}_{\mathbf{z}\sim q(\cdot|\mathbf{x})}\left[\log p_{\theta}(\mathbf{x}|\mathbf{z})\right]-\beta D_{KL}\left(q(\mathbf{z}|\mathbf{x})\|p_{\theta}(\mathbf{z})\right)

More expressive encoder

To improve the quality of the approximating posterior, we replace the linear layers in the original SVAE with ResNet blocks [14].

SVAE decoder with unit norm constraint

Most importantly, we noticed that in the end-to-end training of SVAE, the filters of the decoder no longer have the constraint of having unit $L_{2}$ norm. We suspect that the over-pruning issue of VAE training [12] and the sparsity-inducing prior together exacerbate the imbalance of training, which causes only a few filters to have dominating gradients, leaving most of the filters not fully optimized. Therefore, we propose to use the same projected gradient descent step in the original sparse coding model on SVAE to ensure each decoder filter is constrained to unit $L_{2}$ norm. We show that this drastically improves the number of optimized filters (see Results).

Refer to caption — Figure 1: Reconstructions of 16x16 natural image patches in the held-out test set using the different models.

4 Results

Here, we present the performance comparison among the traditional sparse coding model, SVAE, and SVAE with our training heuristics. To further investigate the effect of decoder weight normalization, we also trained two SVAEs both with the first two proposed heuristics ( $\beta$ weighting, ResNet block encoders), but one with weight normalization ("SVAE-Norm") and the other one without. We experimented with natural image patches and MNIST handwritten digits. The natural image patches¹¹1http://www.rctn.org/bruno/sparsenet/ are spatially whitened with a low pass filter $R(f)=fe^{-(f/f_{0})^{4}}$ , $f_{0}=200$ cycles/image. We used $z_{i}\sim\text{Laplace}(0,0.1)\text{ }\forall i=1\dots N$ as the factorized sparsity-inducing prior for all experiments.

4.1 Reconstruction

We first examined the performance of the models on reconstruction. Figure 1 shows 10 reconstructed patches, while Table 1 shows the pixelwise mean squared error (MSE) of each model’s reconstructions on the entire test set. Our normalized SVAE model (SVAE-Norm) achieves the lowest MSE, followed by SVAE. The traditional sparse coding model performs worse than both.

Model	MSE	STD (Monte Carlo Samples)
Sparse Coding	0.0150	N/A
SVAE	0.00875	1.631e-5
SVAE-Norm	0.00769	2.329e-5

Table 1: The mean squared error and standard deviation for reconstructions of natural images in the held-out test set over 50 trials.

4.2 Neural representation and activity

Next, we examined the receptive fields of the neurons in model’s decoder, shown in Figure 2. In the SVAE formulation, these filters should behave similarly to the traditional sparse coding filters. SVAE-Norm exhibits the clearest Gabors, followed by the sparse coding model. SVAE has some Gabor-like structures, but most of its neurons have either PCA-like receptive fields [2] (e.g. Figure 2 middle, first row, middle column) or are unoptimized (gray filters).

When presented with a natural image patch, only a few neurons have a large activation (Figure 3, each row shows a different input patch). This is consistent with biological and computational findings in the sparse-coding domain. However, the sparse coding model has a clearer sparse activation across the 3 images (left column), while the VAEs visually appear to be noisier (middle and right column). This is expected due to the different inference procedures (ISTA vs. amortized inference + sampling), although it is interesting to see that strictly sparse code (true zero activations) is not required to produce “parts of images” structures in the filters (in SVAEs).

To evaluate the conditional distribution $p_{\theta}(\mathbf{x}|\mathbf{z})$ learned by the decoder, we generated images by sampling from the prior Laplace distribution (Figure 4). Although these images are noisy, they look similar to the natural image patches of Figure 1, suggesting the SVAEs learned a meaningful hidden representation that generates input data.

4.3 Noise filters

We observed that many neurons are under-optimized and learned white-noise-like filters rather than Gabor-like filters in the SVAE model (Figure 5(a), top left). This suggests that a majority of the filters were not learning meaningful latent structure of the data. We were able to isolate these neurons in the SVAE model by thresholding based on the standard deviation of the latent code activation on the test set. We used 0.5 as the threshold for both SVAE and SVAE-Norm model, and found that 296 filters out of 450 in SVAE are below the threshold, We visualized the variance distribution of these filters in Figure 5(b) top panel. In Figure 5(b), we show that these "noise" filters have extremely small $L_{2}$ norms, compared to the filters that show clear Gabor-like structure (Figure 5(a), bottom left). However, with weight normalization applied to SVAE during training, only 76 filters out of 450 are below the threshold, and these filters are mainly highly sparse filters that encode single pixels of the images (Figure 5(a), top right). The majority of the filters in SVAE-Norm model now show Gabor-like structure similar to the traditional sparse coding models and V1 RFs ((Figure 5(a), bottom right).

To validate the effect of weight normalization in learning quality filters, we ran the same two SVAE models on the MNIST datasets, with or without weight normalization. Figure 6 shows 100 randomly sampled decoder filters from the two models. The original SVAE model only shows 7 filters with large norms but noisy structures, while all filters in the SVAE-Norm model show stroke-like structures that resemble parts of MNIST digits. For a visualization of all of the filters learned by the two models on the two datasets, see Appendix figures.

5 Discussion

In this paper, we showed that a weight normalization scheme for training sparse coding variational autoencoders is critical for decoder optimization. To gain some insights into the efficacy of the approach, the projected gradient descent on the decoder can be thought of as a special case of the weight normalization proposed by Salimans et al. [16]. The weight normalization trick reparameterizes neural network weights $\mathbf{w}$ as

\displaystyle\mathbf{w}=\frac{g}{\|\mathbf{v}\|_{2}}\mathbf{v}

where $g$ defines the length of the unit vector $\frac{\mathbf{v}}{\|\mathbf{v}\|_{2}}$ . The SVAE decoder can be thought of as reparameterized weights with $g=1$ . Weight normalization has been shown to accelerate model training and encourage disentangled representation learning. We expect having a unit norm constraint on the SVAE decoder to have similar effects. Future work could focus on verifying the effect of normalization on hierarchical latent variables models, e.g. extending hierarchical sparse coding [17, 18] in a VAE setting.

6 Acknowledgements

This paper was written as a final project for a course on generative models at the University of Washington. We thank the course instructor John Thickstun for his feedback.

References

[1] D. H. Hubel and T. N. Wiesel. Receptive fields of single neurones in the cat’s striate cortex. The Journal of Physiology, 148(3):574–591, October 1959.
[2] Bruno A. Olshausen and David J. Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583):607–609, June 1996. Number: 6583 Publisher: Nature Publishing Group.
[3] H. Lee, A. Battle, R. Raina, and A. Ng. Efficient sparse coding algorithms. In NIPS, 2006.
[4] K. Gregor and Y. LeCun. Learning fast approximations of sparse coding. In ICML, 2010.
[5] Y. Chen, Dylan M. Paiton, and B. Olshausen. The sparse manifold transform. ArXiv, abs/1806.08887, 2018.
[6] Dylan M. Paiton, Charles G. Frye, Sheng Y. Lundquist, Joel D Bowen, R. Zarcone, and B. Olshausen. Selectivity and robustness of sparse coding networks. Journal of Vision, 20, 2020.
[7] Jeremias Sulam, Ramchandran Muthumukar, and R. Arora. Adversarial robustness of supervised sparse coding. ArXiv, abs/2010.12088, 2020.
[8] G. Barello, A. Charles, and Jonathan W. Pillow. Sparse-coding variational auto-encoders. bioRxiv, 2018.
[9] Diederik P. Kingma and M. Welling. Auto-encoding variational bayes. CoRR, abs/1312.6114, 2014.
[10] Samuel R. Bowman, L. Vilnis, Oriol Vinyals, Andrew M. Dai, R. Józefowicz, and S. Bengio. Generating sentences from a continuous space. ArXiv, abs/1511.06349, 2016.
[11] Diederik P. Kingma, Tim Salimans, and M. Welling. Improved variational inference with inverse autoregressive flow. ArXiv, abs/1606.04934, 2017.
[12] Serena Yeung, A. Kannan, Yann Dauphin, and Li Fei-Fei. Tackling over-pruning in variational autoencoders. ArXiv, abs/1706.03643, 2017.
[13] I. Higgins, Loïc Matthey, A. Pal, C. Burgess, Xavier Glorot, M. Botvinick, S. Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017.
[14] Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
[15] Stephen P. Boyd and L. Vandenberghe. Convex optimization. IEEE Transactions on Automatic Control, 51:1859–1859, 2006.
[16] Tim Salimans and Diederik P. Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In NIPS, 2016.
[17] Matthew D. Zeiler, Dilip Krishnan, Graham W. Taylor, and R. Fergus. Deconvolutional networks. 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 2528–2535, 2010.
[18] Matthew D. Zeiler, Graham W. Taylor, and R. Fergus. Adaptive deconvolutional networks for mid and high level feature learning. 2011 International Conference on Computer Vision, pages 2018–2025, 2011.