StarNet: Gradient-free Training of Deep Generative Models using Determined System of Linear Equations

Amir Zadeh, Santiago Benoit, Louis-Philippe Morency
LTI, Computer Science Department
Carnegie Mellon University
Pittsburgh, PA 15213
{abagherz,sbenoit,morency}@cs.cmu.edu

Abstract

In this paper we present an approach for training deep generative models solely based on solving determined systems of linear equations. A network that uses this approach, called a StarNet, has the following desirable properties: 1) training requires no gradient as solution to the system of linear equations is not stochastic, 2) is highly scalable when solving the system of linear equations w.r.t the latent codes, and similarly for the parameters of the model, and 3) it gives desirable least-square bounds for the estimation of latent codes and network parameters within each layer.

1 Introduction

Generative modeling requires learning a latent space, and a decoder that maps the latent samples to an output manifold. We study the two cases of using inverse-funnel feedforward decoders and convolutional decoders through the lens of a system of linear equations. Our proposed algorithm is called StarNet.

A system of liner equations is a well-studied area in linear algebra, with various algorithms ranging from elimination methods to solutions based on matrix pseudoinverse. The benefits of StarNet learning process are as follows:

•

No gradients are required to solve a system of liner equations. Therefore, training can be done at a lower computational cost, and fewer hyperparameters. Training deep neural networks often requires calculating the gradient of an objective function (e.g. log likelihood) w.r.t the parameters of a model. Gradient calculation is computationally heavy, and often scales poorly since batch updates are serial throughout an epoch.
•

Certain algorithms to solve a system of linear equations can be highly scalable. Unlike batch learning, where each datapoint waits its turn within the epoch, a system of linear equations offers a framework that is more scalable. Certain steps of the StarNet learning algorithm can linearly scale up to as many computational nodes as there are datapoints within a dataset.
•

There are strong theoretical backings for a solution to a system of linear equations. It is imperative that deep models have explainable behavior, during both learning and application. Learning via a system of linear equations offers residuals for each layer, which can be used to identify and better approach multiple important issues in machine learning such as mode-collapse. Furthermore, for each datapoint, an estimate of how well each layer fits the datapoint can be measured, thus better understanding the intrinsic success or failure cases of deep learning.

A system of linear equations offers unique (exact/approximate) solutions if it is determined (well-determined/over-determined). This paper presents such conditions for two commonly used decoder types: feedforward and convolutional. These conditions are not strong, and a significant portion of the currently used decoders fit within these conditions. Our StarNet learning algorithm is defined by progressively training a deep neural network from the last layer to the first. It follows a coordinate descent method to reach a plateau in solving the system of linear equations w.r.t both the latent codes and the parameters of the model, at each layer. Most of the trained models in this paper reach convergence in less than $3$ epochs - a significant boost over gradient models. Due to the fact that the latents of the StarNet are not restricted (e.g. an encoder in Auto-Encoder or PCA), even a single layer StarNet offers remarkable generation capabilities.

2 Related Work

Generative modeling is a fundamental research area in machine learning. Notable related works re outlined below.

In a neural framework, Auto-Encoders (AE) and their variational implementation (VAE [4]) have been commonly used for learning based on a reconstruction objective. GANs [3] deviate from this framework by optimizing an objective that pushes two networks towards an equilibrium where a generators output is indistinguishable from the real generative distribution. An encoderless application of VI to generative modeling, Variational Auto-Decoders (VAD [10]) use only a generator but with an objective similar to VAE - with the added benefit of robustness to noise and missing data. All approaches rely on computation of gradients during train time (with GAN and VAD commonly requiring the same during test time to invert the generative model, i.e. posterior inference). Flow-based models use invertible mappings between input and output of same dimensions. Convolutional variants are limited to $1\times 1$ convolutions [5]. StarNet does not require same dimensions for input and output, and is not restricted to $1\times 1$ convolutions. However, similar to the latter, StarNet does exhibit certain mild limitations when solving for CNNs.

Aside deep techniques, Principle Component Analysis (PCA [9]) is a popular method for learning low-dimensional representation of data. PCA is identical to a linear auto-encoder with a lasso reconstruction objective. PCA algorithm does not requires gradient computations. It differs from the StarNet in that the feed forward operation of PCA is restricted by a linear encoder, while the StarNet imposes no restrictions on the latent space. In simple terms, StarNet has no direct mapping between input and latents; the latents are recovered using pesudoinverse of the model weights.

3 Model

StarNet is a method for training certain neural decoders without gradient computations. We cover two major types in this paper: 1) feedforward StarNet, and 2) convolutional StarNet. For each type, we outline the necessary conditions required for successful training and inference.

3.1 Feedforward StarNet

We first start by a feedforward StarNet, which uses a feedforward decoder. The following are the operations in tthe $i$ -th layer of a neural network:

h_{i}=a\ (W_{i}\ \cdot\ h_{i-1})

(1)

With $h_{i}\in\mathbb{R}^{d_{i}}$ as output, $h_{i-1}\in\mathbb{R}^{d_{i-1}}$ as input, $W\in\mathbb{R}^{d_{i}\times d_{i-1}}$ as an affine map, $a$ as a non-linear activation function. Note, for simplicity of equations later, the formula accounts for bias as a constant dimensions added to all $h_{i-1}$ and a row added to $W$ . Such operation above is unfolded $K$ times for a deep neural decoder wih $K$ layer.

Let $X=\{x^{(n)}\}_{n=1}^{N}$ be a given i.i.d dataset. Essentially, in any generative or unsupervised framework, the hope is that $h^{(n)}_{i=K}$ is as close as possible to the respective $x^{(n)}$ .

Theorem 1

$\forall i$ , Equation 1 is solvable w.r.t $h_{i-1}$ as long as $d_{i-1}\leq d_{i}$ (i.e. decoder is inverse funnel). This is coupled with $a^{-1}$ as the inverse of the activation.

Proof 1

The condition is required for well-determined, or over-determined solution to a system of liner equations w.r.t $h_{i-1}$ . Let $\hat{h}_{i}$ be the non-activated neural responses in Equation 1. Naturally, $\hat{h}_{i}=a^{-1}(h_{i})$ . Foreach $x^{(n)}$ , we have the following:

W_{i}\ \cdot\ h^{(n)}_{i-1}=\hat{h}_{i-1}^{(n)}

(2)

The above is a system of linear equations with a unique solution in the case of $d_{i-1}=d_{i}$ and a least-squared solution in the case of $d_{i-1}<d_{i}$ . The solution is exactly as:

\hat{h}^{*}_{i-1}=(W_{i}^{T}W_{i})^{-1}W_{i}^{T}\hat{h}_{i}

(3)

$\hat{h}^{*}_{i-1}$ is the least-squared solution (hence the name StarNet). $(W_{i}^{T}W_{i})^{-1}W_{i}^{T}$ is the pseudo-inverse of the weights of the $i$ th layer. It is important that the initialization of $W$ does not produce rank-deficiency. Therefore, random initialization or its orthogonal projection (product of a decomposition) may work. Subsequently:

h^{*}_{i-1}=a^{-1}(\hat{h}^{*}_{i-1})

(4)

Commonly used activation functions are invertible, except for ReLU in the negative part of its domain. Alternatives such as leaky ReLU can be used instead.

Lemma 1

The above will not have a unique solution if $d_{i-1}>d_{i}$ , since $nullity(W)>0$ . Therefore, the architecture cannot go lowerin dimensions without the risk of rank deficiency. If so, strong assumptions on the observations $x^{(n)}$ have to be made, such as presence of a particular structure or certain simplicities[1], often covered in compressed sensing [2]. In the scope of this paper, since majority of the neural decoders are inverse funnels, we leave the studies of rank deficient weights for the next edition of this preprint.

Lemma 2

Since samples drawn from $p(x)$ are assumed i.i.d, the solution to the Equation 2 can be distributed across $N$ independent computations. Therefore, allowing for linear job distribution over computational nodes.

Lemma 3

The solution of the Equation 1 for a datapoint can be seen as the latent inference process (e.g. similar to an encoder feedforward process), which is part of both training and testing. It can be applied recursively from higher to lower layers, with a succession of solving Equation 1 for layers $K-1,K-2,\cdots,1$ , and applying inverse of the activation $a$ at each layer. Therefore, the complexity of inference in StarNet remains linear w.r.t the number of layers $K$ . This is an improvement over encoderless gradient-based approaches that require large number of epochs (e.g. approximate posterior inference in VAD).

Lemma 4

If Equation 2 is over-determined, then a norm of the residuals of each equation (i.e. datapoint when solving for $h^{(\cdot)}$ ) can be used as a measure for confidence, detecting outliers, or mode-collapse. An in-distribution datapoint with high residual norm would signal inference suboptimality given the layer parameters. Similarly, a datapoint may be judged to belong to a certain distribution using a probability field defined over such a measure.

Lemma 5

A layer which has $d_{i}=d_{i-1}$ is not a useful neural construct for StarNet. Such a layer will collapse an affine transformation to a degenerate identity projection. If a solution lies within the $R^{d_{i-1}}$ , then the Equation 4 will find it, regardless of the depth of similarly shaped layers projecting into that same space.

Theorem 2

Equation 1 can also be solved w.r.t $W_{i}$ iff $d_{i-1}\leq N$ ; essentially, there needs to be more datapoints than the number of input neurons in each individual layer of the network.

Proof 2

Let $W_{i,j}$ be the $j$ th row (that produces the $j$ th dimension in the ouput of the layer) of the weights of the $i$ th layer. Let $H_{i-1}$ be the matrix of inputs (over the entire dataset) to the $i$ th layer. Thus equation 1 can be written as:

a^{-1}H_{i,j}^{T}=H_{i-1}^{T}\ \cdot\ W_{i,j}^{T}

(5)

Hence a new system of linear equations with exactly $N$ equations and $d_{i-1}$ unknowns. The same equation can be written for each row of $W_{i}$ individually, leading to a scalable solution for $W_{i}$ , across as many nodes as the total rows for $W_{i}$ . A solution to this system can similarly be done by calculating the pseudoinverse of the knowns matrix, or elimination approaches.

Lemma 6

Since it is required to have the network follow an inverse funnel (See Proof 1), then $d_{K-1}<N$ is the only condition that needs to be checked aside the previous conditions.

3.2 Convolutional StarNet

Convolution is essentially a local affine map applied along patches of an input. Similar to feedforward StarNet, a system of linear equations can be solved for convolutional decoders. In order to keep the system of linear equations determined for the latents, we also discuss combining the convolution and pooling operations together for a conv-unpool layer. A conv-unpool layer swaps generates neighboring pixels of a large response map as channels within a smaller response map. Figure 1 shows the operations of a conv-unpool layer.

Refer to caption — Figure 1: The operations of a conv-unpool layer with unpooling of $2\times 2$ . Essentially a local patch (right) is created by reshaping the responses of the previous responses (left).

Theorem 3

The system of linear equations for a conv-unpool layer with a stride of $1$ remains determined w.r.t the latents of the $(i-1)$ th layer, as long as $c_{i}\leq c_{i-1}\cdot u_{i-1}^{2}$ ; with $c_{i}$ , $c_{i-1}$ being the number of channels in the $i$ th and $(i-1)$ th layer of conv-unpool operations, and $u_{i}$ the unpooling shape in both image directions (here assumed to be square shaped, but can also be rectangle).

Lemma 7

The above condition can be restrictive for the border convolutions. Performing full convolutions will help with the solution for all the inputs to remain determined (even the very borders of the input response map).

Theorem 4

The same system of linear equations can be solved w.r.t the convolutional parameters as long as there are fewer convolutional parameters in a single convolutional kernel (e.g. $49$ for a $7\times 7$ convolutions) than the total number of convolutions for the same kernel on the entire dataset. This is somewhat a relaxed condition since layers tend to be larger than the shape of a single convolutional kernel.

Lemma 8

When solving for the convolutional parameters, the number of equations is extremely large compared to the number of unknowns (e.g. $2,880,000$ equations for a single $7\times 7$ convolution in MNIST dataset with $49$ unknowns). This is due to the fact that the same kernels are applied over different patches. Naturally, calculation of the pseudoinverse may become computational exhaustive very quickly as the size of input image increases. We provide two practical solutions for this: 1) We can solve the system via a subset of the datapoints or equations. Naturally, this puts an emphasis on what datapoints are chosen. Hence, the subset should be reflective of the larger system of equations. In practice random datapoint or patch selection across the entire dataset works well, as shown in Figure 8, or 2) To get a more precise approximation of the convolutional weights, we can distribute the overdetermined equations across multiple computational nodes. Each computational node will still solve an overdetermined system, but a smaller one than the main system defined over the entire dataset. The final weights are subsequently the average of the solution found by the nodes.

3.3 Training and Inference Algorithm

Training StarNet follows a coordinate descent framework. We start from the last layer of the network, where the $h^{(n)}_{K}=x^{(n)}$ . We iteratively solve the system of linear equations w.r.t the latents, and subsequently w.r.t the weights of the model, until convergence is reached. The inverse activation is applied after training the layer to reach $h^{*(n)}_{K-1}$ . The same approach is repeated for each layer, until we reach the first layer latents: $h^{*(n)}_{1}$ . Test-time inference is identical to the layerwise training, except the system of equations in each layer is only solved for the latents, and never solved for the weights (unless the purpose is finetuning).

Code is provided under https://github.com/A2Zadeh/StarNet.

4 Experiments

In this section we discuss the experiments and the methodology. We first start by outlining the datasets, and subsequently discuss the training procedure and model architectures.

4.1 Datasets

We experiment with the following four datasets.

MNIST, Fashion-MNIST: MNIST [7] is a collection of handwritten digits, between $[0,9]$ . Fashion-MNIST ¹¹1https://github.com/zalandoresearch/fashion-mnist is a similar dataset to MNIST, except for fashion items. Both datasets are baselines for image generation. The images are monochromatic, with the shape of $28\times 28$ .

SVHN: SVHN (Street View House Numbers) is a collection of real-world images of house numbers, obtained from Google Street View images [8]. One variant of this dataset includes bounding boxes around each digit in a house number, while the other variant (used in this paper) separates all digits individually into $32\times 32$ images, with digits $[0,9]$ similarly to MNIST.

CIFAR10: CIFAR10 is a collection of $32\times 32$ color images belonging to 10 different classes, corresponding to different animals and vehicles [6]. The full dataset consists of $6,000$ images per class, $60,000$ images in total.

4.2 Training Methodology

The training methodology for the feedforwad StarNet and convolutional StarNet are as follows:

Feedforward StarNet: We train feedforward networks with two layers of $128,256$ neurons. The final dimension for the MNIST and Fashion-MNIST is $784$ ( $28\times 28$ ), while SVHN and CIFAR10 are $3072$ ( $3\times 32\times 32$ ). The activation is leaky ReLU, with the negative slope as a hyperparameter discussed in Section 5.

Convolutional StarNet: We use a conv-unpool operation, with the pooling of $2$ in each direction. For all trained models, there are two layers of convolution, with the shape $7\times 7$ . For MNIST and Fashion-MNIST, the first layer has $6$ kernels, and the second layer has $2$ . For CIFAR10 and SVHN, we have $10$ kernels in the first layer, and $4$ kernels in the second. All trained models use full convolution to ensure that the system remains determined w.r.t the border latents (See Lemma 7).

The dimensions chosen for the above feedforward and convolutional StarNet models follow the conditions required in Section 3. Therefore, no degenerate identity solution can be learned.

The comparison Auto-Encoder (AE) baseline uses the same architecture decoder. The encoder is the same architecture as the decoder (no weights shared between the encoder and decoder). Default Adam parameters are used for training with a learning rate of $10e-3$ . Batch size is set to $64$ .

5 Results

The generations for the datasets of MNIST, Fashion-MNIST, SVHN and CIFAR10 are shown in Figure 2 for feedforward StarNet, and Figure 3 for convolutional StarNet. While trained models use no gradients, the generations show high fidelity to ground truth²²2The high negative slope of 0.5 was found to work the best overall.. We show the generations based on each layer, and further discuss the results of the single layer generation in Section 6.

6 Discussion

We discuss some of the important findings of our experiments in this section.

Performance of a Single Layer Network: A unique distinction between methods that use free-form parameterization for the latents (i.e. StarNet, VAD), from the methods that use deep layers for inference, is the fact that a single layer network is the most successful in generation. This is due to the fact that the latents are not controlled by an underlying neural structure during the optimization. We posit that a neural architecture imposes a distribution over its range, whereas free-form optimization allows latents to travel everywhere within the latent manifold. Therefore, free-form latents can learn the same output of a deep network (or better ones), if it indeed satisfies the training objective.

Convergence Rate: On the experimental datasets, StarNet reaches performance plateau mostly within $5$ epochs. Figure 6 shows the elastic (i.e. $|\cdot|+||\cdot||$ ) loss over different epochs for both MNIST and CIFAR10. Note, the loss only reports the error measure and is nowhere used for training the StarNet. For reference, we include the convergence rate of an auto-encoder trained with Adam optimizer with a default learning of $10e-3$ . The Auto-Encoder is trained over the elastic objective.

Learning Progress over Different Epochs: We also study the progress of the generations over different epochs of training. We do this for a single layer feedforward and conv-unpool networks (for MNIST and CIFAR10). The layers are the second layers ( $256$ neurons) from Section 4.2. The results are shown in Figures 4 for feedforward and 5 for convolutional. Both architectures show that generations beyond $2$ epochs are already very close to the ground truth.

Importance of Both Steps in Learning: The learning progress in Figures 4 and 5 demonstrate that while convergence is fast, initial solution (i.e. SL in the first epoch) is not completely successful due to random weights. We also initiate the training w.r.t the weights, and show the results for the MNIST dataset in Figure 7. Therefore, one cannot train a StarNet by only solving w.r.t either the latents or the weights. Both have to be present.

Sampling for Convolutional StarNet Equations: The system of linear equations in a convolutional StarNet can have too many equations when solving for the parameters of the network. Naturally, this makes calculation of the pseudoinverse (or any other approach for solving system of equations) computationally exhaustive. To mitigate this, we study the effect of random datatpoint sampling on the learning process of the model parameters. For MNIST, Figure 8 shows that around $500$ datapoint are enough statistics to solve for the weights. Going below that, to $100$ datapoints per epochs seems to make the convergence too slow and the coordinate descent process of learning StarNet unstable. Still, around $500$ equations is a significant reduction from the total number of equations, i.e. $2,880,000$ for full convolutions over MNIST. Note, all models still use all the datapoints for solving w.r.t the latents since datapoints and latents are assumed i.i.d.

Final Remarks on Scalability: There are various ways to maximizing computational node utilization when scaling up the learning process of StarNet. First and foremost, when solving both feedforward and convolutional StarNet for the latents, the relation between the number of datapoints and compuational nodes remains linear; i.e. $100$ i.i.d datapoints can go to $100$ computational nodes. Furthermore, for both the feedforward and convolutional cases of StarNet, the solution for the weights can be scaled in two ways: 1) independent weights (rows of weight matrix in feedforward, and kernels in convolution) can be mapped to different computational nodes, 2) in case there are still many more equations than the number of unknowns when solving for the weights (i.e. due to large size of the dataset), a sampling approach can reduce the number of equations to a representative subset, as studied in Figure 8.

7 Conclusion

In this paper we presented a new training framework for certain generative architectures. StarNet uses no gradients during training and inference. It relies solely on layer-wise solving of systems of linear equations. This method of training converges fast, and allows for scalability across computational nodes. We present generative experiments over 4 publicly available and well-studied datasets. The generation results show the capabilities of StarNet.

References

[1] David Donoho, Hossein Kakavand, and James Mammen. The simplest solution to an underdetermined system of linear equations. In 2006 IEEE International Symposium on Information Theory, pages 1924–1928. IEEE, 2006.
[2] David L Donoho. Compressed sensing. IEEE Transactions on information theory, 52(4):1289–1306, 2006.
[3] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27:2672–2680, 2014.
[4] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
[5] Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in neural information processing systems, pages 10215–10224, 2018.
[6] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
[7] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[8] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011.
[9] Svante Wold, Kim Esbensen, and Paul Geladi. Principal component analysis. Chemometrics and intelligent laboratory systems, 2(1-3):37–52, 1987.
[10] Amir Zadeh, Yao-Chong Lim, Paul Pu Liang, and Louis-Philippe Morency. Variational auto-decoder. arXiv preprint arXiv:1903.00840, 2019.