Variance Reduction with Sparse Gradients
Abstract
Variance reduction methods which use a mixture of large and small batch gradients, such as SVRG (svrg) and SpiderBoost (spiderboost), require significantly more computational resources per update than SGD (sgd). We reduce the computational cost per update of variance reduction methods by introducing a sparse gradient operator (sparse; sparsememory). While the computational cost of computing the derivative of a model parameter is constant, we make the observation that the gains in variance reduction are proportional to the magnitude of the derivative. In this paper, we show that a sparse large batch gradient based on the magnitude of smaller batch gradients reduces the computational cost of model updates without a significant loss in variance reduction.
1 Introduction
Variance reduction methods which use a mixture of large and small batch gradients, such as SVRG (svrg) and SpiderBoost (spiderboost), require significantly more computational resources per update than SGD (sgd). We reduce the computational cost per update of variance reduction methods by introducing a sparse gradient operator. While the computational cost of computing the derivative of a model parameter is constant, the gains in variance reduction are proportional to the magnitude of the derivative. In this paper, we claim that a sparse large batch gradient based on the magnitude of smaller batch gradients reduces the computational cost of model updates without a significant loss in variance reduction.
We begin with a description of the sparse variance reduction algorithm, as well as a concrete example of our algorithm applied to SpiderBoost (spiderboost) with sparse forward pass and back propagation on a simple two-layer neural network with non-linear activation. We follow this with an analysis of the computational complexity of the gradient update for SpiderBoost with sparsity. Finally, we demonstrate the performance of SpiderBoost with sparsity on a synthetic regression problem, as well as MNIST and CIFAR-10, and conclude that sparsity can improve the gradient query complexity of variance reduction methods without substantially increasing the variance of stochastic gradients.
2 Related Work
In practice, an interpolation between SGD (sgd) GD, called Batch SGD, is used to minimize the finite sum objective
(1) |
where is a vector of weights, and is the loss associated with sample . GD solves this objective by applying the update rule
(2) |
where is some constant learning rate, and . Batch SGD minimizes Equation 1 by sampling and applying the update rule
(3) |
where is the number of batches, and . The variance of batch at is
(4) |
Note that when , the variance of batch SGD is .
Through a combination of large and small batch gradients, variance reduction methods (svrg; scsg; spiderboost) also reduce the variance of stochastic gradients. However, variance reduction methods methods have failed to outperform SGD, let alone other optimization algorithms, in practice. This is primarily due to poor variance reduction at the cost of significant resource consumption ineffective.