This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Variance Reduction with Sparse Gradients

Melih Elibol, Michael I. Jordan
University of California, Berkeley
{elibol,jordan}@cs.berkeley.edu
&Lihua Lei
Stanford
[email protected]
(September 2019)
Abstract

Variance reduction methods which use a mixture of large and small batch gradients, such as SVRG (svrg) and SpiderBoost (spiderboost), require significantly more computational resources per update than SGD (sgd). We reduce the computational cost per update of variance reduction methods by introducing a sparse gradient operator (sparse; sparsememory). While the computational cost of computing the derivative of a model parameter is constant, we make the observation that the gains in variance reduction are proportional to the magnitude of the derivative. In this paper, we show that a sparse large batch gradient based on the magnitude of smaller batch gradients reduces the computational cost of model updates without a significant loss in variance reduction.

1 Introduction

Variance reduction methods which use a mixture of large and small batch gradients, such as SVRG (svrg) and SpiderBoost (spiderboost), require significantly more computational resources per update than SGD (sgd). We reduce the computational cost per update of variance reduction methods by introducing a sparse gradient operator. While the computational cost of computing the derivative of a model parameter is constant, the gains in variance reduction are proportional to the magnitude of the derivative. In this paper, we claim that a sparse large batch gradient based on the magnitude of smaller batch gradients reduces the computational cost of model updates without a significant loss in variance reduction.

We begin with a description of the sparse variance reduction algorithm, as well as a concrete example of our algorithm applied to SpiderBoost (spiderboost) with sparse forward pass and back propagation on a simple two-layer neural network with non-linear activation. We follow this with an analysis of the computational complexity of the gradient update for SpiderBoost with sparsity. Finally, we demonstrate the performance of SpiderBoost with sparsity on a synthetic regression problem, as well as MNIST and CIFAR-10, and conclude that sparsity can improve the gradient query complexity of variance reduction methods without substantially increasing the variance of stochastic gradients.

2 Related Work

In practice, an interpolation between SGD (sgd) GD, called Batch SGD, is used to minimize the finite sum objective

minxd1ni=1nfi(x),\min_{x\in\mathbb{R}^{d}}\frac{1}{n}\sum_{i=1}^{n}f_{i}(x), (1)

where xx is a vector of weights, and fi:df_{i}:\mathbb{R}^{d}\rightarrow\mathbb{R} is the loss associated with sample ii. GD solves this objective by applying the update rule

xt+1=xtημ(x),x_{t+1}=x_{t}-\eta\mu(x), (2)

where η\eta is some constant learning rate, and μ(x)=1ni=1nfi(x)\mu(x)=\frac{1}{n}\sum_{i=1}^{n}\nabla f_{i}(x). Batch SGD minimizes Equation 1 by sampling U(1,,k)\mathcal{I}\sim U(1,...,k) and applying the update rule

xt+1=xtημIx,x_{t+1}=x_{t}-\eta\mu_{I}{x}, (3)

where kk is the number of batches, and μ(x)=1|b|jbfj(x)\mu_{\mathcal{I}}(x)=\frac{1}{|b_{\mathcal{I}}|}\sum_{j\in b_{\mathcal{I}}}\nabla f_{j}(x). The variance of batch bb_{\mathcal{I}} at xx is

𝔼(μ(x)μ(x))2=𝔼μ2(x)μ2(x).\displaystyle\mathbb{E}{\mathcal{I}}{(\mu_{\mathcal{I}}(x)-\mu(x))^{2}}=\mathbb{E}{\mathcal{I}}{\mu_{\mathcal{I}}^{2}(x)}-\mu^{2}(x). (4)

Note that when |b|=n|b_{\mathcal{I}}|=n, the variance of batch SGD is 0.

Through a combination of large and small batch gradients, variance reduction methods (svrg; scsg; spiderboost) also reduce the variance of stochastic gradients. However, variance reduction methods methods have failed to outperform SGD, let alone other optimization algorithms, in practice. This is primarily due to poor variance reduction at the cost of significant resource consumption  ineffective.

Acknowledgments

Appendix A Appendix