This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Quasi-Monte Carlo Quasi-Newton for variational Bayes

Sifan Liu Department of Statistics, Stanford University Art B. Owen Department of Statistics, Stanford University
(April 2021)
Abstract

Many machine learning problems optimize an objective that must be measured with noise. The primary method is a first order stochastic gradient descent using one or more Monte Carlo (MC) samples at each step. There are settings where ill-conditioning makes second order methods such as L-BFGS more effective. We study the use of randomized quasi-Monte Carlo (RQMC) sampling for such problems. When MC sampling has a root mean squared error (RMSE) of O(n1/2)O(n^{-1/2}) then RQMC has an RMSE of o(n1/2)o(n^{-1/2}) that can be close to O(n3/2)O(n^{-3/2}) in favorable settings. We prove that improved sampling accuracy translates directly to improved optimization. In our empirical investigations for variational Bayes, using RQMC with stochastic L-BFGS greatly speeds up the optimization, and sometimes finds a better parameter value than MC does.

1 Introduction

Many practical problems take the form

minθΘdF(θ)whereF(θ)=𝔼(f(𝒛;θ))\displaystyle\min_{\theta\in\Theta\subseteq^{d}}F(\theta)\quad\text{where}\quad F(\theta)={\mathbb{E}}(f(\boldsymbol{z};\theta)) (1)

and 𝒛\boldsymbol{z} is random vector with a known distribution pp. Classic problems of simulation-optimization (Andradóttir,, 1998) take this form and recently it has become very important in machine learning with variational Bayes (VB) (Blei et al.,, 2017) and Bayesian optimization (Frazier,, 2018) being prominent examples.

First order optimizers (Beck,, 2017) use a sequence of steps θk+1θkαkF(θk)\theta_{k+1}\leftarrow\theta_{k}-\alpha_{k}\nabla F(\theta_{k}) for step sizes αk>0\alpha_{k}>0 and an operator \nabla that we always take to be the gradient with respect to θ\theta. Very commonly, neither FF nor its gradient is available at reasonable computational cost and a Monte Carlo (MC) approximation is used instead. The update

θk+1θkαkF^(θk)forF^(θk)=1ni=1ng(𝒛i;θ)andg(𝒛;θ)=f(𝒛;θ)\displaystyle\theta_{k+1}\leftarrow\theta_{k}-\alpha_{k}\nabla\hat{F}(\theta_{k})\quad\text{for}\quad\nabla\hat{F}(\theta_{k})=\frac{1}{n}\sum_{i=1}^{n}g(\boldsymbol{z}_{i};\theta)\quad\text{and}\quad g(\boldsymbol{z};\theta)=\nabla f(\boldsymbol{z};\theta) (2)

for 𝒛iiidp\boldsymbol{z}_{i}\,{\buildrel\text{iid}\over{\sim}\,}p is a simple form of stochastic gradient descent (SGD) (Duchi,, 2018). There are many versions of SGD, notably AdaGrad (Duchi et al.,, 2011) and Adam (Kingma and Ba,, 2014) which are prominent in optimization of neural networks among other learning algorithms. Very often SGD involves sampling a mini-batch of observations. In this paper we consider SGD that samples instead some random quantities from a continuous distribution.

While a simple SGD is often very useful, there are settings where it can be improved. SGD is known to have slow convergence when the Hessian of FF is ill-conditioned (Bottou et al.,, 2018). When θ\theta is not very high dimensional, the second order methods such as Newton iteration using exact or approximate Hessians can perform better. Quasi-Newton methods such as BFGS and L-BFGS (Nocedal and Wright,, 2006) that we describe in more details below can handle much higher dimensional parameters than Newton methods can while still improving upon first order methods. We note that quasi second-order methods cannot be proved to have better convergence rate than SGD. See Agarwal et al., (2012). They can however have a better implied constant.

A second difficulty with SGD is that MC sampling to estimate the gradient can be error prone or inefficient. For a survey of MC methods to estimate a gradient see Mohamed et al., (2020). Improvements based on variance reduction methods have been adopted to improve SGD. For instance Paisley et al., (2012) and Miller et al., (2017) both employ control variates in VB. Recently, randomized quasi-Monte Carlo (RQMC) methods, that we describe below, have been used in place of MC to improve upon SGD. Notable examples are Balandat et al., (2020) for Bayesian optimization and Buchholz et al., (2018) for VB. RQMC can greatly improve the accuracy with which integrals are estimated. The theory in Buchholz et al., (2018) shows how the improved integration accuracy from RQMC translates into faster optimization for SGD. The primary benefit is that RQMC can use a much smaller value of nn. This is also seen empirically in Balandat et al., (2020) for Bayesian optimization.

Our contribution is to combine RQMC with a second order limited memory method known as L-BFGS. We show theoretically that improved integration leads to improved optimization. We show empirically for some VB examples that the optimization is improved. We find that RQMC regularly allows one to use fewer samples per iteration and in a crossed random effects example it found a better solution than we got with MC.

At step kk of our stochastic optimization we will use some number nn of sample values, 𝒛k,1,,𝒛k,n\boldsymbol{z}_{k,1},\dots,\boldsymbol{z}_{k,n}. It is notationally convenient to group these all together into 𝒵k=(𝒛k,1,,𝒛k,n){\mathcal{Z}}_{k}=(\boldsymbol{z}_{k,1},\dots,\boldsymbol{z}_{k,n}). We also write

g¯(𝒵k;θ)=1ni=1ng(𝒛k,i;θ)\displaystyle\bar{g}({\mathcal{Z}}_{k};\theta)=\frac{1}{n}\sum_{i=1}^{n}g(\boldsymbol{z}_{k,i};\theta) (3)

for gradient estimation at step kk.

The closest works to ours are Balandat et al., (2020) and Buchholz et al., (2018). Like Balandat et al., (2020) we incorporate RQMC into L-BFGS. Our algorithm differs in that we take fresh RQMC samples at each iteration where they had a fixed sample of size nn that they used in a ‘common random numbers’ approach. They prove consistency as nn\to\infty for both MC and RQMC sampling. Their proof for RQMC required the recent strong law of large numbers for RQMC from Owen and Rudolf, (2021). Our analysis incorporates the geometric decay of estimation error as the number KK of iterations increases, similar to that used by Buchholz et al., (2018) for SGD. Our error bounds include sampling variances through which RQMC brings an advantage.

An outline of this paper is as follows. Section 2 reviews the optimization methods we need. Section 3 gives basic properties of scrambled net sampling, a form of RQMC. Section 4 presents our main theoretical findings. The optimality gap after KK steps of stochastic quasi-Newton is F(θK)F(θ)F(\theta_{K})-F(\theta^{*}) where θ\theta^{*} is the optimal value. Theorem 4.1 bounds the expected optimality gap by a term that decays exponentially in KK plus a second term that is linear in a measure of sampling variance. RQMC greatly reduces that second non-exponentially decaying term which will often dominate. A similar bound holds for tail probabilities of the optimality gap. Theorem 4.2 obtains a comparable bound for 𝔼(θKθ2){\mathbb{E}}(\|\theta_{K}-\theta^{*}\|^{2}). Section 5 gives numerical examples on some VB problems. In a linear regression problem where the optimal parameters are known, we verify that RQMC converges to them at an improved rate in nn. In logistic regression, crossed random effects and variational autoencoder examples we see the second order methods greatly outperform SGD in terms of wall clock time to improve FF. Since the true parameters are unknown we cannot compare accuracy of MC and RQMC sampling algorithms for those examples. In some examples RQMC finds a better ELBO than MC does when both use SGD, but the BFGS algorithms find yet better ELBOs. Section 6 gives some conclusions. The proofs of our main results are in an appendix.

2 Quasi-Newton optimization

We write 2F(θ)\nabla^{2}F(\theta) for the Hessian matrix of F(θ)F(\theta). The classic Newton update is

θk+1θk(2F(θk))1F(θk).\displaystyle\theta_{k+1}\leftarrow\theta_{k}-(\nabla^{2}F(\theta_{k}))^{-1}\nabla F(\theta_{k}). (4)

Under ideal circumstances it converges quadratically to the optimal parameter value θ\theta^{*}. That is θk+1θ=O(θkθ2)\|\theta_{k+1}-\theta^{*}\|=O(\|\theta_{k}-\theta^{*}\|^{2}). Newton’s method is unsuitable for the problems we consider here because forming 2Fd×d\nabla^{2}F\in^{d\times d} may take too much space and solving the equation in (4) can cost O(d3)O(d^{3}) which is prohibitively expensive. The quasi-Newton methods we consider do not form explicit Hessians. We also need to consider stochastic versions of them.

2.1 BFGS and L-BFGS

The BFGS method is named after four independent discoverers: Broyden, Fletcher, Goldfarb and Shanno. See Nocedal and Wright, (2006, Chapter 6). BFGS avoids explicitly computing and inverting the Hessian of FF. Instead it maintains at step kk an approximation HkH_{k} to the inverse of 2F(θk)\nabla^{2}F(\theta_{k}). After an initialization such as setting H1H_{1} to the identity matrix, the algorithm updates θ\theta and HH via

θk+1\displaystyle\theta_{k+1} θkαkHkF(θk),and\displaystyle\leftarrow\theta_{k}-\alpha_{k}H_{k}\nabla F(\theta_{k}),\quad\text{and}
Hk+1\displaystyle H_{k+1} (Iskykskyk)Hk(Iskykskyk)+skskskyk\displaystyle\leftarrow\Bigl{(}I-\frac{s_{k}y_{k}^{\intercal}}{s_{k}^{\intercal}y_{k}}\Bigr{)}H_{k}\Bigl{(}I-\frac{s_{k}y_{k}^{\intercal}}{s_{k}^{\intercal}y_{k}}\Bigr{)}+\frac{s_{k}s_{k}^{\intercal}}{s_{k}^{\intercal}y_{k}}

respectively, where

sk=θk+1θkandyk=F(θk+1)F(θk).\displaystyle s_{k}=\theta_{k+1}-\theta_{k}\quad\text{and}\quad y_{k}=\nabla F(\theta_{k+1})-\nabla F(\theta_{k}).

The stepsize αk\alpha_{k} is found by a line search.

Storing HkH_{k} is a burden and the limited-memory BFGS (L-BFGS) algorithm of Nocedal, (1980) avoids forming HkH_{k} explicitly. Instead it computes HkF(θk)H_{k}\nabla F(\theta_{k}) using a recursion based on the mm most recent (sk,yk)(s_{k},y_{k}) pairs. See Nocedal and Wright, (2006, Algorithm 7.4).

2.2 Stochastic quasi-Newton

Ordinarily in quasi-Newton algorithms the objective function remains constant through all the iterations. In stochastic quasi-Newton algorithms the sample points change at each iteration which is like having the objective function FF change in a random way at iteration kk. Despite this Bottou et al., (2018) find that quasi-Newton can work well in simulation-optimization with these random changes.

We will need to use sample methods to evaluate gradients. Given 𝒵=(𝒛1,,𝒛n){\mathcal{Z}}=(\boldsymbol{z}_{1},\dots,\boldsymbol{z}_{n}) the sample gradient is g¯(𝒵;θ)\bar{g}({\mathcal{Z}};\theta) as given at (3). Stochastic quasi-Newton algorithms like the one we study also require randomized Hessian information.

Byrd et al., (2016) develop a stochastic L-BFGS algorithm in the mini-batch setting. Instead of adding every correction pair (sk,yk)(s_{k},y_{k}) to the buffer after each iteration, their algorithm updates the buffer every BB steps using the averaged correction pairs. Specifically, after every BB iterations, for k=t×Bk=t\times B, it computes the average parameter value θ¯t=B1j=kB+1kθj\bar{\theta}_{t}=B^{-1}\sum_{j=k-B+1}^{k}\theta_{j} over the most recent BB steps. It then computes the correction pairs by st=θ¯tθ¯t1s_{t}=\bar{\theta}_{t}-\bar{\theta}_{t-1} and

yt=g¯(𝒵~t;θ¯t)g¯(𝒵~t;θ¯t1)=1ni=1n(g(θ¯t,𝒛~t,i)g(θ¯t1,𝒛~t,i)),y_{t}=\bar{g}(\widetilde{\mathcal{Z}}_{t};\bar{\theta}_{t})-\bar{g}(\widetilde{\mathcal{Z}}_{t};\bar{\theta}_{t-1})=\frac{1}{n}\sum_{i=1}^{n}\bigl{(}g(\bar{\theta}_{t},\tilde{\boldsymbol{z}}_{t,i})-g(\bar{\theta}_{t-1},\tilde{\boldsymbol{z}}_{t,i})\bigr{)},

and adds the pair (st,yt)(s_{t},y_{t}) to the buffer. Here 𝒵~t=(𝒛~t,1,,𝒛~t,n)\widetilde{\mathcal{Z}}_{t}=(\tilde{\boldsymbol{z}}_{t,1},\dots,\tilde{\boldsymbol{z}}_{t,n}) is a random sample of size n=nhn=n_{h} Hessian update samples. The samples 𝒵~t\widetilde{\mathcal{Z}}_{t} for t1t\geqslant 1 are completely different from and independent of 𝒵k{\mathcal{Z}}_{k} for k1k\geqslant 1 used to update gradient estimates at step kk. This method is called SQN (stochastic quasi-Newton). As suggested by the authors, BB is often taken to be 10 or 20. So one can afford to use relatively large nhn_{h} because the amortized average number of gradient evaluations per iteration is ng+2nh/Bn_{g}+2n_{h}/B.

The objective function from Byrd et al., (2016) has the form F(θ)=(1/N)i=1Nf(𝒙i;θ)F(\theta)=(1/N)\sum_{i=1}^{N}f(\boldsymbol{x}_{i};\theta), and {𝒙1,,𝒙N}\{\boldsymbol{x}_{1},\ldots,\boldsymbol{x}_{N}\} is a fixed training set. At each iteration, their random sample is a subset of nNn\ll N points in the dataset, not a sample generated from some continuous distribution. However, it is straightforward to adapt their algorithm to our setting. The details are in Algorithm 1 in Section 3.

2.3 Literature review

Many authors have studied how to use L-BFGS in stochastic settings. Bollapragada et al., (2018) proposes several techniques for stochastic L-BFGS, including increasing the sample size with iterations (progressive batching), choosing the initial step length for backtracking line search so that the expected value of the objective function decreases, and computing the correction pairs using overlapping samples in consecutive steps (Berahas et al.,, 2016).

Another way to prevent noisy updates is by a lengthening strategy from Xie et al., (2020) and Shi et al., (2020). Classical BFGS would use the correction pair (αkpk,g(θk+αkpk,𝒛)g(θk,𝒛))(\alpha_{k}p_{k},\,g(\theta_{k}+\alpha_{k}p_{k},\boldsymbol{z})-g(\theta_{k},\boldsymbol{z})), where pk=Hkg(θk,𝒛)p_{k}=-H_{k}g(\theta_{k},\boldsymbol{z}) is the update direction, and 𝒛\boldsymbol{z} encodes the randomness in estimating the gradients. Because the correction pairs may be dominated by noise, Xie et al., (2020) suggested the lengthening correction pairs

(sk,yk)=(βkpk,g(θk+βkpk)g(θk)),where βkαk.(s_{k},\,y_{k})=(\beta_{k}p_{k},\,g(\theta_{k}+\beta_{k}p_{k})-g(\theta_{k})),\quad\text{where }\beta_{k}\geqslant\alpha_{k}.

Shi et al., (2020) propose to choose αk\alpha_{k} by the Armijo-Wolfe condition, while choosing βk\beta_{k} large enough so that [g(θk+βkpk)g(θk)]pk/pk[g(\theta_{k}+\beta_{k}p_{k})-g(\theta_{k})]^{\intercal}p_{k}/\|p_{k}\| is sufficiently large.

Gower et al., (2016) utilizes sketching strategies to update the inverse Hessian approximations by compressed Hessians. Moritz et al., (2016) combines the stochastic quasi-Newton algorithm in Byrd et al., (2016) and stochastic variance reduced gradients (SVRG) (Johnson and Zhang,, 2013) by occasionally computing the gradient using a full batch.

Balandat et al., (2020) also applied RQMC with L-BFGS in Bayesian optimization. At each step of Bayesian optimization, one needs to maximize the acquisition function of the form α(θ):=𝔼[(g(θ))]\alpha(\theta):=\mathbb{E}\left[\ell(g(\theta))\right], where gg is a Gaussian process and \ell is a loss function. They use the sample average approximation α^(θ):=(1/n)i=1n(ξi(θ))\hat{\alpha}(\theta):=(1/n)\sum_{i=1}^{n}\ell(\xi_{i}(\theta)), where ξi(θ)g(θ)\xi_{i}(\theta)\sim g(\theta). They prove that the maximizer of α^\hat{\alpha} converges to that of α\alpha when nn\rightarrow\infty for both MC and RQMC sampling under certain conditions. The RQMC result relies on the recent discovery of the strong law of large numbers for RQMC (Owen and Rudolf,, 2021).

3 Scrambled net sampling

Scrambled nets are a form of RQMC sampling. We begin by briefly describing plain quasi-Monte Carlo (QMC) sampling. QMC is most easily described for computing expectations of f(𝒖)f(\boldsymbol{u}) for 𝒖𝐔[0,1]s\boldsymbol{u}\sim\mathbf{U}[0,1]^{s}. In our present context f()f(\cdot) will be a component of g(;θ)g(\cdot;\theta) and not the same as the ff in equation (1). In practice we must ordinarily transform the random variables 𝒖\boldsymbol{u} to some other distribution such as a Gaussian via 𝒙=ψ(𝒖)\boldsymbol{x}=\psi(\boldsymbol{u}). We suppose that ψ(𝒖)p\psi(\boldsymbol{u})\sim p for some transformation ψ()\psi(\cdot). The text by Devroye, (1986) has many examples of such transformations. Here we subsume any such transformation ψ()\psi(\cdot) into the definition of ff.

In QMC sampling, we estimate μ=[0,1]sf(𝒖)d𝒖\mu=\int_{[0,1]^{s}}f(\boldsymbol{u})\mathrm{\,d}\boldsymbol{u} by μ^=(1/n)i=1nf(𝒖i)\hat{\mu}=(1/n)\sum_{i=1}^{n}f(\boldsymbol{u}_{i}), just like in MC sampling except that distinct points 𝒖i\boldsymbol{u}_{i} are chosen so that the discrete uniform distribution 𝐔{𝒖1,,𝒖n}\mathbf{U}\{\boldsymbol{u}_{1},\dots,\boldsymbol{u}_{n}\} is made very close to the continuous 𝐔[0,1]s\mathbf{U}[0,1]^{s} distribution. The difference between these two distributions can be quantified in many ways, called discrepancies (Chen et al.,, 2014). For a comprehensive treatment of QMC see Dick and Pillichshammer, (2010) or Niederreiter, (1992) or Dick et al., (2013).

When ff is of bounded variation in the sense of Hardy and Krause (BVHK) (see Owen, (2005)) then QMC attains the asymptotic error rate |μ^μ|=O(n1log(n)s1)=O(n1+ϵ)|\hat{\mu}-\mu|=O(n^{-1}\log(n)^{s-1})=O(n^{-1+\epsilon}) for any ϵ>0\epsilon>0. QMC is deterministic and to get practical error estimates RQMC methods were introduced. In RQMC, each individual 𝒖i𝐔[0,1]d\boldsymbol{u}_{i}\sim\mathbf{U}[0,1]^{d} while collectively 𝒖1,,𝒖n\boldsymbol{u}_{1},\dots,\boldsymbol{u}_{n} still retain the low discrepancy property. Uniformity of 𝒖i\boldsymbol{u}_{i} makes RQMC unbiased: 𝔼(μ^)=μ{\mathbb{E}}(\hat{\mu})=\mu. Then if fBVHKf\in\mathrm{BVHK} we get an RMSE of O(n1+ϵ)O(n^{-1+\epsilon}). The whole RQMC process can then be replicated independently to quantify uncertainty. See Cranley and Patterson, (1976) and Owen, (1995) for methods and a survey in L’Ecuyer and Lemieux, (2002).

Scrambled net sampling (Owen,, 1995) is a form of RQMC that operates by randomly permuting the bits (more generally digits) of QMC methods called digital nets. The best known are those of Sobol’, (1969) and Faure, (1982). In addition to error estimation, scrambled nets give the user some control over the powers of log(n)\log(n) in the QMC rate and also extend the domain of QMC from Riemann integrable functions of which BVHK is a subset to much more general functions including some with integrable singularities. For any integrand fL2[0,1]sf\in L^{2}[0,1]^{s}, MC has RMSE O(n1/2)O(n^{-1/2}) while scrambled nets have RMSE o(n1/2)o(n^{-1/2}) without requiring fBVHKf\in\mathrm{BVHK} or even that ff is Riemann integrable (Owen, 1997a, ). For fixed nn, each construction of scrambled nets has a ‘gain constant’ Γ<\Gamma<\infty so that the RMSE is below Γn1/2\sqrt{\Gamma}n^{-1/2} for any fL2[0,1]sf\in L^{2}[0,1]^{s}. This effectively counters the powers of log(n)\log(n). For smooth enough ff, an error cancellation phenomenon for scrambled nets yields an RMSE of O(n3/2log(n)(s1)/2)=O(n3/2+ϵ)O(n^{-3/2}\log(n)^{(s-1)/2})=O(n^{-3/2+\epsilon}) (Owen, 1997b, ; Yue and Mao,, 1999; Owen,, 2008). The logarithmic powers here cannot ‘set in’ until they are small enough to obey the Γ1/2n1/2\Gamma^{1/2}n^{-1/2} upper bound. Some forms of scrambled net sampling satisfy a central limit theorem (Loh,, 2003; Basu and Mukherjee,, 2017).

Very recently, a strong law of large numbers

Pr(limnμ^n=μ)=1\Pr\Bigl{(}\lim_{n\to\infty}\hat{\mu}_{n}=\mu\Bigr{)}=1

has been proved for scrambled net sampling assuming only that fL1+δ[0,1]sf\in L^{1+\delta}[0,1]^{s} for some δ>0\delta>0 (Owen and Rudolf,, 2021). The motivation for this result was that Balandat et al., (2020) needed a strong law of large numbers to prove consistency for their use for scrambled nets in Bayesian optimization.

Figure 1 graphically compares MC, QMC and RQMC points for s=2s=2. The underlying QMC method is a Sobol’ sequence using ‘direction numbers’ from Joe and Kuo, (2008). We can see that MC points leave voids and create clumps. The QMC points are spread out more equally and show a strong diagonal structure. The RQMC points satisfy the same discrepancy bounds as the QMC points do but have broken up some of the structure.

Refer to caption
Figure 1: Each panel shows 512512 points in [0,1]2[0,1]^{2}. From left to right they are plain MC, Sobol’ points and scrambled Sobol’ points. From Owen and Rudolf, (2021): Copyright © 2021 Society for Industrial and Applied Mathematics. Reprinted with permission. All rights reserved.

In favorable settings the empirical behaviour of QMC and RQMC for realistic nn can be as good as their asymptotic rates. In less favorable settings RQMC can be like MC with some reduced variance. The favorable integrands are those where ff is nearly additive or at least dominated by sums of only a few of their inputs at a time. See Caflisch et al., (1997) for a definition of functions of ‘low effective dimension’ and Dick et al., (2013) for a survey of work on reproducing kernel Hilbert spaces of favorable integrands.

We propose to combine the stochastic quasi-Newton method with RQMC samples to create a randomized quasi-stochastic quasi-Newton (RQSQN) algorithm. At the kk’th iteration, we draw an independently scrambled refreshing sample 𝒵k=(𝒛k,1,,𝒛k,ng){\mathcal{Z}}_{k}=(\boldsymbol{z}_{k,1},\dots,\boldsymbol{z}_{k,n_{g}}) of size n=ngn=n_{g} via RQMC to compute the gradient estimator g¯(𝒵k;θk)\bar{g}({\mathcal{Z}}_{k};\theta_{k}). Then we find the descent direction Hkg¯(𝒵k;θk)H_{k}\bar{g}({\mathcal{Z}}_{k};\theta_{k}) using an L-BFGS two-loop recursion. Then we update the solution by

θk+1θkαkHkg¯(𝒵k;θk).\theta_{k+1}\leftarrow\theta_{k}-\alpha_{k}H_{k}\bar{g}({\mathcal{Z}}_{k};\theta_{k}).

Here αk\alpha_{k} may be found by line search with the Wolfe condition when using L-BFGS. See Chapter 3.1 in Nocedal and Wright, (2006).

Algorithm 1 shows pseudo-code for an RQMC version of SQN based on L-BFGS. It resembles the SQN algorithm in Byrd et al., (2016), except that the random samples are drawn by using RQMC instead of being sampled without replacement from a finite data set. Note that we don’t compute the Hessian directly. We either compute the Hessian-vector product 2f(𝒵t;θ¯t)st\nabla^{2}f({\mathcal{Z}}_{t};\bar{\theta}_{t})s_{t} or use gradient differences f(θ¯t)f(θ¯t1)\nabla f(\bar{\theta}_{t})-\nabla f(\bar{\theta}_{t-1}).

Input : Initialization θ1\theta_{1}, buffer size mm, Hessian update interval BB, sample sizes ngn_{g} and nhn_{h} for estimating gradient and updating buffer
Output : Solution θ\theta
t1t\leftarrow-1;
for k=1,2,k=1,2,\ldots do
       Take an RQMC sample 𝒛k,1,,𝒛k,ngp\boldsymbol{z}_{k,1},\ldots,\boldsymbol{z}_{k,n_{g}}\sim p;
       Calculate the gradient estimator gkg¯(𝒵k;θk)=1ngi=1ngg(𝒛k,i;θk)g_{k}\leftarrow\bar{g}({\mathcal{Z}}_{k};\theta_{k})=\frac{1}{n_{g}}\sum_{i=1}^{n_{g}}g(\boldsymbol{z}_{k,i};\theta_{k});
       if t<1t<1 then
             θk+1θkαkgk\theta_{k+1}\leftarrow\theta_{k}-\alpha_{k}g_{k};
            
      else
             Find HtgkH_{t}g_{k} by the two-loop recursion with memory size mm;
             Find αk\alpha_{k} by line search;
             Update θk+1θkαkHtgk\theta_{k+1}\leftarrow\theta_{k}-\alpha_{k}H_{t}g_{k};
            
      
      if mod(k,B)=0\mod(k,B)=0 then
             tt+1t\leftarrow t+1;
             θ¯tB1j=kB+1kθj\bar{\theta}_{t}\leftarrow B^{-1}\sum_{j=k-B+1}^{k}\theta_{j};
             if t>0t>0 then
                  
                  Take an RQMC sample 𝒛~t,1,,𝒛~t,nhp\tilde{\boldsymbol{z}}_{t,1},\ldots,\tilde{\boldsymbol{z}}_{t,n_{h}}\sim p;
                   Add st=(θ¯tθ¯t1)s_{t}=(\bar{\theta}_{t}-\bar{\theta}_{t-1}), yt=g¯(𝒵~t;θ¯t)=1nhi=1nh2f(𝒛~t,i;θ¯t)sty_{t}=\bar{g}(\widetilde{\mathcal{Z}}_{t};\bar{\theta}_{t})=\frac{1}{n_{h}}\sum_{i=1}^{n_{h}}\nabla^{2}f(\tilde{\boldsymbol{z}}_{t,i};\bar{\theta}_{t})s_{t} to the buffer;
                  
            
      
Algorithm 1 RQMC-SQN

4 Theoretical guarantees

In this section, we study the convergence rate of a general quasi-Newton iteration based on nn sample points 𝒛k1,,𝒛knp\boldsymbol{z}_{k1},\dots,\boldsymbol{z}_{kn}\sim p at stage kk. The algorithm iterates as follows

θk+1\displaystyle\theta_{k+1} θkαkHkf¯(𝒵k;θk),where\displaystyle\leftarrow\theta_{k}-\alpha_{k}H_{k}\nabla\bar{f}({\mathcal{Z}}_{k};\theta_{k}),\quad\text{where} (5)
f¯(𝒵k;θk)\displaystyle\nabla\bar{f}({\mathcal{Z}}_{k};\theta_{k}) =1ni=1nf(𝒛ki;θk)=g¯(𝒵k;θk).\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\nabla f(\boldsymbol{z}_{ki};\theta_{k})=\bar{g}({\mathcal{Z}}_{k};\theta_{k}).

Here HkH_{k} is an approximate inverse Hessian. We let F(θ)=𝔼(f(𝒛ki;θ))F(\theta)={\mathbb{E}}(f(\boldsymbol{z}_{ki};\theta)). We assume that the gradient estimator g(𝒵k;θk)g({\mathcal{Z}}_{k};\theta_{k}) is unbiased conditionally on θk\theta_{k}, i.e., 𝔼(g(𝒵k;θk)θk)=F(θk){\mathbb{E}}(g({\mathcal{Z}}_{k};\theta_{k})\mid\theta_{k})=\nabla F(\theta_{k}). The θk\theta_{k} are random because they depend on 𝒵k{\mathcal{Z}}_{k^{\prime}} for k<kk^{\prime}<k.

The Hessian estimates HkH_{k} for k>1k>1 are also random because they depend directly on some additional inputs 𝒵~t\tilde{\mathcal{Z}}_{t} for those Hessian update epochs tt which occur prior to step kk. They also depend on 𝒵k{\mathcal{Z}}_{k^{\prime}} for k<kk^{\prime}<k because those random variables affect θk\theta_{k^{\prime}}. In our algorithms HkH_{k} is independent of θk\theta_{k}. For RQMC this involves fresh rerandomizations of the underlying QMC points at step kk. The alternative of ‘going deeper’ into a given RQMC sequence using points (k1)n+1(k-1)n+1 through knkn would not satisfy independence. While it might possibly perform better it is harder to analyze and we saw little difference empirically in doing that. Our theory requires regularity of the problem as follows.

Assumption 1

We impose these three conditions:

  1. (a)

    Strong convexity. For some c>0c>0,

    F(θ)F(θ)+F(θ)(θθ)+c2θθ22for allθ,θΘ.F(\theta^{\prime})\geqslant F(\theta)+\nabla F(\theta)^{\intercal}(\theta^{\prime}-\theta)+\frac{c}{2}\|\theta^{\prime}-\theta\|_{2}^{2}\quad\text{for all}\quad\theta,\theta^{\prime}\in\Theta.
  2. (b)

    Lipschitz continuous objective gradients. For some L<L<\infty,

    F(θ)F(θ)Lθθfor allθ,θΘ.\|\nabla F(\theta)-\nabla F(\theta^{\prime})\|\leqslant L\|\theta-\theta^{\prime}\|\quad\text{for all}\quad\theta,\theta^{\prime}\in\Theta.
  3. (c)

    Bounded variance of the gradient estimator. For some M<M<\infty,

    tr(Var(g¯(𝒵k;θk)θk))Mfor allk1.{\mathrm{tr}}\bigl{(}{\mathrm{Var}}(\bar{g}({\mathcal{Z}}_{k};\theta_{k})\!\mid\!\theta_{k})\bigr{)}\leqslant M\quad\text{for all}\quad k\geqslant 1.

These are standard assumptions in the study of SGD. For example, see Assumptions 4.1 and 4.5 of Bottou et al., (2018). Strong convexity implies that there exists a unique minimizer θ\theta^{*} to estimate. We write F=F(θ)F^{*}=F(\theta^{*}), for the best possible value of FF. We must have some smoothness and Lipshitz continuity is a mild assumption. The quantity MM will prove to be important below. We can get a better MM from RQMC than from MC. The other way to reduce MM is to increase nn. When RQMC has an advantage it is because it gets a smaller MM for the same nn, or to put it another way, it can get comparable MM with smaller nn.

For two symmetric matrices AA and BB, ABA\preccurlyeq B means that BAB-A is positive semi-definite. We denote the spectral norm of AA by A\|A\|.

Theorem 4.1 (Convergence of optimality gap)

Suppose that our simulation-optimization problem satisfies the regularity conditions in Assumption 1. Assume that we run updates as in equation (5) where the approximate inverse Hessian matrices HkH_{k} satisfy h1IHkh2Ih_{1}I\preccurlyeq H_{k}\preccurlyeq h_{2}I for some 0<h1h20<h_{1}\leqslant h_{2} and all k1k\geqslant 1. Next, assume constant step sizes αk=α\alpha_{k}=\alpha with 0<αh1/(Lh22)0<\alpha\leqslant h_{1}/(Lh_{2}^{2}). Then for every K1K\geqslant 1

𝔼(F(θK)F)\displaystyle{\mathbb{E}}(F(\theta_{K})-F^{*}) (1αch1)K(F(θ0)F)+αLh222ch1M.\displaystyle\leqslant(1-\alpha ch_{1})^{K}(F(\theta_{0})-F^{*})+\frac{\alpha Lh_{2}^{2}}{2ch_{1}}M. (6)

Furthermore, if g(θ,𝐳)C\|g(\theta,\boldsymbol{z})\|\leqslant C for some constant CC for all θ\theta and 𝐳\boldsymbol{z}, then for any ε>0\varepsilon>0

F(θK)F\displaystyle F(\theta_{K})-F^{*} (1αch1)K(F(θ0)F)+αLh222ch1M+C22αch1(h2Lαh12+h1)ε\displaystyle\leqslant(1-\alpha ch_{1})^{K}(F(\theta_{0})-F^{*})+\frac{\alpha Lh_{2}^{2}}{2ch_{1}}M+C^{2}\sqrt{\frac{2\alpha}{ch_{1}}}\bigl{(}h_{2}-L\alpha h_{1}^{2}+{h_{1}}{}\bigr{)}\varepsilon (7)

holds with probability at least 1eε21-e^{-\varepsilon^{2}}.

Proof  See Section A.1 in the Appendix.  

Remark 4.1

As KK\rightarrow\infty the expected optimality gap is no larger than [(αLh22)/(2ch1)]×M[(\alpha Lh_{2}^{2})/(2ch_{1})]\times M. The variance bound M=M(n)M=M(n) (for nn points 𝐳k,i\boldsymbol{z}_{k,i}) depends on the sampling method we choose. From the results in Section 3, scrambled nets can reduce MM from O(1/n)O(1/n) for MC to o(1/n)o(1/n) attaining O(n2+ϵ)O(n^{-2+\epsilon}) or even O(n3+ϵ)O(n^{-3+\epsilon}) in favorable cases. When M<M<\infty for MC then MΓMM\leqslant\Gamma M for scrambled net RQMC, which limits the harm if any that could come from RQMC.

Remark 4.2

By Lemma 3.1 of Byrd et al., (2016), the L-BFGS iteration satisfies h1IHkh2Ih_{1}I\preccurlyeq H_{k}\preccurlyeq h_{2}I under weaker conditions than we have in Theorem 4.1. We can replace the bound g(𝐳;θ)C\|g(\boldsymbol{z};\theta)\|\leqslant C by 𝔼(g(𝐳;θ)2)C2<{\mathbb{E}}(\|g(\boldsymbol{z};\theta)\|^{2})\leqslant C^{2}<\infty.

The following theorem states the convergence rate of θkθ\|\theta_{k}-\theta^{*}\|.

Theorem 4.2 (Convergence of variables)

Under the conditions of Theorem 4.1, suppose that

cL>h2h1h2+h1and0<αk<(h1+h2)c(h2h1)L2L2h22.\frac{c}{L}>\frac{h_{2}-h_{1}}{h_{2}+h_{1}}\quad\text{and}\quad 0<\alpha_{k}<\frac{(h_{1}+h_{2})c-(h_{2}-h_{1})L}{2L^{2}h_{2}^{2}}.

Then for every K1K\geqslant 1,

𝔼(θKθ2)(1α2h22L2)Kθ0θ2+M/L2.\displaystyle{\mathbb{E}}(\|\theta_{K}-\theta^{*}\|^{2})\leqslant\bigl{(}1-\alpha^{2}h_{2}^{2}L^{2}\bigr{)}^{K}\|\theta_{0}-\theta^{*}\|^{2}+M/L^{2}.

Proof  See Section A.2 in the Appendix.  

Remark 4.3

When KK\rightarrow\infty, the limiting expected squared error is bounded by α2h22M\alpha^{2}h_{2}^{2}M. Once again the potential gain from RQMC is in reducing MM compared to MC or getting a good MM with smaller nn.

5 Variational Bayes

In this section we investigate quasi-Newton quasi-Monte Carlo optimization for some VB problems. Variational Bayes begins with a posterior distribution p(𝒛𝒙)p(\boldsymbol{z}\!\mid\!\boldsymbol{x}) that is too inconvenient to work with. This usually means that we cannot readily sample 𝒛\boldsymbol{z}. We turn instead to a distribution qθq_{\theta} for θΘ\theta\in\Theta from which we can easily sample 𝒛qθ\boldsymbol{z}\sim q_{\theta}. We now want to make a good choice of θ\theta and in VB the optimal value θ\theta^{*} is taken to be the minimizer of the KL divergence between distributions:

θ=argminθΘ𝒟KL(qθ(𝒛)p(𝒛𝒙)).\displaystyle\theta^{*}=\operatorname*{argmin}_{\theta\in\Theta}\,\mathcal{D}_{\mathrm{KL}}(q_{\theta}(\boldsymbol{z})\,\|\,p(\boldsymbol{z}\!\mid\!\boldsymbol{x})).

In this section we are using the symbol p()p(\cdot) as a generic probability distribution. It is the distribution of whatever random variable’s symbol is inside the parentheses and not necessarily the same pp from the introduction. We use 𝔼θ(){\mathbb{E}}_{\theta}(\cdot) to denote expectation with respect to 𝒛qθ\boldsymbol{z}\sim q_{\theta}.

We suppose that 𝒛\boldsymbol{z} has a prior distribution p(𝒛)p(\boldsymbol{z}). Then Bayes rule gives

𝒟KL(qθ(𝒛)p(𝒛𝒙))\displaystyle\mathcal{D}_{\mathrm{KL}}(q_{\theta}(\boldsymbol{z})\,\|\,p(\boldsymbol{z}\!\mid\!\boldsymbol{x})) =𝔼θ(logqθ(𝒛)logp(𝒛𝒙))\displaystyle={\mathbb{E}}_{\theta}\bigl{(}{\log q_{\theta}(\boldsymbol{z})-\log p(\boldsymbol{z}\!\mid\!\boldsymbol{x})}\bigr{)}
=𝔼θ(logqθ(𝒛)logp(𝒙𝒛)p(𝒛)p(𝒙))\displaystyle={\mathbb{E}}_{\theta}\Bigl{(}{\log q_{\theta}(\boldsymbol{z})-\log\frac{p(\boldsymbol{x}\!\mid\!\boldsymbol{z})p(\boldsymbol{z})}{p(\boldsymbol{x})}}\Bigr{)}
=𝒟KL(qθ(𝒛)p(𝒛))𝔼θ(p(𝒙𝒛))+logp(𝒙).\displaystyle=\mathcal{D}_{\mathrm{KL}}(q_{\theta}(\boldsymbol{z})\,\|\,p(\boldsymbol{z}))-{\mathbb{E}}_{\theta}(p(\boldsymbol{x}\!\mid\!\boldsymbol{z}))+\log p(\boldsymbol{x}).

The last term does not depend on θ\theta and so we may minimize 𝒟KL()\mathcal{D}_{\mathrm{KL}}(\cdot\,\|\,\cdot) by maximizing

(θ)=𝔼θ(logp(𝒙𝒛))𝒟KL(qθ(𝒛)p(𝒛)).{\mathcal{L}}(\theta)={\mathbb{E}}_{\theta}({\log p(\boldsymbol{x}\!\mid\!\boldsymbol{z})})-\mathcal{D}_{\mathrm{KL}}(q_{\theta}(\boldsymbol{z})\,\|\,p(\boldsymbol{z})).

This (){\mathcal{L}}(\cdot) is known as the evidence lower bound (ELBO). The first term 𝔼θ(logp(𝒙𝒛)){\mathbb{E}}_{\theta}({\log p(\boldsymbol{x}\!\mid\!\boldsymbol{z})}) expresses a preference for θ\theta having a large value of the likelihood of the observed data 𝒙\boldsymbol{x} given the latent data 𝒛\boldsymbol{z}. The second term 𝒟KL(qθ(𝒛p(𝒛))-\mathcal{D}_{\mathrm{KL}}(q_{\theta}(\boldsymbol{z}\,\|\,p(\boldsymbol{z})) can be regarded as a regularization, penalizing parameter values for which qθ(𝒛)q_{\theta}(\boldsymbol{z}) is too far from the prior distribution p(𝒛)p(\boldsymbol{z}).

To optimize (θ){\mathcal{L}}(\theta) we need (θ)\nabla{\mathcal{L}}(\theta). It is usual to choose a family qθq_{\theta} for which 𝒟KL()\mathcal{D}_{\mathrm{KL}}(\cdot\,\|\,\cdot) and its gradient are analytically tractable. We still need to estimate the gradient of the first term, i.e.,

𝔼θ(logp(𝒙𝒛)).\displaystyle\nabla\,{\mathbb{E}}_{\theta}({\log p(\boldsymbol{x}\!\mid\!\boldsymbol{z})}).

One method is to use the score function logpθ(z)\nabla\log p_{\theta}(z) and Fisher’s identity

𝔼θ(f(𝒛))=𝔼θ(f(𝒛)logpθ(𝒛))\displaystyle\nabla{\mathbb{E}}_{\theta}({f(\boldsymbol{z})})={\mathbb{E}}_{\theta}({f(\boldsymbol{z})\nabla\log p_{\theta}(\boldsymbol{z})})

with f()=log(p(𝒙))f(\cdot)=\log(p(\boldsymbol{x}\!\mid\!\cdot)). Unfortunately an MC strategy based on this approach can suffer from large variance.

The most commonly used method is to write the parameter θ\theta as a function of some underlying common random variables. This is known as the reparameterization trick (Kingma and Welling,, 2014). Suppose that there is a base distribution p0p_{0} and a transformation T(;θ)T(\cdot;\theta), such that if 𝒛p0\boldsymbol{z}\sim p_{0}, then T(𝒛;θ)qθT(\boldsymbol{z};\theta)\sim q_{\theta}. Then

𝔼θ(f(θ))=𝔼p0(f(T(𝒛;θ))).\displaystyle\nabla{\mathbb{E}}_{\theta}({f(\theta)})={\mathbb{E}}_{p_{0}}(\nabla f(T(\boldsymbol{z};\theta))).

It is often easy to sample from the base distribution p0p_{0}, and thus to approximate the expectation by MC or RQMC samples. This is the method we use in our examples.

5.1 Bayesian linear regression

We start with a toy example where we can find θ\theta^{*} analytically. This will let us study θkθ\|\theta_{k}-\theta^{*}\| empirically. We consider the hierarchical linear model

𝒚β𝒩(Xβ,γ2IN)forβ𝒩(0,Id)\displaystyle\boldsymbol{y}\!\mid\!\beta\sim\mathcal{N}(X\beta,\gamma^{2}I_{N})\quad\text{for}\quad\beta\sim\mathcal{N}(0,I_{d})

where XN×dX\in^{N\times d} is a given matrix of full rank dNd\leqslant N and γ2(0,)\gamma^{2}\in(0,\infty) is a known error variance. Here NN is the number of data points in our simulation and not the number nn of MC or RQMC gradient evaluations. The entries in XX are IID 𝒩(0,1)\mathcal{N}(0,1) random variables and we used γ=0.5\gamma=0.5.

Translating this problem into the VB setup, we make our latent variable 𝒛\boldsymbol{z} the unknown parameter vector (β1,,βd)(\beta_{1},\dots,\beta_{d}), and we choose a very convenient variational distribution qθq_{\theta} with βjind𝒩(μj,σj2)\beta_{j}\stackrel{{\scriptstyle\mathrm{ind}}}{{\sim}}\mathcal{N}(\mu_{j},\sigma_{j}^{2}) for j=1,,dj=1,\ldots,d. Now θ=(μ1,,μd,σ1,,σd)\theta=(\mu_{1},\dots,\mu_{d},\sigma_{1},\dots,\sigma_{d}), and 𝒚\boldsymbol{y} plays the role of the observations 𝒙\boldsymbol{x}. We also write μ=(μ1,,μd)\mu=(\mu_{1},\dots,\mu_{d}) and σ=(σ1,,σd)\sigma=(\sigma_{1},\dots,\sigma_{d}) for the parts of θ\theta.

The ELBO has the expression

(θ)\displaystyle{\mathcal{L}}(\theta) =𝔼θ(logp(𝒚β))𝒟KL(qθ(β)p(β𝒚))\displaystyle={\mathbb{E}}_{\theta}({\log p(\boldsymbol{y}\!\mid\!\beta)})-\mathcal{D}_{\mathrm{KL}}(q_{\theta}(\beta)\,\|\,p(\beta\!\mid\!\boldsymbol{y}))
=𝔼θ(logp(𝒚β))j=1d(σj2+μj212logσj),\displaystyle={\mathbb{E}}_{\theta}({\log p(\boldsymbol{y}\!\mid\!\beta)})-\sum_{j=1}^{d}\Bigl{(}\frac{\sigma_{j}^{2}+\mu_{j}^{2}-1}{2}-\log\sigma_{j}\Bigr{)},

where logp(𝒚β)=(p/2)log(2πγ2)𝒚Xβ22/(2γ2)\log p(\boldsymbol{y}\!\mid\!\beta)=-(p/2)\log(2\pi\gamma^{2})-\|\boldsymbol{y}-X\beta\|_{2}^{2}/(2\gamma^{2}). In this example, the ELBO has a closed form and the optimal variational parameters are given by

μ\displaystyle\mu^{*} =(XXγ2+Id)1X𝒚γ2andσj=(1+Xj2γ2)1/2,\displaystyle=\Bigl{(}\frac{X^{\intercal}X}{\gamma^{2}}+I_{d}\Bigr{)}^{-1}\frac{X^{\intercal}\boldsymbol{y}}{\gamma^{2}}\quad\text{and}\quad\sigma^{*}_{j}=\Bigl{(}1+\frac{\|X_{\text{\tiny$\bullet$}j}\|^{2}}{\gamma^{2}}\Bigr{)}^{-1/2},

where XjNX_{\text{\tiny$\bullet$}j}\in^{N} is the jj’th column of XX.

In this setting the Hessian is simply XX/γ2-X^{\intercal}X/\gamma^{2} and so stochastic quasi-Newton gradient estimates are not needed. We can however compare the effectiveness of MC and RQMC in SGD. We estimate the gradient by MC or RQMC samples and use SGD via AdaGrad (Duchi et al.,, 2011) to maximize the ELBO.

Our computations used one example data set with N=300N=300 data points, d=100d=100 variables and K=1000K=1000 iterations. At each iteration, we draw a new sample of sample size nn of the dd-dimensional Gaussian used to sample β\beta. The sample size nn is fixed in each run, but we vary it between over the range 8n81928\leqslant n\leqslant 8192 through powers of 22 in order to explore how quickly MC and RQMC converge.

For RQMC, we use the scrambled Sobol’ points implemented in PyTorch (Balandat et al.,, 2020) using the inverse Gaussian CDF ψ()=Φ1()\psi(\cdot)=\Phi^{-1}(\cdot) to translate uniform random variables into standard Gaussians that are then multiplied by σj\sigma_{j} and shifted by μj\mu_{j} to get the random βj\beta_{j} that we need. We compute the log errors log2k\log_{2}\|{\mathcal{L}}_{k}-{\mathcal{L}}^{*}\| and log2θkθ\log_{2}\|\theta_{k}-\theta^{*}\| and average these over the last 50 iterations. The learning rate in AdaGrad was taken to be 11.

The results are shown in Figure 2. We see there that RQMC achieves a higher accuracy than plain MC. This happens because RQMC estimates the gradient with lower variance. In this simple setting the rate of convergence is improved. Balandat et al., (2020) report similar rate improvements in Bayesian optimization.

Refer to caption
Refer to caption
Refer to caption
Figure 2: The left panel has the average of log2|(θk)(θ)|\log_{2}|{\mathcal{L}}(\theta_{k})-{\mathcal{L}}(\theta^{*})| over the last 5050 values of kk versus nn. The middle panel has that average of log2θkθ\log_{2}\|\theta_{k}-\theta^{*}\|. The right panel has the average of log2g(𝒛k;θk)F(θk)\log_{2}\|g(\boldsymbol{z}_{k};\theta_{k})-\nabla F(\theta_{k})\| versus nn where g^(θ^k)=(1/n)i=1ng(𝒛k,i;θk)\hat{g}(\hat{\theta}_{k})=(1/n)\sum_{i=1}^{n}g(\boldsymbol{z}_{k,i};\theta_{k}). The straight lines are least squares fits with their slopes written above them.

5.2 Bayesian logistic regression

Next we consider another simple example, though it is one with no closed form expression for θ\theta^{*}. We use it to compare first and second order methods. The Bayesian logistic regression model is defined by

Pr(yi=±1𝒙i,β)\displaystyle\Pr(y_{i}=\pm 1\!\mid\!\boldsymbol{x}_{i},\beta) =11+exp(𝒙iβ),i=1,,Nwhereβ𝒩(0,Id).\displaystyle=\frac{1}{1+\exp(\mp\boldsymbol{x}_{i}^{\intercal}\beta)},\quad i=1,\ldots,N\quad\text{where}\quad\beta\sim\mathcal{N}(0,I_{d}).

As before, β\beta is the unknown 𝒛\boldsymbol{z}, and pθp_{\theta} has βjind𝒩(μj,σj2)\beta_{j}\stackrel{{\scriptstyle\mathrm{ind}}}{{\sim}}\mathcal{N}(\mu_{j},\sigma_{j}^{2}), for θ=(μ1,,μd,σ1,,σd)\theta=(\mu_{1},\dots,\mu_{d},\sigma_{1},\dots,\sigma_{d}). The ELBO has the form

θ\displaystyle{\mathcal{L}}_{\theta} =𝔼θ(i=1NlogS(yi𝒙iβ))j=1d(σj2+μj212logσj),\displaystyle={\mathbb{E}}_{\theta}\biggl{(}\,{\sum_{i=1}^{N}\log S(y_{i}\boldsymbol{x}_{i}^{\intercal}\beta)}\biggr{)}-\sum_{j=1}^{d}\Bigl{(}\frac{\sigma_{j}^{2}+\mu_{j}^{2}-1}{2}-\log\sigma_{j}\Bigr{)},

where SS denotes the sigmoid function S(x)=(1+ex)1S(x)=(1+e^{-x})^{-1}.

In our experiments, we take N=30N=30 and d=100d=100. With d>Nd>N it is very likely that the data can be perfectly linearly separated and then a Bayesian approach provides a form of regularization. The integral to be computed is in 100100 dimensions, and the parameter to be optimized is in 200200 dimensions. We generate the data from β𝒩(0,Id/N)\beta\sim\mathcal{N}(0,I_{d}/N), then 𝒙iiid𝒩(0,Id)\boldsymbol{x}_{i}\,{\buildrel\text{iid}\over{\sim}\,}\mathcal{N}(0,I_{d}) and finally yi=1y_{i}=1 with probability 1/(1+e𝒙iβ)1/(1+e^{-\boldsymbol{x}_{i}^{\intercal}\beta}) are sampled independently for i=1,,Ni=1,\dots,N.

In Figure 3, we show the convergence of ELBO versus wall clock times for different combinations of sampling methods (MC, RQMC) and optimization methods (AdaGrad, L-BFGS). The left panel draws ng=8n_{g}=8 samples in each optimization iterations, while the right panel takes ng=128n_{g}=128. The initial learning rate for AdaGrad is 0.01. The L-BFGS is described in Algorithm 1, with nh=1024n_{h}=1024 Hessian evaluations every B=20B=20 steps with memory size M=50M=50 and α=0.01\alpha=0.01. Because L-BFGS uses some additional gradient function evaluations to update the Hessian information at every BB’th iteration that the first order methods do not use, we compare wall clock times. The maximum iteration count in the line search was 20. We used the Wolfe condition (Condition 3.6 in Nocedal and Wright, (2006)) with c1=0.001c_{1}=0.001 and c2=0.01c_{2}=0.01.

For this problem, L-BFGS always converges faster than AdaGrad. We can also see that plain MC is noisier than RQMC. The ELBOs for AdaGrad still seem to be increasing slowly even at the end of the time interval shown. For AdaGrad, RQMC consistently has a slightly higher ELBO than MC does.

Refer to caption
Refer to caption
Figure 3: ELBO versus wall clock time in VB for Bayesian logistic regression. The methods and setup are described in the text. There are ng{128,256}n_{g}\in\{128,256\} gradient samples at each iteration and the second order methods use nh=1024n_{h}=1024 Hessian samples every B=20B=20’th iteration.

5.3 Crossed random effects

In this section, we consider a crossed random effects model. Both Bayesian and frequentist approaches to crossed random effects can be a challenge with costs scaling like N3/2N^{3/2} or worse. See Papaspiliopoulos et al., (2020) and Ghosh et al., (2020) for Bayesian and frequentist approaches and also the dissertation of Gao, (2017).

An intercept only version of this model has

Yij\displaystyle Y_{ij} ind𝒩(μ+ai+bj,1),1iI,1jJ\displaystyle\stackrel{{\scriptstyle\mathrm{ind}}}{{\sim}}\mathcal{N}(\mu+a_{i}+b_{j},1),\quad 1\leqslant i\leqslant I,\quad 1\leqslant j\leqslant J

given μ𝒩(0,1)\mu\sim\mathcal{N}(0,1), aiiid𝒩(0,σa2)a_{i}\,{\buildrel\text{iid}\over{\sim}\,}\mathcal{N}(0,\sigma_{a}^{2}), and bjiid𝒩(0,σb2)b_{j}\,{\buildrel\text{iid}\over{\sim}\,}\mathcal{N}(0,\sigma_{b}^{2}) where logσa\log\sigma_{a} and logσb\log\sigma_{b} are both 𝒩(0,1)\mathcal{N}(0,1). All of μ\mu, aia_{i}, bjb_{j} and the log standard deviations are independent.

We use VB to approximate the posterior distribution of the d=I+J+3d=I+J+3 dimensional parameter 𝒛=(μ,logσa,logσb,𝐚,𝐛)\boldsymbol{z}=(\mu,\log\sigma_{a},\log\sigma_{b},\mathbf{a},\mathbf{b}). In our example, we take I=10I=10 and J=5J=5. As before q(𝒛θ)q(\boldsymbol{z}\!\mid\!\theta) is chosen to be Gaussian with independent coordinates and θ\theta has their means and standard deviations. In Figure 4, we plot the convergence of ELBO for different combinations of sampling methods and optimization methods. The BFGS method takes B=20B=20, M=30M=30, nh=512n_{h}=512. We used a learning rate of 0.010.01 in AdaGrad. We observe that when the sample size is 8 (left), plain Monte Carlo has large fluctuations even when converged, especially for BFGS. When the sample size is 128 (right), the fluctuations disappear. But RQMC still achieves a higher ELBO than plain Monte Carlo for BFGS. In both cases, BFGS finds the optimum faster than AdaGrad.

Refer to caption
Refer to caption
Figure 4: ELBO versus wall clock time in VB for crossed random effects. The methods and setup are described in the text. There are ng{8,128}n_{g}\in\{8,128\} gradient samples at each iteration.

5.4 Variational autoencoder

A variational autoencoder (VAE, Kingma and Welling, (2014)) learns a generative model for a dataset. A VAE has a probabilistic encoder and a probabilistic decoder. The encoder first produces a distribution qϕ(𝒛𝒙)q_{\phi}(\boldsymbol{z}\!\mid\!\boldsymbol{x}) over the latent variable 𝒛\boldsymbol{z} given a data point 𝒙\boldsymbol{x}, then the decoder reconstructs a distribution pθ(𝒙𝒛)p_{\theta}(\boldsymbol{x}\!\mid\!\boldsymbol{z}) over the corresponding 𝒙\boldsymbol{x} from the latent variable 𝒛\boldsymbol{z}. The goal is to maximize the marginal probability pθ(𝒙)p_{\theta}(\boldsymbol{x}). Observe that the ELBO provides a lower bound of log(pθ(𝒙))\log(p_{\theta}(\boldsymbol{x})):

logpθ(𝒙)𝒟KL(qϕ(𝒛𝒙)pθ(𝒛𝒙))\displaystyle\log p_{\theta}(\boldsymbol{x})-\mathcal{D}_{\mathrm{KL}}(q_{\phi}(\boldsymbol{z}\!\mid\!\boldsymbol{x})\,\|\,p_{\theta}(\boldsymbol{z}\!\mid\!\boldsymbol{x})) =𝔼ϕ(logpθ(𝒙𝒛)𝒙)𝒟KL(qϕ(𝒛𝒙)pθ(z))=:(θ,ϕ𝒙),\displaystyle={\mathbb{E}}_{\phi}(\log p_{\theta}(\boldsymbol{x}\!\mid\!\boldsymbol{z})\!\mid\!\boldsymbol{x})-\mathcal{D}_{\mathrm{KL}}(q_{\phi}(\boldsymbol{z}\!\mid\!\boldsymbol{x})\,\|\,p_{\theta}(z))=:{\mathcal{L}}(\theta,\phi\!\mid\!\boldsymbol{x}),

where 𝔼ϕ(𝒙){\mathbb{E}}_{\phi}(\cdot\!\mid\!\boldsymbol{x}) denotes expection for random 𝒛\boldsymbol{z} given 𝒙\boldsymbol{x} with parameter ϕ\phi. In this section 𝒛\boldsymbol{z} is the latent variable, and not a part of the 𝒵{\mathcal{Z}} that we use in our MC or RQMC algorithms. We do not refer to those variables in our VAE description below.

The usual objective is to maximize the ELBO i=1N(θ,ϕ𝒙i)\sum_{i=1}^{N}{\mathcal{L}}(\theta,\phi\!\mid\!\boldsymbol{x}_{i}) for a sample of NN IID 𝒙i\boldsymbol{x}_{i} and now we have to optimize over ϕ\phi as well as θ\theta. The first term 𝔼ϕ(logpθ(𝒙𝒛)){\mathbb{E}}_{\phi}(\log p_{\theta}(\boldsymbol{x}\!\mid\!\boldsymbol{z})) in the ELBO is the reconstruction error, while the second term 𝒟KL(qϕ(𝒛𝒙)pθ(z))\mathcal{D}_{\mathrm{KL}}(q_{\phi}(\boldsymbol{z}\!\mid\!\boldsymbol{x})\,\|\,p_{\theta}(z)) penalizes parameters ϕ\phi that give a posterior qϕ(𝒛𝒙)q_{\phi}(\boldsymbol{z}\!\mid\!\boldsymbol{x}) too different from the prior pθ(𝒛)p_{\theta}(\boldsymbol{z}). Most commonly, qθ(𝒛𝒙)=𝒩(μ(𝒙;θ),Σ(𝒙;θ))q_{\theta}(\boldsymbol{z}\!\mid\!\boldsymbol{x})=\mathcal{N}(\mu(\boldsymbol{x};\theta),\Sigma(\boldsymbol{x};\theta)), and pθ(𝒛)=𝒩(0,I)p_{\theta}(\boldsymbol{z})=\mathcal{N}(0,I), so that the KL-divergence term 𝒟KL(qϕ(𝒛𝒙)pθ(z))\mathcal{D}_{\mathrm{KL}}(q_{\phi}(\boldsymbol{z}\!\mid\!\boldsymbol{x})\,\|\,p_{\theta}(z)) has a closed form. The decoding term pθ(𝒙𝒛)p_{\theta}(\boldsymbol{x}\!\mid\!\boldsymbol{z}) is usually chosen to be a Gaussian or Bernoulli distribution, depending on the data type of 𝒙\boldsymbol{x}. The expectation 𝔼ϕ(logpθ(𝒙𝒛)𝒙){\mathbb{E}}_{\phi}({\log p_{\theta}(\boldsymbol{x}\!\mid\!\boldsymbol{z})\!\mid\!\boldsymbol{x}}) is ordinarily estimated by MC. We implement both plain MC and RQMC in our experiments. To maximize the ELBO, the easiest way is to use SGD or its variants. We also compare SGD with L-BFGS in the experiments.

The experiment uses the MNIST dataset in PyTorch. It has 60,000 28×2828\times 28 gray scale images, and so the dimension is 784. All experiments were conducted on a cluster node with 2 CPUs and 4GB memory. The training was conducted in a mini-batch manner with batch size 128. The encoder has a hidden layer with 800 nodes, and an output layer with 40 nodes, 20 for μ\mu and 20 for σ\sigma. Our Σ(𝒙;θ)\Sigma(\boldsymbol{x};\theta) takes the form diag(σ(𝒙;θ)){\mathrm{diag}}(\sigma(\boldsymbol{x};\theta)). The decoder has one hidden layer with 400 nodes.

In Figure 5(a), we plot the ELBO versus wall clock time for different combinations of sampling methods (MC, RQMC) and optimization methods (Adam, BFGS). The learning rate for Adam is 0.0001. For BFGS, the memory size is M=20M=20. The other tuning parameters are set to the defaults from PyTorch. We observe that BFGS converged faster than Adam. For BFGS, we can also see that RQMC achieves a slightly higher ELBO than MC. Figure 5(b) through 5(e) shows some reconstructed images using the four algorithms.

Refer to caption
(a) ELBO
Refer to caption
(b) BFGS RQMC
Refer to caption
(c) BFGS MC
Refer to caption
(d) Adam RQMC
Refer to caption
(e) Adam MC
Figure 5: Plot (a) shows the ELBO versus wall clock time for MC and RQMC used in both L-BFGS and Adam. Plots (b) through (e) show example images.

6 Discussion

RQMC methods are finding uses in simulation optimization problems in machine learning, especially in first order SGD algorithms. We have looked at their use in a second order, L-BFGS algorithm. RQMC is known theoretically and empirically to improve the accuracy in integration problems compared to both MC and QMC. We have shown that improved estimation of expected gradients translates directly into improved optimization for quasi-Newton methods. There is a small burden in reprogramming algorithms to use RQMC instead of MC, but that is greatly mitigated by the appearance of RQMC algorithms in tools such as BoTorch (Balandat et al.,, 2020) and the forthcoming scipy 1.7 (scipy.stats.qmc.Sobol) and QMCPy at https://pypi.org/project/qmcpy/.

Our empirical examples have used VB. The approach has potential value in Bayesian optimization (Frazier,, 2018) and optimal transport (El Moselhy and Marzouk,, 2012; Bigoni et al.,, 2016) as well.

The examples we chose were of modest scale where both first and second order methods could be used. In these settings, we saw that second order methods improve upon first order ones. For the autoencoder problem the second order methods converged faster than the first order ones did. This also happenend for the crossed random effects problem where the second order methods found better ELBOs than the first order ones did and RQMC-based quasi-Newton algorithm found a better ELBO than the MC-based quasi-Newton did without increasing the wall clock time.

It is possible that RQMC will bring an advantage to conjugate gradient approaches as they have some similarities to L-BFGS. We have not investigated them.

Acknowledgments

This work was supported by the National Science Foundation under grant IIS-1837931.

References

  • Agarwal et al., (2012) Agarwal, A., Bartlett, P. L., Ravikumar, P., and Wainwright, M. J. (2012). Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Transactions on Information Theory, 58(5):3235–3249.
  • Andradóttir, (1998) Andradóttir, S. (1998). A review of simulation optimization techniques. In 1998 winter simulation conference, volume 1, pages 151–158. IEEE.
  • Azuma, (1967) Azuma, K. (1967). Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, Second Series, 19(3):357–367.
  • Balandat et al., (2020) Balandat, M., Karrer, B., Jiang, D. R., Daulton, S., Letham, B., Wilson, A. G., and Bakshy, E. (2020). BoTorch: A Framework for Efficient Monte-Carlo Bayesian Optimization. In Advances in Neural Information Processing Systems 33.
  • Basu and Mukherjee, (2017) Basu, K. and Mukherjee, R. (2017). Asymptotic normality of scrambled geometric net quadrature. Annals of Statistics, 45(4):1759–1788.
  • Beck, (2017) Beck, A. (2017). First-order methods in optimization. SIAM, Philadelphia.
  • Berahas et al., (2016) Berahas, A. S., Nocedal, J., and Takáč, M. (2016). A multi-batch L-BFGS method for machine learning. Advances in Neural Information Processing Systems, pages 1063–1071.
  • Bigoni et al., (2016) Bigoni, D., Spantini, A., and Marzouk, Y. (2016). Adaptive construction of measure transports for Bayesian inference. In NIPS workshop on Approximate Inference.
  • Blei et al., (2017) Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877.
  • Bollapragada et al., (2018) Bollapragada, R., Nocedal, J., Mudigere, D., Shi, H. J., and Tang, P. T. P. (2018). A progressive batching L-BFGS method for machine learning. In International Conference on Machine Learning, pages 620–629. PMLR.
  • Bottou et al., (2018) Bottou, L., Curtis, F. E., and Nocedal, J. (2018). Optimization methods for large-scale machine learning. Siam Review, 60(2):223–311.
  • Buchholz et al., (2018) Buchholz, A., Wenzel, F., and Mandt, S. (2018). Quasi-Monte Carlo variational inference. In International Conference on Machine Learning, pages 668–677. PMLR.
  • Byrd et al., (2016) Byrd, R. H., Hansen, S. L., Nocedal, J., and Singer, Y. (2016). A stochastic quasi-Newton method for large-scale optimization. SIAM Journal on Optimization, 26(2):1008–1031.
  • Caflisch et al., (1997) Caflisch, R. E., Morokoff, W., and Owen, A. B. (1997). Valuation of mortgage backed securities using Brownian bridges to reduce effective dimension. Journal of Computational Finance, 1(1):27–46.
  • Chen et al., (2014) Chen, W., Srivastav, A., and Travaglini, G., editors (2014). A panorama of discrepancy theory. Springer, Cham, Switzerland.
  • Cranley and Patterson, (1976) Cranley, R. and Patterson, T. N. L. (1976). Randomization of number theoretic methods for multiple integration. SIAM Journal of Numerical Analysis, 13(6):904–914.
  • Devroye, (1986) Devroye, L. (1986). Non-uniform Random Variate Generation. Springer, New York.
  • Dick et al., (2013) Dick, J., Kuo, F. Y., and Sloan, I. H. (2013). High-dimensional integration: the quasi-Monte Carlo way. Acta Numerica, 22:133–288.
  • Dick and Pillichshammer, (2010) Dick, J. and Pillichshammer, F. (2010). Digital sequences, discrepancy and quasi-Monte Carlo integration. Cambridge University Press, Cambridge.
  • Duchi et al., (2011) Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(7).
  • Duchi, (2018) Duchi, J. C. (2018). Introductory lectures on stochastic optimization. In Mahoney, M. W., Duchi, J. C., and Gilbert, A. C., editors, The mathematics of data, volume 25, pages 99–186. American Mathematical Society, Providence, RI.
  • El Moselhy and Marzouk, (2012) El Moselhy, T. A. and Marzouk, Y. M. (2012). Bayesian inference with optimal maps. Journal of Computational Physics, 231(23):7815–7850.
  • Faure, (1982) Faure, H. (1982). Discrépance de suites associées à un système de numération (en dimension ss). Acta Arithmetica, 41:337–351.
  • Frazier, (2018) Frazier, P. I. (2018). A tutorial on Bayesian optimization. Technical report, arXiv:1807.02811.
  • Gao, (2017) Gao, K. (2017). Scalable Estimation and Inference for Massive Linear Mixed Models with Crossed Random Effects. PhD thesis, Stanford University.
  • Ghosh et al., (2020) Ghosh, S., Hastie, T., and Owen, A. B. (2020). Backfitting for large scale crossed random effects regressions. Technical report, arXiv:2007.10612.
  • Gower et al., (2016) Gower, R., Goldfarb, D., and Richtárik, P. (2016). Stochastic block BFGS: Squeezing more curvature out of data. In International Conference on Machine Learning, pages 1869–1878.
  • Joe and Kuo, (2008) Joe, S. and Kuo, F. Y. (2008). Constructing Sobol’ sequences with better two-dimensional projections. SIAM Journal on Scientific Computing, 30(5):2635–2654.
  • Johnson and Zhang, (2013) Johnson, R. and Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, volume 26, pages 315–323.
  • Kingma and Ba, (2014) Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. Technical report, arXiv:1412.6980.
  • Kingma and Welling, (2014) Kingma, D. P. and Welling, M. (2014). Auto-encoding variational Bayes. stat, 1050:1.
  • L’Ecuyer and Lemieux, (2002) L’Ecuyer, P. and Lemieux, C. (2002). A survey of randomized quasi-Monte Carlo methods. In Dror, M., L’Ecuyer, P., and Szidarovszki, F., editors, Modeling Uncertainty: An Examination of Stochastic Theory, Methods, and Applications, pages 419–474. Kluwer Academic Publishers.
  • Loh, (2003) Loh, W.-L. (2003). On the asymptotic distribution of scrambled net quadrature. Annals of Statistics, 31(4):1282–1324.
  • Miller et al., (2017) Miller, A. C., Foti, N. J., D Amour, A., and Adams, R. P. (2017). Reducing reparameterization gradient variance. In Advances in Neural Information Processing Systems.
  • Mohamed et al., (2020) Mohamed, S., Rosca, M., Figurnov, M., and Mnih, A. (2020). Monte Carlo gradient estimation in machine learning. Journal of Machine Learning Research, 21(132):1–62.
  • Moritz et al., (2016) Moritz, P., Nishihara, R., and Jordan, M. (2016). A linearly-convergent stochastic L-BFGS algorithm. In Artificial Intelligence and Statistics, pages 249–258.
  • Niederreiter, (1992) Niederreiter, H. (1992). Random Number Generation and Quasi-Monte Carlo Methods. SIAM, Philadelphia, PA.
  • Nocedal, (1980) Nocedal, J. (1980). Updating quasi-Newton matrices with limited storage. Mathematics of computation, 35(151):773–782.
  • Nocedal and Wright, (2006) Nocedal, J. and Wright, S. (2006). Numerical Optimization. Springer Science & Business Media, New York, second edition.
  • Owen, (1995) Owen, A. B. (1995). Randomly permuted (t,m,s)(t,m,s)-nets and (t,s)(t,s)-sequences. In Niederreiter, H. and Shiue, P. J.-S., editors, Monte Carlo and Quasi-Monte Carlo Methods in Scientific Computing, pages 299–317, New York. Springer-Verlag.
  • (41) Owen, A. B. (1997a). Monte Carlo variance of scrambled net quadrature. SIAM Journal of Numerical Analysis, 34(5):1884–1910.
  • (42) Owen, A. B. (1997b). Scrambled net variance for integrals of smooth functions. Annals of Statistics, 25(4):1541–1562.
  • Owen, (2005) Owen, A. B. (2005). Multidimensional variation for quasi-Monte Carlo. In Fan, J. and Li, G., editors, International Conference on Statistics in honour of Professor Kai-Tai Fang’s 65th birthday.
  • Owen, (2008) Owen, A. B. (2008). Local antithetic sampling with scrambled nets. Annals of Statistics, 36(5):2319–2343.
  • Owen and Rudolf, (2021) Owen, A. B. and Rudolf, D. (2021). A strong law of large numbers for scrambled net integration. SIAM Review, 63(2):360–372.
  • Paisley et al., (2012) Paisley, J., Blei, D. M., and Jordan, M. I. (2012). Variational Bayesian inference with stochastic search. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1363–1370.
  • Papaspiliopoulos et al., (2020) Papaspiliopoulos, O., Roberts, G. O., and Zanella, G. (2020). Scalable inference for crossed random effects models. Biometrika, 107(1):25–40.
  • Shi et al., (2020) Shi, H. J., Xie, Y., Byrd, R., and Nocedal, J. (2020). A noise-tolerant quasi-Newton algorithm for unconstrained optimization. Technical report, arXiv:2010.04352.
  • Sobol’, (1969) Sobol’, I. M. (1969). Multidimensional Quadrature Formulas and Haar Functions. Nauka, Moscow. (In Russian).
  • Xie et al., (2020) Xie, Y., Byrd, R. H., and Nocedal, J. (2020). Analysis of the BFGS method with errors. SIAM Journal on Optimization, 30(1):182–209.
  • Yue and Mao, (1999) Yue, R.-X. and Mao, S.-S. (1999). On the variance of quadrature over scrambled nets and sequences. Statistics & probability letters, 44(3):267–280.

Appendix A Proof of main theorems

Our approach is similar to that used by Buchholz et al., (2018). They studied RQMC with SGD whereas we consider L-BFGS, a second order method.

A.1 Proof of Theorem 4.1

Proof  Let ek=g¯(𝒵k;θk)F(θk)e_{k}=\bar{g}({\mathcal{Z}}_{k};\theta_{k})-\nabla F(\theta_{k}) be the error in estimating the gradient at step kk. By the unbiasedness assumption, 𝔼(ekθk)=0{\mathbb{E}}(e_{k}\!\mid\!\theta_{k})=0. Starting from the Lipschitz condition, we have

F(θk+1)F(θk)\displaystyle F(\theta_{k+1})-F(\theta_{k}) F(θk)(θk+1θk)+L2θk+1θk2\displaystyle\leqslant\nabla F(\theta_{k})^{\intercal}(\theta_{k+1}-\theta_{k})+\frac{L}{2}\|\theta_{k+1}-\theta_{k}\|^{2}
=αkF(θk)Hkg¯(𝒵k;θk)+Lαk22Hkg¯(𝒵k;θk)2\displaystyle=-\alpha_{k}\nabla F(\theta_{k})^{\intercal}H_{k}\bar{g}({\mathcal{Z}}_{k};\theta_{k})+\frac{L\alpha_{k}^{2}}{2}\|H_{k}\bar{g}({\mathcal{Z}}_{k};\theta_{k})\|^{2}
=αkF(θk)HkekαkF(θk)HkF(θk)\displaystyle=-\alpha_{k}\nabla F(\theta_{k})^{\intercal}H_{k}e_{k}-\alpha_{k}\nabla F(\theta_{k})^{\intercal}{H_{k}}\nabla F(\theta_{k})
+Lαk22(Hkek2+HkF(θk)2+2F(θk)Hk2ek)\displaystyle\qquad+\frac{L\alpha_{k}^{2}}{2}\left(\|H_{k}e_{k}\|^{2}+\|H_{k}\nabla F(\theta_{k})\|^{2}+2\nabla F(\theta_{k})^{\intercal}H_{k}^{2}e_{k}\right)
αkF(θk)Hkekαkh1F(θk)2\displaystyle\leqslant-\alpha_{k}\nabla F(\theta_{k})^{\intercal}H_{k}e_{k}-\alpha_{k}h_{1}\|\nabla F(\theta_{k})\|^{2}
+Lαk22(h22ek2+h22F(θk)2+2F(θk)Hk2ek)\displaystyle\qquad+\frac{L\alpha_{k}^{2}}{2}(h_{2}^{2}\|e_{k}\|^{2}+h_{2}^{2}\|\nabla F(\theta_{k})\|^{2}+2\nabla F(\theta_{k})^{\intercal}H_{k}^{2}e_{k})
=αkF(θk)Hkek+Lαk2F(θk)Hk2ek\displaystyle=-\alpha_{k}\nabla F(\theta_{k})^{\intercal}H_{k}e_{k}+L\alpha_{k}^{2}\nabla F(\theta_{k})^{\intercal}H_{k}^{2}e_{k}
αkh1(1Lαkh222h1)F(θk)2+Lαk2h222ek2.\displaystyle\qquad-\alpha_{k}h_{1}\Bigl{(}1-\frac{L\alpha_{k}h_{2}^{2}}{2h_{1}}\Bigr{)}\|\nabla F(\theta_{k})\|^{2}+\frac{L\alpha_{k}^{2}h_{2}^{2}}{2}\|e_{k}\|^{2}.

Because αk=αh1/(Lh22)\alpha_{k}=\alpha\leqslant{h_{1}}/({Lh_{2}^{2}}), we have 1Lαkh22/(2h1)1/21-{L\alpha_{k}h_{2}^{2}}/({2h_{1}})\geqslant 1/2. Because strong convexity implies

F(θ)22c(F(θ)F),θ,\|\nabla F(\theta)\|^{2}\geqslant 2c(F(\theta)-F^{*}),\quad\forall\theta,

we have

F(θk+1)F(θk)\displaystyle F(\theta_{k+1})-F(\theta_{k}) αkF(θk)(HkLαkHk2)ekαkh1c(F(θk)F)+Lαk2h222ek2.\displaystyle\leqslant-\alpha_{k}\nabla F(\theta_{k})^{\intercal}(H_{k}-L\alpha_{k}H_{k}^{2})e_{k}-\alpha_{k}h_{1}c(F(\theta_{k})-F^{*})+\frac{L\alpha_{k}^{2}h_{2}^{2}}{2}\|e_{k}\|^{2}.

Adding F(θk)FF(\theta_{k})-F^{*} to both sides gives

F(θk+1)F\displaystyle F(\theta_{k+1})-F^{*} (1αkh1c)(F(θk)F)+Rk,\displaystyle\leqslant(1-\alpha_{k}h_{1}c)(F(\theta_{k})-F^{*})+R_{k}, (8)

where

Rk=αkF(θk)(HkLαkHk2)ek+Lαk2h222ek2.\displaystyle R_{k}=-\alpha_{k}\nabla F(\theta_{k})^{\intercal}(H_{k}-L\alpha_{k}H_{k}^{2})e_{k}+\frac{L\alpha_{k}^{2}h_{2}^{2}}{2}\|e_{k}\|^{2}.

Let k=σ(𝒵i,1ik){\mathcal{F}}_{k}=\sigma({\mathcal{Z}}_{i},1\leqslant i\leqslant k) be the filtration generated by the random inputs {𝒵k}\{{\mathcal{Z}}_{k}\} to our sampling process. Because 𝒵k{\mathcal{Z}}_{k} are mutually independent and HkH_{k} is independent of 𝒵k{\mathcal{Z}}_{k}, we have θk,Hkk1\theta_{k},H_{k}\in{\mathcal{F}}_{k-1} and 𝔼(ekk1)=0{\mathbb{E}}(e_{k}\!\mid\!{\mathcal{F}}_{k-1})=0. Then

𝔼(F(θk)(HkLαkHk2)ekk1)\displaystyle{\mathbb{E}}\bigl{(}\nabla F(\theta_{k})^{\intercal}(H_{k}-L\alpha_{k}H_{k}^{2})e_{k}\!\mid\!{\mathcal{F}}_{k-1}\bigr{)} =F(θk)(HkLαkHk2)𝔼(ekk1)=0.\displaystyle=\nabla F(\theta_{k})^{\intercal}(H_{k}-L\alpha_{k}H_{k}^{2}){\mathbb{E}}(e_{k}\!\mid\!{\mathcal{F}}_{k-1})=0.

Therefore, F(θk)(HkLαkHk2)ek\nabla F(\theta_{k})^{\intercal}(H_{k}-L\alpha_{k}H_{k}^{2})e_{k} is a martingale difference sequence w.r.t. k{\mathcal{F}}_{k}. Let

Vk=𝔼(ek2θk)=tr(Var[g¯(𝒵k;θk)θk]).V_{k}={\mathbb{E}}(\|e_{k}\|^{2}\!\mid\!\theta_{k})={\mathrm{tr}}\bigl{(}\operatorname{Var}\left[\bar{g}({\mathcal{Z}}_{k};\theta_{k})\!\mid\!\theta_{k}\right]\bigr{)}.

Then ek2Vk\|e_{k}\|^{2}-V_{k} is also a martingale difference sequence w.r.t. k{\mathcal{F}}_{k}. So we can write

Rk=νk+Lαk2h222Vk,\displaystyle R_{k}=\nu_{k}+\frac{L\alpha_{k}^{2}h_{2}^{2}}{2}V_{k},

where

νk=αkF(θk)(HkLαkHk2)ek+Lαk2h222(ek2Vk)\displaystyle\nu_{k}=-\alpha_{k}\nabla F(\theta_{k})^{\intercal}(H_{k}-L\alpha_{k}H_{k}^{2})e_{k}+\frac{L\alpha_{k}^{2}h_{2}^{2}}{2}(\|e_{k}\|^{2}-V_{k})

is a martingale difference sequence, and Lαk2h22/(2Vk){L\alpha_{k}^{2}h_{2}^{2}}/({2}V_{k}) is a deterministic sequence. Recursively applying equation (8) gives

F(θK)F\displaystyle F(\theta_{K})-F^{*} (1αch1)K(F(θ0)F)+k=0K1(1αch1)Kk1Rk\displaystyle\leqslant(1-\alpha ch_{1})^{K}(F(\theta_{0})-F^{*})+\sum_{k=0}^{K-1}(1-\alpha ch_{1})^{K-k-1}R_{k}
=(1αch1)K(F(θ0)F)+k=0K1(1αch1)Kk1(νk+Lα2h222Vk).\displaystyle=(1-\alpha ch_{1})^{K}(F(\theta_{0})-F^{*})+\sum_{k=0}^{K-1}(1-\alpha ch_{1})^{K-k-1}\Bigl{(}\nu_{k}+\frac{L\alpha^{2}h_{2}^{2}}{2}V_{k}\Bigr{)}.

By the bounded variance assumption, VkMV_{k}\leqslant M for all k0k\geqslant 0. Hence,

F(θK)F\displaystyle F(\theta_{K})-F^{*} (1αc)K(F(θ0)F)+αLh222ch1M+k=0K1(1αch1)Kk1νk.\displaystyle\leqslant(1-\alpha c)^{K}(F(\theta_{0})-F^{*})+\frac{\alpha Lh_{2}^{2}}{2ch_{1}}M+\sum_{k=0}^{K-1}(1-\alpha ch_{1})^{K-k-1}\nu_{k}.

Taking expectations on both sides proves (6).

To prove the finite sample guarantee (7), it remains to bound the martingale k=0K1(1αch1)Kk1νk\sum_{k=0}^{K-1}(1-\alpha ch_{1})^{K-k-1}\nu_{k} with high probability. We assumed a bound on g(𝒛;θ)\|g(\boldsymbol{z};\theta)\| which implies one for g¯(𝒵;θ)\|\bar{g}({\mathcal{Z}};\theta)\| as well for any fixed nn. When the norms of the gradient F(θ)\nabla F(\theta) and gradient estimator g¯(𝒵;θ)\bar{g}({\mathcal{Z}};\theta) are bounded by such a constant CC for all θ\theta and 𝒵{\mathcal{Z}}, then ek=g¯(𝒵;θ)F(θ)2C\|e_{k}\|=\|\bar{g}({\mathcal{Z}};\theta)-\nabla F(\theta)\|\leqslant 2C, and we have the bound

|νk|α|F(θk)(HkLαHk2)ek|+Lα2h222C22α(h2Lαh12+Lαh22)C2=:C,\displaystyle|\nu_{k}|\leqslant\alpha|\nabla F(\theta_{k})^{\intercal}(H_{k}-L\alpha H_{k}^{2})e_{k}|+\frac{L\alpha^{2}h_{2}^{2}}{2}C^{2}\leqslant 2\alpha(h_{2}-L\alpha h_{1}^{2}+{L\alpha h_{2}^{2}})C^{2}=:C^{\prime},

where the second inequality uses that the largest eigenvalue of HkLαHk2H_{k}-L\alpha H_{k}^{2} is upper bounded by h2Lαh12h_{2}-L\alpha h_{1}^{2}. By the Azuma-Hoeffding inequality (Azuma,, 1967), for all t0t\geqslant 0,

(k=1K(1αch1)Kk1νkt)exp(2t2k=1K(1αch1)Kk1C2)exp(2t2C2αch1).\displaystyle\mathbb{P}\biggl{(}\,{\sum_{k=1}^{K}(1-\alpha ch_{1})^{K-k-1}\nu_{k}\geqslant t}\biggr{)}\leqslant\exp\biggl{(}{-\frac{2t^{2}}{\sum_{k=1}^{K}(1-\alpha ch_{1})^{K-k-1}C^{\prime 2}}}\biggr{)}\leqslant\exp\biggr{(}{-\frac{2t^{2}}{\frac{C^{\prime 2}}{\alpha ch_{1}}}}\biggr{)}.

Setting ε2=2t2αch1/C2\varepsilon^{2}={2t^{2}\alpha ch_{1}}/{C^{\prime 2}} gives

t=C2αch1ε=2αC2(h2Lαh12+Lαh22)2αch1εC22αch1(h2Lαh12+h1)ε.\displaystyle t=\frac{C^{\prime}}{\sqrt{2\alpha ch_{1}}}\varepsilon=\frac{2\alpha C^{2}(h_{2}-L\alpha h_{1}^{2}+{L\alpha h_{2}^{2}})}{\sqrt{2\alpha ch_{1}}}\varepsilon\leqslant C^{2}\sqrt{\frac{2\alpha}{ch_{1}}}(h_{2}-L\alpha h_{1}^{2}+{h_{1}}{})\varepsilon.

So we have proved that with probability at least 1eε21-e^{-\varepsilon^{2}},

F(θK)F\displaystyle F(\theta_{K})-F^{*} (1αch1)K(F(θ0)F)+αLh222ch1M+C22αch1(h2Lαh12+h1)ε.\displaystyle\leqslant(1-\alpha ch_{1})^{K}(F(\theta_{0})-F^{*})+\frac{\alpha Lh_{2}^{2}}{2ch_{1}}M+C^{2}\sqrt{\frac{2\alpha}{ch_{1}}}(h_{2}-L\alpha h_{1}^{2}+{h_{1}}{})\varepsilon.

 

A.2 Proof of Theorem 4.2

Our proof uses the following lemma.

Lemma 1

Let u,vnu,v\in^{n} satisfy uvAu^{\intercal}v\geqslant A and let Dn×nD\in^{n\times n} be symmetric with h1IDh2Ih_{1}I\preccurlyeq D\preccurlyeq h_{2}I where 0<h1h20<h_{1}\leqslant h_{2}. Then

uDvh1+h22Ah2h12uv.u^{\intercal}Dv\geqslant\frac{h_{1}+h_{2}}{2}A-\frac{h_{2}-h_{1}}{2}\|u\|\|v\|.

Proof  Without loss of generality, we can assume that DD is a diagonal matrix. Otherwise, let D=UΛUD=U\Lambda U^{\intercal} be the eigen-decomposition of DD. Then uDv=(Uu)Λ(Uv)u^{\intercal}Dv=(U^{\intercal}u)^{\intercal}\Lambda(U^{\intercal}v), Uu=u\|U^{\intercal}u\|=\|u\|, Uv=v\|U^{\intercal}v\|=\|v\|, and (Uu)(Uv)=uv(U^{\intercal}u)^{\intercal}(U^{\intercal}v)=u^{\intercal}v, and we can replace DD by Λ\Lambda.

Let d1,,dnd_{1},\ldots,d_{n} be the diagonal entries of DD. Then uDv=i=1nuividiu^{\intercal}Dv=\sum_{i=1}^{n}u_{i}v_{i}d_{i}. Let s+=i:uivi0uivis_{+}=\sum_{i:u_{i}v_{i}\geqslant 0}u_{i}v_{i} and s=i:uivi<0uivis_{-}=\sum_{i:u_{i}v_{i}<0}u_{i}v_{i}. Note that s++s=uvAs_{+}+s_{-}=u^{\intercal}v\geqslant A, s+s=i=1n|uivi|uvs_{+}-s_{-}=\sum_{i=1}^{n}|u_{i}v_{i}|\leqslant\|u\|\|v\|. Then for any DD,

uDvh1s++h2s=h1+h22(s++s)+h2h12(ss+)h1+h22Ah2h12uv.u^{\intercal}Dv\geqslant h_{1}s_{+}+h_{2}s_{-}=\frac{h_{1}+h_{2}}{2}(s_{+}+s_{-})+\frac{h_{2}-h_{1}}{2}(s_{-}-s_{+})\geqslant\frac{h_{1}+h_{2}}{2}A-\frac{h_{2}-h_{1}}{2}\|u\|\|v\|.

 

Now we are ready to prove Theorem 4.2.

Proof  We start by decomposing

θk+1θ2\displaystyle\|\theta_{k+1}-\theta^{*}\|^{2} =θk+1θk+θkθ2\displaystyle=\|\theta_{k+1}-\theta_{k}+\theta_{k}-\theta^{*}\|^{2}
=θkθ2+α2Hkg¯(𝒵k;θk)22α(θkθ)Hkg¯(𝒵k;θk).\displaystyle=\|\theta_{k}-\theta^{*}\|^{2}+\alpha^{2}\|H_{k}\bar{g}({\mathcal{Z}}_{k};\theta_{k})\|^{2}-2\alpha(\theta_{k}-\theta^{*})^{\intercal}H_{k}\bar{g}({\mathcal{Z}}_{k};\theta_{k}).

Note that only g¯(𝒵k;θk)\bar{g}({\mathcal{Z}}_{k};\theta_{k}) and θk+1\theta_{k+1} depend on 𝒵k{\mathcal{Z}}_{k}. Taking expectation w.r.t. 𝒵k{\mathcal{Z}}_{k} on both sides gives

𝔼(θk+1θ2)\displaystyle{\mathbb{E}}\bigl{(}\|\theta_{k+1}-\theta^{*}\|^{2}\bigr{)} θkθ2+α2h22(M+F(θk)2)2α(θkθ)HkF(θk)\displaystyle\leqslant\|\theta_{k}-\theta^{*}\|^{2}+\alpha^{2}h_{2}^{2}(M+\|\nabla F(\theta_{k})\|^{2})-2\alpha(\theta_{k}-\theta^{*})^{\intercal}H_{k}\nabla F(\theta_{k})
θkθ2+α2h22(M+L2θkθ2)2α(θkθ)HkF(θk).\displaystyle\leqslant\|\theta_{k}-\theta^{*}\|^{2}+\alpha^{2}h_{2}^{2}(M+L^{2}\|\theta_{k}-\theta^{*}\|^{2})-2\alpha(\theta_{k}-\theta^{*})^{\intercal}H_{k}\nabla F(\theta_{k}). (9)

By strong convexity of F()F(\cdot),

(θkθ)F(θk)F(θk)F+c2θkθ2cθkθ2.(\theta_{k}-\theta^{*})^{\intercal}\nabla F(\theta_{k})\geqslant F(\theta_{k})-F^{*}+\frac{c}{2}\|\theta_{k}-\theta^{*}\|^{2}\geqslant c\|\theta_{k}-\theta^{*}\|^{2}.

Using Lemma 1 and equation (A.2), we have

(θkθ)HkF(θk)\displaystyle(\theta_{k}-\theta^{*})^{\intercal}H_{k}\nabla F(\theta_{k}) h1+h22cθkθ2h2h12θkθF(θk)\displaystyle\geqslant\frac{h_{1}+h_{2}}{2}c\|\theta_{k}-\theta^{*}\|^{2}-\frac{h_{2}-h_{1}}{2}\|\theta_{k}-\theta^{*}\|\|\nabla F(\theta_{k})\|
(h1+h22ch2h12L)θkθ2,\displaystyle\geqslant\Bigl{(}\frac{h_{1}+h_{2}}{2}c-\frac{h_{2}-h_{1}}{2}L\Bigr{)}\|\theta_{k}-\theta^{*}\|^{2},

where the last inequality is due to F(θk)Lθkθ\|\nabla F(\theta_{k})\|\leqslant L\|\theta_{k}-\theta^{*}\|. Combining this with (9) gives

𝔼(θk+1θ2)\displaystyle{\mathbb{E}}\bigl{(}{\|\theta_{k+1}-\theta^{*}\|^{2}}\bigr{)} (1+α2L2h22α[(h1+h2)c(h2h1)L])θkθ2+α2h22M.\displaystyle\leqslant\bigl{(}1+\alpha^{2}L^{2}h_{2}^{2}-\alpha[(h_{1}+h_{2})c-(h_{2}-h_{1})L]\bigr{)}\|\theta_{k}-\theta^{*}\|^{2}+\alpha^{2}h_{2}^{2}M.

We have assumed that

0<α<(h1+h2)c(h2h1)L2L2h220<\alpha<\frac{(h_{1}+h_{2})c-(h_{2}-h_{1})L}{2L^{2}h_{2}^{2}}

and it then follows that

(1+α2L2h22α[(h1+h2)c(h2h1)L])1α2h22L2\bigl{(}1+\alpha^{2}L^{2}h_{2}^{2}-\alpha[(h_{1}+h_{2})c-(h_{2}-h_{1})L]\bigr{)}\leqslant 1-\alpha^{2}h_{2}^{2}L^{2}

from which

𝔼(θk+1θ2)\displaystyle{\mathbb{E}}\bigl{(}{\|\theta_{k+1}-\theta^{*}\|^{2}}\bigr{)} (1α2h22L2)θkθ2+α2h22M.\displaystyle\leqslant\bigl{(}1-\alpha^{2}h_{2}^{2}L^{2}\bigr{)}\|\theta_{k}-\theta^{*}\|^{2}+\alpha^{2}h_{2}^{2}M. (11)

Applying the recursive error formula (11) we get

𝔼(θkθ2)(1α2h22L2)kθ0θ2+ML2.\displaystyle{\mathbb{E}}\bigl{(}\|\theta_{k}-\theta^{*}\|^{2}\bigr{)}\leqslant(1-\alpha^{2}h_{2}^{2}L^{2})^{k}\|\theta_{0}-\theta^{*}\|^{2}+\frac{M}{L^{2}}.

where the expectation is over 𝒵1,,𝒵k{\mathcal{Z}}_{1},\dots,{\mathcal{Z}}_{k}.