This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Convergence Analysis of Stochastic Gradient Descent with MCMC Estimators

Tianyou Li School of Mathematical Sciences, Peking University, CHINA ([email protected]).  Fan Chen School of Mathematical Sciences, Peking University, CHINA ([email protected]).  Huajie Chen School of Mathematical Sciences, Beijing Normal University, CHINA ([email protected]).  Zaiwen Wen Beijing International Center for Mathematical Research, Peking University, CHINA ([email protected]).
Abstract

Understanding stochastic gradient descent (SGD) and its variants is essential for machine learning. However, most of the preceding analyses are conducted under amenable conditions such as unbiased gradient estimator and bounded objective functions, which does not encompass many sophisticated applications, such as variational Monte Carlo, entropy-regularized reinforcement learning and variational inference. In this paper, we consider the SGD algorithm that employ the Markov Chain Monte Carlo (MCMC) estimator to compute the gradient, called MCMC-SGD. Since MCMC reduces the sampling complexity significantly, it is an asymptotically convergent biased estimator in practice. Moreover, by incorporating a general class of unbounded functions, it is much more difficult to analyze the MCMC sampling error. Therefore, we assume that the function is sub-exponential and use the Bernstein inequality for non-stationary Markov chains to derive error bounds of the MCMC estimator. Consequently, MCMC-SGD is proven to have a first order convergence rate O(logK/nK)O(\log K/\sqrt{nK}) with KK iterations and a sample size nn. It partially explains how MCMC influences the behavior of SGD. Furthermore, we verify the correlated negative curvature condition under reasonable assumptions. It is shown that MCMC-SGD escapes from saddle points and reaches (ϵ,ϵ1/4)(\epsilon,\epsilon^{1/4}) approximate second order stationary points or ϵ1/2\epsilon^{1/2}-variance points at least O(ϵ11/2log2(1/ϵ))O(\epsilon^{-11/2}\log^{2}(1/\epsilon)) steps with high probability. Our analysis unveils the convergence pattern of MCMC-SGD across a broad class of stochastic optimization problems, and interprets the convergence phenomena observed in practical applications.

1 Introduction

We consider a general class of stochastic optimization problems. Given a measurable space 𝒳\mathcal{X}, let πθ(x)\pi_{\theta}(x) be a family of probability distributions on 𝒳\mathcal{X} and fθ(x)f_{\theta}(x) be a function for x𝒳x\in\mathcal{X}. A general form of stochastic optimization problems is stated as

minθΘ(θ)=𝔼xπθ[fθ(x)],\displaystyle\min_{\theta\in\Theta}~{}\mathcal{L}(\theta)=\mathbb{E}_{x\sim\pi_{\theta}}\left[f_{\theta}(x)\right], (1.1)

where Θ\Theta denotes the space of parameters, which is typically a subset of a Euclidean space d\mathbb{R}^{d}. In the field of stochastic optimization, the most common form is typically expressed by a given distribution π\pi and a parameterized function fθf_{\theta}. Also, there are numerous tasks, such as reinforcement learning (RL), that aim to find a suitable distribution πθ\pi_{\theta} under a given function ff. Compared to the two aforementioned problems, (1.1) represents a broader framework, wherein both the distribution πθ\pi_{\theta} and the function fθf_{\theta} are dependent on parameters θ\theta.

A main issue in solving the stochastic optimization problem (1.1) is that the difficulty of directly computing integrals over high-dimensional spaces. The Monte Carlo method offers a strategy for estimating expectations using samples, thereby facilitating the computation of stochastic gradients. However, for certain complex variational distributions, it is prohibitive to obtain their exact probability density functions which impedes directly sampling. In order to circumvent the intractable normalization constants in probabilities, Markov chain Monte Carlo (MCMC) is employed to sample from the unnormalized probability. In contrast to the ordinary unbiased Monte Carlo methods, the MCMC method requires some time for mixing and produces the desired samples biasedly and non-independently. There are several common MCMC algorithms, such as Metropolis-Hastings algorithm [29, 16], Hamiltonian Monte Carlo [11], Stochastic Gradient Langevin Dynamics [45], etc. The efficiency of the optimization algorithm relies on the error of the MCMC sampling, which has not been adequately investigated in the previous literatures. Previous studies [30, 19, 13, 37] on the concentration inequality of MCMC sampling typically assume a more idealized setting where the random variables are uniformly bounded.

With the stochastic gradient estimated by MCMC algorithms, the objective function is minimized using stochastic gradient descent (SGD) methods. The vanilla SGD method is to sample independently from a uniform distribution, and has been extensively studied. Moulines & Bach first show linear convergence of SGD non-asymptotically for strongly convex functions [33]. Needell et al. improve these results by removing the quadratic dependency on the condition number in the iteration complexity results [34]. Among these convergence results, the gradient noise assumptions for i.i.d samples are of vital importance to establish an O(1/nK)O(1/\sqrt{nK}) convergence rate for general non-convex cost functions, where nn is the sample size per iteration and KK is the total number of iterations. However, stochastic sampling is not always independent or unbiased. The convergence of SGD with Markovian gradients has been studied in [12, 44], and SGD with biased gradient estimators is considered in [2]. Moreover, there are also some research about SGD with MCMC noise, which we call MCMC-SGD. Atchade et al. consider stochastic proximal gradient algorithms with MCMC sampling on convex objective [3]. The uniform-in-θ\theta ergodic assumption is proposed to analyze the Markov noise for convex bounded functions in their stochastic proximal gradient algorithm. Karimi et al. study SGD with Markovian noise [22]. They analyze the convergence by controlling the noise through the Poisson equation. But strong assumptions are given and the convergence result has an extra term that depends on the bias of the estimation scheme.

The contributions of this paper are multifaceted and useful in the advancement of stochastic optimization, particularly in the context of machine learning and scientific computing. Three main contributions are delineated as follows in a structured manner:

  1. 1.

    Error analysis of MCMC estimators: A novel aspect of this paper is the application of concentration inequalities to estimate the upper bounds of bias and variance associated with MCMC sampling. Compared with the conventional boundedness assumption in [30, 19, 13, 37], we adopt a broader assumption of unbounded functions, recognizing the more complex and realistic scenarios often encountered in MCMC sampling. Based on concentration inequalities, our approach investigates the MCMC error for a class of sub-exponential functions from the perspective of spectral gap. This analysis is universal and non-asymptotic, entailing a meticulous examination of the non-stationary Markov chain on unbounded functions.

  2. 2.

    First-order convergence of MCMC-SGD: We demonstrate that the biased stochastic gradient, a result of the MCMC sampling, achieves first-order stationary convergence. In comparison with [3], we conduct a detailed analysis on the MCMC error and extend its assumption of convex bounded functions. Meanwhile, our analysis does not need assumptions with the Poisson equation in [22]. The convergence rate is quantified as O(logKnK)O\left(\frac{\log K}{\sqrt{nK}}\right), which is established after KK iterations, given a sufficiently large sample size NN. This result is instrumental in validating the effectiveness of the MCMC-SGD algorithm and provides a theoretical viewpoint that can be used to guide the design and tuning of the algorithm.

  3. 3.

    Escaping from saddle points and second-order convergence: Our investigation extends into the realm of second-order convergence properties. The SGD escaping from saddle points with unbiased gradient estimators has been studied in [14, 8, 20]. Under the influence of biased MCMC noise, it is imperative to examine the second-order convergence properties of MCMC-SGD for the problem (1.1). By substantiating the correlated negative curvature (CNC) condition under errors, we generalize the analysis of unbiased SGD escaping from saddle points in [8] to MCMC-SGD, where the gradient estimator is biased and requires more carefully analysis. A second-order convergence guarantee characterized by a rate of O(ϵ11/2log2(1ϵ))O\left(\epsilon^{-11/2}\log^{2}\left(\frac{1}{\epsilon}\right)\right). This guarantee not only provides theoretical assurance of the convergence speed but also serves specifically as a testament to the algorithm’s ability to efficiently approximate eigenvalues for variational eigenvalue problems in quantum mechanics.

The rest of this paper is organized as follows. In Section 2, we describe the applications of the general optimization problem and introduce the SGD algorithm with MCMC sampling. In Section 3, we first give our assumptions for the function and the variational distribution. Then, the sampling error is analyzed asymptotically by the concentration inequality for Markov chains. We prove that the MCMC-SGD algorithm converges to stationary points and provide the convergence rates. In Section 4, the correlated negative curvature condition of the MCMC-SGD algorithm is verified in empirical sampling under reasonable assumptions and the second-order convergence characteristics are further investigated. We establish the convergence guarantee to avoid saddle points with high probability, based on the stochastic perturbations of MCMC algorithms.

2 SGD with MCMC estimators

For the problem (1.1), we notice that parameter dependency exists in both the function and distribution. It is regarded as a more generalized form than purely optimizing distributions or functions in stochastic optimization problems.

To begin with, we formally present the assumptions regarding fθf_{\theta} and πθ\pi_{\theta} in problem (1.1).

Assumption 1.

For the stochastic optimization problem (1.1), the following two properties holds correspondingly.

  1. (1)

    The function fθ(x)f_{\theta}(x) satisfies

    𝔼xπθ[θfθ(x)]=0.\mathbb{E}_{x\sim\pi_{\theta}}\left[\nabla_{\theta}f_{\theta}(x)\right]=0. (2.1)
  2. (2)

    There exists an energy function ϕθ(x)\phi_{\theta}(x) such that the distribution πθ(x)\pi_{\theta}(x) is parameterized as

    πθ(x)=eϕθ(x)x𝒳eϕθ(x)𝑑xeϕθ(x),\pi_{\theta}(x)=\frac{e^{\phi_{\theta}(x)}}{\int_{x\in\mathcal{X}}e^{\phi_{\theta}(x)}dx}\propto e^{\phi_{\theta}(x)}, (2.2)

    where ϕθ(x)\phi_{\theta}(x) is a known parameterization.

The function fθ(x)f_{\theta}(x) need not be assumed bounded on the space 𝒳\mathcal{X}, as numerous examples have demonstrated that such an assumption is often difficult to satisfy. Assumption 1(1) constitutes an extension to the case where ff is independent of the parameter θ\theta, that is, θfθ(x)=0\nabla_{\theta}f_{\theta}(x)=0 for all x𝒳x\in\mathcal{X}. We naturally generalize the case where ff is parameter-independent. The function fθf_{\theta} with restriction (2.1) can not only cover a wider range of stochastic optimization problems but also retains the form of policy gradients, facilitating our optimization efforts.

The distribution πθ\pi_{\theta} is typically viewed as an artificially designed parametric distribution, encompassing these distributions from the exponential family (e.g., Gaussian, Bernoulli, Poisson distribution), structured probabilistic models such as graphical models (e.g., Bayesian networks, Markov random fields), and a series of deep learning architectures that serve as function approximators. In most scenarios, we may not have direct access to the value of πθ\pi_{\theta}, but instead πθ\pi_{\theta} is parameterized by the energy function ϕθ\phi_{\theta}. Assumption 1(2) expounds that πθ\pi_{\theta} possesses a unnormalized form with the energy function ϕθ\phi_{\theta}, which frequently occurs in distributions expressed by neural networks. The relation (2.2) is a weak assumption and often appears in the design of complex distributions. Since the energy function ϕθ\phi_{\theta} is typically chosen to be tractable and differentiable with respect to the parameters θ\theta, the computation of x𝒳eϕθ(x)𝑑x\int_{x\in\mathcal{X}}e^{\phi_{\theta}(x)}dx is often infeasible for complex models or high-dimensional spaces, as it involves an integral over all possible states of x𝒳x\in\mathcal{X}. However, MCMC techniques, such as Metropolis-Hastings algorithm and Stochastic Gradient Langevin Dynamics and so on, can sample from unnormalized distributions, thereby circumventing the need for complex integral computations.

To substantiate the validity of this framework, let us consider the following three problems as exemplary cases, where it can be verified that fθf_{\theta} is unbounded and satisfies (2.1).

2.1 Applications

2.1.1 Variational Monte Carlo for many-body quantum problems

Consider a many-body quantum system with NN particles. We denote the NN-particle configuration of this system by x:=(x1,,xN)𝒳:=𝒜Nx:=(x_{1},\cdots,x_{N})\in\mathcal{X}:=\mathcal{A}^{N}, with 𝒜\mathcal{A} being the one-particle configuration space, which can be continuous or discrete. The wavefunction Ψ:𝒳(or)\Psi:\mathcal{X}\rightarrow\mathbb{C}~{}({\rm or}~{}\mathbb{R}) describes the quantum state of the many-body system, and is often required to satisfy some symmetric/anti-symmetric conditions. The Hamiltonian \mathcal{H} is a self-adjoint operator acting on the wavefunction, which determines the dynamics and the ground state of the quantum system.

A central issue in quantum physics is to compute the ground state energy and wavefunction of the system, which corresponds to the lowest eigenvalue E0E_{0} and its corresponding eigenfunction of

Ψ0=E0Ψ0.\displaystyle\mathcal{H}\Psi_{0}=E_{0}\Psi_{0}. (2.1)

Since it is challenge to solve the very high dimensional eigenvalue problem (2.1) directly, an alternative way is to minimize the following Rayleigh quotient

E0minθΘx𝒳Ψθ(x)(Ψθ)(x)𝑑xx𝒳Ψθ(x)Ψθ(x)𝑑x,\displaystyle E_{0}\approx\min_{\theta\in\Theta}\frac{\int_{x\in\mathcal{X}}\Psi_{\theta}^{*}(x)\cdot\big{(}\mathcal{H}\Psi_{\theta}\big{)}(x)dx}{\int_{x\in\mathcal{X}}\Psi_{\theta}^{*}(x)\cdot\Psi_{\theta}(x)dx}, (2.2)

where the wavefunction is parameterized by a suitable ansatz Ψθ,θΘ\Psi_{\theta},~{}\theta\in\Theta within a finite parameter space. When 𝒳\mathcal{X} is a discrete configuration space, the above integrals in (2.2) is regarded as summations over all configurations. We can regard (2.2) as a stochastic optimization problem by the following reformulation:

minθΘx𝒳Ψθ(x)Ψθ(x)x𝒳Ψθ(x)Ψθ(x)𝑑xπθ(x)Ψθ(x)Ψθ(x)fθ(x)𝑑x=𝔼xπθ[fθ(x)].\displaystyle\min_{\theta\in\Theta}\int_{x\in\mathcal{X}}\underbrace{\frac{\Psi_{\theta}^{*}(x)\cdot\Psi_{\theta}(x)}{\int_{x\in\mathcal{X}}\Psi_{\theta}^{*}(x)\cdot\Psi_{\theta}(x)dx}}_{\pi_{\theta}(x)}\cdot\underbrace{\frac{\mathcal{H}\Psi_{\theta}(x)}{\Psi_{\theta}(x)}}_{f_{\theta}(x)}dx=\mathbb{E}_{x\sim\pi_{\theta}}\left[f_{\theta}(x)\right]. (2.3)

As the Hamiltonian \mathcal{H} is a self-adjoint linear operator, the following holds for fθ(x)f_{\theta}(x):

𝔼xπθ[θfθ(x)]=x𝒳Ψθ(x)θΨθ(x)θΨθ(x)Ψθ(x)dxx𝒳Ψθ(x)Ψθ(x)𝑑x=0.\mathbb{E}_{x\sim\pi_{\theta}}\left[\nabla_{\theta}f_{\theta}(x)\right]=\frac{\int_{x\in\mathcal{X}}\Psi_{\theta}^{*}(x)\cdot\mathcal{H}\nabla_{\theta}\Psi_{\theta}(x)-\nabla_{\theta}\Psi_{\theta}^{*}(x)\cdot\mathcal{H}\Psi_{\theta}(x)dx}{\int_{x\in\mathcal{X}}\Psi_{\theta}^{*}(x)\cdot\Psi_{\theta}(x)dx}=0.

For examples, the NN-electron configuration for a many-electron system in 33 dimension is x=(x1,,xN)x=(x_{1},\cdots,x_{N}) with xi=(ri,σi)𝒜=3×2x_{i}=(r_{i},\sigma_{i})\in\mathcal{A}=\mathbb{R}^{3}\times\mathbb{Z}_{2}, where rir_{i} represents the spatial coordinate and σi\sigma_{i} is the spin coordinate. The Hamiltonian of the electron system is given by

=12i=1NΔri+i=1Nvext(ri)+1i<jNvee(ri,rj),\displaystyle\mathcal{H}=-\frac{1}{2}\sum_{i=1}^{N}\Delta_{r_{i}}+\sum_{i=1}^{N}v_{\mathrm{ext}}(r_{i})+\sum_{1\leq i<j\leq N}v_{\mathrm{ee}}(r_{i},r_{j}), (2.4)

where vext:3v_{\mathrm{ext}}:\mathbb{R}^{3}\rightarrow\mathbb{R} is the ionic potential and vee:3×3v_{\mathrm{ee}}:\mathbb{R}^{3}\times\mathbb{R}^{3}\rightarrow\mathbb{R} represents the interaction between electrons

vext(r)=IZI|rRI|,vee(r,r)=1|rr|,r,r3,\displaystyle v_{\mathrm{ext}}(r)=-\sum_{I}\frac{Z_{I}}{|r-R_{I}|},~{}~{}v_{\mathrm{ee}}(r,r^{\prime})=\frac{1}{|r-r^{\prime}|},\quad r,r^{\prime}\in\mathbb{R}^{3}, (2.5)

with RI3R_{I}\in\mathbb{R}^{3} and ZIZ_{I}\in\mathbb{N} being the position and atomic number of the II-th nuclear.

In quantum many-body systems, the classical VMC employs MCMC algorithms to compute approximate gradients, which garnered renewed interest due to the rise of modern neural network ansatzes. These techniques typically rely on the ability of artificial neural networks to represent complex high-dimensional wavefunction, which have already been explored in many fields of physics and chemistry. The RBM is first proposed by Carleo and Troyer [6] as a variational ansatz for many-body quantum problems. Furthermore, a large number of deep neural networks, such as FFNN [41, 5], deep RBMs [9, 15, 36, 23], convolutional neural networks [27, 7], variational autoencoders [40], have been applied to capture the physical features and improve the accuracy of the ground state. Motivated by the traditional Slater-Jastrow-backflow ansatzes, PauliNet [17] and FermiNet [38] use the permutation equivariant and invariant construction of networks and many determinants to approximate a general antisymmetric wavefunction. However, the convergence properties of VMC methods have remained underexplored in a long time. A recent study [1] by Abrahamsen et al. provides a proof of convergence for VMC methods with an unbiased gradient estimator.

2.1.2 Entropy-regularized reinforcement learning

In RL, exploration in actions is critical to finding good policies during optimization. Without a large number of diverse state-action pairs, the algorithm might settle for a weak policy. To prevent the concentration in policy, the entropy regularization is used to encourage random exploration [46, 32]. Let x:=(s0,a0,s1,a1,)x:=(s_{0},a_{0},s_{1},a_{1},\dots) be the trajectory and r(x)r(x) be the cumulative reward over the trajectory xx. Compared to the classical reinforcement learning, the introduction of entropy regularization for the parameterized policy πθ(x)\pi_{\theta}(x) yields a distinct objective function, that is,

minθΘ𝔼xπθ[r(x)]+β𝔼xπθ[logπθ(x)]=𝔼xπθ[r(x)+βlogπθ(x)fθ(x)].\min_{\theta\in\Theta}~{}\mathbb{E}_{x\sim\pi_{\theta}}\left[-r(x)\right]+\beta\mathbb{E}_{x\sim\pi_{\theta}}\left[\log\pi_{\theta}(x)\right]=\mathbb{E}_{x\sim\pi_{\theta}}\Big{[}\underbrace{-r(x)+\beta\log\pi_{\theta}(x)}_{f_{\theta}(x)}\Big{]}. (2.6)

Since x𝒳πθ(x)𝑑x=1\int_{x\in\mathcal{X}}\pi_{\theta}(x)dx=1, it holds that

βx𝒳θπθ(x)=β𝔼xπθ[θlogπθ(x)]=𝔼xπθ[θfθ(x)]=0.\beta\int_{x\in\mathcal{X}}\nabla_{\theta}\pi_{\theta}(x)=\beta\mathbb{E}_{x\sim\pi_{\theta}}\left[\nabla_{\theta}\log\pi_{\theta}(x)\right]=\mathbb{E}_{x\sim\pi_{\theta}}\left[\nabla_{\theta}f_{\theta}(x)\right]=0. (2.7)

2.1.3 Variational Inference

Approximating complex distributions constitutes a significant facet of the Bayesian statistics. Compared to directly handling distributions within high-dimensional spaces, variational inference (VI) seeks to circumvent the computational complexities by approximating the target distribution with a parameterized simpler distribution [21, 39, 25, 4]. This approach is particularly advantageous when the exact posterior distribution is either too complex to be analytically solvable or when it is computationally prohibitive to sample from, as is often the case in high-dimensional Bayesian inference problems. VI operates on the principle of optimization, where the objective is to minimize the discrepancy between the approximating variational distribution πθ(x|y)\pi_{\theta}(x|y) and the true posterior p(x,y)p(x,y). This discrepancy is frequently quantified using the Kullback-Leibler (KL) divergence, a non-symmetric measure of the difference between two probability distributions:

minθΘKL(p(x,y)||πθ(x|y)):=𝔼xπθ(|y)[log(p(x,y)πθ(x|y))fθ(x)].\min_{\theta\in\Theta}~{}\mathrm{KL}(p(x,y)||\pi_{\theta}(x|y)):=\mathbb{E}_{x\sim\pi_{\theta}(\cdot|y)}\Bigg{[}\underbrace{\log\left(\frac{p(x,y)}{\pi_{\theta}(x|y)}\right)}_{f_{\theta}(x)}\Bigg{]}. (2.8)

Analogous to (2.7), we have that 𝔼xπθ[θfθ(x)]=0\mathbb{E}_{x\sim\pi_{\theta}}\left[\nabla_{\theta}f_{\theta}(x)\right]=0 for fθ(x)f_{\theta}(x) defined in (2.8).

There are also a lot of attention to improving the explicit variational distribution in VI by applying MCMC sampling. Salimans et al. (2015) propose a hybrid approach that utilizes MCMC within the variational framework to refine the variational distribution [42]. By employing MCMC transitions that leave the target posterior distribution invariant, they introduce a rich class of variational distributions that can potentially capture complex dependencies and multimodal structures. A continuous relaxation is introduced by Maddison et al. (2017) for discrete random variables [28], which facilitates the application of gradient-based optimization techniques. Their paper is particularly relevant when dealing with discrete latent variables in VI, where MCMC sampling can be leveraged to approximate the gradients of the objective function more effectively. Li et al. (2017) propose utilizing MCMC sampling not just as a means of refining a given variational distribution, but also as a mechanism for learning an efficient proposal distribution for variational inference [26]. By amortizing the cost of MCMC sampling across different iterations of the inference process, they aim to reduce the computational overhead while maintaining the benefits of MCMC sampling efficacy.

2.2 Gradient approximation using MCMC

To solve the stochastic optimization problem (1.1), a common approach is to compute the stochastic gradient for iterations. As the Assumption 1 holds, the gradient of objective function (θ)\mathcal{L}(\theta) is derived by

g(θ)\displaystyle g(\theta) =𝔼xπθ[fθ(x)θlogπθ(x)]+𝔼xπθ[θfθ(x)]\displaystyle=\mathbb{E}_{x\sim\pi_{\theta}}\left[f_{\theta}(x)\nabla_{\theta}\log\pi_{\theta}(x)\right]+\mathbb{E}_{x\sim\pi_{\theta}}\left[\nabla_{\theta}f_{\theta}(x)\right] (2.9)
=𝔼xπθ[(fθ(x)𝔼xπθ[fθ(x)])θϕθ(x)].\displaystyle=\mathbb{E}_{x\sim\pi_{\theta}}\left[\left(f_{\theta}(x)-\mathbb{E}_{x\sim\pi_{\theta}}\left[f_{\theta}(x)\right]\right)\nabla_{\theta}\phi_{\theta}(x)\right].

The second equality is due to 𝔼xπθ[θfθ(x)]=0\mathbb{E}_{x\sim\pi_{\theta}}\left[\nabla_{\theta}f_{\theta}(x)\right]=0 and θlogπθ(x)=θϕθ(x)𝔼xπθ[θϕθ(x)]\nabla_{\theta}\log\pi_{\theta}(x)=\nabla_{\theta}\phi_{\theta}(x)-\mathbb{E}_{x\sim\pi_{\theta}}\left[\nabla_{\theta}\phi_{\theta}(x)\right], implied by Assumption 1. We notice that the gradient can be expressed in the form of an expectation, thus allowing the approximation of the gradient through the Monte Carlo method. As for the unnormalized distribution πθ(x)ϕθ(x)\pi_{\theta}(x)\propto\phi_{\theta}(x), it is natural to employ MCMC methods to estimate the expectations presented within gradients.

Let us consider a Markov chain {Xi}i=1\{X_{i}\}_{i=1}^{\infty} with a stationary distribution πθ(x)eϕθ(x)\pi_{\theta}(x)\propto e^{\phi_{\theta}(x)}. We have obtained a set of samples S={xi}i=n0+1n0+nS=\{x_{i}\}_{i=n_{0}+1}^{n_{0}+n} following n0n_{0} burn-in steps. These samples are generated to approximate expectations with respect to πθ(x)\pi_{\theta}(x), which is typically intractable. The objective function and its gradient is estimated by the MCMC samples SS:

^(θ;S)\displaystyle\hat{\mathcal{L}}(\theta;S) =1|S|xSfθ(x),\displaystyle=\frac{1}{|S|}\sum_{x\in S}f_{\theta}(x), (2.10)
g^(θ;S)\displaystyle\hat{g}(\theta;S) =2|S|xS(fθ(x)1|S|xSfθ(x))θϕθ(x).\displaystyle=\frac{2}{|S|}\sum_{x\in S}\big{(}f_{\theta}(x)-\frac{1}{|S|}\sum_{x\in S}f_{\theta}(x)\big{)}\nabla_{\theta}\phi_{\theta}(x). (2.11)

While MCMC methods offer a tractable and efficient strategy for approximating expectations, it must be noted that estimators derived from finite MCMC samples are generally biased. Furthermore, the samples are correlated, which affects the variance of the estimator. To mitigate bias, one must ensure a sufficiently large sample size. However, an excessively large sample size may increase computational burden and reduce the stochasticity for escaping local regions. In light of these considerations, the practitioner must carefully balance the sample size to minimize bias while preserving randomness and computational tractability. A rigorous approach may involve adaptive MCMC methods that adjust the burn-in period and sample size based on convergence diagnostics of Markov chains. It suggests that we should pay attention to the influence of sampling methods in optimization.

The SGD algorithm is utilized to update the parameters with MCMC samples:

θk+1=θkαkg^(θk,Sk),\theta_{k+1}=\theta_{k}-\alpha_{k}\hat{g}(\theta_{k},S_{k}), (2.12)

where αk\alpha_{k} are chosen stepsizes and SkS_{k} is the obtained samples with the stationary distribution πθk\pi_{\theta_{k}}. MCMC-SGD provides a powerful framework for optimization in settings where the objective function is defined in terms of expectations over complex distributions. A detailed analysis on the interplay between SGD and MCMC sampling will aid us in deepening our understanding of convergence and refining the algorithms.

Algorithm 1 MCMC-SGD
0:  The initialized parameter θ0\theta_{0}, the sample size nn and the length of burn-in periods n0n_{0}.
1:  for  k=0,1,2,k=0,1,2,\dots do
2:     Draw nn samples Sk={𝒙n0+1(k),,𝒙n0+n(k)}S_{k}=\{\bm{x}_{n_{0}+1}^{(k)},\dots,\bm{x}^{(k)}_{n_{0}+n}\} by the MCMC algorithm after a burn-in period of n0n_{0}.
3:     Compute the estimated gradient g^(θk,Sk)\hat{g}(\theta_{k},S_{k}) by (2.11).
4:     Update the parameter by (2.12) with the stepsize αk\alpha_{k}.
5:  end for

3 Convergence analysis of MCMC-SGD

In this section, we explore how SGD performs when it is coupled with MCMC samples. We begin by stating a number of assumptions that are crucial to guaranteeing the necessary characteristics of both the objective function and the sampling process. Then, the MCMC estimator is analyzed by the concentration inequality specific to Markov chains. This approach is fundamentally distinct from that employed in traditional SGD settings. Upon establishing this analytical framework, we present our convergence theorem for the MCMC-SGD. This theorem reveals that, within a practical and flexible situation, the expected value of the gradient norm is assured to converge to zero. This convergence is noteworthy since it persists despite the bias inherent in MCMC sampling methods, highlighting the effectiveness of the algorithm under a variety of conditions. It enhances our theoretical understanding of the integration of MCMC-SGD and offers guidance for improving algorithmic methodologies for various stochastic optimization problems.

3.1 Assumptions

To lay the groundwork for our subsequent assumptions, we first expound upon the definition of sub-exponential random variables. This class of random variables is characterized by their tail behavior, which, in a certain sense, is less extreme than that of heavy-tailed distributions. A precise mathematical definition is as follows:

Definition 1.

The sub-exponential norm of a random variable XX, which assumes values in a space 𝒳\mathcal{X} and follows a distribution π\pi, is defined by the expression

Xψ1(π;𝒳)=inf{t>0:𝔼π[exp(|X|t)]2}.\left\|X\right\|_{\psi_{1}(\pi;\mathcal{X})}=\inf\left\{t>0:\mathbb{E}_{\pi}\left[\exp\left(\frac{|X|}{t}\right)\right]\leq 2\right\}.

In the event that Xψ1(π;𝒳)\left\|X\right\|_{\psi_{1}(\pi;\mathcal{X})} is bounded above by a finite value, XX is called sub-exponential with parameter Xψ1(π;𝒳)\left\|X\right\|_{\psi_{1}(\pi;\mathcal{X})}. Moreover, the sub-exponential norm of a measurable function ff with respect to a distribution π\pi is defined by

fψ1(π;𝒳)=inf{t>0:𝔼π[exp(|f(X)|t)]2}.\left\|f\right\|_{\psi_{1}(\pi;\mathcal{X})}=\inf\left\{t>0:\mathbb{E}_{\pi}\left[\exp\left(\frac{|f(X)|}{t}\right)\right]\leq 2\right\}.

We introduce some regularity conditions on fθf_{\theta}. Noticing that fθf_{\theta} may be unbounded for a general class problems, we give the sub-exponential assumption of fθf_{\theta} under the distribution πθ\pi_{\theta}.

Assumption 2.

Let fθ(x)f_{\theta}(x) given in (1.1) satisfy the following conditions. There exist constants M,L2>0M,L_{2}>0 such that

  1. (1)

    supθΘfθψ1(πθ;𝒳)M\sup\limits_{\theta\in\Theta}\left\|f_{\theta}\right\|_{\psi_{1}(\pi_{\theta};\mathcal{X})}\leq M,

  2. (2)

    supθΘ𝔼πθ[θfθ(x)2]L22\sup\limits_{\theta\in\Theta}\mathbb{E}_{\pi_{\theta}}\left[\left\|\nabla_{\theta}f_{\theta}(x)\right\|^{2}\right]\leq L_{2}^{2}.

The mentioned two assumptions are both reasonable and encompass a broad class of scenarios. A common example is fθ=r(x)+βlogπθ(x)f_{\theta}=-r(x)+\beta\log\pi_{\theta}(x) with |r(x)|R|r(x)|\leq R for all x𝒳x\in\mathcal{X}, which often appears in the optimization of entropy regularization and KL divergence. In this situation, fθψ1(πθ;𝒳)t\left\|f_{\theta}\right\|_{\psi_{1}(\pi_{\theta};\mathcal{X})}\leq t is equivalent to

𝔼πθ[exp(|r(x)+βlogπθ(x)|t)]\displaystyle\mathbb{E}_{\pi_{\theta}}\left[\exp\left(\frac{|-r(x)+\beta\log\pi_{\theta}(x)|}{t}\right)\right] 𝔼πθ[exp(Rβlogπθ(x)t)]\displaystyle\leq\mathbb{E}_{\pi_{\theta}}\left[\exp\left(\frac{R-\beta\log\pi_{\theta}(x)}{t}\right)\right]
=x𝒳eR/tπθ(x)1β/t𝑑x2.\displaystyle=\int_{x\in\mathcal{X}}e^{R/t}\pi_{\theta}(x)^{1-\beta/t}dx\leq 2.

Since x𝒳πθ𝑑x=1\int_{x\in\mathcal{X}}\pi_{\theta}dx=1 holds for the distribution πθ\pi_{\theta}, there must exist a constant tθt_{\theta} relying θ\theta such that x𝒳eR/tθπθ(x)1β/tθ𝑑x2\int_{x\in\mathcal{X}}e^{R/t_{\theta}}\pi_{\theta}(x)^{1-\beta/t_{\theta}}dx\leq 2 and then Assumption 2 (1) is easily satisfied for a compact parameter space Θ\Theta. Meanwhile, 𝔼πθ[θfθ(x)2]=β2𝔼πθ[θϕθ(x)𝔼πθ[θϕθ(x)]2]\mathbb{E}_{\pi_{\theta}}\left[\left\|\nabla_{\theta}f_{\theta}(x)\right\|^{2}\right]=\beta^{2}\mathbb{E}_{\pi_{\theta}}\left[\left\|\nabla_{\theta}\phi_{\theta}(x)-\mathbb{E}_{\pi_{\theta}}\left[\nabla_{\theta}\phi_{\theta}(x)\right]\right\|^{2}\right], which can be bounded through Assumption 3 (1) on the variational distribution πθ\pi_{\theta}. By establishing the reasonableness of the Assumption 3 (1) on the variational distribution, one can deduce the validity of the Assumption 2 (2) in this setting. For another instances, when solving the many-electron Schrödinger equations, people often use the Jastrow factor [18] to enable the network to efficiently capture the cusps and decay of the wavefunction. By exploiting physical knowledge in the construction of the wavefunction ansatz can make the function fθf_{\theta} smooth with respect to the parameters θ\theta, such that Assumption 2 is satisfied in practical VMC calculations.

There are also some regularity conditions on the variational distribution πθ(x)\pi_{\theta}(x), in order to guarantee Lipschitz continuity of the gradient g(θ)g(\theta). Typically, it is more practical to apply these smoothing measures to the energy function ϕθ(x)\phi_{\theta}(x) rather than directly to the distribution πθ(x)\pi_{\theta}(x). This kind of smoothing helps us control how the gradient changes, which is important for making sure our optimization methods work efficiently and reliably.

Assumption 3.

Let ϕθ(x)\phi_{\theta}(x) be differentiable with respect to the parameters θΘ\theta\in\Theta for any x𝒳x\in\mathcal{X}. There exist constants B,L1>0B,L_{1}>0 such that

  1. (1)

    supθΘsupx𝒳θϕθ(x)B\sup\limits_{\theta\in\Theta}\sup\limits_{x\in\mathcal{X}}\left\|\nabla_{\theta}\phi_{\theta}(x)\right\|\leq B,

  2. (2)

    supθΘ𝔼πθ[θ2ϕθ(x)2]L12\sup\limits_{\theta\in\Theta}~{}\mathbb{E}_{\pi_{\theta}}\left[\left\|\nabla_{\theta}^{2}\phi_{\theta}(x)\right\|^{2}\right]\leq L_{1}^{2}.

The reasonableness of these two assumptions is solely based on the choice of parameterization. For instance, πθ\pi_{\theta} is taken as a canonical form of exponential family, i.e., ϕθ(x)=θTT(x)\phi_{\theta}(x)=\theta^{\mathrm{T}}T(x), where θ\theta represents the natural parameter vector and T(x)T(x) is the sufficient statistic. As θϕθ(x)=T(x)\nabla_{\theta}\phi_{\theta}(x)=T(x) and θ2ϕθ(x)=0\nabla_{\theta}^{2}\phi_{\theta}(x)=0, the Assumption 3 holds when T(x)T(x) is uniformly bounded for all x𝒳x\in\mathcal{X}. If ϕθ\phi_{\theta} is parameterized by a complex network, it is equally imperative to ensure that its gradients with respect to any input and parameters are bounded.

3.2 Analysis of the MCMC error

We introduce following notations which are used frequently throughout this paper. For any function f:𝒳f:\mathcal{X}\rightarrow\mathbb{R} and any distribution π\pi on 𝒳\mathcal{X}, we write its expectation 𝔼π[f]:=f(x)π(dx)\mathbb{E}_{\pi}[f]:=\int f(x)\pi(dx) and pp-th central moment σpp[f]:=𝔼[(f𝔼π[f])p]\sigma^{p}_{p}[f]:=\mathbb{E}[(f-\mathbb{E}_{\pi}[f])^{p}]. The second central moment, called the variance, is denoted by Varπ[f]=σ22[f]:=𝔼[(f𝔼π[f])2]\mathrm{Var}_{\pi}[f]=\sigma_{2}^{2}[f]:=\mathbb{E}[(f-\mathbb{E}_{\pi}[f])^{2}].

Before the analysis of MCMC methods, we provide several general definitions about Markov chains. Within our consideration, the state space 𝒳\mathcal{X} is Polish and equipped with its σ\sigma-algebra \mathcal{B}. Let {Xi}i=1n\{X_{i}\}_{i=1}^{n} be a time-homogeneous Markov chain defined on 𝒳\mathcal{X}. The distribution of the Markov chain is uniquely determined by its initial distribution ν\nu and its transition kernel PP. For any Borel set AA\in\mathcal{B}, let

ν(A)=(X1A),P(Xi,A)=(Xi+1A|Xi).\displaystyle\nu(A)=\mathbb{P}(X_{1}\in A),\quad P(X_{i},A)=\mathbb{P}(X_{i+1}\in A|X_{i}).

A distribution π\pi is called stationary with respect to a transition kernel PP if

π(A)=P(x,A)π(dx),A.\pi(A)=\int P(x,A)\pi(dx),~{}\forall A\in\mathcal{B}.

When the initial distribution ν=π\nu=\pi, we call the Markov chain stationary.

Our analysis starts from the perspective of operator theory on Hilbert spaces. Let π\pi be the stationary distribution of a Markov chain and L2(π)={f:𝔼π[f2]<}L^{2}(\pi)=\{f:\mathbb{E}_{\pi}[f^{2}]<\infty\} be the Hilbert space equipped with the norm fπ=(𝔼π[f2])1/2\left\|f\right\|_{\pi}=(\mathbb{E}_{\pi}[f^{2}])^{1/2}. Each transition kernel can be viewed as a Markov operator on the Hilbert space L2(π)L^{2}(\pi). The Markov operator P:L2(π)L2(π)\operatorname{P}:L^{2}(\pi)\rightarrow L^{2}(\pi) is defined by

Pf(x)=f(y)P(x,dy),x𝒳,fL2(π).\displaystyle\operatorname{P}f(x)=\int f(y)P(x,dy),~{}\forall x\in\mathcal{X},~{}\forall f\in L^{2}(\pi).

It is easy to show that P\operatorname{P} has the largest eigenvalue 11. Intuitively but not strictly, the gap between 11 and other eigenvalues matters to the Markov chain from non-stationarity towards stationarity. Hence, we introduce the definition of the absolute spectral gap.

Definition 2 (absolute spectral gap).

A Markov operator P:L2(π)L2(π)\operatorname{P}:L^{2}(\pi)\rightarrow L^{2}(\pi) admits an absolute spectral gap γ\gamma if

γ(P)=1λ(P):=1PΠπ>0,\gamma(\operatorname{P})=1-\lambda(\operatorname{P}):=1-\interleave\operatorname{P}-\Pi\interleave_{\pi}>0,

where Π:fL2(π)𝔼π[f]1\Pi:f\in L^{2}(\pi)\rightarrow\mathbb{E}_{\pi}[f]\mathit{1} is the projection operator with 1\mathit{1} denoting the identity operator and π\interleave\cdot\interleave_{\pi} is the operator norm induced by π\left\|\cdot\right\|_{\pi} on L2(π)L^{2}(\pi).

We consider the sample set S={xi}i=n0+1n0+nS=\{x_{i}\}_{i=n_{0}+1}^{n_{0}+n} generated from the desired distribution πθ\pi_{\theta} by the MCMC algorithm . The Markov operator, denoted by Pθ\operatorname{P}_{\theta}, determines the convergence rate of the Markov chain. We assume that there exists a uniform lower bound of the spectral gap.

Assumption 4.

Let Pθ\operatorname{P}_{\theta} be the Markov operator induced by the MCMC algorithm with the absolute spectral gap γ(Pθ)\gamma(\operatorname{P}_{\theta}). For any θΘ\theta\in\Theta, there is a positive lower bound of absolute spectral gaps, that is,

γ:=infθΘγ(Pθ)>0.\gamma:=\inf_{\theta\in\Theta}\gamma(\operatorname{P}_{\theta})>0.

This assumption excludes the situation that infθΘγ(Pθ)=0\inf_{\theta\in\Theta}\gamma(\operatorname{P}_{\theta})=0, which ensures the uniform convergence of the Markov chain.. The spectral gap γ\gamma measures the lower bound on the mixing rate of the Markov chain, thereby guaranteeing the ergodicity of the MCMC algorithm. Compared to the uniformly geometric ergodicity assumptions commonly found in other articles [22, 3], the spectral gap assumption is significantly tighter, intuitively reflecting the factor that influence the convergence rate of a Markov chain. The spectral gap of the MCMC algorithm has been studied for some specific examples. On finite-state spaces, P\operatorname{P} becomes a transition matrix while 1γ(P)1-\gamma(\operatorname{P}) relates to its eigenvalue. A survey of spectrums for Markov chains on discrete state-spaces is in [43]. This develops amounts of analytic techniques and [10] has further applications to the Metropolis-Hastings algorithm. For the continuous spaces, there are few examples of sharp rates of convergence for the Metropolis-Hastings algorithm. In [24, 31], it is claimed that the spectral gap γO(ϵ2)\gamma\sim O(\epsilon^{2}) for the Gaussian proposal Q(x|x)exp(12ϵ2xx2)Q(x^{\prime}|x)\sim\exp\left(-\tfrac{1}{2\epsilon^{2}}\left\|x^{\prime}-x\right\|^{2}\right).

We commence by stating Theorem 1, which articulates the core result that analyzes the error of MCMC gradient estimators. The proof of this theorem will rely on a series of auxiliary propositions, including Lemmas 1, 2 and 3. Lemma 1 introduces a Bernstein inequality for general Markov chains. Using the Bernstein inequality, we obtain the tail bound of the MCMC estimators for sub-exponential functions in Lemma 2. Then, a detailed and intuitive error bound for the MCMC estimators is of consideration in Lemma 3. Finally, through Lemma 3, we succinctly capture the sampling error inherent in MCMC algorithms. In the following theorem, we delineate the error bounds of this stochastic gradient, which is one of main results in this paper.

Theorem 1.

Let Assumptions 1, 2, 3 and 4 hold. For a fixed parameter θΘ\theta\in\Theta, we generate the MCMC samples S={xi}i=n0+1n0+nS=\{x_{i}\}_{i=n_{0}+1}^{n_{0}+n} with the stationary distribution πθ\pi_{\theta} by the MCMC algorithm. Suppose we start from the initial distribution ν\nu and let χ=χ(ν,πθ)<+{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}={\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}(\nu,\pi_{\theta})<+\infty. The stochastic gradient g^(θ;S)\hat{g}(\theta;S), defined by (2.11) has the following error bounds that

𝔼[g^(θ;S)]g(θ)\displaystyle\left\|\mathbb{E}[\hat{g}(\theta;S)]-g(\theta)\right\| Bn,n0:=4c1Bnγ,\displaystyle\leq B_{n,n_{0}}:=\frac{4c_{1}B}{n\gamma}, (3.1)
𝔼[g^(θ;S)g(θ)2]\displaystyle\mathbb{E}\left[\left\|\hat{g}(\theta;S)-g(\theta)\right\|^{2}\right] Vn,n0:=16c2B2σ22[fθ]nγ+40(c3+c4log4n)B2M2n2γ2,\displaystyle\leq V_{n,n_{0}}:=\frac{16c_{2}B^{2}\sigma^{2}_{2}[f_{\theta}]}{n\gamma}+\frac{40(c_{3}+c_{4}\log^{4}n)B^{2}M^{2}}{n^{2}\gamma^{2}},

where with χn0=(1γ)n0χ{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}_{n_{0}}=(1-\gamma)^{n_{0}}{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}} and C=(1+χn0)12C=(1+{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}_{n_{0}})^{\frac{1}{2}}, these factors are defined by c1=χn0σ2[fθ]+4M[logχn0]+2+4M[logχn0]+c_{1}={\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}_{n_{0}}\sigma_{2}[f_{\theta}]+4M[\log{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}_{n_{0}}]^{2}_{+}+4M[\log{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}_{n_{0}}]_{+}, c2=64(1+log2C)c_{2}=64(1+\log 2C) and c3=100(16+4log2C)4c_{3}=100(16+4\log 2C)^{4}, c4=200c_{4}=200.

Proof.

The error of the stochastic gradient can be rewritten as

𝔼[g^(θ;S)]g(θ)\displaystyle\left\|\mathbb{E}[\hat{g}(\theta;S)]-g(\theta)\right\| =supv=1|𝔼[vTg^(θ;S)]vTg(θ)|,\displaystyle=\sup_{\left\|v\right\|=1}\left|\mathbb{E}[v^{\mathrm{T}}\hat{g}(\theta;S)]-v^{\mathrm{T}}g(\theta)\right|,
𝔼[g^(θ;S)g(θ)2]\displaystyle\mathbb{E}\left[\left\|\hat{g}(\theta;S)-g(\theta)\right\|^{2}\right] =tr(𝔼[(g^(θ;S)g(θ))(g^(θ;S)g(θ))T]),\displaystyle=\operatorname{tr}\left(\mathbb{E}\left[\left(\hat{g}(\theta;S)-g(\theta)\right)\left(\hat{g}(\theta;S)-g(\theta)\right)^{\mathrm{T}}\right]\right),
dsupv=1𝔼[(vTg^(θ;S)vTg(θ))2].\displaystyle\leq d\sup_{\left\|v\right\|=1}\mathbb{E}[\left(v^{\mathrm{T}}\hat{g}(\theta;S)-v^{\mathrm{T}}g(\theta)\right)^{2}].

For any given vv such that v=1\left\|v\right\|=1, vTg(θ)v^{\mathrm{T}}g(\theta) is approximated by vTg^(θ;S)v^{\mathrm{T}}\hat{g}(\theta;S). We denote the stationary variables F=fθ(X)F=f_{\theta}(X), Y=vTθϕθ(X)Y=v^{\mathrm{T}}\nabla_{\theta}\phi_{\theta}(X), F¯=fθ(X)𝔼xπθ[fθ(x)]\bar{F}=f_{\theta}(X)-\mathbb{E}_{x\sim\pi_{\theta}}[f_{\theta}(x)] with XπθX\sim\pi_{\theta} and the empirical variables Fi=fθ(xi)F_{i}=f_{\theta}(x_{i}), Yi=vTθϕθ(xi)Y_{i}=v^{\mathrm{T}}\nabla_{\theta}\phi_{\theta}(x_{i}), F¯i=Fi𝔼xπθ[fθ(x)]\bar{F}_{i}=F_{i}-\mathbb{E}_{x\sim\pi_{\theta}}[f_{\theta}(x)]. Then it holds

12(vTg^(θ;S)vTg(θ))\displaystyle\frac{1}{2}\left(v^{\mathrm{T}}\hat{g}(\theta;S)-v^{\mathrm{T}}g(\theta)\right) =1ni=n0+1n+n0FiYi(1nj=n0+1n+n0Fj)(1ni=n0+1n+n0Yi)𝔼πθ[F¯Y]\displaystyle=\frac{1}{n}\sum_{i=n_{0}+1}^{n+n_{0}}F_{i}Y_{i}-\left(\frac{1}{n}\sum_{j=n_{0}+1}^{n+n_{0}}F_{j}\right)\left(\frac{1}{n}\sum_{i=n_{0}+1}^{n+n_{0}}Y_{i}\right)-\mathbb{E}_{\pi_{\theta}}[\bar{F}Y] (3.2)
=1ni=n0+1n+n0F¯iYi𝔼πθ[F¯Y]I1(1nj=n0+1n+n0F¯i)(1ni=n0+1n+n0Yi)I2.\displaystyle=\underbrace{\frac{1}{n}\sum_{i=n_{0}+1}^{n+n_{0}}\bar{F}_{i}Y_{i}-\mathbb{E}_{\pi_{\theta}}[\bar{F}Y]}_{I_{1}}-\underbrace{\left(\frac{1}{n}\sum_{j=n_{0}+1}^{n+n_{0}}\bar{F}_{i}\right)\left(\frac{1}{n}\sum_{i=n_{0}+1}^{n+n_{0}}Y_{i}\right)}_{I_{2}}.

With Assumptions 3 and 2, we have F¯ψ1(πθ,𝒳)M\left\|\bar{F}\right\|_{\psi_{1}(\pi_{\theta},\mathcal{X})}\leq M and YvθlogΨθ(x)B\left\|Y\right\|\leq\left\|v\right\|\left\|\nabla_{\theta}\log\Psi_{\theta}(x)\right\|\leq B. Then the variance is bounded by σ22[F¯Y]𝔼[(F¯Y)2]σ22[F]B2\sigma_{2}^{2}[\bar{F}Y]\leq\mathbb{E}[(\bar{F}Y)^{2}]\leq\sigma_{2}^{2}[F]B^{2}. It also holds

𝔼[exp(|F¯Y𝔼[F¯Y]|2MB)]𝔼[exp(|F¯|B+MB2MB)]2,\displaystyle\mathbb{E}\left[\exp\left(\frac{\left|\bar{F}Y-\mathbb{E}[\bar{F}Y]\right|}{2MB}\right)\right]\leq\mathbb{E}\left[\exp\left(\frac{\left|\bar{F}\right|B+MB}{2MB}\right)\right]\leq 2,

which implies F¯Yψ1(πθ,𝒳)2MB\left\|\bar{F}Y\right\|_{\psi_{1}(\pi_{\theta},\mathcal{X})}\leq 2MB. Applying Lemma 3 to {F¯iYi}i=n0+1n0+n\{\bar{F}_{i}Y_{i}\}_{i=n_{0}+1}^{n_{0}+n}, we have

𝔼[I1]c1Bnγ,𝔼[I12]c2dB2σ22[Fθ]nγ+4(c3d+c4dlog4n)M2B2n2γ2.\displaystyle\left\|\mathbb{E}[I_{1}]\right\|\leq\frac{c_{1}B}{n\gamma},\quad\mathbb{E}\left[\left\|I_{1}\right\|^{2}\right]\leq\frac{c_{2}dB^{2}\sigma^{2}_{2}[F_{\theta}]}{n\gamma}+\frac{4(c_{3}d+c_{4}d\log^{4}n)M^{2}B^{2}}{n^{2}\gamma^{2}}. (3.3)

As F¯ψ1(πθ,𝒳)M\left\|\bar{F}\right\|_{\psi_{1}(\pi_{\theta},\mathcal{X})}\leq M, 𝔼πθ[F¯]=0\mathbb{E}_{\pi_{\theta}}[\bar{F}]=0 and 1ni=n0+1n+n0YiB\left\|\frac{1}{n}\sum_{i=n_{0}+1}^{n+n_{0}}Y_{i}\right\|\leq B, Lemma 3 implies that

𝔼[I2]c1Bnγ,𝔼[I22]c2dB2σ22[Fθ]nγ+(c3d+c4dlog4n)M2B2n2γ2.\displaystyle\left\|\mathbb{E}[I_{2}]\right\|\leq\frac{c_{1}B}{n\gamma},\quad\mathbb{E}\left[\left\|I_{2}\right\|^{2}\right]\leq\frac{c_{2}dB^{2}\sigma^{2}_{2}[F_{\theta}]}{n\gamma}+\frac{(c_{3}d+c_{4}d\log^{4}n)M^{2}B^{2}}{n^{2}\gamma^{2}}. (3.4)

Combining (3.3) and (3.4), we finally obtain

𝔼[g^(θ;S)]g(θ)\displaystyle\left\|\mathbb{E}[\hat{g}(\theta;S)]-g(\theta)\right\| =2𝔼[I1I2]2𝔼[I1]+2𝔼[I2]Bn,n0,\displaystyle=2\left\|\mathbb{E}[I_{1}-I_{2}]\right\|\leq 2\left\|\mathbb{E}[I_{1}]\right\|+2\left\|\mathbb{E}[I_{2}]\right\|\leq B_{n,n_{0}},
𝔼[g^(θ;S)g(θ)2]\displaystyle\mathbb{E}\left[\left\|\hat{g}(\theta;S)-g(\theta)\right\|^{2}\right] =4𝔼[I1I22]8𝔼[I12]+8𝔼[I22]Vn,n0,\displaystyle=4\mathbb{E}[\left\|I_{1}-I_{2}\right\|^{2}]\leq 8\mathbb{E}[\left\|I_{1}\right\|^{2}]+8\mathbb{E}[\left\|I_{2}\right\|^{2}]\leq V_{n,n_{0}},

where Bn,n0,Vn,n0B_{n,n_{0}},V_{n,n_{0}} are given in (3.1). ∎

To refine the proof of the Theorem 1, we first introduce a pivotal tool in this paper, concentration inequalities, to analyze the error of the MCMC estimators in details. The essence of concentration inequalities lies in their capacity to quantify how a random variable can concentrate around its expected value. A Bernstein inequality for general Markov chains is proposed by Jiang and Fan [19] and beneficial to our analysis. The following lemma is a direct corollary of Theorem 2 in [19].

Lemma 1.

Let {Xi}i=1n\{X_{i}\}_{i=1}^{n} be a Markov chain with stationary distribution π\pi and absolute spectral gap γ\gamma. Suppose the initial distribution ν\nu is absolute continuous with respect to the stationary distribution π\pi and its derivative dνdπL2(π)\tfrac{d\nu}{d\pi}\in L^{2}(\pi). Consider a bounded function h:𝒳[c,c]h:\mathcal{X}\rightarrow[-c,c] with 𝔼π[h]=0\mathbb{E}_{\pi}[h]=0 and variance σ22[h]\sigma_{2}^{2}[h]. Then, when ν=π\nu=\pi , that is, {Xi}i=1n\{X_{i}\}_{i=1}^{n} is stationary, it holds that

π(1ni=1nh(Xi)s)exp(γns24σ2+5cs),s0.\mathbb{P}_{\pi}\left(\frac{1}{n}\sum_{i=1}^{n}h\left(X_{i}\right)\geq s\right)\leq\exp\left(-\frac{\gamma ns^{2}}{4\sigma^{2}+5cs}\right),\qquad\forall s\geq 0. (3.5)

The Bernstein inequality (3.5) shows how the average of MCMC samples concentrates at the expectation. However, the Markov chain is not always non-stationary. We define χ2(ν,π):=dνdπ1π2{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}^{2}(\nu,\pi):=\left\|\tfrac{d\nu}{d\pi}-1\right\|^{2}_{\pi} represents the chi-squared divergence between ν\nu and π\pi, by which the Bernstein inequality can be extended into a non-stationary one. We abbreviate χ=χ(ν,π){\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}={\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}(\nu,\pi) and C=(1+χ2)12C=(1+{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}^{2})^{\frac{1}{2}}. Then, we derive the tail bound of the MCMC estimator for sub-exponential functions in the following lemma.

Lemma 2.

Let {Xi}i=1n\{X_{i}\}_{i=1}^{n} be a non-stationary Markov chain with stationary distribution π\pi and absolute spectral gap γ\gamma. Suppose the initial distribution ν\nu is absolute continuous with respect to the stationary distribution π\pi and its derivative dνdπL2(π)\tfrac{d\nu}{d\pi}\in L^{2}(\pi). We consider a function hL2(π)h\in L^{2}(\pi) satisfying hψ1(π,𝒳)M\left\|h\right\|_{\psi_{1}(\pi,\mathcal{X})}\leq M. If s10M(logn)2ns\geq\frac{10M(\log n)^{2}}{n}, the following tail bound holds,

ν(|1ni=1nh(Xi)𝔼π[h]|s)2Cexp(γns264σ22[h])+2Cexp(γns160M).\mathbb{P}_{\nu}\left(\left|\frac{1}{n}\sum_{i=1}^{n}h\left(X_{i}\right)-\mathbb{E}_{\pi}[h]\right|\geq s\right)\leq 2C\exp\left(-\frac{\gamma ns^{2}}{64\sigma^{2}_{2}[h]}\right)+2C\exp\left(-\sqrt{\frac{\gamma ns}{160M}}\right). (3.6)

In other words, for δ>0\delta>0, with probability at least 1δ1-\delta, it holds that

|1ni=1nh(Xi)𝔼π[h]|8σ2[h]log(4C/δ)nγ+160M[log(4Cn/δ)]2nγ.\left|\frac{1}{n}\sum_{i=1}^{n}h(X_{i})-\mathbb{E}_{\pi}[h]\right|\leq 8\sigma_{2}[h]\sqrt{\frac{\log(4C/\delta)}{n\gamma}}+160M\frac{[\log(4Cn/\delta)]^{2}}{n\gamma}. (3.7)
Proof.

Without loss of generality, we assume that 𝔼π[h]=0\mathbb{E}_{\pi}[h]=0 in the following proof. To deal with a possibly unbounded hh, we firstly fix a M>0M^{\prime}>0 and consider the truncation function h¯=max{min{h,M},M}\bar{h}=\max\{\min\{h,M^{\prime}\},-M^{\prime}\} and h^=hh¯\hat{h}=h-\bar{h}. When hh is a sub-exponential random variable, its tail decays at least as fast as an exponential function. As hψ1(π,𝒳)M\left\|h\right\|_{\psi_{1}(\pi,\mathcal{X})}\leq M, the Markov’s inequality implies

π(|h|>s)exp(sM)𝔼π[exp(|h|M)]2exp(sM).\mathbb{P}_{\pi}(\left|h\right|>s)\leq\exp\left(-\frac{s}{M}\right)\mathbb{E}_{\pi}\left[\exp\left(\frac{|h|}{M}\right)\right]\leq 2\exp\left(-\frac{s}{M}\right). (3.8)

Notice that for any event Aσ(X1,,Xn)A\in\sigma(X_{1},\cdots,X_{n}), it holds from the Cauchy-Schwarz inequality

ν(A)=\displaystyle\mathbb{P}_{\nu}(A)= 𝒳(A|X1=x)ν(dx)=A(A|X1=x)dνdππ(dx)\displaystyle~{}\int_{\mathcal{X}}\mathbb{P}(A|X_{1}=x)\nu(dx)=\int_{A}\mathbb{P}(A|X_{1}=x)\frac{d\nu}{d\pi}\pi(dx) (3.9)
\displaystyle\leq 𝒳(dνdπ)2(A|X1=x)π(dx)𝒳(A|X1=x)π(dx)\displaystyle~{}\sqrt{\int_{\mathcal{X}}\left(\frac{d\nu}{d\pi}\right)^{2}\mathbb{P}(A|X_{1}=x)\pi(dx)\int_{\mathcal{X}}\mathbb{P}(A|X_{1}=x)\pi(dx)}
\displaystyle\leq Cπ(A).\displaystyle~{}C\sqrt{\mathbb{P}_{\pi}(A)}.

Hence, we only consider the case ν=π\nu=\pi, i.e., the case {Xi}i=1n\{X_{i}\}_{i=1}^{n} is stationary. We first fix a large M>0M^{\prime}>0. For any 1in1\leq i\leq n, we have

(|h(Xi)|>M)2exp(MM).\displaystyle\mathbb{P}\left(\left|h(X_{i})\right|>M^{\prime}\right)\leq 2\exp\left(-\frac{M^{\prime}}{M}\right). (3.10)

It follows from the inclusion of events that

(|1ni=1nh(Xi)|s)\displaystyle\mathbb{P}\left(\left|\frac{1}{n}\sum_{i=1}^{n}h\left(X_{i}\right)\right|\geq s\right)\leq (|1ni=1nh¯(Xi)|s)+(1in,|h(Xi)|>M)\displaystyle~{}\mathbb{P}\left(\left|\frac{1}{n}\sum_{i=1}^{n}\bar{h}\left(X_{i}\right)\right|\geq s\right)+\mathbb{P}\left(\exists~{}1\leq i\leq n,~{}\left|h(X_{i})\right|>M^{\prime}\right)
\displaystyle\leq (|1ni=1nh¯(Xi)|s)++2nexp(MM).\displaystyle~{}\mathbb{P}\left(\left|\frac{1}{n}\sum_{i=1}^{n}\bar{h}\left(X_{i}\right)\right|\geq s\right)++2n\exp\left(-\frac{M^{\prime}}{M}\right).

Notice that |𝔼π[h¯]|=|𝔼π[h^]|2Mexp(MM)\left|\mathbb{E}_{\pi}[\bar{h}]\right|=\left|\mathbb{E}_{\pi}[\hat{h}]\right|\leq 2M\exp\left(-\frac{M^{\prime}}{M}\right). Therefore, when s2|𝔼π[h^]|s\geq 2\left|\mathbb{E}_{\pi}[\hat{h}]\right|, it holds that

(|1ni=1nh¯(Xi)|s)\displaystyle\mathbb{P}\left(\left|\frac{1}{n}\sum_{i=1}^{n}\bar{h}\left(X_{i}\right)\right|\geq s\right)
(|1ni=1nh¯(Xi)𝔼π[h¯]|s|𝔼π[h¯]|)\displaystyle\leq~{}\mathbb{P}\left(\left|\frac{1}{n}\sum_{i=1}^{n}\bar{h}\left(X_{i}\right)-\mathbb{E}_{\pi}[\bar{h}]\right|\geq s-\left|\mathbb{E}_{\pi}[\bar{h}]\right|\right)
(|1ni=1nh¯(Xi)𝔼π[h¯]|s2)\displaystyle\leq~{}\mathbb{P}\left(\left|\frac{1}{n}\sum_{i=1}^{n}\bar{h}\left(X_{i}\right)-\mathbb{E}_{\pi}[\bar{h}]\right|\geq\frac{s}{2}\right)
2exp(γns216σ22[h¯]+10Ms)2exp(γns216σ22[h]+10Ms),\displaystyle\leq~{}2\exp\left(-\frac{\gamma ns^{2}}{16\sigma^{2}_{2}[\bar{h}]+10M^{\prime}s}\right)\leq~{}2\exp\left(-\frac{\gamma ns^{2}}{16\sigma^{2}_{2}[h]+10M^{\prime}s}\right),

where the third inequality is due to (3.5), and the last inequality uses σ22[h¯]σ22[h]\sigma^{2}_{2}[\bar{h}]\leq\sigma^{2}_{2}[h]. Finally, for any fixed s10M(logn)2ns\geq\frac{10M(\log n)^{2}}{n}, we take M=γnMs/10M^{\prime}=\sqrt{\gamma nMs/10} and then obtain

(|1ni=1nh(Xi)|s)\displaystyle\mathbb{P}\left(\left|\frac{1}{n}\sum_{i=1}^{n}h\left(X_{i}\right)\right|\geq s\right)
2exp(γns216σ22[h]+10Ms)+2nexp(MM)\displaystyle\leq~{}2\exp\left(-\frac{\gamma ns^{2}}{16\sigma^{2}_{2}[h]+10M^{\prime}s}\right)+2n\exp\left(-\frac{M^{\prime}}{M}\right)
2exp(γns22max{16σ22[h],10sγnMs/10})+2nexp(γns10M)\displaystyle\leq~{}2\exp\left(-\frac{\gamma ns^{2}}{2\max\{16\sigma^{2}_{2}[h],10s\sqrt{\gamma nMs/10}\}}\right)+2n\exp\left(-\sqrt{\frac{\gamma ns}{10M}}\right)
2exp(γns232σ22[h])+2exp(γns40M)+2nexp(γns10M)\displaystyle\leq~{}2\exp\left(-\frac{\gamma ns^{2}}{32\sigma^{2}_{2}[h]}\right)+2\exp\left(-\sqrt{\frac{\gamma ns}{40M}}\right)+2n\exp\left(-\sqrt{\frac{\gamma ns}{10M}}\right)
2exp(γns232σ22[h])+4exp(γns40M),\displaystyle\leq~{}2\exp\left(-\frac{\gamma ns^{2}}{32\sigma^{2}_{2}[h]}\right)+4\exp\left(-\sqrt{\frac{\gamma ns}{40M}}\right),

where the last inequality holds from nexp(γns10M)exp(γns40M)n\exp\left(-\sqrt{\frac{\gamma ns}{10M}}\right)\leq\exp\left(-\sqrt{\frac{\gamma ns}{40M}}\right) as long as s10M(logn)2ns\geq\frac{10M(\log n)^{2}}{n}. Using the property (3.9), it follows that the tail bound (3.6) holds. This completes the proof. ∎

In the above lemma, we establish the tail bound of the MCMC estimator for unbounded functions. To the best of our knowledge, this result has not appeared in the literatures of Bernstein inequality for Markov chains. The novelty of our method is manifested in the sub-exponential assumption tailored for unbounded functions and the utilization of spectral gap analysis within concentration inequalities. Consequently, a comprehensive and intuitive error bound for the MCMC estimator is provided as follows.

Lemma 3.

Suppose that the condition of Lemma 2 holds. Then, error bounds for the MCMC estimator are given as follows.

(1) The bias satisfies that

|𝔼ν[1ni=1nh(Xi)]𝔼π[h]|c1nγ,\left|\mathbb{E}_{\nu}\left[\frac{1}{n}\sum_{i=1}^{n}h(X_{i})\right]-\mathbb{E}_{\pi}[h]\right|\leq\frac{c_{1}}{n\gamma}, (3.11)

where c1=σ2[h]min{1,χ}+4M[logχ]+2+4M[logχ]+c_{1}=\sigma_{2}[h]\min\left\{1,{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}\right\}+4M\left[\log{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}\right]_{+}^{2}+4M\left[\log{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}\right]_{+} and χ=χ(ν,π){\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}={\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}(\nu,\pi). In particular, when χ1{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}\leq 1, c1=σ2[h]χc_{1}=\sigma_{2}[h]{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}.

(2) The variance satisfies that

𝔼ν[|1ni=1nh(Xi)𝔼π[h]|2]c2nγσ22[h]+c3+c4log4nn2γ2M2,\mathbb{E}_{\nu}\left[\left|\frac{1}{n}\sum_{i=1}^{n}h(X_{i})-\mathbb{E}_{\pi}[h]\right|^{2}\right]\leq\frac{c_{2}}{n\gamma}\sigma_{2}^{2}[h]+\frac{c_{3}+c_{4}\log^{4}n}{n^{2}\gamma^{2}}M^{2}, (3.12)

where c2=64(1+log2C)c_{2}=64(1+\log 2C), c3=100(16+4log2C)4c_{3}=100(16+4\log 2C)^{4}, c4=200c_{4}=200, and C=(1+χ2)12C=(1+{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}^{2})^{\frac{1}{2}}.

Proof.

We follow the notations in the proof of Lemma 2.

Bias bound.

Clearly, we only need to bound 1ni=1n|𝔼νi[h]𝔼π[h]|\frac{1}{n}\sum_{i=1}^{n}\left|\mathbb{E}_{\nu_{i}}[h]-\mathbb{E}_{\pi}[h]\right|, where νi=νPi1\nu_{i}=\nu\operatorname{P}^{i-1} is the distribution of XiX_{i}. For the distribution π\pi^{\prime} whose derivative dπdπL2(π)\frac{d\pi^{\prime}}{d\pi}\in L^{2}(\pi), it holds from the Cauchy-Schwarz inequality that

|𝔼π[h]𝔼π[h]|\displaystyle\left|\mathbb{E}_{\pi^{\prime}}[h]-\mathbb{E}_{\pi}[h]\right| =|𝔼π[(dπdπ1)(h𝔼π[h])]|\displaystyle=\left|\mathbb{E}_{\pi}\left[\left(\frac{d\pi^{\prime}}{d\pi}-1\right)(h-\mathbb{E}_{\pi}[h])\right]\right| (3.13)
[𝔼π(dπdπ1)2]1/2σ2[h]=χ(π,π)σ2[h].\displaystyle\leq\left[\mathbb{E}_{\pi}\left(\frac{d\pi^{\prime}}{d\pi}-1\right)^{2}\right]^{1/2}\sigma_{2}[h]={\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}(\pi^{\prime},\pi)\sigma_{2}[h].

Besides, noticing that (𝔼π|h^|2)122Mexp(M2M)\left(\mathbb{E}_{\pi}\left|\hat{h}\right|^{2}\right)^{\frac{1}{2}}\leq 2M\exp\left(-\frac{M^{\prime}}{2M}\right), we have

|𝔼π[h^]𝔼π[h^]|χ(π,π)(𝔼π|h^|2)122Mχ(π,π)exp(M2M).\displaystyle\left|\mathbb{E}_{\pi^{\prime}}[\hat{h}]-\mathbb{E}_{\pi}[\hat{h}]\right|\leq{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}(\pi^{\prime},\pi)\left(\mathbb{E}_{\pi}\left|\hat{h}\right|^{2}\right)^{\frac{1}{2}}\leq 2M{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}(\pi^{\prime},\pi)\exp\left(-\frac{M^{\prime}}{2M}\right).

Then (3.13) implies that

|𝔼νi[h]𝔼π[h]|\displaystyle\left|\mathbb{E}_{\nu_{i}}[h]-\mathbb{E}_{\pi}[h]\right|\leq |𝔼νi[h¯]𝔼π[h¯]|+|𝔼νi[h^]𝔼π[h^]|\displaystyle~{}\left|\mathbb{E}_{\nu_{i}}[\bar{h}]-\mathbb{E}_{\pi}[\bar{h}]\right|+\left|\mathbb{E}_{\nu_{i}}[\hat{h}]-\mathbb{E}_{\pi}[\hat{h}]\right| (3.14)
\displaystyle\leq 2M+2Mχ(νi,π)exp(M2M),\displaystyle~{}2M^{\prime}+2M{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}(\nu_{i},\pi)\exp\left(-\frac{M^{\prime}}{2M}\right),

and as long as χ(νi,π)1{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}(\nu_{i},\pi)\geq 1, we can choose M=2Mlogχ(νi,π)M^{\prime}=2M\log{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}(\nu_{i},\pi) in (3.14) to derive |𝔼νi[h]𝔼π[h]|4M(1+logχ(νi,π))\left|\mathbb{E}_{\nu_{i}}[h]-\mathbb{E}_{\pi}[h]\right|\leq 4M(1+\log{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}(\nu_{i},\pi)). Combining (3.13) and (3.14) yields that

|𝔼νi[h]𝔼π[h]|{χ(νi,π)σ2[h],always,4M(1+logχ(νi,π)),if χ(νi,π)1.\displaystyle\left|\mathbb{E}_{\nu_{i}}[h]-\mathbb{E}_{\pi}[h]\right|\leq\begin{cases}{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}(\nu_{i},\pi)\sigma_{2}[h],&\text{always},\\ 4M\left(1+\log{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}(\nu_{i},\pi)\right),&\text{if }{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}(\nu_{i},\pi)\geq 1.\end{cases} (3.15)

Now, by the definition of γ\gamma, it holds that χ(νi,π)(1γ)i1χ(ν1,π)=(1γ)i1χ{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}(\nu_{i},\pi)\leq(1-\gamma)^{i-1}{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}(\nu_{1},\pi)=(1-\gamma)^{i-1}{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}. Let us consider the smallest k1k\geq 1 such that χ(νk,π)1{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}(\nu_{k},\pi)\leq 1. Then clearly k1+[logχ]+γk\leq 1+\left\lceil\frac{[\log{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}]_{+}}{\gamma}\right\rceil, and for iki\geq k, χ(νi,π)(1γ)i1min{1,χ}{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}(\nu_{i},\pi)\leq(1-\gamma)^{i-1}\min\{1,{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}\}. Hence,

i=1n|𝔼νi[h]𝔼π[h]|\displaystyle\sum_{i=1}^{n}\left|\mathbb{E}_{\nu_{i}}[h]-\mathbb{E}_{\pi}[h]\right|\leq i=1k1|𝔼νi[h]𝔼π[h]|+i=kn|𝔼νi[h]𝔼π[h]|\displaystyle\sum_{i=1}^{k-1}\left|\mathbb{E}_{\nu_{i}}[h]-\mathbb{E}_{\pi}[h]\right|+\sum_{i=k}^{n}\left|\mathbb{E}_{\nu_{i}}[h]-\mathbb{E}_{\pi}[h]\right|
\displaystyle\leq i=1k14M(1+logχ(νi,π))+i=knχ(νi,π)σ2[h]\displaystyle\sum_{i=1}^{k-1}4M\left(1+\log{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}(\nu_{i},\pi)\right)+\sum_{i=k}^{n}{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}(\nu_{i},\pi)\sigma_{2}[h]
\displaystyle\leq 4M(k1)(1+logχ)+σ2[h]min{1,χ}1γ,\displaystyle 4M(k-1)\left(1+\log{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}\right)+\sigma_{2}[h]\min\{1,{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}\}\cdot\frac{1}{\gamma},

where the first inequality is the triangle inequality and the second uses (3.15). Therefore, it holds that

1ni=1n|𝔼νi[h]𝔼π[h]|1nγ(σ2[h]min{1,χ}+4M[logχ]+2+4M[logχ]+).\frac{1}{n}\sum_{i=1}^{n}\left|\mathbb{E}_{\nu_{i}}[h]-\mathbb{E}_{\pi}[h]\right|\leq\frac{1}{n\gamma}\left(\sigma_{2}[h]\min\left\{1,{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}\right\}+4M\left[\log{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}\right]_{+}^{2}+4M\left[\log{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}\right]_{+}\right).
Variance bound.

In the following proof, we provide an upper bound on all higher-order moments of Δ\Delta simultaneously. Denote A=log(2C),R1=8σ22[h]γn,R2=160Mγn,s0=10M(logn)2nA=\log(2C),R_{1}=8\sqrt{\frac{\sigma_{2}^{2}[h]}{\gamma n}},R_{2}=\frac{160M}{\gamma n},s_{0}=\frac{10M(\log n)^{2}}{n}. Then, we only need to bound 𝔼[|Δ|m]\mathbb{E}[\left|\Delta\right|^{m}] for even mm under the condition

(|Δ|s)exp(As2/R12)+exp(As/R2),ss0.\displaystyle\mathbb{P}\left(\left|\Delta\right|\geq s\right)\leq\exp(A-s^{2}/R_{1}^{2})+\exp(A-\sqrt{s/R_{2}}),\qquad\forall s\geq s_{0}.

Notice that

1m𝔼[|Δ|m]=\displaystyle\frac{1}{m}\mathbb{E}[\left|\Delta\right|^{m}]= 0sm1(|Δ|s)𝑑s\displaystyle~{}\int_{0}^{\infty}s^{m-1}\mathbb{P}(\left|\Delta\right|\geq s)ds
\displaystyle\leq 0sm1min{1,exp(As2/R12)+exp(As/R2)}𝑑s\displaystyle~{}\int_{0}^{\infty}s^{m-1}\min\{1,\exp(A-s^{2}/R_{1}^{2})+\exp(A-\sqrt{s/R_{2}})\}ds
\displaystyle\leq I1+I2.\displaystyle~{}I_{1}+I_{2}.

where I1=0sm1min{1,exp(As2/R12)}𝑑sI_{1}=\int_{0}^{\infty}s^{m-1}\min\{1,\exp(A-s^{2}/R_{1}^{2})\}ds and I2=0sm1min{1,exp(As/R2)}𝑑sI_{2}=\int_{0}^{\infty}s^{m-1}\min\{1,\exp(A-\sqrt{s/R_{2}})\}ds. Thus, we further denote s1=max{s0,R1A},s2=max{s0,R2A2}s_{1}=\max\{s_{0},R_{1}\sqrt{A}\},s_{2}=\max\{s_{0},R_{2}A^{2}\}. Then

I1=\displaystyle I_{1}= 0s1sm1𝑑s+s1sm1exp(As2/R12)𝑑s\displaystyle~{}\int_{0}^{s_{1}}s^{m-1}ds+\int_{s_{1}}^{\infty}s^{m-1}\exp(A-s^{2}/R_{1}^{2})ds
=\displaystyle= s1mm+R1m2s1(sR1)m2exp(As2/R12)d(s2R12)\displaystyle~{}\frac{s_{1}^{m}}{m}+\frac{R_{1}^{m}}{2}\int_{s_{1}}^{\infty}\left(\frac{s}{R_{1}}\right)^{m-2}\exp(A-s^{2}/R_{1}^{2})d\left(\frac{s^{2}}{R_{1}^{2}}\right)
=\displaystyle= s1mm+R1m2s12/R12t(m2)/2exp(At)𝑑t\displaystyle~{}\frac{s_{1}^{m}}{m}+\frac{R_{1}^{m}}{2}\int_{s_{1}^{2}/R_{1}^{2}}^{\infty}t^{(m-2)/2}\exp(A-t)dt
\displaystyle\leq s1mm+R1mm((A+m/2)m/2Am/2),\displaystyle~{}\frac{s_{1}^{m}}{m}+\frac{R_{1}^{m}}{m}\left((A+m/2)^{m/2}-A^{m/2}\right),

where the last inequality is due to the fact Atk1exp(At)𝑑t(A+k1)k1((A+k)kAk)/k\int_{A}^{\infty}t^{k-1}\exp(A-t)dt\leq(A+k-1)^{k-1}\leq((A+k)^{k}-A^{k})/k for any integer k1k\geq 1. Similarly,

I2=\displaystyle I_{2}= 0sm1min{1,exp(As/R2)}𝑑s\displaystyle~{}\int_{0}^{\infty}s^{m-1}\min\{1,\exp(A-\sqrt{s/R_{2}})\}ds
=\displaystyle= 0s2sm1𝑑s+s2sm1exp(As/R2)𝑑s\displaystyle~{}\int_{0}^{s_{2}}s^{m-1}ds+\int_{s_{2}}^{\infty}s^{m-1}\exp(A-\sqrt{s/R_{2}})ds
=\displaystyle= s2mm+s2/R22R2mt2m1exp(At)𝑑t\displaystyle~{}\frac{s_{2}^{m}}{m}+\int_{\sqrt{s_{2}/R_{2}}}^{\infty}2R_{2}^{m}t^{2m-1}\exp(A-t)dt
\displaystyle\leq s2mm+R2mm((A+2m)2mA2m).\displaystyle~{}\frac{s_{2}^{m}}{m}+\frac{R_{2}^{m}}{m}\left((A+2m)^{2m}-A^{2m}\right).

Combining these two cases, we obtain

𝔼[|Δ|m]\displaystyle\mathbb{E}[\left|\Delta\right|^{m}]\leq mI1+mI2\displaystyle~{}mI_{1}+mI_{2}
\displaystyle\leq s0m+s1m+R1m((A+m/2)m/2Am/2)+R2m((A+2m)2mA2m)\displaystyle~{}s_{0}^{m}+s_{1}^{m}+R_{1}^{m}\left((A+m/2)^{m/2}-A^{m/2}\right)+R_{2}^{m}\left((A+2m)^{2m}-A^{2m}\right)
\displaystyle\leq 2s0m+R1m(A+m/2)m/2+R2m(A+2m)2m.\displaystyle~{}2s_{0}^{m}+R_{1}^{m}(A+m/2)^{m/2}+R_{2}^{m}(A+2m)^{2m}.

This completes the proof. ∎

3.3 First order convergence

In this subsection, we will first analyze the decrease for a single iteration based on the MCMC error bounds. Then, our main result delineates the convergence of the expected norm of the gradient throughout the iterations of SGD, when cooperating with the MCMC estimator. This convergence is established through a rigorous theoretical framework, underscoring the efficacy of the SGD algorithm in the presence of sampling randomness.

Under our assumptions in subsection 3.1, we can affirm the LL-smoothness of the objective function (θ)\mathcal{L}(\theta) to facilitate a conventional optimization analysis. This is a common condition for functions to undergo expected descent. We prove the following lemma by giving the bound of the Hessian.

Lemma 4.

Let Assumptions 1, 2 and 3 hold. There exists a constant L>0L>0 such that (θ)\mathcal{L}(\theta) is LL-smooth, that is

g(θ1)g(θ2)Lθ1θ2,θ1,θ2Θ.\|g(\theta_{1})-g(\theta_{2})\|\leq L\|\theta_{1}-\theta_{2}\|,~{}\forall\theta_{1},\theta_{2}\in\Theta. (3.16)
Proof.

For any θΘ\theta\in\Theta, fθψ1(πθ;𝒳)M\left\|f_{\theta}\right\|_{\psi_{1}(\pi_{\theta};\mathcal{X})}\leq M implies that

𝔼πθ[|fθ|2]1/2𝔼πθ[|fθ|]M.\mathbb{E}_{\pi_{\theta}}\left[|f_{\theta}|^{2}\right]^{1/2}\leq\mathbb{E}_{\pi_{\theta}}\left[|f_{\theta}|\right]\leq M.

The Hessian H(θ):=θ2(θ)H(\theta):=\nabla_{\theta}^{2}\mathcal{L}(\theta) can be derived by

H(θ)=\displaystyle H(\theta)= 𝔼πθ[θlogπθθfθT]+𝔼πθ[fθθ2logπθ]+𝔼πθ[fθθlogπθθlogπθT],\displaystyle\mathbb{E}_{\pi_{\theta}}\left[\nabla_{\theta}\log\pi_{\theta}\nabla_{\theta}f_{\theta}^{\mathrm{T}}\right]+\mathbb{E}_{\pi_{\theta}}\left[f_{\theta}\nabla_{\theta}^{2}\log\pi_{\theta}\right]+\mathbb{E}_{\pi_{\theta}}\left[f_{\theta}\nabla_{\theta}\log\pi_{\theta}\nabla_{\theta}\log\pi_{\theta}^{\mathrm{T}}\right], (3.17)

where θlogπθ=θϕθ𝔼πθ[θϕθ]\nabla_{\theta}\log\pi_{\theta}=\nabla_{\theta}\phi_{\theta}-\mathbb{E}_{\pi_{\theta}}\left[\nabla_{\theta}\phi_{\theta}\right] and θ2logπθ=θ2ϕθ𝔼πθ[θ2ϕθ]𝔼πθ[θϕθθϕθT]+𝔼πθ[θϕθ]𝔼πθ[θϕθT]\nabla_{\theta}^{2}\log\pi_{\theta}=\nabla_{\theta}^{2}\phi_{\theta}-\mathbb{E}_{\pi_{\theta}}\left[\nabla_{\theta}^{2}\phi_{\theta}\right]-\mathbb{E}_{\pi_{\theta}}\left[\nabla_{\theta}\phi_{\theta}\nabla_{\theta}\phi_{\theta}^{\mathrm{T}}\right]+\mathbb{E}_{\pi_{\theta}}\left[\nabla_{\theta}\phi_{\theta}\right]\mathbb{E}_{\pi_{\theta}}\left[\nabla_{\theta}\phi_{\theta}^{\mathrm{T}}\right]. Then, we have

H(θ)\displaystyle\left\|H(\theta)\right\|\leq 𝔼πθ[θfθθϕθT]+2𝔼πθ[|fθ|2]1/2𝔼πθ[θ2ϕθ2]1/2+6MB2\displaystyle\mathbb{E}_{\pi_{\theta}}\left[\left\|\nabla_{\theta}f_{\theta}\nabla_{\theta}\phi_{\theta}^{\mathrm{T}}\right\|\right]+2\mathbb{E}_{\pi_{\theta}}\left[|f_{\theta}|^{2}\right]^{1/2}\mathbb{E}_{\pi_{\theta}}\left[\left\|\nabla^{2}_{\theta}\phi_{\theta}\right\|^{2}\right]^{1/2}+6MB^{2}
\displaystyle\leq BL1+2ML2+6MB2,\displaystyle BL_{1}+2ML_{2}+6MB^{2},

where the last inequality is due to Assumptions 2 and 3. Let L=BL1+2ML2+6MB2L=BL_{1}+2ML_{2}+6MB^{2}, then (θ)\mathcal{L}(\theta) is LL-smooth. ∎

According to the book [47], the LL-smoothness (3.16) is also equivalent to

(θ2)(θ1)+g(θ1),θ2θ1+L2θ1θ22.\mathcal{L}(\theta_{2})\leq\mathcal{L}(\theta_{1})+\langle g(\theta_{1}),\theta_{2}-\theta_{1}\rangle+\frac{L}{2}\|\theta_{1}-\theta_{2}\|^{2}. (3.18)

It is already known that the stochastic gradient g^\hat{g} is biased, unlike the unbiased estimator in the classical SGD. However, the bias decays with increasing burn-in time n0n_{0} or the sample size nn, which can be regarded sufficiently small. We discuss the expected descent of the function value within each iteration in the following lemma.

Lemma 5.

If the stepsize αk12L\alpha_{k}\leq\frac{1}{2L}, for any given θk\theta_{k} the objective function (θk+1)\mathcal{L}(\theta_{k+1}) decreases in expectation as

(θk)𝔼[(θk+1)|θk]αk2(θk)2αk2Bn,n02αk2L2Vn,n0.\mathcal{L}(\theta_{k})-\mathbb{E}\left[\mathcal{L}(\theta_{k+1})|\theta_{k}\right]\geq\frac{\alpha_{k}}{2}\|\nabla\mathcal{L}(\theta_{k})\|^{2}-\frac{\alpha_{k}}{2}\cdot B_{n,n_{0}}^{2}-\frac{\alpha_{k}^{2}L}{2}\cdot V_{n,n_{0}}. (3.19)

where Bn,n0B_{n,n_{0}} and Vn,n0V_{n,n_{0}} are defined in Theorem 1.

Proof.

For convenience, we simplify the notations with k:=(θk)\mathcal{L}_{k}:=\mathcal{L}(\theta_{k}), gk:=g(θk)g_{k}:=g(\theta_{k}) and g^k:=g^(θk,Sk)\hat{g}_{k}:=\hat{g}(\theta_{k},S_{k}) for any k1k\geq 1. By the Lemma 4 and the property (3.18), we perform a gradient descent analysis of the iteration (2.12),

k+1\displaystyle\mathcal{L}_{k+1}\leq k+gk,θk+1θk+L2θk+1θk2\displaystyle\mathcal{L}_{k}+\langle g_{k},\theta_{k+1}-\theta_{k}\rangle+\frac{L}{2}\|\theta_{k+1}-\theta_{k}\|^{2} (3.20)
=\displaystyle= kαkgk,g^kgkαkgk2+αk2L2g^k2\displaystyle\mathcal{L}_{k}-\alpha_{k}\langle g_{k},\hat{g}_{k}-g_{k}\rangle-\alpha_{k}\|g_{k}\|^{2}+\frac{\alpha_{k}^{2}L}{2}\|\hat{g}_{k}\|^{2}
=\displaystyle= k(αkαk2L)gk,g^kgk(αkαk2L2)gk2+αk2L2g^kgk2.\displaystyle\mathcal{L}_{k}-(\alpha_{k}-\alpha_{k}^{2}L)\langle g_{k},\hat{g}_{k}-g_{k}\rangle-\left(\alpha_{k}-\frac{\alpha_{k}^{2}L}{2}\right)\|g_{k}\|^{2}+\frac{\alpha_{k}^{2}L}{2}\|\hat{g}_{k}-g_{k}\|^{2}.

Taking the conditional expectation of (3.20) on θk\theta_{k} gives

𝔼[k+1|θk]\displaystyle\mathbb{E}\left[\left.\mathcal{L}_{k+1}\right|\theta_{k}\right]\leq k(αkαk2L2)gk2\displaystyle\mathcal{L}_{k}-\left(\alpha_{k}-\frac{\alpha_{k}^{2}L}{2}\right)\|g_{k}\|^{2} (3.21)
(αkαk2L)gk,𝔼[g^k|θk]gk+αk2L2𝔼[g^kgk2|θk].\displaystyle-(\alpha_{k}-\alpha_{k}^{2}L)\langle g_{k},\mathbb{E}\left[\left.\hat{g}_{k}\right|\theta_{k}\right]-g_{k}\rangle+\frac{\alpha_{k}^{2}L}{2}\mathbb{E}\left[\left.\left\|\hat{g}_{k}-g_{k}\right\|^{2}\right|\theta_{k}\right].

Through the fact |x,y|(x2+y2)/2|\langle x,y\rangle|\leq(\left\|x\right\|^{2}+\left\|y\right\|^{2})/2, it holds

(αkαk2L)gk,𝔼[g^k|θk]gk\displaystyle-(\alpha_{k}-\alpha_{k}^{2}L)\left\langle g_{k},\mathbb{E}\left[\left.\hat{g}_{k}\right|\theta_{k}\right]-g_{k}\right\rangle\leq αkαk2L2gk2+αk2𝔼[g^k|θk]gk2.\displaystyle\frac{\alpha_{k}-\alpha_{k}^{2}L}{2}\left\|g_{k}\right\|^{2}+\frac{\alpha_{k}}{2}\left\|\mathbb{E}\left[\left.\hat{g}_{k}\right|\theta_{k}\right]-g_{k}\right\|^{2}. (3.22)

Therefore, we can plug in (3.22) to (3.21) and rearrange to get

2(k𝔼[k+1|θk])αkgk2αk𝔼[g^k|θk]gk2Lαk2𝔼[g^kgk2|θk].2\left(\mathcal{L}_{k}-\mathbb{E}\left[\left.\mathcal{L}_{k+1}\right|\theta_{k}\right]\right)\geq\alpha_{k}\left\|g_{k}\right\|^{2}-\alpha_{k}\left\|\mathbb{E}\left[\left.\hat{g}_{k}\right|\theta_{k}\right]-g_{k}\right\|^{2}-L\alpha_{k}^{2}\mathbb{E}\left[\left.\left\|\hat{g}_{k}-g_{k}\right\|^{2}\right|\theta_{k}\right]. (3.23)

Thus, (3.19) holds when we substitute the error bounds in Theorem 1 into (3.23). ∎

As nn goes towards positive infinity, the objective function decreases as an exact gradient descent. However, nn and n0n_{0} are not too large because of the high sampling cost in practice. The error bounds studied in Section 3.2 are valuable for our analysis of the stochastic optimization. Since (θ)\mathcal{L}(\theta) is non-convex, we consider the convergence rate in terms of the expected norm of the gradient 𝔼θ(θ)2\mathbb{E}\left\|\nabla_{\theta}\mathcal{L}(\theta)\right\|^{2}. Finally, we establish our first-order convergence results as follows.

Theorem 2.

Let Assumptions 2, 2, 3 and 4 hold and {θk}\{\theta_{k}\} be generated by Algorithm 1. If the stepsize satisfies αk12L\alpha_{k}\leq\frac{1}{2L}, then for any KK, we have

min1kK𝔼g(θk)2O(1k=1Kαk)+O(k=1Kαk2nk=1Kαk)+O((1γ)2n0n2),\min_{1\leq k\leq K}\mathbb{E}\|g(\theta_{k})\|^{2}\leq O\left(\frac{1}{\sum_{k=1}^{K}\alpha_{k}}\right)+O\left(\frac{\sum_{k=1}^{K}\alpha_{k}^{2}}{n\sum_{k=1}^{K}\alpha_{k}}\right)+O\left(\frac{(1-\gamma)^{2n_{0}}}{n^{2}}\right), (3.24)

where O()O(\cdot) hides constants γ,M,B,C\gamma,M,B,C and retains n,n0n,n_{0} and KK. In particular, if the stepsize is chosen as αk=cnk\alpha_{k}=\frac{c\sqrt{n}}{\sqrt{k}} where c12Lc\leq\frac{1}{2L}, then we have

min1kK𝔼g(θk)2O(logKnK)+O((1γ)2n0n2).\min_{1\leq k\leq K}\mathbb{E}\|g(\theta_{k})\|^{2}\leq O\left(\frac{\log K}{\sqrt{nK}}\right)+O\left(\frac{(1-\gamma)^{2n_{0}}}{n^{2}}\right). (3.25)
Proof.

Lemma 5 suggests that, for any k1k\geq 1,

αkg(θk)2\displaystyle\alpha_{k}\left\|g(\theta_{k})\right\|^{2}\leq 2((θk)𝔼[(θk+1)|θk])+αkBn,n02+αk2LVn,n0.\displaystyle 2\left(\mathcal{L}(\theta_{k})-\mathbb{E}\left[\left.\mathcal{L}(\theta_{k+1})\right|\theta_{k}\right]\right)+\alpha_{k}\cdot B_{n,n_{0}}^{2}+\alpha_{k}^{2}\cdot LV_{n,n_{0}}.

Thus, taking total expectation and summing over k=1,,Kk=1,\cdots,K yields

(k=1Kαk)min1kK𝔼[g(θk)2]k=1Kαk𝔼[g(θk)2]\displaystyle\left(\sum_{k=1}^{K}\alpha_{k}\right)\min_{1\leq k\leq K}\mathbb{E}\left[\left\|g(\theta_{k})\right\|^{2}\right]\leq~{}\sum_{k=1}^{K}\alpha_{k}\mathbb{E}\left[\left\|g(\theta_{k})\right\|^{2}\right]
2((θ1)𝔼[(θK+1)])+k=1Kαk2LVn,n0+k=1KαkBn,n02\displaystyle\leq~{}2\left(\mathcal{L}(\theta_{1})-\mathbb{E}\left[\mathcal{L}(\theta_{K+1})\right]\right)+\sum_{k=1}^{K}\alpha_{k}^{2}\cdot LV_{n,n_{0}}+\sum_{k=1}^{K}\alpha_{k}\cdot B_{n,n_{0}}^{2}
=O(1)+k=1Kαk2O(1n)+k=1KαkO((1γ)2n0n2).\displaystyle=~{}O(1)+\sum_{k=1}^{K}\alpha_{k}^{2}\cdot O\left(\frac{1}{n}\right)+\sum_{k=1}^{K}\alpha_{k}\cdot O\left(\frac{(1-\gamma)^{2n_{0}}}{n^{2}}\right).

Divide both sides by k=1Kαk\sum_{k=1}^{K}\alpha_{k}, then (3.24) holds. If we take αk=cnk\alpha_{k}=\frac{c\sqrt{n}}{\sqrt{k}}, (3.25) is implied by

k=1Kαk=k=1Kcnk=O(nK),k=1Kαk2=k=1Kc2nk=O(nlogK).\sum_{k=1}^{K}\alpha_{k}=\sum_{k=1}^{K}\frac{c\sqrt{n}}{\sqrt{k}}=O(\sqrt{nK}),\quad\sum_{k=1}^{K}\alpha_{k}^{2}=\sum_{k=1}^{K}\frac{c^{2}n}{k}=O(n\log K).

This completes the proof. ∎

Theorem 2 shows the convergence rate of the MCMC-SGD with certain choices of stepsizes. The convergence rate is related to the bias and variance produced by the MCMC estimators. A trade-off involves balancing the convergence rate of the algorithm with the sample size. As the bias term depends on n2n^{2}, we can select an appropriate sample size nn to ensure that the bias term has few impact on convergence. Also, in practical applications, we often choose a long burn-in period at the start of the algorithm to reduce the bias in sampling.

4 Escaping from saddle points

In this section, we discuss how MCMC-SGD escapes from saddle points and converges to second-order stationarity. Daneshmand et al. first propose the CNC assumption, by which the simple SGD without explicit improvement is proven to have second-order convergence [8]. Inspired by the ideas of this study, we proceed to analyze the escape from saddle points under biased gradient estimators. To begin with, a second moment lower bound for non-stationary Markov chains is developed to guarantee the sufficient noise of the biased gradient. Then, we verify that the CNC condition under errors is satisfied when our assumptions hold. Eventually, we demonstrate our convergence analysis of Algorithm 1 to reach approximate second-order stationary points with high probability.

To deal with the second-order structure, the following assumption is needed.

Assumption 5.

(1) The Hessian matrix H(θ)H(\theta) defined in (3.17) is ρ\rho-Lipschitz continuous, i.e.,

H(θ1)H(θ2)ρθ1θ2,θ1,θ2Θ.\left\|H(\theta_{1})-H(\theta_{2})\right\|\leq\rho\left\|\theta_{1}-\theta_{2}\right\|,~{}\forall\theta_{1},\theta_{2}\in\Theta.

(2) Let vv be the unit eigenvector with respect to the minimum eigenvalue of the Hessian matrix H(θ)H(\theta). There exists a constant η>0\eta>0 such that

𝔼πθ[(fθ(x)vT(θϕθ(x)𝔼πθ[θϕθ(x)]))2]ησ22[fθ],θΘ.\mathbb{E}_{\pi_{\theta}}\left[\left(f_{\theta}(x)v^{T}\left(\nabla_{\theta}\phi_{\theta}(x)-\mathbb{E}_{\pi_{\theta}}\left[\nabla_{\theta}\phi_{\theta}(x)\right]\right)\right)^{2}\right]\geq\eta\sigma_{2}^{2}[f_{\theta}],~{}\forall\theta\in\Theta.

(3) For any θΘ\theta\in\Theta, the ratio of the sub-exponential norm of fθf_{\theta} to its variance has an upper bound κ>0\kappa>0, that is,

supθΘ(fθψ1(πθ,𝒳)/σ2[fθ])κ.\sup_{\theta\in\Theta}\left(\left\|f_{\theta}\right\|_{\psi_{1}(\pi_{\theta},\mathcal{X})}/\sigma_{2}[f_{\theta}]\right)\leq\kappa.

The first assumption is common to perform a second-order convergence analysis. The second assumption refers to the Correlated negative curvature (CNC) condition introduced by Daneshmand et al. [8], which assumes that the second moment of the projection of stochastic gradients along the negative curvature direction is uniformly bounded away from zero. By this way, the simple SGD without explicit improvement is proven to have second-order convergence. However, the CNC condition is difficult to satisfy when the stochastic gradient is almost surely zero. We modify their CNC condition into the more precise one by introducing the variance of the function σ22(fθ)\sigma_{2}^{2}(f_{\theta}). The third one guarantees that the fθf_{\theta} will have a respectively light-tailed distribution. It holds for most of distributions and especially near the ground state of quantum many-body problems.

Besides, over Assumptions 2 and 3, we can assume an upper bound of the stochastic gradient to simply our analysis:

lg:=supθΘsupx𝒳(fθ(x)𝔼xπθ[fθ(x)])θϕθ(x)<+.\displaystyle l_{g}:=\sup_{\theta\in\Theta}\sup_{x\in\mathcal{X}}\left\|\left(f_{\theta}(x)-\mathbb{E}_{x\sim\pi_{\theta}}\left[f_{\theta}(x)\right]\right)\nabla_{\theta}\phi_{\theta}(x)\right\|<+\infty. (4.1)

This treatment is legal because fθf_{\theta} is assumed to sub-exponential. Otherwise, the proof can be also established with similar techniques in section 3.2. The upper bound will simplify our proof in Lemma 9.

4.1 The correlated negative curvature condition under errors

We first prove a lemma about the second moment for non-stationary Markov chains. This lemma establishes that the second moment of the MCMC estimator is bounded from below by a positive quantity relying on the sample size nn, the spectral gap γ\gamma and 𝔼π[h2]\mathbb{E}_{\pi}[h^{2}].

Lemma 6.

Let {Xi}i=1n0+n\{X_{i}\}_{i=1}^{n_{0}+n} be the Markov chain with the stationary distribution π\pi and the initial distribution ν\nu. Suppose it admits a absolute spectral gap γ\gamma and χ=χ(ν,π)<+{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}={\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}(\nu,\pi)<+\infty. The function hh has a finite fourth central moment σ44[h]:=𝔼π[(h𝔼π[h])4]\sigma_{4}^{4}[h]:=\mathbb{E}_{\pi}[\left(h-\mathbb{E}_{\pi}[h]\right)^{4}]. If n32γ3n\geq\frac{32}{\gamma^{3}} and n02γ(logχ+log(σ4[h]/σ2[h])+logn)n_{0}\geq\frac{2}{\gamma}(\log{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}+\log(\sigma_{4}[h]/\sigma_{2}[h])+\log n), it holds that

𝔼ν[(1ni=n0+1n+n0h(Xi))2]γ4n𝔼π[h2].\mathbb{E}_{\nu}\left[\left(\frac{1}{n}\sum_{i=n_{0}+1}^{n+n_{0}}h(X_{i})\right)^{2}\right]\geq\frac{\gamma}{4n}\mathbb{E}_{\pi}[h^{2}]. (4.2)
Proof.

We denote Z=1ni=n0+1n+n0h(Xi)Z=\frac{1}{n}\sum_{i=n_{0}+1}^{n+n_{0}}h(X_{i}) and c=𝔼π[h]c=\mathbb{E}_{\pi}[h]. Then it holds

𝔼ν[Z2]=c2+𝔼ν[(Zc)2]+2c(𝔼ν[Z]c)c22+𝔼ν[(Zc)2]2(𝔼ν[Z]c)2.\mathbb{E}_{\nu}[Z^{2}]=c^{2}+\mathbb{E}_{\nu}[(Z-c)^{2}]+2c(\mathbb{E}_{\nu}[Z]-c)\geq\frac{c^{2}}{2}+\mathbb{E}_{\nu}[(Z-c)^{2}]-2(\mathbb{E}_{\nu}[Z]-c)^{2}. (4.3)

Notice the difference

|𝔼ν[(Zc)2]𝔼π[(Zc)2]|(1γ)n0χ(𝔼π[(Zc)4])12\displaystyle\left|\mathbb{E}_{\nu}\left[(Z-c)^{2}\right]-\mathbb{E}_{\pi}\left[(Z-c)^{2}\right]\right|\leq(1-\gamma)^{n_{0}}{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}\cdot\left(\mathbb{E}_{\pi}\left[(Z-c)^{4}\right]\right)^{\frac{1}{2}} (4.4)

where the inequality changes the measure as in (3.13). It follows from the Cauchy-Schwarz inequality that

𝔼π[(Zc)4]=\displaystyle\mathbb{E}_{\pi}\left[(Z-c)^{4}\right]= 𝔼π[(1ni=n0+1n+n0h(Xi)𝔼π[h])4]\displaystyle~{}\mathbb{E}_{\pi}\left[\left(\frac{1}{n}\sum_{i=n_{0}+1}^{n+n_{0}}h(X_{i})-\mathbb{E}_{\pi}[h]\right)^{4}\right]
\displaystyle\leq 𝔼π[1ni=n0+1n+n0(h(Xi)𝔼π[h])4]=𝔼Xπ[(f(X)𝔼π[h])4]=σ44[h],\displaystyle~{}\mathbb{E}_{\pi}\left[\frac{1}{n}\sum_{i=n_{0}+1}^{n+n_{0}}\left(h(X_{i})-\mathbb{E}_{\pi}[h]\right)^{4}\right]=\mathbb{E}_{X\sim\pi}\left[\left(f(X)-\mathbb{E}_{\pi}[h]\right)^{4}\right]=\sigma_{4}^{4}[h],

where the second equality is because π\pi is the stationary distribution of the Markov chain. Therefore, we derive that

𝔼ν[(Zc)2]𝔼π[(1ni=n0+1n+n0h(Xi)𝔼π[h])2](1γ)n0χσ42[h].\mathbb{E}_{\nu}\left[(Z-c)^{2}\right]\geq\mathbb{E}_{\pi}\left[\left(\frac{1}{n}\sum_{i=n_{0}+1}^{n+n_{0}}h(X_{i})-\mathbb{E}_{\pi}[h]\right)^{2}\right]-(1-\gamma)^{n_{0}}{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}\sigma_{4}^{2}[h]. (4.5)

It holds from Theorem 3.1 in [37] that

|𝔼π[(1ni=n0+1n+n0h(Xi)𝔼π[h])2]σasy2[h]n|16σ22[h]γ2n2,\displaystyle\left|\mathbb{E}_{\pi}\left[\left(\frac{1}{n}\sum_{i=n_{0}+1}^{n+n_{0}}h(X_{i})-\mathbb{E}_{\pi}[h]\right)^{2}\right]-\frac{\sigma^{2}_{asy}[h]}{n}\right|\leq\frac{16\sigma^{2}_{2}[h]}{\gamma^{2}n^{2}}, (4.6)

where σasy2[h]:=h,[2(I(Pπ))1I]hπ\sigma^{2}_{asy}[h]:=\langle h,[2(I-(\operatorname{P}-\pi))^{-1}-I]h\rangle_{\pi} on page 24 of [37]. Using the spectral method therein, we obtain that

σasy2[h]\displaystyle\sigma^{2}_{asy}[h] =h,[2(I(Pπ))1I]hπ=(IP)1hπ2P(IP)1hπ2\displaystyle=\langle h,[2(I-(\operatorname{P}-\pi))^{-1}-I]h\rangle_{\pi}=\left\|(I-\operatorname{P})^{-1}h\right\|_{\pi}^{2}-\left\|P(I-P)^{-1}h\right\|_{\pi}^{2} (4.7)
(1(1γ)2)(IP)1hπ2γσ22[h]2.\displaystyle\geq(1-(1-\gamma)^{2})\left\|(I-\operatorname{P})^{-1}h\right\|_{\pi}^{2}\geq\frac{\gamma\sigma^{2}_{2}[h]}{2}.

Next, we bound the term (𝔼ν[Z]c)2(\mathbb{E}_{\nu}[Z]-c)^{2} in (4.3). By Lemma 3, it holds that

|𝔼ν[Z]c|=|1ni=n0+1n+n0𝔼π[h(Xi)]𝔼π[h]|(1γ)n0χσ2[h]γn.\left|\mathbb{E}_{\nu}[Z]-c\right|=\left|\frac{1}{n}\sum_{i=n_{0}+1}^{n+n_{0}}\mathbb{E}_{\pi}\left[h(X_{i})\right]-\mathbb{E}_{\pi}[h]\right|\leq\frac{(1-\gamma)^{n_{0}}{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}\sigma_{2}[h]}{\gamma n}. (4.8)

Finally, combining (4.3), (4.5), (4.6), (4.7) and (4.8) yields

𝔼ν[(1ni=n0+1n+n0h(Xi))2]\displaystyle~{}\mathbb{E}_{\nu}\left[\left(\frac{1}{n}\sum_{i=n_{0}+1}^{n+n_{0}}h(X_{i})\right)^{2}\right]
12(𝔼π[h])2+γσ22[h]2n16σ22[h]γ2n2(1γ)n0χσ42[h]2(1γ)2n0χ2σ22[h]γ2n2\displaystyle\geq\frac{1}{2}\left(\mathbb{E}_{\pi}[h]\right)^{2}+\frac{\gamma\sigma^{2}_{2}[h]}{2n}-\frac{16\sigma^{2}_{2}[h]}{\gamma^{2}n^{2}}-(1-\gamma)^{n_{0}}{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}\sigma_{4}^{2}[h]-\frac{2(1-\gamma)^{2n_{0}}{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}^{2}\sigma^{2}_{2}[h]}{\gamma^{2}n^{2}}
σ22[h](γn16γ2n22(1γ)2n0χ2γ2n2(1γ)n0χσ42[h]σ22[h])+12(𝔼π[h])2\displaystyle\geq\sigma^{2}_{2}[h]\left(\frac{\gamma}{n}-\frac{16}{\gamma^{2}n^{2}}-\frac{2(1-\gamma)^{2n_{0}}{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}^{2}}{\gamma^{2}n^{2}}-(1-\gamma)^{n_{0}}{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}\cdot\frac{\sigma^{2}_{4}[h]}{\sigma^{2}_{2}[h]}\right)+\frac{1}{2}\left(\mathbb{E}_{\pi}[h]\right)^{2}
γ4nσ22[h]+12(𝔼π[h])2γ4n𝔼π[h2],\displaystyle\geq\frac{\gamma}{4n}\sigma^{2}_{2}[h]+\frac{1}{2}\left(\mathbb{E}_{\pi}[h]\right)^{2}\geq\frac{\gamma}{4n}\mathbb{E}_{\pi}[h^{2}],

where the third inequality holds when n32γ3n\geq\frac{32}{\gamma^{3}} and n02γ(logχ+log(σ4[h]/σ2[h])+logn)n_{0}\geq\frac{2}{\gamma}(\log{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}+\log(\sigma_{4}[h]/\sigma_{2}[h])+\log n). ∎

The CNC condition under errors (4.9) in the following is of vital importance to analyze the saddle escaping property in SGD. That is one of the reason why stochastic optimization algorithms can escape from saddle points. If the CNC condition does not hold, the noise may not be strong enough to escape along the descent direction. The following lemma shows that the empirical stochastic gradient g^(θ;S)\hat{g}(\theta;S) defined in (2.11) satisfies the CNC condition under errors.

Lemma 7.

Let Assumptions 3,2, 4 and 5 hold. Then, for the unit eigenvector vv with respect to the minimum eigenvalue of the Hessian, there exists μ=ηγ16n\mu=\frac{\eta\gamma}{16n} such that it holds

𝔼ν[(vTg^(θ;S))2]μσ22[fθ],θΘ,\mathbb{E}_{\nu}\left[\left(v^{\mathrm{T}}\hat{g}(\theta;S)\right)^{2}\right]\geq\mu\sigma_{2}^{2}[f_{\theta}],\quad\forall\theta\in\Theta, (4.9)

as long as nΩ(1ηγ3)+Ω~(κBηγ2)n\geq\Omega\left(\frac{1}{\eta\gamma^{3}}\right)+\tilde{\Omega}\left(\frac{\kappa B}{\sqrt{\eta}\gamma^{2}}\right) and n02γ(logχ+log(2κ)+logn)n_{0}\geq\frac{2}{\gamma}(\log{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}+\log(2\kappa)+\log n) where Ω()\Omega(\cdot) and Ω~()\tilde{\Omega}(\cdot) hide constants M,B,C,χM,B,C,{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}.

Proof.

The MCMC estimate of the gradient g^(θ,S)\hat{g}(\theta,S) is computed by (2.11). For convenience, fix a unit vector vv and we denote empirical variables by Fi=fθ(xi),Yi=vTθϕθ(xi)F_{i}=f_{\theta}(x_{i}),Y_{i}=v^{\mathrm{T}}\nabla_{\theta}\phi_{\theta}(x_{i}) and F¯,Y¯\overline{F},\overline{Y} represents the average of Ei,YiE_{i},Y_{i}. Let stationary variables F=fθ(x)F=f_{\theta}(x) and Y=vTθϕθ(X)Y=v^{\mathrm{T}}\nabla_{\theta}\phi_{\theta}(X) with a dependent variable XπθX\sim\pi_{\theta}, then it holds that

vTg^(θ,S)\displaystyle v^{\mathrm{T}}\hat{g}(\theta,S) =2ni=n0+1n0+n(fθ(xi)1ni=n0+1n0+nfθ(xi))vTϕθ(xi)\displaystyle=\frac{2}{n}\sum_{i=n_{0}+1}^{n_{0}+n}\big{(}f_{\theta}(x_{i})-\frac{1}{n}\sum_{i=n_{0}+1}^{n_{0}+n}f_{\theta}(x_{i})\big{)}v^{\mathrm{T}}\nabla\phi_{\theta}(x_{i})
=2ni=n0+1n0+nFiYi2(F¯𝔼F+𝔼F)(Y¯𝔼Y+𝔼Y)\displaystyle=\frac{2}{n}\sum_{i=n_{0}+1}^{n_{0}+n}F_{i}Y_{i}-2\left(\overline{F}-\mathbb{E}F+\mathbb{E}F\right)\left(\overline{Y}-\mathbb{E}Y+\mathbb{E}Y\right)
=2ni=n0+1n0+n(Fi𝔼F)(Yi𝔼Y)Z12(F¯𝔼F)(Y¯𝔼Y)Z2.\displaystyle=\underbrace{\frac{2}{n}\sum_{i=n_{0}+1}^{n_{0}+n}(F_{i}-\mathbb{E}F)(Y_{i}-\mathbb{E}Y)}_{Z_{1}}-\underbrace{2\left(\overline{F}-\mathbb{E}F\right)\left(\overline{Y}-\mathbb{E}Y\right)}_{Z_{2}}.

To obtain the lower bound, we have

𝔼ν[(vTg^(θ,S))2]=𝔼ν[(Z1Z2)2]12𝔼ν[Z12]𝔼ν[Z22].\displaystyle\mathbb{E}_{\nu}\left[(v^{\mathrm{T}}\hat{g}(\theta,S))^{2}\right]=\mathbb{E}_{\nu}\left[(Z_{1}-Z_{2})^{2}\right]\geq\frac{1}{2}\mathbb{E}_{\nu}\left[Z_{1}^{2}\right]-\mathbb{E}_{\nu}\left[Z_{2}^{2}\right].

Since ex>x4/8e^{x}>x^{4}/8 when x0x\geq 0, it holds that σ4[F]2Fψ1(πθ,𝒳)\sigma_{4}[F]\leq 2\left\|F\right\|_{\psi_{1}(\pi_{\theta},\mathcal{X})}. It implies the kurtosis σ4[F]/σ2[F]\sigma_{4}[F]/\sigma_{2}[F] is less than 2κ2\kappa. Then, when n32γ3n\geq\frac{32}{\gamma^{3}} and n02γ(logχ+log(2κ)+logn)n_{0}\geq\frac{2}{\gamma}(\log{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}+\log(2\kappa)+\log n), it follows from Lemma 6 that

𝔼ν[Z12]\displaystyle\mathbb{E}_{\nu}\left[Z_{1}^{2}\right] γ4n𝔼π[(F𝔼F)2(Y𝔼Y)2]ηγσ22[F]4n.\displaystyle\geq\frac{\gamma}{4n}\mathbb{E}_{\pi}[(F-\mathbb{E}F)^{2}(Y-\mathbb{E}Y)^{2}]{\geq}\frac{\eta\gamma\sigma_{2}^{2}[F]}{4n}. (4.10)

Besides, the upper bound of 𝔼ν[Z22]\mathbb{E}_{\nu}\left[Z_{2}^{2}\right] can be obtained by the Cauchy-Schwarz inequality that

𝔼ν[Z22]\displaystyle\mathbb{E}_{\nu}[Z_{2}^{2}] =4𝔼ν[(F¯𝔼F)2(Y¯𝔼Y)2]4𝔼ν[(F¯𝔼F)4]𝔼ν[(Y¯𝔼Y)4].\displaystyle=4\mathbb{E}_{\nu}[(\bar{F}-\mathbb{E}F)^{2}(\bar{Y}-\mathbb{E}Y)^{2}]\leq 4\sqrt{\mathbb{E}_{\nu}[(\bar{F}-\mathbb{E}F)^{4}]\cdot\mathbb{E}_{\nu}[(\bar{Y}-\mathbb{E}Y)^{4}]}. (4.11)

By the result in the proof of Lemma 3, we have

𝔼ν[|E¯𝔼F|4]𝒪(σ24[Eθ]n2γ2+M4(logn)8n4γ4).\mathbb{E}_{\nu}[\left|\bar{E}-\mathbb{E}F\right|^{4}]\leq\mathcal{O}\left(\frac{\sigma_{2}^{4}[E_{\theta}]}{n^{2}\gamma^{2}}+\frac{M^{4}(\log n)^{8}}{n^{4}\gamma^{4}}\right). (4.12)

Similarly, notice that |Y|=|vTθϕθ(x)|vθϕθ(x)B\left|Y\right|=\left|v^{\mathrm{T}}\nabla_{\theta}\phi_{\theta}(x)\right|\leq\left\|v\right\|\left\|\nabla_{\theta}\phi_{\theta}(x)\right\|\leq B, we can apply [13, Theorem 12] to obtain

(|Y¯𝔼Y|t)2Cexp(nγt24B2).\mathbb{P}\left(|\bar{Y}-\mathbb{E}Y|\geq t\right)\leq 2C\exp\left(-\frac{n\gamma t^{2}}{4B^{2}}\right).

Using the similar technique in the proof of Lemma 2, we derive

𝔼ν[|Y¯𝔼Y|4]𝒪(B4n2γ2).\mathbb{E}_{\nu}[|\bar{Y}-\mathbb{E}Y|^{4}]\leq\mathcal{O}\left(\frac{B^{4}}{n^{2}\gamma^{2}}\right). (4.13)

Plugging (4.12) and (4.13) into (4.11) yields

𝔼ν[Z22]4𝔼ν[(F¯𝔼F)4]𝔼ν[(Y¯𝔼Y)4]𝒪(σ22[F]B2n2γ2+(logn)4M2B2n3γ3).\mathbb{E}_{\nu}[Z_{2}^{2}]\leq 4\sqrt{\mathbb{E}_{\nu}[(\bar{F}-\mathbb{E}F)^{4}]\cdot\mathbb{E}_{\nu}[(\bar{Y}-\mathbb{E}Y)^{4}]}\leq\mathcal{O}\left(\frac{\sigma_{2}^{2}[F]B^{2}}{n^{2}\gamma^{2}}+\frac{(\log n)^{4}M^{2}B^{2}}{n^{3}\gamma^{3}}\right).

Therefore, 𝔼ν[Z22]ηγσ22[F]32n\mathbb{E}_{\nu}[Z_{2}^{2}]\leq\frac{\eta\gamma\sigma_{2}^{2}[F]}{32n} as long as nΩ(1ηγ3)+Ω~(κBηγ2)n\geq\Omega\left(\frac{1}{\eta\gamma^{3}}\right)+\tilde{\Omega}\left(\frac{\kappa B}{\sqrt{\eta}\gamma^{2}}\right). Finally, combining our discussion of Z1Z_{1} and Z2Z_{2}, we derive that

𝔼ν[(vTg^(θ,S))2]12𝔼ν[Z12]𝔼ν[Z22]ηγσ22[F]32n.\mathbb{E}_{\nu}\left[(v^{\mathrm{T}}\hat{g}(\theta,S))^{2}\right]\geq\frac{1}{2}\mathbb{E}_{\nu}\left[Z_{1}^{2}\right]-\mathbb{E}_{\nu}\left[Z_{2}^{2}\right]\geq\frac{\eta\gamma\sigma_{2}^{2}[F]}{32n}.

This completes the proof. ∎

4.2 Second order stationarity

As discussed in Section 3, MCMC-SGD converges to first order stationary points. While the convergence of SGD is well-understood for convex functions, the existence of saddle points and local minimum poses challenges for non-convex optimization. Since the ordinary GD often stucks near the saddle points, the additional noise within SGD allows it to escape from saddle points. We have verified the CNC condition for biased gradients, by which we are able to analyze the behavior of MCMC-SGD near saddle points. Compared to previous studies [8, 14, 20], the presence of bias in the stochastic gradient needs a more nuanced analysis.

We first give the definition of approximate second-order stationary points.

Definition 3 (Approximate second-order stationary point).

Given a function \mathcal{L}, an (ϵg,ϵh)(\epsilon_{g},\epsilon_{h}) approximate second-order stationary point θ\theta of \mathcal{L} is defined as

g(θ)ϵg,λmin(H(θ))ϵh,\left\|g(\theta)\right\|\leq\epsilon_{g},\quad\lambda_{\min}\left(H(\theta)\right)\geq-\epsilon_{h},

where gg and HH denote the gradient and Hessian of \mathcal{L} respectively.

If ϵg=ϵh=0\epsilon_{g}=\epsilon_{h}=0, the point θ\theta is a second-order stationary point. The second order analysis contributes to understand how MCMC-SGD leaves from these saddle points and finally reach approximate second-order stationary points.

An observation lies that when the variance of the function fθ(x)f_{\theta}(x) approaches zero, the efficacy of MCMC-SGD is compromised, as the stochastic gradient tends to zero. Notably, in the context of variational eigenvalue problems, the specific situation of interest arises when an eigenvalue is obtained as σ22[fθ]=0\sigma_{2}^{2}[f_{\theta}]=0. This scenario is precisely what is required for our analysis. Hence, we define ϵ\epsilon-variance points as another criteria for the algorithm 1.

Definition 4 (ϵ\epsilon-variance point).

For the optimization problem (1.1), we call θ\theta an ϵ\epsilon-variance point if the function fθf_{\theta} satisfies σ22[fθ]<ϵ\sigma_{2}^{2}[f_{\theta}]<\epsilon.

This definition may differentiate the regions where modeling and parameterization cause the algorithm to get stuck. For notational simplification, we take an abbreviation σ22=σ22[fθ]\sigma_{2}^{2}=\sigma_{2}^{2}[f_{\theta}] in this subsection.

To escape from saddle points, a new stepsize schedule is adopted for Algorithm 1. Given a period TT, note that α\alpha and β\beta are the constant stepsizes with β>α>0\beta>\alpha>0, with values given in Table 1. Within one iteration period of TT steps, we adopt a large stepsize β\beta at the beginning of the period and a small one α\alpha at the other T1T-1 iterations, that is,

αk={α,k(modT)0,β,k(modT)=0.\alpha_{k}=\begin{cases}\alpha,\quad k(\mathrm{mod}T)\not=0,\\ \beta,\quad k(\mathrm{mod}T)=0.\end{cases} (4.14)

It will be shown that the schedule can be suitably designed to achieve sufficient descent in one period.

Suppose the total number of iterations KK is a multiple of TT and there are K/TK/T periods. We denote θ~m=θmT\tilde{\theta}_{m}=\theta_{m\cdot T} for m=0,1,,K/Tm=0,1,\dots,K/T in each period. For some given ϵ>0\epsilon>0, we consider four regimes of the iterates {θ~m}\{\tilde{\theta}_{m}\} as follow,

1\displaystyle\mathcal{R}_{1} :={θ|g(θ)ϵ},\displaystyle:=\left\{\theta\Big{|}\left\|g(\theta)\right\|\geq\epsilon\right\}, (4.15)
2\displaystyle\mathcal{R}_{2} :={θ|g(θ)<ϵ,λmin(H(θ))ϵ14,σ22[fθ]ϵ12},\displaystyle:=\left\{\theta\Big{|}\left\|g(\theta)\right\|<\epsilon,~{}\lambda_{\min}\left(H(\theta)\right)\leq-\epsilon^{\frac{1}{4}},~{}~{}\sigma_{2}^{2}[f_{\theta}]\geq\epsilon^{\frac{1}{2}}\right\}, (4.16)
3\displaystyle\mathcal{R}_{3} :={θ|g(θ)<ϵ,λmin(H(θ))>ϵ14,orσ22[fθ]<ϵ12},\displaystyle:=\left\{\theta\Big{|}\left\|g(\theta)\right\|<\epsilon,~{}\lambda_{\min}\left(H(\theta)\right)>-\epsilon^{\frac{1}{4}},~{}\mathrm{or}~{}\sigma_{2}^{2}[f_{\theta}]<\epsilon^{\frac{1}{2}}\right\}, (4.17)

1\mathcal{R}_{1} stands for the regime with a large gradient, where the stochastic gradient works effectively. When the iterate lies in 2\mathcal{R}_{2}, despite being close to a first-order stationary point, the CNC condition mentioned in section 4.1 guarantees a decrease after TT iterations under our schedule. 3\mathcal{R}_{3} is a regime of (ϵ,ϵ1/4)(\epsilon,\epsilon^{1/4}) approximate second order stationary points or ϵ1/2\epsilon^{1/2} variance points. We need to show Algorithm 1 will reach 3\mathcal{R}_{3} with high probability, that is, converge to approximate second order stationary points.

The analysis below relies on a particular choice of parameters, whose values satisfy the following lemmas and the main theorem. For ease of verification, the choice of parameters is collected in Table 1.

Parameter Value Order Conditions
β\beta δϵ2192lgρLVn,n0\frac{\delta\epsilon^{2}}{192l_{g}\rho LV_{n,n_{0}}} O(ϵ)O(\epsilon) (4.20), (4.34)
α\alpha βT\frac{\beta}{\sqrt{T}} O(ϵ9/4log1(1ϵ))O\left(\epsilon^{9/4}\log^{-1}\left(\frac{1}{\epsilon}\right)\right) (4.34)
𝒟\mathcal{D} βϵ2192ρlg\frac{\beta\epsilon^{2}}{192\rho l_{g}} O(ϵ3)O(\epsilon^{3}) (4.20),(4.34)
nn ηγ64ϵ\frac{\eta\gamma}{64\epsilon} O(ϵ1)O\left(\epsilon^{-1}\right) Lemma 7
n0n_{0} - O(log(1ϵ))O\left(\log\left(\frac{1}{\epsilon}\right)\right) (3.1), Lemma 7
Bn,n0B_{n,n_{0}} ϵ296T3/2lgρ\sqrt{\frac{\epsilon^{2}}{96T^{3/2}l_{g}\rho}} O(ϵ3)O(\epsilon^{3}) (4.20), (4.34)
TT 1β2ϵ1/2log2(ρlgLVn,n0μδϵ)\frac{1}{\beta^{2}\epsilon^{1/2}}\log^{2}\left(\frac{\rho l_{g}LV_{n,n_{0}}}{\mu\delta\epsilon}\right) O(ϵ5/2log2ϵ)O\left(\epsilon^{-5/2}\log^{2}\epsilon\right) (4.38)
KK 2[(θ0)]Tδ𝒟\frac{2[\mathcal{L}(\theta_{0})-\mathcal{L}^{*}]T}{\delta\mathcal{D}} O(ϵ11/2log2ϵ)O\left(\epsilon^{-11/2}\log^{2}\epsilon\right) (4.42)
Table 1: List of parameters.

We shall briefly introduce each parameter in Table 1 to facilitate the understanding of the following proof. β\beta is the larger stepsize that assists SGD in achieving sufficient descent when the gradient norm is large. α\alpha is the smaller stepsize that helps SGD escaping saddle points when the algorithm approaches first-order stationary points. 𝒟\mathcal{D} stands for the sufficient decrease in the expected function value near the large gradient region and saddle points. nn and n0n_{0} are the sample size and the length of burn-in period of MCMC algorithms. Bn,n0B_{n,n_{0}} is the bias of MCMC algorithms, depending on nn and n0n_{0}. TT represents the period length for switching stepsize β\beta and α\alpha, in which we adopt the large stepsize β\beta at the beginning and the small one α\alpha at the other T1T-1 iterations. Finally, KK is the total number of iterations of MCMC-SGD. These parameters are chosen such that all the conditions are satisfied. We presuppose the validity of these conditions and then deduce the values of the parameters from them. Although the choice of each parameter may not appear intuitive, it is not necessary to substitute these parameters in their entirety during computation. One only needs to verify the fulfillment of the conditions listed in Table 1.

With a large gradient, it is easy to show a sufficient decrease of the objective function value. We have analyzed the decrease for each iteration in Lemma 5, and it follows a similar argument for the MCMC-SGD.

Lemma 8.

Suppose that θ~m\tilde{\theta}_{m} lies in 1\mathcal{R}_{1} defined in (4.15) and Algorithm 1 updates with the schedule (4.14) and parameters in Table 1, then the expected value of (θ~m+1)\mathcal{L}(\tilde{\theta}_{m+1}) taken over the randomness of {θk}k=mT+1(m+1)T\{\theta_{k}\}_{k=m\cdot T+1}^{(m+1)\cdot T} decreases as

(θ~m)𝔼[(θ~m+1)|θ~m]𝒟.\mathcal{L}(\tilde{\theta}_{m})-\mathbb{E}[\mathcal{L}(\tilde{\theta}_{m+1})|\tilde{\theta}_{m}]\geq\mathcal{D}.
Proof.

We first decompose the difference of the expected function value into each iteration,

(θ~m)𝔼[(θ~m+1)|θ~m]=p=0T1𝔼[(θmT+p)𝔼[(θmT+p+1)]|θmT+p],\mathcal{L}(\tilde{\theta}_{m})-\mathbb{E}[\mathcal{L}(\tilde{\theta}_{m+1})|\tilde{\theta}_{m}]=\sum_{p=0}^{T-1}\mathbb{E}\left[\mathcal{L}(\theta_{m\cdot T+p})-\mathbb{E}\left[\mathcal{L}(\theta_{m\cdot T+p+1})\right]\bigg{|}\theta_{m\cdot T+p}\right], (4.18)

where 𝔼[(θmT)|θmT]=(θ~m)\mathbb{E}\left[\mathcal{L}(\theta_{m\cdot T})|\theta_{m\cdot T}\right]=\mathcal{L}(\tilde{\theta}_{m}) due to the definition θ~m=θmT\tilde{\theta}_{m}=\theta_{m\cdot T}. Using the choice β2=Tα2\beta^{2}=T\alpha^{2} in Table 1, it follows from (4.18) that

2((θ~m)𝔼[(θ~m+1)|θ~m])\displaystyle 2\left(\mathcal{L}(\tilde{\theta}_{m})-\mathbb{E}[\mathcal{L}(\tilde{\theta}_{m+1})|\tilde{\theta}_{m}]\right)\geq βg(θ~m)2βBn,n02β2LVn,n0\displaystyle\beta\left\|g(\tilde{\theta}_{m})\right\|^{2}-\beta B_{n,n_{0}}^{2}-\beta^{2}LV_{n,n_{0}} (4.19)
(T1)(αBn,n02+α2LVn,n0)\displaystyle\quad-(T-1)\left(\alpha B_{n,n_{0}}^{2}+\alpha^{2}LV_{n,n_{0}}\right)
\displaystyle\geq βg(θ~m)22TαBn,n022β2LVn,n0,\displaystyle\beta\left\|g(\tilde{\theta}_{m})\right\|^{2}-2T\alpha B_{n,n_{0}}^{2}-2\beta^{2}LV_{n,n_{0}},

where the first inequality is by Lemma 5 for p=0,,T1p=0,\dots,T-1 and the second is by the direct substitution. Then, by the choice of β,n0,𝒟\beta,n_{0},\mathcal{D} in Table 1, these conditions hold that

βϵ28LVn,n0,Bn,n02ϵ28T,𝒟βϵ24.\beta\leq\frac{\epsilon^{2}}{8LV_{n,n_{0}}},\quad B_{n,n_{0}}^{2}\leq\frac{\epsilon^{2}}{8\sqrt{T}},\quad\mathcal{D}\leq\frac{\beta\epsilon^{2}}{4}. (4.20)

As θ~m\tilde{\theta}_{m} lies in 1\mathcal{R}_{1}, which means θ(θ~m)ϵ\left\|\nabla_{\theta}\mathcal{L}(\tilde{\theta}_{m})\right\|\geq\epsilon, we plug (4.20) into (4.19) and obtain that

2((θ~m)𝔼[(θ~m+1)|θ~m])\displaystyle 2\left(\mathcal{L}(\tilde{\theta}_{m})-\mathbb{E}[\mathcal{L}(\tilde{\theta}_{m+1})|\tilde{\theta}_{m}]\right)\geq βϵ22βTBn,n022βLVn,n0ϵ28LVn,n0\displaystyle\beta\epsilon^{2}-2\beta\sqrt{T}B_{n,n_{0}}^{2}-2\beta LV_{n,n_{0}}\cdot\frac{\epsilon^{2}}{8LV_{n,n_{0}}}
\displaystyle\geq βϵ2βϵ24βϵ24=βϵ222𝒟.\displaystyle\beta\epsilon^{2}-\frac{\beta\epsilon^{2}}{4}-\frac{\beta\epsilon^{2}}{4}=\frac{\beta\epsilon^{2}}{2}\geq 2\mathcal{D}.

This completes the proof. ∎

Near the saddle points, the classical GD gets stuck if the gradient is orthogonal to the negative curvature direction. However, the stochastic gradient with the CNC condition has inherent noise along the negative curvature direction. Under our stepsize schedule (4.14), the objective function value can have sufficient decrease after a period.

Lemma 9.

Suppose θ~m\tilde{\theta}_{m} lies in 2\mathcal{R}_{2} defined in (4.16) and Algorithm 1 updates with the schedule (4.14) and parameters in Table 1, then the expected value of (θ~m+1)\mathcal{L}(\tilde{\theta}_{m+1}) taken over the randomness of {θk}k=mT+1(m+1)T\{\theta_{k}\}_{k=m\cdot T+1}^{(m+1)\cdot T} decreases as

(θ~m)𝔼[(θ~m+1)|θ~m]𝒟.\mathcal{L}(\tilde{\theta}_{m})-\mathbb{E}[\mathcal{L}(\tilde{\theta}_{m+1})|\tilde{\theta}_{m}]\geq\mathcal{D}.
Proof.

The proof is by contradiction. Without loss of generality, we suppose that m=0m=0 in this proof and denote

p=(θp),gp=θ(θp),g^p=g^(θp,Sp),Hp=θ2(θp)\mathcal{L}_{p}=\mathcal{L}\left(\theta_{p}\right),~{}~{}g_{p}=\nabla_{\theta}\mathcal{L}\left(\theta_{p}\right),~{}~{}\hat{g}_{p}=\hat{g}\left(\theta_{p},S_{p}\right),~{}~{}H_{p}=\nabla^{2}_{\theta}\mathcal{L}\left(\theta_{p}\right)

for p=0,,T1p=0,\dots,T-1. Every expectation in this proof is taken over all existed θp\theta_{p} in every formula. We assume the expected function value decreases by no more than 𝒟\mathcal{D}, i.e.,

0𝔼[T]<𝒟.\mathcal{L}_{0}-\mathbb{E}\left[\mathcal{L}_{T}\right]<\mathcal{D}. (4.21)

We proceed to show that the assumption (4.21) is invalid.

We start with estimating the expected distance between θ0\theta_{0} and θp\theta_{p} for p=1,,T1p=1,\dots,T-1. By Lemma 5, it holds

2(0𝔼[T])\displaystyle~{}2\left(\mathcal{L}_{0}-\mathbb{E}\left[\mathcal{L}_{T}\right]\right) (4.22)
βg02βBn,n02β2LVn,n0+h=1T1(α𝔼gh2αBn,n02α2LVn,n0)\displaystyle\geq~{}\beta\left\|g_{0}\right\|^{2}-\beta B^{2}_{n,n_{0}}-\beta^{2}LV_{n,n_{0}}+\sum_{h=1}^{T-1}\left(\alpha\mathbb{E}\left\|g_{h}\right\|^{2}-\alpha B^{2}_{n,n_{0}}-\alpha^{2}LV_{n,n_{0}}\right)
βg02+αh=1T1𝔼gh22TαBn,n022β2LVn,n0,\displaystyle\geq~{}\beta\left\|g_{0}\right\|^{2}+\alpha\sum_{h=1}^{T-1}\mathbb{E}\left\|g_{h}\right\|^{2}-2T\alpha B^{2}_{n,n_{0}}-2\beta^{2}LV_{n,n_{0}},

where the first inequality is derived similarly to (4.19), and the second inequality is due to β=Tα\beta=\sqrt{T}\alpha. Together with the assumption (4.21), it follows that

βg02+αh=1T1𝔼gh22𝒟+2TαBn,n02+2β2LVn,n0=:A1.\beta\left\|g_{0}\right\|^{2}+\alpha\sum_{h=1}^{T-1}\mathbb{E}\left\|g_{h}\right\|^{2}\leq 2\mathcal{D}+2T\alpha B^{2}_{n,n_{0}}+2\beta^{2}LV_{n,n_{0}}=:A_{1}. (4.23)

A direct implication of (4.23) is that

βg02A1,α𝔼h=1pgh2pαh=1p𝔼gh2pA1.\beta\left\|g_{0}\right\|^{2}\leq A_{1},\qquad\alpha\mathbb{E}\left\|\sum_{h=1}^{p}g_{h}\right\|^{2}\leq p\alpha\sum_{h=1}^{p}\mathbb{E}\left\|g_{h}\right\|^{2}\leq pA_{1}. (4.24)

We proceed to bound θp+1θ0=βg^0+αh=1pg^h\theta_{p+1}-\theta_{0}=\beta\hat{g}_{0}+\alpha\sum_{h=1}^{p}\hat{g}_{h} as follows. Firstly,

𝔼h=1p(g^hgh)2\displaystyle\mathbb{E}\left\|\sum_{h=1}^{p}(\hat{g}_{h}-g_{h})\right\|^{2} =h=1p𝔼g^hgh2+21h<lp𝔼g^hgh,g^lgl\displaystyle=\sum_{h=1}^{p}\mathbb{E}\left\|\hat{g}_{h}-g_{h}\right\|^{2}+2\sum_{1\leq h<l\leq p}\mathbb{E}\left\langle\hat{g}_{h}-g_{h},\hat{g}_{l}-g_{l}\right\rangle (4.25)
=h=1p𝔼g^hgh2+21h<lp𝔼g^hgh,𝔼[g^l|θl]gl\displaystyle=\sum_{h=1}^{p}\mathbb{E}\left\|\hat{g}_{h}-g_{h}\right\|^{2}+2\sum_{1\leq h<l\leq p}\mathbb{E}\left\langle\hat{g}_{h}-g_{h},\mathbb{E}[\hat{g}_{l}|\theta_{l}]-g_{l}\right\rangle
pVn,n0+p(p1)Bn,n0Vn,n02pVn,n0+p3Bn,n02,\displaystyle\leq pV_{n,n_{0}}+p(p-1)B_{n,n_{0}}\sqrt{V_{n,n_{0}}}\leq 2pV_{n,n_{0}}+p^{3}B_{n,n_{0}}^{2},

where the second equality is because we can take conditional expectation on θl\theta_{l} first, the following inequality holds from the result 𝔼g^hgh2Vn,n0\mathbb{E}\left\|\hat{g}_{h}-g_{h}\right\|^{2}\leq V_{n,n_{0}} and 𝔼[g^l|θl]glBn,n0\left\|\mathbb{E}[\hat{g}_{l}|\theta_{l}]-g_{l}\right\|\leq B_{n,n_{0}} in Theorem 1, and the last inequality is due to AM-GM inequality. Therefore, we can bound

𝔼θp+1θ02=𝔼βg^0+αh=1pg^h2\displaystyle~{}\mathbb{E}\left\|\theta_{p+1}-\theta_{0}\right\|^{2}=\mathbb{E}\left\|\beta\hat{g}_{0}+\alpha\sum_{h=1}^{p}\hat{g}_{h}\right\|^{2} (4.26)
=𝔼βg0+β(g^0g0)+αh=1pgh+αh=1p(g^hgh)2\displaystyle=\mathbb{E}\left\|\beta g_{0}+\beta(\hat{g}_{0}-g_{0})+\alpha\sum_{h=1}^{p}g_{h}+\alpha\sum_{h=1}^{p}(\hat{g}_{h}-g_{h})\right\|^{2}
4β2g02(4.24)+4β2𝔼g^0g02Theorem 1+4α2𝔼h=1pgh2(4.24)+4α2𝔼h=1p(g^hgh)2(4.25)\displaystyle\leq\underbrace{4\beta^{2}\left\|g_{0}\right\|^{2}}_{\eqref{eq:R2dec3}}+\underbrace{4\beta^{2}\mathbb{E}\left\|\hat{g}_{0}-g_{0}\right\|^{2}}_{\text{Theorem \ref{thm:gbiasVar}}}+\underbrace{4\alpha^{2}\mathbb{E}\left\|\sum_{h=1}^{p}g_{h}\right\|^{2}}_{\eqref{eq:R2dec3}}+\underbrace{4\alpha^{2}\mathbb{E}\left\|\sum_{h=1}^{p}(\hat{g}_{h}-g_{h})\right\|^{2}}_{\eqref{eq:distdecomp2}}
4βA1+4β2Vn,n0+4αpA1+2α2(2pVn,n0+p3Bn,n02)\displaystyle\leq 4\beta A_{1}+4\beta^{2}V_{n,n_{0}}+4\alpha pA_{1}+2\alpha^{2}(2pV_{n,n_{0}}+p^{3}B_{n,n_{0}}^{2})
4βA1+8β2Vn,n0+4p(αA1+α2T2Bn,n02),\displaystyle\leq 4\beta A_{1}+8\beta^{2}V_{n,n_{0}}+4p(\alpha A_{1}+\alpha^{2}T^{2}B_{n,n_{0}}^{2}),

where the second inequality is due to (4.24), Theorem 1 and (4.25), and the final follows from pα2Tα2=β2p\alpha^{2}\leq T\alpha^{2}=\beta^{2}. Thus, we can take

A2\displaystyle A_{2} :=8α𝒟+16α2T2Bn,n02+8αβ2LVn,n0,\displaystyle:=8\alpha\mathcal{D}+16\alpha^{2}T^{2}B_{n,n_{0}}^{2}+8\alpha\beta^{2}LV_{n,n_{0}},
A3\displaystyle A_{3} :=8β𝒟+8αβTBn,n02+16β2Vn,n0,\displaystyle:=8\beta\mathcal{D}+8\alpha\beta TB_{n,n_{0}}^{2}+16\beta^{2}V_{n,n_{0}},

and then

𝔼θpθ02(p1)A2+A3.\displaystyle\mathbb{E}\left\|\theta_{p}-\theta_{0}\right\|^{2}\leq(p-1)A_{2}+A_{3}. (4.27)

When we take p=T1p=T-1, (4.26) shows that the expected distance between θ0\theta_{0} and θT\theta_{T} is bounded by a quadratic function of TT.

We further prove the expected distance between θ0\theta_{0} and θT\theta_{T} grows at least exponentially for TT, leading to a contradiction. Since θp\theta_{p} stays close to θ0\theta_{0}, the quadratic Taylor approximation of the function \mathcal{L} at θ0\theta_{0} is introduced as

Q(θ):=0+g0T(θθ0)+12(θθ0)TH0(θθ0).Q(\theta):=\mathcal{L}_{0}+g_{0}^{\mathrm{T}}(\theta-\theta_{0})+\frac{1}{2}(\theta-\theta_{0})^{\mathrm{T}}H_{0}(\theta-\theta_{0}).

We denote Qp=Q(θp)Q_{p}=Q(\theta_{p}) and qp=θQ(θp)=g0+H0(θpθ0)q_{p}=\nabla_{\theta}Q(\theta_{p})=g_{0}+H_{0}(\theta_{p}-\theta_{0}) for p=0,,T1p=0,\dots,T-1. Using the Taylor approximation is firstly proposed in [14]. As \mathcal{L} is twice-differentiable with a ρ\rho-Lipschitz Hessian, [35, Lemma 1.2.4] gives that

θ(θ)θQ(θ)ρ2θθ02.\left\|\nabla_{\theta}\mathcal{L}(\theta)-\nabla_{\theta}Q(\theta)\right\|\leq\frac{\rho}{2}\left\|\theta-\theta_{0}\right\|^{2}. (4.28)

Thus, qhgh2ρ2θhθ02\left\|q_{h}-g_{h}\right\|^{2}\leq\frac{\rho}{2}\left\|\theta_{h}-\theta_{0}\right\|^{2}. To derive the lower bound, θp+1θ0\theta_{p+1}-\theta_{0} is decomposed as

θp+1θ0\displaystyle\theta_{p+1}-\theta_{0} =θpθ0αg^p\displaystyle=\theta_{p}-\theta_{0}-\alpha\hat{g}_{p}
=θpθ0αqpα(g^pgp+gpqp)\displaystyle=\theta_{p}-\theta_{0}-\alpha q_{p}-\alpha(\hat{g}_{p}-g_{p}+g_{p}-q_{p})
=(IαH0)(θpθ0)αg0α(g^pgp+gpqp).\displaystyle=(I-\alpha H_{0})(\theta_{p}-\theta_{0})-\alpha g_{0}-\alpha(\hat{g}_{p}-g_{p}+g_{p}-q_{p}).

Let λ0<0-\lambda_{0}<0 be the minimum eigenvalue of the Hessian H0=H(θ0)H_{0}=H(\theta_{0}), and let vv be the unit eigenvector with respect to λ0-\lambda_{0} (which is deterministic conditional on θ0\theta_{0}). Then (IαH0)v=(1+αλ0)v=κv(I-\alpha H_{0})v=(1+\alpha\lambda_{0})v=\kappa v, and hence

v,θp+1θ0=κv,θp+1θ0αv,g0αv,g^pqp.\displaystyle\left\langle v,\theta_{p+1}-\theta_{0}\right\rangle=\kappa\left\langle v,\theta_{p+1}-\theta_{0}\right\rangle-\alpha\left\langle v,g_{0}\right\rangle-\alpha\left\langle v,\hat{g}_{p}-q_{p}\right\rangle.

Recursively expanding this equality out, we finally obtain

v,θp+1θ0=κpv,θ1θ0αv,g0h=1pκphαh=1pκphv,g^hqh\displaystyle\left\langle v,\theta_{p+1}-\theta_{0}\right\rangle=\kappa^{p}\left\langle v,\theta_{1}-\theta_{0}\right\rangle-\alpha\left\langle v,g_{0}\right\rangle\sum_{h=1}^{p}\kappa^{p-h}-\alpha\sum_{h=1}^{p}\kappa^{p-h}\left\langle v,\hat{g}_{h}-q_{h}\right\rangle
=κp[βv,g^0uα1κpκ1v,g0dpαh=1pκhv,g^hghξpαh=1pκhv,ghqhδp].\displaystyle=\kappa^{p}\bigg{[}\beta\underbrace{\left\langle v,-\hat{g}_{0}\right\rangle}_{u}-\alpha\underbrace{\frac{1-\kappa^{-p}}{\kappa-1}\left\langle v,g_{0}\right\rangle}_{d_{p}}-\alpha\underbrace{\sum_{h=1}^{p}\kappa^{-h}\left\langle v,\hat{g}_{h}-g_{h}\right\rangle}_{\xi_{p}}-\alpha\underbrace{\sum_{h=1}^{p}\kappa^{-h}\left\langle v,g_{h}-q_{h}\right\rangle}_{\delta_{p}}\bigg{]}.

Therefore,

𝔼θp+1θ02\displaystyle\mathbb{E}\left\|\theta_{p+1}-\theta_{0}\right\|^{2}\geq 𝔼[v,θp+1θ02]=κ2p𝔼[(βuαdpαξpαδp)2]\displaystyle\mathbb{E}[\left\langle v,\theta_{p+1}-\theta_{0}\right\rangle^{2}]=\kappa^{2p}\mathbb{E}[\left(\beta u-\alpha d_{p}-\alpha\xi_{p}-\alpha\delta_{p}\right)^{2}] (4.29)
\displaystyle\geq κ2p(β2𝔼[u2]2αβ𝔼[udp]2αβ𝔼[uξp]2αβ𝔼[uδp]).\displaystyle\kappa^{2p}\left(\beta^{2}\mathbb{E}[u^{2}]-2\alpha\beta\mathbb{E}\left[ud_{p}\right]-2\alpha\beta\mathbb{E}\left[u\xi_{p}\right]-2\alpha\beta\mathbb{E}\left[u\delta_{p}\right]\right).

By the Cauchy-Schwarz inequality, Lemma 7 implies that

𝔼[u2]\displaystyle\mathbb{E}[u^{2}] =𝔼[(vTg^0)2]μσ22.\displaystyle=\mathbb{E}\left[(v^{\mathrm{T}}\hat{g}_{0})^{2}\right]\geq\mu\sigma_{2}^{2}. (4.30)

where μ=ηγ16n\mu=\frac{\eta\gamma}{16n}. Next, because dpd_{p} is deterministic, the term 𝔼[udp]\mathbb{E}\left[ud_{p}\right] can be bounded as

𝔼[udp]=dp𝔼[v,g^0]=\displaystyle\mathbb{E}\left[ud_{p}\right]=~{}-d_{p}\mathbb{E}[\left\langle v,\hat{g}_{0}\right\rangle]= dpv,g0+dp𝔼[v,g0g^0]\displaystyle~{}-d_{p}\left\langle v,g_{0}\right\rangle+d_{p}\mathbb{E}[\left\langle v,g_{0}-\hat{g}_{0}\right\rangle] (4.31)
\displaystyle\leq dp𝔼[v,g0g^0]lgBn,n0κ1,\displaystyle~{}d_{p}\mathbb{E}[\left\langle v,g_{0}-\hat{g}_{0}\right\rangle]\leq\frac{l_{g}B_{n,n_{0}}}{\kappa-1},

where the first inequality is due to dpv,g0=1κpκ1v,g020-d_{p}\left\langle v,g_{0}\right\rangle=-\frac{1-\kappa^{-p}}{\kappa-1}\left\langle v,g_{0}\right\rangle^{2}\leq 0, and the second inequality uses Theorem 1.

We next upper bound the term 𝔼[uξp]\mathbb{E}\left[u\xi_{p}\right] as follows.

𝔼[uξp]=\displaystyle\mathbb{E}\left[u\xi_{p}\right]= 𝔼[uh=1pκhv,g^hgh]=𝔼[uh=1pκhv,𝔼[g^h|θh]gh]\displaystyle~{}\mathbb{E}\left[u\sum_{h=1}^{p}\kappa^{-h}\left\langle v,\hat{g}_{h}-g_{h}\right\rangle\right]=\mathbb{E}\left[u\sum_{h=1}^{p}\kappa^{-h}\left\langle v,\mathbb{E}[\hat{g}_{h}|\theta_{h}]-g_{h}\right\rangle\right] (4.32)
\displaystyle\leq 𝔼[|u|h=1pκh𝔼[g^h|θh]gh]𝔼[lgh=1pκhBn,n0]lgBn,n0κ1.\displaystyle~{}\mathbb{E}\left[|u|\sum_{h=1}^{p}\kappa^{-h}\left\|\mathbb{E}[\hat{g}_{h}|\theta_{h}]-g_{h}\right\|\right]\leq\mathbb{E}\left[l_{g}\sum_{h=1}^{p}\kappa^{-h}B_{n,n_{0}}\right]\leq\frac{l_{g}B_{n,n_{0}}}{\kappa-1}.

where the second inequality is due to the Cauchy-Schwarz inequality, and in the last inequality we use Theorem 1.

Finally, we bound the term 𝔼[uδp]\mathbb{E}\left[u\delta_{p}\right]:

𝔼[uδp]=\displaystyle\mathbb{E}\left[u\delta_{p}\right]= 𝔼[uh=1pκhv,ghqh]𝔼[lgh=1pκhghqh]\displaystyle\mathbb{E}\left[u\sum_{h=1}^{p}\kappa^{-h}\left\langle v,g_{h}-q_{h}\right\rangle\right]\leq\mathbb{E}\left[l_{g}\sum_{h=1}^{p}\kappa^{-h}\left\|g_{h}-q_{h}\right\|\right] (4.33)
\displaystyle\leq lgρ2𝔼[h=1pκhθhθ02]lgρ2h=1pκh(A2(h1)+A3)\displaystyle\frac{l_{g}\rho}{2}\mathbb{E}\left[\sum_{h=1}^{p}\kappa^{-h}\left\|\theta_{h}-\theta_{0}\right\|^{2}\right]\leq\frac{l_{g}\rho}{2}\sum_{h=1}^{p}\kappa^{-h}\left(A_{2}(h-1)+A_{3}\right)
\displaystyle\leq lgρ2[A2(κ1)2+A3κ1],\displaystyle\frac{l_{g}\rho}{2}\left[\frac{A_{2}}{(\kappa-1)^{2}}+\frac{A_{3}}{\kappa-1}\right],

where the second inequality is due to (4.1), the third inequality uses (4.27), and the last inequality is because h=1pκh(h1)1(κ1)2\sum_{h=1}^{p}\kappa^{-h}(h-1)\leq\frac{1}{(\kappa-1)^{2}}.

Substituting the four inequalities (4.30), (4.31), (4.32) and (4.33) into (4.29), we obtain the lower bound as

𝔼θp+1θ02κ2p(β2𝔼[u2]2αβ𝔼[udp]2αβ𝔼[uξp]2αβ𝔼[uδp])\displaystyle~{}\mathbb{E}\left\|\theta_{p+1}-\theta_{0}\right\|^{2}\geq\kappa^{2p}\left(\beta^{2}\mathbb{E}[u^{2}]-2\alpha\beta\mathbb{E}\left[ud_{p}\right]-2\alpha\beta\mathbb{E}\left[u\xi_{p}\right]-2\alpha\beta\mathbb{E}\left[u\delta_{p}\right]\right)
κ2p(β2μσ222αβlgBn,n0κ12αβlgBn,n0κ12αβlgρ2[A2(κ1)2+A3κ1]).\displaystyle\geq\kappa^{2p}\left(\beta^{2}\mu\sigma_{2}^{2}-2\alpha\beta\frac{l_{g}B_{n,n_{0}}}{\kappa-1}-2\alpha\beta\frac{l_{g}B_{n,n_{0}}}{\kappa-1}-2\alpha\beta\frac{l_{g}\rho}{2}\left[\frac{A_{2}}{(\kappa-1)^{2}}+\frac{A_{3}}{\kappa-1}\right]\right).

According to our settings in Table 1, these conditions are satisfied:

𝒟μσ22λ0min{βλ0,1}192lgρ,Bn,n02μβσ22λ0min{βTλ0,1}384αT2lgρ,βμσ22λ02L384lgρLVn,n0.\displaystyle\mathcal{D}\leq\frac{\mu\sigma_{2}^{2}\lambda_{0}\cdot\min\{\beta\lambda_{0},1\}}{192l_{g}\rho},~{}B_{n,n_{0}}^{2}\leq\frac{\mu\beta\sigma_{2}^{2}\lambda_{0}\cdot\min\{\beta T\lambda_{0},1\}}{384\alpha T^{2}l_{g}\rho},~{}\beta\leq\frac{\mu\sigma_{2}^{2}\lambda_{0}^{2}L}{384l_{g}\rho LV_{n,n_{0}}}. (4.34)

It follows that

A2μαβσ22λ028lgρ,A3μβσ22λ08lgρ.A_{2}\leq\frac{\mu\alpha\beta\sigma_{2}^{2}\lambda_{0}^{2}}{8l_{g}\rho},\quad A_{3}\leq\frac{\mu\beta\sigma_{2}^{2}\lambda_{0}}{8l_{g}\rho}. (4.35)

Therefore, we can conclude that

𝔼θp+1θ02\displaystyle\mathbb{E}\left\|\theta_{p+1}-\theta_{0}\right\|^{2} κ2p(β2μσ2214β2μσ2214β2μσ2214β2μσ22)=14β2κ2pμσ22.\displaystyle\geq\kappa^{2p}\left(\beta^{2}\mu\sigma_{2}^{2}-\frac{1}{4}\beta^{2}\mu\sigma_{2}^{2}-\frac{1}{4}\beta^{2}\mu\sigma_{2}^{2}-\frac{1}{4}\beta^{2}\mu\sigma_{2}^{2}\right)=\frac{1}{4}\beta^{2}\kappa^{2p}\mu\sigma_{2}^{2}. (4.36)

In other word, (4.36) shows that the expected distance between θ0\theta_{0} and θp+1\theta_{p+1} grows at least exponentially for pp. Substituting p=T1p=T-1, it leads to a contradiction if the lower bound of 𝔼θTθ02\mathbb{E}\left\|\theta_{T}-\theta_{0}\right\|^{2} is greater than the upper bound, i.e.,

14β2κ2Tμσ22(T1)A2+A3.\frac{1}{4}\beta^{2}\kappa^{2T}\mu\sigma_{2}^{2}\geq(T-1)A_{2}+A_{3}. (4.37)

We choose

T=cαλ0log(LVn,n0nηαβσ2)T=\frac{c}{\alpha\lambda_{0}}\log\left(\frac{LV_{n,n_{0}}n}{\eta\alpha\beta\sigma_{2}}\right) (4.38)

in Table 1 with taking a sufficiently large cc and then (4.37) implies a contradiction. This completes the proof. ∎

Finally, we establish the main theorem in this section. It is shown that MCMC-SGD returns approximate second-order points or ϵ\epsilon-variance points with high probability. This result may explain the observation of convergence to eigenvalues in solving variational eigenvalue problems, where ϵ\epsilon-variance points means that an eigenvalue is obtained even if the objective function does not approach the global minimum. When an ϵ\epsilon-variance point is desired in this kind of problems, we should design better neural network architectures to reduce those meaningless second order stability points.

Theorem 3.

Let Assumptions 3 ,2 ,4 and 5 hold. For any δ(0,1)\delta\in(0,1), with the stepsizes (4.14) and parameters in Table 1, Algorithm 1 returns an (ϵ,ϵ1/4)(\epsilon,\epsilon^{1/4}) approximate second-order stationary point or an ϵ1/2\epsilon^{1/2}-variance point with probability at least 1δ1-\delta after the following steps

O(δ4ϵ11/2log2(1ϵδ)).O\left(\delta^{-4}\epsilon^{-11/2}\log^{2}\left(\frac{1}{\epsilon\delta}\right)\right).
Proof.

Suppose m\mathcal{E}_{m} is the event

m:={θ~m12},\mathcal{E}_{m}:=\left\{\tilde{\theta}_{m}\in\mathcal{R}_{1}\cup\mathcal{R}_{2}\right\},

and its complement is mc={θ~m3}\mathcal{E}_{m}^{c}=\left\{\tilde{\theta}_{m}\in\mathcal{R}_{3}\right\}. Let 𝒫m\mathcal{P}_{m} denote the probability of the occurrence of the event m\mathcal{E}_{m}.

When m\mathcal{E}_{m} occurs, by Lemmas 8 and 9, we have

𝔼[(θ~m)(θ~m+1)|m]𝒟.\mathbb{E}\left[\mathcal{L}(\tilde{\theta}_{m})-\mathcal{L}(\tilde{\theta}_{m+1})\Big{|}\mathcal{E}_{m}\right]\geq\mathcal{D}. (4.39)

On the other hand, when mc\mathcal{E}_{m}^{c} occurs, it follows from (4.19) that

2𝔼[(θ~m)(θ~m+1)|mc]2TαBn,n022β2LVn,n0δ𝒟,2\mathbb{E}\left[\mathcal{L}(\tilde{\theta}_{m})-\mathcal{L}(\tilde{\theta}_{m+1})\Big{|}\mathcal{E}_{m}^{c}\right]\geq-2T\alpha B_{n,n_{0}}^{2}-2\beta^{2}LV_{n,n_{0}}\geq-\delta\mathcal{D}, (4.40)

where the first inequality is by discarding positive terms in (4.19) and the second inequality is due to the choice 𝒟(2TαBn,n02+2β2LVn,n0)/δ\mathcal{D}\geq\left(2T\alpha B_{n,n_{0}}^{2}+2\beta^{2}LV_{n,n_{0}}\right)/\delta in Table 1. It means that the function value may increase by no more than δ𝒟/2\delta\mathcal{D}/2. When the expectation is taken overall, (4.39) and (4.40) imply that

𝔼[(θ~m)(θ~m+1)](1𝒫m)(δ𝒟2)+𝒫m𝒟.\mathbb{E}\left[\mathcal{L}(\tilde{\theta}_{m})-\mathcal{L}(\tilde{\theta}_{m+1})\right]\geq(1-\mathcal{P}_{m})\cdot\left(-\frac{\delta\mathcal{D}}{2}\right)+\mathcal{P}_{m}\cdot\mathcal{D}. (4.41)

Suppose Algorithm 1 runs for KK steps starting from θ0\theta_{0} and there are M=K/TM=K/T of θ~m\tilde{\theta}_{m}. Let \mathcal{L}^{*} be the global minimum of (θ)\mathcal{L}(\theta). Summing (4.41) for m=1,,Mm=1,\dots,M yields that

(θ0)δ𝒟M2+m=1M𝒫m𝒟1Mm=1M𝒫m(θ0)M𝒟+δ2δ,\mathcal{L}(\theta_{0})-\mathcal{L}^{*}\geq-\frac{\delta\mathcal{D}M}{2}+\sum_{m=1}^{M}\mathcal{P}_{m}\cdot\mathcal{D}\Rightarrow\frac{1}{M}\sum_{m=1}^{M}\mathcal{P}_{m}\leq\frac{\mathcal{L}(\theta_{0})-\mathcal{L}^{*}}{M\mathcal{D}}+\frac{\delta}{2}\leq\delta,

where the last inequality holds if KK satisfies

K2[(θ0)]Tδ𝒟=O(δ4ϵ11/2log2(1ϵδ)).K\geq\frac{2[\mathcal{L}(\theta_{0})-\mathcal{L}^{*}]T}{\delta\mathcal{D}}=O\left(\delta^{-4}\epsilon^{-11/2}\log^{2}\left(\frac{1}{\epsilon\delta}\right)\right). (4.42)

Hence, the probability of the event mc\mathcal{E}_{m}^{c} occurs can be bounded by

11Mm=1M𝒫m1δ.1-\frac{1}{M}\sum_{m=1}^{M}\mathcal{P}_{m}\geq 1-\delta.

This proves the statement in Theorem 3. ∎

5 Conclusions

In this paper, we explore the convergence properties of the SGD algorithm coupled with MCMC sampling. The upper bound of the bias and variance corresponding to the MCMC is estimated by concentration inequalities without resorting to the conventional bounded assumption. Then, we show that MCMC-SGD achieves first order stationary convergence based on the error analysis of MCMC. It has O(logKnK)O\left(\frac{\log K}{\sqrt{nK}}\right) convergence rate with a sufficiently large sample size nn after KK iterations. Moreover, we conduct a study on the second-order convergence properties of MCMC-SGD under reasonable assumptions. By verifying the CNC condition under errors, we discuss how MCMC-SGD escapes from saddle points and establish the second order convergence guarantee of O(ϵ11/2log2(1ϵ))O\left(\epsilon^{-11/2}\log^{2}\left(\frac{1}{\epsilon}\right)\right) with high probability. Our result explains the observation of convergence to eigenvalues in solving variational eigenvalue problems, thereby demonstrating the favorable second-order convergence properties of the MCMC-SGD algorithm.

There are some potential directions to be concerned for future research. (1) The relationship between non minimum eigenvalues in variational eigenvalue problems and optimization algorithms remains an area ripe for exploration. (2) Our convergence analysis suggests that improving the efficiency of sampling methods is of vital importance for the better performance of stochastic algorithms. (3) The convergence of the natural gradient method and the KFAC method with MCMC samples is also interesting.

References

  • [1] Nilin Abrahamsen, Zhiyan Ding, Gil Goldshlager, and Lin Lin. Convergence of stochastic gradient descent on parameterized sphere with applications to variational Monte carlo simulation. arXiv preprint arXiv:2303.11602, 2023.
  • [2] Ahmad Ajalloeian and Sebastian U Stich. On the convergence of SGD with biased gradients. arXiv preprint arXiv:2008.00051, 2020.
  • [3] Yves F Atchadé, Gersende Fort, and Eric Moulines. On perturbed proximal gradient algorithms. The Journal of Machine Learning Research, 18(1):310–342, 2017.
  • [4] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859–877, 2017.
  • [5] Z. Cai. Approximating quantum many-body wave-functions using artificial neural networks. Phys. Rev. B, 97:035116, 2018.
  • [6] G. Carleo and M. Troyer. Solving the quantum many-body problem with artificial neural networks. Science, 355:602–606, 2017.
  • [7] Kenny Choo, Giuseppe Carleo, Nicolas Regnault, and Titus Neupert. Symmetries and many-body excitations with neural-network quantum states. Physical review letters, 121(16):167204, 2018.
  • [8] Hadi Daneshmand, Jonas Kohler, Aurelien Lucchi, and Thomas Hofmann. Escaping saddles with stochastic gradients. In International Conference on Machine Learning, pages 1155–1164. PMLR, 2018.
  • [9] D. Deng, X. Li, and S. Das Sarma. Quantum entanglement in neural network states. Phys. Rev. X, 7:021021, 2017.
  • [10] Persi Diaconis and Laurent Saloff-Coste. What do we know about the Metropolis algorithm? Journal of Computer and System Sciences, 57(1):20–36, 1998.
  • [11] Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid monte carlo. Physics letters B, 195(2):216–222, 1987.
  • [12] John C Duchi, Alekh Agarwal, Mikael Johansson, and Michael I Jordan. Ergodic mirror descent. SIAM Journal on Optimization, 22(4):1549–1578, 2012.
  • [13] Jianqing Fan, Bai Jiang, and Qiang Sun. Hoeffding’s inequality for general Markov chains and its applications to statistical learning. Journal of Machine Learning Research, 22(139):1–35, 2021.
  • [14] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. In Conference on learning theory, pages 797–842. PMLR, 2015.
  • [15] I. Glasser, N. Pancotti, M. August, I. Rodriguez, and J. Cirac. Neural networks quantum states, string-bond states and chiral topological states. Phys. Rev. X, 8:011006, 2018.
  • [16] WK HASTINGS. Monte carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97–109, 1970.
  • [17] Jan Hermann, Zeno Schätzle, and Frank Noé. Deep-neural-network solution of the electronic Schrödinger equation. Nature Chemistry, 12(10):891–897, 2020.
  • [18] Robert Jastrow. Many-body problem with strong forces. Physical Review, 98(5):1479, 1955.
  • [19] Bai Jiang, Qiang Sun, and Jianqing Fan. Bernstein’s inequality for general Markov chains. arXiv preprint arXiv:1805.10721, 2018.
  • [20] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points efficiently. In International conference on machine learning, pages 1724–1732. PMLR, 2017.
  • [21] Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to variational methods for graphical models. Machine learning, 37:183–233, 1999.
  • [22] Belhal Karimi, Blazej Miasojedow, Éric Moulines, and Hoi-To Wai. Non-asymptotic analysis of biased stochastic approximation scheme. Proceedings of Machine Learning Research vol XX, 1:33, 2019.
  • [23] Raphael Kaubruegger, Lorenzo Pastori, and Jan Carl Budich. Chiral topological phases from artificial neural networks. Physical Review B, 97(19):195136, 2018.
  • [24] Jörg Kienitz. Convergence of Markov chains via analytic and isoperimetric inequalities. PhD thesis, Bielefeld University, 2000.
  • [25] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • [26] Yingzhen Li, Richard E Turner, and Qiang Liu. Approximate inference with amortised mcmc. arXiv preprint arXiv:1702.08343, 2017.
  • [27] Xiao Liang, Wen-Yuan Liu, Pei-Ze Lin, Guang-Can Guo, Yong-Sheng Zhang, and Lixin He. Solving frustrated quantum many-particle models with convolutional neural networks. Physical Review B, 98(10):104426, 2018.
  • [28] Chris J Maddison, John Lawson, George Tucker, Nicolas Heess, Mohammad Norouzi, Andriy Mnih, Arnaud Doucet, and Yee Teh. Filtering variational objectives. Advances in Neural Information Processing Systems, 30, 2017.
  • [29] Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller. Equation of state calculations by fast computing machines. The journal of chemical physics, 21(6):1087–1092, 1953.
  • [30] Błażej Miasojedow. Hoeffding’s inequalities for geometrically ergodic markov chains on general state space. Statistics & Probability Letters, 87:115–120, 2014.
  • [31] Laurent Miclo and Cyril Roberto. Trous spectraux pour certains algorithmes de Métropolis sur r. Séminaire de Probabilités XXXIV, pages 336–352, 2000.
  • [32] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937. PMLR, 2016.
  • [33] Eric Moulines and Francis Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Advances in neural information processing systems, 24, 2011.
  • [34] Deanna Needell, Rachel Ward, and Nati Srebro. Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. Advances in neural information processing systems, 27, 2014.
  • [35] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, Louvain-la-Neuve, 2003.
  • [36] Y. Nomura, A. Darmawan, Y. Yamajji, and M. Imada. Restricted Boltzmann machine learning for solving strongly correlated quantum systems. Phys. Rev. B, 96:205152, 2017.
  • [37] Daniel Paulin. Concentration inequalities for Markov chains by Marton couplings and spectral methods. Electronic Journal of Probability, 20:1–32, 2015.
  • [38] David Pfau, James S Spencer, Alexander GDG Matthews, and W Matthew C Foulkes. Ab initio solution of the many-electron Schrödinger equation with deep neural networks. Physical Review Research, 2(3):033429, 2020.
  • [39] Rajesh Ranganath, Sean Gerrish, and David Blei. Black box variational inference. In Artificial intelligence and statistics, pages 814–822. PMLR, 2014.
  • [40] A. Rocchetto, E. Grant, S. Strelchuk, G. Carleo, and S. Severini. Learning hard quantum distributions with variational autoencoders. Quantum Inf., 4:28, 2018.
  • [41] H. Saito and M. Kato. Machine learning technique to find quantum many-body ground states of Bosons on a lattice. J. Phys. Soc. Jpn., 87:014001, 2017.
  • [42] Tim Salimans, Diederik Kingma, and Max Welling. Markov chain monte carlo and variational inference: Bridging the gap. In International conference on machine learning, pages 1218–1226. PMLR, 2015.
  • [43] Laurent Saloff-Coste. Lectures on finite Markov chains. Lectures on probability theory and statistics, pages 301–413, 1997.
  • [44] Tao Sun, Yuejiao Sun, and Wotao Yin. On Markov chain gradient descent. Advances in neural information processing systems, 31, 2018.
  • [45] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681–688. Citeseer, 2011.
  • [46] Ronald J Williams and Jing Peng. Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3(3):241–268, 1991.
  • [47] Stephen Wright, Jorge Nocedal, et al. Numerical optimization. Springer Science, 35(67-68):7, 1999.