This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\NewEnviron

temp

Convergence of variational Monte Carlo simulation and scale-invariant pre-training

Nilin Abrahamsen [email protected] Zhiyan Ding [email protected] Gil Goldshlager [email protected] Lin Lin [email protected] The Simons Institute for the Theory of Computing, Berkeley, CA 94720 USA Department of Mathematics, University of California, Berkeley, CA 94720 USA Applied Mathematics and Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
Abstract

We provide theoretical convergence bounds for the variational Monte Carlo (VMC) method as applied to optimize neural network wave functions for the electronic structure problem. We study both the energy minimization phase and the supervised pre-training phase that is commonly used prior to energy minimization. For the energy minimization phase, the standard algorithm is scale-invariant by design, and we provide a proof of convergence for this algorithm without modifications. The pre-training stage typically does not feature such scale-invariance. We propose using a scale-invariant loss for the pretraining phase and demonstrate empirically that it leads to faster pre-training.

1 Introduction

A central goal of computational physics and chemistry is to find a ground state which minimizes the energy of a system. Electronic structure theory studies the behavior of electrons that govern the chemical properties of molecules and materials. The energy in the electronic structure is a functional defined on the set of quantum wave functions ψ\psi, which are normalized 2\mathscr{L}^{2}-integrable functions determined up to an arbitrary scalar multiplication factor, known as a phase. Equivalently, the wave function can be viewed as an element of the projective space rather than a normalized function.

The variational Monte Carlo (VMC) algorithm [1, 2, 3] is a powerful method for optimizing the energy of a parameterized wave function through stochastic gradient estimates. The method relies on Monte Carlo samples from the probability distribution defined by the Born rule, which views the normalized squared wave function as a probability density function.

Historically, the number of parameters used in VMC simulations was relatively small, and methods such as stochastic reconfiguration [4, 5] and the linear method [6, 7] were preferred for parameter optimization. Recently, neural networks have been used to parameterize the wave function in an approach known as neural quantum states [8]. Subsequent work has shown that VMC with neural quantum states can match or exceed state-of-the-art high-accuracy quantum simulation methods for a variety of electronic structure problems, including molecules in both first quantization [9, 10, 11, 12] and second quantization [13, 14], electron gas [15, 16, 17], and solids [18].

Due to the complexity of quantum wave functions parameterized by neural networks, it is in general not possible to explicitly normalize the parameterized wave function. One of the essential properties of the VMC procedure is thus its ability to use an unnormalized function ψθ\psi_{\theta} to represent the corresponding element in the projective space. This is achieved by pulling back the energy functional to be minimized from the projective space to a scale-invariant functional on the space of unnormalized wave functions. This scale-invariant functional is the same as the Rayleigh quotient ψ|H|ψ/ψ|ψ\bra{\psi}H\ket{\psi}/\braket{\psi}{\psi} where the Hermitian operator HH is known as the Hamiltonian. The scale-invariance of the energy functional is reflected in the VMC algorithm where a scale-invariant local energy function corresponding to the wave function is evaluated at MCMC-sampled electron configurations. For the ground state problem considered here it suffices to restrict to real-valued wave functions, so we will take all scalars to be real.

The wave function in the VMC algorithm is often initialized using a supervised pre-training stage [11] where it is fitted to a less expressive approximation to the ground state, for example given as a Slater determinant. In contrast to the main VMC algorithm which is trained with an inherently scale-invariant objective, this supervised pre-training typically does not encode the scale-invariant property [11, 19]. We propose using a scale-invariant loss function for this supervised stage and demonstrate numerically that this modification leads to faster convergence towards the target in the supervised pre-training setting. We further give a theoretical convergence bound for the proposed scale-invariant supervised optimization algorithm. As an important ingredient in this analysis, we introduce the concept of a directionally unbiased stochastic estimator for the gradient.

1.1 Related works

Using scale-invariance in the parameter vector to stabilize SGD training is a well-studied technique. Many structures and methods have been proposed along these lines, such as normalized NN [20, 21, 22], 𝒢\mathcal{G}-SGD [23], SGD+WD [24, 25, 26, 27], and projected/normalized (S)GD [28, 29]. In these works, the parameter 𝐯\mathbf{v} is stored explicitly and the loss function \mathcal{L} is scale-invariant in the parameter 𝐯\mathbf{v}, which means (λ𝐯)=𝐯\mathcal{L}(\lambda\mathbf{v})=\mathbf{v} for any λ>0\lambda>0.

The setting of neural wave functions differs from typical scale-invariant loss functions in that the scale-invariant functional \mathcal{L} is applied after a nonlinear parameterization θψθ\theta\mapsto\psi_{\theta}, and the function ψθ\psi_{\theta} can be queried but not accessed as an explicit vector. The composed loss function L(θ)=(ψθ)L(\theta)=\mathcal{L}(\psi_{\theta}) is not scale-invariant in its parameters. We can obtain a relation between the two settings by applying the chain rule involving a functional derivative, but the resulting expression involves factors that we need to estimate by sampling, in contrast with the typical setting of scale-invariance in the parameter space.

1.1.1 SGD on Riemannian manifolds

The training of a scale-invariant objective can be seen as a training process on projective space, which is a special case of learning on Riemannian manifolds. The convergence of SGD on Riemannian manifolds is well-studied; for example [30] shows the asymptotic convergence of SGD on Riemannian manifolds. Further variations of the SGD algorithm on Riemannian manifolds have been proposed and studied, such as SVRG on Riemannian manifolds  [31], [32] and ASGD on Riemannian manifolds [33]. To the best of our knowledge, all previous works ensure the parameter stays on the manifold by either taking stochastic gradient steps on the manifold or projecting them back onto it. This is in contrast to our setting where no projection onto a submanifold is needed.

1.1.2 Concurrent theoretical analysis

The concurrent work of [34] proves a similar convergence result for VMC and also takes into account the effect of Markov chain Monte Carlo sampling. However, this work does not consider the setting of supervised pre-training.

2 Variational Monte Carlo

A quantum wave function is a square-integrable function ψ\psi on a space of configurations Ω\Omega. A typical configuration is Ω=3N\Omega=\mathbb{R}^{3N} for a system of NN electrons111The spin degrees of freedom are considered as classical and are expressed in the symmetry type of ψθ\psi_{\theta}.. The term “square-integrable function” refers to a function ψ\psi with a finite 2\mathscr{L}^{2}-norm ψ=Ω|ψ|2dx<\|\psi\|=\sqrt{\int_{\Omega}|\psi|^{2}\mathrm{d}x}<\infty with respect to the Lebesgue measure. For any scalar λ\lambda\in\mathbb{R} and λ0\lambda\neq 0, λψ\lambda\psi represents the same physical state as ψ\psi. In the variational Monte Carlo algorithm, ψθ\psi_{\theta} is an Ansatz parameterized by θ\theta, and the objective is to find the parameter vector θ\theta such that ψθ\psi_{\theta} minimizes the energy

(ψ)=ψ|H|ψψ|ψ.\mathcal{L}(\psi)=\frac{\bra{\psi}H\ket{\psi}}{\braket{\psi}{\psi}}.

The Born rule associates to a quantum wave function ψ\psi a probability distribution on Ω\Omega with density pψp_{\psi} given by

pψ(x)=|ψ(x)|2/ψ2.p_{\psi}(x)=|{\psi}(x)|^{2}/\|{\psi}\|^{2}\,. (1)

The norm ψ\|{\psi}\| is unknown to the optimization algorithm, and samples {Xi}pψ\{X_{i}\}\sim p_{\psi} are generated using Markov chain Monte Carlo (MCMC). We will assume that these samples are independent and sampled from the exact distribution pψp_{\psi}. In practice, the independence assumption is tested by evaluating the autocorrelation of the MCMC samples.

The energy of ψ{\psi} can then be expressed as an expectation with respect to pψp_{\psi}: (ψ)=𝔼Xpψψ(X),\mathcal{L}({\psi})=\mathbb{E}_{X\sim p_{{\psi}}}\mathcal{E}_{\psi}(X), where

ψ(x)=(Hψ)(x)ψ(x)\mathcal{E}_{\psi}(x)=\frac{(H{\psi})(x)}{{\psi}(x)} (2)

is called the local energy. We note that the local energy is not well-defined in the case where ψ(x)=0\psi(x)=0, but this is not a practical issue since pψ(x)=0p_{\psi}(x)=0 by definition at such points and thus the expectation value is still well-defined.

We use L(θ)=(ψθ)L(\theta)=\mathcal{L}(\psi_{\theta}) to denote the loss as a function of the parameters.

Problem 2.1 (VMC).

Minimize

L(θ)=𝔼Xpθθ(X),L(\theta)=\mathbb{E}_{X\sim p_{\theta}}\mathcal{E}_{\theta}(X), (3)

over parameter vectors θd\theta\in\mathbb{R}^{d}, where θ(x)=Hψθ(x)/ψθ(x)\mathcal{E}_{\theta}(x)=H\psi_{\theta}(x)/\psi_{\theta}(x) and pθ(x)=|ψθ(x)|2/ψθ2p_{\theta}(x)=|\psi_{\theta}(x)|^{2}/\|\psi_{\theta}\|^{2}. We assume access to samples from {Xi}pθ\{X_{i}\}\sim p_{\theta} and query access to ψθ\psi_{\theta}, θψθ\nabla_{\theta}\psi_{\theta}, and θ\mathcal{E}_{\theta} which are assumed to be CC^{\infty} with respect to θ\theta for any xΩx\in\Omega.

When ψθ<\|\psi_{\theta}\|<\infty, the gradient of the VMC energy functional takes the form (see e.g., equation (9) of [11])

θL(θ)=2𝔼Xpθ[(θ(X)L(θ))θlog|ψθ(X)|].\nabla_{\theta}L(\theta)=2\mathbb{E}_{X\sim p_{\theta}}\Big{[}\left(\mathcal{E}_{\theta}(X)-L(\theta)\right)\nabla_{\theta}\log|\psi_{\theta}(X)|\Big{]}. (4)

Notably, the change in local energy does not contribute to this formula, that is, 𝔼Xpθ[θθ(X)]\mathbb{E}_{X\sim p_{\theta}}[\nabla_{\theta}\mathcal{E}_{\theta}(X)]=0. Instead, the gradient of LL is entirely due to the the perturbation of the sampling distribution as θ\theta varies. For completeness, we include the derivation of Equation 4 in A. The condition that ψθ\psi_{\theta} is CC^{\infty} with respect to θ\theta is typically satisfied by using the logistic sigmoid activation function.

{temp}

2.1 Relation to the policy gradient method

In the reinforcement learning framework [polgrad], an agent decides on an action aa with probability πθ(a|s)\pi_{\theta}(a|s) where ss is the state of the environment and πθ\pi_{\theta} is the policy, parameterized by θ\theta. Let J(π)J(\pi) be the expected reward R(τ)R(\tau) of a trajectory sampled under the policy π\pi. The policy gradient theorem [polgrad] exhibits an expression for the gradient of the reward:

Theorem 2.1 ([polgrad]).

The gradient of the expected reward is

θJ(πθ)=t=1T𝔼π[R(τ)θlogπθ(at|st)],\nabla_{\theta}J(\pi_{\theta})=\sum_{t=1}^{T}\mathbb{E}_{\pi}\big{[}R(\tau)\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})\big{]}, (5)

where R(τ)R(\tau) is the reward of trajectory τ\tau, and st,ats_{t},a_{t} are the states and actions in the trajectory.

To see the connection to the gradient estimate in VMC, it suffices to consider the policy gradient formula with a single timestep. Then we can ignore the state ss and the expected reward becomes J(π)=𝔼[Q(a)]J(\pi)=\mathbb{E}[Q(a)] where Q(a)Q(a) is the value of action aa. Equation 5 then states that

θJ(πθ)=𝔼[Q(a)θlogπθ(a)].\nabla_{\theta}J(\pi_{\theta})=\mathbb{E}\big{[}Q(a)\nabla_{\theta}\log\pi_{\theta}(a)\big{]}.

3 Supervised Pre-training

In order to stabilize the VMC training, the neural quantum state ψθ\psi_{\theta} in FermiNet [11] and subsequent works [35, 36, 37] is initialized by fitting the Ansatz to an initial guess ψ\psi at an approximate ground state [11, 19]. This target initial state can be taken to be a Slater determinant optimized using the self-consistent field (SCF) method [38]. We therefore consider the problem of minimizing the distance from the line {λψ|λ}\{\lambda\psi|\lambda\in\mathbb{R}\} to a target function φ2(Ω)\varphi\in\mathscr{L}^{2}(\Omega).

That is,

(ψ)=minλλψφρ2,\mathcal{L}(\psi)=\min_{\lambda\in\mathbb{R}}\left\|\lambda\psi-\varphi\right\|_{\rho}^{2}\,, (6)

where ϕρ=Ω|ϕ(x)|2dρ(x)\left\|\phi\right\|_{\rho}=\sqrt{\int_{\Omega}|\phi(x)|^{2}\mathrm{d}\rho(x)} is the norm in 2\mathscr{L}^{2}-sense on a probability space (Ω,ρ)(\Omega,\rho) with probability measure ρ\rho. In practice the pre-training may fit intermediate vector-valued features such as orbitals [19] instead of the scalar-valued wave function. As we note in Section 5.1, this is formally equivalent to fitting a scalar wave function. Moreover, as we show in Section 5.1, exploiting scale-invariance when fitting the orbitals can improve the convergence of the wave function itself as measured by Equation 6.

Here, ρ\rho is typically the Lebesgue measure, but in Section 5.1 we will also consider a case where it is defined by the target function. This loss function is scale-invariant by construction and has a closed-form expression

(ψ)=φρ2(|φ,ψρ|/ψρ)2.\mathcal{L}(\psi)=\|\varphi\|_{\rho}^{2}-\big{(}|\langle\varphi,\psi\rangle_{\rho}|/\|\psi\|_{\rho}\big{)}^{2}\,. (7)

Note that for normalized target φ\varphi, the second term of Equation 7 is the squared sine of the angle between the vectors, (ψ)=sin2(φ,ψ)\mathcal{L}(\psi)=\sin^{2}\angle(\varphi,\psi). Minimizing (ψ)\mathcal{L}(\psi) is equivalent to minimizing (ψ)=φ,ψρ/ψρ\mathcal{L}(\psi)=-\langle\varphi,\psi\rangle_{\rho}/\|\psi\|_{\rho}. Again we write the loss as L(θ)=(ψθ)L(\theta)=\mathcal{L}(\psi_{\theta}) as a function of the variational parameters:

Problem 3.1 (Supervised learning pre-training).

Given a target function φ2(Ω,ρ)\varphi\in\mathscr{L}^{2}(\Omega,\rho), access to samples {Xi}ρ\{X_{i}\}\sim\rho and query access to φ\varphi, ψθ\psi_{\theta}, and θψθ\nabla_{\theta}\psi_{\theta}, minimize

L(θ)=φ,ψθρψθρL(\theta)=-\frac{\langle\varphi,\psi_{\theta}\rangle_{\rho}}{\|\psi_{\theta}\|_{\rho}}\, (8)

over parameter vectors θd\theta\in\mathbb{R}^{d}.

3.1 Directionally unbiased gradient estimator

By applying the chain rule we can write the gradient of L(θ)L(\theta) as

θL(θ)\displaystyle\nabla_{\theta}L(\theta) =[ψ](ψθ)θψθ,\displaystyle=[\partial_{\psi}\mathcal{L}](\psi_{\theta})\cdot\nabla_{\theta}\psi_{\theta}, (9)

where ψ(ψθ)\partial_{\psi}\mathcal{L}(\psi_{\theta}) is the functional derivative of the scale-invariant loss. This can be evaluated analogously to the gradient derived in [28] for a scale-invariant loss function. The resulting gradient expression is

θL(θ)=φ,θψθρψθρ+φ,ψθρψθ,θψθρψθρ3.\displaystyle\nabla_{\theta}L(\theta)=-\frac{\braket{\varphi,\nabla_{\theta}\psi_{\theta}}_{\rho}}{\|\psi_{\theta}\|_{\rho}}+\frac{\braket{\varphi,\psi_{\theta}}_{\rho}\braket{\psi_{\theta},\nabla_{\theta}\psi_{\theta}}_{\rho}}{\|\psi_{\theta}\|_{\rho}^{3}}\,. (10)

We cannot compute the infinite-dimensional 2\mathscr{L}^{2} norms in the denominator of Equation 10 and do not have access to an unbiased estimator for Equation 10. However, in Lemma 4.5, we propose a directionally unbiased gradient estimator GG, which means that 𝔼[G]\mathbb{E}[G] is in the positive span of θL\nabla_{\theta}L. Additionally, we will demonstrate that the directionally unbiased gradient estimator is sufficient for achieving rapid convergence of SGD as long as the norm ψθρ\|\psi_{\theta}\|_{\rho} can be approximately estimated, see detail in Corollary 4.10.

4 Theoretical results

4.1 Convergence result for VMC

In the VMC setting, we assume that the MCMC subroutine is able to sample exactly from the distribution pθ(x)=|ψθ(x)|2/ψθ2p_{\theta}(x)=|\psi_{\theta}(x)|^{2}/\|\psi_{\theta}\|^{2}. This sampling oracle makes it possible to construct a cheap and unbiased gradient estimation for the energy functional (3). For simplicity we use exact real-valued arithmetic.

From the formula (4) for the gradient, we derive the following unbiased gradient estimator for θL(θ)\nabla_{\theta}L(\theta):

Lemma 4.1.

Given n2n\geq 2 i.i.d. samples {Xi}pθ\{X_{i}\}\sim p_{\theta}, let L^(θ)=1ni=1θ(Xi)\hat{L}(\theta)=\frac{1}{n}\sum_{i=1}\mathcal{E}_{\theta}\left(X_{i}\right) and define the gradient estimate

G(θ,X)=2n1i=1n(θ(Xi)L^(θ))θlog|ψθ(Xi)|.G(\theta,X)=\frac{2}{n-1}\sum_{i=1}^{n}\left(\mathcal{E}_{\theta}\left(X_{i}\right)-\hat{L}(\theta)\right)\nabla_{\theta}\log|\psi_{\theta}\left(X_{i}\right)|. (11)

Then G(θ,X)G(\theta,X) is an unbiased estimator of the gradient of the population energy.

We prove Lemma 4.1 in A.

Using the unbiased gradient estimator (11) from Lemma 4.1, the VMC algorithm can be formulated as a variant of SGD:

  \fname@algorithm 1 Variational Monte Carlo

 

1:  Preparation: number of iterations: MM; learning rate ηm\eta_{m}; initial parameter: θ0\theta_{0}; number of samples each iteration: nn;
2:  Running:
3:  m0m\leftarrow 0;
4:  while mMm\leq M do
5:     Sample {Xim}i=1n\{X^{m}_{i}\}^{n}_{i=1} independently according to the density pθm|ψθm|2p_{\theta^{m}}\propto|\psi_{\theta^{m}}|^{2};
6:     GmG(θ,X)=2n1i=1n(θ(Xi)L^(θ))θlog|ψθ(Xi)|G_{m}\leftarrow G(\theta,X)=\frac{2}{n-1}\sum_{i=1}^{n}(\mathcal{E}_{\theta}(X_{i})-\hat{L}(\theta))\nabla_{\theta}\log|\psi_{\theta}\left(X_{i}\right)|;
7:     θm+1θmηmGm\theta_{m+1}\leftarrow\theta_{m}-\eta_{m}G_{m};
8:  end while
9:  Output: θm\theta_{m};

 

Given any θ\theta, we define q,pθ\|\cdot\|_{q,p_{\theta}} as the qq-th norm under the measure pθ(x)dxp_{\theta}(x)dx. The convergence bound for the VMC algorithm assumes a uniform bound on the following quantities:

Condition 4.2.

There exists a constant CψC_{{\psi}} such that for any θd\theta\in\mathbb{R}^{d},

  • 1.

    (Value bound):

    Hψθ/ψθ4,pθCψ.\|H\psi_{\theta}/\psi_{\theta}\|_{4,p_{\theta}}\leq C_{\psi}\,. (12)
  • 2.

    (Gradient bound):

    θHψθ/ψθ2,pθCψ,\displaystyle\|\nabla_{\theta}H\psi_{\theta}/\psi_{\theta}\|_{2,p_{\theta}}\leq C_{\psi}\,, (13)
    θψθ/ψθ4,pθCψ.\displaystyle\|\nabla_{\theta}\psi_{\theta}/\psi_{\theta}\|_{4,p_{\theta}}\leq C_{\psi}\,.
  • 3.

    (Hessian bound):

    𝖧θψθ/ψθ2,pθCψ,\|\mathsf{H}_{\theta}\psi_{\theta}/\psi_{\theta}\|_{2,p_{\theta}}\leq C_{\psi}\,, (14)

    where 𝖧θψθ\mathsf{H}_{\theta}\psi_{\theta} is the hessian of ψθ\psi_{\theta} in θ\theta.

4.2 is scale-invariant in ψθ\psi_{\theta}, i.e., if ψθ(x)\psi_{\theta}(x) satisfies 4.2, then λψθ(x)\lambda\psi_{\theta}(x) also satisfies the condition with the same constant λ\lambda for any λ0\lambda\neq 0. We establish new bounds on the Lipschitz constant of L\nabla L and the variance of the gradient estimator (see Lemma C.1), subject to the constraints specified in 4.2. These bounds are crucial in demonstrating the convergence of our method.

Now, we are ready to give the convergence result for Algorithm 4.1:

Theorem 4.3.

Under 4.2 there exists a constant C>0C>0 that only depends on CψC_{\psi} such that, when ηm<1C\eta_{m}<\frac{1}{C} for any m0m\geq 0,

m=0Mηm𝔼(|θL(θm)|2)2L(θ0)+Cm=0Mηm2n.\sum^{M}_{m=0}\eta_{m}\mathbb{E}\left(\left|\nabla_{\theta}L(\theta_{m})\right|^{2}\right)\leq 2L(\theta_{0})+C\sum^{M}_{m=0}\frac{\eta^{2}_{m}}{n}\,. (15)

for any M>0M>0.

The proof of the above theorem is given in C. The upper bound (15) is important for us to determine ηm\eta_{m} so that the parameter θm\theta_{m} converges to a first-order stationary point in the expectation sense when mm is moderately large. Theorem 4.3 implies:

Corollary 4.4.

Under 4.2:

  • 1.

    Choosing ηm=Θ(nϵ2)\eta_{m}=\Theta\left(n\epsilon^{2}\right) and M=Θ(1/(nϵ4))M=\Theta\left(1/(n\epsilon^{4})\right), we have

    min0mM𝔼(|θL(θm)|)ϵ\min_{0\leq m\leq M}\mathbb{E}\left(\left|\nabla_{\theta}L(\theta_{m})\right|\right)\leq\epsilon

    .

  • 2.

    Choosing ηm=Θ(nm+1)\eta_{m}=\Theta\left(\sqrt{\frac{n}{m+1}}\right), we have

    min0mM𝔼(|θL(θm)|)=O((nM)1/4).\min_{0\leq m\leq M}\mathbb{E}\left(\left|\nabla_{\theta}L(\theta_{m})\right|\right)=O\left(\left(nM\right)^{-1/4}\right).

This bound shows that the convergence of Algorithm 4.1 is the same as the convergence rate of SGD to first-order stationary point for nonconvex functions, see [39, Theorem 2] or [40, Theorem 5.2.1].

4.1.1 Scale-invariant supervised pre-training

The following lemma shows how we can choose a directionally unbiased gradient estimator. Given samples {Xi}\{X_{i}\}, write ψ,φn=1ni=1nψ(Xi)φ(Xi)\langle\psi,\varphi\rangle_{n}=\frac{1}{n}\sum_{i=1}^{n}\psi(X_{i})\varphi(X_{i}) and ψn2=1ni=1nψ(Xi)2\|{\psi}\|_{n}^{2}=\frac{1}{n}\sum_{i=1}^{n}\psi(X_{i})^{2}.

Lemma 4.5.

Given n2n\geq 2 and i.i.d. samples X1,,XnρX_{1},\ldots,X_{n}\sim\rho, define coefficients

aj=ψθn2φ(Xj)+φ,ψθnψθ(Xj)a_{j}=-\|\psi_{\theta}\|_{n}^{2}\>\varphi(X_{j})+\langle\varphi,\psi_{\theta}\rangle_{n}\>\psi_{\theta}(X_{j})

and let

G=1ψθρ31n1j=1najθψθ(Xj).G=\frac{1}{\|\psi_{\theta}\|_{\rho}^{3}}\frac{1}{n-1}\sum_{j=1}^{n}a_{j}\nabla_{\theta}\psi_{\theta}(X_{j})\,. (16)

Then 𝔼X(G)=θL(θ)\mathbb{E}_{X}(G)=\nabla_{\theta}L(\theta).

We give the derivation of Lemma 4.5 in B. Given an estimator Z~\tilde{Z} for ψθρ\|\psi_{\theta}\|_{\rho} which is independent of X1,,XnX_{1},\ldots,X_{n}, we then choose the following as our directionally unbiased gradient estimator:

G(θ,X)=1Z~31n1j=1najθψθ(Xj),G(\theta,X)=\frac{1}{\tilde{Z}^{3}}\frac{1}{n-1}\sum_{j=1}^{n}a_{j}\nabla_{\theta}\psi_{\theta}(X_{j})\,, (17)

where {aj}j=1n\{a_{j}\}^{n}_{j=1} are defined in Lemma 4.5.

Remark 4.6.

To obtain the independent norm estimate Z~\tilde{Z} we can take 2n2n samples X1,,X2nX_{1},\ldots,X_{2n} at each iteration and let Z~=(1ni=n+12n|ψθ(Xi)|2)1/2\tilde{Z}=(\frac{1}{n}\sum_{i=n+1}^{2n}|\psi_{\theta}(X_{i})|^{2})^{1/2} in (17). However, our numerics in Section 5.1 suggest it is sufficient in practice to estimate ψθρ\|\psi_{\theta}\|_{\rho} as Z~=(1ni=1n|ψθ(Xi)|2)1/2\tilde{Z}=(\frac{1}{n}\sum_{i=1}^{n}|\psi_{\theta}(X_{i})|^{2})^{1/2} without sampling an additional independent minibatch.

Remark 4.7.

Some care must be taken when constructing a plug-in estimator for the gradient. For example, given an estimate Z~\tilde{Z} of ψθρ\|\psi_{\theta}\|_{\rho} and samples X1,X2ρX_{1},X_{2}\sim\rho, Equation 10 suggests a plug-in estimate of θLθ\nabla_{\theta}L_{\theta} given by

φ(X1)θψ(X1)Z~+φ(X2)ψ(X2)ψ(X1)θψ(X1)Z~3.-\frac{\varphi(X_{1})\nabla_{\theta}\psi(X_{1})}{\tilde{Z}}+\frac{\varphi(X_{2})\psi(X_{2})\psi(X_{1})\nabla_{\theta}\psi(X_{1})}{\tilde{Z}^{3}}\,. (18)

However, this estimator is not directionally unbiased and does not achieve the convergence shown in this paper. This is due to the fact that it is unbalanced in the sense that the exponent of Z~\tilde{Z} is different for each term. More specifically, taking expectation on (18) gives

φ,ψθZ~+φ,ψθψθ,ψθZ~3λθL(θ),λ.-\frac{\langle\varphi,\nabla\psi_{\theta}\rangle}{\tilde{Z}}+\frac{\langle\varphi,\psi_{\theta}\rangle\langle\psi_{\theta},\nabla\psi_{\theta}\rangle}{\tilde{Z}^{3}}\neq\lambda\nabla_{\theta}L(\theta),\quad\forall\lambda\in\mathbb{R}\,.

The detailed algorithm is summarized in Algorithm 4.1.1.

  \fname@algorithm 2 Stochastic gradient descent algorithm for supervised learning

 

1:  Preparation: φn\varphi_{n}; ψθ(x)\psi_{\theta}(x); number of iterations: MM; learning rate ηm\eta_{m}; initial parameter: θ0\theta_{0}; nn
2:  Running:
3:  m0m\leftarrow 0;
4:  while mMm\leq M do
5:     Sample {Xim}i=1n\{X^{m}_{i}\}^{n}_{i=1} independently from ρ\rho;
6:     Z~man estimation ofψθmρ\widetilde{Z}_{m}\leftarrow\text{an estimation of}\ \|\psi_{\theta_{m}}\|_{\rho};
7:     aiψθmn2φ(Xim)+φ,ψθmnψθm(Xim)a_{i}\leftarrow-\|\psi_{\theta_{m}}\|_{n}^{2}\>\varphi(X^{m}_{i})+\langle\varphi,\psi_{\theta_{m}}\rangle_{n}\>\psi_{\theta_{m}}(X^{m}_{i})
8:     GmG(θ,X)=1Z~m31n1i=1naiθψθm(Xim)G_{m}\leftarrow G(\theta,X)=\frac{1}{\widetilde{Z}_{m}^{3}}\frac{1}{n-1}\sum_{i=1}^{n}a_{i}\nabla_{\theta}\psi_{\theta_{m}}(X^{m}_{i});
9:     θm+1θmηmGm\theta_{m+1}\leftarrow\theta_{m}-\eta_{m}G_{m};
10:  end while
11:  Output: θm\theta_{m};

 

The choice of learning rate ηm\eta_{m} in the pre-training setting and the decay rate of the loss function depends on the ratio ψθmρ/Z~\|\psi_{\theta_{m}}\|_{\rho}/\widetilde{Z}. However, for our results it suffices that this ratio is bounded above and below by fixed constants. As mentioned above we may estimate Z~=(1ni=1n|ψθ(Xi)|2)1/2\tilde{Z}=(\frac{1}{n}\sum_{i=1}^{n}|\psi_{\theta}(X_{i})|^{2})^{1/2} at each iteration using an independent sample Xn+1,,X2nX1,,XnX_{n+1},\ldots,X_{2n}\perp\!\!\!\perp X_{1},\ldots,X_{n}. Alternatively we may take inspiration from variance reduction techniques [41, 42, 43], to update Z~\tilde{Z} using a large batch once every KK steps, letting Z~m=(1Ki=1K|ψθ(Xi)|2)1/2\tilde{Z}_{m}=(\frac{1}{K}\sum_{i=1}^{K}|\psi_{\theta}(X_{i})|^{2})^{1/2} when m/K0(mod K)m/K\equiv 0\>(\text{mod }K) and Z~m=Z~m1\widetilde{Z}_{m}=\widetilde{Z}_{m-1} otherwise.

Our convergence bound for the supervised pre-training assumes uniform bounds similar to 4.2.

Condition 4.8.

There exists a constant CψC_{\psi} such that for any θd\theta\in\mathbb{R}^{d}

  • 1.

    (Value bound):

    ψθ2ρ1/2/ψθρCψ.\left\|\psi_{\theta}^{2}\right\|^{1/2}_{\rho}/\|\psi_{\theta}\|_{\rho}\leq C_{\psi}\,. (19)
  • 2.

    (Gradient bound):

    |θψθ|2ρ1/2/ψθρCψ.\left\|\left|\nabla_{\theta}\psi_{\theta}\right|^{2}\right\|^{1/2}_{\rho}/\|\psi_{\theta}\|_{\rho}\leq C_{\psi}\,. (20)
  • 3.

    (Hessian bound):

    𝖧θψθ22ρ1/2/ψθρCψ,\left\|\left\|\mathsf{H}_{\theta}\psi_{\theta}\right\|^{2}_{2}\right\|^{1/2}_{\rho}/\|\psi_{\theta}\|_{\rho}\leq C_{\psi}\,, (21)

    where 𝖧θψ\mathsf{H}_{\theta}\psi is the hessian of ψ\psi.

Under 4.8, we establish the Lipschitz property of θL\nabla_{\theta}L and provide bounds for the variance of the gradient estimator (see Lemma D.1). These results play a crucial role in demonstrating the convergence of our method. 4.8 is scale-invariant in ψθ\psi_{\theta}, meaning that the same condition holds with the same constant for any λψθ\lambda\psi_{\theta}, where λ0\lambda\neq 0.

Theorem 4.9.

Assume 4.8, |φn|Cφ|\varphi_{n}|\leq C_{\varphi}, and

1CrψθmρZ~mCr,a.s.\frac{1}{C_{r}}\leq\frac{\|\psi_{\theta_{m}}\|_{\rho}}{\widetilde{Z}_{m}}\leq C_{r}\,,\quad\mathrm{a.s.} (22)

for all mm\in\mathbb{N}, where θm\theta_{m} comes from Algorithm 4.1.1. Then there exists a constant CC that only depends on Cr,Cψ,CφC_{r},C_{\psi},C_{\varphi} such that, when ηm<1C\eta_{m}<\frac{1}{C} for any m0m\geq 0,

m=0Mηm𝔼(|θL(θm)|2)2Cr2L(θ0)+Cm=0Mηm2n.\sum^{M}_{m=0}\eta_{m}\mathbb{E}\left(\left|\nabla_{\theta}L(\theta_{m})\right|^{2}\right)\leq 2C^{2}_{r}L(\theta_{0})+C\sum^{M}_{m=0}\frac{\eta^{2}_{m}}{n}\,. (23)

for all M>0M>0.

The above theorem is proved in D. In the proof, we first bound the 𝔼|Gm|2\mathbb{E}|G_{m}|^{2} and show that θL\nabla_{\theta}L has a uniform Lipschitz constant. Then, the decay rate of the loss function in each step can be lower bounded by |θL|2|\nabla_{\theta}L|^{2}, which finally gives us an upper bound of |θL|2|\nabla_{\theta}L|^{2} as shown in (23).

Using Theorem 4.9 it is straightforward to show the following corollary:

Corollary 4.10.

Under 4.8 and (22),

  • 1.

    Choosing ηm=Θ(nϵ2)\eta_{m}=\Theta\left(n\epsilon^{2}\right) and M=Θ(1/(nϵ4))M=\Theta\left(1/(n\epsilon^{4})\right), we have

    min0mM𝔼(|θL(θm)|)ϵ.\min_{0\leq m\leq M}\mathbb{E}\left(\left|\nabla_{\theta}L(\theta_{m})\right|\right)\leq\epsilon.
  • 2.

    Choosing ηm=Θ(nm+1)\eta_{m}=\Theta\left(\sqrt{\frac{n}{m+1}}\right), we have

    min0mM𝔼(|θL(θm)|)=O((nM)1/4).\min_{0\leq m\leq M}\mathbb{E}\left(\left|\nabla_{\theta}L(\theta_{m})\right|\right)=O\left(\left(nM\right)^{-1/4}\right).

We note that this bound matches the standard O(M1/4)O(M^{-1/4}) convergence rate of SGD to first-order stationary point for non-convex functions.

5 Numerical results

5.1 Pre-training with scale-invariant loss

To evaluate the use of a scale-invariant loss during pre-training we modify the training loss used in the pre-training stage of [19] to be scale-invariant with respect to the trained Ansatz. We tested the scale-invariant loss in a version of the FermiNet code [44] where the electron configurations were sampled from the target state as in [19]. Specifically, the wave function is given by a sum of determinants and the target state is a Slater determinant:

ψθ(x)=k=1d[det(yij(k))i,j=1N],φ(x)=det[(ϕj(xi))i,j=1N],\psi_{\theta}(x)=\sum_{k=1}^{d}\Big{[}\det(y^{(k)}_{ij})_{i,j=1}^{N}\Big{]},\qquad\varphi(x)=\det\Big{[}\big{(}\phi_{j}(x_{i})\big{)}_{i,j=1}^{N}\Big{]},

where NN is the number of electrons and the tensor yy is the output of the parameterized permutation-equivariant neural network. In our experiment we use the standard FermiNet Ansatz [11] for the network architecture. The pre-training of [19] uses the loss function

k=1di,j=1N|yi,j(k)ϕj(xi)|2\sum_{k=1}^{d}\sum_{i,j=1}^{N}|y_{i,j}^{(k)}-\phi_{j}(x_{i})|^{2} (24)

on the matrices. To test the use of a scale-invariant loss function during training we replace the training loss of Equation 24 with the sum of column-wise scale-invariant sin2\sin^{2} distances. We note that in practice we did not find it necessary to estimate the denominator using an independent sample as in Remark 4.6. Note also that we can view the fitting of a (column) vector-valued function Ωx(yi)i=1N\Omega\ni x\mapsto(y_{i})_{i=1}^{N} as fitting a scalar-valued function y(x,i)y(x,i) defined on Ω×{1,,N}\Omega\times\{1,\ldots,N\}.

L(θ)=k=1dj=1N(1|y,j(k)ϕj()|2y,j(k)2),L(\theta)=\sum_{k=1}^{d}\sum_{j=1}^{N}\Big{(}1-\frac{|y_{\cdot,j}^{(k)}\cdot\phi_{j}(\cdot)|^{2}}{\|y_{\cdot,j}^{(k)}\|^{2}}\Big{)}, (25)

where ϕj()=(ϕj(x1),,ϕj(xn))T\phi_{j}(\cdot)=(\phi_{j}(x_{1}),\ldots,\phi_{j}(x_{n}))^{T}, that is,

L(θ)=k=1dj=1N(1|i=1Nyi,j(k)ϕj(xi)|2i=1N|yi,j(k)|2).L(\theta)=\sum_{k=1}^{d}\sum_{j=1}^{N}\Big{(}1-\frac{|\sum_{i=1}^{N}y_{i,j}^{(k)}\phi_{j}(x_{i})|^{2}}{\sum_{i=1}^{N}|y_{i,j}^{(k)}|^{2}}\Big{)}. (26)
Refer to caption
Figure 1: Convergence of supervised pre-training using the scale-invariant training loss (orange) vs. the training loss of [19] (blue). For both optimizers the plotted quantity is the sine of the angle between the target state and the trained state, where the angle is defined in 2(3n,ρ=|φ|2)\mathscr{L}^{2}(\mathbb{R}^{3n},\rho=|\varphi|^{2}) with respect to the measure induced by the target state density.

To evaluate the performance of the modified pre-training procedure we plot the angle between the target Slater determinant-based wave function and the neural network wave function during each training procedure (Figure 1). The system shown is the lithium atom (details in F).

5.2 Empirical convergence of VMC algorithm

As an example, we demonstrate the convergence of the VMC Section 4.1 on the Hydrogen square (H4) model depicted in Figure 2. This system involves only four particles but is strongly correlated and known to be difficult to simulate accurately. We define ψθ\psi_{\theta} using the FermiNet ansatz [11, 35] and run 200,000 training steps using stochastic gradient descent with a learning rate schedule ηm=0.051+m/10000\eta_{m}=\frac{0.05}{\sqrt{1+m/10000}}.

HHHH1.01.0 Bohr
Figure 2: Atomic configuration for the square H4 model.
Refer to caption
Figure 3: Convergence of VMC run on the H4 square. The running minimum is taken to smooth out the data and match the form of Corollary 4.4.

With this learning rate schedule, Corollary 4.4 implies that the running minimum of 𝔼(|θL(θm)|)\mathbb{E}(|\nabla_{\theta}L(\theta_{m})|) should converge as O(M1/4)O(M^{-1/4}) as the optimization progresses. To verify this bound, we plot in Figure 3 the running minimum of |Gm||G_{m}|, which is our numerical proxy for |θL(θ)||\nabla_{\theta}L(\theta)| as per Lemma 4.1. We compare two distinct VMC runs using 10 and 1000 walkers, respectively, to explore how the convergence varies with the accuracy of the gradient estimate. We estimate the convergence rate by measuring the overall slope of the log-log plot, starting from step 200 to avoid the initial pre-asymptotic period. We find that the gradient of the loss converges roughly as M0.27M^{-0.27} when using 10 walkers, closely matching the expected bound. With 1000 walkers the convergence goes as M0.38M^{-0.38}, which is somewhat faster than the theoretical bound, as might be expected given that the estimate of the gradient is very accurate with so many walkers. With 1000 walkers, the convergence is M0.38M^{-0.38}. This rate surpasses the theoretical bound given in Corollary 4.4. Although a rigorous theoretical understanding of this phenomenon is currently lacking, we hypothesize that with this number of walkers, relatively accurate gradient estimates lead to training dynamics similar to (non-stochastic) gradient descent, giving a faster convergence rate than our theoretical bound in Corollary 4.4.

Since the bounding of the Lipschitz constant of the gradient of the loss function is critical to our convergence proof, we supply in Figure 4 a numerical estimate of this quantity during the VMC run. Here we show data only for the case of 1000 walkers, as the Lipschitz constant estimate is extremely noisy with 10 walkers. The data suggest that the value of the Lipschitz constant is well below 10001000 for this simulation.

Refer to caption
Figure 4: Lipschitz constant for VMC run on the H4 square with 1000 walkers. The constant is numerically approximated using the formula |G(θm+1)G(θm)|/|θm+1θm||G(\theta_{m+1})-G(\theta_{m})|/|\theta_{m+1}-\theta_{m}|.

6 Conclusion

We consider two optimization problems that arise in neural network variational Monte Carlo simulations of electronic structure: energy minimization and supervised pre-training. We provide theoretical convergence bounds for both cases. For the setting of supervised pre-training, we note that the standard algorithms do not incorporate the property of scale-invariance. We propose using a scale-invariant loss function for the supervised pre-training and demonstrate numerically that incorporating scale-invariance accelerates the process of pre-training.

SGD is only the simplest stochastic optimization method. Over the last two decades, there has been increasing interest in developing more efficient and scalable optimization methods for VMC simulations [45, 46, 47], and our analysis may be a starting point for analyzing such methods. It may also be possible to generalize this work from a high-dimensional sphere to a Grassmann manifold, parameterized by a neural network up to a gauge matrix. This setting could be applicable to VMC simulations of excited states of quantum systems.

Acknowledgement

This work was supported by the Simons Foundation under Award No. 825053 (N. A.), and by the Challenge Institute for Quantum Computation (CIQC) funded by National Science Foundation (NSF) through grant number OMA-2016245 (Z. D.). This work is supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Department of Energy Computational Science Graduate Fellowship under Award Number DE-SC0023112 (G. G.). This material is also based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research and Office of Basic Energy Sciences, Scientific Discovery through Advanced Computing (SciDAC) program under Award Number DE-SC0022364 (L.L.). L.L. is a Simons Investigator in Mathematics. This research used the Savio computational cluster resource provided by the Berkeley Research Computing program at the University of California, Berkeley (supported by the UC Berkeley Chancellor, Vice Chancellor for Research, and Chief Information Officer).

Disclaimer

This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.

References

Appendix A Proofs of gradient estimators

Proof of (4).

Let qθ(x)=|ψθ(x)|2q_{\theta}(x)=|\psi_{\theta}(x)|^{2}, and Qθ=ψθ2Q_{\theta}=\|\psi_{\theta}\|^{2}.

We write (3) as

L(θ)=qθ,θQθ.L(\theta)=\ifrac{\langle q_{\theta},\mathcal{E}_{\theta}\rangle}{Q_{\theta}}\,.

Then by the product formula,

θL(θ)\displaystyle\nabla_{\theta}L(\theta) =θqθ,θQθqθ,θθQθQθ2=:AB.\displaystyle=\frac{\nabla_{\theta}\langle q_{\theta},\mathcal{E}_{\theta}\rangle}{Q_{\theta}}-\frac{\langle q_{\theta},\mathcal{E}_{\theta}\rangle\nabla_{\theta}Q_{\theta}}{Q^{2}_{\theta}}=:A-B. (27)

Continuing, we have

A\displaystyle A =θ,θqθ/Qθ+qθ,θθ/Qθ\displaystyle=\langle\mathcal{E}_{\theta},\nabla_{\theta}q_{\theta}\rangle/Q_{\theta}+\langle q_{\theta},\nabla_{\theta}\mathcal{E}_{\theta}\rangle/Q_{\theta} (28)
=θ,θqθqθqθQθ+qθQθ,θθ\displaystyle=\langle\mathcal{E}_{\theta},\frac{\nabla_{\theta}q_{\theta}}{q_{\theta}}\frac{q_{\theta}}{Q_{\theta}}\rangle+\langle\frac{q_{\theta}}{Q_{\theta}},\nabla_{\theta}\mathcal{E}_{\theta}\rangle
=𝔼pθ[θθlogqθ]+𝔼pθ[θθ].\displaystyle=\mathbb{E}_{p_{\theta}}[\mathcal{E}_{\theta}\nabla_{\theta}\log q_{\theta}]+\mathbb{E}_{p_{\theta}}[\nabla_{\theta}\mathcal{E}_{\theta}].

and

B\displaystyle B =qθQθ,θθQθQθ\displaystyle=\langle\frac{q_{\theta}}{Q_{\theta}},\mathcal{E}_{\theta}\rangle\frac{\nabla_{\theta}Q_{\theta}}{Q_{\theta}} (29)
=L(θ)qθQθ,θqθqθ=L(θ)𝔼pθ[θlogqθ].\displaystyle=L(\theta)\langle\frac{q_{\theta}}{Q_{\theta}},\frac{\nabla_{\theta}q_{\theta}}{q_{\theta}}\rangle=L(\theta)\mathbb{E}_{p_{\theta}}[\nabla_{\theta}\log q_{\theta}].

Substitute (28), (29) into (27) to obtain

θL(θ)\displaystyle\nabla_{\theta}L(\theta) =𝔼Xpθ[(θ(X)L(θ))θlogqθ(X)+θθ(X)]\displaystyle=\mathbb{E}_{X\sim p_{\theta}}\left[(\mathcal{E}_{\theta}(X)-L(\theta))\nabla_{\theta}\log q_{\theta}(X)+\nabla_{\theta}\mathcal{E}_{\theta}(X)\right] (30)
=2𝔼Xpθ[(θ(X)L(θ))θlog|ψθ(X)|+θθ(X)].\displaystyle=2\mathbb{E}_{X\sim p_{\theta}}[\big{(}\mathcal{E}_{\theta}(X)-L(\theta))\nabla_{\theta}\log|\psi_{\theta}(X)|+\nabla_{\theta}\mathcal{E}_{\theta}(X)\Big{]}\,.

It remains to show that 𝔼Xpθ[iθ(X)]=0\mathbb{E}_{X\sim p_{\theta}}[\partial_{i}\mathcal{E}_{\theta}(X)]=0 where i\partial_{i} is differentiation with respect to θi\theta_{i}. Write

iθ=iHψθψθ=iHψθψθHψθiψθψθ2.\partial_{i}\mathcal{E}_{\theta}=\partial_{i}\frac{H\psi_{\theta}}{\psi_{\theta}}=\frac{\partial_{i}H\psi_{\theta}\cdot\psi_{\theta}-H\psi_{\theta}\cdot\partial_{i}\psi_{\theta}}{\psi_{\theta}^{2}}\,. (31)

So, because HH is symmetric:

𝔼Xpθ[iθ]=Hiψθ,ψθHψθ,iψθψθ2=0,\mathbb{E}_{X\sim p_{\theta}}\left[\partial_{i}\mathcal{E}_{\theta}\right]=\frac{\langle H\partial_{i}\psi_{\theta},\psi_{\theta}\rangle-\langle H\psi_{\theta},\partial_{i}\psi_{\theta}\rangle}{\|\psi_{\theta}\|^{2}}=0,

Proof of Lemma 4.1.
G(θ,X)=\displaystyle G(\theta,X)= 2n1i=1nθ(Xi)θlog|ψθ(Xi)|\displaystyle\frac{2}{n-1}\sum_{i=1}^{n}\mathcal{E}_{\theta}\left(X_{i}\right)\nabla_{\theta}\log|\psi_{\theta}\left(X_{i}\right)|
2n(n1)j=1ni=1nθ(Xj)θlog|ψθ(Xi)|\displaystyle-\frac{2}{n(n-1)}\sum_{j=1}^{n}\sum_{i=1}^{n}\mathcal{E}_{\theta}\left(X_{j}\right)\nabla_{\theta}\log|\psi_{\theta}\left(X_{i}\right)|
=\displaystyle= 2ni=1nθ(Xi)θlog|ψθ(Xi)|\displaystyle\frac{2}{n}\sum_{i=1}^{n}\mathcal{E}_{\theta}\left(X_{i}\right)\nabla_{\theta}\log|\psi_{\theta}\left(X_{i}\right)|
2n2nijθ(Xi)θlog|ψθ(Xj)|.\displaystyle-\frac{2}{n^{2}-n}\sum_{i\neq j}\mathcal{E}_{\theta}\left(X_{i}\right)\nabla_{\theta}\log|\psi_{\theta}\left(X_{j}\right)|.

Here, the terms i=ji=j in the expansion of the rightmost term cancel with the first sum. But now each sum is the average of terms with the correct expectation (2𝔼Xpθ[θ(X)θlog|ψθ(X)|]\left(2\mathbb{E}_{X\sim p_{\theta}}\left[\mathcal{E}_{\theta}(X)\nabla_{\theta}\log|\psi_{\theta}(X)|\right]\right. and 2θ𝔼Xpθθlog|ψθ(X)|2\mathcal{L}_{\theta}\mathbb{E}_{X\sim p_{\theta}}\nabla_{\theta}\log|\psi_{\theta}(X)| respectively )). ∎

Appendix B The directionally unbiased gradient estimator for supervised learning

Given n2n\geq 2 i.i.d. samples {Xi}ρ\{X_{i}\}\sim\rho. We write ψi=ψθ(Xi)\psi_{i}=\psi_{\theta}(X_{i}), φi=φ(Xi)\varphi_{i}=\varphi(X_{i}), and ψi=θψθ(Xi)\nabla\psi_{i}=\nabla_{\theta}\psi_{\theta}(X_{i}).

ψ,φn=1ni=1nψiφi,ψn2=1ni=1nψi2.\displaystyle\langle\psi,\varphi\rangle_{n}=\frac{1}{n}\sum_{i=1}^{n}\psi_{i}\varphi_{i},\qquad\|\psi\|_{n}^{2}=\frac{1}{n}\sum_{i=1}^{n}\psi_{i}^{2}. (32)
Lemma B.1.

Given samples X1,,XnX_{1},\ldots,X_{n} from ρ\rho, let

Gn\displaystyle G_{n} =1n1j=1najψj,\displaystyle=\frac{1}{n-1}\sum_{j=1}^{n}a_{j}\nabla\psi_{j}, (33)
aj\displaystyle a_{j} =ψn2φj+φ,ψnψj.\displaystyle=-\|\psi\|_{n}^{2}\>\varphi_{j}+\langle\varphi,\psi\rangle_{n}\>\psi_{j}. (34)

Then GnG_{n} is an unbiased estimator for G=ψθρ3θL(θ)G=\|\psi_{\theta}\|^{3}_{\rho}\nabla_{\theta}L(\theta).

Proof of Lemma B.1.

We notice

G=ψθρ2φ,ψθ+φ,ψθψθ,ψθ.G=-\|\psi_{\theta}\|^{2}_{\rho}\langle\varphi,\nabla\psi_{\theta}\rangle+\langle\varphi,\psi_{\theta}\rangle\langle\psi_{\theta},\nabla\psi_{\theta}\rangle.

For each iji\neq j we have an unbiased estimator for GG given by

|ψi|2φjψj+φiψiψjψj.-|\psi_{i}|^{2}\varphi_{j}\nabla\psi_{j}+\varphi_{i}\psi_{i}\psi_{j}\nabla\psi_{j}.

Taking the average over all pairs iji\neq j gives another unbiased estimator for GG:

1n(n1)ij(ψi2φjψjφiψiψjψj).-\frac{1}{n(n-1)}\sum_{i\neq j}(\psi_{i}^{2}\varphi_{j}\nabla\psi_{j}-\varphi_{i}\psi_{i}\psi_{j}\nabla\psi_{j}).

We may add the terms i=ji=j without changing the value because all such terms evaluate to 0. ∎

Appendix C Proof of Theorem 4.3

To prove Theorem 4.3, we first show the following lemma:

Lemma C.1.

Under 4.2 there exists a uniform constant CC^{\prime} such that for any θ,θ~d\theta,\widetilde{\theta}\in\mathbb{R}^{d},

|θL(θ)|4Cψ2,\left|\nabla_{\theta}L(\theta)\right|\leq 4C^{2}_{\psi}\,, (35)
|θL(θ)θL(θ~)|C(Cψ4+1)|θθ~|,\left|\nabla_{\theta}L(\theta)-\nabla_{\theta}L\left(\widetilde{\theta}\right)\right|\leq C^{\prime}(C^{4}_{\psi}+1)|\theta-\widetilde{\theta}|\,, (36)

and

𝔼{Xi}i=1n(|G(θ,X)|2)|θL(θ)|2+C(Cψ4+1)n,\mathbb{E}_{\{X_{i}\}^{n}_{i=1}}\left(\left|G(\theta,X)\right|^{2}\right)\leq|\nabla_{\theta}L(\theta)|^{2}+\frac{C^{\prime}(C^{4}_{\psi}+1)}{n}\,, (37)

where G(θ,X)G(\theta,X) is defined in (11).

The above lemma is important for the proof of Theorem 4.3. First, inequality (37) gives an upper bound for the gradient estimator. Using this upper bound, we can show that each iteration of Algorithm 4.1 is close to the classical gradient descent when the learning rate η\eta is small enough. Second, (36) means that the hessian of LL is bounded. This ensures that the classical gradient descent has a faster convergence rate to the first-order stationary point, which further implies the fast convergence of Algorithm 4.1.

Proof of Lemma C.1.

Define

Zθ=Ω|ψθ(x)|2𝑑x.Z_{\theta}=\sqrt{\int_{\Omega}\left|\psi_{\theta}(x)\right|^{2}dx}\,.

First, to prove (35), using (4), we have

|θL(θ)|\displaystyle\left|\nabla_{\theta}L(\theta)\right| =2𝔼Xpθ|(θ(X)L(θ))θψθ(X)ψθ(X)|\displaystyle=2\mathbb{E}_{X\sim p_{\theta}}\left|\left(\mathcal{E}_{\theta}(X)-L(\theta)\right)\frac{\nabla_{\theta}\psi_{\theta}(X)}{\psi_{\theta}(X)}\right|
2(𝔼Xpθ|θ(X)L(θ)|2)1/2(𝔼Xpθ|θψθ(X)ψθ(X)|2)1/24Cψ2,\displaystyle\leq 2\left(\mathbb{E}_{X\sim p_{\theta}}\left|\mathcal{E}_{\theta}(X)-L(\theta)\right|^{2}\right)^{1/2}\left(\mathbb{E}_{X\sim p_{\theta}}\left|\frac{\nabla_{\theta}\psi_{\theta}(X)}{\psi_{\theta}(X)}\right|^{2}\right)^{1/2}\leq 4C^{2}_{\psi}\,,

where we use Hölder’s inequality in the first inequality, (12) and the second inequality of (13) in the last inequality.

Next, to prove (36). Using the formula of (θ)\mathcal{E}(\theta) (equation (2)), we write

𝖧θL(θ)=\displaystyle\left\|\mathsf{H}_{\theta}L(\theta)\right\|= 2|θZθ|Zθ3Ω|Hψθ(x)θψθ(x)|𝑑x(I)+1Zθ2ΩθHψθ(x)θψθ(x)𝑑x(II)\displaystyle\underbrace{\frac{2\left|\nabla_{\theta}Z_{\theta}\right|}{Z^{3}_{\theta}}\int_{\Omega}\left|H\psi_{\theta}(x)\nabla_{\theta}\psi_{\theta}(x)\right|dx}_{\mathrm{(I)}}+\underbrace{\frac{1}{Z^{2}_{\theta}}\int_{\Omega}\left\|\nabla_{\theta}H\psi_{\theta}(x)\nabla_{\theta}\psi_{\theta}(x)\right\|dx}_{\mathrm{(II)}}
+1Zθ2Ω|Hψθ(x)|𝖧θψθ(x)𝑑x(III)+θ(1Zθ2ΩL(θ)ψθ(x)θψθ(x)𝑑x).\displaystyle+\underbrace{\frac{1}{Z^{2}_{\theta}}\int_{\Omega}\left|H\psi_{\theta}(x)\right|\left\|\mathsf{H}_{\theta}\psi_{\theta}(x)\right\|dx}_{\mathrm{(III)}}+\left\|\nabla_{\theta}\left(\frac{1}{Z^{2}_{\theta}}\int_{\Omega}L(\theta)\psi_{\theta}(x)\nabla_{\theta}\psi_{\theta}(x)dx\right)\right\|\,.

We first deal with Terms (I), (II), (III) separately:

  • 1.

    For Term (I), we notice

    |θZθ|=|θψθ(x)ψθ(x)𝑑x|ψθ(x)|2𝑑x||θψθ(x)|2𝑑x,\left|\nabla_{\theta}Z_{\theta}\right|=\left|\frac{\int\nabla_{\theta}\psi_{\theta}(x)\psi_{\theta}(x)dx}{\sqrt{\int|\psi_{\theta}(x)|^{2}dx}}\right|\leq\sqrt{\int|\nabla_{\theta}\psi_{\theta}(x)|^{2}dx}\,,

    where we use Hölder’s inequality to bound the numerator in the inequality. This implies

    Term(I)2|θψθ(x)|2𝑑x|ψθ(x)|2𝑑xΩ|Hψθ(x)θψθ(x)|Zθ2𝑑x\displaystyle\mathrm{Term\ (I)}\leq\frac{2\sqrt{\int|\nabla_{\theta}\psi_{\theta}(x)|^{2}dx}}{\sqrt{\int|\psi_{\theta}(x)|^{2}dx}}\int_{\Omega}\frac{\left|H\psi_{\theta}(x)\nabla_{\theta}\psi_{\theta}(x)\right|}{Z^{2}_{\theta}}dx
    \displaystyle\leq 2(𝔼Xpθ(|θψθ(X)ψθ(X)|2))1/2Ω|Hψθ(x)|2𝑑xΩ|θψθ(x)|2𝑑xZθ2\displaystyle 2\left(\mathbb{E}_{X\sim p_{\theta}}\left(\left|\frac{\nabla_{\theta}\psi_{\theta}(X)}{\psi_{\theta}(X)}\right|^{2}\right)\right)^{1/2}\frac{\sqrt{\int_{\Omega}|H\psi_{\theta}(x)|^{2}dx}\sqrt{\int_{\Omega}|\nabla_{\theta}\psi_{\theta}(x)|^{2}dx}}{Z^{2}_{\theta}}
    =\displaystyle= 2(𝔼Xpθ(|θψθ(X)ψθ(X)|2))1/2(𝔼Xpθ(|Hψθ(X)ψθ(X)|2))1/2(𝔼Xpθ(|θψθ(X)ψθ(X)|2))1/2\displaystyle 2\left(\mathbb{E}_{X\sim p_{\theta}}\left(\left|\frac{\nabla_{\theta}\psi_{\theta}(X)}{\psi_{\theta}(X)}\right|^{2}\right)\right)^{1/2}\left(\mathbb{E}_{X\sim p_{\theta}}\left(\left|\frac{H\psi_{\theta}(X)}{\psi_{\theta}(X)}\right|^{2}\right)\right)^{1/2}\left(\mathbb{E}_{X\sim p_{\theta}}\left(\left|\frac{\nabla_{\theta}\psi_{\theta}(X)}{\psi_{\theta}(X)}\right|^{2}\right)\right)^{1/2}
    \displaystyle\leq 2Cψ3.\displaystyle 2C^{3}_{\psi}\,.

    Here, we use the Hölder’s inequality in the second inequality, and (12)-(14) in the last inequality.

  • 2.

    For Term (II),

    1Zθ2ΩθHψθ(x)θψθ(x)𝑑x𝔼Xpθ(|θHψθ(X)||θψθ(X)||ψθ(X)|2)\displaystyle\frac{1}{Z^{2}_{\theta}}\int_{\Omega}\left\|\nabla_{\theta}H\psi_{\theta}(x)\nabla_{\theta}\psi_{\theta}(x)\right\|dx\leq\mathbb{E}_{X\sim p_{\theta}}\left(\frac{\left|\nabla_{\theta}H\psi_{\theta}(X)\right|\left|\nabla_{\theta}\psi_{\theta}(X)\right|}{\left|\psi_{\theta}(X)\right|^{2}}\right)
    \displaystyle\leq (𝔼Xpθ(|θHψθ(X)ψθ(X)|2))1/2(𝔼Xpθ(|θψθ(X)ψθ(X)|2))1/2Cψ2\displaystyle\left(\mathbb{E}_{X\sim p_{\theta}}\left(\left|\frac{\nabla_{\theta}H\psi_{\theta}(X)}{\psi_{\theta}(X)}\right|^{2}\right)\right)^{1/2}\left(\mathbb{E}_{X\sim p_{\theta}}\left(\left|\frac{\nabla_{\theta}\psi_{\theta}(X)}{\psi_{\theta}(X)}\right|^{2}\right)\right)^{1/2}\leq C^{2}_{\psi}

    where we use Hölder’s inequality, (12)-(14) in the last two inequalities.

  • 3.

    For Term (III),

    1Zθ2ΩHψθ(x)𝖧θψθ(x)𝑑x𝔼Xpθ(|Hψθ(X)|𝖧θψθ(x)|ψθ(X)|2)\displaystyle\frac{1}{Z^{2}_{\theta}}\int_{\Omega}\left\|H\psi_{\theta}(x)\mathsf{H}_{\theta}\psi_{\theta}(x)\right\|dx\leq\mathbb{E}_{X\sim p_{\theta}}\left(\frac{\left|H\psi_{\theta}(X)\right|\left\|\mathsf{H}_{\theta}\psi_{\theta}(x)\right\|}{\left|\psi_{\theta}(X)\right|^{2}}\right)
    \displaystyle\leq (𝔼Xpθ(|Hψθ(X)ψθ(X)|2))1/2(𝔼Xpθ(|𝖧θψθ(x)ψθ(X)|2))1/2Cψ2\displaystyle\left(\mathbb{E}_{X\sim p_{\theta}}\left(\left|\frac{H\psi_{\theta}(X)}{\psi_{\theta}(X)}\right|^{2}\right)\right)^{1/2}\left(\mathbb{E}_{X\sim p_{\theta}}\left(\left|\frac{\mathsf{H}_{\theta}\psi_{\theta}(x)}{\psi_{\theta}(X)}\right|^{2}\right)\right)^{1/2}\leq C^{2}_{\psi}

    where we use Hölder’s inequality, (12)-(14) in the last two inequalities.

Combining the above three inequalities, we have

Term (I)+Term (II)+Term (III)2Cψ2(Cψ+1).\text{Term (I)+Term (II)+Term (III)}\leq 2C^{2}_{\psi}(C_{\psi}+1)\,.

Using a similar calculation, we can also show

θ(1Zθ2ΩL(θ)ψθ(x)θψθ(x)𝑑x)C(Cψ4+1),\left\|\nabla_{\theta}\left(\frac{1}{Z^{2}_{\theta}}\int_{\Omega}L(\theta)\psi_{\theta}(x)\nabla_{\theta}\psi_{\theta}(x)dx\right)\right\|\leq C(C^{4}_{\psi}+1)\,,

where CC is a uniform constant. Thus, we have the hessian of LL can be bounded, meaning

𝖧θL(θ)C(Cψ4+1),\left\|\mathsf{H}_{\theta}L(\theta)\right\|\leq C(C^{4}_{\psi}+1)\,,

which proves (36) by the mean-value theorem.

Finally, to prove (37), noticing

G(θ,X)=2ni=1nθ(Xi)θlog|ψθ(Xi)|2n2nijθ(Xi)θlog|ψθ(Xj)|,G(\theta,X)=\frac{2}{n}\sum_{i=1}^{n}\mathcal{E}_{\theta}\left(X_{i}\right)\nabla_{\theta}\log|\psi_{\theta}\left(X_{i}\right)|-\frac{2}{n^{2}-n}\sum_{i\neq j}\mathcal{E}_{\theta}\left(X_{i}\right)\nabla_{\theta}\log|\psi_{\theta}\left(X_{j}\right)|\,,

and

𝔼{Xi}i=1n(G(θ,X))=θL(θ),\mathbb{E}_{\{X_{i}\}^{n}_{i=1}}\left(G(\theta,X)\right)=\nabla_{\theta}L(\theta)\,,

we have

𝔼{Xi}i=1n(|G(θ,X)|2)|θL(θ)|2+8n2i=1n𝔼Xipθ(|θ(Xi)θlog|ψθ(Xi)|𝔼Xpθ[θ(X)θlog|ψθ(X)|]|2)+8(n2n)2𝔼(ijθ(Xi1)θlog|ψθ(Xj1)|θ𝔼Xpθθlog|ψθ(X)|)2|θL(θ)|2+8n𝔼Xpθ|θ(X)θlog|ψθ(X)|𝔼Xpθ[θ(X)θlog|ψθ(X)|]|2+8(n2n)2(i1=i2orj1=j21)𝔼(X1,X2)pθ2|θ(X1)θlog|ψθ(X2)|θ𝔼Xpθθlog|ψθ(X)||2|θL(θ)|2+8n𝔼Xpθ|θ(X)θlog|ψθ(X)||2+Cn𝔼(X1,X2)pθ2|θ(X1)θlog|ψθ(X2)||2,\begin{aligned} &\mathbb{E}_{\{X_{i}\}^{n}_{i=1}}\left(\left|G(\theta,X)\right|^{2}\right)\\ \leq&|\nabla_{\theta}L(\theta)|^{2}+\frac{8}{n^{2}}\sum^{n}_{i=1}\mathbb{E}_{X_{i}\sim p_{\theta}}\left(\left|\mathcal{E}_{\theta}\left(X_{i}\right)\nabla_{\theta}\log|\psi_{\theta}\left(X_{i}\right)|-\mathbb{E}_{X\sim p_{\theta}}\left[\mathcal{E}_{\theta}(X)\nabla_{\theta}\log|\psi_{\theta}(X)|\right]\right|^{2}\right)\\ &+\frac{8}{(n^{2}-n)^{2}}\mathbb{E}\left(\sum_{i\neq j}\mathcal{E}_{\theta}\left(X_{i_{1}}\right)\nabla_{\theta}\log|\psi_{\theta}\left(X_{j_{1}}\right)|-\mathcal{L}_{\theta}\mathbb{E}_{X\sim p_{\theta}}\nabla_{\theta}\log|\psi_{\theta}(X)|\right)^{2}\\ \leq&|\nabla_{\theta}L(\theta)|^{2}+\frac{8}{n}\mathbb{E}_{X\sim p_{\theta}}\left|\mathcal{E}_{\theta}\left(X\right)\nabla_{\theta}\log|\psi_{\theta}\left(X\right)|-\mathbb{E}_{X\sim p_{\theta}}\left[\mathcal{E}_{\theta}(X)\nabla_{\theta}\log|\psi_{\theta}(X)|\right]\right|^{2}\\ &+\frac{8}{(n^{2}-n)^{2}}\left(\sum_{i_{1}=i_{2}\,\text{or}\,j_{1}=j_{2}}1\right)\mathbb{E}_{(X_{1},X_{2})\sim p^{\otimes 2}_{\theta}}\left|\mathcal{E}_{\theta}\left(X_{1}\right)\nabla_{\theta}\log|\psi_{\theta}\left(X_{2}\right)|-\mathcal{L}_{\theta}\mathbb{E}_{X\sim p_{\theta}}\nabla_{\theta}\log|\psi_{\theta}(X)|\right|^{2}\\ \leq&|\nabla_{\theta}L(\theta)|^{2}+\frac{8}{n}\mathbb{E}_{X\sim p_{\theta}}\left|\mathcal{E}_{\theta}\left(X\right)\nabla_{\theta}\log|\psi_{\theta}\left(X\right)|\right|^{2}+\frac{C}{n}\mathbb{E}_{(X_{1},X_{2})\sim p^{\otimes 2}_{\theta}}\left|\mathcal{E}_{\theta}\left(X_{1}\right)\nabla_{\theta}\log|\psi_{\theta}\left(X_{2}\right)|\right|^{2}\\ \end{aligned}\,,

where we use XiX_{i} and XjX_{j} are independent in the first two inequalities. Using (12) and the second inequality of (13), it is straightforward to show:

𝔼Xpθ|θ(X)θlog|ψθ(X)||2(𝔼Xpθ(|Hψθ(X)ψθ(X)|4))1/2(𝔼Xpθ(|θψθ(X)ψθ(X)|4))1/2Cψ4,\mathbb{E}_{X\sim p_{\theta}}\left|\mathcal{E}_{\theta}\left(X\right)\nabla_{\theta}\log|\psi_{\theta}\left(X\right)|\right|^{2}\leq\left(\mathbb{E}_{X\sim p_{\theta}}\left(\left|\frac{H\psi_{\theta}(X)}{\psi_{\theta}(X)}\right|^{4}\right)\right)^{1/2}\left(\mathbb{E}_{X\sim p_{\theta}}\left(\left|\frac{\nabla_{\theta}\psi_{\theta}(X)}{\psi_{\theta}(X)}\right|^{4}\right)\right)^{1/2}\leq C^{4}_{\psi}\,,

and

𝔼(X1,X2)pθ2|θ(X1)θlog|ψθ(X2)||2𝔼Xpθ(|Hψθ(X)ψθ(X)|2)𝔼Xpθ(|θψθ(X)ψθ(X)|2)Cψ4.\mathbb{E}_{(X_{1},X_{2})\sim p^{\otimes 2}_{\theta}}\left|\mathcal{E}_{\theta}\left(X_{1}\right)\nabla_{\theta}\log|\psi_{\theta}\left(X_{2}\right)|\right|^{2}\leq\mathbb{E}_{X\sim p_{\theta}}\left(\left|\frac{H\psi_{\theta}(X)}{\psi_{\theta}(X)}\right|^{2}\right)\mathbb{E}_{X\sim p_{\theta}}\left(\left|\frac{\nabla_{\theta}\psi_{\theta}(X)}{\psi_{\theta}(X)}\right|^{2}\right)\leq C^{4}_{\psi}\,.

This concludes the proof of (37). ∎

Now, we are ready to prove Theorem 4.3:

Proof of Theorem 4.3.

Denote the probability filtration

m=σ(Xij,1in,1jm).\mathcal{F}_{m}=\sigma(X^{j}_{i},1\leq i\leq n,1\leq j\leq m)\,.

According to the algorithm, we have

θm+1=θmηmGm.\theta_{m+1}=\theta_{m}-\eta_{m}G_{m}\,.

Plugging θm+1\theta_{m+1} into L(θ)L(\theta), we have

L(θm+1)L(θm)ηmθL(θm)Gm+Cηm22|Gm|2L(\theta_{m+1})\leq L(\theta_{m})-\eta_{m}\nabla_{\theta}L(\theta_{m})\cdot G_{m}+\frac{C\eta_{m}^{2}}{2}|G_{m}|^{2}

where we use (36) and CC is a constant that only depends on CψC_{\psi}. This implies

𝔼(L(θm+1)|m1)\displaystyle\mathbb{E}\left(L(\theta_{m+1})|\mathcal{F}_{m-1}\right)\leq L(θm)ηm|θL(θm)|2+Cηm22𝔼(|Gm|2|m1)\displaystyle L(\theta_{m})-\eta_{m}\left|\nabla_{\theta}L(\theta_{m})\right|^{2}+\frac{C\eta_{m}^{2}}{2}\mathbb{E}(|G_{m}|^{2}|\mathcal{F}_{m-1})

where we use 𝔼{Xi}i=1n(G(θ,X))=θL(θ)\mathbb{E}_{\{X_{i}\}^{n}_{i=1}}\left(G(\theta,X)\right)=\nabla_{\theta}L(\theta). Furthermore, using (37) and ηm<1C\eta_{m}<\frac{1}{C}, we have

𝔼(L(θm+1)|m1)\displaystyle\mathbb{E}\left(L(\theta_{m+1})|\mathcal{F}_{m-1}\right)\leq L(θm)(ηmCηm2/2)|θL(θm)|2+Cηm22n\displaystyle L(\theta_{m})-\left(\eta_{m}-C\eta_{m}^{2}/2\right)\left|\nabla_{\theta}L(\theta_{m})\right|^{2}+\frac{C\eta_{m}^{2}}{2n}
\displaystyle\leq L(θm)ηm2|θL(θm)|2+Cηm22n\displaystyle L(\theta_{m})-\frac{\eta_{m}}{2}\left|\nabla_{\theta}L(\theta_{m})\right|^{2}+\frac{C\eta_{m}^{2}}{2n}

Taking the average in m=1,2,3,,M1m=1,2,3,\cdots,M-1, we prove (15). ∎

Appendix D Proof of Theorem 4.9

In this section, we prove Theorem 4.9. We first show the following lemma:

Lemma D.1.

Under 4.8 there exists a constant CC^{\prime} that only depends on Cψ,Cr,CϕC_{\psi},C_{r},C_{\phi} such that for any θ,θ~d\theta,\widetilde{\theta}\in\mathbb{R}^{d},

|θL(θ)θL(θ~)|C|θθ~|.\left|\nabla_{\theta}L(\theta)-\nabla_{\theta}L\left(\widetilde{\theta}\right)\right|\leq C^{\prime}|\theta-\widetilde{\theta}|\,. (38)

and

𝔼{Xi}i=1n(|G(θ,X)|2)Zθ6Z6(|θL(θ)|2+Cn),\mathbb{E}_{\{X_{i}\}^{n}_{i=1}}\left(\left|G(\theta,X)\right|^{2}\right)\leq\frac{Z^{6}_{\theta}}{Z^{6}}\left(|\nabla_{\theta}L(\theta)|^{2}+\frac{C^{\prime}}{n}\right)\,, (39)

where G(θ,X)G(\theta,X) is defined in (17).

Proof of Lemma D.1.

The proof is very similar to the proof of Lemma C.1 after setting Hψθ(x)=g(x)(g(x),ψθ(x))H\psi_{\theta}(x)=g(x)\left(\braket{g(x),\psi_{\theta}(x)}\right). ∎

Now, we are ready to prove Theorem 4.9.

Proof of Theorem 4.9.

According to the algorithm, we have

θm+1=θmηmGm.\theta_{m+1}=\theta_{m}-\eta_{m}G_{m}\,.

Plugging θm+1\theta_{m+1} into L(θ)L(\theta), we have

L(θm+1)L(θm)ηmθL(θm)Gm+Cηm22|Gm|2L(\theta_{m+1})\leq L(\theta_{m})-\eta_{m}\nabla_{\theta}L(\theta_{m})\cdot G_{m}+\frac{C\eta_{m}^{2}}{2}|G_{m}|^{2}

where we use (38). Here CC is a constant that only depends on CψC_{\psi}. Because

𝔼X1m,,Xnm(Gm)=Zm3Z~m3θL(θm),\mathbb{E}_{X^{m}_{1},\cdots,X^{m}_{n}}(G_{m})=\frac{Z^{3}_{m}}{\widetilde{Z}^{3}_{m}}\nabla_{\theta}L(\theta_{m})\,, (40)

we obtain

𝔼(L(θm+1)|m1)\displaystyle\mathbb{E}\left(L(\theta_{m+1})|\mathcal{L}_{m-1}\right)\leq L(θm)ηm|θL(θm)|2Zm3Z~m3\displaystyle L(\theta_{m})-\eta_{m}\left|\nabla_{\theta}L(\theta_{m})\right|^{2}\frac{Z^{3}_{m}}{\widetilde{Z}^{3}_{m}} (41)
+Cηm22𝔼(|Gm|2|m1)\displaystyle+\frac{C\eta^{2}_{m}}{2}\mathbb{E}(|G_{m}|^{2}|\mathcal{L}_{m-1})

Using (39) and (22),

𝔼(|Gm|2|m1)C(|θL(θm)|2+Cn),\mathbb{E}(|G_{m}|^{2}|\mathcal{L}_{m-1})\leq C\left(|\nabla_{\theta}L(\theta_{m})|^{2}+\frac{C^{\prime}}{n}\right)\,, (42)

where CC is a constant depends on Cψ,Cr,CϕC_{\psi},C_{r},C_{\phi}.

Plugging (42) into (41), using (22), and choosing ηm\eta_{m} small enough, we have

𝔼(L(θm+1))L(θm)ηm2Cr2|θL(θm)|2+Cηm2n\mathbb{E}\left(L(\theta_{m+1})\right)\leq L(\theta_{m})-\frac{\eta_{m}}{2C^{2}_{r}}\left|\nabla_{\theta}L(\theta_{m})\right|^{2}+\frac{C\eta^{2}_{m}}{n}\\ (43)

Taking the average in m=1,2,3,,M1m=1,2,3,\cdots,M-1, we prove (23). ∎

Appendix E Lipschitz of the ZθZ_{\theta}

Lemma E.1.

Assume 4.8, let θ,θ~d\theta,\widetilde{\theta}\in\mathbb{R}^{d}, and suppose

|θ~θ|Cr.\left|\widetilde{\theta}-\theta\right|\leq C_{r}\,.

Then there exists a constant CC^{\prime} that only depends on Cψ,Cr,CϕC_{\psi},C_{r},C_{\phi} such that

ψθ~ρψθρexp(CCr).\frac{\|\psi_{\widetilde{\theta}}\|_{\rho}}{\|\psi_{\theta}\|_{\rho}}\leq\exp(C^{\prime}C_{r})\,. (44)
Proof of Lemma E.1.

Fixing θ,θ~\theta,\widetilde{\theta}, we define

R(t)=ψθ+t(θ~θ)ρ2ψθρ2.R(t)=\frac{\left\|\psi_{\theta+t(\widetilde{\theta}-\theta)}\right\|^{2}_{\rho}}{\|\psi_{\theta}\|^{2}_{\rho}}\,.

Then

|R(t)|\displaystyle|R^{\prime}(t)|\leq |2θψθ+t(θ~θ),ψθ+t(θ~θ)ρψθρ2||θ~θ|\displaystyle\left|\frac{2\braket{\nabla_{\theta}\psi_{\theta+t(\widetilde{\theta}-\theta)},\psi_{\theta+t(\widetilde{\theta}-\theta)}}_{\rho}}{\|\psi_{\theta}\|^{2}_{\rho}}\right||\widetilde{\theta}-\theta|
\displaystyle\leq |2θψθ+t(θ~θ),ψθ+t(θ~θ)ρψθ+t(θ~θ)ρ2||ψθ+t(θ~θ)ρ2ψθρ2||θ~θ|\displaystyle\left|\frac{2\braket{\nabla_{\theta}\psi_{\theta+t(\widetilde{\theta}-\theta)},\psi_{\theta+t(\widetilde{\theta}-\theta)}}_{\rho}}{\left\|\psi_{\theta+t(\widetilde{\theta}-\theta)}\right\|^{2}_{\rho}}\right|\left|\frac{\left\|\psi_{\theta+t(\widetilde{\theta}-\theta)}\right\|^{2}_{\rho}}{\|\psi_{\theta}\|^{2}_{\rho}}\right||\widetilde{\theta}-\theta|
\displaystyle\leq C|θ~θ|R(t),\displaystyle C|\widetilde{\theta}-\theta|R(t)\,,

where we use (19) and (20) in the last inequality. Because R(0)R(0)=1, using Grönwall’s inequality, we obtain

ψθ~ρψθρ=R1/2(1)exp(C|θθ~|)exp(CCr).\frac{\|\psi_{\widetilde{\theta}}\|_{\rho}}{\|\psi_{\theta}\|_{\rho}}=R^{1/2}(1)\leq\exp(C|\theta-\widetilde{\theta}|)\leq\exp(CC_{r})\,.

Appendix F Additional details on scale-invariant supervised training

We created a fork [48] of the FermiNet repository [44] and implemented the scale-invariant loss function Equation 25 in the columns of the backflow tensors. We used the straight-through gradient estimator [49] to train the network using either the standard loss or our modified scale-invariant loss while plotting the angle between the wave functions in each case Figure 1.

The commands for each (using the fork [48] of FermiNet) were:

  • 1.

    Standard (blue)

    ferminet --config ferminet/configs/atom.py \
    --config.system.atom Li \
    --config.batch_size 256 \
    --config.pretrain.iterations 10
    
  • 2.

    Scale-invariant (orange)

    ferminet --config ferminet/configs/atom.py \
    --config.system.atom Li \
    --config.batch_size 256 \
    --config.pretrain.iterations 10 \
    --config.pretrain.SI