\NewEnviron

temp

Convergence of variational Monte Carlo simulation and scale-invariant pre-training

Nilin Abrahamsen [email protected] Zhiyan Ding [email protected] Gil Goldshlager [email protected] Lin Lin [email protected] The Simons Institute for the Theory of Computing, Berkeley, CA 94720 USA Department of Mathematics, University of California, Berkeley, CA 94720 USA Applied Mathematics and Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA

Abstract

We provide theoretical convergence bounds for the variational Monte Carlo (VMC) method as applied to optimize neural network wave functions for the electronic structure problem. We study both the energy minimization phase and the supervised pre-training phase that is commonly used prior to energy minimization. For the energy minimization phase, the standard algorithm is scale-invariant by design, and we provide a proof of convergence for this algorithm without modifications. The pre-training stage typically does not feature such scale-invariance. We propose using a scale-invariant loss for the pretraining phase and demonstrate empirically that it leads to faster pre-training.

1 Introduction

A central goal of computational physics and chemistry is to find a ground state which minimizes the energy of a system. Electronic structure theory studies the behavior of electrons that govern the chemical properties of molecules and materials. The energy in the electronic structure is a functional defined on the set of quantum wave functions $\psi$ , which are normalized $\mathscr{L}^{2}$ -integrable functions determined up to an arbitrary scalar multiplication factor, known as a phase. Equivalently, the wave function can be viewed as an element of the projective space rather than a normalized function.

The variational Monte Carlo (VMC) algorithm [1, 2, 3] is a powerful method for optimizing the energy of a parameterized wave function through stochastic gradient estimates. The method relies on Monte Carlo samples from the probability distribution defined by the Born rule, which views the normalized squared wave function as a probability density function.

Historically, the number of parameters used in VMC simulations was relatively small, and methods such as stochastic reconfiguration [4, 5] and the linear method [6, 7] were preferred for parameter optimization. Recently, neural networks have been used to parameterize the wave function in an approach known as neural quantum states [8]. Subsequent work has shown that VMC with neural quantum states can match or exceed state-of-the-art high-accuracy quantum simulation methods for a variety of electronic structure problems, including molecules in both first quantization [9, 10, 11, 12] and second quantization [13, 14], electron gas [15, 16, 17], and solids [18].

Due to the complexity of quantum wave functions parameterized by neural networks, it is in general not possible to explicitly normalize the parameterized wave function. One of the essential properties of the VMC procedure is thus its ability to use an unnormalized function $\psi_{\theta}$ to represent the corresponding element in the projective space. This is achieved by pulling back the energy functional to be minimized from the projective space to a scale-invariant functional on the space of unnormalized wave functions. This scale-invariant functional is the same as the Rayleigh quotient $\bra{\psi}H\ket{\psi}/\braket{\psi}{\psi}$ where the Hermitian operator $H$ is known as the Hamiltonian. The scale-invariance of the energy functional is reflected in the VMC algorithm where a scale-invariant local energy function corresponding to the wave function is evaluated at MCMC-sampled electron configurations. For the ground state problem considered here it suffices to restrict to real-valued wave functions, so we will take all scalars to be real.

The wave function in the VMC algorithm is often initialized using a supervised pre-training stage [11] where it is fitted to a less expressive approximation to the ground state, for example given as a Slater determinant. In contrast to the main VMC algorithm which is trained with an inherently scale-invariant objective, this supervised pre-training typically does not encode the scale-invariant property [11, 19]. We propose using a scale-invariant loss function for this supervised stage and demonstrate numerically that this modification leads to faster convergence towards the target in the supervised pre-training setting. We further give a theoretical convergence bound for the proposed scale-invariant supervised optimization algorithm. As an important ingredient in this analysis, we introduce the concept of a directionally unbiased stochastic estimator for the gradient.

1.1 Related works

Using scale-invariance in the parameter vector to stabilize SGD training is a well-studied technique. Many structures and methods have been proposed along these lines, such as normalized NN [20, 21, 22], $\mathcal{G}$ -SGD [23], SGD+WD [24, 25, 26, 27], and projected/normalized (S)GD [28, 29]. In these works, the parameter $\mathbf{v}$ is stored explicitly and the loss function $\mathcal{L}$ is scale-invariant in the parameter $\mathbf{v}$ , which means $\mathcal{L}(\lambda\mathbf{v})=\mathbf{v}$ for any $\lambda>0$ .

The setting of neural wave functions differs from typical scale-invariant loss functions in that the scale-invariant functional $\mathcal{L}$ is applied after a nonlinear parameterization $\theta\mapsto\psi_{\theta}$ , and the function $\psi_{\theta}$ can be queried but not accessed as an explicit vector. The composed loss function $L(\theta)=\mathcal{L}(\psi_{\theta})$ is not scale-invariant in its parameters. We can obtain a relation between the two settings by applying the chain rule involving a functional derivative, but the resulting expression involves factors that we need to estimate by sampling, in contrast with the typical setting of scale-invariance in the parameter space.

1.1.1 SGD on Riemannian manifolds

The training of a scale-invariant objective can be seen as a training process on projective space, which is a special case of learning on Riemannian manifolds. The convergence of SGD on Riemannian manifolds is well-studied; for example [30] shows the asymptotic convergence of SGD on Riemannian manifolds. Further variations of the SGD algorithm on Riemannian manifolds have been proposed and studied, such as SVRG on Riemannian manifolds [31], [32] and ASGD on Riemannian manifolds [33]. To the best of our knowledge, all previous works ensure the parameter stays on the manifold by either taking stochastic gradient steps on the manifold or projecting them back onto it. This is in contrast to our setting where no projection onto a submanifold is needed.

1.1.2 Concurrent theoretical analysis

The concurrent work of [34] proves a similar convergence result for VMC and also takes into account the effect of Markov chain Monte Carlo sampling. However, this work does not consider the setting of supervised pre-training.

2 Variational Monte Carlo

A quantum wave function is a square-integrable function $\psi$ on a space of configurations $\Omega$ . A typical configuration is $\Omega=\mathbb{R}^{3N}$ for a system of $N$ electrons¹¹1The spin degrees of freedom are considered as classical and are expressed in the symmetry type of $\psi_{\theta}$ .. The term “square-integrable function” refers to a function $\psi$ with a finite $\mathscr{L}^{2}$ -norm $\|\psi\|=\sqrt{\int_{\Omega}|\psi|^{2}\mathrm{d}x}<\infty$ with respect to the Lebesgue measure. For any scalar $\lambda\in\mathbb{R}$ and $\lambda\neq 0$ , $\lambda\psi$ represents the same physical state as $\psi$ . In the variational Monte Carlo algorithm, $\psi_{\theta}$ is an Ansatz parameterized by $\theta$ , and the objective is to find the parameter vector $\theta$ such that $\psi_{\theta}$ minimizes the energy

\mathcal{L}(\psi)=\frac{\bra{\psi}H\ket{\psi}}{\braket{\psi}{\psi}}.

The Born rule associates to a quantum wave function $\psi$ a probability distribution on $\Omega$ with density $p_{\psi}$ given by

p_{\psi}(x)=|{\psi}(x)|^{2}/\|{\psi}\|^{2}\,.

(1)

The norm $\|{\psi}\|$ is unknown to the optimization algorithm, and samples $\{X_{i}\}\sim p_{\psi}$ are generated using Markov chain Monte Carlo (MCMC). We will assume that these samples are independent and sampled from the exact distribution $p_{\psi}$ . In practice, the independence assumption is tested by evaluating the autocorrelation of the MCMC samples.

The energy of ${\psi}$ can then be expressed as an expectation with respect to $p_{\psi}$ : $\mathcal{L}({\psi})=\mathbb{E}_{X\sim p_{{\psi}}}\mathcal{E}_{\psi}(X),$ where

\mathcal{E}_{\psi}(x)=\frac{(H{\psi})(x)}{{\psi}(x)}

(2)

is called the local energy. We note that the local energy is not well-defined in the case where $\psi(x)=0$ , but this is not a practical issue since $p_{\psi}(x)=0$ by definition at such points and thus the expectation value is still well-defined.

We use $L(\theta)=\mathcal{L}(\psi_{\theta})$ to denote the loss as a function of the parameters.

Problem 2.1 (VMC).

Minimize

L(\theta)=\mathbb{E}_{X\sim p_{\theta}}\mathcal{E}_{\theta}(X),

(3)

over parameter vectors $\theta\in\mathbb{R}^{d}$ , where $\mathcal{E}_{\theta}(x)=H\psi_{\theta}(x)/\psi_{\theta}(x)$ and $p_{\theta}(x)=|\psi_{\theta}(x)|^{2}/\|\psi_{\theta}\|^{2}$ . We assume access to samples from $\{X_{i}\}\sim p_{\theta}$ and query access to $\psi_{\theta}$ , $\nabla_{\theta}\psi_{\theta}$ , and $\mathcal{E}_{\theta}$ which are assumed to be $C^{\infty}$ with respect to $\theta$ for any $x\in\Omega$ .

When $\|\psi_{\theta}\|<\infty$ , the gradient of the VMC energy functional takes the form (see e.g., equation (9) of [11])

\nabla_{\theta}L(\theta)=2\mathbb{E}_{X\sim p_{\theta}}\Big{[}\left(\mathcal{E}_{\theta}(X)-L(\theta)\right)\nabla_{\theta}\log|\psi_{\theta}(X)|\Big{]}.

(4)

Notably, the change in local energy does not contribute to this formula, that is, $\mathbb{E}_{X\sim p_{\theta}}[\nabla_{\theta}\mathcal{E}_{\theta}(X)]$ =0. Instead, the gradient of $L$ is entirely due to the the perturbation of the sampling distribution as $\theta$ varies. For completeness, we include the derivation of Equation 4 in A. The condition that $\psi_{\theta}$ is $C^{\infty}$ with respect to $\theta$ is typically satisfied by using the logistic sigmoid activation function.

{temp}

2.1 Relation to the policy gradient method

In the reinforcement learning framework [polgrad], an agent decides on an action $a$ with probability $\pi_{\theta}(a|s)$ where $s$ is the state of the environment and $\pi_{\theta}$ is the policy, parameterized by $\theta$ . Let $J(\pi)$ be the expected reward $R(\tau)$ of a trajectory sampled under the policy $\pi$ . The policy gradient theorem [polgrad] exhibits an expression for the gradient of the reward:

Theorem 2.1 ([polgrad]).

The gradient of the expected reward is

\nabla_{\theta}J(\pi_{\theta})=\sum_{t=1}^{T}\mathbb{E}_{\pi}\big{[}R(\tau)\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})\big{]},

(5)

where $R(\tau)$ is the reward of trajectory $\tau$ , and $s_{t},a_{t}$ are the states and actions in the trajectory.

To see the connection to the gradient estimate in VMC, it suffices to consider the policy gradient formula with a single timestep. Then we can ignore the state $s$ and the expected reward becomes $J(\pi)=\mathbb{E}[Q(a)]$ where $Q(a)$ is the value of action $a$ . Equation 5 then states that

\nabla_{\theta}J(\pi_{\theta})=\mathbb{E}\big{[}Q(a)\nabla_{\theta}\log\pi_{\theta}(a)\big{]}.

3 Supervised Pre-training

In order to stabilize the VMC training, the neural quantum state $\psi_{\theta}$ in FermiNet [11] and subsequent works [35, 36, 37] is initialized by fitting the Ansatz to an initial guess $\psi$ at an approximate ground state [11, 19]. This target initial state can be taken to be a Slater determinant optimized using the self-consistent field (SCF) method [38]. We therefore consider the problem of minimizing the distance from the line $\{\lambda\psi|\lambda\in\mathbb{R}\}$ to a target function $\varphi\in\mathscr{L}^{2}(\Omega)$ .

That is,

\mathcal{L}(\psi)=\min_{\lambda\in\mathbb{R}}\left\|\lambda\psi-\varphi\right\|_{\rho}^{2}\,,

(6)

where $\left\|\phi\right\|_{\rho}=\sqrt{\int_{\Omega}|\phi(x)|^{2}\mathrm{d}\rho(x)}$ is the norm in $\mathscr{L}^{2}$ -sense on a probability space $(\Omega,\rho)$ with probability measure $\rho$ . In practice the pre-training may fit intermediate vector-valued features such as orbitals [19] instead of the scalar-valued wave function. As we note in Section 5.1, this is formally equivalent to fitting a scalar wave function. Moreover, as we show in Section 5.1, exploiting scale-invariance when fitting the orbitals can improve the convergence of the wave function itself as measured by Equation 6.

Here, $\rho$ is typically the Lebesgue measure, but in Section 5.1 we will also consider a case where it is defined by the target function. This loss function is scale-invariant by construction and has a closed-form expression

\mathcal{L}(\psi)=\|\varphi\|_{\rho}^{2}-\big{(}|\langle\varphi,\psi\rangle_{\rho}|/\|\psi\|_{\rho}\big{)}^{2}\,.

(7)

Note that for normalized target $\varphi$ , the second term of Equation 7 is the squared sine of the angle between the vectors, $\mathcal{L}(\psi)=\sin^{2}\angle(\varphi,\psi)$ . Minimizing $\mathcal{L}(\psi)$ is equivalent to minimizing $\mathcal{L}(\psi)=-\langle\varphi,\psi\rangle_{\rho}/\|\psi\|_{\rho}$ . Again we write the loss as $L(\theta)=\mathcal{L}(\psi_{\theta})$ as a function of the variational parameters:

Problem 3.1 (Supervised learning pre-training).

Given a target function $\varphi\in\mathscr{L}^{2}(\Omega,\rho)$ , access to samples $\{X_{i}\}\sim\rho$ and query access to $\varphi$ , $\psi_{\theta}$ , and $\nabla_{\theta}\psi_{\theta}$ , minimize

L(\theta)=-\frac{\langle\varphi,\psi_{\theta}\rangle_{\rho}}{\|\psi_{\theta}\|_{\rho}}\,

(8)

over parameter vectors $\theta\in\mathbb{R}^{d}$ .

3.1 Directionally unbiased gradient estimator

By applying the chain rule we can write the gradient of $L(\theta)$ as

\displaystyle\nabla_{\theta}L(\theta)

\displaystyle=[\partial_{\psi}\mathcal{L}](\psi_{\theta})\cdot\nabla_{\theta}\psi_{\theta},

(9)

where $\partial_{\psi}\mathcal{L}(\psi_{\theta})$ is the functional derivative of the scale-invariant loss. This can be evaluated analogously to the gradient derived in [28] for a scale-invariant loss function. The resulting gradient expression is

\displaystyle\nabla_{\theta}L(\theta)=-\frac{\braket{\varphi,\nabla_{\theta}\psi_{\theta}}_{\rho}}{\|\psi_{\theta}\|_{\rho}}+\frac{\braket{\varphi,\psi_{\theta}}_{\rho}\braket{\psi_{\theta},\nabla_{\theta}\psi_{\theta}}_{\rho}}{\|\psi_{\theta}\|_{\rho}^{3}}\,.

(10)

We cannot compute the infinite-dimensional $\mathscr{L}^{2}$ norms in the denominator of Equation 10 and do not have access to an unbiased estimator for Equation 10. However, in Lemma 4.5, we propose a directionally unbiased gradient estimator $G$ , which means that $\mathbb{E}[G]$ is in the positive span of $\nabla_{\theta}L$ . Additionally, we will demonstrate that the directionally unbiased gradient estimator is sufficient for achieving rapid convergence of SGD as long as the norm $\|\psi_{\theta}\|_{\rho}$ can be approximately estimated, see detail in Corollary 4.10.

4 Theoretical results

4.1 Convergence result for VMC

In the VMC setting, we assume that the MCMC subroutine is able to sample exactly from the distribution $p_{\theta}(x)=|\psi_{\theta}(x)|^{2}/\|\psi_{\theta}\|^{2}$ . This sampling oracle makes it possible to construct a cheap and unbiased gradient estimation for the energy functional (3). For simplicity we use exact real-valued arithmetic.

From the formula (4) for the gradient, we derive the following unbiased gradient estimator for $\nabla_{\theta}L(\theta)$ :

Lemma 4.1.

Given $n\geq 2$ i.i.d. samples $\{X_{i}\}\sim p_{\theta}$ , let $\hat{L}(\theta)=\frac{1}{n}\sum_{i=1}\mathcal{E}_{\theta}\left(X_{i}\right)$ and define the gradient estimate

G(\theta,X)=\frac{2}{n-1}\sum_{i=1}^{n}\left(\mathcal{E}_{\theta}\left(X_{i}\right)-\hat{L}(\theta)\right)\nabla_{\theta}\log|\psi_{\theta}\left(X_{i}\right)|.

(11)

Then $G(\theta,X)$ is an unbiased estimator of the gradient of the population energy.

We prove Lemma 4.1 in A.

Using the unbiased gradient estimator (11) from Lemma 4.1, the VMC algorithm can be formulated as a variant of SGD:

\fname@algorithm 1 Variational Monte Carlo

1: Preparation: number of iterations:

M

; learning rate

\eta_{m}

; initial parameter:

\theta_{0}

; number of samples each iteration:

n

;

2: Running:

m\leftarrow 0

;

4: while

m\leq M

5: Sample

\{X^{m}_{i}\}^{n}_{i=1}

independently according to the density

p_{\theta^{m}}\propto|\psi_{\theta^{m}}|^{2}

;

G_{m}\leftarrow G(\theta,X)=\frac{2}{n-1}\sum_{i=1}^{n}(\mathcal{E}_{\theta}(X_{i})-\hat{L}(\theta))\nabla_{\theta}\log|\psi_{\theta}\left(X_{i}\right)|

;

\theta_{m+1}\leftarrow\theta_{m}-\eta_{m}G_{m}

;

8: end while

9: Output:

\theta_{m}

;

Given any $\theta$ , we define $\|\cdot\|_{q,p_{\theta}}$ as the $q$ -th norm under the measure $p_{\theta}(x)dx$ . The convergence bound for the VMC algorithm assumes a uniform bound on the following quantities:

Condition 4.2.

There exists a constant $C_{{\psi}}$ such that for any $\theta\in\mathbb{R}^{d}$ ,

1.

(Value bound):

$\|H\psi_{\theta}/\psi_{\theta}\|_{4,p_{\theta}}\leq C_{\psi}\,.$ (12)

(Gradient bound):

		$\displaystyle\\|\nabla_{\theta}H\psi_{\theta}/\psi_{\theta}\\|_{2,p_{\theta}}\leq C_{\psi}\,,$		(13)
		$\displaystyle\\|\nabla_{\theta}\psi_{\theta}/\psi_{\theta}\\|_{4,p_{\theta}}\leq C_{\psi}\,.$		(13)

3.

(Hessian bound):

$\|\mathsf{H}_{\theta}\psi_{\theta}/\psi_{\theta}\|_{2,p_{\theta}}\leq C_{\psi}\,,$ (14)

where $\mathsf{H}_{\theta}\psi_{\theta}$ is the hessian of $\psi_{\theta}$ in $\theta$ .

4.2 is scale-invariant in $\psi_{\theta}$ , i.e., if $\psi_{\theta}(x)$ satisfies 4.2, then $\lambda\psi_{\theta}(x)$ also satisfies the condition with the same constant $\lambda$ for any $\lambda\neq 0$ . We establish new bounds on the Lipschitz constant of $\nabla L$ and the variance of the gradient estimator (see Lemma C.1), subject to the constraints specified in 4.2. These bounds are crucial in demonstrating the convergence of our method.

Now, we are ready to give the convergence result for Algorithm 4.1:

Theorem 4.3.

Under 4.2 there exists a constant $C>0$ that only depends on $C_{\psi}$ such that, when $\eta_{m}<\frac{1}{C}$ for any $m\geq 0$ ,

\sum^{M}_{m=0}\eta_{m}\mathbb{E}\left(\left|\nabla_{\theta}L(\theta_{m})\right|^{2}\right)\leq 2L(\theta_{0})+C\sum^{M}_{m=0}\frac{\eta^{2}_{m}}{n}\,.

(15)

for any $M>0$ .

The proof of the above theorem is given in C. The upper bound (15) is important for us to determine $\eta_{m}$ so that the parameter $\theta_{m}$ converges to a first-order stationary point in the expectation sense when $m$ is moderately large. Theorem 4.3 implies:

Corollary 4.4.

Under 4.2:

1.

Choosing $\eta_{m}=\Theta\left(n\epsilon^{2}\right)$ and $M=\Theta\left(1/(n\epsilon^{4})\right)$ , we have

$\min_{0\leq m\leq M}\mathbb{E}\left(\left|\nabla_{\theta}L(\theta_{m})\right|\right)\leq\epsilon$

.
2.

Choosing $\eta_{m}=\Theta\left(\sqrt{\frac{n}{m+1}}\right)$ , we have

$\min_{0\leq m\leq M}\mathbb{E}\left(\left|\nabla_{\theta}L(\theta_{m})\right|\right)=O\left(\left(nM\right)^{-1/4}\right).$

This bound shows that the convergence of Algorithm 4.1 is the same as the convergence rate of SGD to first-order stationary point for nonconvex functions, see [39, Theorem 2] or [40, Theorem 5.2.1].

4.1.1 Scale-invariant supervised pre-training

The following lemma shows how we can choose a directionally unbiased gradient estimator. Given samples $\{X_{i}\}$ , write $\langle\psi,\varphi\rangle_{n}=\frac{1}{n}\sum_{i=1}^{n}\psi(X_{i})\varphi(X_{i})$ and $\|{\psi}\|_{n}^{2}=\frac{1}{n}\sum_{i=1}^{n}\psi(X_{i})^{2}$ .

Lemma 4.5.

Given $n\geq 2$ and i.i.d. samples $X_{1},\ldots,X_{n}\sim\rho$ , define coefficients

a_{j}=-\|\psi_{\theta}\|_{n}^{2}\>\varphi(X_{j})+\langle\varphi,\psi_{\theta}\rangle_{n}\>\psi_{\theta}(X_{j})

and let

G=\frac{1}{\|\psi_{\theta}\|_{\rho}^{3}}\frac{1}{n-1}\sum_{j=1}^{n}a_{j}\nabla_{\theta}\psi_{\theta}(X_{j})\,.

(16)

Then $\mathbb{E}_{X}(G)=\nabla_{\theta}L(\theta)$ .

We give the derivation of Lemma 4.5 in B. Given an estimator $\tilde{Z}$ for $\|\psi_{\theta}\|_{\rho}$ which is independent of $X_{1},\ldots,X_{n}$ , we then choose the following as our directionally unbiased gradient estimator:

G(\theta,X)=\frac{1}{\tilde{Z}^{3}}\frac{1}{n-1}\sum_{j=1}^{n}a_{j}\nabla_{\theta}\psi_{\theta}(X_{j})\,,

(17)

where $\{a_{j}\}^{n}_{j=1}$ are defined in Lemma 4.5.

Remark 4.6.

To obtain the independent norm estimate $\tilde{Z}$ we can take $2n$ samples $X_{1},\ldots,X_{2n}$ at each iteration and let $\tilde{Z}=(\frac{1}{n}\sum_{i=n+1}^{2n}|\psi_{\theta}(X_{i})|^{2})^{1/2}$ in (17). However, our numerics in Section 5.1 suggest it is sufficient in practice to estimate $\|\psi_{\theta}\|_{\rho}$ as $\tilde{Z}=(\frac{1}{n}\sum_{i=1}^{n}|\psi_{\theta}(X_{i})|^{2})^{1/2}$ without sampling an additional independent minibatch.

Remark 4.7.

Some care must be taken when constructing a plug-in estimator for the gradient. For example, given an estimate $\tilde{Z}$ of $\|\psi_{\theta}\|_{\rho}$ and samples $X_{1},X_{2}\sim\rho$ , Equation 10 suggests a plug-in estimate of $\nabla_{\theta}L_{\theta}$ given by

-\frac{\varphi(X_{1})\nabla_{\theta}\psi(X_{1})}{\tilde{Z}}+\frac{\varphi(X_{2})\psi(X_{2})\psi(X_{1})\nabla_{\theta}\psi(X_{1})}{\tilde{Z}^{3}}\,.

(18)

However, this estimator is not directionally unbiased and does not achieve the convergence shown in this paper. This is due to the fact that it is unbalanced in the sense that the exponent of $\tilde{Z}$ is different for each term. More specifically, taking expectation on (18) gives

-\frac{\langle\varphi,\nabla\psi_{\theta}\rangle}{\tilde{Z}}+\frac{\langle\varphi,\psi_{\theta}\rangle\langle\psi_{\theta},\nabla\psi_{\theta}\rangle}{\tilde{Z}^{3}}\neq\lambda\nabla_{\theta}L(\theta),\quad\forall\lambda\in\mathbb{R}\,.

The detailed algorithm is summarized in Algorithm 4.1.1.

\fname@algorithm 2 Stochastic gradient descent algorithm for supervised learning

1: Preparation:

\varphi_{n}

;

\psi_{\theta}(x)

; number of iterations:

M

; learning rate

\eta_{m}

; initial parameter:

\theta_{0}

;

n

2: Running:

m\leftarrow 0

;

4: while

m\leq M

5: Sample

\{X^{m}_{i}\}^{n}_{i=1}

independently from

\rho

;

\widetilde{Z}_{m}\leftarrow\text{an estimation of}\ \|\psi_{\theta_{m}}\|_{\rho}

;

a_{i}\leftarrow-\|\psi_{\theta_{m}}\|_{n}^{2}\>\varphi(X^{m}_{i})+\langle\varphi,\psi_{\theta_{m}}\rangle_{n}\>\psi_{\theta_{m}}(X^{m}_{i})

G_{m}\leftarrow G(\theta,X)=\frac{1}{\widetilde{Z}_{m}^{3}}\frac{1}{n-1}\sum_{i=1}^{n}a_{i}\nabla_{\theta}\psi_{\theta_{m}}(X^{m}_{i})

;

\theta_{m+1}\leftarrow\theta_{m}-\eta_{m}G_{m}

;

10: end while

11: Output:

\theta_{m}

;

The choice of learning rate $\eta_{m}$ in the pre-training setting and the decay rate of the loss function depends on the ratio $\|\psi_{\theta_{m}}\|_{\rho}/\widetilde{Z}$ . However, for our results it suffices that this ratio is bounded above and below by fixed constants. As mentioned above we may estimate $\tilde{Z}=(\frac{1}{n}\sum_{i=1}^{n}|\psi_{\theta}(X_{i})|^{2})^{1/2}$ at each iteration using an independent sample $X_{n+1},\ldots,X_{2n}\perp\!\!\!\perp X_{1},\ldots,X_{n}$ . Alternatively we may take inspiration from variance reduction techniques [41, 42, 43], to update $\tilde{Z}$ using a large batch once every $K$ steps, letting $\tilde{Z}_{m}=(\frac{1}{K}\sum_{i=1}^{K}|\psi_{\theta}(X_{i})|^{2})^{1/2}$ when $m/K\equiv 0\>(\text{mod }K)$ and $\widetilde{Z}_{m}=\widetilde{Z}_{m-1}$ otherwise.

Our convergence bound for the supervised pre-training assumes uniform bounds similar to 4.2.

Condition 4.8.

There exists a constant $C_{\psi}$ such that for any $\theta\in\mathbb{R}^{d}$

1.

(Value bound):

$\left\|\psi_{\theta}^{2}\right\|^{1/2}_{\rho}/\|\psi_{\theta}\|_{\rho}\leq C_{\psi}\,.$ (19)
2.

(Gradient bound):

$\left\|\left|\nabla_{\theta}\psi_{\theta}\right|^{2}\right\|^{1/2}_{\rho}/\|\psi_{\theta}\|_{\rho}\leq C_{\psi}\,.$ (20)
3.

(Hessian bound):

$\left\|\left\|\mathsf{H}_{\theta}\psi_{\theta}\right\|^{2}_{2}\right\|^{1/2}_{\rho}/\|\psi_{\theta}\|_{\rho}\leq C_{\psi}\,,$ (21)

where $\mathsf{H}_{\theta}\psi$ is the hessian of $\psi$ .

Under 4.8, we establish the Lipschitz property of $\nabla_{\theta}L$ and provide bounds for the variance of the gradient estimator (see Lemma D.1). These results play a crucial role in demonstrating the convergence of our method. 4.8 is scale-invariant in $\psi_{\theta}$ , meaning that the same condition holds with the same constant for any $\lambda\psi_{\theta}$ , where $\lambda\neq 0$ .

Theorem 4.9.

Assume 4.8, $|\varphi_{n}|\leq C_{\varphi}$ , and

\frac{1}{C_{r}}\leq\frac{\|\psi_{\theta_{m}}\|_{\rho}}{\widetilde{Z}_{m}}\leq C_{r}\,,\quad\mathrm{a.s.}

(22)

for all $m\in\mathbb{N}$ , where $\theta_{m}$ comes from Algorithm 4.1.1. Then there exists a constant $C$ that only depends on $C_{r},C_{\psi},C_{\varphi}$ such that, when $\eta_{m}<\frac{1}{C}$ for any $m\geq 0$ ,

\sum^{M}_{m=0}\eta_{m}\mathbb{E}\left(\left|\nabla_{\theta}L(\theta_{m})\right|^{2}\right)\leq 2C^{2}_{r}L(\theta_{0})+C\sum^{M}_{m=0}\frac{\eta^{2}_{m}}{n}\,.

(23)

for all $M>0$ .

The above theorem is proved in D. In the proof, we first bound the $\mathbb{E}|G_{m}|^{2}$ and show that $\nabla_{\theta}L$ has a uniform Lipschitz constant. Then, the decay rate of the loss function in each step can be lower bounded by $|\nabla_{\theta}L|^{2}$ , which finally gives us an upper bound of $|\nabla_{\theta}L|^{2}$ as shown in (23).

Using Theorem 4.9 it is straightforward to show the following corollary:

Corollary 4.10.

Under 4.8 and (22),

1.

Choosing $\eta_{m}=\Theta\left(n\epsilon^{2}\right)$ and $M=\Theta\left(1/(n\epsilon^{4})\right)$ , we have

$\min_{0\leq m\leq M}\mathbb{E}\left(\left|\nabla_{\theta}L(\theta_{m})\right|\right)\leq\epsilon.$
2.

Choosing $\eta_{m}=\Theta\left(\sqrt{\frac{n}{m+1}}\right)$ , we have

$\min_{0\leq m\leq M}\mathbb{E}\left(\left|\nabla_{\theta}L(\theta_{m})\right|\right)=O\left(\left(nM\right)^{-1/4}\right).$

We note that this bound matches the standard $O(M^{-1/4})$ convergence rate of SGD to first-order stationary point for non-convex functions.

5 Numerical results

5.1 Pre-training with scale-invariant loss

To evaluate the use of a scale-invariant loss during pre-training we modify the training loss used in the pre-training stage of [19] to be scale-invariant with respect to the trained Ansatz. We tested the scale-invariant loss in a version of the FermiNet code [44] where the electron configurations were sampled from the target state as in [19]. Specifically, the wave function is given by a sum of determinants and the target state is a Slater determinant:

\psi_{\theta}(x)=\sum_{k=1}^{d}\Big{[}\det(y^{(k)}_{ij})_{i,j=1}^{N}\Big{]},\qquad\varphi(x)=\det\Big{[}\big{(}\phi_{j}(x_{i})\big{)}_{i,j=1}^{N}\Big{]},

where $N$ is the number of electrons and the tensor $y$ is the output of the parameterized permutation-equivariant neural network. In our experiment we use the standard FermiNet Ansatz [11] for the network architecture. The pre-training of [19] uses the loss function

\sum_{k=1}^{d}\sum_{i,j=1}^{N}|y_{i,j}^{(k)}-\phi_{j}(x_{i})|^{2}

(24)

on the matrices. To test the use of a scale-invariant loss function during training we replace the training loss of Equation 24 with the sum of column-wise scale-invariant $\sin^{2}$ distances. We note that in practice we did not find it necessary to estimate the denominator using an independent sample as in Remark 4.6. Note also that we can view the fitting of a (column) vector-valued function $\Omega\ni x\mapsto(y_{i})_{i=1}^{N}$ as fitting a scalar-valued function $y(x,i)$ defined on $\Omega\times\{1,\ldots,N\}$ .

L(\theta)=\sum_{k=1}^{d}\sum_{j=1}^{N}\Big{(}1-\frac{|y_{\cdot,j}^{(k)}\cdot\phi_{j}(\cdot)|^{2}}{\|y_{\cdot,j}^{(k)}\|^{2}}\Big{)},

(25)

where $\phi_{j}(\cdot)=(\phi_{j}(x_{1}),\ldots,\phi_{j}(x_{n}))^{T}$ , that is,

L(\theta)=\sum_{k=1}^{d}\sum_{j=1}^{N}\Big{(}1-\frac{|\sum_{i=1}^{N}y_{i,j}^{(k)}\phi_{j}(x_{i})|^{2}}{\sum_{i=1}^{N}|y_{i,j}^{(k)}|^{2}}\Big{)}.

(26)

Refer to caption — Figure 1: Convergence of supervised pre-training using the scale-invariant training loss (orange) vs. the training loss of [19] (blue). For both optimizers the plotted quantity is the sine of the angle between the target state and the trained state, where the angle is defined in $\mathscr{L}^{2}(\mathbb{R}^{3n},\rho=|\varphi|^{2})$ with respect to the measure induced by the target state density.

To evaluate the performance of the modified pre-training procedure we plot the angle between the target Slater determinant-based wave function and the neural network wave function during each training procedure (Figure 1). The system shown is the lithium atom (details in F).

5.2 Empirical convergence of VMC algorithm

As an example, we demonstrate the convergence of the VMC Section 4.1 on the Hydrogen square (H₄) model depicted in Figure 2. This system involves only four particles but is strongly correlated and known to be difficult to simulate accurately. We define $\psi_{\theta}$ using the FermiNet ansatz [11, 35] and run 200,000 training steps using stochastic gradient descent with a learning rate schedule $\eta_{m}=\frac{0.05}{\sqrt{1+m/10000}}$ .

Figure 2: Atomic configuration for the square H₄ model.

With this learning rate schedule, Corollary 4.4 implies that the running minimum of $\mathbb{E}(|\nabla_{\theta}L(\theta_{m})|)$ should converge as $O(M^{-1/4})$ as the optimization progresses. To verify this bound, we plot in Figure 3 the running minimum of $|G_{m}|$ , which is our numerical proxy for $|\nabla_{\theta}L(\theta)|$ as per Lemma 4.1. We compare two distinct VMC runs using 10 and 1000 walkers, respectively, to explore how the convergence varies with the accuracy of the gradient estimate. We estimate the convergence rate by measuring the overall slope of the log-log plot, starting from step 200 to avoid the initial pre-asymptotic period. We find that the gradient of the loss converges roughly as $M^{-0.27}$ when using 10 walkers, closely matching the expected bound. With 1000 walkers the convergence goes as $M^{-0.38}$ , which is somewhat faster than the theoretical bound, as might be expected given that the estimate of the gradient is very accurate with so many walkers. With 1000 walkers, the convergence is $M^{-0.38}$ . This rate surpasses the theoretical bound given in Corollary 4.4. Although a rigorous theoretical understanding of this phenomenon is currently lacking, we hypothesize that with this number of walkers, relatively accurate gradient estimates lead to training dynamics similar to (non-stochastic) gradient descent, giving a faster convergence rate than our theoretical bound in Corollary 4.4.

Since the bounding of the Lipschitz constant of the gradient of the loss function is critical to our convergence proof, we supply in Figure 4 a numerical estimate of this quantity during the VMC run. Here we show data only for the case of 1000 walkers, as the Lipschitz constant estimate is extremely noisy with 10 walkers. The data suggest that the value of the Lipschitz constant is well below $1000$ for this simulation.

6 Conclusion

We consider two optimization problems that arise in neural network variational Monte Carlo simulations of electronic structure: energy minimization and supervised pre-training. We provide theoretical convergence bounds for both cases. For the setting of supervised pre-training, we note that the standard algorithms do not incorporate the property of scale-invariance. We propose using a scale-invariant loss function for the supervised pre-training and demonstrate numerically that incorporating scale-invariance accelerates the process of pre-training.

SGD is only the simplest stochastic optimization method. Over the last two decades, there has been increasing interest in developing more efficient and scalable optimization methods for VMC simulations [45, 46, 47], and our analysis may be a starting point for analyzing such methods. It may also be possible to generalize this work from a high-dimensional sphere to a Grassmann manifold, parameterized by a neural network up to a gauge matrix. This setting could be applicable to VMC simulations of excited states of quantum systems.

Acknowledgement

This work was supported by the Simons Foundation under Award No. 825053 (N. A.), and by the Challenge Institute for Quantum Computation (CIQC) funded by National Science Foundation (NSF) through grant number OMA-2016245 (Z. D.). This work is supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Department of Energy Computational Science Graduate Fellowship under Award Number DE-SC0023112 (G. G.). This material is also based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research and Office of Basic Energy Sciences, Scientific Discovery through Advanced Computing (SciDAC) program under Award Number DE-SC0022364 (L.L.). L.L. is a Simons Investigator in Mathematics. This research used the Savio computational cluster resource provided by the Berkeley Research Computing program at the University of California, Berkeley (supported by the UC Berkeley Chancellor, Vice Chancellor for Research, and Chief Information Officer).

Disclaimer

This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.

References

[1] W. M. C. Foulkes, L. Mitas, R. J. Needs, G. Rajagopal, Quantum monte carlo simulations of solids, Rev. Mod. Phys. 73 (2001) 33–83. doi:10.1103/RevModPhys.73.33.
[2] J. Toulouse, R. Assaraf, C. J. Umrigar, Chapter fifteen - introduction to the variational and diffusion monte carlo methods, in: P. E. Hoggan, T. Ozdogan (Eds.), Electron Correlation in Molecules – ab initio Beyond Gaussian Quantum Chemistry, Vol. 73 of Advances in Quantum Chemistry, Academic Press, 2016, pp. 285–314. doi:10.1016/bs.aiq.2015.07.003.
URL https://www.sciencedirect.com/science/article/pii/S0065327615000386
[3] F. Becca, S. Sorella, Quantum Monte Carlo Approaches for Correlated Systems, Cambridge University Press, 2017. doi:10.1017/9781316417041.
[4] S. Sorella, Green function monte carlo with stochastic reconfiguration, Phys. Rev. Lett. 80 (1998) 4558–4561. doi:10.1103/PhysRevLett.80.4558.
URL https://link.aps.org/doi/10.1103/PhysRevLett.80.4558
[5] S. Sorella, Generalized lanczos algorithm for variational quantum monte carlo, Phys. Rev. B 64 (2001) 024512. doi:10.1103/PhysRevB.64.024512.
URL https://link.aps.org/doi/10.1103/PhysRevB.64.024512
[6] M. P. Nightingale, V. Melik-Alaverdian, Optimization of ground- and excited-state wave functions and van der waals clusters, Phys. Rev. Lett. 87 (2001) 043401. doi:10.1103/PhysRevLett.87.043401.
URL https://link.aps.org/doi/10.1103/PhysRevLett.87.043401
[7] J. Toulouse, C. J. Umrigar, Optimization of quantum monte carlo wave functions by energy minimization, The Journal of chemical physics 126 (8) (2007) 084102.
[8] G. Carleo, M. Troyer, Solving the quantum many-body problem with artificial neural networks, Science 355 (6325) (2017) 602–606. doi:10.1126/science.aag2302.
[9] J. Han, L. Zhang, W. E, Solving many-electron schrödinger equation using deep neural networks, Journal of Computational Physics 399 (2019) 108929. doi:10.1016/j.jcp.2019.108929.
[10] J. Hermann, Z. Schätzle, F. Noé, Deep-neural-network solution of the electronic schrödinger equation, Nature Chemistry 12 (10) (2020) 891–897. doi:10.1038/s41557-020-0544-y.
[11] D. Pfau, J. S. Spencer, A. G. Matthews, W. M. C. Foulkes, Ab initio solution of the many-electron schrödinger equation with deep neural networks, Physical Review Research 2 (3) (2020) 033429.
[12] J. Hermann, J. Spencer, K. Choo, A. Mezzacapo, W. M. C. Foulkes, D. Pfau, G. Carleo, F. Noé, Ab-initio quantum chemistry with neural-network wavefunctions (2022). doi:10.48550/ARXIV.2208.12590.
URL https://arxiv.org/abs/2208.12590
[13] K. Choo, A. Mezzacapo, G. Carleo, Fermionic neural-network states for ab-initio electronic structure, Nature Communications 11 (1) (2020) 2368. doi:10.1038/s41467-020-15724-9.
[14] T. D. Barrett, A. Malyshev, A. Lvovsky, Autoregressive neural-network wavefunctions for ab initio quantum chemistry, Nature Machine Intelligence 4 (4) (2022) 351–358.
[15] G. Cassella, H. Sutterud, S. Azadi, N. D. Drummond, D. Pfau, J. S. Spencer, W. M. C. Foulkes, Discovering quantum phase transitions with fermionic neural networks (2022). doi:10.48550/ARXIV.2202.05183.
URL https://arxiv.org/abs/2202.05183
[16] G. Pescia, J. Nys, J. Kim, A. Lovato, G. Carleo, Message-passing neural quantum states for the homogeneous electron gas (2023). arXiv:2305.07240.
[17] M. Wilson, S. Moroni, M. Holzmann, N. Gao, F. Wudarski, T. Vegge, A. Bhowmik, Neural network ansatz for periodic wave functions and the homogeneous electron gas, Physical Review B 107 (23) (2023) 235139.
[18] X. Li, Z. Li, J. Chen, Ab initio calculation of real solids via neural network ansatz, Nature Communications 13 (1) (2022) 7895.
[19] I. von Glehn, J. S. Spencer, D. Pfau, A self-attention ansatz for ab-initio quantum chemistry (2023). arXiv:2211.13672.
[20] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in: Proceedings of the 32nd International Conference on Machine Learning - Volume 37, 2015.
[21] Y. Wu, K. He, Group normalization, in: In Proceedings of the European conference on computer vision (ECCV), 2018.
[22] J. L. Ba, J. R. Kiros, G. E. Hinton, Layer normalization, arXiv/1607.06450 (2016).
[23] Q. Meng, S. Zheng, H. Zhang, W. Chen, Z.-M. Ma, T.-Y. Liu, G-SGD: Optimizing reLU neural networks in its positively scale-invariant space, in: International Conference on Learning Representations, 2019.
[24] T. van Laarhoven, L2 regularization versus batch and weight normalization, arXiv/1706.05350 (2017).
[25] S. Arora, Z. Li, K. Lyu, Theoretical analysis of auto rate-tuning by batch normalization, in: International Conference on Learning Representations, 2019.
[26] Z. Li, S. Bhojanapalli, M. Zaheer, S. Reddi, S. Kumar, Robust Training of Neural Networks Using Scale Invariant Architectures, in: Proceedings of the 39th International Conference on Machine Learning, PMLR, 2022.
[27] R. Wan, Z. Zhu, X. Zhang, J. Sun, Spherical motion dynamics: Learning dynamics of normalized neural network using sgd and weight decay, in: Advances in Neural Information Processing Systems, Vol. 34, 2021, pp. 6380–6391.
[28] T. Salimans, D. P. Kingma, Weight normalization: A simple reparameterization to accelerate training of deep neural networks, in: Proceedings of the 30th International Conference on Neural Information Processing Systems, 2016.
[29] M. Kodryan, E. Lobacheva, M. Nakhodnov, D. P. Vetrov, Training scale-invariant neural networks on the sphere can happen in three regimes, in: Advances in Neural Information Processing Systems, 2022.
[30] S. Bonnabel, Stochastic Gradient Descent on Riemannian Manifolds, IEEE Transactions on Automatic Control 58 (9) (2013) 2217–2229, conference Name: IEEE Transactions on Automatic Control. doi:10.1109/TAC.2013.2254619.
[31] H. Zhang, S. J. Reddi, S. Sra, Riemannian svrg: Fast stochastic optimization on riemannian manifolds, in: Proceedings of the 30th International Conference on Neural Information Processing Systems, 2016, p. 4599–4607.
[32] H. Sato, H. Kasai, B. Mishra, Riemannian stochastic variance reduced gradient algorithm with retraction and vector transport, SIAM Journal on Optimization 29 (2) (2019) 1444–1472.
[33] N. Tripuraneni, N. Flammarion, F. Bach, M. I. Jordan, Averaging Stochastic Gradient Descent on Riemannian Manifolds, in: Proceedings of the 31st Conference On Learning Theory, PMLR, 2018, pp. 650–687, iSSN: 2640-3498.
URL https://proceedings.mlr.press/v75/tripuraneni18a.html
[34] T. Li, F. Chen, H. Chen, , Z. Wen, Provable convergence of variational monte carlo methods, arxiv/2303.10599 (2023).
[35] J. S. Spencer, D. Pfau, A. Botev, W. M. C. Foulkes, Better, faster fermionic neural networks (2020). doi:10.48550/ARXIV.2011.07125.
URL https://arxiv.org/abs/2011.07125
[36] L. Gerard, M. Scherbela, P. Marquetand, P. Grohs, Gold-standard solutions to the schrödinger equation using deep learning: How much physics do we need? (2022). doi:10.48550/ARXIV.2205.09438.
URL https://arxiv.org/abs/2205.09438
[37] I. von Glehn, J. S. Spencer, D. Pfau, A self-attention ansatz for ab-initio quantum chemistry (2022). doi:10.48550/ARXIV.2211.13672.
URL https://arxiv.org/abs/2211.13672
[38] A. Szabo, N. Ostlund, Modern Quantum Chemistry: Introduction to Advanced Electronic Structure Theory, Dover Books on Chemistry, Dover Publications, 1996.
URL https://books.google.com/books?id=6mV9gYzEkgIC
[39] A. Khaled, P. Richtárik, Better theory for sgd in the nonconvex world, arXiv/2002.03329 (2020).
[40] G. Garrigos, R. M. Gower, Handbook of convergence theorems for (stochastic) gradient methods, arXiv/2301.11235 (2024).
[41] M. Schmidt, N. Roux, F. Bach, Minimizing finite sums with the stochastic average gradient, Mathematical Programming 162 (09 2013). doi:10.1007/s10107-016-1030-6.
[42] A. Defazio, F. Bach, S. Lacoste-Julien, SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives, Advances in Neural Information Processing Systems 2 (07 2014).
[43] R. Johnson, T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction, in: Proceedings of the 26th International Conference on Neural Information Processing Systems, 2013, p. 315–323.
[44] D. P. James S. Spencer, F. Contributors, FermiNet (forked from commit fdaca31) (2023).
URL http://github.com/deepmind/ferminet
[45] A. W. Sandvik, G. Vidal, Variational quantum monte carlo simulations with tensor-network states, Phys. Rev. Lett. 99 (2007) 220602. doi:10.1103/PhysRevLett.99.220602.
URL https://link.aps.org/doi/10.1103/PhysRevLett.99.220602
[46] E. Neuscamman, C. J. Umrigar, G. K.-L. Chan, Optimizing large parameter sets in variational quantum monte carlo, Phys. Rev. B 85 (2012) 045103. doi:10.1103/PhysRevB.85.045103.
URL https://link.aps.org/doi/10.1103/PhysRevB.85.045103
[47] L. Otis, E. Neuscamman, Complementary first and second derivative methods for ansatz optimization in variational monte carlo, Phys. Chem. Chem. Phys. 21 (2019) 14491–14510. doi:10.1039/C9CP02269D.
URL http://dx.doi.org/10.1039/C9CP02269D
[48] D. P. James S. Spencer, F. Contributors, fork of FermiNet (forked from commit fdaca31) (2023).
URL https://github.com/nilin/fork_of_ferminet/tree/sample_orbitals
[49] P. Yin, J. Lyu, S. Zhang, S. Osher, Y. Qi, J. Xin, Understanding straight-through estimator in training activation quantized neural nets (2019). arXiv:1903.05662.

Appendix A Proofs of gradient estimators

Proof of (4).

Let $q_{\theta}(x)=|\psi_{\theta}(x)|^{2}$ , and $Q_{\theta}=\|\psi_{\theta}\|^{2}$ .

We write (3) as

L(\theta)=\ifrac{\langle q_{\theta},\mathcal{E}_{\theta}\rangle}{Q_{\theta}}\,.

Then by the product formula,

\displaystyle\nabla_{\theta}L(\theta)

\displaystyle=\frac{\nabla_{\theta}\langle q_{\theta},\mathcal{E}_{\theta}\rangle}{Q_{\theta}}-\frac{\langle q_{\theta},\mathcal{E}_{\theta}\rangle\nabla_{\theta}Q_{\theta}}{Q^{2}_{\theta}}=:A-B.

(27)

Continuing, we have

$\displaystyle A$	$\displaystyle=\langle\mathcal{E}_{\theta},\nabla_{\theta}q_{\theta}\rangle/Q_{\theta}+\langle q_{\theta},\nabla_{\theta}\mathcal{E}_{\theta}\rangle/Q_{\theta}$	(28)
	$\displaystyle=\langle\mathcal{E}_{\theta},\frac{\nabla_{\theta}q_{\theta}}{q_{\theta}}\frac{q_{\theta}}{Q_{\theta}}\rangle+\langle\frac{q_{\theta}}{Q_{\theta}},\nabla_{\theta}\mathcal{E}_{\theta}\rangle$
	$\displaystyle=\mathbb{E}_{p_{\theta}}[\mathcal{E}_{\theta}\nabla_{\theta}\log q_{\theta}]+\mathbb{E}_{p_{\theta}}[\nabla_{\theta}\mathcal{E}_{\theta}].$

and

	$\displaystyle B$	$\displaystyle=\langle\frac{q_{\theta}}{Q_{\theta}},\mathcal{E}_{\theta}\rangle\frac{\nabla_{\theta}Q_{\theta}}{Q_{\theta}}$		(29)
		$\displaystyle=L(\theta)\langle\frac{q_{\theta}}{Q_{\theta}},\frac{\nabla_{\theta}q_{\theta}}{q_{\theta}}\rangle=L(\theta)\mathbb{E}_{p_{\theta}}[\nabla_{\theta}\log q_{\theta}].$		(29)

Substitute (28), (29) into (27) to obtain

	$\displaystyle\nabla_{\theta}L(\theta)$	$\displaystyle=\mathbb{E}_{X\sim p_{\theta}}\left[(\mathcal{E}_{\theta}(X)-L(\theta))\nabla_{\theta}\log q_{\theta}(X)+\nabla_{\theta}\mathcal{E}_{\theta}(X)\right]$		(30)
		$\displaystyle=2\mathbb{E}_{X\sim p_{\theta}}[\big{(}\mathcal{E}_{\theta}(X)-L(\theta))\nabla_{\theta}\log\|\psi_{\theta}(X)\|+\nabla_{\theta}\mathcal{E}_{\theta}(X)\Big{]}\,.$		(30)

It remains to show that $\mathbb{E}_{X\sim p_{\theta}}[\partial_{i}\mathcal{E}_{\theta}(X)]=0$ where $\partial_{i}$ is differentiation with respect to $\theta_{i}$ . Write

\partial_{i}\mathcal{E}_{\theta}=\partial_{i}\frac{H\psi_{\theta}}{\psi_{\theta}}=\frac{\partial_{i}H\psi_{\theta}\cdot\psi_{\theta}-H\psi_{\theta}\cdot\partial_{i}\psi_{\theta}}{\psi_{\theta}^{2}}\,.

(31)

So, because $H$ is symmetric:

\mathbb{E}_{X\sim p_{\theta}}\left[\partial_{i}\mathcal{E}_{\theta}\right]=\frac{\langle H\partial_{i}\psi_{\theta},\psi_{\theta}\rangle-\langle H\psi_{\theta},\partial_{i}\psi_{\theta}\rangle}{\|\psi_{\theta}\|^{2}}=0,

∎

Proof of Lemma 4.1.

	$\displaystyle G(\theta,X)=$	$\displaystyle\frac{2}{n-1}\sum_{i=1}^{n}\mathcal{E}_{\theta}\left(X_{i}\right)\nabla_{\theta}\log\|\psi_{\theta}\left(X_{i}\right)\|$
		$\displaystyle-\frac{2}{n(n-1)}\sum_{j=1}^{n}\sum_{i=1}^{n}\mathcal{E}_{\theta}\left(X_{j}\right)\nabla_{\theta}\log\|\psi_{\theta}\left(X_{i}\right)\|$
	$\displaystyle=$	$\displaystyle\frac{2}{n}\sum_{i=1}^{n}\mathcal{E}_{\theta}\left(X_{i}\right)\nabla_{\theta}\log\|\psi_{\theta}\left(X_{i}\right)\|$
		$\displaystyle-\frac{2}{n^{2}-n}\sum_{i\neq j}\mathcal{E}_{\theta}\left(X_{i}\right)\nabla_{\theta}\log\|\psi_{\theta}\left(X_{j}\right)\|.$

Here, the terms $i=j$ in the expansion of the rightmost term cancel with the first sum. But now each sum is the average of terms with the correct expectation $\left(2\mathbb{E}_{X\sim p_{\theta}}\left[\mathcal{E}_{\theta}(X)\nabla_{\theta}\log|\psi_{\theta}(X)|\right]\right.$ and $2\mathcal{L}_{\theta}\mathbb{E}_{X\sim p_{\theta}}\nabla_{\theta}\log|\psi_{\theta}(X)|$ respectively $)$ . ∎

Appendix B The directionally unbiased gradient estimator for supervised learning

Given $n\geq 2$ i.i.d. samples $\{X_{i}\}\sim\rho$ . We write $\psi_{i}=\psi_{\theta}(X_{i})$ , $\varphi_{i}=\varphi(X_{i})$ , and $\nabla\psi_{i}=\nabla_{\theta}\psi_{\theta}(X_{i})$ .

\displaystyle\langle\psi,\varphi\rangle_{n}=\frac{1}{n}\sum_{i=1}^{n}\psi_{i}\varphi_{i},\qquad\|\psi\|_{n}^{2}=\frac{1}{n}\sum_{i=1}^{n}\psi_{i}^{2}.

(32)

Lemma B.1.

Given samples $X_{1},\ldots,X_{n}$ from $\rho$ , let

	$\displaystyle G_{n}$	$\displaystyle=\frac{1}{n-1}\sum_{j=1}^{n}a_{j}\nabla\psi_{j},$		(33)
	$\displaystyle a_{j}$	$\displaystyle=-\\|\psi\\|_{n}^{2}\>\varphi_{j}+\langle\varphi,\psi\rangle_{n}\>\psi_{j}.$		(34)

Then $G_{n}$ is an unbiased estimator for $G=\|\psi_{\theta}\|^{3}_{\rho}\nabla_{\theta}L(\theta)$ .

Proof of Lemma B.1.

We notice

G=-\|\psi_{\theta}\|^{2}_{\rho}\langle\varphi,\nabla\psi_{\theta}\rangle+\langle\varphi,\psi_{\theta}\rangle\langle\psi_{\theta},\nabla\psi_{\theta}\rangle.

For each $i\neq j$ we have an unbiased estimator for $G$ given by

-|\psi_{i}|^{2}\varphi_{j}\nabla\psi_{j}+\varphi_{i}\psi_{i}\psi_{j}\nabla\psi_{j}.

Taking the average over all pairs $i\neq j$ gives another unbiased estimator for $G$ :

-\frac{1}{n(n-1)}\sum_{i\neq j}(\psi_{i}^{2}\varphi_{j}\nabla\psi_{j}-\varphi_{i}\psi_{i}\psi_{j}\nabla\psi_{j}).

We may add the terms $i=j$ without changing the value because all such terms evaluate to $0$ . ∎

Appendix C Proof of Theorem 4.3

To prove Theorem 4.3, we first show the following lemma:

Lemma C.1.

Under 4.2 there exists a uniform constant $C^{\prime}$ such that for any $\theta,\widetilde{\theta}\in\mathbb{R}^{d}$ ,

\left|\nabla_{\theta}L(\theta)\right|\leq 4C^{2}_{\psi}\,,

(35)

\left|\nabla_{\theta}L(\theta)-\nabla_{\theta}L\left(\widetilde{\theta}\right)\right|\leq C^{\prime}(C^{4}_{\psi}+1)|\theta-\widetilde{\theta}|\,,

(36)

and

\mathbb{E}_{\{X_{i}\}^{n}_{i=1}}\left(\left|G(\theta,X)\right|^{2}\right)\leq|\nabla_{\theta}L(\theta)|^{2}+\frac{C^{\prime}(C^{4}_{\psi}+1)}{n}\,,

(37)

where $G(\theta,X)$ is defined in (11).

The above lemma is important for the proof of Theorem 4.3. First, inequality (37) gives an upper bound for the gradient estimator. Using this upper bound, we can show that each iteration of Algorithm 4.1 is close to the classical gradient descent when the learning rate $\eta$ is small enough. Second, (36) means that the hessian of $L$ is bounded. This ensures that the classical gradient descent has a faster convergence rate to the first-order stationary point, which further implies the fast convergence of Algorithm 4.1.

Proof of Lemma C.1.

Define

Z_{\theta}=\sqrt{\int_{\Omega}\left|\psi_{\theta}(x)\right|^{2}dx}\,.

First, to prove (35), using (4), we have

	$\displaystyle\left\|\nabla_{\theta}L(\theta)\right\|$	$\displaystyle=2\mathbb{E}_{X\sim p_{\theta}}\left\|\left(\mathcal{E}_{\theta}(X)-L(\theta)\right)\frac{\nabla_{\theta}\psi_{\theta}(X)}{\psi_{\theta}(X)}\right\|$
		$\displaystyle\leq 2\left(\mathbb{E}_{X\sim p_{\theta}}\left\|\mathcal{E}_{\theta}(X)-L(\theta)\right\|^{2}\right)^{1/2}\left(\mathbb{E}_{X\sim p_{\theta}}\left\|\frac{\nabla_{\theta}\psi_{\theta}(X)}{\psi_{\theta}(X)}\right\|^{2}\right)^{1/2}\leq 4C^{2}_{\psi}\,,$

where we use Hölder’s inequality in the first inequality, (12) and the second inequality of (13) in the last inequality.

Next, to prove (36). Using the formula of $\mathcal{E}(\theta)$ (equation (2)), we write

	$\displaystyle\left\\|\mathsf{H}_{\theta}L(\theta)\right\\|=$	$\displaystyle\underbrace{\frac{2\left\|\nabla_{\theta}Z_{\theta}\right\|}{Z^{3}_{\theta}}\int_{\Omega}\left\|H\psi_{\theta}(x)\nabla_{\theta}\psi_{\theta}(x)\right\|dx}_{\mathrm{(I)}}+\underbrace{\frac{1}{Z^{2}_{\theta}}\int_{\Omega}\left\\|\nabla_{\theta}H\psi_{\theta}(x)\nabla_{\theta}\psi_{\theta}(x)\right\\|dx}_{\mathrm{(II)}}$
		$\displaystyle+\underbrace{\frac{1}{Z^{2}_{\theta}}\int_{\Omega}\left\|H\psi_{\theta}(x)\right\|\left\\|\mathsf{H}_{\theta}\psi_{\theta}(x)\right\\|dx}_{\mathrm{(III)}}+\left\\|\nabla_{\theta}\left(\frac{1}{Z^{2}_{\theta}}\int_{\Omega}L(\theta)\psi_{\theta}(x)\nabla_{\theta}\psi_{\theta}(x)dx\right)\right\\|\,.$

We first deal with Terms (I), (II), (III) separately:

For Term (I), we notice

\left|\nabla_{\theta}Z_{\theta}\right|=\left|\frac{\int\nabla_{\theta}\psi_{\theta}(x)\psi_{\theta}(x)dx}{\sqrt{\int|\psi_{\theta}(x)|^{2}dx}}\right|\leq\sqrt{\int|\nabla_{\theta}\psi_{\theta}(x)|^{2}dx}\,,

where we use Hölder’s inequality to bound the numerator in the inequality. This implies

		$\displaystyle\mathrm{Term\ (I)}\leq\frac{2\sqrt{\int\|\nabla_{\theta}\psi_{\theta}(x)\|^{2}dx}}{\sqrt{\int\|\psi_{\theta}(x)\|^{2}dx}}\int_{\Omega}\frac{\left\|H\psi_{\theta}(x)\nabla_{\theta}\psi_{\theta}(x)\right\|}{Z^{2}_{\theta}}dx$
	$\displaystyle\leq$	$\displaystyle 2\left(\mathbb{E}_{X\sim p_{\theta}}\left(\left\|\frac{\nabla_{\theta}\psi_{\theta}(X)}{\psi_{\theta}(X)}\right\|^{2}\right)\right)^{1/2}\frac{\sqrt{\int_{\Omega}\|H\psi_{\theta}(x)\|^{2}dx}\sqrt{\int_{\Omega}\|\nabla_{\theta}\psi_{\theta}(x)\|^{2}dx}}{Z^{2}_{\theta}}$
	$\displaystyle=$	$\displaystyle 2\left(\mathbb{E}_{X\sim p_{\theta}}\left(\left\|\frac{\nabla_{\theta}\psi_{\theta}(X)}{\psi_{\theta}(X)}\right\|^{2}\right)\right)^{1/2}\left(\mathbb{E}_{X\sim p_{\theta}}\left(\left\|\frac{H\psi_{\theta}(X)}{\psi_{\theta}(X)}\right\|^{2}\right)\right)^{1/2}\left(\mathbb{E}_{X\sim p_{\theta}}\left(\left\|\frac{\nabla_{\theta}\psi_{\theta}(X)}{\psi_{\theta}(X)}\right\|^{2}\right)\right)^{1/2}$
	$\displaystyle\leq$	$\displaystyle 2C^{3}_{\psi}\,.$

Here, we use the Hölder’s inequality in the second inequality, and (12)-(14) in the last inequality.

For Term (II),

		$\displaystyle\frac{1}{Z^{2}_{\theta}}\int_{\Omega}\left\\|\nabla_{\theta}H\psi_{\theta}(x)\nabla_{\theta}\psi_{\theta}(x)\right\\|dx\leq\mathbb{E}_{X\sim p_{\theta}}\left(\frac{\left\|\nabla_{\theta}H\psi_{\theta}(X)\right\|\left\|\nabla_{\theta}\psi_{\theta}(X)\right\|}{\left\|\psi_{\theta}(X)\right\|^{2}}\right)$
	$\displaystyle\leq$	$\displaystyle\left(\mathbb{E}_{X\sim p_{\theta}}\left(\left\|\frac{\nabla_{\theta}H\psi_{\theta}(X)}{\psi_{\theta}(X)}\right\|^{2}\right)\right)^{1/2}\left(\mathbb{E}_{X\sim p_{\theta}}\left(\left\|\frac{\nabla_{\theta}\psi_{\theta}(X)}{\psi_{\theta}(X)}\right\|^{2}\right)\right)^{1/2}\leq C^{2}_{\psi}$

where we use Hölder’s inequality, (12)-(14) in the last two inequalities.

For Term (III),

		$\displaystyle\frac{1}{Z^{2}_{\theta}}\int_{\Omega}\left\\|H\psi_{\theta}(x)\mathsf{H}_{\theta}\psi_{\theta}(x)\right\\|dx\leq\mathbb{E}_{X\sim p_{\theta}}\left(\frac{\left\|H\psi_{\theta}(X)\right\|\left\\|\mathsf{H}_{\theta}\psi_{\theta}(x)\right\\|}{\left\|\psi_{\theta}(X)\right\|^{2}}\right)$
	$\displaystyle\leq$	$\displaystyle\left(\mathbb{E}_{X\sim p_{\theta}}\left(\left\|\frac{H\psi_{\theta}(X)}{\psi_{\theta}(X)}\right\|^{2}\right)\right)^{1/2}\left(\mathbb{E}_{X\sim p_{\theta}}\left(\left\|\frac{\mathsf{H}_{\theta}\psi_{\theta}(x)}{\psi_{\theta}(X)}\right\|^{2}\right)\right)^{1/2}\leq C^{2}_{\psi}$

where we use Hölder’s inequality, (12)-(14) in the last two inequalities.

Combining the above three inequalities, we have

\text{Term (I)+Term (II)+Term (III)}\leq 2C^{2}_{\psi}(C_{\psi}+1)\,.

Using a similar calculation, we can also show

\left\|\nabla_{\theta}\left(\frac{1}{Z^{2}_{\theta}}\int_{\Omega}L(\theta)\psi_{\theta}(x)\nabla_{\theta}\psi_{\theta}(x)dx\right)\right\|\leq C(C^{4}_{\psi}+1)\,,

where $C$ is a uniform constant. Thus, we have the hessian of $L$ can be bounded, meaning

\left\|\mathsf{H}_{\theta}L(\theta)\right\|\leq C(C^{4}_{\psi}+1)\,,

which proves (36) by the mean-value theorem.

Finally, to prove (37), noticing

G(\theta,X)=\frac{2}{n}\sum_{i=1}^{n}\mathcal{E}_{\theta}\left(X_{i}\right)\nabla_{\theta}\log|\psi_{\theta}\left(X_{i}\right)|-\frac{2}{n^{2}-n}\sum_{i\neq j}\mathcal{E}_{\theta}\left(X_{i}\right)\nabla_{\theta}\log|\psi_{\theta}\left(X_{j}\right)|\,,

and

\mathbb{E}_{\{X_{i}\}^{n}_{i=1}}\left(G(\theta,X)\right)=\nabla_{\theta}L(\theta)\,,

we have

\begin{aligned} &\mathbb{E}_{\{X_{i}\}^{n}_{i=1}}\left(\left|G(\theta,X)\right|^{2}\right)\\ \leq&|\nabla_{\theta}L(\theta)|^{2}+\frac{8}{n^{2}}\sum^{n}_{i=1}\mathbb{E}_{X_{i}\sim p_{\theta}}\left(\left|\mathcal{E}_{\theta}\left(X_{i}\right)\nabla_{\theta}\log|\psi_{\theta}\left(X_{i}\right)|-\mathbb{E}_{X\sim p_{\theta}}\left[\mathcal{E}_{\theta}(X)\nabla_{\theta}\log|\psi_{\theta}(X)|\right]\right|^{2}\right)\\ &+\frac{8}{(n^{2}-n)^{2}}\mathbb{E}\left(\sum_{i\neq j}\mathcal{E}_{\theta}\left(X_{i_{1}}\right)\nabla_{\theta}\log|\psi_{\theta}\left(X_{j_{1}}\right)|-\mathcal{L}_{\theta}\mathbb{E}_{X\sim p_{\theta}}\nabla_{\theta}\log|\psi_{\theta}(X)|\right)^{2}\\ \leq&|\nabla_{\theta}L(\theta)|^{2}+\frac{8}{n}\mathbb{E}_{X\sim p_{\theta}}\left|\mathcal{E}_{\theta}\left(X\right)\nabla_{\theta}\log|\psi_{\theta}\left(X\right)|-\mathbb{E}_{X\sim p_{\theta}}\left[\mathcal{E}_{\theta}(X)\nabla_{\theta}\log|\psi_{\theta}(X)|\right]\right|^{2}\\ &+\frac{8}{(n^{2}-n)^{2}}\left(\sum_{i_{1}=i_{2}\,\text{or}\,j_{1}=j_{2}}1\right)\mathbb{E}_{(X_{1},X_{2})\sim p^{\otimes 2}_{\theta}}\left|\mathcal{E}_{\theta}\left(X_{1}\right)\nabla_{\theta}\log|\psi_{\theta}\left(X_{2}\right)|-\mathcal{L}_{\theta}\mathbb{E}_{X\sim p_{\theta}}\nabla_{\theta}\log|\psi_{\theta}(X)|\right|^{2}\\ \leq&|\nabla_{\theta}L(\theta)|^{2}+\frac{8}{n}\mathbb{E}_{X\sim p_{\theta}}\left|\mathcal{E}_{\theta}\left(X\right)\nabla_{\theta}\log|\psi_{\theta}\left(X\right)|\right|^{2}+\frac{C}{n}\mathbb{E}_{(X_{1},X_{2})\sim p^{\otimes 2}_{\theta}}\left|\mathcal{E}_{\theta}\left(X_{1}\right)\nabla_{\theta}\log|\psi_{\theta}\left(X_{2}\right)|\right|^{2}\\ \end{aligned}\,,

where we use $X_{i}$ and $X_{j}$ are independent in the first two inequalities. Using (12) and the second inequality of (13), it is straightforward to show:

\mathbb{E}_{X\sim p_{\theta}}\left|\mathcal{E}_{\theta}\left(X\right)\nabla_{\theta}\log|\psi_{\theta}\left(X\right)|\right|^{2}\leq\left(\mathbb{E}_{X\sim p_{\theta}}\left(\left|\frac{H\psi_{\theta}(X)}{\psi_{\theta}(X)}\right|^{4}\right)\right)^{1/2}\left(\mathbb{E}_{X\sim p_{\theta}}\left(\left|\frac{\nabla_{\theta}\psi_{\theta}(X)}{\psi_{\theta}(X)}\right|^{4}\right)\right)^{1/2}\leq C^{4}_{\psi}\,,

and

\mathbb{E}_{(X_{1},X_{2})\sim p^{\otimes 2}_{\theta}}\left|\mathcal{E}_{\theta}\left(X_{1}\right)\nabla_{\theta}\log|\psi_{\theta}\left(X_{2}\right)|\right|^{2}\leq\mathbb{E}_{X\sim p_{\theta}}\left(\left|\frac{H\psi_{\theta}(X)}{\psi_{\theta}(X)}\right|^{2}\right)\mathbb{E}_{X\sim p_{\theta}}\left(\left|\frac{\nabla_{\theta}\psi_{\theta}(X)}{\psi_{\theta}(X)}\right|^{2}\right)\leq C^{4}_{\psi}\,.

This concludes the proof of (37). ∎

Now, we are ready to prove Theorem 4.3:

Proof of Theorem 4.3.

Denote the probability filtration

\mathcal{F}_{m}=\sigma(X^{j}_{i},1\leq i\leq n,1\leq j\leq m)\,.

According to the algorithm, we have

\theta_{m+1}=\theta_{m}-\eta_{m}G_{m}\,.

Plugging $\theta_{m+1}$ into $L(\theta)$ , we have

L(\theta_{m+1})\leq L(\theta_{m})-\eta_{m}\nabla_{\theta}L(\theta_{m})\cdot G_{m}+\frac{C\eta_{m}^{2}}{2}|G_{m}|^{2}

where we use (36) and $C$ is a constant that only depends on $C_{\psi}$ . This implies

\displaystyle\mathbb{E}\left(L(\theta_{m+1})|\mathcal{F}_{m-1}\right)\leq

\displaystyle L(\theta_{m})-\eta_{m}\left|\nabla_{\theta}L(\theta_{m})\right|^{2}+\frac{C\eta_{m}^{2}}{2}\mathbb{E}(|G_{m}|^{2}|\mathcal{F}_{m-1})

where we use $\mathbb{E}_{\{X_{i}\}^{n}_{i=1}}\left(G(\theta,X)\right)=\nabla_{\theta}L(\theta)$ . Furthermore, using (37) and $\eta_{m}<\frac{1}{C}$ , we have

	$\displaystyle\mathbb{E}\left(L(\theta_{m+1})\|\mathcal{F}_{m-1}\right)\leq$	$\displaystyle L(\theta_{m})-\left(\eta_{m}-C\eta_{m}^{2}/2\right)\left\|\nabla_{\theta}L(\theta_{m})\right\|^{2}+\frac{C\eta_{m}^{2}}{2n}$
	$\displaystyle\leq$	$\displaystyle L(\theta_{m})-\frac{\eta_{m}}{2}\left\|\nabla_{\theta}L(\theta_{m})\right\|^{2}+\frac{C\eta_{m}^{2}}{2n}$

Taking the average in $m=1,2,3,\cdots,M-1$ , we prove (15). ∎

Appendix D Proof of Theorem 4.9

In this section, we prove Theorem 4.9. We first show the following lemma:

Lemma D.1.

Under 4.8 there exists a constant $C^{\prime}$ that only depends on $C_{\psi},C_{r},C_{\phi}$ such that for any $\theta,\widetilde{\theta}\in\mathbb{R}^{d}$ ,

\left|\nabla_{\theta}L(\theta)-\nabla_{\theta}L\left(\widetilde{\theta}\right)\right|\leq C^{\prime}|\theta-\widetilde{\theta}|\,.

(38)

and

\mathbb{E}_{\{X_{i}\}^{n}_{i=1}}\left(\left|G(\theta,X)\right|^{2}\right)\leq\frac{Z^{6}_{\theta}}{Z^{6}}\left(|\nabla_{\theta}L(\theta)|^{2}+\frac{C^{\prime}}{n}\right)\,,

(39)

where $G(\theta,X)$ is defined in (17).

Proof of Lemma D.1.

The proof is very similar to the proof of Lemma C.1 after setting $H\psi_{\theta}(x)=g(x)\left(\braket{g(x),\psi_{\theta}(x)}\right)$ . ∎

Now, we are ready to prove Theorem 4.9.

Proof of Theorem 4.9.

According to the algorithm, we have

\theta_{m+1}=\theta_{m}-\eta_{m}G_{m}\,.

Plugging $\theta_{m+1}$ into $L(\theta)$ , we have

L(\theta_{m+1})\leq L(\theta_{m})-\eta_{m}\nabla_{\theta}L(\theta_{m})\cdot G_{m}+\frac{C\eta_{m}^{2}}{2}|G_{m}|^{2}

where we use (38). Here $C$ is a constant that only depends on $C_{\psi}$ . Because

\mathbb{E}_{X^{m}_{1},\cdots,X^{m}_{n}}(G_{m})=\frac{Z^{3}_{m}}{\widetilde{Z}^{3}_{m}}\nabla_{\theta}L(\theta_{m})\,,

(40)

we obtain

	$\displaystyle\mathbb{E}\left(L(\theta_{m+1})\|\mathcal{L}_{m-1}\right)\leq$	$\displaystyle L(\theta_{m})-\eta_{m}\left\|\nabla_{\theta}L(\theta_{m})\right\|^{2}\frac{Z^{3}_{m}}{\widetilde{Z}^{3}_{m}}$		(41)
		$\displaystyle+\frac{C\eta^{2}_{m}}{2}\mathbb{E}(\|G_{m}\|^{2}\|\mathcal{L}_{m-1})$		(41)

Using (39) and (22),

\mathbb{E}(|G_{m}|^{2}|\mathcal{L}_{m-1})\leq C\left(|\nabla_{\theta}L(\theta_{m})|^{2}+\frac{C^{\prime}}{n}\right)\,,

(42)

where $C$ is a constant depends on $C_{\psi},C_{r},C_{\phi}$ .

Plugging (42) into (41), using (22), and choosing $\eta_{m}$ small enough, we have

\mathbb{E}\left(L(\theta_{m+1})\right)\leq L(\theta_{m})-\frac{\eta_{m}}{2C^{2}_{r}}\left|\nabla_{\theta}L(\theta_{m})\right|^{2}+\frac{C\eta^{2}_{m}}{n}\\

(43)

Taking the average in $m=1,2,3,\cdots,M-1$ , we prove (23). ∎

Appendix E Lipschitz of the $Z_{\theta}$

Lemma E.1.

Assume 4.8, let $\theta,\widetilde{\theta}\in\mathbb{R}^{d}$ , and suppose

\left|\widetilde{\theta}-\theta\right|\leq C_{r}\,.

Then there exists a constant $C^{\prime}$ that only depends on $C_{\psi},C_{r},C_{\phi}$ such that

\frac{\|\psi_{\widetilde{\theta}}\|_{\rho}}{\|\psi_{\theta}\|_{\rho}}\leq\exp(C^{\prime}C_{r})\,.

(44)

Proof of Lemma E.1.

Fixing $\theta,\widetilde{\theta}$ , we define

R(t)=\frac{\left\|\psi_{\theta+t(\widetilde{\theta}-\theta)}\right\|^{2}_{\rho}}{\|\psi_{\theta}\|^{2}_{\rho}}\,.

Then

	$\displaystyle\|R^{\prime}(t)\|\leq$	$\displaystyle\left\|\frac{2\braket{\nabla_{\theta}\psi_{\theta+t(\widetilde{\theta}-\theta)},\psi_{\theta+t(\widetilde{\theta}-\theta)}}_{\rho}}{\\|\psi_{\theta}\\|^{2}_{\rho}}\right\|\|\widetilde{\theta}-\theta\|$
	$\displaystyle\leq$	$\displaystyle\left\|\frac{2\braket{\nabla_{\theta}\psi_{\theta+t(\widetilde{\theta}-\theta)},\psi_{\theta+t(\widetilde{\theta}-\theta)}}_{\rho}}{\left\\|\psi_{\theta+t(\widetilde{\theta}-\theta)}\right\\|^{2}_{\rho}}\right\|\left\|\frac{\left\\|\psi_{\theta+t(\widetilde{\theta}-\theta)}\right\\|^{2}_{\rho}}{\\|\psi_{\theta}\\|^{2}_{\rho}}\right\|\|\widetilde{\theta}-\theta\|$
	$\displaystyle\leq$	$\displaystyle C\|\widetilde{\theta}-\theta\|R(t)\,,$

where we use (19) and (20) in the last inequality. Because $R(0)$ =1, using Grönwall’s inequality, we obtain

\frac{\|\psi_{\widetilde{\theta}}\|_{\rho}}{\|\psi_{\theta}\|_{\rho}}=R^{1/2}(1)\leq\exp(C|\theta-\widetilde{\theta}|)\leq\exp(CC_{r})\,.

∎

Appendix F Additional details on scale-invariant supervised training

We created a fork [48] of the FermiNet repository [44] and implemented the scale-invariant loss function Equation 25 in the columns of the backflow tensors. We used the straight-through gradient estimator [49] to train the network using either the standard loss or our modified scale-invariant loss while plotting the angle between the wave functions in each case Figure 1.

The commands for each (using the fork [48] of FermiNet) were:

Standard (blue)

ferminet --config ferminet/configs/atom.py \
--config.system.atom Li \
--config.batch_size 256 \
--config.pretrain.iterations 10

Scale-invariant (orange)

ferminet --config ferminet/configs/atom.py \
--config.system.atom Li \
--config.batch_size 256 \
--config.pretrain.iterations 10 \
--config.pretrain.SI

	$\displaystyle G(\theta,X)=$	$\displaystyle\frac{2}{n-1}\sum_{i=1}^{n}\mathcal{E}_{\theta}\left(X_{i}\right)\nabla_{\theta}\log\|\psi_{\theta}\left(X_{i}\right)\|$
		$\displaystyle-\frac{2}{n(n-1)}\sum_{j=1}^{n}\sum_{i=1}^{n}\mathcal{E}_{\theta}\left(X_{j}\right)\nabla_{\theta}\log\|\psi_{\theta}\left(X_{i}\right)\|$
	$\displaystyle=$	$\displaystyle\frac{2}{n}\sum_{i=1}^{n}\mathcal{E}_{\theta}\left(X_{i}\right)\nabla_{\theta}\log\|\psi_{\theta}\left(X_{i}\right)\|$
		$\displaystyle-\frac{2}{n^{2}-n}\sum_{i\neq j}\mathcal{E}_{\theta}\left(X_{i}\right)\nabla_{\theta}\log\|\psi_{\theta}\left(X_{j}\right)\|.$

	$\displaystyle\mathbb{E}\left(L(\theta_{m+1})\|\mathcal{L}_{m-1}\right)\leq$	$\displaystyle L(\theta_{m})-\eta_{m}\left\|\nabla_{\theta}L(\theta_{m})\right\|^{2}\frac{Z^{3}_{m}}{\widetilde{Z}^{3}_{m}}$		(41)
		$\displaystyle+\frac{C\eta^{2}_{m}}{2}\mathbb{E}(\|G_{m}\|^{2}\|\mathcal{L}_{m-1})$		(41)

	$\displaystyle\|R^{\prime}(t)\|\leq$	$\displaystyle\left\|\frac{2\braket{\nabla_{\theta}\psi_{\theta+t(\widetilde{\theta}-\theta)},\psi_{\theta+t(\widetilde{\theta}-\theta)}}_{\rho}}{\\|\psi_{\theta}\\|^{2}_{\rho}}\right\|\|\widetilde{\theta}-\theta\|$
	$\displaystyle\leq$	$\displaystyle\left\|\frac{2\braket{\nabla_{\theta}\psi_{\theta+t(\widetilde{\theta}-\theta)},\psi_{\theta+t(\widetilde{\theta}-\theta)}}_{\rho}}{\left\\|\psi_{\theta+t(\widetilde{\theta}-\theta)}\right\\|^{2}_{\rho}}\right\|\left\|\frac{\left\\|\psi_{\theta+t(\widetilde{\theta}-\theta)}\right\\|^{2}_{\rho}}{\\|\psi_{\theta}\\|^{2}_{\rho}}\right\|\|\widetilde{\theta}-\theta\|$
	$\displaystyle\leq$	$\displaystyle C\|\widetilde{\theta}-\theta\|R(t)\,,$