This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

thanks: All the authors contributed equallythanks: All the authors contributed equallythanks: All the authors contributed equally

Universal Effectiveness of High-Depth Circuits in Variational Eigenproblems

Joonho Kim School of Natural Sciences, Institute for Advanced Study, Princeton, NJ 08540, USA [email protected]    Jaedeok Kim AI Center, Samsung Research, Seoul 06765, Republic of Korea    Dario Rosa School of Physics, Korea Institute for Advanced Study, 85 Hoegi-ro, Dongdaemun-gu, Seoul 02455, Republic of Korea Department of Physics, Korea Advanced Institute of Science and Technology,
291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea
Abstract

We explore the effectiveness of variational quantum circuits in simulating the ground states of quantum many-body Hamiltonians. We show that generic high-depth circuits, performing a sequence of layer unitaries of the same form, can accurately approximate the desired states. We demonstrate their universal success by using two Hamiltonian systems with very different properties: the transverse field Ising model and the Sachdev-Ye-Kitaev model. The energy landscape of the high-depth circuits has a proper structure for the gradient-based optimization, i.e. the presence of local extrema – near any random initial points – reaching the ground level energy. We further test the circuit’s capability of replicating random quantum states by minimizing the Euclidean distance.

I Introduction

Variational Quantum Eigensolver (VQE) Peruzzo et al. (2014); McClean et al. (2016) is one of the most promising hybrid quantum-classical (HQC) algorithms, which may offer a precise approximation of the ground state of quantum systems. It is based on the iterative application of the following three steps: state preparation, measurement and optimization. Let us briefly describe each step. First, the preparation of a trial state |ψ(𝜽)|\psi(\bm{\theta})\rangle is carried out by successive application of unitary quantum gates that depend on variational parameters 𝜽\bm{\theta}. Second, the measurement step estimates the trial state mean energy,

E(𝜽)ψ(𝜽)||ψ(𝜽),\displaystyle E(\bm{\theta})\equiv\langle\psi(\bm{\theta})|\mathcal{H}|\psi(\bm{\theta})\rangle, (1)

by taking the expectation value of Hamiltonian \mathcal{H} of the target system over the trial state. Third, the optimization step adjusts the variational parameters 𝜽\bm{\theta} of the trial wavefunction to minimize the mean state energy E(𝜽)E(\bm{\theta}) by applying a classical optimization algorithm. See Cao et al. (2019) and references therein for more details. After a sufficient number of iterations, the variational state |ψ(𝜽)|\psi(\bm{\theta}^{*})\rangle at a convergence point 𝜽=𝜽\bm{\theta}=\bm{\theta}^{*} is expected to reproduce well the ground state of the target Hamiltonian \mathcal{H}, under the assumption that the state Ansatz |ψ(𝜽)|\psi(\bm{\theta})\rangle is parametrically expressible and well-trainable under the gradient based optimization. It is then a crucial question to find such an Ansatz.

Ideally, we are looking for a universally effective Ansatz capable of solving the VQE problem associated with an arbitrary target Hamiltonian. Note that it is generically a very challenging task: unless the target Hamiltonian is local, the ground states of non-local interacting systems are relatively close to typical quantum states that constitute most of the Hilbert space. Universality is thus equivalent to demand a circuit Ansatz to approximate any random state |ϕ|\phi\rangle at certain values of the circuit parameters.

The existence of a parameter set 𝝋{\bm{\varphi}}, such that |ψ(𝝋)|ϕ|\psi({\bm{\varphi}})\rangle\simeq|\phi\rangle, is therefore a necessary condition for the effectiveness of the variational circuit. On top of that, we must concern if the gradient descent optimization can actually reach the parameters 𝝋{\bm{\varphi}} from randomly initialized ones. Suppose that we use a layered quantum circuit composed of repetitive application of variational layers with the same architecture. Since increasing the number of layers extends the dimension of the parameter space, it can affect the circuit’s expressibility only positively. However, there has been a reported tension between increasing depth of layers and trainability of the circuit via minimizing the mean energy, (1), known as the barren plateau phenomenon McClean et al. (2018).

When the layered circuit reaches a certain depth such that it evolves to an approximate 2-design, a numerical experiment McClean et al. (2018) has shown the exponential decay of the variance of energy gradients 𝜽E(𝜽)\nabla_{\bm{\theta}}E(\bm{\theta}) with respect to the number nn of qubits — for random circuit states obtained by uniformly sampling from the parameter space. Combined with another fact that the random energy gradient vanishes on average, Chebyshev’s inequality implies that the initial energy gradient can exponentially rarely deviate from zero. Such diminishing gradients hinder the beginning of efficient energy minimization, possibly causing the variational state to be stuck on non-optimal plateaus.

There have been several proposals to overcome the vanishing gradient problem and optimize the variational circuit. The most obvious approach is to incorporate some physics information on the target Hamiltonian Liu et al. (2019); Gard et al. (2020); Seki et al. (2020); Wiersema et al. (2020), e.g. the ground state symmetry, for designing a less generic problem-tailored Ansatz. As an alternative direction towards the universal applicability of the variational algorithms, novel initialization Grant et al. (2019); Verdon et al. (2019), architecture Cerezo et al. (2020); Volkoff and Coles (2021), and optimization Stokes et al. (2020); Skolik et al. (2020); Koczor and Benjamin (2020) of generic purpose circuits have been developed to enhance the circuit performance.

The main goal of this paper is to demonstrate the inherent capability of variational circuits as universal and accurate eigensolvers for generic Hamiltonians – if they can even approximate close-to-random states. To this end, we will simply put aside the barren plateau problem by sufficiently increasing the classical computation capacity. Specifically, we will consider the high-depth regime of the layered quantum circuit, thus building an exponentially high-dimensional parameter space. In this regime, the variational circuit has sufficiently many layers and consequently many parameters. Hence, the norm of the gradient vector 𝜽E(𝜽)\|\nabla_{\bm{\theta}}E(\bm{\theta})\| can still grow to a finite size, capable of moving the circuit parameter 𝜽{\bm{\theta}} from initial points, even though the magnitudes of the individual components |iE(𝜽)||\partial_{i}E(\bm{\theta})| are exponentially suppressed for the number nn of qubits. We will illustrate the effectiveness of the high-depth circuits by solving two concrete VQE problems for the following quantum many-body Hamiltonians: the 1d Ising model in a constant transverse magnetic field, which is a prototypical model of locally interacting spin-chain systems, and the Sachdev-Ye-Kitaev (SYK) model, which is a strongly interacting quantum mechanical system of Majorana fermions Sachdev and Ye (1993); Kitaev (2015); Maldacena and Stanford (2016). Despite the striking contrast in these two Hamiltonians’ ground state properties, we will find that the high-depth circuit achieves a very high fidelity in approximating both ground states. Moreover, we will see that the high-depth circuits can narrow the Euclidean distance from random states up to arbitrary precision, indicating outstanding expressibility and trainability that stems from the high-dimensionality of 𝜽{\bm{\theta}}.

The remarkable efficiency of the gradient descent optimization in the over-parameterized regime was also reported for approximating random unitary matrices Kiani et al. (2020) and ground states Wiersema et al. (2020) with certain variational Ansatzes. More generally, over-parameterization is an active research topic that underlies the success of deep learning. It makes large neural networks capable to reach a global minimum during the optimization process, despite the non-convexity of the energy landscape Belkin et al. (2019); Liu et al. (2020). While searching the ground state via the energy minimization, we will observe some phenomena similar to what happens during the training of the neural networks with large parameter spaces, summarized as follows:

First, the energy landscape looks fairly simple in the local vicinity of randomly chosen points Koczor and Benjamin (2020). Generically, an initial point is already confined in a certain basin of attraction, such that an emanating optimization trajectory can quickly arrive at a nearby local extremum. Especially if the circuit has enough layers to become an approximate 2-design, almost all uniformly sampled initial states end up with rather homogeneous energy levels.

Second, for high-depth circuits, all the local extrema in the energy landscape, reachable by the VQE optimization from randomly initialized points, are substantially close to the exact ground energy, i.e., the value of the global minimum. These minima are not isolated individual points but develop multiple flat directions. It explains the robust success of the high-depth circuits in solving the VQE problems.

The rest of this paper is organized as follows. Section II introduces the architecture of the variational circuits used throughout this paper. Section III first reviews the occurrence of the barren plateau phenomenon, then examines the optimization of the variational circuits using the VQE example of the 1d Ising model coupled to a uniform transverse magnetic field. Section IV studies the same VQE problem for the SYK model, thus showing the universal effectiveness of the high-depth circuit. Section V addresses the ability of the variational circuits to reproduce typical states by minimizing the Euclidean distance. In particular, it shows that the high-depth Ansatz is highly expressible and trainable, being able to reach any random state. Finally, Section VI concludes with discussions.

II Circuit Ansatzes

Our focus in this work is to demonstrate the efficiency of high-depth layered circuits in a typical VQE problem, i.e. to approximate the ground state of a given Hamiltonian. To this purpose, here we specify the architecture of the variational circuit used in our numerical experiments. Our circuit state |ψ(𝜽)|\psi(\bm{\theta})\rangle is composed of LL unitary layers acting sequentially on the initial product state |0|0\rangle, i.e.,

|ψ(𝜽)=UL(𝜽L)UL1(𝜽L1)U2(𝜽2)U1(𝜽1)|0,\displaystyle|\psi(\bm{\theta})\rangle=U_{L}(\bm{\theta}_{L})\,U_{L-1}(\bm{\theta}_{L-1})\cdots U_{2}(\bm{\theta}_{2})\,U_{1}(\bm{\theta}_{1})|0\rangle, (2)

where each layer Ui(𝜽i)U_{i}(\bm{\theta}_{i}) has single-qubit yy-rotation gates, parameterized by nn periodic variables 𝜽i\bm{\theta}_{i}, and controlled-zz gates operating on all pairs of nn qubits. More precisely, once the single-qubit RYRY gates have acted upon all individual qubits, the entangling CZCZ gates acting on aa-th controlling and bb-th targeting qubits are arranged for every integer pair (a,b)(a,b) satisfying 1a<bn1\leq a<b\leq n. We have drawn the circuit state (2) with n=4n=4 qubits in Figure 1.

Note that we have deliberately chosen an unbiased circuit architecture, instead of embedding known properties of the ground states that are under search. Nonetheless, as we will see in Sections IIIIV and V, the above circuit can solve the ground states of distinct Hamiltonians by minimizing E(𝜽)E(\bm{\theta}) and accurately approximate random quantum states by minimizing the Euclidean distance, as long as the number LL of layers is sufficiently large.

Refer to caption
(a) Circuit Ansatz
Refer to caption
(b) Layer
Figure 1: The layered circuit Ansatz |ψ(𝜽)|\psi(\bm{\theta})\rangle used in this paper.

III Looking into VQE Trajectories

This section is devoted to the detailed investigation of the VQE optimization procedure, with a particular focus on the effectiveness of the high-depth circuit Ansatz (2). Our exploration will be based on a concrete Hamiltonian system, commonly used in measuring the performance of variational circuits, i.e. the 1d Ising model in a transverse and uniform magnetic field Nakaji and Yamamoto (2020); Wiersema et al. (2020).

The 1d transverse field Ising model is defined over a spin lattice of length nn, consisting of the spin-spin coupling between nearest neighbors in the zz direction, as well as the spin interaction with a background uniform magnetic field along the transverse xx direction. Assuming periodic boundary conditions, σn+1zσ1z\sigma_{n+1}^{z}\equiv\sigma_{1}^{z}, the ferromagnetic Ising Hamiltonian reads

i=1nσizσi+1zgiσix,\displaystyle\mathcal{H}\equiv-\sum_{i=1}^{n}\sigma_{i}^{z}\sigma_{i+1}^{z}-g\sum_{i}\sigma_{i}^{x}\ , (3)

where σix,y,z\sigma_{i}^{x,y,z} is the Pauli operator acting on the ii’th spin, and gg denotes the strength of the uniform magnetic field.

The physics of this model has been well-studied Sachdev (2011). When the lattice size scales up to infinity, nn\to\infty, the system undergoes a quantum phase transition at |g|=1|g|=1 between the ordered (|g|<1|g|<1) and disordered (|g|>1|g|>1) phases. The former phase has the spin-flip 2\mathbb{Z}_{2} symmetry that connects the two opposite ferromagnetic ground states, while the latter phase has a unique ground state, with all the spins aligned along the xx direction.

We will apply the VQE algorithm to the finite nn Ising system, which exhibits some differences from the thermodynamic limit. Specifically, the 2\mathbb{Z}_{2} degeneracy in the 0<|g|<10<|g|<1 phase is broken at finite nn. Our target state will be always non-degenerate and gapped at |g|0|g|\neq 0. All concrete calculations presented in this section have been done for g=2g=2.

1 Barren Plateaus and Classical Resolution

Refer to caption
Figure 2: Tr(2)\mathrm{Tr}(\mathcal{H}^{2}) of the Ising model (3) and the SYK model (20) as a function of the system size nn. The dashed lines are the regression lines with the functional form of anb 2na\cdot n^{b}\,2^{n}, based on the numerical data denoted as small circles.

It was argued in McClean et al. (2018) that optimization of the quantum variational circuits under the random parameter initialization comes with an inherent difficulty, given by the problem of vanishing gradients. For the circuit unitary ensemble that is quantum 2-design, random initial parameters are typically located on a plateau in the energy landscape. It means that the gradient-based optimization cannot even roll out. The odds of having non-vanishing gradients at random points decays exponentially with the system size nn, impeding a large-scale application of the variational circuits. Its consequence would be even more detrimental on actual quantum devices, where the sampling noise must be taken into account McClean et al. (2018); Skolik et al. (2020). Given its prominence, we begin by reviewing the vanishing gradient problem in the VQE setting.

Consider the ensemble of the energy gradients over the parameter space of the variational circuit in Figure 1. For analyzing the kk’th component, kE(𝜽)\partial_{k}E(\bm{\theta}), that belongs to the \ell’th variational layer, it is convenient to group the LL layer unitaries (2) into the following two blocks:

|ψ(𝜽)=U(𝜽)U+(𝜽+)|0,\displaystyle|\psi(\bm{\theta})\rangle=U_{-}(\bm{\theta}_{-})\,U_{+}(\bm{\theta}_{+})\,|0\rangle, (4)

where 𝜽q={θp|pn=q}\bm{\theta}_{q}=\{\theta_{p}\,|\,\lfloor\frac{p}{n}\rfloor=q\}, 𝜽+=a𝜽a\bm{\theta}_{+}=\bigcup_{a\leq\ell}\bm{\theta}_{a}, 𝜽=a>𝜽a\bm{\theta}_{-}=\bigcup_{a>\ell}\bm{\theta}_{a},

U+(𝜽+)U(𝜽)U1(𝜽1)U2(𝜽2)U1(𝜽1),U(𝜽)UL(𝜽L)U+2(𝜽+2)U+1(𝜽+1).\begin{split}U_{+}(\bm{\theta}_{+})&\equiv U_{\ell}(\bm{\theta}_{\ell})U_{\ell-1}(\bm{\theta}_{\ell-1})\cdots U_{2}(\bm{\theta}_{2})\,U_{1}(\bm{\theta}_{1}),\\ U_{-}(\bm{\theta}_{-})&\equiv U_{L}(\bm{\theta}_{L})\cdots U_{\ell+2}(\bm{\theta}_{\ell+2})\,U_{\ell+1}(\bm{\theta}_{\ell+1}).\end{split} (5)

As the partial derivative k\partial_{k} acts only on U(𝜽)U_{\ell}(\bm{\theta}_{\ell}), the variance of the kk’th gradient component, Var𝜽[kE(𝜽)]\text{Var}_{\bm{\theta}}[\partial_{k}E(\bm{\theta})], is

d𝜽(2π)nLψ(𝜽)|[U(𝜽)VkU(𝜽),]|ψ(𝜽)2\displaystyle-\int\frac{d\bm{\theta}}{(2\pi)^{nL}}\langle\psi(\bm{\theta})|[U_{-}(\bm{\theta}_{-})V_{k}U_{-}(\bm{\theta}_{-})^{\dagger},\mathcal{H}]|\psi(\bm{\theta})\rangle^{2} (6)

where VkV_{k} denotes the Pauli operator σy\sigma^{y} acting on the kk’th spin variable, such that Tr(Vk)=0\text{Tr}(V_{k})=0 and Tr(Vk2)=2n\text{Tr}(V_{k}^{2})=2^{n}.

Refer to caption
(a) Sample variance of the energy derivative kE(𝜽)\partial_{k}E(\bm{\theta})
Refer to caption
(b) Sample average of the Euclidean norm of 𝜽E(𝜽)\nabla_{\bm{\theta}}E(\bm{\theta})
Figure 3: The barren plateau experiment for the Ising Hamiltonian (3). The shades indicate (a) the variance across gradient components {kE(𝜽)}k=1nL\{\partial_{k}E(\bm{\theta})\}_{k=1}^{nL}, (b) the first and third sample quantiles.

If we assume the quantum 2-design property of U+(𝜽+)U_{+}(\bm{\theta}_{+}) and/or U(𝜽)U_{-}(\bm{\theta}_{-}), the above integral (6) can be replaced with the unitary matrix integral. In that case, the matrix integral can be handled exactly and simplified to

2Tr(2)22n.\displaystyle\frac{2\text{Tr}(\mathcal{H}^{2})}{2^{2n}}. (7)

Such simplification (7) happens when both U±(𝜽±)U_{\pm}(\bm{\theta}_{\pm}) are 2-designs. Instead, if the 2-design condition does not hold for either of U±(𝜽±)U_{\pm}(\bm{\theta}_{\pm}), we have the following expression:

122nd𝜽(2π)n(L)Tr([Vk,U(𝜽)U(𝜽)]2)\displaystyle-\frac{1}{2^{2n}}\int\frac{d\bm{\theta}_{-}}{(2\pi)^{n(L-\ell)}}\,\text{Tr}([V_{k},U_{-}(\bm{\theta}_{-})^{\dagger}\mathcal{H}U_{-}(\bm{\theta}_{-})]^{2}) (8)

if only U+(𝜽+)U_{+}(\bm{\theta}_{+}) is a 2-design while U(𝜽)U_{-}(\bm{\theta}_{-}) is not, or

Tr(2)22nd𝜽+(2π)nTr([Vk,U+(𝜽+)ρU+(𝜽+)]2)\displaystyle-\frac{\text{Tr}(\mathcal{H}^{2})}{2^{2n}}\int\frac{d\bm{\theta}_{+}}{(2\pi)^{n\ell}}\text{Tr}([V_{k},U_{+}(\bm{\theta}_{+})^{\dagger}\rho\,U_{+}(\bm{\theta}_{+})]^{2}) (9)

where ρ=|00|\rho=|0\rangle\langle 0|, if not U+(𝜽+)U_{+}(\bm{\theta}_{+}) but only U(𝜽)U_{-}(\bm{\theta}_{-}) is a 2-design. Asymptotically in the system size nn, the above expressions (7)–(9) are all bounded as

0Var𝜽[kE(𝜽)]4Tr(2)22n,\displaystyle 0\leq\text{Var}_{\bm{\theta}}[\partial_{k}E(\bm{\theta})]\leq\frac{4\text{Tr}(\mathcal{H}^{2})}{2^{2n}}\ , (10)

where we find the upper bound by expanding the commutator inside the integral of (8) and (9), then applying the following trace inequality that holds for two Hermitian matrices AA, BB Yang and Feng (2002):

|Tr(AB)2m|Tr(A2mB2m)for m.\displaystyle|\text{Tr}(AB)^{2m}|\leq\text{Tr}(A^{2m}B^{2m})\quad\text{for }m\in\mathbb{N}. (11)

To determine the scaling behavior of the upper bound (10) with respect to the system size nn, we examine how Tr(2)\text{Tr}(\mathcal{H}^{2}) scales with nn by extrapolating the values obtained from exact diagonalization of the Hamiltonian (3) up to n17n\leq 17. The result is presented in Figure 2. We see that the term Tr(2)\text{Tr}(\mathcal{H}^{2}) scales as 2n2^{n}, such that the exponential factor in the denominator of the upper bound (10) cannot fully be balanced by Tr(2)\text{Tr}(\mathcal{H}^{2}). Then, inserting the upper bound (10) into Chebyshev’s inequality, which states

Pr(|kE(𝜽)|>ϵ)<Var𝜽[kE(𝜽)]ϵ2\displaystyle\text{Pr}\left(|\partial_{k}E(\bm{\theta})|>\epsilon\right)<\frac{\text{Var}_{\bm{\theta}}[\partial_{k}E(\bm{\theta})]}{\epsilon^{2}} (12)

with kE(𝜽)\partial_{k}E(\bm{\theta}) being a random variable of zero mean, the probability of having a non-zero and finite derivative is exponentially suppressed with growing nn. Therefore, the large-scale VQE problems are expected to suffer from the vanishing gradients.

We have not yet justified the assumption that either of the unitaries U±(𝜽±)U_{\pm}(\bm{\theta}_{\pm}) is quantum 2-design. To see if the problem of exponentially vanishing gradients happens in the circuit unitaries of Figure 1, we numerically estimate the variance of initial gradients by collecting 10310^{3} random energy gradients of the Ising model. Figure 3a clearly exhibits an exponential decay of the partial derivatives with the increasing number nn of qubits, where the shaded region displays component-wise fluctuations of the variance for all 1knL1\leq k\leq nL. It shows that the energy landscape for the circuit in Figure 1 indeed contain the barren plateaus, which hinders the circuit optimization towards the ground state.

On the other hand, Figure 3a shows that the variance, at a fixed value of nn, converges to a constant by increasing the number LL of layers. The variance is independent of LL beyond a transition point L0L_{0}, where the circuit unitaries with LL0L\geq L_{0} evolve to approximate 2-design. Due to the saturation, having exponentially many parameters can compensate for the exponential decay of individual components when calculating the gradient norm 𝜽E(𝜽)\|\nabla_{\bm{\theta}}E(\bm{\theta})\|, which appears in the evolution of the circuit energy under the gradient descent with an infinitesimal rate α\alpha:

dE(𝜽)dτ=α𝜽E(𝜽)2\displaystyle\frac{dE(\bm{\theta})}{d\tau}=-\alpha\|\nabla_{\bm{\theta}}E(\bm{\theta})\|^{2} (13)

Therefore, the vanishing gradient problem inherent to the quantum Hilbert space can be resolved in a classical way, i.e. by using the exponentially high-dimensional 𝜽\bm{\theta}-space.

In agreement with the reasoning above, Figure 3b exhibits a small initial drop dominated by the transient decrease of Var𝜽[kE(𝜽)]\text{Var}_{\bm{\theta}}[\partial_{k}E(\bm{\theta})] for LL0L\leq L_{0} layers, then a steady increase driven by the linear growth in the number of circuit parameters. One can approximate the asymptotic increase rate of the norm 𝜽E(𝜽)\|\nabla_{\bm{\theta}}E(\bm{\theta})\| as

𝜽E(𝜽)nL×Var𝜽[kE(𝜽)]\displaystyle\|\nabla_{\bm{\theta}}E(\bm{\theta})\|\sim\sqrt{nL\times\text{Var}_{\bm{\theta}}[\partial_{k}E(\bm{\theta})]} (14)

which agrees well with the numerically estimated growth rates between log𝜽E(𝜽)\log\|\nabla_{\bm{\theta}}E(\bm{\theta})\| and logL\log L in Figure 3b:

We remark that the barren plateau phenomenon concerns only the initial steps of the gradient descent update. In almost all physical systems of relevance, the Hamiltonian spectrum is symmetrically distributed around a certain value, which we canonically set to zero. A random state is therefore in a superposition of multiple Hamiltonian eigenstates, whose mean energy is almost certainly zero. On the other hand, the variational mean energy quickly becomes a negative value after a few steps of the VQE optimization, implying that the circuit state is no longer represented by sample statistics of random states.

Having found that the vanishing gradient problem can be trivially avoided at the cost of having exponentially many parameters, we turn to answer if the variational circuit with a sufficiently many layers can actually solve the VQE problems. We will examine the optimization error, the training curve, and the trajectory in the parameter space at different values of the circuit depth LL.

2 Optimizing the Circuit

We now come back to original task of approximating the ground state of the Ising model, Eq. (3).

The VQE optimization results of the circuit states (2), with different numbers LL of layers and under the random initialization of the circuit parameters 𝜽\bm{\theta}, can be summarized to the following two features:

  1. 1.

    For small enough LL, the minimized energies E(𝜽)E(\bm{\theta^{*}}) that the circuit states can reach are highly variable.

  2. 2.

    For larger LL, the E(𝜽)E(\bm{\theta^{*}}) distribution turns concentrated around a mean value which gets smaller.

We illustrate them with the outcomes of the VQE algorithm for the case of a chain having n=10n=10 sites and for L{8,10,12,14,16,18,20}L\in\{8,10,12,14,16,18,20\} layers. To find the optimal parameters 𝜽\bm{\theta}^{*} that minimize the energy, we use the Adam optimization algorithm Kingma and Ba (2017), which iteratively updates the variational parameters 𝜽\bm{\theta} by the exponential moving averages of gradients and their squares. It has a clear advantage in convergence speed, being widely used in a variety of deep learning models. The parameter update rule is decided by a choice of three hyperparameters (α,β1,β2)(\alpha,\beta_{1},\beta_{2}). We have collected 35 independent VQE runs for the Ising Hamiltonian (3), whose initial parameters are randomly sampled from the uniform distribution 𝒰(0,2π)nL\mathcal{U}(0,2\pi)^{\otimes nL}, after 500 parameter updates with the hyperparameters (α,β1,β2)=(0.05,0.9,0.999)(\alpha,\beta_{1},\beta_{2})=(0.05,0.9,0.999).

Refer to caption
Figure 4: Optimized VQE energy (E(𝜽)E0E(\bm{\theta^{*}})-E_{0}) density for the Ising model over 35 distinct runs with random initialization.

The sample distribution of the final VQE energy E(𝜽)E(\bm{\theta^{*}}) is visualized in Figure 4 for different numbers LL of layers. One can characterize it as follows. First, the energy distribution at the local extremum 𝜽\bm{\theta^{*}} clearly exhibits the widespread spectrum for a shallow depth, e.g.,

3.52E(𝜽)E012.6for L=8,4.67E(𝜽)E016.9for L=10.\displaystyle\begin{array}[]{cl}3.52\leq E(\bm{\theta^{*}})-E_{0}\leq 12.6&\quad\text{for $L=8$},\\ 4.67\leq E(\bm{\theta^{*}})-E_{0}\leq 16.9&\quad\text{for $L=10$}.\end{array}

sometimes achieving a relatively good energy level while the gap always persists. Second, by stacking more layers, i.e., L12L\geq 12, deeper than the 2-design transition point detected in Figure 3a, the energy distribution starts to concentrate around a value, which is far from the ground energy E0E_{0}. Third, the average value of E(𝜽)E(\bm{\theta^{*}}) continues to decrease for the growing number LL of layers, suggesting that the high-depth variational circuits can possibly simulate the ground state using the VQE optimization.

Refer to caption
(a) The VQE optimization error E(𝜽)E0E(\bm{\theta^{*}})-E_{0}
Refer to caption
(b) Fidelity between the optimized and true ground states
Figure 5: Single-run VQE outcomes for the Ising model (3) using the layered circuit Ansatze (2) at different depths LL.
Refer to caption
(a) With the constant hyperparameters
Refer to caption
(b) With the exponential decay (17) of hyperparameters
Figure 6: Optimization curves of the circuit state on the Ising model at 88 qubits. The upper and lower plots denote the VQE error E(𝜽τ)E0E(\bm{\theta}_{\tau})-E_{0} and the fidelity |ψ(𝜽)|ϕ|2|\langle\psi({\bm{\theta}^{*}})|\phi\rangle|^{2} between the circuit and true ground states, respectively, as a function of the optimization steps τ\tau. Notice the late-time fluctuation has been alleviated with the learning rate scheduling.

Encouraged by the observed shrinkage of the mean and variance of E(𝜽)E(\bm{\theta^{*}}), we also have done the single VQE tryouts for a broader span of (n,L)(n,L), as summarized in Figure 5. It is clear that being deeper enhances the variational circuit’s capability to replicate the ground state |ψ0|\psi_{0}\rangle, reaching the ground state energy E0E_{0} and achieving a high level of fidelity with |ψ0|\psi_{0}\rangle. The high-depth circuits can achieve a zero error with the following precision:

|E(𝜽)E0|105ΔE\displaystyle|E(\bm{\theta})-E_{0}|\leq 10^{-5}\cdot\Delta E (15)

where ΔE\Delta E denotes the bandwidth of the target Hamiltonian \mathcal{H}, defined as the difference between the largest and smallest eigenvalues of \mathcal{H}. We note that such precision can be achieved only with an appropriately chosen learning rate α\alpha, in order to avoid too-large parameter updates that prevent the fine-level optimization. Here we refrain from the systematic hyperparameter search, which may be more relevant for the case where the average gap between nearby energy levels shrinks, but simply stick to α=0.05\alpha=0.05 (L=4,6,8L=4,6,8) and α=0.01\alpha=0.01 (L=10L=10). We have found that (15) can be achieved when the circuit depth LL passes through the following threshold value:

n qubits46810Lv layers102468250\displaystyle\begin{array}[]{>{\raggedleft}p{1.65cm}|*{4}{>{\centering}p{0.75cm}}}\hline\cr\text{$n$ qubits }\@add@raggedleft&4\@add@centering&6\@add@centering&8\@add@centering&10\@add@centering\cr\hline\cr\text{$L_{v}$ layers }\@add@raggedleft&10\@add@centering&24\@add@centering&68\@add@centering&250\@add@centering\cr\hline\cr\end{array}

We also remark that deeper circuits do not necessarily lead to better performance with VQE, as displayed by the gentle ramp after passing the threshold point LvL_{v}. This can be understood as follows: As the space of variational parameters 𝜽\bm{\theta} has more dimensions, the basin of attractor to local extrema becomes narrower Cerezo et al. (2020), giving a larger value of the estimated inverse volume Wu et al. (2017)

Vk1=i=1klogλi(H)\displaystyle V_{k}^{-1}=\sum_{i=1}^{k}\log{\lambda_{i}(H)} (16)

where the summation is taken over the top-kk eigenvalues {λi(H)}i=1k\{{\lambda_{i}(H)}\}_{i=1}^{k} of the Hessian matrix HijijE(𝜽)H_{ij}\equiv\partial_{i}\partial_{j}E({\bm{\theta})}. For instance, the positive correlation between LL and Vk1(L)V_{k}^{-1}(L) can clearly be identified in Figure 7a, drawn for k=100k=100. As a result, the VQE trajectory is unlikely to land at an exact extremum but wander around nearby points, whose deviation gets larger as the attractor basin becomes narrower and steeper.

Even with the optimal number of layers, such that the circuit state can reach an exact extremum and accurately represent the ground state |ψ0|\psi_{0}\rangle of the Ising Hamiltonian (3), the narrowness of the attractor basin still makes the VQE trajectory somewhat unstoppable, passing through the best point 𝜽\bm{\theta^{*}} and then hopping around in the local neighborhood. Figure 6a shows that the VQE error E(𝜽τ)E0E({\bm{\theta}_{\tau}})-E_{0} slightly increases on average and mildly fluctuates after achieving the minimum error E(𝜽)E0E(\bm{\theta^{*}})-E_{0}. This residual error can be reduced by making use of popular optimization tricks, such as early stopping or learning rate scheduling, which causes the VQE optimization to stop at the optimal point. For instance, by introducing the exponential decay of the learning rate α\alpha, i.e.

ατ=α0cτ/500,τ0\displaystyle\alpha_{\tau}=\alpha_{0}c^{\tau/500},\quad\tau\geq 0 (17)

at the optimization step τ\tau with a constant value c=0.3c=0.3, we could reduce the late time fluctuations as in Figure 6b.

So far, we have discussed some important aspects of the VQE optimization. The main observation is that, when supported by the high-dimensional parameter space, randomly initialized variational circuits can approximate the Ising ground state with remarkably high accuracy. The energy gradient neither vanishes nor randomly fluctuates along the optimization trajectory, making it quickly converge to a local minimum that exhibits a small energy gap from the ground energy E0E_{0} and high fidelity with the exact ground state |ψ0|\psi_{0}\rangle. We will explore this efficiency of the high-depth circuit in some details by visualizing the VQE trajectory on the energy landscape.

3 Visualizing the Trajectory

Refer to caption
(a) The mean of the top 100100 Hessian eigenvalues
Refer to caption
(b) The horizontal and vertical axes denote the 𝒮100(𝜽)\mathcal{S}_{100}(\bm{\theta^{*}})-projected Euclidean distance and the VQE error E(𝜽τ)E0E(\bm{\theta}_{\tau})-E_{0}, respectively. The bottom boxes magnify the trajectory in the vicinity of the optimal point 𝜽\bm{\theta}^{*}.
Figure 7: The visualization of the optimization trajectory in the 100 steepest directions for the Ising model with n=8n=8 qubits.

A key characteristic that contributes to the success of the high-depth circuit is the fact that randomly initialized points are likely to be already confined in the basin of attraction of a good attractor, i.e. a local extremum that is close enough to the ground state energy. We illustrate now this point by looking at the actual optimization trajectories under the energy minimization.

For a given initial point 𝜽0\bm{\theta}_{0} and the trajectory 𝒯(𝜽0)={𝜽τ}\mathcal{T}(\bm{\theta}_{0})=\{\bm{\theta}_{\tau}\} thereafter, we identify the optimal parameter 𝜽\bm{\theta^{*}} as the point in the trajectory 𝒯(𝜽0)\mathcal{T}(\bm{\theta}_{0}) having the minimum energy expectation value,

E(𝜽)E(𝜽τ)for any τ0,\displaystyle E(\bm{\theta^{*}})\leq E(\bm{\theta}_{\tau})\quad\text{for any }\tau\geq 0, (18)

which is the best possible representative point of the local extrema that the trajectory 𝒯(𝜽0)\mathcal{T}(\bm{\theta}_{0}) converges around. The associated basin of attractor can be estimated by calculating the Hessian HijijE(𝜽)H_{ij}\equiv\partial_{i}\partial_{j}E(\bm{\theta^{*}}) at the optimum 𝜽\bm{\theta^{*}}. We are interested in the eigenvalue spectrum, {λi(H)}\{\lambda_{i}(H)\}, from which we distinguish steep and flat directions and calculate the degree of steepness.

Similar to the case of deep neural networks Sagun et al. (2017); Chaudhari et al. (2017); Wu et al. (2017), the local extrema in the VQE energy landscape of high-depth circuits are often interconnected in multiple flat directions, whose corresponding Hessian eigenvalues are zero, looking like valleys rather than isolated singular points. Success of the VQE algorithm is not affected by a specific position in flat directions, but only concerned with if a ball, initially at a considerable height, can roll off in steep directions and reach a sufficiently deep gorge. It motivates us to consider the kk-dimensional hypersurface 𝒮k(𝜽)\mathcal{S}_{k}(\bm{\theta^{*}}) along the kk steepest directions, i.e., spanned by the top-kk Hessian eigenvectors. More precisely, we want to examine if the 𝒮k(𝜽)\mathcal{S}_{k}(\bm{\theta^{*}})-projected Euclidean distance between 𝜽\bm{\theta^{*}} and 𝜽τ\bm{\theta}_{\tau} decreases along the trajectory 𝒯(𝜽0)\mathcal{T}(\bm{\theta}_{0}) as the optimization step τ\tau progresses, until E(𝜽τ)=E(𝜽)E(\bm{\theta}_{\tau})=E(\bm{\theta^{*}}). This will tell us if the entire trajectory 𝒯(𝜽0)\mathcal{T}(\bm{\theta}_{0}) is confined in the kk-dimensional basin of attraction, while ignoring the movement in the directions of less or zero attraction.

Figure 7b displays the VQE error E(𝜽τ)E0E(\bm{\theta}_{\tau})-E_{0} and the projection distance 𝜽τ𝜽𝒮100(𝜽)\|\bm{\theta}_{\tau}-\bm{\theta^{*}}\|_{\mathcal{S}_{100}(\bm{\theta^{*}})} along the actual optimization trajectories, whose step number τ\tau is indicated by color. We have made the following observations: First, when an appropriate value of kk is selected, both the distance 𝜽τ𝜽𝒮k(𝜽)\|\bm{\theta}_{\tau}-\bm{\theta^{*}}\|_{\mathcal{S}_{k}(\bm{\theta^{*}})} and the loss value E(𝜽τ)E0E(\bm{\theta}_{\tau})-E_{0} continuously decrease on a macroscopic scale. It exhibits that the trajectory 𝒯(𝜽0)\mathcal{T}(\bm{\theta}_{0}) converges without escaping from a specific basin of the attractor that encloses a randomly initialized point 𝜽0\bm{\theta}_{0}. Second, for a shallow circuit state, the optimization trajectory often makes a slight detour in some orthogonal directions. In contrast, the steady convergence occurs typically for large LL. It implies that the vicinity of any randomly initialized parameters is effectively convex. Finally, we find from Figure 7a that the attractor basin in the direction of 𝒮k(𝜽)\mathcal{S}_{k}(\bm{\theta^{*}}) evolves rapidly steeper and narrower as the depth LL increases, thereby causing the rapid convergence and substantial late-time fluctuation around 𝜽\bm{\theta^{*}}. It is noticeable that the initial convergence follows a milder path than later fluctuations in the L=64L=64 case, while the convergence in the L=80L=80 case happens along a much steeper route. Taken together, these observations show the quick convergence of high-depth circuits Kiani et al. (2020); Wiersema et al. (2020) under the VQE algorithm.

IV Solving the SYK model

Our discussions so far have been based on just a particular Hamiltonian, (3), the Ising model in a transverse uniform magnetic field. Since the variational circuit that we use has no features particularly tailored for the Ising model, we expect it to closely replicate the ground states of other Hamiltonians in the high-depth regime. To check the generality of the prior discussions on the efficiency of the high-depth circuits, we will now solve the VQE problem, defined for another Hamiltonian of very distinct nature: the Sachdev-Ye-Kitaev (SYK) model.

1 The SYK Model

The SYK model Sachdev and Ye (1993); Kitaev (2015); Maldacena and Stanford (2016) is built out of 2n2n Majorana fermions in 1d1d, i.e. the operators γi\gamma_{i}, with i=1,,2ni=1,\,\dots,2n, satisfying the following anti-commutation relations

{γi,γj}=δij,\left\{\gamma_{i},\gamma_{j}\right\}=\delta_{ij}\ , (19)

where δij\delta_{ij} denotes the Kronecker delta. The SYK Hamiltonian is an all-to-all Hamiltonian, which couples all the Majorana fermions together in a fully non-local fashion, consisting of the following qq-body interaction terms with q2q\geq 2 being an even integer:

(i)q/2i1<<iqJi1iqγi1γiq,\mathcal{H}\equiv(i)^{q/2}\sum_{i_{1}<\dots<i_{q}}J_{i_{1}\dots i_{q}}\gamma_{i_{1}}\cdots\gamma_{i_{q}}\ , (20)

where the coupling constants Ji1iqJ_{i_{1}\dots i_{q}} are randomly sampled from the Gaussian distribution of mean 0 and variance

Ji1iq2J2(q1)!(2n)q1,\langle J_{i_{1}\dots i_{q}}^{2}\rangle\equiv\frac{J^{2}(q-1)!}{(2n)^{q-1}}\ , (21)

where J2J^{2} is a constant which we set to be equal to one. The model has recently attracted a widespread attention from different communities, due to some peculiar features it enjoys. It has been shown that when q4q\geq 4 the model is highly chaotic Kitaev (2015); Maldacena and Stanford (2016); García-García and Verbaarschot (2016); Cotler et al. (2017), although solvable in the large nn limit Kitaev (2015); Maldacena and Stanford (2016), thus creating a perfect situation to study relevant questions on quantum chaos which are usually out of reach for other chaotic models. Moreover, the SYK model has intriguing connections with the physics of black holes and quantum gravity, promoting itself as an ideal candidate to address new questions on holography and the AdS/CFT correspondence.

We will focus our attention on the SYK model with q=4q=4. It has two notable features that can be a source of trouble for any eigensolver algorithms, regardless of being classical or quantum mechanical, whose goal is to reach the ground state. First, the energy spectrum of the SYK model is very dense, especially having a small energy gap between the ground state E0E_{0} and all other excited states. Second, the SYK ground state is much less distinguishable from generic quantum states in the Hilbert space, supporting the volume law scaling of the entanglement entropy. Therefore, the VQE computation with the SYK model must be seen as a highly challenging benchmark Kim and Oz (2021).

As another characteristic, the SYK model is known to have two-fold degenerate eigenstates, if and only if n=4k+2n=4k+2 for any positive integer kk García-García and Verbaarschot (2016); Cotler et al. (2017). Denoting the two-fold degenerate ground states by |ϕ1|\phi_{1}\rangle and |ϕ2|\phi_{2}\rangle, that we assume to be normalized and orthogonal to each other, the VQE target states for the SYK model are then given by all the possible linear combinations of the form:

|ϕ0=α|α|2+|β|2|ϕ1+β|α|2+|β|2|ϕ2\displaystyle|\phi_{0}\rangle=\frac{\alpha}{\sqrt{|\alpha|^{2}+|\beta|^{2}}}|\phi_{1}\rangle+\frac{\beta}{\sqrt{|\alpha|^{2}+|\beta|^{2}}}|\phi_{2}\rangle (22)

with α\alpha and β\beta being complex numbers. Therefore, the distance between the circuit state |ψ(𝜽)|\psi(\bm{\theta})\rangle and the closest state of the form |ϕ0|\phi_{0}\rangle can be measured by computing

|ψ(𝜽)|ϕ1|2+|ψ(𝜽)|ϕ2|21\displaystyle|\langle\psi(\bm{\theta})|\phi_{1}\rangle|^{2}+|\langle\psi(\bm{\theta})|\phi_{2}\rangle|^{2}\leq 1 (23)

with the inequality which is saturated whenever |ψ(𝜽)|\psi(\bm{\theta})\rangle takes exactly the form (22).

We will numerically show that the high-depth quantum circuit can effectively learn the ground state of the SYK model. It will imply that even complicated systems, involving non-local interactions and a high level of entanglement, can be universally simulated through the variational circuits.

2 Optimizing the Circuit

Refer to caption
(a) Sample variance of the energy derivative kE(𝜽)\partial_{k}E(\bm{\theta})
Refer to caption
(b) Sample average of the Euclidean norm of 𝜽E(𝜽)\nabla_{\bm{\theta}}E(\bm{\theta})
Figure 8: The barren plateau experiment for the SYK Hamiltonian (20). The shades indicate (a) the variance across gradient components {kE(𝜽)}k=1nL\{\partial_{k}E(\bm{\theta})\}_{k=1}^{nL}, (b) the first and third sample quantiles.

We start by discussing if the SYK Hamiltonian causes the vanishing gradient problem for the variational circuit in Figure 1. When at least one of U±(𝜽±)U_{\pm}(\bm{\theta}_{\pm}) is a 22-design, the scaling behavior of Tr(2)\text{Tr}(\mathcal{H}^{2}) will determine if the SYK energy gradient will be exponentially suppressed or not. The Hamiltonian (20) has been exactly diagonalized up to n=15n=15. Moreover, an approximate analytical formula of the spectral density is at disposal in García-García and Verbaarschot (2017), which we use to extrapolate Tr(2)\text{Tr}(\mathcal{H}^{2}) for the large system size nn. The results are presented in Figure 2. We clearly see, as in the Ising model, that Tr(2)\mathrm{Tr}(\mathcal{H}^{2}) scales like 2n2^{n}, which indicates the abundance of the barren plateaus in the VQE energy landscape of the SYK model.

The vanishing gradient problem has also been observed numerically by computing the sample variance of kE(𝜽)\partial_{k}E(\bm{\theta}) over a collection of 10001000 random parameters, as displayed in Figure 8a. Clearly, the energy gradient kE(𝜽)\partial_{k}E(\bm{\theta}) is exponentially suppressed by increasing nn. We also see that for the SYK model, contrary to the Ising Hamiltonian, there is almost no transient regime where kE(𝜽)\partial_{k}E(\bm{\theta}) is still large, yet decreasing for the growing number LL of layers. The lack of the transient regime, which is a consequence of the non-local nature of the SYK interactions Cerezo et al. (2020), is a clear obstacle in using the variational circuit to approximate the SYK ground state in the low-depth regime.

On the other hand, as we can see from Figure 8b, the norm of the gradient vector is again increasing for the growing number LL of layers, due to the saturation of the Var𝜽[kE(𝜽)]\text{Var}_{\bm{\theta}}[\partial_{k}E(\bm{\theta})] with respect to LL. The empirically measured growth rates between log𝜽E(𝜽)\log\|\nabla_{\bm{\theta}}E(\bm{\theta})\| and logL\log L are

matching the simple estimation formula (14). It tells that the gradient-based optimization can at least launch for high-depth quantum circuits, which bypass the vanishing gradient problem using an exponentially high-dimensional parameter space.

Refer to caption
Figure 9: Optimized VQE energy (E(𝜽)E0E(\bm{\theta^{*}})-E_{0}) density for the SYK model over 35 distinct runs with random initialization.

As the next step, we examine the performance of the variational circuit Ansatz (2) in reaching the ground state of the SYK Hamiltonian, by repeating the same numerical experiments conducted in Section 2 under the same choice of hyperparameters (α,β1,β2)=(0.05,0.9,0.999)(\alpha,\beta_{1},\beta_{2})=(0.05,0.9,0.999).

First, the sample distribution of the minimized energy E(𝜽)E(\bm{\theta}^{*}) for randomly initialized circuits with 4L164\leq L\leq 16 layers is illustrated in Figure 9, based on 35 sample VQE runs at each LL. A notable observation is the small variance of the final energy, compared to the Ising VQE energy distribution in Figure 4, even at very low depth. We interpret it as another manifestation of the fact that the ground state that we aim to approximate is less distinguishable from generic quantum states. Hence, developing an optimization trajectory demands more classical computing power through the high-dimensional 𝜽\bm{\theta}-space Kim and Oz (2021). Another – and perhaps the most important – point to stress is that, the mean value of the minimized circuit energy E(𝜽)E(\bm{\theta}^{*}) decreases by stacking more layers, just like what has happened to the Ising VQE problem. It suggests that the high-depth circuits can reach a very good approximation of the SYK ground states.

Refer to caption
(a) The VQE optimization error E(𝜽)E0E(\bm{\theta^{*}})-E_{0}
Refer to caption
(b) Fidelity between the optimized and true ground states. As explained near (22)–(23), for the two-fold degenerate case n=4k+2n=4k+2, the sum |ψ(𝜽)|ϕ1|2+|ψ(𝜽)|ϕ2|2|\langle\psi({\bm{\theta}^{*}})|\phi_{1}\rangle|^{2}+|\langle\psi({\bm{\theta}^{*}})|\phi_{2}\rangle|^{2} is drawn.
Figure 10: Single-run VQE outcomes for the SYK model (20) using the layered circuit Ansatze (2) at different depths LL.

Second, we have measured the performance of the layered circuit Ansatz (2) in approximating a ground state of the SYK model (20), as a function of the circuit depth. Specifically, the VQE single-run error E(𝜽)E0E({\bm{\theta}^{*}})-E_{0} is drawn in Figure 10. The high-depth circuit performs very well, reaching a zero error with the following accuracy,

|E(𝜽)E0|105ΔE,\displaystyle|E(\bm{\theta})-E_{0}|\leq 10^{-5}\cdot\Delta E,

when the depth LL arrives at the following values.

n qubits46810Lv layers123096220\displaystyle\begin{array}[]{>{\raggedleft}p{1.65cm}|*{4}{>{\centering}p{0.75cm}}}\hline\cr\text{$n$ qubits }\@add@raggedleft&4\@add@centering&6\@add@centering&8\@add@centering&10\@add@centering\cr\hline\cr\text{$L_{v}$ layers }\@add@raggedleft&12\@add@centering&30\@add@centering&96\@add@centering&220\@add@centering\cr\hline\cr\end{array}

We also note from Figure 10b that the fidelity between the optimized and ground states tends to decrease at an intermediate scale of depth, while the VQE error continues to reduce without temporary increase. Such contrasting behavior is due to the dense energy spectrum of the SYK model near the ground energy level E0E_{0}. With an intermediate-depth circuit, the VQE algorithm is going to approximate not the exact ground states, but some low-lying excited states. In this way, the energy continues to decrease but the fidelity does not improve. However, for sufficiently deep circuits, the VQE algorithm can overcome the difficulty and reach an excellent agreement with the ground state.

Interestingly, we have seen that the necessary number LvL_{v} of layers to reach the high precision (15) is roughly in the same order, both for the SYK and Ising models. We will also see that the circuit (2) with LLvL\geq L_{v} achieves high precision in replicating random states by minimizing the Euclidean distance, as shown in Figure 11. This compatibility highlights the universal effectiveness of the high-depth circuit in approximating any quantum state in the Hilbert space, both generic and non-generic ones.

V Approximating Random States Using Euclidean Loss Function

Taking one step further, we will look into the high-depth circuit’s capability to approximate generic quantum states |ϕ|\phi\rangle by minimizing the Euclidean distance.

Recall that the variational circuit |ψ(𝜽)|\psi(\bm{\theta})\rangle with LL layers depends on nLnL parameters, which are periodic over the finite interval [0,2π)\left[0,2\pi\right) up to an overall sign. It is a mapping from the nLnL-dimensional torus to the Hilbert space of nn qubits. As a practical measure for the expressibility and trainability of the circuits in simulating quantum states, we consider whether, for a given quantum state |ϕ|\phi\rangle, there exist a point 𝜽\bm{\theta}^{*} in the parameter space, reachable by the gradient-based optimization such that |ψ(𝜽)|ϕ|\psi(\bm{\theta}^{*})\rangle\simeq|\phi\rangle. Such measure of expressibility is motivated by Sim et al. (2019) but tailored for the hybrid algorithms. Its definition follows:

Let us apply the gradient descent to find the minimum distance at the closest point

𝜽=argmin𝜽|ψ(𝜽)|ϕ\displaystyle\bm{\theta}^{*}=\arg\min_{\bm{\theta}}\||\psi({\bm{\theta}})\rangle-|\phi\rangle\| (24)

between the circuit and target states, where \|\cdot\| is the Euclidean norm of a complex vector. The (in)expressibility of the variational circuit is written as the average of the minimum distances over all quantum states:

ε=1Vol(U(2n))𝑑μϕmin𝜽|ψ(𝜽)|ϕ.\displaystyle\varepsilon=\frac{1}{\text{Vol($U(2^{n})$)}}\int d\mu_{\phi}\,\min_{\bm{\theta}}\||\psi({\bm{\theta}})\rangle-|\phi\rangle\|. (25)

Notice that the closeness between two quantum states is usually defined by using the trace distance, which is highly sensitive to small parameter changes. Instead, we have adopted the Euclidean distance to improve the convexity of the optimization landscape, such that the closest point 𝜽\bm{\theta}^{*} can be found as easily as possible. The Euclidean norm defines trivially a convex landscape. Hence, any non-convexity of the optimization landscape is inherited from the variational circuit itself that fixes how to embed the nLnL-dimensional torus into the Hilbert space.

For an actual estimation of ε\varepsilon, we substitute the Haar unitary integral (25) with the sample mean over mm states:

εm=1mi=1mmin𝜽|ψ(𝜽)|ϕi\displaystyle\varepsilon_{m}=\frac{1}{m}\sum_{i=1}^{m}\min_{\bm{\theta}}\||\psi({\bm{\theta}})\rangle-|\phi_{i}\rangle\| (26)

Given a target state |ϕi|\phi_{i}\rangle, we start with an Ansatz state |ψ(𝜽)|\psi({\bm{\theta}})\rangle with nLnL initial parameters 𝜽\bm{\theta}, randomly sampled from the uniform distribution 𝒰(0,2π)nL\mathcal{U}(0,2\pi)^{\otimes nL}. To reach the optimal parameters 𝜽\bm{\theta}^{*} that minimize the distance, we use the Adam optimization algorithm with the hyperparameters (α,β1,β2)=(0.05,0.9,0.999)(\alpha,\beta_{1},\beta_{2})=(0.05,0.9,0.999). Note that the overwhelming majority of Hilbert space is filled with generic states with a high degree of entanglement, so the sample states |ϕi|\phi_{i}\rangle will almost never be non-generic, low-entangling states similar to the Ising ground state.

Refer to caption
Figure 11: The parametric (in)expressibility of the layered circuit Ansatz over m=10m=10 random target samples, εm=10/2n\varepsilon_{m=10}/2^{n}, normalized by the number 2n2^{n} of state components. The narrow shade denotes the fluctuation across m=10m=10 target states.

Figure 11 shows the sample mean εm\varepsilon_{m} over m=10m=10 random target states, divided by the number 2n2^{n} of state components. The narrow shade displays the fluctuation range of the component-averaged distance |ψ(𝜽)|ϕi 2n\||\psi(\bm{\theta^{*}})\rangle-|\phi_{i}\rangle\|\,\cdot\,2^{-n} across different target states indexed by i=1,,10i=1,\ldots,10. We have selected the minimum distance among the first τ500\tau\leq 500 optimization steps, which is sufficient since the optimal parameter 𝜽\bm{\theta}^{*} is empirically always found in an early stage of the optimization.

We find that the (in)expressibility approaches to 0 as the circuit depth LL grows. To achieve the following high precision,

εm=101052n,\displaystyle\varepsilon_{m=10}\leq 10^{-5}\cdot 2^{n}, (27)

the depth LL has to be greater than the threshold values:

n qubits46810Lε layers102456150\displaystyle\begin{array}[]{>{\raggedleft}p{1.65cm}|*{4}{>{\centering}p{0.75cm}}}\hline\cr\text{$n$ qubits }\@add@raggedleft&4\@add@centering&6\@add@centering&8\@add@centering&10\@add@centering\cr\hline\cr\text{$L_{\varepsilon}$ layers }\@add@raggedleft&10\@add@centering&24\@add@centering&56\@add@centering&150\@add@centering\cr\hline\cr\end{array}

Note that the slight ramp-up after the steep fall of εm=10\varepsilon_{m=10} is caused by the fluctuation of 𝜽\bm{\theta} around the optimal point 𝜽\bm{\theta^{*}}, since the fluctuation size gets amplified for bigger LL. It is a common phenomenon in gradient-based optimization, where a suitable reduction of the learning rate can turn it to the stable convergence 𝜽𝜽\bm{\theta}\rightarrow\bm{\theta^{*}}.

In Figure 11, we have found the following overall trend: Deeper circuits are not only a superset of shallow circuits but also behave more effectively in the gradient-based optimization. Though the functional form of the variational circuit in Figure 1 is composed of only two types of gates, it can accurately express and reach most quantum states in Hilbert space in the high-depth regime.

VI Discussion

From the viewpoint of the vanishing gradient problem, adding more parametric gates to the circuit architecture brings both positive and negative effects on trainability. On the negative side, it makes the circuit unitary ensemble closer to quantum 2-design, whose one- and two-point correlators are equivalent to those of Haar random unitaries, so that initial random gradients may decay exponentially with the system size nn McClean et al. (2018). On the positive side, however, it enlarges the dimensionality of the parameter space and maintains the norm of random gradients to be finite.

Until now, we have explored one limiting regime, where the impact of exponentially many parameters dominates over the barren plateau phenomenon. It turns out that the high-depth circuit has been very effective in solving the ground states of the Ising and SYK Hamiltonians, as well as in simulating generic random states. One may note a certain degree of qualitative similarity between the VQE optimization trajectory of high-depth circuits, demonstrated in Section 2, and the lazy learning Du et al. (2019) in over-parameterized neural networks. We speculate it as naturally emerging in any systems involving the high-dimensional parameter space. It would be interesting to know why local extrema on the energy landscape of the high-depth circuits can reliably reach a zero VQE error, as studied in Soudry and Carmon (2016) for the neural networks.

Some interesting phenomena that we found during the optimization of low- and intermediate-depth circuits need to be further understood. For low-depth circuits, it is important to characterize what features of the initial points lead to the observed big difference in their minimized energy levels. We are also curious to know what makes the local minima on the energy landscape of intermediate-depth circuits so homogeneous. More generally, we want to understand how the circuit states evolve along a gradient optimization trajectory on average, in terms of various quantum information measures.

We need to concern two types of errors for the actual use of the variational circuits on the near-term quantum devices Preskill (2018). Firstly, for accurate estimation of the energy gradient, it is necessary to sample the variational wavefunction repeatedly, ideally exponentially many times in the system size nn. Under the assumption that we can collect only a limited number of samples, the energy gradient estimation will be inevitably noisy. Therefore, the expressibility and trainability of the variational circuit needs to be addressed with noisy gradients, or even without direct gradient computation as in Maheswaranathan et al. (2019); Welleck and Cho (2020). We expect the simplicity of the energy landscape in the high-depth regime may bring robustness against the sampling noise. Secondly, and more severely, quantum hardware noise restricts our ability to maintain quantum states when a sequence of layer unitary acts on them. A noise-induced mechanism that causes the vanishing gradient problem has also been studied in Wang et al. (2020). We would like to investigate if the variational circuit approach can be successful under decoherence of quantum states in the near future.

Acknowledgements.
We thank Boris Hanin and Yaron Oz for helpful discussions. Joonho Kim acknowledges the support from the NSF grant PHY-1911298 and the Sivian fund. The work of DR is supported by the National Research Foundation of Korea (NRF) grant NRF-2020R1C1C1007591. Our Python scripts, which are available https://github.com/phyjoon/high-depth-vqe for the numerical experiments written in JAX Bradbury et al. (2018), QuTip Johansson et al. (2013). The experimental data are managed in Weights & Biases Biewald (2020).

References