This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\AtBeginDvi

Latency-aware adaptive shot allocation for run-time efficient variational quantum algorithms

Kosuke Ito Center for Quantum Information and Quantum Biology, International Advanced Research Institute, Osaka University, Osaka 560-8531, Japan
Abstract

Efficient classical optimizers are crucial in practical implementations of Variational Quantum Algorithms (VQAs). In particular, to make Stochastic Gradient Descent (SGD) resource efficient, adaptive strategies have been proposed to determine the number of measurement shots used to estimate the gradient. However, existing strategies overlook the overhead that occurs in each iteration. In terms of wall-clock runtime, significant circuit-switching and communication latency can slow the optimization process when using a large number of iterations. In terms of economic cost when using cloud services, per-task prices can become significant. To address these issues, we present an adaptive strategy that balances the number of shots in each iteration to maximize expected gain per unit time or cost by explicitly taking into account the overhead. Our approach can be applied to not only to the simple SGD but also its variants, including Adam. Numerical simulations show that our adaptive shot strategy is actually efficient for Adam, outperforming many existing state-of-the-art adaptive shot optimizers. However, this is not the case for the simple SGD. When focusing on the number of shots as the resource, our adaptive-shots Adam with zero-overhead also outperforms existing optimizers.

I Introduction

Variational quantum algorithms (VQAs) have attracted much attention as a promising candidate for the first practical use of noisy intermediate-scale quantum (NISQ) devices Preskill (2018); Cerezo et al. (2021). In VQAs, the cost function of a variational problem is computed by utilizing somehow shallow quantum circuits, and the optimization of the variational parameters is done on a classical computer. Several paradigms of VQAs have been proposed, such as the variational quantum eigensolver (VQE) Peruzzo et al. (2014); O’Malley et al. (2016); Bauer et al. (2016); Kandala et al. (2017) to obtain an approximation of the ground-state energy of a Hamiltonian and beyond McClean et al. (2017); Santagati et al. (2018); Heya et al. (2018); Colless et al. (2018); McArdle et al. (2019); Jones et al. (2019); Parrish et al. (2019a); Higgott et al. (2019); Nakanishi et al. (2019); Tilly et al. (2020); Ollitrault et al. (2020), the quantum approximate optimization algorithm (QAOA) Farhi et al. (2014); Farhi and Harrow (2016); Otterbach et al. (2017) for combinatorial optimization problems, quantum machine learning Mitarai et al. (2018); Benedetti et al. (2019); Otten et al. (2020), and many others Cerezo et al. (2021).

One of the key to successful VQAs is an efficient optimization strategy of the classical parameters. To go beyond simple applications of existing non-quantum specific classical optimizers such as Adam, Nelder-Mead, Powell, etc. Lavrijsen et al. (2020); Kübler et al. (2020); LaRose et al. (2019), several quantum-aware optimization algorithms have been proposed Kübler et al. (2020); Arrasmith et al. (2020); Stokes et al. (2020); Koczor and Benjamin (2022a); Nakanishi et al. (2020); Parrish et al. (2019b); Sweke et al. (2020); Koczor and Benjamin (2022b); van Straaten and Koczor (2021); Gu et al. (2021); Menickelly et al. (2022); Tamiya and Yamasaki (2022); Bittel et al. (2022); Moussa et al. (2022); Boyd and Koczor (2022). Especially, it have been observed that optimizers using gradient information offer improved convergence over those without it Harrow and Napp (2021). A problem of gradient based optimizers is that the calculation of the gradient is expensive in VQAs since each component of the gradient must be individually evaluated. In fact, the choice of the number of measurement shots to estimate the gradient severely affects the performance of gradient based optimizers Sweke et al. (2020). Then, adaptive strategies to assign appropriate numbers of shots for efficient optimization have been proposed Kübler et al. (2020); Arrasmith et al. (2020); Gu et al. (2021). Those strategies actually achieve fast convergence with respect to the total shot number.

However, in practice, the total time consumption is more important rather than the total number of shots. The latency occurs with circuit-switching and the communication is usually much larger than the single-shot acquisition time Karalekas et al. (2020); Menickelly et al. (2022). This situation is similar for economic cost of using a cloud quantum computing service, where per-task prices are usually much more expensive than single-shot prices aws . To our best knowledge, no optimizer has been proposed in such a way that the value of the latency time is explicitly reflected to the optimization strategy. In this paper, we propose an approach for improving the efficiency of shot-adaptive gradient-based optimizers in terms of total time consumption or cost for using a cloud service. The approach, referred to as the Waiting-time Evaluating Coupled Adaptive Number of Shot (weCANS), determines the number of shots by evaluating the waiting time or per-task cost to optimize overall efficiency. Our proposal includes a latency-aware adaptive-shots Adam, which we call we-AdamCANS, as well as simple generalizations of iCANS and gCANS. Our research aims to address the potential improvement in the performance of gradient-based optimizers through explicit consideration of latency values in their design.

This paper is organized as follows. Sec. II provides an overview of the background related to the topic. Sec. III introduces the weCANS algorithm. The section begins by providing a simple generalization of the iCANS and gCANS algorithms (in Sec. III.1), followed by a generalization of weCANS to a wider range of stochastic gradient descent methods, including the popular Adam algorithm (in Sec. III.2). In Sec. III.3, we describe how to incorporate random operator sampling techniques for measuring Hamiltonians that consist of multiple noncommuting terms into the weCANS algorithm. Sec. IV presents numerical simulations of the proposed strategies on various tasks including Compiling task, VQE tasks of quantum chemistry, and VQE of 1-dimensional Transverse field Ising model. Finally, the conclusion of the paper is presented in Section V.

II Backgrounds

II.1 Variational quantum eigensolver

As a typical VQA, we focus on problems formulated as the minimization of the cost function

f(𝜽(k)):=ψ(𝜽)|H|ψ(𝜽).\displaystyle f(\bm{\theta}^{(k)}):=\bra{\psi(\bm{\theta})}H\ket*{\psi(\bm{\theta})}. (1)

given by the expectation value of a Hermitian operator HH. In VQE Peruzzo et al. (2014); Bauer et al. (2016); Kandala et al. (2017), we attempt to find a good approximation of the ground state energy of a quantum mechanical Hamiltonian HH by minimizing the cost function (1). It is expected that a quantum computer outperforms in computing the expectation value ψ(𝜽)|H|ψ(𝜽)\bra{\psi(\bm{\theta})}H\ket*{\psi(\bm{\theta})} of a quantum mechanical Hamiltonian of some complex systems. Hence, a quantum advantage might be achieved by only using the quantum device to compute the expectation value, while the parameter vector 𝜽(k)\bm{\theta}^{(k)} is classically optimized. That is why the VQE is a candidate for a practical application of NISQ devices. For a successful VQE, it is important to find an efficient classical optimizer as well as an expressive and trainable ansatz |ψ(𝜽)\ket{\psi(\bm{\theta})} which is realizable in a quantum computer. Our focus in this paper is to find an efficient optimizer with respect to the total wall-clock running time and the economic cost as practical figures of merit. Especially, we focus on the gradient descent optimizers.

One possible way to approximately estimate the partial derivative is the numerical differentiation:

fθi(𝜽)f(𝜽+δ𝒆i)f(𝜽)δ,\displaystyle\frac{\partial f}{\partial\theta_{i}}(\bm{\theta})\approx\frac{f(\bm{\theta}+\delta\bm{e}_{i})-f(\bm{\theta})}{\delta}, (2)

where 𝒆i\bm{e}_{i} is the unit vector in the ii-th direction and δ\delta is a small number. However, as the finite numbers of measurements for the estimations of the cost function values introduce statistical error, this estimation is not good Harrow and Napp (2021). Instead, we can use an analytic expression of the gradient of the cost function in VQAs. If the ansatz is given by parametric gates in the form of Ui(θi)=exp[iθiAi/2]U_{i}(\theta_{i})=\exp[-i\theta_{i}A_{i}/2] with Ai2=IA_{i}^{2}=I and fixed gates, we can use the following so-called parameter shift rule Mitarai et al. (2018); Schuld et al. (2019):

fθi(𝜽)=12(f(𝜽+π2𝒆i)f(𝜽π2𝒆i)).\displaystyle\frac{\partial f}{\partial\theta_{i}}(\bm{\theta})=\frac{1}{2}\left(f(\bm{\theta}+\frac{\pi}{2}\bm{e}_{i})-f(\bm{\theta}-\frac{\pi}{2}\bm{e}_{i})\right). (3)

For example, hardware efficient ansatz actually fits in this form. More generally, the partial derivatives may be calculated by general parameter shift rules with some RiR_{i} number of parameter shifts Wierichs et al. (2022):

fθi(𝜽)=j=1Riajf(𝜽+xi,j𝒆i).\displaystyle\frac{\partial f}{\partial\theta_{i}}(\bm{\theta})=\sum_{j=1}^{R_{i}}a_{j}f(\bm{\theta}+x_{i,j}\bm{e}_{i}). (4)

Hence, the gradient can be computed by evaluating the cost function at shifted parameters and taking their linear combination. We remark that the statistical error in the estimation of the cost function still remains due to the finiteness of the number of quantum measurements even if we use the analytic form.

II.2 Operator sampling

The target Hamiltonian HH often cannot be measured directly in a quantum computer (e. g. Hamiltonians in quantum chemistry). In such a case, we decompose HH into directly measurable operators such as Pauli operators PiP_{i} as

H=i=1NciPi.\displaystyle H=\sum_{i=1}^{N}c_{i}P_{i}. (5)

In general, we cannot measure all the terms simultaneously because this decomposition includes noncommuting Pauli operators. To efficiently measure the observables PiP_{i}, we need to group simultaneously measurable observables. Finding an efficient strategy to choose appropriate measurements is an important problem which have been intensively studied O’Malley et al. (2016); Kandala et al. (2017); Hamamura and Imamichi (2020); Crawford et al. (2021); Hempel et al. (2018); Izmaylov et al. (2020); Vallury et al. (2020); Verteletskyi et al. (2020); Zhao et al. (2020); Wu et al. (2023). We do not go into details regarding the grouping problem in this paper. Once the groups GjG_{j} of the commuting operators are fixed, another issue is how to allocate the number of shots to each group under a given shot budget stots_{\mathrm{tot}}. Although this is also a nontrivial problem, we take the weighted random sampling (WRS) strategy Arrasmith et al. (2020) in this paper. In WRS, each jj-th group is randomly selected with probability iGj|ci|/k=1N|ck|\sum_{i\in G_{j}}|c_{i}|/\sum_{k=1}^{N}|c_{k}| in each measurement. We can implement this procedure by sampling the number s(j)s^{(j)} of shots for each jj-th group from the multinomial distribution for stots_{\mathrm{tot}} independent trials with probability distribution pj=iGj|ci|/k=1N|ck|p_{j}=\sum_{i\in G_{j}}|c_{i}|/\sum_{k=1}^{N}|c_{k}|. If we allocate deterministic number of shots, some groups may never be measured for small stots_{\mathrm{tot}}, which results in the bias in the estimation of the expectation value of HH. In WRS, we obtain an unbiased estimator of the expectation value of HH for any small number of shots stots_{\mathrm{tot}}.

II.3 Stochastic gradient descent

Gradient descent is a very standard approach to optimization problems. In the gradient descent, we minimize the cost function f(𝜽)f(\bm{\theta}) by iteratively updating the parameter vector 𝜽d\bm{\theta}\in\mathbb{R}^{d} as

𝜽(k+1)=𝜽(k)αf(𝜽(k)),\displaystyle\bm{\theta}^{(k+1)}=\bm{\theta}^{(k)}-\alpha\bm{\nabla}f(\bm{\theta}^{(k)}), (6)

where α\alpha is the learning rate which is a hyperparameter. In some problems such as machine learning tasks, f(𝜽(k))\bm{\nabla}f(\bm{\theta}^{(k)}) in (6) can be replaced with an estimator 𝒈(𝜽(k))\bm{g}(\bm{\theta}^{(k)}) of it, which is a random variable. Gradient descent using a random estimator of the gradient is called stochastic gradient descent (SGD). When the estimator is unbiased, rigorous convergence guarantees are available under appropriate conditions Shalev-Shwartz and Ben-David (2014); Karimi et al. (2016); Sweke et al. (2020). Moreover, the stochasticity can rather help to avoid local minima and saddle points Bottou (1991). Thus, SGD is often efficient not only because of saving the resource to compute the gradient.

In VQAs, we have no access to the full gradient f\bm{\nabla}f due to the essentially random nature of the quantum measurement. Hence, the gradient descent in VQAs is inevitably stochastic in some extent depending on the number of measurements used to estimate the gradient. Although using large numbers of measurements results in accurate estimation of the full gradient, the accurate estimation of the gradient is not necessarily beneficial, as the stochasticity can be rather advantageous in SGD. In fact, it is reported that using not large numbers of measurements for gradient evaluations effectively results in the implementation of SGD Harrow and Napp (2021), and is actually beneficial in the optimization Sweke et al. (2020); Liu et al. (2022). However, the accuracy of the estimation of the gradients also restricts the final precision of the optimization. Therefore, the numbers of measurements for evaluation of the gradients during the optimization is an important hyperparameter which determines the effectiveness of the optimizer. While the numbers of measurements as well as the learning rate are usually tuned heuristically, some papers propose algorithms to adaptively set the numbers of measurements in SGD as we review later in more detail (Sec. II.5) Kübler et al. (2020); Arrasmith et al. (2020); Gu et al. (2021). The adaptive allocation of appropriate numbers of measurement shots for practically efficient SGD is the main focus of our study.

For the selection of the learning rate α\alpha, we can make use of the Lipschitz continuity of the gradient of the cost function of the VQE:

f(𝜽(k+1))f(𝜽(k))L𝜽(k+1)𝜽(k),\displaystyle\|\bm{\nabla}f(\bm{\theta}^{(k+1)})-f(\bm{\theta}^{(k)})\|\leq L\|\bm{\theta}^{(k+1)}-\bm{\theta}^{(k)}\|, (7)

where LL is called a Lipschitz constant. According to Eq. (7), we can make the change in the gradient per single descent step small by taking small enough α\alpha in accordance with the size of LL. In this way, it is expected that the course of the optimization well follows the gradient. In fact, for the exact gradient descent, the convergence is guaranteed if α<2/L\alpha<2/L Balles et al. (2017). As seen later, the Lipschitz constant is also used to set the number of shots in our algorithm in a similar manner to Kübler et al. (2020). For the cost function of the VQE, we have access to a Lipschitz constant. Especially, for the case with the parameter shift rule Eq. (3), we see that any order of the partial derivatives of ff is upper bounded by the matrix norm H\|H\| of HH by recursively applying Eq. (3). Then, if the parameter vector is dd-dimensional, we have the following upper bound of the Lipschitz constant Sweke et al. (2020); Gu et al. (2021)

LdH.\displaystyle L\leq d\|H\|. (8)

When HH is decomposed as (5), the norm is bounded as Hi=1N|ci|\|H\|\leq\sum_{i=1}^{N}|c_{i}|. We remark that smaller estimation of LL results in good performance of optimizers using LL in general. If we need tighter estimation of LL than (8), we can search for smaller LL in a similar manner to hyperparameter tuning.

II.4 Adam

Adaptive moment estimation (Adam) Kingma and Ba (2014) is a variation of a modified SGD which adaptively modifies the descent direction and the step size using the estimations of first and the second moments of the gradient. In Adam, we use the exponential moving averages

𝒎k=β1𝒎k1+(1β1)𝒈(𝜽(k))\displaystyle\bm{m}_{k}=\beta_{1}\bm{m}_{k-1}+(1-\beta_{1})\bm{g}(\bm{\theta}^{(k)}) (9)
𝒗k=β2𝒗k1+(1β2)𝒈(𝜽(k))2\displaystyle\bm{v}_{k}=\beta_{2}\bm{v}_{k-1}+(1-\beta_{2})\bm{g}(\bm{\theta}^{(k)})^{2} (10)

as the estimations of the moments, where the multiplication of the vector is taken component-wise, β1,β2\beta_{1},\beta_{2} are the hyperparameters which determines the speed of the reflection of the change in the gradient. Then, we update the parameter according to

𝜽(k+1)=𝜽(k)αk𝒎k𝒗k+ϵ,\displaystyle\bm{\theta}^{(k+1)}=\bm{\theta}^{(k)}-\alpha_{k}\frac{\bm{m}_{k}}{\sqrt{\bm{v}_{k}}+\epsilon}, (11)

where αk\alpha_{k} is the step size which depends on kk in general, and ϵ\epsilon is a small constant for the numerical stability. The bias-corrected version of 𝒎k\bm{m}_{k} and 𝒗k\bm{v}_{k} and constant αk\alpha_{k} are used in the original proposal of Adam, which is equivalent to using αk=α1β2k/(1β1k)\alpha_{k}=\alpha\sqrt{1-\beta_{2}^{k}}/(1-\beta_{1}^{k}) with a constant α\alpha if we neglect ϵ\epsilon. A possible advantage of Adam is that the step size for each component is adjusted component-wise, which gives an additional flexibility. In fact, Adam is known to work well for many applications. In this paper, we investigate how to exploit Adam’s power in VQAs by using adaptively determined appropriate number of shots in Sec. III.2.

II.5 Shot adaptive optimizers

As explained in Sec. II.3, the number of shots for the estimation of the gradient is an important hyperparameter of SGD for VQE. Using too large number of shots not only results in too expensive optimization, but also loses the benefit from the stochasticity. On the other hand, too small number of shots results in the fail of the accurate optimization due to the statistical errors. In general, the required number of shots tends to be small in the early steps in SGD, and the required number increases as the optimization progresses. Thus, it is hard for SGD to be efficient if we give a fixed number of shots used throughout the optimization process. One option to circumvent this problem is to dynamically increase the number ss of shots following a fixed schedule, e. g. s=s0rks=\lfloor s_{0}r^{k}\rfloor with the initial number s0s_{0} and a hyperparameter rr. However, very careful hyperparameter tuning is needed for such a strategy to be efficient Gu et al. (2021). Hence, an adaptive method to determine appropriate numbers of shots step by step in SGD optimizations is desired.

Balles et al. Balles et al. (2017) introduced Coupled Adaptive Batch Size (CABS) method for choosing the sample size in the context of optimizing neural networks by SGD. Inspired by CABS, Kübler et al. Kübler et al. (2020) proposed individual-Coupled Adaptive Number of Shots (iCANS) for choosing the number of shots for SGD optimizations in VQAs. Both algorithms assign the sample size so that the expected gain per single sample is maximized. To do so, the expectation value of a lower bound of the gain is considered based on the following inequality:

f(𝜽(k))f(𝜽(k+1))\displaystyle f(\bm{\theta}^{(k)})-f(\bm{\theta}^{(k+1)})
\displaystyle\geq αf(𝜽(k))T𝒈(𝜽(k))Lα22𝒈(𝜽(k))2=:𝒢,\displaystyle\alpha\bm{\nabla}f(\bm{\theta}^{(k)})^{\mathrm{T}}\bm{g}(\bm{\theta}^{(k)})-\frac{L\alpha^{2}}{2}\|\bm{g}(\bm{\theta}^{(k)})\|^{2}=:\mathcal{G}, (12)

where 𝒈(𝜽(k))\bm{g}(\bm{\theta}^{(k)}) is an estimator of the gradient. For VQAs, when each ii-th component of the gradient is estimated by sis_{i} shots, the expectation value of this lower bound 𝒢(𝒔)\mathcal{G}(\bm{s}) reads

𝔼[𝒢(𝒔)]=(αLα22)f(𝜽(k))2Lα22iσi2si,\displaystyle\mathbb{E}[\mathcal{G}(\bm{s})]=\left(\alpha-\frac{L\alpha^{2}}{2}\right)\|\bm{\nabla}f(\bm{\theta}^{(k)})\|^{2}-\frac{L\alpha^{2}}{2}\sum_{i}\frac{\sigma_{i}^{2}}{s_{i}}, (13)

where 𝒔\bm{s} denotes (s1,s2,,sd)(s_{1},s_{2},\cdots,s_{d}), and σi\sigma_{i} is the standard deviation of ii-th component gi(𝜽(k))g_{i}(\bm{\theta}^{(k)}) of the random estimator of the gradient. We remark that to guarantee that 𝔼[𝒢]\mathbb{E}[\mathcal{G}] is positive, we need Balles et al. (2017)

α2f(𝜽(k))2L(f(𝜽(k))2+iσi2si).\displaystyle\alpha\leq\frac{2\|\bm{\nabla}f(\bm{\theta}^{(k)})\|^{2}}{L(\|\bm{\nabla}f(\bm{\theta}^{(k)})\|^{2}+\sum_{i}\frac{\sigma_{i}^{2}}{s_{i}})}. (14)

The gain 𝒢(𝒔)\mathcal{G}(\bm{s}) can be divided into a sum of the individual contributions 𝒢i(si)\mathcal{G}_{i}(s_{i}) from each ii-th component given as

𝔼[𝒢i(si)]=(αLα22)if(𝜽(k))2Lα22σi2si,\displaystyle\mathbb{E}[\mathcal{G}_{i}(s_{i})]=\left(\alpha-\frac{L\alpha^{2}}{2}\right)\partial_{i}f(\bm{\theta}^{(k)})^{2}-\frac{L\alpha^{2}}{2}\frac{\sigma_{i}^{2}}{s_{i}}, (15)

where if:=fθi\partial_{i}f:=\frac{\partial f}{\partial\theta_{i}}. In iCANS, we assign a different number sis_{i} of shots for estimating each component of the gradient by individually maximizing γi:=𝔼[𝒢i(si)]/si\gamma_{i}:=\mathbb{E}[\mathcal{G}_{i}(s_{i})]/s_{i}. Then, the maximization of γi\gamma_{i} yields the suggested number of shots Kübler et al. (2020)

si=2Lα2Lασi2if(𝜽(k))2.\displaystyle s_{i}=\frac{2L\alpha}{2-L\alpha}\frac{\sigma_{i}^{2}}{\partial_{i}f(\bm{\theta}^{(k)})^{2}}. (16)

We have to note that this formalism is only valid when α<2/L\alpha<2/L. The suggested number of shots implies that the larger number of shots should be assigned as the statistical noise σi2\sigma_{i}^{2} gets larger compared to the signal if(𝜽(k))\partial_{i}f(\bm{\theta}^{(k)}). Hence, sis_{i} tends to become large as the optimization progresses. Moreover, a restriction on the number of shots not to exceed that of the highest expected gain is introduced in iCANS for further shot frugality. That is, we impose sismaxs_{i}\leq s_{\max} with the number smaxs_{\max} given by

imax\displaystyle i_{\max} =argmaxi(γi)\displaystyle=\arg\max_{i}(\gamma_{i}) (17)
smax\displaystyle s_{\max} =simax.\displaystyle=s_{i_{\max}}. (18)

Fortunately, a Lipschitz constant LL to calculate sis_{i} is accessible in VQAs from Eq. (8). However, the quantities σi2\sigma_{i}^{2} and if(𝜽(k))\partial_{i}f(\bm{\theta}^{(k)}) are not accessible because sis_{i} must be calculated before we estimate the gradient. Then, in order to implement iCANS, we replace these quantities with accessible quantities calculated from the previous estimations since it is expected that the change in these quantities along the course of SGD is not so large. For increased stability, it is suggested to use bias-corrected exponential moving averages of the previous iterations in place of σi2\sigma_{i}^{2} and if(𝜽(k))\partial_{i}f(\bm{\theta}^{(k)}). We also employ this strategy in our algorithm.

Gu et al. Gu et al. (2021) proposed another algorithm global-CANS (gCANS) with a different rule to allocate the number of shots as a variant of CANS. In gCANS, the rate of the expected gain is maximized globally, whereas each individual component is maximized in iCANS. In other words, the number of shots sis_{i} is determined so that

γ=𝔼[𝒢(𝒔)]i=1dsi\displaystyle\gamma=\frac{\mathbb{E}[\mathcal{G}(\bm{s})]}{\sum_{i=1}^{d}s_{i}} (19)

is maximized. Then, the suggested rule is

si=2Lα2Lασij=1dσjf(𝜽(k))2.\displaystyle s_{i}=\frac{2L\alpha}{2-L\alpha}\frac{\sigma_{i}\sum_{j=1}^{d}\sigma_{j}}{\|\bm{\nabla}f(\bm{\theta}^{(k)})\|^{2}}. (20)

Inaccessible quantities are also replaced by the exponential moving averages in gCANS. Both iCANS and gCANS have shown its efficiency in terms of the total number of shots used as a figure of merit Kübler et al. (2020); Gu et al. (2021), and according to Ref. Gu et al. (2021), gCANS performs better than iCANS. However, in practice, the total time consumption is more important rather than the total number of shots. The latency occurs with circuit-switching and the communication is usually much larger than the single-shot acquisition time Karalekas et al. (2020); Menickelly et al. (2022). As for the economic cost for using a cloud quantum computing service, the task cost is usually much more expensive than a single-shot cost aws . Although Ref. Gu et al. (2021) also showed that gCANS has high frugality of task cost of Amazon braket pricing and the number of the iterations, they does not explicitly consider the performance with respect to the total time or the total economic cost taking into account the latency or task cost.

On the other hand, Menickelly et al. Menickelly et al. (2022) investigated the performance of the optimizers of VQAs in terms of the total time including the latency. They proposed a new optimizer, SHOt-Adaptive Line Search (SHOALS). SHOALS makes use of stochastic line search based on Paquette and Scheinberg (2020); Berahas et al. (2021); Jin et al. (2021) for the better iteration complexity. In SHOALS, the number of shots used for estimating the gradient is adaptively determined so that the error in the estimations will be bounded in such a way that it does not interfere with the gradient descent dynamics. The number of shots used for the cost function evaluations for the line search is also adaptively assigned in a similar manner. The suggested number of shots is also basically proportional to the ratio of the noise to the signal, though the gain is not taken into account. They have shown that SHOALS actually achieves better iteration complexity scaling O(ϵ2)O(\epsilon^{-2}) compared to that of the simple SGD O(ϵ4)O(\epsilon^{-4}), where ϵ\epsilon is the precision to be achieved in expectation. In that sense, they suggest SHOALS as a high-performing optimizer in terms of the total time in the presence of the large latency. However, SHOALS does not take into account the explicit value of the latency. To our best knowledge, no optimizer has been proposed in such a way that the value of the latency is explicitly reflected to the optimization strategy. Therefore, a question arises here as to what extent the performance of an optimizer in terms of the total time consumption (or the economic cost) is improved by designing it with an explicit reflection of the latency value.

III Waiting-time evaluating coupled adaptive number of shots

In this section, we propose adaptive shot allocation strategies taking into account the latency or per-task cost as the overhead. We follow Ref. Menickelly et al. (2022) for the treatment of the latency. We denote the single-shot acquisition time by c1c_{1}, the circuit-switching latency by c2c_{2}, and the communication latency by c3c_{3}. The single-shot acquisition time includes the time for the initialization, running a given circuit, and a single measurement. The values of c1c_{1} for example are around 100100 μ\mus in superconducting qubits Karalekas et al. (2020) and around 200200 ms in neutral atom systems Briegel et al. (2000). The circuit switching latency includes the time for compiling a circuit and load it on the controller of the quantum device. For superconducting qubits, the compilation task takes 200200 ms, and interaction with the arbitrary waveform generator takes around 2525 ms Karalekas et al. (2020); Menickelly et al. (2022). The communication latency occurs on the communication between the classical computer and the quantum device side. Its value depends on the environment using the quantum computer, e. g. very small when we directly use a quantum device in the same laboratory, or many seconds when we communicate with a quantum device through the internet Sung et al. (2020); Menickelly et al. (2022). The task cost overhead for the economic cost of using a cloud quantum computing service can also be treated in the same way, where c1c_{1} is the per-shot price and c2c_{2} is the per-task price and c3=0c_{3}=0. For example, c1=$3.5×104c_{1}=\text{\$}3.5\times 10^{-4} and c2=$0.3c_{2}=\text{\$}0.3 for the pricing of the Rigetti’s devices in Amazon Braket aws .

III.1 Simple latency-aware generalization of iCANS and gCANS

The simplest idea to take into account the latency is to modify the shot allocation rules of iCANS and gCANS in such a way that we maximize the gain per unit time (or economic cost) including the latency instead of the gain per single shot. In other words, to modify iCANS to take into account the latency overhead, we assign the number of shots sis_{i} to estimate ii-th component of the gradient such that

ri(si)=𝔼[𝒢i(si)]c1si+c2mi+c3/d\displaystyle r_{\mathrm{i}}(s_{i})=\frac{\mathbb{E}[\mathcal{G}_{i}(s_{i})]}{c_{1}s_{i}+c_{2}m_{i}+c_{3}/d} (21)

is maximized, where mim_{i} is the number of circuits used to estimate ii-th component of the gradient. For the case where mim_{i} is a random variable as is the case with WRS, we need to replace mim_{i} with some available estimated quantity. We deal with this problem in detail in Sec. III.3. Here, we assume that a single round of the communication is enough to estimate all the components of the gradient. This maximization yields the optimal number of shots as

si\displaystyle s_{i}
=\displaystyle= Lα2Lα(1+1+Ri2LαLαif(𝜽(k))2σi2)σi2if(𝜽(k))2,\displaystyle\frac{L\alpha}{2-L\alpha}\left(1+\sqrt{1+R_{i}\frac{2-L\alpha}{L\alpha}\frac{\partial_{i}f(\bm{\theta}^{(k)})^{2}}{\sigma_{i}^{2}}}\right)\frac{\sigma_{i}^{2}}{\partial_{i}f(\bm{\theta}^{(k)})^{2}}, (22)

where Ri:=(c2mi+c3/d)/c1R_{i}:=(c_{2}m_{i}+c_{3}/d)/c_{1} is the ratio of the overhead per evaluation of each ii-th component of the gradient to the single-shot acquisition time. All the steps except for the number of shots are the same as the original iCANS. In particular, inaccessible quantities σi2\sigma_{i}^{2} and if(𝜽(k))\partial_{i}f(\bm{\theta}^{(k)}) are replaced with the bias-corrected exponential moving averages of the previous iterations. We call this shot allocation strategy Waiting time-Evaluating Coupled Adaptive Number of Shots (weCANS). To distinguish between iCANS-based version and gCANS-based version (introduced below), we call them weCANS(i) and weCANS(g) respectively. In this way, we take the explicit value of the latency into account. Especially, the term proportional to the ratio RiR_{i} represents the increase in the number of shots to deal with the latency. Indeed, as the latency gets large, taking large iteration number becomes more expensive. Hence, it can be more efficient to take larger number of shots in a single iteration as the latency gets larger compared to c1c_{1}. The shot number (22) concretely suggests how large number of shots is appropriate based on the estimation of the gain. For large ratio RiR_{i} of the latency, the increase in the number of shots reflecting the latency roughly scales linearly to RiLα/(2Lα)σi/if(𝜽(k))\sqrt{R_{i}L\alpha/(2-L\alpha)}\sigma_{i}/\partial_{i}f(\bm{\theta}^{(k)}), which implies that the effect of the latency tends to be reflected more for later optimization steps as the ratio of the statistical noise to the signal gets larger in general.

A waiting time-evaluating version of gCANS can also be obtained by maximizing

rg(𝒔)=𝔼[𝒢(𝒔)]c1i=1dsi+c2i=1dmi+c3.\displaystyle r_{\mathrm{g}}(\bm{s})=\frac{\mathbb{E}[\mathcal{G}(\bm{s})]}{c_{1}\sum_{i=1}^{d}s_{i}+c_{2}\sum_{i=1}^{d}m_{i}+c_{3}}. (23)

This yields the weCANS(g) shot allocation rule

si=σiRi=1dσi+(i=1dσi)2+R2LαLαf(𝜽(k))2,\displaystyle s_{i}=\frac{\sigma_{i}R}{-\sum_{i=1}^{d}\sigma_{i}+\sqrt{(\sum_{i=1}^{d}\sigma_{i})^{2}+R\frac{2-L\alpha}{L\alpha}\|\bm{\nabla}f(\bm{\theta}^{(k)})\|^{2}}}, (24)

where R:=(c2i=1dmi+c3)/c1R:=(c_{2}\sum_{i=1}^{d}m_{i}+c_{3})/c_{1} is the ratio of the overhead per iteration to the single-shot acquisition time. For weCANS(g), the increase in the number of shots reflecting the latency scales similarly to weCANS(i) as RLα/(2Lα)σi/f(𝜽(k))\sim\sqrt{RL\alpha/(2-L\alpha)}\sigma_{i}/\|\nabla f(\bm{\theta}^{(k)})\|. Obviously, the leading-order of the gain in the single-shot acquisition time c1rg(𝒔)c_{1}r_{\mathrm{g}}(\bm{s}) is at most O(1/R)O(1/R) for large RR. Without latency-aware shot allocation rule in gCANS, the leading-order term of the gain in the single-shot acquisition time is (αLα2/2)f(𝜽(k))2/(2R)(\alpha-L\alpha^{2}/2)\|\bm{\nabla}f(\bm{\theta}^{(k)})\|^{2}/(2R), whereas using the number of shots of weCANS yields its leading-order term (αLα2/2)f(𝜽(k))2/R(\alpha-L\alpha^{2}/2)\|\bm{\nabla}f(\bm{\theta}^{(k)})\|^{2}/R, which is optimal. A similar estimation also holds for iCANS. Hence, roughly speaking, twice acceleration is expected in large RR limit. We remark that subleading-order terms are also maximized in weCANS, although the same leading-order scaling can be obtained by any siRas_{i}\propto R^{a} (0<a<1)(0<a<1). Nevertheless, it turns out that such an improvement by weCANS(i) and weCANS(g) was not observed in our numerical simulations as shown in Sec. IV. That might be due to the limitation of the estimation of the gain using the expectation value of its lower bound even without considering the statistical error. On the other hand, incorporating latency-aware adaptive shot strategy into Adam introduced in the next subsection actually works well as shown in Sec. IV.

III.2 Waiting time-evaluating CANS for a wider class of SGD including Adam

In this section, we propose how to incorporate weCANS adaptive shot number allocation strategy into more general stochastic gradient descent optimizers including Adam. We consider the following class of the optimizers such that the parameter is updated according to

𝜽(k+1)=𝜽(k)αk𝑿k(𝒈(𝜽(k))),\displaystyle\bm{\theta}^{(k+1)}=\bm{\theta}^{(k)}-\alpha_{k}\bm{X}_{k}(\bm{g}(\bm{\theta}^{(k)})), (25)

where 𝑿k(𝒈(𝜽(k)))\bm{X}_{k}(\bm{g}(\bm{\theta}^{(k)})) is some analytic function of the gradient estimator. In Adam, this function is given as

𝑿k(𝒈(𝜽(k)))=𝒎k(𝒈(𝜽(k)))/𝒗k(𝒈(𝜽(k))),\displaystyle\bm{X}_{k}(\bm{g}(\bm{\theta}^{(k)}))=\bm{m}_{k}(\bm{g}(\bm{\theta}^{(k)}))/\sqrt{\bm{v}_{k}(\bm{g}(\bm{\theta}^{(k)}))}, (26)

where 𝒎k(𝒈(𝜽(k)))\bm{m}_{k}(\bm{g}(\bm{\theta}^{(k)})) and 𝒗k(𝒈(𝜽(k)))\bm{v}_{k}(\bm{g}(\bm{\theta}^{(k)})) are given by Eq. (9) and (10) respectively, and we neglect ϵ\epsilon in Eq. (11). Here, we only explicitly consider the dependency of 𝑿k\bm{X}_{k} on the gradient estimator of the current step. In particular, 𝑿k\bm{X}_{k} can depend on the quantities calculated in the previous iterations up to k1k-1 as in Adam. The dependency of 𝑿k\bm{X}_{k} except for 𝒈(𝜽(k))\bm{g}(\bm{\theta}^{(k)}) is included as the change of the form of the function depending on kk. Under this parameter update rule, we have the following lower bound of the absolute value of the gain

|f(𝜽(k))f(𝜽(k+1))|\displaystyle|f(\bm{\theta}^{(k)})-f(\bm{\theta}^{(k+1)})|
\displaystyle\geq |αkf(𝜽(k))T𝑿k(𝒈(𝜽(k))[𝒔])|Lαk22𝑿k(𝒈(𝜽(k)))[𝒔]2\displaystyle|\alpha_{k}\bm{\nabla}f(\bm{\theta}^{(k)})^{\mathrm{T}}\bm{X}_{k}(\bm{g}(\bm{\theta}^{(k)})[\bm{s}])|-\frac{L\alpha_{k}^{2}}{2}\|\bm{X}_{k}(\bm{g}(\bm{\theta}^{(k)}))[\bm{s}]\|^{2}
=:\displaystyle=: φ(𝒈(𝜽(k))[𝒔])\displaystyle\varphi(\bm{g}(\bm{\theta}^{(k)})[\bm{s}]) (27)

in a similar manner to CANS, where, for clarity, we explicitly denote the dependency on the numbers of shots 𝒔\bm{s} of the estimator of the gradient

gi(𝜽(k))[𝒔]=1sik=1siγi,k\displaystyle{g}_{i}(\bm{\theta}^{(k)})[\bm{s}]=\frac{1}{s_{i}}\sum_{k=1}^{s_{i}}\gamma_{i,k} (28)

with single-shot estimations γi,k\gamma_{i,k}. Hence, if we have an estimation of the lower bound φ(𝒈(𝜽(k))[𝒔])\varphi(\bm{g}(\bm{\theta}^{(k)})[\bm{s}]), we can determine an appropriate number of shots to maximize the it per unit time. Here, we consider the absolute value of the gain because the expected update direction given by 𝑿k(𝒈(𝜽(k)))\bm{X}_{k}(\bm{g}(\bm{\theta}^{(k)})) can be opposite to the direction of the gradient f(𝜽(k))\bm{\nabla}f(\bm{\theta}^{(k)}). We assume that as large as possible update is beneficial for a given update rule also in such a case.

We remark that the expectation is taken with respect to each single-shot realization γi,k\gamma_{i,k} as a random variable. Hence, we can not obtain an analytic expression for the expectation value of this lower bound in general. Then, we approximate it via the Taylor expansion of φ\varphi around the expectation value 𝔼[𝒈(𝜽(k))[𝒔]]=f(𝜽(k))\mathbb{E}[\bm{g}(\bm{\theta}^{(k)})[\bm{s}]]=\bm{\nabla}f(\bm{\theta}^{(k)}) as

𝔼[φ(𝒈(𝜽(k))[𝒔])]\displaystyle\mathbb{E}[\varphi(\bm{g}(\bm{\theta}^{(k)})[\bm{s}])]
\displaystyle\approx φ(f(𝜽(k)))+i=1dσi22si2φgi2(f(𝜽(k)))\displaystyle\varphi(\bm{\nabla}f(\bm{\theta}^{(k)}))+\sum_{i=1}^{d}\frac{\sigma_{i}^{2}}{2s_{i}}\frac{\partial^{2}\varphi}{\partial g_{i}^{2}}(\bm{\nabla}f(\bm{\theta}^{(k)}))
=\displaystyle= Ai=1dBisi,\displaystyle A-\sum_{i=1}^{d}\frac{B_{i}}{s_{i}}, (29)

where

A:=\displaystyle A:= φ(f(𝜽(k)))\displaystyle\varphi(\bm{\nabla}f(\bm{\theta}^{(k)})) (30)
Bi:=\displaystyle B_{i}:= σi222φgi2(f(𝜽(k))).\displaystyle-\frac{\sigma_{i}^{2}}{2}\frac{\partial^{2}\varphi}{\partial g_{i}^{2}}(\bm{\nabla}f(\bm{\theta}^{(k)})). (31)

Eq. (29) has the same form as the objective function for gCANS, except that BiB_{i} can be negative. If BiB_{i} is negative, as small as possible value of sis_{i} gives the largest value of our objective function

rX(𝒔)=𝔼[φ(𝒈(𝜽(k))[𝒔])]c1i=1dsi+c2i=1dmi+c3.\displaystyle r_{\mathrm{X}}(\bm{s})=\frac{\mathbb{E}[\varphi(\bm{g}(\bm{\theta}^{(k)})[\bm{s}])]}{c_{1}\sum_{i=1}^{d}s_{i}+c_{2}\sum_{i=1}^{d}m_{i}+c_{3}}. (32)

to be maximized based on the expression Eq. (29). However, choosing si=1s_{i}=1 is not appropriate in this case because Eq. (32) is based on the Taylor expansion, which is not accurate for small value of sis_{i}. Hence, Bi0B_{i}\leq 0 implies that the optimal sis_{i} lies in a region where the Taylor expansion is not accurate. We circumvent this problem by setting a floor value smins_{\min}, and we substitute smins_{\min} for sis_{i} with Bi0B_{i}\leq 0. In other words, we use 𝒔\bm{s} given by

𝒔=argmax{rX(𝒔)|sismin},\displaystyle\bm{s}=\arg\max\{r_{\mathrm{X}}(\bm{s})|s_{i}\geq s_{\min}\}, (33)

which yields the shot allocation rule

si={(R+|J|smin)Bib+2+(R+|J|smin)(AjJBjsmin)b+if Bi>0,sminif Bi0,\displaystyle s_{i}=\begin{cases}\frac{(R+|J_{-}|s_{\min})\sqrt{B_{i}}}{\sqrt{b_{+}^{2}+(R+|J_{-}|s_{\min})(A-\sum_{j\in J_{-}}\frac{B_{j}}{s_{\min}})}-b_{+}}&\text{if $B_{i}>0$,}\\ s_{\min}&\text{if $B_{i}\leq 0$,}\end{cases} (34)

where R=(c2i=1dmi+c3)/c1R=(c_{2}\sum_{i=1}^{d}m_{i}+c_{3})/c_{1} is the ratio of the overhead per iteration to the single-shot acquisition time and

b+:=jJ+Bj\displaystyle b_{+}:=\sum_{j\in J_{+}}\sqrt{B_{j}} (35)

with J+:={j|Bj0}J_{+}:=\{j|B_{j}\geq 0\} and J:={j|Bj<0}J_{-}:=\{j|B_{j}<0\}. All the inaccessible quantities are also replaced with the exponential moving average calculated from the previous iterations in the same way as iCANS and gCANS. For Eq. (34) to give valid numbers of shots, A>0A>0 is required, which is equivalent to imposing

αk<2|f(𝜽(k))T𝑿(f(𝜽(k)))|L𝑿(f(𝜽(k)))2\displaystyle\alpha_{k}<\frac{2|\bm{\nabla}f(\bm{\theta}^{(k)})^{\mathrm{T}}\bm{X}(\bm{\nabla}f(\bm{\theta}^{(k)}))|}{L\|\bm{X}(\bm{\nabla}f(\bm{\theta}^{(k)}))\|^{2}} (36)

in accordance with the condition for the lower bound (27) to be positive. We can forcibly impose Eq. (36) by clipping αk\alpha_{k} by rr times this upper limit with a fixed constant 0<r<10<r<1. However, in practice, clipping αk\alpha_{k} may delay the optimization. Instead, we can use a heuristic strategy to use the original αk\alpha_{k} for the parameter update while the clipped value is used to calculate the number of shots. We take this strategy in the numerical simulations in Sec. IV.

Algorithm 1 Adam with weCANS adaptive shot strategy (we-AdamCANS). The function iEvaluate(𝜽,𝒔)iEvaluate(\bm{\theta},\bm{s}) evaluates the gradient at the parameter 𝜽\bm{\theta} using sis_{i} shots to estimate ii-th partial derivative via parameter-shift rule (3) or (4). This function returns the vectors 𝒈\bm{g} and 𝑺\bm{S} whose components are the estimated partial derivatives and its variance respectively. The function ShotsEvaluate(A,𝑩,R,smin)ShotsEvaluate(A,\bm{B},R,s_{\min}) evaluates the number of shots via Eq. (34) using given AA, 𝑩\bm{B}, RR, and smins_{\min}, and returns the vector 𝒔\bm{s} of the integers.
0:  Step size α\alpha, initial parameter 𝜽0\bm{\theta}_{0}, minimum number of shots smins_{\min}, total wall-clock time TT available for the optimization, hyperparameters β1\beta_{1}, β2\beta_{2}, and ϵ\epsilon, Lipschitz constant LL, running average constant μ\mu, ratio RR of the overhead to the single-shot acquisition time, step-size clipping rate rr, Boolean flag ClippingClipping
1:  Initialize: 𝜽𝜽0\bm{\theta}\leftarrow\bm{\theta}_{0}, t0t\leftarrow 0, tstartt_{\mathrm{start}}\leftarrow current wall-clock time, 𝒔(smin,,smin)T\bm{s}\leftarrow(s_{\min},\cdots,s_{\min})^{\mathrm{T}}, 𝝌(0,,0)T\bm{\chi}^{\prime}\leftarrow(0,\cdots,0)^{\mathrm{T}}, 𝝃(0,,0)T\bm{\xi}^{\prime}\leftarrow(0,\cdots,0)^{\mathrm{T}}, 𝒎(0,,0)T\bm{m}^{\prime}\leftarrow(0,\cdots,0)^{\mathrm{T}}, 𝒗(0,,0)T\bm{v}^{\prime}\leftarrow(0,\cdots,0)^{\mathrm{T}}, k0k\leftarrow 0, αα\alpha^{\prime}\leftarrow\alpha
2:  while tTt\leq T do
3:     Evaluate the current wall-clock time tstartt_{\mathrm{start}}
4:     𝒈,𝑺iEvaluate(𝜽,𝒔)\bm{g},\bm{S}\leftarrow iEvaluate(\bm{\theta},\bm{s})
5:     𝒎β1𝒎+(1β1)𝒈\bm{m}^{\prime}\leftarrow\beta_{1}\bm{m}^{\prime}+(1-\beta_{1})\bm{g}
6:     𝒗β2𝒗+(1β2)𝒈2\bm{v}^{\prime}\leftarrow\beta_{2}\bm{v}^{\prime}+(1-\beta_{2})\bm{g}^{2}
7:     𝒎𝒎/(1β1k)\bm{m}\leftarrow\bm{m}^{\prime}/(1-\beta_{1}^{k})
8:     𝒗𝒗/(1β2k)\bm{v}\leftarrow\bm{v}^{\prime}/(1-\beta_{2}^{k})
9:     𝜽𝜽α𝒎/(𝒗+ϵ)\bm{\theta}\leftarrow\bm{\theta}-\alpha\bm{m}/(\sqrt{\bm{v}}+\epsilon)
10:     𝝌μ𝝌+(1μ)𝒈\bm{\chi}^{\prime}\leftarrow\mu\bm{\chi}^{\prime}+(1-\mu)\bm{g}
11:     𝝃μ𝝃+(1μ)𝑺\bm{\xi}^{\prime}\leftarrow\mu\bm{\xi}^{\prime}+(1-\mu)\bm{S}
12:     𝝃𝝃/(1μk+1)\bm{\xi}^{\prime}\leftarrow\bm{\xi}^{\prime}/(1-\mu^{k+1})
13:     𝑴β1𝒎+(1β1)𝝌\bm{M}\leftarrow\beta_{1}\bm{m}^{\prime}+(1-\beta_{1})\bm{\chi}
14:     𝑽β2𝒗+(1β2)𝝌2\bm{V}\leftarrow\beta_{2}\bm{v}^{\prime}+(1-\beta_{2})\bm{\chi}^{2}
15:     𝑴𝑴/(1β1k)\bm{M}\leftarrow\bm{M}/(1-\beta_{1}^{k})
16:     𝑽𝑽/(1β2k)\bm{V}\leftarrow\bm{V}/(1-\beta_{2}^{k})
17:     𝑿𝑴/(𝑽+ϵ)\bm{X}\leftarrow\bm{M}/(\sqrt{\bm{V}}+\epsilon)
18:     αmin{α,r2|𝝌T𝑿|L𝑿2}\alpha\leftarrow\min\{\alpha^{\prime},r\frac{2|\bm{\chi}^{\mathrm{T}}\bm{X}|}{L\|\bm{X}\|^{2}}\}
19:     Aφ(𝝌)A\leftarrow\varphi(\bm{\chi}) using α\alpha as the step size in Eq. (27)
20:     for i[1,,d]i\in[1,\cdots,d] do
21:        Biξi22φgi2(𝝌)B_{i}\leftarrow-\frac{\xi_{i}}{2}\frac{\partial^{2}\varphi}{\partial g_{i}^{2}}(\bm{\chi})
22:     end for
23:     𝒔ShotsEvaluate(A,𝑩,R,smin)\bm{s}\leftarrow ShotsEvaluate(A,\bm{B},R,s_{\min})
24:     if not ClippingClipping then
25:        αα\alpha\leftarrow\alpha^{\prime}
26:     end if
27:     tendt_{\mathrm{end}}\leftarrow current wall-clock time
28:     tt+tendtstartt\leftarrow t+t_{\mathrm{end}}-t_{\mathrm{start}}
29:     tstarttendt_{\mathrm{start}}\leftarrow t_{\mathrm{end}}
30:     kk+1k\leftarrow k+1
31:  end while

We also call this adaptive shot allocation strategy weCANS. Especially, we call the Adam optimizer using weCANS strategy we-AdamCANS. For clarity, we describe the detail of we-AdamCANS in Algorithm 1.

We also obtain the shot allocation rule without considering the latency by taking the limit R0R\rightarrow 0. If |J|=0|J_{-}|=0, we have

si=2Bij=1dBjA.\displaystyle s_{i}=2\frac{\sqrt{B_{i}}\sum_{j=1}^{d}\sqrt{B_{j}}}{A}. (37)

Especially, we call the Adam optimizer using this adaptive number of shots AdamCANS.

The best expected gain in the single-shot acquisition time of the above generalized weCANS is also twice as large as that of generalized CANS without latency consideration in large RR limit, in the same way as the weCANS(g) (weCANS(i)) vs gCANS (iCANS). As shown in numerical simulations in Sec. IV, we-AdamCANS actually achieves fast convergence in wall-clock time, although its acceleration is not always twice in comparison with AdamCANS (sometimes less than twice, sometimes even more faster).

III.3 Incorporating WRS into weCANS

For HH composed of multiple noncommuting observables, we allocate the number of shots for each simultaneously measurable group according to WRS as explained in Sec. II.2. In this case, the number of circuits mim_{i} used to estimate the gradient is also random because the latency only occurs for the measured groups. Hence, we cannot access mim_{i} before we decide the number of shots. Then, we take the following heuristic way to obtain an approximated maximization of the gain per unit time ror_{o} (o=i,g,X)(o=\mathrm{i},\mathrm{g},\mathrm{X}) by replacing mim_{i} with an accessible quantity. For simplicity, we consider the case where the two-point parameter shift rule (3) holds and si/2s_{i}/2 shots are used for the evaluation of the cost function at each shifted parameter. It is straightforward to generalize the following prescription to more general case with Eq. (4). We assume that we have obtained a grouping of the Pauli observables in Eq. (5) into MM groups of simultaneously measurable observables as

H=j=1MiGjciPi.\displaystyle H=\sum_{j=1}^{M}\sum_{i\in G_{j}}c_{i}P_{i}. (38)

Then, each group GjG_{j} is measured with probability pj=iGj|ci|/k=1N|ck|p_{j}=\sum_{i\in G_{j}}|c_{i}|/\sum_{k=1}^{N}|c_{k}| in WRS. We remark that we can use any probability distribution pjp_{j} in general. Hence, group GjG_{j} is not measured with probability (1pj)si/2(1-p_{j})^{s_{i}/2} in each evaluation of the cost function. Thus, the expectation value of mi(si)m_{i}(s_{i}) when sis_{i} shots are used is

𝔼[mi(si)]=2(Mj=1M(1pj)si/2).\displaystyle\mathbb{E}[m_{i}(s_{i})]=2\left(M-\sum_{j=1}^{M}(1-p_{j})^{s_{i}/2}\right). (39)

However, this value still depends on sis_{i}. Although it may be possible to numerically solve the maximization problem of ror_{o} with mim_{i} replaced with its expectation value (39) which we denote by r~o\tilde{r}_{o} from scratch, it might be expensive. In such a case, we instead use the exponential moving average s~i\tilde{s}_{i} of sis_{i} calculated from the previous iterations in place of sis_{i}. We can further neglect the terms (1pj)s~i/2(1-p_{j})^{\tilde{s}_{i}/2} below some small value since the probability of no occurrence of such a term is too small, and it is expected to be better to count its latency. Then, we obtain the analytic expressions of sis_{i} for weCANS (22) and (24) in the same way. Moreover, we can further implement a full maximization of r~o\tilde{r}_{o} by using the obtained analytic expressions as the initial guess.

IV Numerical simulations

In this section, we demonstrate the performance of weCANS optimizers in comparison with other state-of-the-art optimizers through numerical simulations of three kinds of benchmark VQA problems. We compare weCANS(i), weCANS(g), we-AdamCANS, AdamCANS with the original iCANS, gCANS, Adam with some fixed number of shots, and SHOALS. Including the weCANS version, 1/L1/L is used as the step size for iCANS and gCANS, the other hyperparameters are the same as the default values of iCANS and gCANS. β1=0.9\beta_{1}=0.9, β2=0.99\beta_{2}=0.99, ϵ=108\epsilon=10^{-8}, and α=1/L\alpha=1/L are used in we-AdamCANS, AdamCANS, Adam. As for the other specific hyperparameters of we-AdamCANS and AdamCANS, smin=100s_{\min}=100, μ=0.99\mu=0.99, and r=0.75r=0.75 are used. We employ Clipping=FalseClipping=\mathrm{False} in Algorithm 1.

For the other hyperparameters, the default values of each optimizer given in each proposal paper are used. SHOALS sometimes gets stuck in too small values of the step size in the line search. To avoid this, we introduced the minimum of the step size as 10410^{-4}. We implement all the simulations via Qulacs Suzuki et al. (2021).

IV.1 Compiling task

Refer to caption
Figure 1: The parameterized quantum circuit used in the compiling task. Each layer of the circuit is composed of single-qubit Pauli rotation with a randomly chosen axis of each qubit followed by the controlled-ZZ gates on the adjacent qubits.
Refer to caption
Figure 2: Comparison of performance of the optimizers for the compilation benchmark. The dashed line indicates the precision 10310^{-3}. Adam-ss means Adam using ss shots for every evaluation of the cost function. The median of the exact values of the cost function at the parameters obtained in 30 trials of the optimizations is shown vs the elapsed time with the single-shot time c1=1.0×105c_{1}=1.0\times 10^{-5} s, circuit switching latency c2=0.1c_{2}=0.1 s, and the communication latency c3=4.0c_{3}=4.0 s.

As a first benchmark problem, we consider a variational compilation task Nakanishi et al. (2020); Kübler et al. (2020). Our task is to minimize the infidelity of the state U(𝜽)|𝟎U(\bm{\theta})\ket{\bm{0}}

f(𝜽)=1|𝟎|U(𝜽)U(𝜽)|𝟎|2\displaystyle f(\bm{\theta})=1-|\bra{\bm{0}}U(\bm{\theta}^{*})^{\dagger}U(\bm{\theta})\ket{\bm{0}}|^{2} (40)

with the target state U(𝜽)|𝟎U(\bm{\theta}^{*})^{\dagger}\ket{\bm{0}} given by a randomly selected parameter vector 𝜽\bm{\theta}^{*} as the cost function. In this way, the ansatz always has the exact solution at the target parameter.

The parameterized quantum circuit U(θ)U(\vec{\theta}) used in the numerical simulation is shown in Fig. 1, where RPi,j(θi,j)=exp[iθi,jPi,j/2]R_{P_{i,j}}(\theta_{i,j})=\exp[-i\theta_{i,j}P_{i,j}/2] with Pi,jP_{i,j} uniformly and randomly selected from the Pauli operators X,Y,ZX,Y,Z. Especially, we simulate the optimization with n=3n=3 qubits and D=3D=3 depth.

The result of our simulation is shown in Fig. 2. We use the latency values c1=1.0×105c_{1}=1.0\times 10^{-5} s, c2=0.1c_{2}=0.1 s, and c3=4.0c_{3}=4.0 s following Sung et al. (2020); Menickelly et al. (2022). Adam-ss means Adam using ss shots for every evaluation of the cost function. The median of the exact values of the cost function f(𝜽)f(\bm{\theta}) at the optimized parameters of 30 runs is shown as the metric of the performance. Actually, we-AdamCANS outperforms the other optimizers. Especially, we-AdamCANS attains the precision 1.0×1031.0\times 10^{-3} in 773773 s whereas Adam-10000 attains the same precision at 985 s but the error increases again due to the statistical error. The next faster one is iCANS which attains the same precision in 1463 s. Hence, we-AdamCANS is over about 2 times faster than the other optimizers in this problem. However, no improvement can be seen in the median performance by the simple strategies weCANS(i) and weCANS(g). That might be because our estimations are based on the maximization of the expectation value of the lower bound of the gain but not the gain itself. Hence the performance may be improved by using a tighter estimation of the Lipschitz constant LL which yields tighter estimation of the lower bound of the gain.

IV.2 VQE tasks of quantum chemistry

Refer to caption
Figure 3: The hardware efficient ansatz used in the numerical simulation of VQE tasks of quantum chemistry. Each layer of the circuit is composed of single-qubit XX rotation and ZZ rotation of each qubit followed by the controlled-not gates on the adjacent qubits.
Refer to caption
Figure 4: Comparison of performance of the optimizers for the VQE benchmark. Every graph shows the median of 30 trials of the optimization. The dashed line indicates the chemical accuracy 1.6×103\times 10^{-3} Ha. (a) ΔE\Delta E vs the elapsed time for H2\mathrm{H}_{2} with c1=105c_{1}=10^{-5} s, c2=0.1c_{2}=0.1 s and c3=4.0c_{3}=4.0 s. (b) ΔE\Delta E vs the economic cost for H2\mathrm{H}_{2} with the pricing of the Rigetti’s devices in Amazon Braket where c1=$3.5×104c_{1}=\text{\$}3.5\times 10^{-4}, c2=$0.3c_{2}=\text{\$}0.3. (c) ΔE\Delta E vs the elapsed time for H3+\mathrm{H}_{3}^{+} with c1=105c_{1}=10^{-5} s, c2=0.1c_{2}=0.1 s and c3=4.0c_{3}=4.0 s (d) ΔE\Delta E vs the economic cost for H3+\mathrm{H}_{3}^{+} with the pricing of the Rigetti’s devices in Amazon Braket where c1=$3.5×104c_{1}=\text{\$}3.5\times 10^{-4}, c2=$0.3c_{2}=\text{\$}0.3.
Refer to caption
Figure 5: Comparison of performance in a single run of the optimizers for the VQE benchmark for H3+\mathrm{H}_{3}^{+} in terms of the economic cost with the pricing of the Rigetti’s devices in Amazon Braket. (a) and (b) show two examples of a single run of the optimization out of 30 trials.
Refer to caption
Figure 6: Comparison of performance of the optimizers for the VQE benchmark in terms of the number of shots. Both graphs show the median of 10 trials of the optimization. The dashed line indicates the chemical accuracy 1.6×103\times 10^{-3} Ha. (a) ΔE\Delta E vs the number of shots used for H2\mathrm{H}_{2} (b) ΔE\Delta E vs the number of shots used for H3+\mathrm{H}_{3}^{+}. For weCANS optimizers, we input the latency values c1=3.0×104c_{1}=3.0\times 10^{-4}, c2=0.3c_{2}=0.3 and c3=0.0c_{3}=0.0 for (a), and c1=105c_{1}=10^{-5}, c2=0.1c_{2}=0.1 and c3=4.0c_{3}=4.0 for (b) to calculate the number of shots.

As the next benchmark problem, we consider VQE problems of quantum chemistry. We consider H2\mathrm{H}_{2} molecule and H3+\mathrm{H}_{3}^{+} chain as examples. The distance between H atoms is set to 0.74Å  for both molecules. For H2\mathrm{H}_{2} molecule, we use the STO-3G basis to encode the molecular Hamiltonian into the qubit systems via Jordan-Wigner mapping. For a benchmark purpose, we consider the VQE task of H2\mathrm{H}_{2} encoded in 44 qubits in this way without any qubit reduction. For H3+\mathrm{H}_{3}^{+} molecule, we also use the STO-3G basis, but apply the parity transform Seeley et al. (2012) as the fermion-to-qubit mapping. We further reduce two qubits via 2\mathbb{Z}_{2} symmetry. Then, we obtain a 4-qubit Hamiltonian for H3+\mathrm{H}_{3}^{+} molecule. We just use a greedy algorithm to group the Pauli terms into qubit-wise commutable terms, where we use the coefficient of each term as the weight. We use a hardware efficient ansatz shown in Fig. 3 as the ansatz with D=3D=3.

Fig. 4 shows the numerical results for the simulation of VQE. Fig. 4(a) and (c) respectively shows the performance of the optimizers for VQE tasks of H2\mathrm{H}_{2} and H3+\mathrm{H}_{3}^{+} in terms of the elapsed time under the single-shot time c1=105c_{1}=10^{-5} s, circuit-switching latency c2=0.1c_{2}=0.1 s, and the communication latency c3=4.0c_{3}=4.0 s, where ΔE\Delta E denotes the difference of the exact energy expectation value evaluated at the parameters obtained by the optimization from the ground-state energy. On the other hand, Fig. 4(b) and (d) respectively shows the performance of the optimizers for VQE tasks of H2\mathrm{H}_{2} and H3+\mathrm{H}_{3}^{+} in terms of the economic cost for using the Rigetti’s devices in Amazon Braket where c1=$3.5×104c_{1}=\text{\$}3.5\times 10^{-4}, c2=$0.3c_{2}=\text{\$}0.3 and c3=$0.0c_{3}=\text{\$}0.0. We remark that the economic cost can be relabeled as the elapsed time, which shows the performance without the communication latency. In all the cases, we-AdamCANS considerably outperforms the other optimizers except for AdamCANS. In particular, only we-AdamCANS and AdamCANS attain the chemical accuracy 1.6×103\times 10^{-3} Ha within the range of the simulation (2×1042\times 10^{4} s and $5×104\text{\$}5\times 10^{4} for H2H_{2}, 5×1045\times 10^{4} s and $4×105\text{\$}4\times 10^{5} for H3+H_{3}^{+}). On the other hand, the simple strategies weCANS(i) and weCANS(g) do not work well again in this case.

Except for the VQE performance for H3+\mathrm{H}_{3}^{+} in terms of the economic cost, latency-aware shot allocation strategy of we-AdamCANS actually yields faster convergence than AdamCANS without latency consideration. However, in that exceptional case of H3+\mathrm{H}_{3}^{+} VQE vs the economic cost, we-AdamCANS and AdamCANS attains the chemical accuracy almost with the same cost, though the convergence of we-AdamCANS is more clear and further precision is achieved. That seems because we-AdamCANS sometimes happens to delay by being trapped on a plateau in this problem. In fact, we can verify this observation by looking at some examples of a single run of the optimization shown in Fig. 5. In the case of Fig. 5 (a), we-AdamCANS actually achieves considerably faster convergence than AdamCANS. On the other hand, in the case of Fig. 5 (b), both we-AdamCANS and AdamCANS seems trapped on a plateau. In this case, AdamCANS escapes from the plateau faster than we-AdamCANS. This phenomenon can be understood from the fact that fluctuations can assist to escape from a plateau. Because we-AdamCANS uses larger number of shots to taking into account the latency, AdamCANS can have more chance to escape from a plateau thanks to its larger fluctuations. In the case of the larger ratio of the overhead in Fig. 4 (c), the delay of we-AdamCANS due to the plateau might be compensated by the delay of AdamCANS due to the large latency. It is left for future works to further improve we-AdamCANS shot allocation strategy in such a way that pros and cons of fluctuations are appropriately balanced to avoid such a plateau problem.

Next, let us also see the performance in terms of the number of shots. Fig. 6 shows the performance of the optimizers in terms of the number of shots used. Fig. 6 (a) and (b) are the results for H2\mathrm{H}_{2} and H3+\mathrm{H}_{3}^{+} respectively. For weCANS optimizers, we used the latency values c1=3.0×104c_{1}=3.0\times 10^{-4}, c2=0.3c_{2}=0.3 and c3=0.0c_{3}=0.0 for (a), and c1=105c_{1}=10^{-5}, c2=0.1c_{2}=0.1 and c3=4.0c_{3}=4.0 for (b) to calculate the number of shots. If we measure the performance in terms of the total expended shots, AdamCANS indeed outperforms the other optimizers in both problems.

IV.3 VQE of 1-dimensional Transverse field Ising model

Refer to caption
Figure 7: The parameterized quantum circuit used in the numerical simulation of the VQE task of a 1-dimensional transverse Ising spin chain.
Refer to caption
Figure 8: Comparison of performance of the optimizers for the VQE tasks of 1d transverse Ising spin chain with open boundary conditions. The median of the exact value of the energy difference per site ΔE/(JN)\Delta E/(JN) at the parameters obtained by each optimizer in 10 runs of the optimizations is shown vs the elapsed time with the single-shot time c1=1.0×105c_{1}=1.0\times 10^{-5} s, circuit switching latency c2=0.1c_{2}=0.1 s, and the communication latency c3=4.0c_{3}=4.0 s for (a) N=6N=6 (b) N=12N=12. The dashed line indicates the precision 10210^{-2}.

Finally, in order to demonstrate the performance of our optimizers in larger system sizes, we consider the VQE task of the 1-dimensional transverse field Ising spin chain with open boundary conditions for the system size N=6N=6 and 12, where the Hamiltonian is

H=Ji=1N1ZiZi+1gi=1NXi.\displaystyle H=-J\sum_{i=1}^{N-1}Z_{i}Z_{i+1}-g\sum_{i=1}^{N}X_{i}. (41)

We consider the case with g/J=1.5g/J=1.5. For this problem, we used the ansatz shown in Fig. 7 with D=3D=3 following Kübler et al. (2020). The ratio of the number of shots for the measurements of the interaction term to that of the transverse field term is deterministically fixed as JJ to gg.

Fig. 4(a) and (b) respectively shows the performance of the optimizers for VQE tasks for N=6N=6 and N=12N=12 in terms of the elapsed time under the single-shot time c1=105c_{1}=10^{-5} s, circuit-switching latency c2=0.1c_{2}=0.1 s, and the communication latency c3=4.0c_{3}=4.0 s. The results are shown in terms of the precision of the per-site energy ΔE/(JN)\Delta E/(JN), where ΔE\Delta E denotes the difference of the exact energy expectation value evaluated at the parameters obtained by the optimization from the ground-state energy. Actually, we-AdamCANS outperforms the other implemented optimizers for both N=6N=6 and N=12N=12. In particular, we-AdamCANS attains the precision 10210^{-2} considerably faster than the other optimizers except for Adam-10000 for N=12N=12. However, the performance of Adam with a fixed number of shots is sensitive to the choice of the number of shots. In fact, Adam-10000 is too slow for N=6N=6. On the other hand, we emphasize that there is no need to tune the number of shots to use in we-AdamCANS, which indeed robustly achieves fast convergence for both systems sizes. Especially, for N=6N=6, we-AdamCANS attains 10210^{-2} in 4044 s whereas the next fastest iCANS attains the same precision in 17455 s. Hence, we-AdamCANS is more than about 4-times faster than the others for the convergence up to this precision in this case.

V Conclusion

In this study, we proposed waiting-time evaluating strategy, weCANS, for determining the number of shots for SGD optimizers in variational quantum algorithms to enhance convergence in terms of wall-clock time or cost for using a cloud service. Our weCANS strategy is applicable to a wide range of variations of SGD, including Adam. As shown in numerical simulations, Adam with weCANS (we-AdamCANS) actually outperforms existing shot-adaptive optimizers in terms of convergence speed and cost, as well as shot count in the absence of latency. On the other hand, we have also shown that just applying weCANS to the simple SGD (weCANS(i) and weCANS(g)) is less effective, possibly because the gain estimation is too rough for them.

While our results demonstrate the effectiveness of we-AdamCANS, there is still room for improvement. For example, considering the positive impact of statistical fluctuations on escaping plateaus may overcome a potential drawback of weCANS that the risk of being trapped in a plateau is increased due to its increased number of shots. In addition, optimizing shot allocation for each term in the Hamiltonian may lead to further performance enhancements. Additionally, generalizing the adaptive shot strategy to a wider range of optimizers and exploring its practical application in VQAs are open avenues for future research. We also remark that reducing the latency time is crucial in improving the overall performance of variational quantum algorithms since the maximum expected gain in single-shot acquisition time inevitably scales as O(1/R)O(1/R) with the overhead ratio RR by using any shot allocation strategies.

Acknowledgements.
The author would like to thank Keisuke Fujii and Wataru Mizukami for very helpful comments, and thank Kosuke Mitarai for his useful python modules. This work is supported by MEXT Quantum Leap Flagship Program (MEXT QLEAP) Grant Number JPMXS0118067394 and JPMXS0120319794.

References

  • Preskill (2018) J. Preskill, “Quantum Computing in the NISQ era and beyond,” Quantum 2, 79 (2018).
  • Cerezo et al. (2021) M. Cerezo, A. Arrasmith, R. Babbush, S. C. Benjamin, S. Endo, K. Fujii, J. R. McClean, K. Mitarai, X. Yuan, L. Cincio,  and P. J. Coles, “Variational quantum algorithms,” Nature Reviews Physics 3, 625 (2021).
  • Peruzzo et al. (2014) A. Peruzzo, J. McClean, P. Shadbolt, M.-H. Yung, X.-Q. Zhou, P. J. Love, A. Aspuru-Guzik,  and J. L. O’Brien, “A variational eigenvalue solver on a photonic quantum processor,” Nature Communications 5, 4213 (2014).
  • O’Malley et al. (2016) P. J. J. O’Malley, R. Babbush, I. D. Kivlichan, J. Romero, J. R. McClean, R. Barends, J. Kelly, P. Roushan, A. Tranter, N. Ding, B. Campbell, Y. Chen, Z. Chen, B. Chiaro, A. Dunsworth, A. G. Fowler, E. Jeffrey, E. Lucero, A. Megrant, J. Y. Mutus, M. Neeley, C. Neill, C. Quintana, D. Sank, A. Vainsencher, J. Wenner, T. C. White, P. V. Coveney, P. J. Love, H. Neven, A. Aspuru-Guzik,  and J. M. Martinis, “Scalable quantum simulation of molecular energies,” Phys. Rev. X 6, 031007 (2016).
  • Bauer et al. (2016) B. Bauer, D. Wecker, A. J. Millis, M. B. Hastings,  and M. Troyer, “Hybrid quantum-classical approach to correlated materials,” Phys. Rev. X 6, 031045 (2016).
  • Kandala et al. (2017) A. Kandala, A. Mezzacapo, K. Temme, M. Takita, M. Brink, J. M. Chow,  and J. M. Gambetta, “Hardware-efficient variational quantum eigensolver for small molecules and quantum magnets,” Nature 549, 242 (2017).
  • McClean et al. (2017) J. R. McClean, M. E. Kimchi-Schwartz, J. Carter,  and W. A. de Jong, “Hybrid quantum-classical hierarchy for mitigation of decoherence and determination of excited states,” Phys. Rev. A 95, 042308 (2017).
  • Santagati et al. (2018) R. Santagati, J. Wang, A. A. Gentile, S. Paesani, N. Wiebe, J. R. McClean, S. Morley-Short, P. J. Shadbolt, D. Bonneau, J. W. Silverstone, D. P. Tew, X. Zhou, J. L. O’Brien,  and M. G. Thompson, “Witnessing eigenstates for quantum simulation of hamiltonian spectra,” Science Advances 4 (2018).
  • Heya et al. (2018) K. Heya, Y. Suzuki, Y. Nakamura,  and K. Fujii, “Variational Quantum Gate Optimization,” arXiv:1810.12745 (2018).
  • Colless et al. (2018) J. I. Colless, V. V. Ramasesh, D. Dahlen, M. S. Blok, M. E. Kimchi-Schwartz, J. R. McClean, J. Carter, W. A. de Jong,  and I. Siddiqi, “Computation of molecular spectra on a quantum processor with an error-resilient algorithm,” Phys. Rev. X 8, 011021 (2018).
  • McArdle et al. (2019) S. McArdle, T. Jones, S. Endo, Y. Li, S. C. Benjamin,  and X. Yuan, “Variational ansatz-based quantum simulation of imaginary time evolution,” npj Quantum Information 5, 75 (2019).
  • Jones et al. (2019) T. Jones, S. Endo, S. McArdle, X. Yuan,  and S. C. Benjamin, “Variational quantum algorithms for discovering hamiltonian spectra,” Phys. Rev. A 99, 062304 (2019).
  • Parrish et al. (2019a) R. M. Parrish, E. G. Hohenstein, P. L. McMahon,  and T. J. Martínez, “Quantum computation of electronic transitions using a variational quantum eigensolver,” Phys. Rev. Lett. 122, 230401 (2019a).
  • Higgott et al. (2019) O. Higgott, D. Wang,  and S. Brierley, “Variational quantum computation of excited states,” Quantum 3, 156 (2019).
  • Nakanishi et al. (2019) K. M. Nakanishi, K. Mitarai,  and K. Fujii, “Subspace-search variational quantum eigensolver for excited states,” Phys. Rev. Research 1, 033062 (2019).
  • Tilly et al. (2020) J. Tilly, G. Jones, H. Chen, L. Wossnig,  and E. Grant, “Computation of molecular excited states on ibm quantum computers using a discriminative variational quantum eigensolver,” Physical Review A 102, 062425 (2020).
  • Ollitrault et al. (2020) P. J. Ollitrault, A. Kandala, C.-F. Chen, P. K. Barkoutsos, A. Mezzacapo, M. Pistoia, S. Sheldon, S. Woerner, J. M. Gambetta,  and I. Tavernelli, “Quantum equation of motion for computing molecular excitation energies on a noisy quantum processor,” Phys. Rev. Research 2, 043140 (2020).
  • Farhi et al. (2014) E. Farhi, J. Goldstone,  and S. Gutmann, “A Quantum Approximate Optimization Algorithm,” arXiv:1411.4028 (2014).
  • Farhi and Harrow (2016) E. Farhi and A. W. Harrow, “Quantum Supremacy through the Quantum Approximate Optimization Algorithm,” arXiv:1602.07674 (2016).
  • Otterbach et al. (2017) J. S. Otterbach, R. Manenti, N. Alidoust, A. Bestwick, M. Block, B. Bloom, S. Caldwell, N. Didier, E. S. Fried, S. Hong, P. Karalekas, C. B. Osborn, A. Papageorge, E. C. Peterson, G. Prawiroatmodjo, N. Rubin, C. A. Ryan, D. Scarabelli, M. Scheer, E. A. Sete, P. Sivarajah, R. S. Smith, A. Staley, N. Tezak, W. J. Zeng, A. Hudson, B. R. Johnson, M. Reagor, M. P. da Silva,  and C. Rigetti, “Unsupervised Machine Learning on a Hybrid Quantum Computer,” arXiv:1712.05771 (2017).
  • Mitarai et al. (2018) K. Mitarai, M. Negoro, M. Kitagawa,  and K. Fujii, “Quantum circuit learning,” Phys. Rev. A 98, 032309 (2018).
  • Benedetti et al. (2019) M. Benedetti, E. Lloyd, S. Sack,  and M. Fiorentini, “Parameterized quantum circuits as machine learning models,” Quantum Science and Technology 4, 043001 (2019).
  • Otten et al. (2020) M. Otten, I. R. Goumiri, B. W. Priest, G. F. Chapline,  and M. D. Schneider, “Quantum Machine Learning using Gaussian Processes with Performant Quantum Kernels,” arXiv:2004.11280 (2020).
  • Lavrijsen et al. (2020) W. Lavrijsen, A. Tudor, J. Müller, C. Iancu,  and W. de Jong, “Classical optimizers for noisy intermediate-scale quantum devices,” in 2020 IEEE International Conference on Quantum Computing and Engineering (QCE) (2020) pp. 267–277.
  • Kübler et al. (2020) J. M. Kübler, A. Arrasmith, L. Cincio,  and P. J. Coles, “An Adaptive Optimizer for Measurement-Frugal Variational Algorithms,” Quantum 4, 263 (2020).
  • LaRose et al. (2019) R. LaRose, A. Tikku, É. O’Neel-Judy, L. Cincio,  and P. J. Coles, “Variational quantum state diagonalization,” npj Quantum Information 5, 57 (2019).
  • Arrasmith et al. (2020) A. Arrasmith, L. Cincio, R. D. Somma,  and P. J. Coles, “Operator Sampling for Shot-frugal Optimization in Variational Algorithms,” arXiv:2004.06252 (2020).
  • Stokes et al. (2020) J. Stokes, J. Izaac, N. Killoran,  and G. Carleo, “Quantum Natural Gradient,” Quantum 4, 269 (2020).
  • Koczor and Benjamin (2022a) B. Koczor and S. C. Benjamin, “Quantum natural gradient generalized to noisy and nonunitary circuits,” Phys. Rev. A 106, 062416 (2022a).
  • Nakanishi et al. (2020) K. M. Nakanishi, K. Fujii,  and S. Todo, “Sequential minimal optimization for quantum-classical hybrid algorithms,” Physical Review Research 2, 043158 (2020).
  • Parrish et al. (2019b) R. M. Parrish, J. T. Iosue, A. Ozaeta,  and P. L. McMahon, “A Jacobi Diagonalization and Anderson Acceleration Algorithm For Variational Quantum Algorithm Parameter Optimization,” arXiv:1904.03206 (2019b).
  • Sweke et al. (2020) R. Sweke, F. Wilde, J. Meyer, M. Schuld, P. K. Faehrmann, B. Meynard-Piganeau,  and J. Eisert, “Stochastic gradient descent for hybrid quantum-classical optimization,” Quantum 4, 314 (2020).
  • Koczor and Benjamin (2022b) B. Koczor and S. C. Benjamin, “Quantum analytic descent,” Phys. Rev. Research 4, 023017 (2022b).
  • van Straaten and Koczor (2021) B. van Straaten and B. Koczor, “Measurement cost of metric-aware variational quantum algorithms,” PRX Quantum 2, 030324 (2021).
  • Gu et al. (2021) A. Gu, A. Lowe, P. A. Dub, P. J. Coles,  and A. Arrasmith, “Adaptive shot allocation for fast convergence in variational quantum algorithms,” arXiv:2108.10434 (2021).
  • Menickelly et al. (2022) M. Menickelly, Y. Ha,  and M. Otten, “Latency considerations for stochastic optimizers in variational quantum algorithms,” arXiv:2201.13438 (2022).
  • Tamiya and Yamasaki (2022) S. Tamiya and H. Yamasaki, “Stochastic gradient line bayesian optimization for efficient noise-robust optimization of parameterized quantum circuits,” npj Quantum Information 8, 90 (2022).
  • Bittel et al. (2022) L. Bittel, J. Watty,  and M. Kliesch, “Fast gradient estimation for variational quantum algorithms,” arXiv:2210.06484 (2022).
  • Moussa et al. (2022) C. Moussa, M. H. Gordon, M. Baczyk, M. Cerezo, L. Cincio,  and P. J. Coles, “Resource frugal optimizer for quantum machine learning,” arXiv:2211.04965 (2022).
  • Boyd and Koczor (2022) G. Boyd and B. Koczor, “Training variational quantum circuits with covar: Covariance root finding with classical shadows,” Phys. Rev. X 12, 041022 (2022).
  • Harrow and Napp (2021) A. W. Harrow and J. C. Napp, “Low-depth gradient measurements can improve convergence in variational hybrid quantum-classical algorithms,” Phys. Rev. Lett. 126, 140502 (2021).
  • Karalekas et al. (2020) P. J. Karalekas, N. A. Tezak, E. C. Peterson, C. A. Ryan, M. P. da Silva,  and R. S. Smith, “A quantum-classical cloud platform optimized for variational hybrid algorithms,” Quantum Science and Technology 5, 024003 (2020).
  • (43) Amazon braket pricing, Accessed: 2/9/2023, https://aws.amazon.com/jp/braket/pricing/ .
  • Schuld et al. (2019) M. Schuld, V. Bergholm, C. Gogolin, J. Izaac,  and N. Killoran, “Evaluating analytic gradients on quantum hardware,” Phys. Rev. A 99, 032331 (2019).
  • Wierichs et al. (2022) D. Wierichs, J. Izaac, C. Wang,  and C. Y.-Y. Lin, “General parameter-shift rules for quantum gradients,” Quantum 6, 677 (2022).
  • Hamamura and Imamichi (2020) I. Hamamura and T. Imamichi, “Efficient evaluation of quantum observables using entangled measurements,” npj Quantum Information 6, 56 (2020).
  • Crawford et al. (2021) O. Crawford, B. v. Straaten, D. Wang, T. Parks, E. Campbell,  and S. Brierley, “Efficient quantum measurement of Pauli operators in the presence of finite sampling error,” Quantum 5, 385 (2021).
  • Hempel et al. (2018) C. Hempel, C. Maier, J. Romero, J. McClean, T. Monz, H. Shen, P. Jurcevic, B. P. Lanyon, P. Love, R. Babbush, A. Aspuru-Guzik, R. Blatt,  and C. F. Roos, “Quantum chemistry calculations on a trapped-ion quantum simulator,” Phys. Rev. X 8, 031022 (2018).
  • Izmaylov et al. (2020) A. F. Izmaylov, T.-C. Yen, R. A. Lang,  and V. Verteletskyi, “Unitary partitioning approach to the measurement problem in the variational quantum eigensolver method,” Journal of Chemical Theory and Computation 16, 190 (2020).
  • Vallury et al. (2020) H. J. Vallury, M. A. Jones, C. D. Hill,  and L. C. L. Hollenberg, “Quantum computed moments correction to variational estimates,” Quantum 4, 373 (2020).
  • Verteletskyi et al. (2020) V. Verteletskyi, T.-C. Yen,  and A. F. Izmaylov, “Measurement optimization in the variational quantum eigensolver using a minimum clique cover,” The Journal of Chemical Physics 152, 124114 (2020)https://doi.org/10.1063/1.5141458 .
  • Zhao et al. (2020) A. Zhao, A. Tranter, W. M. Kirby, S. F. Ung, A. Miyake,  and P. J. Love, “Measurement reduction in variational quantum algorithms,” Phys. Rev. A 101, 062322 (2020).
  • Wu et al. (2023) B. Wu, J. Sun, Q. Huang,  and X. Yuan, “Overlapped grouping measurement: A unified framework for measuring quantum states,” Quantum 7, 896 (2023).
  • Shalev-Shwartz and Ben-David (2014) S. Shalev-Shwartz and S. Ben-David, Understanding Machine Learning: From Theory to Algorithms, Understanding Machine Learning: From Theory to Algorithms (Cambridge University Press, 2014).
  • Karimi et al. (2016) H. Karimi, J. Nutini,  and M. Schmidt, “Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition,” in Machine Learning and Knowledge Discovery in Databases, edited by P. Frasconi, N. Landwehr, G. Manco,  and J. Vreeken (Springer International Publishing, Cham, 2016) pp. 795–811.
  • Bottou (1991) L. Bottou, “Stochastic gradient learning in neural networks,” in Proceedings of Neuro-Nimes, Vol. 91 (1991) p. 12.
  • Liu et al. (2022) J. Liu, F. Wilde, A. A. Mele, L. Jiang,  and J. Eisert, “Noise can be helpful for variational quantum algorithms,” arXiv:2210.06723 (2022).
  • Balles et al. (2017) L. Balles, J. Romero,  and P. Hennig, “Coupling adaptive batch sizes with learning rates,” Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence (UAI) , 410 (2017).
  • Kingma and Ba (2014) D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” Proceedings of the 3rd International Conference on Learning Representations  (2014).
  • Paquette and Scheinberg (2020) C. Paquette and K. Scheinberg, “A stochastic line search method with expected complexity analysis,” SIAM Journal on Optimization 30, 349 (2020).
  • Berahas et al. (2021) A. S. Berahas, L. Cao,  and K. Scheinberg, “Global convergence rate analysis of a generic line search algorithm with noise,” SIAM Journal on Optimization 31, 1489 (2021).
  • Jin et al. (2021) B. Jin, K. Scheinberg,  and M. Xie, “High probability complexity bounds for line search based on stochastic oracles,” in Advances in Neural Information Processing Systems, Vol. 34, edited by M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang,  and J. W. Vaughan (Curran Associates, Inc., 2021) pp. 9193–9203.
  • Briegel et al. (2000) H.-J. Briegel, T. Calarco, D. Jaksch, J. I. Cirac,  and P. Zoller, “Quantum computing with neutral atoms,” Journal of Modern Optics 47, 415 (2000).
  • Sung et al. (2020) K. J. Sung, J. Yao, M. P. Harrigan, N. C. Rubin, Z. Jiang, L. Lin, R. Babbush,  and J. R. McClean, “Using models to improve optimizers for variational quantum algorithms,” Quantum Science and Technology 5, 044008 (2020).
  • Suzuki et al. (2021) Y. Suzuki, Y. Kawase, Y. Masumura, Y. Hiraga, M. Nakadai, J. Chen, K. M. Nakanishi, K. Mitarai, R. Imai, S. Tamiya, T. Yamamoto, T. Yan, T. Kawakubo, Y. O. Nakagawa, Y. Ibe, Y. Zhang, H. Yamashita, H. Yoshimura, A. Hayashi,  and K. Fujii, “Qulacs: a fast and versatile quantum circuit simulator for research purpose,” Quantum 5, 559 (2021).
  • Seeley et al. (2012) J. T. Seeley, M. J. Richard,  and P. J. Love, “The Bravyi-Kitaev transformation for quantum computation of electronic structure,” arXiv:1208.5986 (2012).