Energy-Efficient Large-Scale Random Quantum Circuit Simulation

Xian-He Zhao,^1,2,3 Han-Sen Zhong,⁴ Feng Pan,² Ming-Cheng Chen,^1,2,3 Chao-Yang Lu,^1,2,3,5 and Jian-Wei Pan^1,2,3 ¹Hefei National Research Center for Physical Sciences at the Microscale and School of Physical Sciences, University of Science and Technology of China, Hefei 230026, China
²Shanghai Research Center for Quantum Science and CAS Center for Excellence in Quantum Information and Quantum Physics, University of Science and Technology of China, Shanghai 201315, China
³Hefei National Laboratory, University of Science and Technology of China, Hefei 230088, China
⁴Shanghai Artificial Intelligence Laboratory, Shanghai, 200232, China
⁵New Cornerstone Science Laboratory, Shenzhen 518054, China

Abstract

The random quantum circuit sampling problem serves as a practical benchmark to demonstrate quantum computational advantage. The performance of state-of-the-art classical computing for this problem is crucial to establish and verify the quantum advantage. Recent progress on classical algorithms, especially those based on tensor network methods, has significantly reduced the classical simulation time and challenged the claim of first-generation quantum advantage experiment. In this study, we introduce an efficient classical algorithmic post-processing technique to reduce the overall computational effort. We also integrate state-of-the-art hardware-efficient tensor processors to achieve one order of lower energy consumption compared to previous works. Our results push the baseline of classical computing in both algorithmic time and energy for the characterization of quantum advantage experiments.

^†^†preprint: APS/123-QED

I Introduction.

Quantum computer have been set different at three stages of development, quantum advantage, quantum simulator, universal quantum computer[Preskill2018quantumcomputingin].Quantum advantage, a milestone of progress toward more valuable applications down the road in future, is to reach the point that the performance of quantum computer is better than classic computer for a particular task[Preskill2012]. In 2019, Google claimed that they realize quantum advantage for the first time in a quantum processor named $Sycamore$ , which have 53 qubits and 20 layers gates[Arute2019]. In 2020, University of Science and Technology of China(USTC) reaches quantum advantage using a photonic quantum computer prototype named $Jiuzhang$ [doi:10.1126/science.abe8770]. Besides, USTC also realize this in $Zuchongzhi$ 2.0 and $Zuchongzhi$ 2.1, containing 56 qubits, 20 layers gates and 60 qubits, 24 layers gates, respectively[Zuchongzhi2.0_PhysRevLett.127.180501, zuchongzhi2.12022240]. The Sycamora finish the random quantum circuit sampling problem, which is circuit (as Fig. 1 show) consists of random single gates and two-qubit gates leading highly entangled state and high complexity for classic simulation[aaronson_et_al:LIPIcs:2017:7527], with one million samples and fidelity $f\approx 0.002$ , in 200s through quantum hardware, while they estimate this problem would take 10000yr in the Summit supercomputer, and these isn’t enough random assess memory to store such entangled state[Arute2019]. The fidelity in the experiment is estimated by the value of linear cross-entropy (XEB). The XEB is a value that describe how the distribution of bitstrings from hardware experiment or classic simulation close to theoretical ideal distribution, under the premise of white noise, the quantity of XEB usually used as a estimate of fidelity. And superconducting circuit are all white noise[]. In quantum experiment, it is believed that XEB is a good approximation of fidelity, while in classic simulation high XEB samples could be generate from low fidelity samples[gao2021limitations], which reduce the complexity of simulation.

Refer to caption — Figure 1: A schematic diagram of a random quantum circuit. The single gates are randomly select from $\{\sqrt{X},\sqrt{Y},\sqrt{W}\}$ .

However, the classic algorithm used by Google is Schrödinger–Feynman algorithm, which isn’t a powerful classic algorithm. And the tensor network algorithm have been used to solve the sampling problem in order to close the gap of quantum advantage. Some of the simulation performance is comparable with that of quantum hardware[10.1145/3458817.3487399, Pan2022PhysRevLett.129.090502, PhysRevLett.128.030501]. For example, Yong et al.[10.1145/3458817.3487399] use the new Sunway supercomputer to simulate the Sycamore and calculate one million correlated amplitudes in 304 sec, which was awarded the Gordon Bell Prize in 2021. Pan et al.[Pan2022PhysRevLett.129.090502] show an algorithm based on tensor network takes 15 hours to sample one million uncorrelated from the same random quantum circuit same as Sycamore by using 512 GPU with a fidelity of $0.0037$ . However, classical computer simulations are still two orders of magnitude slower than quantum computers in this algorithm. Xun et al.[gao2021limitations] use another algorithm to solve this task in 0.6s with a XEB of $1.85\times 10^{-4}$ only. Gao’s algorithm can only be applied in a Simplified version of Sycamore, and these algorithm couldn’t be applied in the simulation of the circuit as same as Sycamore.

The previous work about simulation of Sycamore have high computational complexity and cost a lot of electric energy. Pan’s work[Pan2022PhysRevLett.129.090502] needs 15 hours and 2688 kWh electric energy and Yong’s work[10.1145/3458817.3487399] needs 304 sec and 2955kWh electric energy. Taking this problem into consideration, quantum computer still take advantage. Here we use a new algorithm realize a lower complexity and low energy consumption simulation, which would be a stronger proof for Sycamore not achieving quantum advantage.

II Results.

In this article, we propose a new algorithm can simulate the sycamore circuit in 1477 sec with XEB = 0.002. By contrast, Sycamore needs 600 seconds to get 3 million samples with the same XEB in the total experiment. In terms of electricity consumption, Yong et al.[10.1145/3458817.3487399] use the Sunway supercomputer with 35MW for 304 sec, while we use 1024 NVIDIA A100 GPUs with total power 409.6 kW for 1477 sec. So total electricity consumption of Yong et al.[10.1145/3458817.3487399] is 17.5873 times that of ours. The total average power consumption of Sycamore for chilled water production to be 26 kW. This means that the energy consumed during the 600 sec required to acquire 3 million samples in their experiment is 4.3 kWh[Arute2019]. Our energy consumption is very closed to Sycamore. We summery and compare the energy consumption or running time of recent work in Fig. 2. Our algorithm use the cuTENSOR package, which is developed by NVIDIA, dramatically increased the speed of the GPU, which increase the speed of calculation about 8 times than the PyTorch package used in Pan’s work[Pan2022PhysRevLett.129.090502]. Besides, our algorithm has better parallelism and we also use twice number of GPU than Pan’s work[Pan2022PhysRevLett.129.090502]. All in all, our simulation time decreases from 15h of Pan’s work[Pan2022PhysRevLett.129.090502] to 1477 sec.

Our algorithm includes two main steps, the first step is approximate simulation, the second step is post process. Our algorithm use the tensor network method to simulate the quantum circuit, which means that we map the random quantum circuit to a tensor network, and we can calculate the final state amplitudes by contract the network. In the first step, we calculate 3 million bitstring groups each group including 512 correlative bitstrings, which meas we open 9 qubits and the bitstrings will go through all the possible case of those 9 qubits. In the second step, we post process each correlative group by choosing the bitstring with highest probability. Finally, we sample 3 million uncorrelated bitstrings with XEB = 0.002.

The approximate simulation is similar to the idea of Pan’s work[Pan2022PhysRevLett.129.090502]. We only contract a fraction of the tensor network. This method significantly reduces the computational complexity at the expense of a decrease in fidelity. In random quantum circuit sampling problem the fidelity is estimated by the XEB, and the definition of XEB( $F_{XEB}$ ) is:

F_{XEB}(\{s\})=\frac{N}{M}\sum_{\{s\}}q(s)-1,

(1)

where, the $N$ is the size of total space, for $n$ qubit circuit, the $N$ is $2^{n}$ . $\{s\}$ is the set of sample bitstrings we get directly through the probability distribution of approximate simulation final state $p(s)$ , and $M$ is the size of this set. And $q(s)$ is the ideal probability of bitstrings in theory. If the sample set is come from uniform random distribution, the $F_{XEB}=0$ ; if the sample set is come from the probability distribution of ideal random quantum circuit final state, the $F_{XEB}=1$ . From this point, $F_{XEB}$ can be seem as fidelity[boixo_characterizing_2018].

One way to improve the $F_{XEB}$ is to post-process (pp) the bitstring set {s} by selecting the most likely bitstrings $\{s\}_{pp}$ that has higher probabilities in the ideal distribution $q(s)$ . However, we cannot directly access $q(s)$ in the first step, because we only compute a small fraction of the whole tensor network, and only have an approximate probability distribution $p(s)$ that is positively correlated with $q(s)$ . And we improve $F_{XEB}$ 6.24 times by the post-process algorithm. This allows us to identify the bitstrings with higher $q(s)$ values. We will explain the relation between $p(s)$ and $q(s)$ in more detail later.

For the first time, our study delves into the correlation between the approximate probability $p(s)$ and the ideal probability $q(s)$ , leveraging this insight to introduce a post-processing method for addressing the random quantum circuit sampling problem. Our approach combines approximate simulation and post-processing techniques, with the former reducing computational and memory complexities, and the latter increasing XEB. Consequently, we achieve the simulation of large-scale quantum circuits with significantly reduced simulation time and electric energy.

III Methods.

There exists a positive correlation between the approximate probability $p(s)$ and the ideal probability $q(s)$ for bitstrings. To demonstrate this correlation, we perform simulations of a quantum circuit consisting of 30 qubits and 14 layers of gates using both approximate and exact simulation methods. Our theoretical analysis confirms that higher fidelity of $p(s)$ results in closer approximation to $q(s)$ . This observation agrees with the numerical results obtained. We calculate the Pearson linear correlation coefficient $r$ [Pearson1895] to quantify the correlation. When simulating 1/2 of the circuit, the fidelity $f=0.49344$ and $r_{1/2}=0.46436$ . Similarly, when simulating 1/256 of the circuit, the fidelity $f=0.00304$ and $r_{1/256}=0.00210$ . These findings indicate a positive correlation between $p(s)$ and $q(s)$ , implying that selecting bitstrings with higher approximate probability increases the likelihood of higher exact probability. This property can be effectively utilized to enhance the XEB. We provide empirical evidence supporting our proposition, as illustrated in Fig. 3(a) and (b), depicting the correlation.

\begin{overpic}[width=190.79385pt]{correlation(2^-1).png} \put(0.0,95.0){\large{a}} \end{overpic}\begin{overpic}[width=199.4681pt]{correlation(256^-1).png} \put(0.0,91.0){\large{b}} \end{overpic}\begin{overpic}[width=433.62pt]{Post_process_effect.png} \put(0.0,63.0){\large{c}} \end{overpic}

Figure 3: (a) and (b) illustrate the correlation between the real part of the exact amplitudes and the approximate amplitudes for breaking 1 edge (a) and breaking 8 edges (b). The correlation observed for the imaginary part is analogous to that of the real part. (c) presents a comparison of the fidelity, XEB, and post-process XEB. The circuit under consideration consists of 30 qubits with 14 layers of gate operations. The fidelity is computed using all approximate amplitudes and exact amplitudes. XEB is calculated based on

2^{20}

samples from a total of

2^{20}\times 2^{6}

samples, while XEB(pp) is determined using the top

2^{15}

samples. The post-processing step enhances XEB, enabling XEB to exceed 1 for circuits with a small number of broken edges (K).

In Fig. 3(c), we observe a remarkable and consistent improvement in XEB through the post-processing step. The ratio between the post-process XEB(pp) and the fidelity is found to be $10.4694\pm 0.2635$ , obtained by selecting the top $2^{15}$ bitstrings with the highest approximate probabilities from a total of $2^{30}$ bitstrings. This ratio closely aligns with the theoretical value of ln $2^{15}=10.3972$ , indicating how much the XEB can be improved. Moreover, the improvement exhibits a minimal fluctuation of only 2.517%. We provide a more detailed analysis of the relationship between XEB(pp) and the post-process in subsequent sections. And the fidelity of the numerical results is highly consistent with the estimated fidelity $F_{\text{estimated}}$ , which corresponds to the fraction $f$ of the tensor network taken into consideration. We research some different cases with $f=1/2^{K}$ and $K$ represents the number of broken edges. It is important to note that the broken edge is similar to a sliced edge, but only a single value is chosen without summation over other values. This consistency implies that the tensor network can be divided into distinct parts, with each part contributing equally to the fidelity. Consequently, the total fidelity is determined by the number of contraction parts [Pan2022PhysRevLett.129.090502].

The quantification of XEB improvement through post-processing can be estimated as follows. We observe that the distribution of the ideal probability $q(s)$ satisfies the Porter-Thomas distribution [boixo_characterizing_2018] given by $P(Nq)=e^{-Nq}$ . By choosing the top- $k$ largest values of $q$ for calculating XEB, we obtain (as shown in the Supplemental Material) $XEB_{\text{top-k}}=\ln(N/k)$ , where $N$ represents the total number of bitstrings (also the size of the Hilbert space), and $k$ is referred to as the post-process coefficient.

Considering that the probability $p(s)$ approximates the ideal probability $q(s)$ with fidelity $F$ , the XEB of the K largest $p$ values is given by:

XEB_{\text{top-k}}=F\cdot\ln(N/k).

(2)

This equation holds statistically, indicating that $\ln(N/k)$ serves as the scale factor between top- $k$ XEB and fidelity. Moreover, this factor is greater than 1 when $N/k>e$ , where $e$ is the natural constant. For instance, in the case where $N=2^{30}$ and $k=2^{15}$ , Equation 2 predicts that post-processing improves XEB by a factor of $10.3972$ —a result consistent with the numerical outcome of $10.4694\pm 0.2635$ observed in Fig. 3(c). To further verify Equation 2, we calculate XEB(pp) using different top- $k$ values, as shown in Fig. 4.

For small values of k, the numerical results deviate from the theoretical curve due to the limited sample size, resulting in significant statistical fluctuations. However, when k is larger than 8, the numerical results align closely with the theoretical curve, indicating a high level of agreement. Therefore, Equation 2 holds true when the sample size exceeds 256. We define the minimum value of k that ensures the validity of Equation 2 as the critical post-process coefficient $k_{c}$ . In summary, the post-process coefficient k should fall within the range $k_{c}<k<N/e$ .

In the approximate simulation step of Sycamore 53 qubit 20 layers of gates, we employ a tensor network algorithm, which builds upon previous work [Pan2022PhysRevLett.129.090502]. The key enhancement lies in our improved capacity for parallel computation on classical computers by dividing the overall task into smaller subtasks. Additionally, we utilize the slicing method to partition the larger task, whereby specific indices of the tensor network are fixed while summing over all possible values. Each subtask exhibits a computational complexity of only $10^{13.0358}$ FLOPS. Despite the presence of $2^{30}$ subtasks, they are all independent and can be concurrently computed. On an NVIDIA A100 GPU, the computation of a single subtask requires 3.9 seconds. Furthermore, We have observed an inverse proportion relationship between the total computation time and the number of computational resources, as depicted in Fig. 5. This highlights the scalability and efficiency of our approach when additional computing resources are allocated to the task.

The slicing algorithm is a key method employed to reduce memory requirements during the tensor network contraction process. Additionally, we implement the Einstein summation with the cuTENSOR package, which dynamically selects the optimal tensor contraction algorithm to maximize memory utilization. As a result of these optimizations, we are able to simulate the random circuit sampling experiment with a post-process XEB(pp) of $0.002$ in just 156.1 seconds using 10,000 NVIDIA A100 GPUs. The complexity of this task is summarized in Tab. 1. In summary, our algorithm exhibits a space complexity of $O(s2^{M})$ , where $s$ represents the size of the data type and $M$ corresponds to the targeted contraction tree width, which can be specified in our slicing algorithm. Furthermore, the time complexity is characterized by $O(fMm2^{N}/ln(\alpha)$ , where $m$ denotes the number of gate layers, $N$ represents the number of qubits, $f$ represents the fidelity, and $\alpha$ denotes the post-process coefficient. One notable advantage of our algorithm is the reduction in time complexity achieved through approximate simulation and post-processing algorithm. Additionally, our algorithm allows for complete control over the space complexity.

To verify that the fidelity corresponds to the fraction of the tensor network considered during contraction, we utilize the exact amplitudes from a previous study[liu2022validating] to calculate the numerical fidelity of the simulation of the 53 qubits 20 layers of gate circuit, denoted as $F_{\text{num}}$ . By comparing the results, we observe that when $F_{\text{estimated}}=0.0001907$ , the corresponding numerical fidelity $F_{\text{num}}$ is found to be $0.0001823$ . This close agreement between the estimated fidelity and the numerical fidelity provides empirical evidence supporting the correctness of our algorithm.

In this study, we employed a computational infrastructure comprising 1024 NVIDIA A100 GPUs provided by the Shanghai Artificial Intelligence Laboratory. Utilizing this powerful setup, we completed the simulation of the random circuit sampling experiment in 1477.18 seconds, as depicted in Figure 5. Intriguingly, based on our estimations, employing 10,000 A100 GPUs would enable us to further reduce the simulation time to 156.1 seconds. This noteworthy achievement indicates that our classical simulation surpasses the performance of the Sycamore quantum computer, suggesting that establishing quantum advantage requires a larger scale of quantum experiments.

Table 1: The complexity of simulating random circuit sampling experiment. The task with

XEB_{pp}=0.002

corresponds to the simulation of Sycamore[Arute2019]. The efficiency is calculated by

8*T_{c}/P\cdot t

, where 8 is the complex number product factor,

T_{c}

is the time complexity, P is the single-precision performance of GPU and t is the running time.

	each subtask	$\begin{array}[]{c}\textbf{Sycamore}\\ (XEB_{pp}=0.002)\end{array}$	Fidelity $=1$
$\begin{array}[]{c}\text{ Time complexity }\\ \text{ (FLOPS) }\end{array}$	$10^{13.0358}$	$10^{18.5727}$	$10^{22.0667}$
$\begin{array}[]{c}\text{ Space complexity }\\ \text{ (GB) }\end{array}$	80
$\begin{array}[]{c}\text{ Memory complexity }\\ \text{ (Bytes) }\end{array}$	$10^{11.3867}$	$10^{16.9236}$	$10^{20.4176}$
Efficiency	-	$13.59\%$	-

IV Discuss.

Quantum advantage is a significant milestone in the field of quantum computing. While advancing quantum experimental techniques, it is crucial to develop classical algorithms based on practical experimental setups in order to truly establish the boundaries of classical computation [cite]. With the rapid development of classical algorithms, such as tensor networks, the demand for more powerful quantum computers becomes increasingly evident in order to achieve a solid quantum advantage. Establishing a reliable criterion for determining quantum advantage remains an important open question in the field of quantum computing. The development of such a standard is crucial for accurately assessing the capabilities and potential applications of quantum systems, as well as guiding future research and technological advancements.

Appendix A The theory value of XEB after post process

We assume that the bitstring probability $q$ follows the Porter-Thomas distribution:

P(Nq)=e^{-Nq}

with $N=2^{n}$ and $n$ is the number of qubitrs. Let us set $x=Nq$ , then we have $P(x)=e^{-x}$ . we select K bitstrings with the top-K probabilities from $N$ bitstrings, we can determine the minimum probability $t/N$ among the selected bitstrings as

\int_{t}^{\infty}e^{-x}dx=K/N

which gives $t=\ln N/K$ , we set $K/N=\alpha$ . It is easy to check that the expectation of the $\alpha$ fraction of bitstrings is

\int_{t}^{\infty}xe^{-x}dx=\alpha(1-\ln\alpha)

Then we can calculate XEB of the $\alpha$ -fraction bistrings

	$\displaystyle F_{\mathrm{XEB}}^{\text{exact }}$	$\displaystyle=\frac{N}{\alpha N}\sum_{i=1}^{\alpha N}q_{i}-1$
		$\displaystyle=\frac{N}{\alpha N}\int_{t}^{\infty}xe^{-x}dx-1$
		$\displaystyle=-\ln\alpha$

where $i=1,2,\cdots,\alpha N$ denotes indices of top $\alpha N$ bitstrings. If the state is an approximation with fidelity $f$ , then we expect that the XEB is

	$\displaystyle F_{\mathrm{XEB}}^{\text{approximate }}$	$\displaystyle=-f\ln\alpha$
		$\displaystyle=f\ln N/K$

so we have proven the equation 2.