Large-Scale Simulation of Quantum Computational Chemistry on a New Sunway Supercomputer

Honghui Shang261, Li Shen36, Yi Fan 3, Zhiqian Xu2, Chu Guo 13141, Jie Liu 101, Wenhao Zhou 4,
Huan Ma 3, Rongfen Lin 11, Yuling Yang 4, Fang Li 4, Zhuoya Wang 12, Yunquan Zhang2, Zhenyu Li3 2 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 3 University of Science and Technology of China, Hefei, China 10 Hefei National Laboratory, University of Science and Technology of China, Hefei, Anhui 230088, China 13 Henan Key Laboratory of Quantum Information and Cryptography, Zhengzhou, Henan 450000, China 14 Hunan Normal University, Changsha 410081, China 4National Supercomputing Center in Wuxi, Wuxi, China 11 Department of Computer Science and Technology, Tsinghua University, Beijing, China 12 Pilot National Laboratory for Marine Science and Technology, Qingdao, China 6These authors contribute equally to this work; 1Corresponding authors:[email protected]; [email protected]; [email protected]

Abstract

Quantum computational chemistry (QCC) is the use of quantum computers to solve problems in computational quantum chemistry. We develop a high performance variational quantum eigensolver (VQE) simulator for simulating quantum computational chemistry problems on a new Sunway supercomputer. The major innovations include: (1) a Matrix Product State (MPS) based VQE simulator to reduce the amount of memory needed and increase the simulation efficiency; (2) a combination of the Density Matrix Embedding Theory with the MPS-based VQE simulator to further extend the simulation range; (3) A three-level parallelization scheme to scale up to 20 million cores; (4) Usage of the Julia script language as the main programming language, which both makes the programming easier and enables cutting edge performance as native C or Fortran; (5) Study of real chemistry systems based on the VQE simulator, achieving nearly linearly strong and weak scaling. Our simulation demonstrates the power of VQE for large quantum chemistry systems, thus paves the way for large-scale VQE experiments on near-term quantum computers.

I Introduction

Our world is fundamentally quantum mechanical. However, the application of the quantum mechanics to chemical problem “leads to equations much too complicated to be soluble”, as Paul Dirac noted. Directly solving the many-electron Schrödinger equation, such as the full configuration interaction (FCI) method, has a complexity that grows exponentially with the problem size. Thus it would soon fail for moderate-size molecules with about $24$ electrons (24 orbitals), which corresponds to a diagonalization problem of size $7.3$ trillion [1].

The advent of quantum computing technologies allows a new approach to solve quantum chemistry problems, referred to as quantum computational chemistry [2, 3]. In the long term, the quantum phase estimation algorithm can efficiently solve quantum chemistry problems with fault-tolerant quantum computers. In the short term, variational quantum eigensolver (VQE) provides a promising heuristic algorithm on noisy quantum computers [4, 5]. VQE is believed to be friendly to near-term quantum computers in that it adapts to the qubit counts, the connectivity and the coherence time on noisy intermediate-scale quantum (NISQ) devices, while maintaining a shallow circuit to mitigate the effects of noise. Recently, quantum computational advantages for the problem of random circuit sampling have been demonstrated on noisy quantum computers with between $50$ and $100$ qubits [6, 7, 8]. However, demonstrating practical quantum advantages on real world problems such as quantum chemistry problems remains a great challenge for noisy quantum computing, for which VQE based quantum chemistry solver is a promising candidate. A variety of electronic structure problems to achieve an industrially relevant computational advantage on contemporary and near-future quantum computers have been discussed recently [9, 10].

At the current stage, classical simulation of quantum computation is crucial for the study of quantum algorithms and quantum computing architectures, especially for heuristic quantum algorithms as VQE. The largest VQE experiment performed on a quantum computer up to date has used $12$ qubits [11]. An industrially relevant quantum computational advantage in quantum chemistry is expected to appear at around $38\leq N\leq 68$ qubits (under the assumption of error-corrected qubits), which is related to an electronic structure problem including $19\leq N\leq 34$ electrons [9]. However, as a heuristic algorithm, it is a priori not clear that the given ansatz for the parametric quantum circuit is expressive enough to represent the ground state of the problem and that even it is, whether the iterative algorithm can converge to the ground state. Moreover, current quantum computers are noisy and the errors of quantum gate operations are often dependent on the types of the gates as well as the qubits that they act on. As a result the accuracy and performance scaling of VQE with the problem size and the complexity of the parametric circuit ansatz may be hardware specific and also problem specific. Therefore before running a full fledged VQE application on current quantum computers with possibly hundreds of qubits, the accuracy and performance scaling, robustness as well as limitations of VQE should first be explored using a classical simulator. This situation is similar to the random quantum circuit (RQC) sampling task used to demonstrate quantum supremacy, where a benchmark baseline from noiseless classical simulation is in need to fully characterize its precision [12, 13, 14, 15, 16, 17, 18]. Compared to RQC, VQE is much more demanding for both quantum and classical computers, for example, the number of CNOT gates involved in a typical quantum computational chemistry simulation quickly goes beyond $10^{6}$ with commonly used physically motivated ansatz such as unitary coupled-cluster (UCC). Moreover, the parametric quantum circuit has to be executed many times as is typical for variational algorithms. These effects limit most of the current investigations of VQE using classical computers to very small problems (less than $20$ qubits).

In view of these challenges for practical application of VQE in the age of noisy quantum computing, we believe a highly efficient and large-scale classical simulator of VQE on a leading supercomputer is currently in need, in that, 1) it allows to analyze the scaling of VQE for practical quantum chemistry systems; 2) it allows a step by step cross verification of near-term VQE algorithms running on noisy quantum computers with around $100$ qubits, which is essential, for example, to quantify the quantum computational advantage; 3) this algorithm itself is also a highly parallelizable classical algorithm to solve large-scale quantum chemistry problems and thus would have practical importance on its own.

Based on the above considerations, we develop a highly efficient VQE simulator for the new generation Sunway supercomputer. We use the physically motivated UCC ansatz for the parametric quantum circuit of VQE [19]. In view of the nearest-neighbour nature of such parametric quantum circuit, we use the Matrix Product State (MPS) to efficiently represent the underlying quantum state [20]. Our MPS-VQE simulator can efficiently simulate around $100$ -qubit quantum circuit with moderate circuit depth. Since the majority of real-world molecules have a very large number of active electrons, which would often require more than $1000$ qubits in VQE, we further extend our capability of our MPS-VQE simulator using the Density Matrix Embedding Theory (DMET) [21]. The combination of these techniques allow us to address quantum chemistry problems with up to $1000$ qubits with descent accuracy. The major innovations of the current work are summarized as follows:

•

The MPS representation of the quantum state is used to explore the low-entanglement nature for relatively shallow quantum circuits, which significantly reduces the memory requirement and the computation time for VQE (the largest VQE simulation in this work uses $200$ qubits).
•

The divide-and-conquer technique for quantum many-body systems, DMET, is used in combination with the MPS-VQE simulator to further extend the simulation scale.
•

A three-level parallelization scheme from DMET to MPS and then down to elementary tensor operations to scale up to $20$ million cores.
•

The Julia script language is used as the main programing language, which allows to achieve high performance without the need to perform any low-level operations using systems languages such as C and Fortran. The fast prototyping nature of the Julia language as well as its rich scientific libraries and benchmarking tools make optimizations much easier.
•

Real chemistry systems have been studied, and nearly linearly strong and weak scaling have been achieved.

These innovations enable us to perform quantum computational chemistry simulations on the scale of hundreds of atoms. It should be noted that although the ingredients used in this work, namely MPS, DMET, VQE, Julia, exist before this work, a full integration of those methods to simulate the realistic large chemical systems is nontrivial and has not been done before. For example, MPS has not been combined with DMET or VQE before. A highly parallelizable design of the simulation scheme and an efficient implementation of it to scale up to over 20 million cores is also original in this work. Beside, unleashing the Julia programming language on Sunway architectures and running it efficiently over 20 million cores is also an extremely challenging task. Briefly speaking, our work has set the standard for large-scale classical simulation of quantum computational chemistry, and paves the way for benchmarking VQE applications on near-term noisy quantum computers.

II Background

II-A Current state of the art

VQE approximates the exact eigenstate with a unitary transformation acting on a reference state. The reference state is often chosen to be some state that can be easily prepared on a quantum computer and the unitary transformation is represented by a parametric quantum circuit, with the parameters updated iteratively based on the output energy of each minimization step. A classical simulator for VQE simulates this process on a classical computer. The brute-force approach to simulate a quantum circuit is the state vector simulator, which exactly stores the underlying $n$ -qubit quantum state. The memory cost of the state vector simulator scales as $2^{n}$ for pure quantum states (implemented in ProjectQ , [22] Qiskit, [23] QuEST [24] and Qulacs [25]) and $2^{2n}$ for mixed quantum states (implemented in DM-Sim , Qiskit, [23] QuEST [24] and Qulacs [25]). The largest quantum circuit simulation with this approach has reached $45$ qubits, but it would be extremely difficult to go further beyond due to the exponential memory cost [26].

Tensor network based algorithms are important alternatives that reduce the memory requirement by representing the quantum state as a tensor network of low-rank tensors, which have been widely applied to simulate RQCs [12, 13, 14, 15, 16, 17, 18]. The tensor network methods used for RQCs are mostly based on high-dimensional tensor network contractions since the RQCs under study usually have well defined two-dimensional geometries. However, the parametric quantum circuits used in VQE would not have such clear geometrical features in general, especially for those physical motivated ansatz such as unitary coupled-cluster as used in this work. In such instances the MPS (a very special type of one-dimensional tensor network that can efficiently simulate weakly entangled quantum states) based simulator is often preferable, especially for quantum chemistry problems. We also note that MPS has already been widely applied to solve quantum chemistry problems by directly computing the ground states in the context of Density Matrix Renormalization Group (DMRG) algorithm [27, 28]. MPS has also been used to directly simulate Shor’s algorithm up to $60$ qubits [29]. MPS-based simulator has already been implemented, for example in QISKIT. However, large-scale implementation of it for VQE on a top supercomputer has not been demonstrated.

The electronic Hamiltonian of a chemical system can be written in a second-quantized formulation:

\displaystyle\hat{H}

\displaystyle=\sum_{p,q}{h_{q}^{p}a_{p}^{\dagger}a_{q}}+\frac{1}{2}{\sum_{\begin{subarray}{c}p,q\\ r,s\end{subarray}}}{g_{rs}^{pq}a_{p}^{\dagger}a_{q}^{\dagger}a_{r}a_{s}}

(1)

where $h_{q}^{p}$ and $g_{rs}^{pq}$ are one- and two-electron integrals in molecular orbital basis. In a VQE procedure, the creation and annihilation operators $\{{a_{p}^{\dagger}},a_{q}\}$ in the Hamiltonian should be mapped to weighted Pauli strings using the Jordan-Wigner or Bravyi-Kitaev transformation, therefore the energy can be obtained by measuring and summing the expectation values of Pauli strings $\{P\}\in\{I,\sigma_{x},\sigma_{y},\sigma_{z}\}^{\otimes n}$ as

$\displaystyle E$	$\displaystyle=\langle{\hat{H}}\rangle=\langle\Psi(\theta)\|\hat{H}\|\Psi(\theta)\rangle$	(2)
	$\displaystyle=\langle\Psi(\theta)\|\sum_{p,q}{\tilde{h}_{q}^{p}P_{q}^{p}}\|\Psi(\theta)\rangle$
	$\displaystyle+\langle\Psi(\theta)\|\sum_{p,q,r,s}{\tilde{g}_{rs}^{pq}P_{rs}^{pq}}\|\Psi(\theta)\rangle.$

The accuracy of VQE heavily depends on the wave function ansatz (or the structure of the parametric quantum circuit) used to approximate the unknown ground state of the problem Hamiltonian. The mostly used form of the wave function ansatz is the physically motivated unitary coupled-cluster (UCC)[19], where the wave function takes the form

|\Psi(\theta)\rangle=e^{\hat{T}(\theta)-\hat{T}^{\dagger}(\theta)}|\Phi_{0}\rangle.

(3)

Here $|\Phi_{0}\rangle$ is an initial state in the form of a single Slater determinant. When truncated to the single and double excitations (UCCSD), the cluster operators are

\hat{T}(\theta)=\sum_{a,i}\theta_{i}^{a}\hat{a}^{\dagger}_{a}\hat{a}_{i}+\frac{1}{4}{\sum_{a,b,i,j}}\theta_{ij}^{ab}\hat{a}_{a}^{\dagger}\hat{a}_{b}^{\dagger}\hat{a}_{i}\hat{a}_{j},

(4)

where $\{i,j,k,\dots\}$ , $\{a,b,c,\dots\}$ and $\{p,q,r,\dots\}$ indicate the occupied, virtual, and general orbitals, respectively. The UCCSD method has no exact finite truncation for the Baker-Campbell-Hausdorff expansion, which indicates that it can not be effectively implemented on a classical computer using polynomial expansion methods or the explicit matrix-vector method. UCCSD can be implemented as a parametric quantum circuit by Suzuki-Trotter decomposition of the unitary evolution operator into one- and two-qubit gates [30].

Many contributions have been devoted to reducing the complexity of the quantum circuit ansatz for quantum chemistry problems. Recently, the excited states of Ir(ppy)₃ have been computed with up to $56$ qubits [31]. This is the largest classical simulation of VQE for quantum computational chemistry in terms of the number of qubits up to date. However, to achieve this simulation a very shallow parametric quantum circuit ansatz (with around $60$ two-qubit gates) was employed to reduce the computational complexity.

Following the philosophy that ”divide each of the difficulties under examination into as many parts as possible, and as might be necessary for its adequate solution”, a general VQE framework combined with the divide-and-conquer method (DMET) [32] for large-scale simulation of quantum chemistry has been proposed [33]. Using this approach, the C₁₈ molecule with $144$ qubits is studied classically, where the VQE in each fragment uses at most $16$ qubits [34]. A similar idea has been adopted in a hybrid tensor network approach, in which a bottom-layer tensor network is introduced to describe the strongly correlated intrasubsystem interactions and a top-layer tensor network is used to describe the intersubsystem interactions [35]. This embedding method can also be applied to excited state calculation [36, 37].

Overall, the current state-of-the-art quantum computational chemistry simulators for large-scale systems are limited with respect to computational efficiency and scalability. Thus, the classical simulation of large-scale quantum chemistry systems are severely limited. This represents the key bottleneck in simulating the physical properties of real systems.

The goal we want to achieve is to faithfully simulate the large-scale VQE experiments for practical Chemical systems running on quantum computers, using a world-leading supercomputer. Since we are aiming for a parallelization scale of up to tens of millions of cores, the algorithm-wise challenge is to identify a suitable algorithm which could map the computation onto all the available cores of the underlying supercomputer, thus extending the simulation range of VQE to quantum chemistry systems which are of practical relevance.

The state vector simulator is out of choice due to the huge memory requirement. We thus take the tensor network based simulator which reduces the exponential memory growth. In practice we find that MPS based simulator gives the best performance when used in combination with the UCC ansatz. To spread the computations of the MPS-VQE simulator over the new Sunway supercomputer, there are three implementation-wise challenges:

•

To parallelize over the $20$ million cores, we need a scheme which can first efficiently decompose the VQE simulation into sub-tasks (independent quantum circuit evolution and measurements) with little communication overhead and of similar complexity, and then distribute the tensor operations involved in a single circuit evolution in each process.
•

To achieve maximal efficiency for the major tensor operations, which are SVD and tensor contraction, we need a well tuned map of these tensor operations to the underlying CPE mesh.
•

Given the highly complex nature of the program including a hybridization of DMET, MPS and VQE as well as the application for practical problems, a variety of scientific libraries in high-level programming languages such as Python needs to be used to avoid reinventing the wheels. To build such a complex application with the simplest coding and also maintain a computational efficiency close to the low level C language, we pay a huge effort to develop the Julia compilation framework with SACA runtime support on heterogeneous processors to achieve functional implementation and performance improvement.

II-B HPC System and Environment

The new-generation Sunway supercomputer is used for performance assessment in this work, which is the successor of the Sunway TaihuLight supercomputer. Similar to the Sunway TaihuLight system, the new Sunway supercomputer adopts a new generation of domestic high-performance heterogeneous many-core processors (SW26010Pro) and interconnection network chips in China.

The new SW processor is designed for massive thread and data parallelism and to deliver high performance on parallel workloads. The architecture of the SW26010Pro processor is shown in Fig. 1. Each processor contains 6 core-groups (CGs), with 65 cores in each CG, and in total 390 cores. Each CG contains one management processing element (MPE), one cluster of computing processing elements (CPEs) and one memory controller (MC). The MPE within each CG is used for computations, management and communication. The CPEs are organized as an $8\times 8$ mesh (64 cores) and are designed to maximize the aggregated computing power and to minimize the complexity of the micro-architecture. The CPEs are organized with a mesh network to achieve high-bandwidth data communication (P2P and collective communications) among the CPEs in one CG, which is referred to as the remote scratchpad memory access (RMA).

Each SW26010Pro processor contains 96 GB memory, with 16 GB memory in each CG. The MPE and the CPEs within the same CG share the same memory which is controlled by the MC. Each CPE has a 32 KB L1 instruction cache, and a 256 KB scratch pad memory (SPM, also called the Local Data Memory (LDM)), which serves the same functionality as the L1 cache. The data storage space can also be configured as a local data cache (LDCache) which is automatically managed by the hardware. Data transfer between LDM and main memory can be realized by direct memory access (DMA), and data transfer between LDCache and main memory can also be realized by conventional load/storage instructions.

Refer to caption — Figure 1: The architecture of the SW26010Pro processor.

III Innovations

III-A The MPS-VQE simulator

In quantum chemistry, the FCI generates a correlated wave function by considering all the excitations above the reference Hartree-Fock state. The FCI wave function $\left|\Psi\right\rangle$ can be written as:

\left|\Psi\right\rangle=\sum_{i_{1}...i_{N}}c_{i_{1}i_{2}i_{3}\ldots i_{N}}|i_{1}i_{2}i_{3}\ldots i_{N}\rangle

(5)

where $|i_{1}i_{2}i_{3}\ldots i_{N}\rangle$ is the computation basis used, the coefficient $c_{i_{1}i_{2}i_{3}\ldots i_{N}}$ is a rank- $N$ tensor of $2^{N}$ complex numbers. Finding the tensor $c$ corresponding to the ground state is in generally exponentially hard. Various simplified ansatz has been used instead of the exact form in Eq.(5) to reduce the exponential scaling, among which the tensor network states (TNSs) constitute an outstanding class [20, 38]. TNS represents the underlying high-rank tensor as the product of a tensor network made of low-rank tensors, and the number of parameters in a TNS typically scales linearly with the system size (number of qubits). MPS is a special class of one-dimensional TNS which not only allows efficient representation, but also enables efficiently manipulations, such as the local gate operations and evaluating local observables, on it. In the context of random circuit sampling, it has been shown that the two-dimensional variant of TNS, namely the Projected Entangled Pair States (PEPS), would have better performance than MPS since current random circuits are mostly two-dimensional [12]. Here we will use the MPS ansatz for the quantum state due to the complex geometry of our parametric circuit ansatz, as shown in Fig. 2(a), which can be written as

c_{i_{1}i_{2}\ldots i_{N}}=\sum_{a_{1},\dots,a_{N-1}}B^{i_{1}}_{a_{1}}B^{i_{2}}_{a_{1},a_{2}}\dots B^{i_{N}}_{a_{N-1}},

(6)

with $i_{n}$ the ‘physical’ index and $a_{n}$ the ‘auxiliary’ index. The largest size of the auxiliary indices is referred to as the bond dimension of MPS, denoted as $D=\dim_{1\leq n<N}(a_{n})$ . Eq.(6) can represent arbitrary quantum states if $D$ is allowed to grow exponentially. A quantum state is said to be efficiently representable as an MPS if $D$ remains almost constant when $N$ grows. An MPS based algorithm in general has a complexity of $O(D^{3})$ . In practice, it is often advantageous to keep MPS in a canonical form, for example, the right-canonical form by ensuring that $\sum_{i_{n},a_{n}}(B^{i_{n}}_{a_{n-1},a_{n}})^{\ast}B^{i_{n}}_{a_{n-1}^{\prime},a_{n}}=\delta_{a_{n-1},a_{n-1}^{\prime}}$ (for one thing, the algorithm would be numerically more stable since the tensors on each site are well normalized).

VQE requires to apply single-qubit and two-qubit gate operations sequentially onto the quantum state and then compute the expectation value of a (local) observable on the quantum state, which is similar to the time-evolving block decimation algorithm used for time evolution of one-dimensional Hamiltonians [39]. Assuming that the initial MPS is prepared in the right-canonical form, a nearest-neighbour two-qubit gate operation, denoted as $O^{i_{n},i_{n+1}}_{i_{n}^{\prime},i_{n+1}^{\prime}}$ , can be applied onto a nearest-pair of tensors on sites $n$ and $n+1$ (also referred to as the $n$ -th bond) as follows [40] (The implementation of single-qubit gate operation is straightforward, but it is not necessary since single-qubits gate can be absorbed into two-qubits gate using gate fusion). As shown in Fig. 2(b), first, one obtains the rank- $4$ tensor $M$ by the contraction of the two-qubit gate with the tensors $B^{i_{n}}_{a_{n-1},a_{n}}$ and $B^{i_{n+1}}_{a_{n},a_{n+1}}$

M_{a_{n-1},i_{n},i_{n+1},a_{n+1}}=\sum_{a_{n}}O^{i_{n},i_{n+1}}_{i_{n}^{\prime},_{n+1}^{\prime}}B^{i_{n}^{\prime}}_{a_{n-1},a_{n}}B^{i_{n+1}^{\prime}}_{a_{n},a_{n+1}}.

(7)

Now to restore the MPS form, one could perform a singular value decomposition (SVD) on $M$ to obtain two rank- $3$ site tensors, however one of them will no longer be right-canonical. To preserve the right canonical form, one could perform SVD on another rank- $4$ tensor

M^{\prime}_{a_{n-1},i_{n},i_{n+1},a_{n+1}}=\sum_{a_{n-1}}\lambda_{a_{n-1}}M_{a_{n-1},i_{n},i_{n+1},a_{n+1}}

(8)

instead, where $\lambda_{a_{n-1}}$ are the Schmidt numbers on the $n-1$ -th bond saved from the previous steps, that is,

{\rm SVD}(M^{\prime}_{a_{n-1},i_{n},i_{n+1},a_{n+1}})=\sum_{a_{n}}U^{i_{n}}_{a_{n-1},a_{n}}\lambda_{a_{n}}^{\prime}V^{i_{n+1}}_{a_{n},a_{n+1}},

(9)

where $\lambda_{a_{n}}^{\prime}$ are the new Schmidt numbers on the $n$ -th bond which will be used to replace the previous ones. In case the number of nonzero Schmidt numbers is larger than $D$ , we will also truncate them back to $D$ by only reserving the $D$ largest Schmidt numbers. This procedure will produce truncation errors if the maximum $D$ we choose is too small or the circuit is too deep (Nevertheless this error can be monitored as an indication of the imprecision of the computing). $V^{i_{n+1}}_{a_{n},a_{n+1}}$ is already right-canonical and will be used to replace $B^{i_{n+1}}_{a_{n},a_{n+1}}$ , the new site tensor on site $n$ can be updated as

B^{i_{n}}_{a_{n-1},a_{n}}=\sum_{i_{n+1},a_{n+1}}M_{a_{n-1},i_{n},i_{n+1},a_{n+1}}(V^{i_{n+1}}_{a_{n},a_{n+1}})^{\ast}

(10)

which can be verified to also be right-canonical. With a right-canonical form of MPS, the expectation value of a local observable $\hat{O}_{n}$ (with its matrix representation $O_{i_{n}^{\prime}}^{i_{n}}$ in the computational basis) on the MPS can be evaluated as

\langle\Psi|\hat{O}|\Psi\rangle=\sum_{a_{n-1},a_{n},i_{n},i_{n}^{\prime}}\lambda_{a_{n-1}}^{2}O_{i_{n}^{\prime}}^{i_{n}}B^{i_{n}^{\prime}}_{a_{n-1},a_{n}}(B^{i_{n}}_{a_{n-1},a_{n}})^{\ast}.

(11)

The expectation value of two-qubit or higher observables can be efficiently computed accordingly. In this work, the expectation values of Pauli strings are calculated using Hadamard test, which is equivalent to evaluating the expectation value of the local projector $|0\rangle\langle 0|$ or $|1\rangle\langle 1|$ on the ancillary qubit.

To this end, we note that since our VQE is based on the MPS representation, the expressiveness of the VQE ansatz used in our classical simulator will not exceed the expressiveness of the underlying MPS. As a result one may well substitute the VQE simulator by another MPS based optimization algorithm such as DMRG and a similar or even higher precision would be expected if the same $D$ is used. However, the MPS based VQE simulator, as will be shown later, allows straightforward parallelization to millions of cores with almost no data communications, in comparison in current parallel algorithms for DMRG data communication can not be avoided and the parallelizability is limited [41].

The MPS based simulator takes advantage of the low entanglement nature of the underlying quantum state to dramatically reduce the required computational resources, which is often true for NISQ devices for which decoherence limits the amount of entanglement that can be generated (except for RQC which quickly entangles the quantum states but usually has no clear physical meaning). An illustrative comparison between the performances of state vector (SV), density matrix (DM) and MPS simulators is shown in Fig.2(c), with a quantum circuit that generates a quantum state which can be written as an MPS with $D=8$ . This circuit implements unitaries each of which entangles 4 consecutive qubits, corresponding to the correlations between neighbouring orbitals in a chemical system. It is also used as the MPS-inspired ansatz[42, 43] for slightly correlated molecules. For the SV or DM method, the exponentially growing memory storage and computation complexity strictly limited the simulation scale, while the MPS method approximates the low-entangled quantum states with smaller tensors, that only spans a very tiny fraction of the overall Hilbert space, so the required computational resources is dramatically reduced. However, faithfully simulating of the whole VQE calculation is still numerically demanding since a huge number of quantum circuits ( $\mathcal{O}(N_{q}^{4})$ for $N_{q}$ ) has to be evaluated. Nevertheless, this is an ideal situation for massive parallelization, which will be discussed in Sec. III-C.

III-B The combination of DMET method with MPS-VQE Simulator

To facilitate the quantum simulation of the large systems, we use density matrix embedding theory (DMET) to reduce large systems into smaller fragments and use MPS-VQE as a fragment solver, as shown in Fig. 3. The total energy of the system is the sum of the energies of the fragments, each of which is embedded in an effective environment (quantum bath) compressed from the rest of the whole system. DMET provides an economic and highly accurate way to describe the interaction between the fragment and the effective environment. In this way, the electronic structure of the system can be computed by iteratively solving a bunch of smaller problems, which can be done using the state vector or MPS simulators (or ultimately using a quantum computer). In practice, the standard procedures of DMET-MPS-VQE method are listed as follows:

1.

Perform low-level calculation for the whole system.
2.

Divide the whole system into fragments with reasonable number of orbitals for each fragment.
3.

For every fragment, construct bath orbitals and the reduced Hamiltonian by Schmidt decomposition and projection.
4.

Get the fragment energy and the wave function with VQE, and then compute the number of electrons with 1-RDM (reduced density matrix).
5.

Check if the sum of the number of electrons in the fragments agrees with the number of electrons for the entire system. If not, modify the global chemical potential and go back the step 3.

In DMET, a high-level calculation for each fragment is carried out individually until the self-consistency criterion has been met: the sum of the number of electrons of all of the fragments (the number of fragments are around 10 to 200) agrees with the number of electrons for the entire system.

III-C The Customized Parallelization Strategy

Three levels of parallelization are used for the DMET-MPS-VQE simulator. As shown in Fig. 4, the calculation of different fragments can be performed in an embarrassingly parallel manner, that is, the first level of parallelization. As no communication between different fragment calculations is required, we split the whole CPU pool into different sub-groups and sub-communicators. Within each sub-group, the total energy is calculated with the VQE method. Within each VQE calculation, another two levels of parallelization have been used, that is, the second level over circuits and the third level over linear algebra operations. In the VQE calculation, the electronic molecular Hamiltonian $H$ can be written as a sum of the Pauli strings: $H=\sum_{i}{a_{i}P_{i}}$ , where each Pauli string $P_{i}\in\{I,\sigma_{x},\sigma_{y},\sigma_{z}\}^{\otimes n}$ is a tensor product of $n$ -qubit Pauli operators. We can estimate each $\langle\psi|P_{i}|\psi\rangle$ independently and then sum over all those results with corresponding coefficients ( $a_{i}$ ) to get $\langle\psi|H|\psi\rangle$ . This enables high parallel scalability with adapted dynamical load balancing algorithm. The parallel simulation algorithm based on distributed memory over the circuits, just “mimic” the actual quantum computers (for quantum computers each term $\langle\psi|P_{i}|\psi\rangle$ also has to be evaluated individually), so our method can offer a good reference for VQE running on the quantum computers. For tensor operations including tensor contraction and singular value decomposition (SVD), a low-level multi-threaded parallelism on the CPE is used to further boost the performance.

III-D The memory-efficient scheme to store the circuits

The second level of parallelism is based on the fact that the electronic structure Hamiltonian can be expressed as the summation of a polynomial number of mutually uncorrelated Pauli strings. Expectation values of each Pauli string can thus be calculated independently. Fig.5 gives an example for the hydrogen molecule. The $4$ -qubit Hamiltonian of hydrogen molecule contains $15$ Pauli strings, therefore $15$ circuits are required for each of them. For each circuit, the first $2$ $X$ gates prepare a Hartree-Fock reference state. The next $120$ gates (with variational parameters in the $RZ$ gates) on q₀, q₁, q₂ and q₃ constitute the ansatz circuit, which corresponds to the first-order Trotterized UCC ansatz and is identical for each Pauli string. $q_{4}$ is the ancillary qubit for Hadamard test used to compute the expectation value, and the gates on $q_{4}$ in different circuits corresponding to the measurement part are constructed according to each Pauli string. As shown in Fig.4, during the VQE optimization, circuits corresponding to mutually exclusive subsets of Pauli strings in Eq.2 are mapped to different processes. The energy $E(\theta)={\langle\psi(\theta)|\hat{H}|\psi(\theta)\rangle}$ under current parameters $\{\theta\}$ is obtained by reducing the results on different processes and then provided to the optimizer.

It should be noted that the number of Pauli strings in the electronic Hamiltonian scales as $\mathcal{O}(N_{q}^{4})$ , where $N_{q}$ is the number of qubits. Even for a medium-size molecule such as benzene (72 qubits, 330816 Pauli strings), storing all the $330816$ circuits on each process will bring a lot of pressure on the memory space of CGs and synchronizing the circuits after each optimization step will also result in remarkable overhead on communication. Noticing that the ansatz part of different circuits are identical and the measurement part can be constructed on-the-fly in the first energy evaluation and then kept constant during the VQE optimization, a straightforward optimization for memory usage can be achieved by keeping only one replica of the ansatz circuit on each process. The speedup of such a memory-optimized simulation is given in Sec. IV-B.

III-E Using Julia Programming Language on Sunway

Julia is designed to take advantage of modern techniques for executing dynamic languages efficiently. It has the performance of a statically compiled language while providing interactive dynamic behavior and productivity[44]. The key ingredients of performance are:

•

Rich type information, provided naturally by multiple dispatch;
•

Aggressive code specialization against run-time types;
•

JIT compilation using the LLVM compiler framework[45];

We build the core components of Q²Chemistry based on the Julia language to take advantage of the above advantages. Julia codes can be highly extensible thanks to its type system and multiple dispatch mechanism, and the meta-programming ability makes developing customized syntax and device-specific programs simple[46]. Generic programming in Julia helps us to reach optimized performance while still keeping the code general and concise. Julia integrates well with other programming languages. It is generally straightforward to use external libraries written in other languages in Q²Chemistry. We have also a python version of Q²Chemistry, which will be reported elsewhere. In this work, the electronic structure code PySCF[47] and openfermion[48] are used to calculate chemical quantities such as the orbital occupations and qubit Hamiltonian for quantum state evolution and measurements.

We unleash the Julia compiler on the new Sunway supercomputer and deploy Q²Chemistry on it. Fig. 6 shows the schematic overview of our Sunway Julia compiler. The MPE compiler is the main part which serves as a JIT compiler for MPEs. The SACA.jl library focuses on providing definitions for CPE operations that are required for writing effective applications based on the SACA programming model[49]. The main interface for calling functions on CPEs resembles a call to an ordinary Julia function: @saca (config...) function(args...), where the config tuple indicates the spawn configuration similar to the arguments in SACA APIs. In addition, Julia features powerful meta-programming and reflection capabilities. For example, the high-level Julia IR is accessible with the code_lowered and code_typede reflections, and can be modified with generated functions. The CPE Julia Compiler implements multiple types of extension based on SW26010Pro architecture features. The SIMD extensions support the definition of non-uniform width short vector data types and the built-in operations in Julia language layer. The multi-level storage extensions make it easy for CPE Julia code to migrate date between multi-level storage, such as LDM, TLS, CROSS, etc. The async-operation extensions provide flexible interfaces for DMA, RMA and other asynchronous components. To facilitate interactions with LLVM, we adopt LLVM.jl[50] package which provides a high-level wrapper to the LLVM C API, using Julia’s powerful Foreign Function Interface (FFI) to interact efficiently with the underlying libraries. To meet the application requirements of Q²Chemistry, we port a number of Julia standard and third-party libraries, such as MPI.jl[51], Distributed.jl, PyCall.jl, etc. MPI.jl and Distributed.jl are two commonly used distribution modes of Julia. Distributed.jl is the method to spawn new processes in separate memory spaces. Like how multi-threading is set up, its master-slave distributed architecture makes it nice and convenient for smallish parallelization. MPI.jl is a basic Julia wrapper for the Message Passing Interface (MPI)[52]. The MPI libraries are versatile and highly optimized for SW26010Pro, so that the MPI.jl could be easy to use it from Julia by preparing functions with almost the same name as C functions and get similar performance. Given the performance and scalability of Q²Chemistry on the new Sunway supercomputer, we finally chose MPI.jl as the main way for distributed computing. We use swBLAS as the backend for matrix-vector and matrix-matrix multiplications. PyCall.jl is used to directly call python packages from Julia.

When compiling Q²Chemistry, each computing node has a separate JIT compilation engine, and there is no data interaction during JIT compilation, so it can be easily scaled up to more than 20M cores. For more than 20M cores, the simultaneous loading of the Julia dynamic libraries on computing nodes can become a performance bottleneck. We avoid this problem by using the ROFS (Read Only File System). Compared with the GFS (Global File System), The ROFS has smaller capacity but faster access speed, which significantly reduces the loading time of Julia dynamic libraries.

When running Q²Chemistry, the hotspots are mainly tensor contraction and the SVD functions in the Julia’s basic library LinearAlgebra. In the tensor contraction, the first step is the index permutation of the tensors, followed by the matrix multiplication (ZGEMM) to accomplish the calculation. Here we use the fused permutation and multiplication technique. In the SVD calculation, the matrix is transformed into a bidiagonal matrix using an orthogonal transformation, then the bidiagonal matrix is diagonalized using BDC (bidiagonal divide-and-conquer) or QR decomposition. Thus the core calculations of SVD are mainly the matrix vector multiplication, matrix multiplication, vector multiplication and matrix transpose, which is done with the swBLAS library.

We also note that running Q²Chemistry on non-Sunway architectures needs to address the following issues:

1.

The compiling and running environment for Julia program on computing nodes: The Julia compilation framework needs to be supported on non-Sunway architectures to meet the compilation and running requirements of the Julia language. Currently the latest Julia version only supports i686 / x86_64 / arm / aarch64 / powerpc64le. Other architectures require porting of the Julia compilation framework;
2.

MPI.jl distributed environment: The underlying MPI protocol and the MPI.jl library need to be supported on non-Sunway architectures to meet Q²Chemistry’s massively scalable communication requirements. All of the mainstream supercomputer systems support the MPI protocol, and some architectures need to transplant the MPI.jl library;
3.

Architecture specific optimization: Optimized third-party computing libraries, including BLAS and LAPACK, need to be supported on non-Sunway architectures to meet Q²Chemistry’s computing performance requirements. The current mainstream non-Sunway Architectures basically meet this requirement.

Therefore, in general Q²Chemistry has good portability and can be rapidly and efficiently transplanted to the current mainstream non-Sunway architectures.

IV EVALUATION

IV-A Simulation Validation

We firstly validate our method by comparing with the FCI results as shown in Fig. 7(a). We start with the calculation of the hydrogen ring with 10 atoms to show the effectiveness of the DMET-MPS-VQE method. Restricted Hartree-Fock method is used as the low-level calculation of the whole system, with the STO-3G basis. In the DMET-MPS-VQE calculation, the hydrogen atoms are divided into fragments with two atoms. Here we can see the potential energy curve with the DMET-MPS-VQE method, which shows very good agreement with those from FCI, and the relative errors are within 0.5%. We also compare the simulation results of MPS-VQE with FCI, and the results of the ground state energy are very accurate for the chemical systems of H₂, LiH, and H₂O, with the relative errors at the level of 0.01%.

We further examine the carbon ring molecule (C18) with a large basis set (cc-pVDZ) as shown in Fig. 7(b). The equilibrium geometry of this molecule is the bond-length alternated structure as confirmed in experiment[53]. Here we scan the energy of this molecule with the change of the bond length alternation (BLA), defined as the lengths of the two sets of C-C bonds in the molecule. We show results with cc-pVDZ basis set (for carbon, the 1s orbital is frozen, only the 2s and 2p basis orbitals of carbon atom are considered in the VQE calculation, while the high angular momentum orbitals are treated at the mean-field level). We can see that, similar to the CCSD results, our DMET-VQE results predict that the energy of the bond-length alternating structure is lower, which in good agreement with experimental observations.

IV-B Speedup

In this section, we compare two other libraries with our proposed algorithm and analyze the test results. The two libraries are qiskit (using the state vector and MPS algorithm) and quimb (using the MPS algorithm). We evaluate the four methods by comparing the simulating time of one circuit for three molecule systems using one process, as shown in Fig. 8. The computing time of quimb (MPS) is on average $3$ times slower than qiskit (state-vector). Our method (Q²Chemistry) is on average around $7$ times faster than quimb and around $2$ times faster than qiskit (both MPS and state vector).

Figure 9 shows the result for the memory-efficient optimisation in the circuit simulation. Here we choose three systems as examples. The (H₂)3, LiH and H₂O molecules have $919$ , $630$ and $1085$ circuits respectively. The number of circuits per process is $18$ , $19$ , $17$ . We can see the speedup is around 15x and the memory reduction is around 20x. This indicates that the memory-optimization scheme is effective and efficient.

Figure 10 shows the simulation time with the MPS-VQE simulator for the VQE circuits of the hydrogen chain. The number of the electrons/atoms changes from 6 to 100, while the corresponding number of the qubits changes from 12 to 200. It is clearly shown that the computing time scales linearly with the number of qubits, which demonstrates the power of Q²Chemistry. There are two major bottlenecks for further scaling up to larger systems: 1) the memory requirement, which grows linearly with the number of qubits ; 2) the quantum circuit corresponding to more complex systems could be significantly deeper, for which one needs a larger bond dimension.

In the MPS-VQE simulation, the percentages of time spent in tensor contraction and SVD are around 15% and 82% when the number of qubits changes from 33 to 129. Fig. 11 give the evaluation of the tensor contraction and SVD performance improvements when using different bond dimensions. We observe that compared to the original version that only uses the MPE, the optimized version use both the MPE and the 64 CPEs produces an overall speedup in the tensor contraction calculation ranging from 2.3x to 46.5x and the SVD calculation ranging from 1.04× to 15.5×, and the speedup is more significant when increasing the bond dimension from 256 to 1024.

We also evaluate the performance of Q²Chemistry on the x86 (AMD Ryzen EPYC 7452) CPU, using a quantum circuit similar to that used in Fig. 2(c), with unitaries replaced by 2-qubit gates acting on neighbouring qubits. The initial quantum state is generated randomly according to a bond dimension threshold (which is 512 in this test). We observe that the SW version is approximately $1.1$ times faster than the x86 version based on OpenBLAS and 16.6 times faster than the x86 version based on LAPACK-3.2.

IV-C Scalability Results

Figure 12 shows the strong scaling performance of the quantum computational chemistry simulation for the hydrogen chain with 1280 atoms (1280 electrons). In all calculations, each MPI sub-group is mapped to 2048 processes. The code shows good strong scaling performance: as the number of Sunway processes increases from $10,240$ to $327,680$ for the calculation (the number of cores increases from $665,600$ to $21,299,200$ ), the parallelization efficiency exceeds 92% and 30 $\times$ speedup with respect to the $10,240$ processes is achieved.

Fig. 13 shows the weak scaling of the quantum computational chemistry simulation for the hydrogen chain. To obtain the weak scaling data, we time the MPS-VQE calculations with an increasing number of fragments ( $2048$ processes per fragment). The number of atoms (electrons) are (40, 80, 320, 1280) respectively in the weak scaling calculation. The calculation achieves excellent weak scaling, exhibiting high parallelization efficiency as the number of cores grows. Taking $10,240$ processes as a reference, the code achieves a parallelization efficiency of around 92% with $327,680$ processes ( $21,299,200$ cores).

The reason for the high parallelization efficiency is due to the low data communication overhead in our algorithm. There is almost no data communication on the DMET level parallelization except the final summation over all fragments (a scalar from each process) to get the total DMET energy. On the MPS-VQE level parallelization over 2048 processes, we first need to broadcast the parameters from the root process to all processes using MPI_Bcast, and after the computation, we need to collect all the results to the root process using MPI_Reduce, which are shown in Fig. 4. We use the Julia performance profiler to measure the communication costs, which shows that the amount of data communication per process is around $15.6$ KB and the time spent on data communication is less than $0.001$ second per VQE iteration.

V Implications

Here we apply our Q²Chemistry method in real chemical systems for the quantification of protein-ligand interactions. Compared to the empirical force fields method, the quantum mechanical calculations can automatically include effects of polarization, charge transfer, charge penetration, and the coupling of the various terms. Therefore, the current quantum mechanical approaches could offer more accurate and detailed information on the nature of protein–ligand interaction, which is valuable in high-accuracy binding affinity prediction and so does in drug design. The SARS-CoV-2 main protease (M^pro) is an enzyme that cleaves the viral polyproteins into individual proteins required for viral replication, so it is the important target to develop drugs for SARS-CoV-2. The structure of the M^pro complex originates from Ref.[54](PDB 6lu7), as shown in Fig. 14. Here by using the ”frozen protein” approximation[55] to remove the protein-protein interactions, we use $E_{\rm b}=E_{\rm ligand\_in\_protein}-E_{\rm ligand}$ to get the binding energy $E_{\rm b}$ , which is then used as the metric for ranking. We use the geometries of the 13 neutral ligands from Ref.[56], which has been optimized with density functional theory method. Then we use Q²Chemistry method to get the $E_{\rm b}$ , the geometries of $E_{\rm ligand}$ are optimized at Hartree-Fock level to account for some amount of the geometric distortion needed for the ligand to occupy the active site. We examine the binding energy of 13 ligands (The largest molecule is Atazanavir contains 103 atoms and 378 electrons), we find Candesartan cilexetil binds best ( $E_{\rm b}$ is -6.8 eV), which is in good agreement with the experimental observation [57]. We also examine the Nirmatrelvir within Paxlovid (Pfizer’s novel anti-viral Covid-19 drug), which is an orally bioavailable protease inhibitor that is active against M^pro, the binding energy $E_{\rm b}$ is -7.3 eV, even lower than that of Candesartan cilexetil, which indicates that the performance of Nirmatrelvir is better than Candesartan cilexetil.

VI CONCLUSION

The innovations realized in this study demonstrate that Q²Chemistry is suitable for large-scale simulation of quantum computational chemistry, based on a combination of the Density Matrix Embedding Theory and the Matrix Product States to reduce the exponentially memory scaling against the system size; a customized three-level parallelization scheme has been implemented according to the nature of the physical problem and the many-core architecture; Julia is used as the primary language which both makes programming easier and enables cutting edge performance close to native C or Fortran; Real chemical systems have been studied to demonstrate the power of Q²Chemistry in computational quantification of protein-ligand interactions. To the best of our knowledge, this is the first reported quantum computational chemistry simulation calculation for real chemical system with as many as $100$ atoms and $1000$ qubits using DMET-MPS-VQE (and $200$ qubits using MPS-VQE), and scales to around $20$ million cores. This paves the way for benchmarking with near term VQE experiments on quantum computers with around $100$ qubits.

In future work we will also attempt to port Q²Chemistry to other computing architectures, for which the following issues need to be addressed: First, the Julia compilation framework needs to be supported to meet the compilation and running requirements of the Julia language; Second, the underlying MPI protocol and the MPI.jl library need to be supported to meet Q²Chemistry’s massively scalable communication requirements; Finally, optimized third-party computing libraries, including BLAS and LAPACK, need to be supported to meet Q²Chemistry’s computing performance requirements.

References

[1] K. D. Vogiatzis, D. Ma, J. Olsen, L. Gagliardi, and W. A. de Jong, “Pushing configuration-interaction to the limit: Towards massively parallel mcscf calculations,” vol. 147, no. 18, p. 184111, 2017. [Online]. Available: https://doi.org/10.1063/1.4989858
[2] Y. Cao, J. Romero, J. P. Olson, M. Degroote, P. D. Johnson, M. Kieferová, I. D. Kivlichan, T. Menke, B. Peropadre, N. P. Sawaya et al., “Quantum chemistry in the age of quantum computing,” Chemical reviews, vol. 119, no. 19, pp. 10 856–10 915, 2019.
[3] S. McArdle, S. Endo, A. Aspuru-Guzik, S. C. Benjamin, and X. Yuan, “Quantum computational chemistry,” Reviews of Modern Physics, vol. 92, no. 1, p. 015003, 2020.
[4] M. Cerezo, A. Arrasmith, R. Babbush, S. C. Benjamin, S. Endo, K. Fujii, J. R. McClean, K. Mitarai, X. Yuan, L. Cincio et al., “Variational quantum algorithms,” Nature Reviews Physics, vol. 3, no. 9, pp. 625–644, 2021.
[5] K. Bharti, A. Cervera-Lierta, T. H. Kyaw, T. Haug, S. Alperin-Lea, A. Anand, M. Degroote, H. Heimonen, J. S. Kottmann, T. Menke et al., “Noisy intermediate-scale quantum algorithms,” Reviews of Modern Physics, vol. 94, no. 1, p. 015004, 2022.
[6] F. Arute, K. Arya, R. Babbush, D. Bacon, J. C. Bardin, R. Barends, R. Biswas, S. Boixo, F. G. Brandao, D. A. Buell et al., “Quantum supremacy using a programmable superconducting processor,” Nature, vol. 574, no. 7779, pp. 505–510, 2019.
[7] Y. Wu, W.-S. Bao, S. Cao, F. Chen, M.-C. Chen, X. Chen, T.-H. Chung, H. Deng, Y. Du, D. Fan et al., “Strong quantum computational advantage using a superconducting quantum processor,” Physical review letters, vol. 127, no. 18, p. 180501, 2021.
[8] Q. Zhu, S. Cao, F. Chen, M.-C. Chen, X. Chen, T.-H. Chung, H. Deng, Y. Du, D. Fan, M. Gong et al., “Quantum computational advantage via 60-qubit 24-cycle random circuit sampling,” Science Bulletin, vol. 67, no. 3, pp. 240–245, 2022.
[9] V. E. Elfving, B. W. Broer, M. Webber, J. Gavartin, M. D. Halls, K. P. Lorton, and A. Bochevarov, “How will quantum computers provide an industrially relevant computational advantage in quantum chemistry?” 2020.
[10] B. Bauer, S. Bravyi, M. Motta, and G. Kin-Lic Chan, “Quantum algorithms for quantum chemistry and quantum materials science,” vol. 120, no. 22, pp. 12 685–12 717, 2020.
[11] G. A. Quantum, Collaborators*†, F. Arute, K. Arya, R. Babbush, D. Bacon, J. C. Bardin, R. Barends, S. Boixo, M. Broughton, B. B. Buckley et al., “Hartree-fock on a superconducting qubit quantum computer,” Science, vol. 369, no. 6507, pp. 1084–1089, 2020.
[12] C. Guo, Y. Liu, M. Xiong, S. Xue, X. Fu, A. Huang, X. Qiang, P. Xu, J. Liu, S. Zheng et al., “General-purpose quantum circuit simulator with projected entangled-pair states and the quantum supremacy frontier,” Physical review letters, vol. 123, no. 19, p. 190501, 2019.
[13] B. Villalonga, S. Boixo, B. Nelson, C. Henze, E. Rieffel, R. Biswas, and S. Mandra, “A flexible high-performance simulator for verifying and benchmarking quantum circuits implemented on real hardware,” npj Quantum Information, vol. 5, no. 1, pp. 1–16, 2019.
[14] C. Guo, Y. Zhao, and H.-L. Huang, “Verifying random quantum circuits with arbitrary geometry using tensor network states algorithm,” Physical Review Letters, vol. 126, no. 7, p. 070502, 2021.
[15] F. Pan and P. Zhang, “Simulation of quantum circuits using the big-batch tensor network method,” Physical Review Letters, vol. 128, no. 3, p. 030501, 2022.
[16] F. Pan, K. Chen, and P. Zhang, “Solving the sampling problem of the sycamore quantum supremacy circuits,” arXiv preprint arXiv:2111.03011, 2021.
[17] C. Huang, F. Zhang, M. Newman, X. Ni, D. Ding, J. Cai, X. Gao, T. Wang, F. Wu, G. Zhang et al., “Efficient parallelization of tensor network contraction for simulating quantum computation,” Nature Computational Science, vol. 1, no. 9, pp. 578–587, 2021.
[18] Y. Liu, X. Liu, F. Li, H. Fu, Y. Yang, J. Song, P. Zhao, Z. Wang, D. Peng, H. Chen et al., “Closing the” quantum supremacy” gap: achieving real-time simulation of a random quantum circuit using a new sunway supercomputer,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–12.
[19] J. Romero, R. Babbush, J. R. McClean, C. Hempel, P. J. Love, and A. Aspuru-Guzik, “Strategies for quantum computing molecular energies using the unitary coupled cluster ansatz,” Quantum Science and Technology, vol. 4, no. 1, p. 014008, oct 2018. [Online]. Available: https://doi.org/10.1088/2058-9565/aad3e4
[20] U. Schollwöck, “The density-matrix renormalization group in the age of matrix product states,” Annals of Physics, vol. 326, no. 1, p. 96–192, Jan 2011. [Online]. Available: http://dx.doi.org/10.1016/j.aop.2010.09.012
[21] W. Li, Z. Huang, C. Cao, Y. Huang, Z. Shuai, X. Sun, J. Sun, X. Yuan, and D. Lv, “Toward practical quantum embedding simulation of realistic chemical systems on near-term quantum computers,” arXiv preprint arXiv:2109.08062, 2021.
[22] D. S. Steiger, T. Häner, and M. Troyer, “ProjectQ: an open source software framework for quantum computing,” Quantum, vol. 2, p. 49, Jan. 2018. [Online]. Available: https://doi.org/10.22331/q-2018-01-31-49
[23] G. Aleksandrowicz, T. Alexander, P. Barkoutsos, L. Bello, Y. Ben-Haim, D. Bucher, F. J. Cabrera-Hernández, J. Carballo-Franquis, A. Chen, C.-F. Chen, J. M. Chow, A. D. Córcoles-Gonzales, A. J. Cross, A. Cross, J. Cruz-Benito, C. Culver, S. D. L. P. González, E. D. L. Torre, D. Ding, E. Dumitrescu, I. Duran, P. Eendebak, M. Everitt, I. F. Sertage, A. Frisch, A. Fuhrer, J. Gambetta, B. G. Gago, J. Gomez-Mosquera, D. Greenberg, I. Hamamura, V. Havlicek, J. Hellmers, Łukasz Herok, H. Horii, S. Hu, T. Imamichi, T. Itoko, A. Javadi-Abhari, N. Kanazawa, A. Karazeev, K. Krsulich, P. Liu, Y. Luh, Y. Maeng, M. Marques, F. J. Martín-Fernández, D. T. McClure, D. McKay, S. Meesala, A. Mezzacapo, N. Moll, D. M. Rodríguez, G. Nannicini, P. Nation, P. Ollitrault, L. J. O’Riordan, H. Paik, J. Pérez, A. Phan, M. Pistoia, V. Prutyanov, M. Reuter, J. Rice, A. R. Davila, R. H. P. Rudy, M. Ryu, N. Sathaye, C. Schnabel, E. Schoute, K. Setia, Y. Shi, A. Silva, Y. Siraichi, S. Sivarajah, J. A. Smolin, M. Soeken, H. Takahashi, I. Tavernelli, C. Taylor, P. Taylour, K. Trabing, M. Treinish, W. Turner, D. Vogt-Lee, C. Vuillot, J. A. Wildstrom, J. Wilson, E. Winston, C. Wood, S. Wood, S. Wörner, I. Y. Akhalwaya, and C. Zoufal, “Qiskit: An Open-source Framework for Quantum Computing,” Jan. 2019. [Online]. Available: https://doi.org/10.5281/zenodo.2562111
[24] T. Jones, A. Brown, I. Bush, and S. C. Benjamin, “Quest and high performance simulation of quantum computers,” Sci. Rep., vol. 9, p. 10736, 2019.
[25] Y. Suzuki, Y. Kawase, Y. Masumura, Y. Hiraga, M. Nakadai, J. Chen, K. M. Nakanishi, K. Mitarai, R. Imai, S. Tamiya, T. Yamamoto, T. Yan, T. Kawakubo, Y. O. Nakagawa, Y. Ibe, Y. Zhang, H. Yamashita, H. Yoshimura, A. Hayashi, and K. Fujii, “Qulacs: a fast and versatile quantum circuit simulator for research purpose,” Quantum, vol. 5, p. 559, Oct. 2021. [Online]. Available: https://doi.org/10.22331/q-2021-10-06-559
[26] T. Häner and D. S. Steiger, “0.5 petabyte simulation of a 45-qubit quantum circuit,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2017, pp. 1–10.
[27] S. R. White and R. L. Martin, “Ab initio quantum chemistry using the density matrix renormalization group,” The Journal of Chemical Physics, vol. 110, no. 9, pp. 4127–4130, 1999. [Online]. Available: https://doi.org/10.1063/1.478295
[28] G. K.-L. Chan and S. Sharma, “The density matrix renormalization group in quantum chemistry,” Annual review of physical chemistry, vol. 62, pp. 465–481, 2011.
[29] A. Dang, C. D. Hill, and L. C. Hollenberg, “Optimising matrix product state simulations of shor’s algorithm,” Quantum, vol. 3, p. 116, 2019.
[30] D. Poulin, M. B. Hastings, D. Wecker, N. Wiebe, A. C. Doberty, and M. Troyer, “The trotter step size required for accurate quantum simulation of quantum chemistry,” Quantum Info. Comput., vol. 15, p. 361, 2015.
[31] I. G. Ryabinkin, A. F. Izmaylov, and S. N. Genin, “A posteriori corrections to the iterative qubit coupled cluster method to minimize the use of quantum resources in large-scale calculations,” Quantum Sci. Technol., vol. 6, no. 2, p. 024012, mar 2021. [Online]. Available: https://doi.org/10.1088/2058-9565/abda8e
[32] G. Knizia and G. K. L. Chan, “Density matrix embedding: A simple alternative to dynamical mean-field theory,” Physical Review Letters, vol. 109, no. 18, pp. 1–5, 2012.
[33] T. Yamazaki, S. Matsuura, A. Narimani, A. Saidmuradov, and A. Zaribafiyan, “Towards the practical application of near-term quantum computers in quantum chemistry simulations: A problem decomposition approach,” 2018.
[34] W. Li, Z. Huang, C. Cao, Y. Huang, Z. Shuai, X. Sun, J. Sun, X. Yuan, and D. Lv, “Toward practical quantum embedding simulation of realistic chemical systems on near-term quantum computers,” 2021.
[35] X. Yuan, J. Sun, J. Liu, Q. Zhao, and Y. Zhou, “Quantum simulation with hybrid tensor networks,” Phys. Rev. Lett., vol. 127, p. 040501, Jul 2021. [Online]. Available: https://link.aps.org/doi/10.1103/PhysRevLett.127.040501
[36] S. J. Bennie, B. F. E. Curchod, F. R. Manby, and D. R. Glowacki, “Pushing the limits of eom-ccsd with projector-based embedding for excitation energies,” The Journal of Physical Chemistry Letters, vol. 8, no. 22, pp. 5559–5565, 2017, pMID: 29076727. [Online]. Available: https://doi.org/10.1021/acs.jpclett.7b02500
[37] J. Qiao and Q. Jie, “Density matrix embedding theory of excited states for spin systems,” Computer Physics Communications, vol. 261, p. 107712, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0010465520303519
[38] R. Orús, “A practical introduction to tensor networks: Matrix product states and projected entangled pair states,” Annals of physics, vol. 349, pp. 117–158, 2014.
[39] G. Vidal, “Efficient classical simulation of slightly entangled quantum computations,” Physical review letters, vol. 91, no. 14, p. 147902, 2003.
[40] M. B. Hastings, “Light-cone matrix product,” Journal of mathematical physics, vol. 50, no. 9, p. 095207, 2009.
[41] E. Stoudenmire and S. R. White, “Real-space parallel density matrix renormalization group,” Physical review B, vol. 87, no. 15, p. 155137, 2013.
[42] J.-G. Liu, Y.-H. Zhang, Y. Wan, and L. Wang, “Variational quantum eigensolver with fewer qubits,” Phys. Rev. Research, vol. 1, p. 023025, Sep 2019. [Online]. Available: https://link.aps.org/doi/10.1103/PhysRevResearch.1.023025
[43] J. Y. Araz and M. Spannowsky, “Classical versus Quantum: comparing Tensor Network-based Quantum Circuits on LHC data,” 2022.
[44] J. Bezanson, S. Karpinski, V. B. Shah, and A. Edelman, “Julia: A fast dynamic language for technical computing,” arXiv preprint arXiv:1209.5145, 2012.
[45] C. Lattner and V. Adve, “Llvm: A compilation framework for lifelong program analysis & transformation,” in International Symposium on Code Generation and Optimization, 2004. CGO 2004. IEEE, 2004, pp. 75–86.
[46] X. Z. Luo, J. G. Liu, P. Zhang, and L. Wang, “Yao.jl: Extensible, efficient framework for quantum algorithm design,” 2019.
[47] J. R. McClean, K. J. Sung, I. D. Kivlichan, Y. Cao, C. Dai, E. S. Fried, C. Gidney, B. Gimby, P. Gokhale, T. Häner, T. Hardikar, V. Havlíček, O. Higgott, C. Huang, J. Izaac et al., “Openfermion: The electronic structure package for quantum computers,” 2019. [Online]. Available: https://arxiv.org/abs/1710.07629
[48] Q. Sun, T. C. Berkelbach, N. S. Blunt, G. H. Booth, S. Guo, Z. Li, J. Liu, J. D. McClain, E. R. Sayfutyarova, S. Sharma, S. Wouters, and G. K.-L. Chan, “Pyscf: the python-based simulations of chemistry framework,” Wiley Interdiscip. Rev. Comput. Mol. Sci., vol. 8, p. e1340, 2018. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/wcms.1340
[49] F. Li, X. Liu, Y. Liu, P. Zhao, Y. Yang, H. Shang, W. Sun, Z. Wang, E. Dong, and D. Chen, “Sw_qsim: a minimize-memory quantum simulator with high-performance on a new sunway supercomputer,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–13.
[50] T. Besard, C. Foket, and B. De Sutter, “Effective extensible programming: Unleashing Julia on GPUs,” IEEE Transactions on Parallel and Distributed Systems, 2018.
[51] S. Byrne, L. C. Wilcox, and V. Churavy, “Mpi. jl: Julia bindings for the message passing interface,” in Proceedings of the JuliaCon Conferences, vol. 1, no. 1, 2021, p. 68.
[52] W. Gropp, W. D. Gropp, E. Lusk, A. Skjellum, and A. D. F. E. E. Lusk, Using MPI: portable parallel programming with the message-passing interface. MIT press, 1999, vol. 1.
[53] K. Kaiser, L. M. Scriven, F. Schulz, P. Gawel, L. Gross, and H. L. Anderson, “An sp-hybridized molecular carbon allotrope, cyclo[18]carbon,” Science, vol. 365, no. 6459, pp. 1299–1301, 2019. [Online]. Available: https://www.science.org/doi/abs/10.1126/science.aay1914
[54] Z. R. X. Liu, B. Zhang, Z. Jin, H. Yang, “Crystal structure of COVID-19 main protease in complex with an inhibitor N3,” 2020. [Online]. Available: https://www.wwpdb.org/pdb?id=pdb_00006lu7
[55] J. J. M. Kirsopp, C. D. Paola, D. Z. Manrique, M. Krompiec, W. Guba, A. Meyder, D. Wolf, M. Strahm, and D. Mu, “Quantum Computational Quantification of Protein-Ligand Interactions,” 2021.
[56] Y. Wang, S. Murlidaran, and D. A. Pearlman, “Quantum simulations of SARS-CoV-2 main protease Mpro enable high-quality scoring of diverse ligands,” Journal of Computer-Aided Molecular Design, vol. 35, no. 9, pp. 963–971, sep 2021. [Online]. Available: https://link.springer.com/10.1007/s10822-021-00412-7
[57] Z. Li, X. Li, Y.-Y. Huang, Y. Wu, R. Liu, L. Zhou, Y. Lin, D. Wu, L. Zhang, H. Liu, X. Xu, K. Yu, Y. Zhang, J. Cui, C.-G. Zhan, X. Wang, and H.-B. Luo, “Identify potent sars-cov-2 main protease inhibitors via accelerated free energy perturbation-based virtual screening of existing drugs,” Proceedings of the National Academy of Sciences, vol. 117, no. 44, pp. 27 381–27 387, 2020. [Online]. Available: https://www.pnas.org/doi/abs/10.1073/pnas.2010470117

$\displaystyle E$	$\displaystyle=\langle{\hat{H}}\rangle=\langle\Psi(\theta)\|\hat{H}\|\Psi(\theta)\rangle$	(2)
	$\displaystyle=\langle\Psi(\theta)\|\sum_{p,q}{\tilde{h}_{q}^{p}P_{q}^{p}}\|\Psi(\theta)\rangle$
	$\displaystyle+\langle\Psi(\theta)\|\sum_{p,q,r,s}{\tilde{g}_{rs}^{pq}P_{rs}^{pq}}\|\Psi(\theta)\rangle.$