Performance Evaluation and Acceleration of the QTensor Quantum Circuit Simulator on GPUs

Danylo Lykov15, Angela Chen25, Huaxuan Chen35, Kristopher Keipert4, Zheng Zhang2, Tom Gibbs4, and Yuri Alexeev1 1Computational Science Division, Argonne National Laboratory, Lemont, IL 60439 2Department of Electrical and Computer Engineering, University of California, Santa Barbara, CA 93106 3Northwestern University, Evanston, IL 60208 4NVIDIA Corporation, Santa Clara, CA 95050 5 These authors contributed equally to the paper

(Started 6/3/21)

Abstract

This work studies the porting and optimization of the tensor network simulator QTensor on GPUs, with the ultimate goal of simulating quantum circuits efficiently at scale on large GPU supercomputers. We implement NumPy, PyTorch and CuPy backends, and benchmark the codes to find the optimal allocation of tensor simulations to either a CPU or a GPU. We also present a dynamic mixed backend to achieve the optimal performance. To demonstrate the performance, we simulate QAOA circuits for computing the MaxCut energy expectation. Our method achieves $140\times$ speedup on a GPU over the NumPy baseline on a CPU for the benchmarked QAOA circuits.

I Introduction

Quantum information science (QIS) has a great potential to speed up certain computing problems like combinatorial optimization and quantum simulations [alexeev2019quantumworkshop]. The development of fast and resource-efficient quantum simulators to classically simulate quantum circuits is the key to the advancement of the QIS field. There are many types of quantum simulators [wu2019full, wu2018amplitude, wu2018memory, quac, markov2008simulating, pednault2017breaking, boixo2017simulation, lykov2021large], and tensor network simulators have shown the state-of-the-art performance. In this work, we port and optimize the tensor network quantum simulator QTensor which is designed to simulate Quantum Approximate Optimization Algorithm (QAOA) [farhi2014quantum] and supremacy quantum circuits efficiently.

There are two main modes of quantum circuit simulation in QTensor: probability amplitude and energy expectation. The energy expectation is more relevant for QAOA simulation since it provides the expectation value for QAOA performance. We use MaxCut QAOA, for which the cost Hamiltonian on graph $G=(V,E)$ has $|E|$ energy terms. Each term generates a lightcone - a subset of the problem that generates a tensor network representing the contribution to the cost expectation value. A tensor network is a collection of tensors, which in turn have a collection of indices, where tensors share some indices with each other. To contract a tensor network, we create an ordered list of tensor buckets. Each bucket corresponds to a tensor index, which is called bucket index. Each bucket is a collection of tensors, which contain the bucket index. Buckets are then contracted one by one. The contraction of a bucket is performed by summing over the bucket index, and the resulting tensor is then appended to the appropriate bucket. The number of dimensions of the resulting tensor is called a bucket width. The memory and computational resources of a bucket contraction scale exponentially with the associated bucket width. For more information on tensor network contraction, see the following papers: [lykov2021importance, lykov2021large, shutski2019adaptive].

Backend Name	Device	Time (second)	Speedup
Torch_CPU	CPU	310	0.94 $\times$
NumPy (baseline)	CPU	294	1.00 $\times$
CuPy	GPU	6.6	44.5 $\times$
Torch_GPU	GPU	3.5	84.0 $\times$
Torch_CPU + Torch_GPU	Mixed	2.9	101 $\times$
NumPy + CuPy	Mixed	2.1	140 $\times$

TABLE I: Time to contract a single tensor network corresponding to the calculation of a single lightcone for a QAOA energy calculation. Speedup shows the overall runtime improvement compared to the baseline CPU backend “ NumPy”.

II CPU and GPU performance results

For numerical evaluations, we measure the performance of QTensor on a single lightcone of QAOA MaxCut problem for a 3-regular graph of size 30 and QAOA depth $p=4$ . The experiment is performed on a NVIDIA DGX-2 server (provided by NVIDIA corporation) with a V100 accelerator using the CUDA version 11.0.

We present the aggregate time required for contracting a single lightcone to simulate the QAOA expectation value in Table I. The baseline NumPy backend is executed only on a CPU and labeled “einsum” in Fig. 1 and Fig. 3 as we use numpy.einsum() for tensor computation. We also benchmark the GPU library CuPy (on GPU only) and PyTorch (on both CPU and GPU). Our GPU implementation of the simulator using PyTorch (labeled “torch_gpu”) achieves 84 $\times$ speedup over the CPU baseline and 1.88 $\times$ speedup over CuPy.

Refer to caption — Figure 1: Breakdown of mean contraction time by bucket width. CPU backends are better for buckets of width $\leq 11$ and GPU backends are better for larger buckets.

Single Backend Benchmarking Results: Figure 1 shows the mean contraction time of various bucket widths in different backends. The “einsum” backend spends less mean time for bucket width less than nine, while the GPU backends (“torch_gpu” and “cupy”) have similar and better performance for larger bucket widths. Figure 3 provides a breakdown of the contraction times of buckets by bucket width. This distribution is multi-modal: A large portion of time is spent on buckets of width 3. For CPU backends the bulk of the simulation time is spent for contracting large buckets. Figure 2 shows the distribution of bucket widths, where 85% of buckets have width less than 5. This signifies that simulation has an overhead from contracting a large number of small buckets.

This is particularly noticeable when looking at the total contraction time of different bucket width. Figure 3 shows that the distribution of time vs bucket width has two modes: for large buckets which dominate the contraction time for CPU backends, and for small buckets where most of the time is spent on I/O and other code overhead.

Mixed Backends Benchmarking Results: From Figure 1 it is apparent that GPU backends perform much better for the buckets of large width, while CPU backends are better for smaller buckets. We thus implemented a MixedBackend approach, which dynamically selects a device (a CPU or a GPU) on which the bucket should be contracted. We select a threshold value of 11 for the bucket width; any bucket that has a width larger than 11 will be contracted on GPU. Figure 3 shows that for GPU backends small buckets occupy approximately 90% of total simulation time. The results for this approach are shown in Table I under backend names “Torch_CPU + Torch_GPU” and “NumPy + CuPy”. Using a CPU backend in combination with Torch_GPU improves the performance by 1.2 $\times$ and for CuPy the improvement is 3 $\times$ . This suggests that using a combination of NumPy + Torch_GPU has the potential to give the best results.

Results: We have evaluated the GPU performance of tensor network contraction for the energy calculation of QAOA. The problem is largely inhomogeneous with a lot of small buckets and a few very large buckets. Most of the improvement comes from using GPU on large buckets, with up to 300 $\times$ speed improvement. On the other hand, the contraction of smaller tensors is faster on CPU. In general, if the maximum bucket width of a lightcone is less than $\sim 17$ , the improvement from using GPU is very marginal. In addition, large buckets require a lot of memory. For example, a bucket of width 27 produces a tensor with 27 dimensions of size 2, and the memory requirement for complex128 data type is 2 GB. In practice, these calculations are feasible up to width $\sim 29$ .

III Conclusions

This work has demonstrated that GPUs can significantly speed up quantum circuit simulations using tensor network contractions. We demonstrate that GPU is best for contracting large tensors, while CPU is slightly better for small tensors. Moving the computation on GPU can dramatically speed up the computation. We propose to use a contraction backend which dynamically assigns CPU or GPU device to tensors based on their size. This mixed backend approach demonstrated a 140 $\times$ improvement in time-to-solution.

Acknowledgments

Danylo Lykov and Yuri Alexeev are supported by the Defense Advanced Research Projects Agency (DARPA) grant. Yuri Alexeev and Angela Chen are also supported in a part by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S. Department of Energy’s Office of Science and National Nuclear Security Administration, responsible for delivering a capable exascale ecosystem, including software, applications, and hardware technology, to support the nation’s exascale computing imperative. Huaxuan Chen is supported in part by the U.S. Department of Energy, Office of Science, Office of Workforce Development for Teachers and Scientists (WDTS) under the Science Undergraduate Laboratory Internships Program (SULI). This work used the resources of the Argonne Leadership Computing Facility, which is DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.