Focus beyond quadratic speedups for error-corrected quantum advantage

Ryan Babbush [email protected] Google Quantum AI, Venice, CA 90291, United States of America Jarrod R. McClean [email protected] Google Quantum AI, Venice, CA 90291, United States of America Michael Newman Google Quantum AI, Venice, CA 90291, United States of America Craig Gidney Google Quantum AI, Venice, CA 90291, United States of America Sergio Boixo Google Quantum AI, Venice, CA 90291, United States of America Hartmut Neven Google Quantum AI, Venice, CA 90291, United States of America

Abstract

In this perspective, we discuss conditions under which it would be possible for a modest fault-tolerant quantum computer to realize a runtime advantage by executing a quantum algorithm with only a small polynomial speedup over the best classical alternative. The challenge is that the computation must finish within a reasonable amount of time while being difficult enough that the small quantum scaling advantage would compensate for the large constant factor overheads associated with error-correction. We compute several examples of such runtimes using state-of-the-art surface code constructions under a variety of assumptions. We conclude that quadratic speedups will not enable quantum advantage on early generations of such fault-tolerant devices unless there is a significant improvement in how we would realize quantum error-correction. While this conclusion persists even if we were to increase the rate of logical gates in the surface code by more than an order of magnitude, we also repeat this analysis for speedups by other polynomial degrees and find that quartic speedups look significantly more practical.

Introduction

One of the most important goals of the field of quantum computing is to eventually build a fault-tolerant quantum computer. But what valuable and classically challenging problems could we actually solve on such a device? Among the most compelling applications are quantum simulation Feynman (1982); Lloyd (1996) and prime factoring Shor (1994). Quantum algorithms for these tasks give exponential speedups over known classical alternatives but would have limited impact compared to significant improvements in our ability to address problems in broad areas of industrial relevance such as optimization and machine learning. However, while quantum algorithms exist for these applications, the most rigorous results have only been able to show a large speedup in contrived settings or a smaller speedup across a broad range of problems. For example, many quantum algorithms (often based on amplitude amplification Brassard et al. (2002)) give quadratic speedups for tasks such as search Grover (1996), optimization Grover (1996); Somma et al. (2008a); Campbell et al. (2019), Monte Carlo Brassard et al. (2002); Montanaro (2015); Rebentrost et al. (2018), various areas of machine learning Aïmeur et al. (2006); Wiebe et al. (2015) and more. However, attempts Sanders et al. (2020); Campbell et al. (2019) to assess the overheads of some such applications within fault-tolerance have come up with discouraging predictions for what would be required to achieve practical advantage against classical algorithms.

Refer to caption — (a) “Quantum nand” $>10\,\,{\rm qubitseconds}$

The central issue is that quantum error-correction and device operation time introduce significant constant factor slowdowns to the algorithm runtime (see Figure 1). These large overheads present many challenges for the practical realization of useful fault-tolerant devices. However, for applications that benefit from an exponential speedup relative to classical algorithms, the exponential scaling of the classical approach quickly catches up to the large constant factors of the quantum approach so that one can achieve a practical runtime advantage for even modest problem sizes. This is borne out through numerous studies on the cost of error-correcting applications with exponential scaling advantage in areas such as quantum chemistry Kivlichan et al. (2020); von Burg et al. (2020); Lee et al. (2020), quantum simulation of lattice models Lemieux et al. (2020a); Childs et al. (2018) and prime factoring Gidney and Ekerå (2019).

In this perspective we discuss when it would be practical for a modest fault-tolerant quantum computer to realize a quantum advantage with quantum algorithms giving only a small polynomial speedup over their classical competition. We will see that with only a low order (e.g., quadratic) speedup, exorbitantly long runtimes are sometimes required in order for the slightly worse scaling of the classical algorithm to catch up to the slightly better scaling (but worse constant factors) of the quantum algorithm. We will argue that the problem is especially pronounced when the best classical algorithms for a problem can also be easily parallelized.

Our analysis will emphasize current projections within the surface code Kitaev (2003) since it has the highest threshold error rate for a two-dimensional quantum computing architecture and is generally regarded as the most practical quantum error-correcting code Fowler et al. (2012). We will focus on a modest realization of the surface code that would involve enough resources to perform classically intractable calculations but only support a few state distillation factories. Our analysis differs from results analyzing the viability of error-correcting quadratic speedups for combinatorial optimization such as Campbell et al. (2019); Sanders et al. (2020), by addressing the prospects for achieving quantum advantage via polynomial speedup for a broad class of algorithms, rather than for specific problems. Note that Campbell et al. (2019) was the first study to detail poor prospects for error-correcting an algorithm achieving a quadratic speedup with a small fault-tolerant processor.

Here we will assume that there is some problem which can be solved by a classical computer that makes $M^{d}$ calls to a “classical primitive” circuit or by a quantum computer which makes $M$ calls to a “quantum primitive” circuit (which is often, but not always, related to the classical primitive circuit). This corresponds to an order $d$ polynomial quantum speedup in the number of queries to these subroutines. For $d=2$ , this is especially evocative of a common class of quantum algorithms leveraging amplitude amplification. This generously assumes no prefactor overhead in a quantum implementation of an algorithm with respect to the number of calls required and along with other crude assumptions, allows us to bound the crossover time.

Our back-of-the-envelope analysis makes many assumptions which are overly optimistic towards the quantum computer and yet we still conclude that the prospects look poor for quadratic speedups with current error-correcting codes and architectures to outperform classical computers in time-to-solution. It seems that to realize a quantum advantage with reasonable fault-tolerant resources, one must either focus beyond quadratic speedups, dramatically improve techniques for error-correction, or do both. Our conclusion is already “folk wisdom” among some in the small community that studies quantum algorithms and error-correction with an eye towards practical realization; however, this reality is not widely appreciated in the broader community that studies algorithms and applications of quantum computers more generally and there is value in presenting a specific argument to this effect in written form. An encouraging finding is that the prospects for error-corrected quantum advantage look significantly better with quartic speedups. Of course, there might exist use cases involving quadratic speedups that defy the framework of this analysis. Either way, we hope this perspective will encourage the field to critically examine the prospects for quantum advantage with error-corrected quadratic speedups and either produce examples where it is feasible, or focus more effort on algorithms with larger speedups.

Relationship between primitive times and runtime

Many quantum algorithms are built on coherent access to primitives implemented with classical logic. For example, this classical logic might be required to compute the value of a classical cost function for optimization Sanders et al. (2020), to evaluate a function of a trajectory of some security that one is pricing with Monte Carlo Rebentrost et al. (2018), or to compute some classical criteria that flags a marked state for which one might be searching Grover (1996). We will define the runtime of the quantum and classical algorithms as

{\cal T}_{Q}=M\,t_{Q}\qquad\qquad{\cal T}_{C}=M^{d}\,t_{C}

(1)

where ${\cal T}$ gives the total runtime of the algorithm, $M$ is the number of primitive calls required, $d$ is the order of the polynomial speedup the quantum computer achieves and $t$ is the time required to perform a call. Throughout this perspective the subscripts $Q$ and $C$ will denote “quantum” and “classical” implementations.

The condition for quantum advantage is

{\cal T}_{Q}<{\cal T}_{C}\qquad{\rm and\,\,\,thus,}\qquad M>\left(\frac{t_{Q}}{t_{C}}\right)^{\frac{1}{d-1}}\,.

(2)

We see then that whenever a problem will require enough calls $M$ that a quantum advantage is possible,

{\cal T}_{Q}>{\cal T}^{\star}\equiv t_{Q}\left(\frac{t_{Q}}{t_{C}}\right)^{\frac{1}{d-1}}

(3)

where ${\cal T}^{\star}$ is the “breakeven time” which occurs when ${\cal T}_{Q}={\cal T}_{C}$ , corresponding to onset of quantum advantage. As emphasized in Figure 1, we will see that the fundamental challenge in realizing this runtime advantage against classical computers (for small $d$ ) is that $t_{Q}\gg t_{C}$ in error-corrected contexts, making ${\cal T}^{\star}$ very large.

Rather than use a single CPU for the classical approach, one might instead parallelize the algorithm using $P$ classical CPUs. This will reduce the total classical runtime to

{\cal T}_{C}=\frac{M^{d}\,t_{C}}{S}\quad\qquad S=\left(\alpha+\frac{1-\alpha}{P}\right)^{-1}

(4)

where $\alpha$ is the fraction of the algorithm which must be executed in serial and $S$ is the speedup factor due to parallelization consistent with the “Amdahl’s law” Amdahl (1967). Note that Amdahl’s law scaling is considered somewhat pessimistic as one can often adjust the size of problems to fully exploit the computing power that becomes available with more parallelism (e.g., see “Gustafson’s law” Gustafson (1988) for a more optimistic formula for $S$ ). But it also seems that in most situations where one might hope to find a quadratic speedup with a quantum computer (e.g. applications such as search, optimization, Monte Carlo, regression, etc.) the corresponding classical approach is embarrassingly parallel (suggesting that $\alpha$ is small enough that $S\approx P$ for reasonable values of $P$ ). Regardless of the form of $S$ , classical parallelism leads to the following revised conditions for quantum advantage:

M>\left(\frac{t_{Q}S}{t_{C}}\right)^{\frac{1}{d-1}}\,\quad{\rm and}\!\qquad{\cal T}^{\star}=t_{Q}\left(\frac{t_{Q}S}{t_{C}}\right)^{\frac{1}{d-1}}\;.

(5)

While parallel efficiency might be limited for some applications, any implementation of an error-correcting code will also require substantial classical co-processing in order to perform decoding, and this is likely to require thousands of classical cores. Although many quantum algorithms can also benefit from various forms of parallelism, we are considering an early fault tolerance setting where there is likely an insufficient number of logical qubits to exploit a space-time tradeoff to the same extent.

Implementing error-corrected quantum primitives

We will now explain the principle overheads believed to be required for the best candidate for quantum error-correction on a two-dimensional lattice: the surface code. Toffolis are the most commonly used gate for implementing classical logic on a quantum computer but cannot be implemented transversally within practical implementations of the surface code. Instead, one must implement these gates by first distilling resource states. In particular, to implement a Toffoli gate one requires a CCZ state ( $\mathinner{|{\rm CCZ}\rangle}={\rm CCZ}\mathinner{|{+++}\rangle}$ ) and these states are consumed during the implementation of the gate. Distilling CCZ states requires a substantial amount of both time and hardware and thus, they are usually the bottleneck in realizing quantum algorithms within the surface code.

Here, we will focus on the state-of-the-art Toffoli factory constructions of Gidney and Fowler (2019a) which are based on applying the lattice surgery constructions of Fowler and Gidney (2018) to the fault-tolerant Toffoli protocols of Jones (2013); Eastin (2013). Using that approach one Toffoli gate requires $5.5\times d$ surface code cycles, where $d$ is the code distance. The time per round of the surface code, including decoding time is expected to be around $1\,\mu s$ in superconducting qubits. Our analysis will assume a code distance in the vicinity of $d=30$ . This would be sufficient for an algorithm with billions of gates and physical gate error rates on the order of $10^{-3}$ (as our analysis will reveal, even more than a billion gates would likely be required to obtain quantum advantage with a modest polynomial speedup). With these assumptions, our model predicts a Toffoli gate time of

\displaystyle t_{G}=30\times 5.5\times 1\,\mu\text{s}\approx 170\,\mu{\rm s}\;.

(6)

This rough approximation matches the more detailed resource estimate of Ref. Gidney and Fowler (2019a). We discuss these estimates in more detail in Appendix A.

Under the aforementioned assumptions which are specific to contemporary realizations of the surface code using superconducting qubits we could express the quantum primitive runtime as

\displaystyle t_{Q}=t_{G}\cdot G=170\,\mu{\rm s}\cdot G\;,

(7)

where $G$ is the number of Toffoli gates required to implement the quantum primitive. Although we have focused on superconducting qubits, we can also contextualize the performance of ion traps — another leading architecture for quantum advantage. Ion qubits enjoy hour-long coherence times Wang et al. (2017), but are typically gated by the performance of their two-qubit gate Bruzewicz et al. (2019). Gate times within a single ion crystal can range from hundreds of microseconds to sub-microsecond speeds Mølmer and Sørensen (1999); García-Ripoll et al. (2003); Wong-Campos et al. (2017); Schäfer et al. (2018); Torrontegui et al. (2020), and can be efficiently parallelized Grzesiak et al. (2020); Landsman et al. (2019).

Multiple ion crystals can be connected to form a networked quantum computer, either through a charge-coupled device Kielpinski et al. (2002); Pino et al. (2020) or via photonic interfaces Monroe et al. (2014); Hucul et al. (2015). While a charge-coupled device may support thousands of qubits, millions of qubits will likely require photonic interconnects, although large shuttling-based traps have been proposed Lekitsch et al. (2017). For either architecture, a $\sim 10\,\text{kHz}$ cycle frequency has been identified as an ambitious but attainable goal Brown et al. (2016); Lekitsch et al. (2017); Nickerson et al. (2014). Consequently, we can roughly estimate that such a device will be limited by a clockspeed about $100\times$ slower than the $1\,\mu\text{s}$ decoding throughput limit, commensurate with typical high-fidelity two-qubit gate times Ballance et al. (2016); Gaebler et al. (2016) and corresponding to a $t_{G}\approx 17\,\text{ms}$ . However, in trade, such a device may support the requisite connectivity for non-2D error-corrected codes and fault-tolerant gates. While the advantages of such approaches are speculative, we touch on some of these alternate proposals in Appendix B.

On a very large surface code quantum computer one could instead use multiple Toffoli factories (at a high cost in the number of physical qubits required) in order to reduce $t_{Q}$ by performing state distillation in parallel. However, the Toffoli gates are only about two orders of magnitude slower than the Clifford gates and when using multiple factories one needs to account for routing overhead. Thus, while $t_{Q}$ can be reduced at the cost of using many more qubits, only by a factor that is between about ten and one-hundred.

If $N$ is the number of qubits on which this problem is defined then a sensible lower bound would seem to be $G\geq N$ and thus, $t_{Q}\geq 170\,\mu{\rm s}\cdot N$ . For example, in Grover’s algorithm Grover (1996) one must perform a reflection that requires ${\cal O}(N)$ Toffoli gates. In order to achieve a quantum advantage we would need to focus on problem sizes that are sufficiently large that enough calls can be made so that Eq. (2) is satisfied. We find it difficult to imagine satisfying this condition for problem sizes less than one-hundred qubits. Thus, an approximate “lower bound” (using $N=100$ ) would be

t_{Q}\geq 17\,{\rm ms}\,.

(8)

In addition to this lower bound, we will also consider a specific, realistic example to keep our estimates grounded. We will focus on the quantum accelerated simulated annealing by qubitized quantum walk algorithm studied in Somma et al. (2008b); Lemieux et al. (2020b), which appears to provide a quadratic speedup over classical simulated annealing (at least in terms of the best known bounds) in terms of the mixing time of the Markov chain under certain assumptions Boixo et al. (2009). This is among the most efficient algorithms compiled in Sanders et al. (2020) and for the Sherrington-Kirkpatrick model Kirkpatrick et al. (1983), the implementation complexity is $5N+{\cal O}(\log N)$ (neglecting some subdominant scalings that depend on precision), which is only worse than the scaling of our lower bound by a factor of five. For example, for an $N=512$ qubit instance, the work of Sanders et al. (2020) shows that only about $2.6\!\times\!10^{3}$ Toffoli gates are required to make an update. Thus, for that problem size (which we choose to facilitate a comparison to classical algorithms that we will discuss later) we have that

t_{Q}=440\,{\rm ms}\,.

(9)

Implementing classical primitives

Classical computers are very fast; a typical 3 GHz CPU can perform several billion 64 bit operations (e.g., floating point multiplications) per second. We might crudely write that the classical primitive time is $t_{C}=330\,{\rm ps}\cdot L$ where $L$ is the number of classical clock cycles required to implement the classical primitive. For our first example comparison of quantum and classical primitives we will assume that any classical logic operation that would require one Toffoli in the quantum primitive can be executed during one classical clock cycle in the classical primitive. This seems generous to the quantum computer since many operations that would take a single clock cycle on a classical computer would actually require thousands of Toffolis. (Note that we are not assuming any scaling advantage for the quantum computer in the primitive implementations.) One might worry about memory-bound classical primitives (since calls to main memory can take hundreds of clock cycles) but since problems defined on more than thousands of logical bits would be infeasible to process on a small fault-tolerant quantum computer we expect that the memory required for the corresponding classical primitives can be held in cache.

Thus, a corresponding bound on the time to realize a classical primitive for a problem where a quantum computer could realize a quantum primitive with anywhere near the lower bound time given in the prior section ( $t_{Q}\geq 170\,\mu{\rm s}\cdot N$ ) is $t_{C}\leq 330\,{\rm ps}\cdot N$ , and for $N=100$ ,

t_{C}\leq 33\,{\rm ns}\,.

(10)

Even though the equivalence we make between Toffolis and classical compute cycles is seemingly generous to the quantum computer, the assumption of such a cheap primitive on the quantum side (only 100 Toffolis) results in what appears to be a fairly cheap primitive on the classical side. However, because Eq. (5) scales worse with $t_{Q}$ than with $t_{C}$ , this assumption is ultimately optimistic towards the overall crossover time.

Consistent with the prior section, we will also discuss the classical primitive time required to apply simulated annealing to an instance of the Sherrington-Kirkpatrick model. Using the techniques developed in Isakov et al. (2015), a performant implementation of classical simulated annealing code for an $N=512$ instance of the Sherrington-Kirkpatrick model can perform a simulated annealing step in roughly 7 CPU-nanoseconds Sanders et al. (2020) (this accounts for the fact that most updates for the Sherrington-Kirkpatrick model are rejected); thus in that case,

t_{C}=7\,{\rm ns}.

(11)

But given the high costs of quantum computing it is unclear that we should compare to a single classical core.

Minimum runtime for quadratic quantum advantage

Here we discuss the ramifications that the primitive runtimes discussed in the prior two sections have for the minimum time to achieve advantage according to Eq. (3) in the case of a quadratic quantum speedup. First, we will compare the example of a quantum primitive requiring only $N=100$ Toffolis and $t_{Q}=17\,{\rm ms}$ . We argued that any such primitive could likely be computed in $t_{C}=33\,{\rm ns}$ on a single core. For this example, ${\cal T}^{\star}=t_{Q}^{2}/t_{C}=2.4\,\,{\rm hours}$ . One might object to this minimal example on the grounds that it seems unlikely any interesting primitive would require only 100 Toffolis. While this is true, we point out that because quantum runtime is quadratic in the quantum primitive time and only inversely proportional to the classical primitive time, the overall crossover time can only get worse by assuming that more than 100 Toffolis would be required.

Next, we will compare to the example of quantum accelerated simulated annealing. We focus on this example because the steps of the quantum algorithm have been concretely compiled, appear quite efficient, and have a clear classical analogue. Here, for an $N=512$ qubit instance we have that $t_{Q}^{2}/t_{C}=320\,{\rm days}$ , reproducing the finding in Sanders et al. (2020). We note that quantum advantage in this case would occur when $M>t_{Q}/t_{C}=6.3\times 10^{7}$ . This means that $4.0\times 10^{15}$ calls would need to be required for the classical algorithm. However, most $N=512$ Sherrington-Kirkpatrick model instances would require many fewer calls to solve with classical simulated annealing and so one would need to focus on an even bigger system for which the numbers will look yet worse for the quantum computer. Notice that our simulated annealing example gave a quantum runtime that is much longer than the resources required for the quantum primitive with $N=100$ Toffolis. This is because the notion that it would take a classical computer an entire clock cycle to do what a quantum computer could accomplish with a single Toffoli is very generous to the quantum computer.

At first glance, the quantum runtime of 2.4 hours to achieve advantage for the primitive with just 100 Toffolis seems encouraging. Unfortunately, this was just for a single classical core. Even most laptops have on the order of ten cores these days and again, most of the problems where quantum computers display a quadratic advantage are classically embarrassingly parallel problems. Furthermore, error-corrected quantum computers are likely to use thousands of classical CPUs just for decoding. When using $P$ different classical CPUs in parallel then the breakeven time is given by Eq. (5). Using that equation, if we take $P=3,\!000$ CPUs for the classical task (rather than using them for error-correction), and if the classical algorithm is sufficiently parallelizable ( $\alpha^{-1}\ll P$ so $S\approx P$ ), we see that the breakeven time even in this still quantum-generous example becomes one year. As we discuss in the next section there are also ways of parallelizing the quantum computations; e.g., by using multiple quantum computers or distillation factories.

polynomial degree $d$	parallelism	resource “lower bound”		simulated annealing
polynomial degree $d$	speedup $S$	iterations $M$	runtime ${\cal T}^{\star}$	iterations $M$	runtime ${\cal T}^{\star}$
Quadratic, $d=2$	1	$5.2\times 10^{5}$	$2.4\,\,{\rm hours}$	$6.3\times 10^{7}$	$320\,\,{\rm days}$
	$10^{3}$	$5.2\times 10^{8}$	$100\,\,{\rm days}$	$6.3\times 10^{10}$	$880\,\,{\rm years}$
	$10^{6}$	$5.2\times 10^{11}$	$280\,\,{\rm years}$	$6.3\times 10^{13}$	$880\,\,{\rm millennia}$
Cubic, $d=3$	1	$7.2\times 10^{2}$	$12\,\,{\rm seconds}$	$7.9\times 10^{3}$	$58\,\,{\rm minutes}$
	$10^{3}$	$2.3\times 10^{4}$	$6.4\,\,{\rm minutes}$	$2.5\times 10^{5}$	$1.3\,\,{\rm days}$
	$10^{6}$	$7.2\times 10^{5}$	$3.4\,\,{\rm hours}$	$7.9\times 10^{6}$	$40\,\,{\rm days}$
Quartic, $d=4$	1	$8.0\times 10^{1}$	$1.4\,\,{\rm seconds}$	$4.0\times 10^{2}$	$2.9\,\,{\rm minutes}$
	$10^{3}$	$8.0\times 10^{2}$	$14\,\,{\rm seconds}$	$4.0\times 10^{3}$	$29\,\,{\rm minutes}$
	$10^{6}$	$8.0\times 10^{3}$	$2.3\,\,{\rm minutes}$	$4.0\times 10^{5}$	$4.9\,\,{\rm hours}$

Table 1: Resources required to achieve quantum advantage assuming speedups of various polynomial degrees,

d

. We make this comparison against an adversary using distributed classical computing resources that achieve a speedup factor

S

and report the number of algorithm steps

M

and total runtime

{\cal T}^{\star}

before a quantum speedup is possible. We make this comparison for both the informal resource “lower bound” we argued for in the text (using

t_{Q}\geq 17\,{\rm ms}

and

t_{C}\leq 33\,{\rm ns})

, and for the specific example of quantum simulated annealing applied to the Sherrington-Kirkpatrick model using the quantum and classical implementations discussed in Sanders et al. (2020); Isakov et al. (2015) (giving

t_{Q}=440\,{\rm ms}

and

t_{C}=7\,{\rm ns}

The viability of higher polynomial speedups
and the impact of faster error-correction

We report values of both $M$ and ${\cal T}^{\star}$ assuming quantum speedups by different polynomial degrees under different amounts of classical parallelism in Table 1. While the viability of quantum advantage with cubic speedups is still a bit ambiguous, the prospects of achieving quantum advantage given a quartic speedup are promising. Even the simulated annealing example run with a classical adversary with $S=10^{6}$ parallelism would give quantum advantage after five hours of runtime if we assume a quartic speedup (while we do not expect a quartic speedup in that case, the comparison is still instructive).

It is rather surprising just how much of a difference there is for this example between assuming a quadratic speedup (requiring 880 millennia of runtime for advantage) and a quartic speedup (requiring just 4.9 hours of runtime for advantage). There are not as many examples of quartic speedups in quantum computing but there are a few, such as the tensor principle component analysis algorithm of Hastings Hastings (2020). Another example is the quartic query complexity reductions of Ambainis et al. Ambainis et al. (2015) and Aaronson et al. Aaronson et al. (2020). We also expect that certain applications of quantum algorithms for linear systems Harrow et al. (2009) (such as for solving linear differential equations in high dimension Berry (2014)) might lead to modest polynomial speedups higher than quadratic. It is also possible that some heuristic quantum algorithms for optimization might give larger than quadratic improvements for some class of problems, although this is still speculative.

Another question we might ask is, what happens if we were somehow able to implement Toffoli gates much faster in the surface code? For example, we might achieve this by fanning out and using more physical qubits per factory, more Toffoli factories, by inventing significantly more efficient protocols for Toffoli state distillation, or even by switching to a different technology with an intrinsically faster cycle time. We will perform this analysis for the case of quadratic speedups; there, the quantum runtime is reduced to ${\cal T}_{Q}=M\,t_{Q}/R$ where $R\geq 1$ is a speedup factor corresponding to performing Toffoli distillation in time $170\,\mu{\rm s}/R$ . In analogy to Eq. (5) this leads to the equations for a quadratic quantum speedup

M>\frac{t_{Q}S}{t_{C}R}\qquad{\rm and}\qquad{\cal T}^{\star}=\frac{t_{Q}^{2}S}{t_{C}R^{2}}\,.

(12)

In Table 2 we compute Eq. (12) for our example problems with $R=10$ , $R=10^{2}$ and $R=10^{3}$ , assuming a classical adversary capable of achieving an $S=10^{3}$ parallelism. We restrict ourselves to $S=10^{3}$ due to the general difficulty in achieving high parallel efficiency described by Amdahl’s law. However, note that for simulated annealing we can achieve $S=10^{6}$ in practice (and so these numbers are overly optimistic for that case).

speedup	resource “lower bound”		simulated annealing
factor	iterations $M$	runtime ${\cal T}^{\star}$	iterations $M$	runtime ${\cal T}^{\star}$
$R=10^{1}$	$5.2\times 10^{7}$	$1.0\,\,{\rm day}$	$6.3\times 10^{9}$	$8.8\,\,{\rm years}$
$R=10^{2}$	$5.2\times 10^{6}$	$15\,\,{\rm minutes}$	$6.3\times 10^{8}$	$32\,\,{\rm days}$
$R=10^{3}$	$5.2\times 10^{5}$	$8.8\,\,{\rm seconds}$	$6.3\times 10^{7}$	$7.7\,\,{\rm hours}$

Table 2: Resources required to achieve quantum advantage under a quadratic speedup assuming Toffoli distillation time of

170\,\mu{\rm s}/R

and a classical adversary making use of classical parallelism with

S=10^{3}

. The speedup factor

R

can account for improvements in error-correction implementations or in our estimates of their overheads. For example,

R=10

could be reached by using ten Toffoli factories if routing were very efficient (at the cost of requiring many more qubits).

Unfortunately, even if Toffoli distillation rates improve by an order of magnitude it would not be enough to make quantum advantage with a quadratic speedup viable. If Toffoli distillation rates improve by two orders magnitude (making them essentially as cheap as Clifford gates) then it would still be challenging to obtain quantum advantage with a quadratic speedup (it would take more than a month for the simulated annealing example despite limiting the classical parallelism to $S=10^{3}$ ) but we cannot categorically rule it out for all algorithms. At three orders of magnitude speedup the story would be materially different but this would likely require a significant breakthrough. Even if classical processing and signal propagation were instantaneous, and we could adapt measurements to take advantage feedforward single-qubit gates only being applied half the time, a single layer of non-Clifford gates would still take a hard limit of the measurement time plus half the single qubit gate time.

Conclusion

We have investigated simple conditions that must be satisfied to realize a quantum advantage through polynomial speedups on a small fault-tolerant quantum computer. Our ultimate finding is that the prospects are generally poor for a quadratic speedup, consistent with folk knowledge in the error-correction community and recent work such as Sanders et al. (2020); Campbell et al. (2019). The comparison to parallel classical resources is particularly damning for quantum computing and unfortunately, many quadratic quantum speedups (especially those leveraging amplitude amplification) apply to problems that are highly parallelizeable. The strongest conclusions in this work assume that one can achieve classical parallelism speedups on the order of $10^{3}$ or more. But if one can produce a quadratic speedup for a problem where that is not the case, the prospects of quantum advantage would be improved.

These findings do not apply to all polynomial speedups. We found that while one would need to very significantly improve the rate of an error-corrected processor to help the case of quadratic speedups, having a quartic speedup rather than a quadratic speedup is often sufficient to restore the viability of achieving quantum advantage on a modest processor. Thus, we believe that these results suggest that the field should focus beyond quadratic speedups to find viable applications that might produce a quantum advantage on the first several generations of fault-tolerant quantum computers.

We expect this conclusion will persist under a variety of different cost models (e.g., were we to focus on the energy consumption of a computation rather than the runtime). However, we also expect that the community will make progress on some of the challenges described here, or perhaps identify circumstances under which the assumptions of this analysis do not apply. Either way, we hope that these arguments will foster further discussion about how we might develop broadly applicable algorithms that can achieve quantum advantage on small error-corrected quantum computers.

Acknowledgments

The authors thank Dave Bacon, Dominic Berry, Ken Brown, Eddie Farhi, Austin Fowler, Bill Huggins, Sergei Isakov, Evan Jeffrey, Cody Jones, John Platt, Rolando Somma, Nathan Wiebe and Will Zeng for helpful discussions and feedback on earlier drafts.

References

Feynman (1982) Richard P Feynman, “Simulating physics with computers,” International Journal of Theoretical Physics 21, 467–488 (1982).
Lloyd (1996) Seth Lloyd, “Universal Quantum Simulators,” Science 273, 1073–1078 (1996).
Shor (1994) P W Shor, “Algorithms for quantum computation: discrete logarithms and factoring,” Proceedings 35th Annual Symposium on Foundations of Computer Science , 124–134 (1994).
Brassard et al. (2002) Gilles Brassard, Peter Høyer, Michele Mosca, and Alain Tapp, “Quantum amplitude amplification and estimation,” in Quantum Computation and Information, edited by Vitaly I Voloshin, Samuel J. Lomonaco, and Howard E. Brandt (American Mathematical Society, Washington D.C., 2002) Chap. 3, pp. 53–74.
Grover (1996) Lov K Grover, “A fast quantum mechanical algorithm for database search,” in Proceedings of the Twenty-Eighth Annual ACM Symposium on Theory of Computing, STOC ’96 (ACM, New York, NY, USA, 1996) pp. 212–219.
Somma et al. (2008a) R. D. Somma, S. Boixo, H. Barnum, and E. Knill, “Quantum Simulations of Classical Annealing Processes,” Physical Review Letters 101, 130504 (2008a).
Campbell et al. (2019) Earl Campbell, Ankur Khurana, and Ashley Montanaro, “Applying quantum algorithms to constraint satisfaction problems,” Quantum 3, 167–undefined (2019).
Montanaro (2015) Ashley Montanaro, “Quantum speedup of Monte Carlo methods,” Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences 471, 20150301 (2015).
Rebentrost et al. (2018) Patrick Rebentrost, Brajesh Gupt, and Thomas R. Bromley, “Quantum computational finance: Monte Carlo pricing of financial derivatives,” Physical Review A 98, 022321 (2018).
Aïmeur et al. (2006) Esma Aïmeur, Gilles Brassard, and Sébastien Gambs, “Machine learning in a quantum world,” in Advances in Artificial Intelligence, edited by Luc Lamontagne and Mario Marchand (Springer Berlin Heidelberg, Berlin, Heidelberg, 2006) pp. 431–442.
Wiebe et al. (2015) Nathan Wiebe, Ashish Kapoor, and Krysta M. Svore, “Quantum algorithms for nearest-neighbor methods for supervised and unsupervised learning,” Quantum Info. Comput. 15, 316–356 (2015).
Sanders et al. (2020) Yuval R. Sanders, Dominic W. Berry, Pedro C. S. Costa, Louis W. Tessler, Nathan Wiebe, Craig Gidney, Hartmut Neven, and Ryan Babbush, “Compilation of Fault-Tolerant Quantum Heuristics for Combinatorial Optimization,” PRX Quantum 1, 020312–020382 (2020).
Gidney and Fowler (2019a) Craig Gidney and Austin G. Fowler, “Efficient magic state factories with a catalyzed —CCZ¿ to 2—T¿ transformation,” Quantum 3, 135 (2019a).
Kivlichan et al. (2020) Ian D. Kivlichan, Craig Gidney, Dominic W. Berry, Nathan Wiebe, Jarrod McClean, Wei Sun, Zhang Jiang, Nicholas Rubin, Austin Fowler, Alán Aspuru-Guzik, Hartmut Neven, and Ryan Babbush, “Improved Fault-Tolerant Quantum Simulation of Condensed-Phase Correlated Electrons via Trotterization,” Quantum 4, 296 (2020).
von Burg et al. (2020) Vera von Burg, Guang Hao Low, Thomas Haner, Damian Steiger, Markus Reiher, Martin Roetteler, and Matthias Troyer, “Quantum computing enhanced computational catalysis,” arXiv:2007.14460 (2020).
Lee et al. (2020) Joonho Lee, Dominic Berry, Craig Gidney, William Huggins, Jarrod McClean, Nathan Wiebe, and Ryan Babbush, “Even more efficient quantum computations of chemistry through tensor hypercontraction,” arXiv:2011.03494 (2020).
Lemieux et al. (2020a) Jessica Lemieux, Guillaume Duclos-Cianci, David Sénéchal, and David Poulin, “Resource estimate for quantum many-body ground state preparation on a quantum computer,” arXiv:2006.04650 (2020a).
Childs et al. (2018) Andrew M Childs, Dmitri Maslov, Yunseong Nam, Neil J Ross, and Yuan Su, “Toward the first quantum simulation with quantum speedup,” Proceedings of the National Academy of Sciences 115, 9456–9461 (2018).
Gidney and Ekerå (2019) Craig Gidney and Martin Ekerå, “How to factor 2048 bit RSA integers in 8 hours using 20 million noisy qubits,” arXiv:1905.09749 (2019).
Kitaev (2003) Alexei Kitaev, “Fault-tolerant quantum computation by anyons,” Annals of Physics 303, 2–30 (2003).
Fowler et al. (2012) Austin G Fowler, Matteo Mariantoni, John M Martinis, and Andrew N Cleland, “Surface codes: Towards practical large-scale quantum computation,” Physical Review A 86, 32324 (2012).
Amdahl (1967) Gene M. Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” in AFIPS ’67 (Spring): Proceedings of the April 18-20, 1967, Spring Joint Computer Conference (1967) pp. 483–485.
Gustafson (1988) John L. Gustafson, “Reevaluating Amdahl’s law,” Communications of the ACM 31, 532–533 (1988).
Fowler and Gidney (2018) Austin G. Fowler and Craig Gidney, “Low overhead quantum computation using lattice surgery,” arXiv:1808.06709 (2018).
Jones (2013) Cody Jones, “Low-overhead constructions for the fault-tolerant Toffoli gate,” Physical Review A 87, 22328 (2013).
Eastin (2013) Bryan Eastin, “Distilling one-qubit magic states into Toffoli states,” Physical Review A 87, 032321 (2013).
Wang et al. (2017) Ye Wang, Mark Um, Junhua Zhang, Shuoming An, Ming Lyu, Jing-Ning Zhang, L-M Duan, Dahyun Yum, and Kihwan Kim, “Single-qubit quantum memory exceeding ten-minute coherence time,” Nature Photonics 11, 646–650 (2017).
Bruzewicz et al. (2019) Colin D Bruzewicz, John Chiaverini, Robert McConnell, and Jeremy M Sage, “Trapped-ion quantum computing: Progress and challenges,” Applied Physics Reviews 6, 021314 (2019).
Mølmer and Sørensen (1999) Klaus Mølmer and Anders Sørensen, “Multiparticle entanglement of hot trapped ions,” Physical Review Letters 82, 1835 (1999).
García-Ripoll et al. (2003) Juan José García-Ripoll, Peter Zoller, and J Ignacio Cirac, “Speed optimized two-qubit gates with laser coherent control techniques for ion trap quantum computing,” Physical Review Letters 91, 157901 (2003).
Wong-Campos et al. (2017) JD Wong-Campos, SA Moses, KG Johnson, and C Monroe, “Demonstration of two-atom entanglement with ultrafast optical pulses,” Physical Review Letters 119, 230501 (2017).
Schäfer et al. (2018) VM Schäfer, CJ Ballance, K Thirumalai, LJ Stephenson, TG Ballance, AM Steane, and DM Lucas, “Fast quantum logic gates with trapped-ion qubits,” Nature 555, 75–78 (2018).
Torrontegui et al. (2020) E Torrontegui, D Heinrich, M I Hussain, R Blatt, and J J García-Ripoll, “Ultra-fast two-qubit ion gate using sequences of resonant pulses,” New Journal of Physics 22, 103024 (2020).
Grzesiak et al. (2020) Nikodem Grzesiak, Reinhold Blümel, Kenneth Wright, Kristin M Beck, Neal C Pisenti, Ming Li, Vandiver Chaplin, Jason M Amini, Shantanu Debnath, Jwo-Sy Chen, et al., “Efficient arbitrary simultaneously entangling gates on a trapped-ion quantum computer,” Nature Communications 11, 1–6 (2020).
Landsman et al. (2019) Kevin A Landsman, Yukai Wu, Pak Hong Leung, Daiwei Zhu, Norbert M Linke, Kenneth R Brown, Luming Duan, and C Monroe, “Two-qubit entangling gates within arbitrarily long chains of trapped ions,” Physical Review A 100, 022332 (2019).
Kielpinski et al. (2002) David Kielpinski, Chris Monroe, and David J Wineland, “Architecture for a large-scale ion-trap quantum computer,” Nature 417, 709–711 (2002).
Pino et al. (2020) Juan M Pino, Jennifer M Dreiling, Caroline Figgatt, John P Gaebler, Steven A Moses, CH Baldwin, M Foss-Feig, D Hayes, K Mayer, C Ryan-Anderson, et al., “Demonstration of the qccd trapped-ion quantum computer architecture,” arXiv preprint arXiv:2003.01293 (2020).
Monroe et al. (2014) C Monroe, R Raussendorf, A Ruthven, KR Brown, P Maunz, L-M Duan, and J Kim, “Large-scale modular quantum-computer architecture with atomic memory and photonic interconnects,” Physical Review A 89, 022317 (2014).
Hucul et al. (2015) David Hucul, Ismail V Inlek, Grahame Vittorini, Clayton Crocker, Shantanu Debnath, Susan M Clark, and Christopher Monroe, “Modular entanglement of atomic qubits using photons and phonons,” Nature Physics 11, 37–42 (2015).
Lekitsch et al. (2017) Bjoern Lekitsch, Sebastian Weidt, Austin G Fowler, Klaus Mølmer, Simon J Devitt, Christof Wunderlich, and Winfried K Hensinger, “Blueprint for a microwave trapped ion quantum computer,” Science Advances 3, e1601540 (2017).
Brown et al. (2016) Kenneth R Brown, Jungsang Kim, and Christopher Monroe, “Co-designing a scalable quantum computer with trapped atomic ions,” npj Quantum Information 2, 1–10 (2016).
Nickerson et al. (2014) Naomi H Nickerson, Joseph F Fitzsimons, and Simon C Benjamin, “Freely scalable quantum technologies using cells of 5-to-50 qubits with very lossy and noisy photonic links,” Physical Review X 4, 041041 (2014).
Ballance et al. (2016) CJ Ballance, TP Harty, NM Linke, MA Sepiol, and DM Lucas, “High-fidelity quantum logic gates using trapped-ion hyperfine qubits,” Physical Review Letters 117, 060504 (2016).
Gaebler et al. (2016) John P Gaebler, Ting Rei Tan, Y Lin, Y Wan, R Bowler, Adam C Keith, S Glancy, K Coakley, E Knill, D Leibfried, et al., “High-fidelity universal gate set for be 9+ ion qubits,” Physical Review Letters 117, 060505 (2016).
Somma et al. (2008b) R D Somma, S Boixo, H Barnum, and E Knill, “Quantum Simulations of Classical Annealing Processes,” Physical Review Letters 101, 130504 (2008b).
Lemieux et al. (2020b) Jessica Lemieux, Bettina Heim, David Poulin, Krysta Svore, and Matthias Troyer, “Efficient quantum walk circuits for metropolis-hastings algorithm,” Quantum 4, 287 (2020b).
Boixo et al. (2009) Sergio Boixo, Emanuel Knill, and Rolando Somma, “Quantum state preparation by phase randomization,” Quantum Information & Computation 9, 0833–0855 (2009).
Kirkpatrick et al. (1983) S. Kirkpatrick, C. Gelatt Jr., and M. Vecchi, “Optimization by Simulated Annealing,” Science 220, 671–680 (1983).
Isakov et al. (2015) S V Isakov, I N Zintchenko, T F Ronnow, and M Troyer, “Optimized simulated annealing code for Ising spin glasses,” Computer Physics Communications 192, 265–271 (2015).
Hastings (2020) Matthew B. Hastings, “Classical and Quantum Algorithms for Tensor Principal Component Analysis,” Quantum 4, 237– (2020).
Ambainis et al. (2015) Andris Ambainis, Kaspars Balodis, Aleksandrs Belovs, Troy Lee, Miklos Santha, and Juris Smotrovs, “Separations in Query Complexity Based on Pointer Functions,” arXiv:1506.04719 (2015).
Aaronson et al. (2020) Scott Aaronson, Shalev Ben-David, Robin Kothari, and Avishay Tal, “Quantum Implications of Huang’s Sensitivity Theorem,” arXiv:2004.13231 (2020).
Harrow et al. (2009) Aram W. Harrow, Avinatan Hassidim, and Seth Lloyd, “Quantum Algorithm for Linear Systems of Equations,” Physical Review Letters 103, 150502 (2009).
Berry (2014) Dominic W Berry, “High-order quantum algorithm for solving linear differential equations,” Journal of Physics A: Mathematical and Theoretical 47, 105301– (2014).
Trout et al. (2018) Colin J Trout, Muyuan Li, Mauricio Gutiérrez, Yukai Wu, Sheng-Tao Wang, Luming Duan, and Kenneth R Brown, “Simulating the performance of a distance-3 surface code in a linear ion trap,” New Journal of Physics 20, 043038 (2018).
Crain et al. (2019) Stephen Crain, Clinton Cahall, Geert Vrijsen, Emma E Wollman, Matthew D Shaw, Varun B Verma, Sae Woo Nam, and Jungsang Kim, “High-speed low-crosstalk detection of a 171 yb+ qubit using superconducting nanowire single photon detectors,” Communications Physics 2, 1–6 (2019).
Stephenson et al. (2020) LJ Stephenson, DP Nadlinger, BC Nichol, S An, P Drmota, TG Ballance, K Thirumalai, JF Goodwin, DM Lucas, and CJ Ballance, “High-rate, high-fidelity entanglement of qubits across an elementary quantum network,” Physical Review Letters 124, 110501 (2020).
Tan et al. (2015) Ting Rei Tan, John P Gaebler, Yiheng Lin, Yong Wan, R Bowler, D Leibfried, and David J Wineland, “Multi-element logic gates for trapped-ion qubits,” Nature 528, 380–383 (2015).
Ballance et al. (2015) CJ Ballance, VM Schäfer, Jonathan P Home, DJ Szwer, Scott C Webster, DTC Allcock, Norbert M Linke, TP Harty, DPL Aude Craik, Derek N Stacey, et al., “Hybrid quantum logic and a test of bell’s inequality using two different atomic isotopes,” Nature 528, 384–386 (2015).
Negnevitsky et al. (2018) Vlad Negnevitsky, Matteo Marinelli, Karan K Mehta, H-Y Lo, Christa Flühmann, and Jonathan P Home, “Repeated multi-qubit readout and feedback with a mixed-species trapped-ion register,” Nature 563, 527–531 (2018).
Fowler (2012) Austin G Fowler, “Time-optimal quantum computation,” arXiv:1210.4626 (2012).
Gidney and Fowler (2019b) Craig Gidney and Austin G Fowler, “Flexible layout of surface code computations using autoccz states,” arXiv:1905.08916 (2019b).
Delfosse et al. (2020) Nicolas Delfosse, Ben W Reichardt, and Krysta M Svore, “Beyond single-shot fault-tolerant quantum error correction,” arXiv:2002.05180 (2020).
Eastin and Knill (2009) Bryan Eastin and Emanuel Knill, “Restrictions on Transversal Encoded Quantum Gate Sets,” Physical Review Letters 102, 110502 (2009).
Fowler (2013) Austin G. Fowler, “Optimal complexity correction of correlated errors in the surface code,” arXiv:1310.0863 (2013).
Raussendorf and Harrington (2007) Robert Raussendorf and Jim Harrington, “Fault-Tolerant Quantum Computation with High Threshold in Two Dimensions,” Physical Review Letters 98, 190504 (2007).
Bravyi and König (2013) Sergey Bravyi and Robert König, “Classification of topologically protected gates for local stabilizer codes,” Physical review letters 110, 170503 (2013).
Vasmer and Browne (2019) Michael Vasmer and Dan E Browne, “Three-dimensional surface codes: Transversal gates and fault-tolerant architectures,” Physical Review A 100, 012312 (2019).
Brown (2020) Benjamin J Brown, “A fault-tolerant non-clifford gate for the surface code in two dimensions,” Science advances 6, eaay4929 (2020).
Bombin (2018) Hector Bombin, “2d quantum computation with 3d topological codes,” arXiv preprint arXiv:1810.09571 (2018).
Anderson et al. (2014) Jonas T Anderson, Guillaume Duclos-Cianci, and David Poulin, “Fault-tolerant conversion between the steane and reed-muller quantum codes,” Physical Review Letters 113, 080501 (2014).
Paetznick and Reichardt (2013) Adam Paetznick and Ben W Reichardt, “Universal fault-tolerant quantum computation with only transversal gates and error correction,” Physical Review Letters 111, 090505 (2013).
Bombín (2015) Héctor Bombín, “Gauge color codes: optimal transversal gates and gauge fixing in topological stabilizer codes,” New Journal of Physics 17, 083002 (2015).
Jochym-O’Connor and Laflamme (2014) Tomas Jochym-O’Connor and Raymond Laflamme, “Using concatenated quantum codes for universal fault-tolerant quantum gates,” Physical Review Letters 112, 010505 (2014).
Yoder et al. (2016) Theodore J Yoder, Ryuji Takagi, and Isaac L Chuang, “Universal fault-tolerant gates on concatenated stabilizer codes,” Physical Review X 6, 031039 (2016).
Yoder (2017) Theodore J Yoder, “Universal fault-tolerant quantum computation with bacon-shor codes,” arXiv preprint arXiv:1705.01686 (2017).
Chamberland et al. (2017) Christopher Chamberland, Tomas Jochym-O’Connor, and Raymond Laflamme, “Overhead analysis of universal concatenated quantum codes,” Physical Review A 95, 022313 (2017).
(78) Michael E Beverland, Aleksander Kubica, and Krysta M Svore, “The cost of universality: A comparative study of the overhead of state distillation and code switching with color codes,” arXiv preprint arXiv:2101.02211 .
Bravyi et al. (2010) Sergey Bravyi, David Poulin, and Barbara Terhal, “Tradeoffs for reliable quantum information storage in 2d systems,” Physical Review Letters 104, 050503 (2010).
Bravyi (2011) Sergey Bravyi, “Subsystem codes with spatially local generators,” Physical Review A 83, 012320 (2011).
Tillich and Zémor (2013) Jean-Pierre Tillich and Gilles Zémor, “Quantum ldpc codes with positive rate and minimum distance proportional to the square root of the blocklength,” IEEE Transactions on Information Theory 60, 1193–1202 (2013).
Panteleev and Kalachev (2019) Pavel Panteleev and Gleb Kalachev, “Degenerate quantum ldpc codes with good finite length performance,” arXiv preprint arXiv:1904.02703 (2019).
Grospellier et al. (2020) Antoine Grospellier, Lucien Grouès, Anirudh Krishna, and Anthony Leverrier, “Combining hard and soft decoders for hypergraph product codes,” arXiv preprint arXiv:2004.11199 (2020).
Higgott and Breuckmann (2020) Oscar Higgott and Nikolas P Breuckmann, “Subsystem codes with high thresholds by gauge fixing and reduced qubit overhead,” arXiv preprint arXiv:2010.09626 (2020).
Gottesman (2013) Daniel Gottesman, “Fault-tolerant quantum computation with constant overhead,” arXiv preprint arXiv:1310.2984 (2013).
Krishna and Poulin (2019) Anirudh Krishna and David Poulin, “Fault-tolerant gates on hypergraph product codes,” arXiv preprint arXiv:1909.07424 (2019).

Appendix A Accounting for error-correction costs

In the main text, we provide an estimate for the time that it takes to perform a single Toffoli gate with optimized factories within the surface code. The crux of the argument in the main text, is that this time is so much slower than the classical equivalent, there is a massive overhead which must be first overcome. We believe that it is valuable in directing future research in error-correction and algorithms to break down the origin of this overhead into its contributions from quantum error-correction and the physical device speed itself. Here we do so in some detail for the case of the surface code in superconducting qubits, and in passing for ion traps. We hope that this discussion will elucidate several avenues through which breakthroughs in error-correction might materially change the analysis of the main text.

To begin, we will assume that there is a physical two-qubit operation and syndrome measurement speed, $\tau$ and $\tau_{s}$ , where $\tau_{s}>\tau$ as $\tau$ is used to build measurement circuits along with a base physical measurement time $\tau_{m}$ . Modern fault-tolerant error-correction proceeds via rounds of syndrome extraction, processing, and correction in order to implement gates. The core physical operation of these rounds on the device is measurement of syndromes, and we are hence lower bounded by the measurement time $\tau_{s}$ in realistic settings. For context, estimates of these times for high fidelity superconducting qubits that would be realistic upon improvement are roughly $\tau\approx 10\,{\rm ns}$ and $\tau_{m}\approx 100\,{\rm ns}$ .

For a networked ion trap device, there are extra nuances in estimating a realistic syndrome measurement speed Trout et al. (2018). Currently, high-fidelity two-qubit gates and measurements take approximately $\tau\approx 100\,\mu{\rm s}$ and $\tau_{m}\approx 10\,\mu{\rm s}$ Ballance et al. (2016); Crain et al. (2019), although high-fidelity microsecond gates have also been demonstrated Schäfer et al. (2018). Most proposals are limited by a typical trap frequency of $\sim 1\,\text{MHz}$ , although this limit is not fundamental Wong-Campos et al. (2017) and submicrosecond gate times are possible García-Ripoll et al. (2003).

In addition, communicating between different crystals will likely introduce significant overhead. When using photonic interconnects, the mean connection rate between different modules will be fundamentally limited by the emission rate, which for typical atomic transitions into free space will be $\sim 100\,\text{MHz}$ . However, current state-of-the-art entanglement generation occurs in the $\sim 200\,\text{Hz}$ regime Stephenson et al. (2020). When accounting for fractional light collection and single-photon detector efficiency, we can ambitiously estimate future mean connection rates of $\sim 10\,\text{kHz}$ Brown et al. (2016); Hucul et al. (2015), which may be amplified by generating entanglement in parallel at an additional cost in space. Without photonic interconnects, shuttling and cooling will introduce additional slowdowns Kielpinski et al. (2002), and can currently take hundreds of microseconds Pino et al. (2020). With photonic interconnects, shuttling may still be required to isolate memory ions from light scattered during entanglement generation, although this can be mitigated by using a different atomic species for communication Tan et al. (2015); Ballance et al. (2015); Negnevitsky et al. (2018).

None of these components are fundamentally limited below $\sim 1\,\text{MHz}$ . However, many of them must act several times in concert to measure a single round of syndromes. Consequently, $\tau_{s}\approx 100\,{\rm\mu s}$ seems an ambitious goal, and is commensurate with earlier estimates Brown et al. (2016); Lekitsch et al. (2017); Nickerson et al. (2014).

If one had perfect operations, but still performed gates via a synthesized and fault-tolerant protocol, these would lower-bound the achievable runtime for a gate. As our operations are not perfect, however, we will need to encode in an error-correcting code with some distance $d$ which is chosen based on the error rate in our device, threshold of the code, and total number of operations we expect to perform. If one is allowed to use numerous ancilla qubits, this need not expand the runtime of individual operations by exploiting parallelism through teleportation and spacetime optimization Fowler (2012); Gidney and Fowler (2019b). However, more qubit spartan implementations must use $d$ rounds of measurement and correction to protect against measurement errors in the time direction, adding a factor of ${\cal O}(d)$ in the time cost. Research into one-shot correction techniques hopes to alleviate this time dependence on $d$ without excessive space overhead Delfosse et al. (2020), but current code constructions are not readily implementable.

On top of each round of these measurements, we must account for the time for this information to leave the device, be processed via decoding, and in some cases, implement active recovery after a gate, where this time depends on the hardware and complexity of the decoding. In order for error-correction to be efficient, it must be possible to process the syndrome data without an accumulation of rounds that grows in time. If we denote this processing latency as $l_{r}$ , then the time for processing $d$ rounds is lower bounded approximately by the time it takes to produce those syndrome measurements on the physical device plus this latency, or $(d\tau_{s}+l_{r})$ . Note that depending on the implementation details, $l_{r}$ is likely to depend on $d$ , but with sufficient classical parallelization it may be possible to make it effectively $d$ independent.

On top of these costs, each gate has some associated prefactor in number of rounds that depends on the type of gate and its logical locality, $C_{G}$ . For easy, or Clifford, gates in most codes, $C_{G}$ can be made near $1$ . Unfortunately, in order to perform universal computation, one requires a gate which is not easy to implement Eastin and Knill (2009), and common proposals center around state distillation where the prefactor $C_{G}$ is often on the order of $10$ . Moreover, if one considers synthesis of arbitrary rotations into multiple of these hard gates, $C_{G}$ can multiply by a factor of $10$ or more depending on the precision, leaving $C_{G}$ on the order of $100$ . Putting these together, we can approximate a lower bound on the quantum gate time scaling in terms of error-correction parameters as

\displaystyle t_{G}\propto C_{G}\,(d\tau_{s}+l_{r})

(13)

Now that we have a general picture of how the time overhead enters for quantum error-correction, we examine it in a specific gate and context. In particular, we focus on superconducting qubits with feasible error rates and operation times within the surface code. Toffoli gates are required to implement classical logic on a quantum computer but cannot be implemented transversally within practical implementations of the surface code. Instead, one must implement these gates by first distilling resource states. To implement a Toffoli gate one requires a CCZ state ( $\mathinner{|{\rm CCZ}\rangle}={\rm CCZ}\mathinner{|{+++}\rangle}$ ) and these states are consumed during the implementation of the gate. Distilling CCZ states requires a substantial amount of both time and hardware and thus, they are usually the bottleneck in realizing quantum algorithms within the surface code.

Here, we will focus on the state-of-the-art Toffoli factory constructions of Gidney and Fowler (2019a) which are based on applying the lattice surgery constructions of Fowler and Gidney (2018) to the fault-tolerant Toffoli protocols of Jones (2013); Eastin (2013). Using that approach one can distill one CCZ state using two levels of state distillation with $5.5\,d+{\cal O}(1)$ surface code cycles and a factory with a data qubit footprint of about $12\,d\times 6\,d$ where $d$ is the code distance (the total footprint includes measurement qubits as well, and is thus roughly double this number). Hence for the Toffoli gate, we take $C_{G}\approx 5.5$ .

We will assume a correlated-error minimum weight perfect matching decoder capable of keeping pace with 1 $\mu$ s rounds of surface code error detection Fowler (2013), and capable of performing with a similar latency of feedforward in about 1 $\mu$ s for $d$ around 30, and conservatively lower bound the overall time for $d$ rounds to then be $(d\tau_{s}+l_{r})\leq 30\,\mu s$ . We will also assume physical gate error rates in the vicinity of $10^{-3}$ , which we hope will be achievable at scale in the next decade. Since we expect to require on the order of billions of Toffoli gates to achieve quantum advantage for practical applications (we will see this is actually a significant underestimate for the case of quadratic speedups) we will assume that a code distance in the vicinity of $d=30$ will be sufficient (since errors are suppressed exponentially in code distance this number will be approximately correct).

With these assumptions, our model predicts a Toffoli gate time of $t_{G}=C_{G}(d\tau_{s}+l_{r})\approx 5.5\times 30\,\mu\text{s}\,\approx 170\,\mu$ s. This rough approximation matches the more detailed resource estimate which shows the spacetime volume required to implement one Toffoli gate is approximately 23 qubit seconds Gidney and Fowler (2019a). We discuss the resources required for distillation in terms of qubitseconds because it is generally possible to make tradeoffs between space and time but the critical resource to minimize is actually the product of the two. Under these assumptions we would be able to distill a Toffoli gate in about 170 $\mu{\rm s}$ using around 130,000 physical qubits (see the resource estimation spreadsheet in Gidney and Fowler (2019a) for detailed assumptions). Due to this large overhead we focus on estimates assuming we distill CCZ states in series, which is likely how we would operate early fault-tolerant surface code computers. For comparison, if ion trap devices used a similar surface code implementation and error rates while achieving a syndrome measurement time of $\tau_{s}=100\,\mu{\rm s}$ in parallel, the gate time assuming $C_{G}\approx 5.5$ is $t_{G}\approx 17,\!000\,\mu{\rm s}$ , or roughly a factor of 100 slower.

To make this more concrete, we can convert this to a unitless error-correction overhead for a particular gate of $C_{G}(d\tau_{s}+l_{r})/\tau_{s}$ . If we keep the $30\,\mu{\rm s}$ overall bound for $(d\tau_{s}+l_{r})$ , and make a reasonable estimate for the improvement of physical syndrome measurement times for superconducting qubits to $100\,{\rm ns}$ , then the error-correction overhead at this distance is $1,\!700$ .

This suggests that at present for superconducting qubits, the most fruitful improvements with regards to algorithmic speed are the reduction of decoding time, the minimization of time overheads in distillation factories, and then the reduction of number of measurement rounds required to protect in the time direction, perhaps through improved gate fidelities for equivalent operation times to result in lower required distances or through single shot protocols. If this can be achieved, the next milestones would be the reduction of physical syndrome extraction time. However, even such advances would already make prospects for realizing a quantum advantage with quadratic speedups considerably more enticing.

Appendix B Alternative approaches

Throughout this work, we have focused on the time cost of surface code implementations of non-Clifford gates, as the expense of such gates has been highly optimized given the connectivity constraints of a superconducting quantum device. Furthermore, the surface code is one of the few error-correction schemes that can operate efficiently in the high noise regime, with gate infidelities in the range of $10^{-3}$ Raussendorf and Harrington (2007); Fowler et al. (2012).

However, there are limitations when considering a two-dimensional architecture. We have chosen to report on the time-cost of logical gates (while keeping the space cost low). By this metric, transversal gates are minimally expensive, yet non-Clifford gates cannot be implemented transversally in the surface code. In fact, any constant depth circuit on a 2D-local stabilizer code must be Clifford Bravyi and König (2013), often leading to a time-dependence on $d$ . Additional connectivity in the device — and likely a requirement of lower error rates — opens up many more avenues towards universal fault-tolerant logic. Beyond magic state distillation, contemporary approaches include, but are not limited to:

1.

Computing with 3D local codes, where non-Clifford gates are transversal Vasmer and Browne (2019) (and may be realized dynamically in 2D Brown (2020); Bombin (2018)).
2.

Code switching between codes with complementary transversal gate sets Anderson et al. (2014).
3.

Fixing the gauge of subsystem codes, where different gauges admit different transversal gates Paetznick and Reichardt (2013); Bombín (2015).
4.

Concatenating codes, each supporting a complementary transversal gate set Jochym-O’Connor and Laflamme (2014).
5.

Pieceable fault-tolerance, which breaks non-transversal gates into fault-tolerant pieces Yoder et al. (2016); Yoder (2017).

For non-Clifford gates, there are tradeoffs that can be made to mitigate the distance and routing overhead of our bound at the expense of many more qubits. Whether or not these constructions may yield a lower space and time overhead at reasonable error rates is speculative. Numerical studies of alternative schemes have yet to show a convincing advantage Chamberland et al. (2017); Beverland et al. , and often require $<10^{-3}$ error rates to operate efficiently, although their spacetime footprint can be smaller Brown (2020). Nonetheless, significant optimizations or new approaches are likely needed to recoup a reasonable-sized quadratic quantum advantage.

A separate avenue towards reducing the space overhead of error-correction are block encodings. While surface codes are robust, they require very many qubits per logical qubit. More generally, 2D-local code families are fundamentally restricted to have a vanishing ratio of logical qubits to physical qubits, assuming an underlying growing code distance Bravyi et al. (2010); Bravyi (2011). In an all-to-all connected device, a non-vanishing rate is possible without sacrificing the low stabilizer weight often essential for good performance Tillich and Zémor (2013). In particular, there have been promising numerical studies of high-density memories in idealized noise settings using variations on efficient belief propagation decoding Panteleev and Kalachev (2019); Grospellier et al. (2020).

For the purposes of our bound, we have assumed first generation fault-tolerant devices will support hundreds of logical qubits each with a distance of $d\sim 30$ , and few distillation factories. The total qubit footprint of such a device will remain in the $10^{5}-10^{6}$ qubit range. By comparison, there exist intermediate-size families of block codes that can encode hundreds of logical qubits using only $10^{3}-10^{4}$ qubits. However, it is difficult to predict the consequences of using such encodings in the context of our computation-time bound. The required physical error rates may be untenably low, with relatively few studies predicting performance in circuit-level error models Higgott and Breuckmann (2020). In addition, performing gates efficiently on such codes is a difficult task Gottesman (2013); Krishna and Poulin (2019). As such, while a significant reduction in memory space may result, the role of such codes in future fault-tolerant devices remains unclear.