This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Dynamical decoupling for superconducting qubits: a performance survey

Nic Ezzell [email protected] Department of Physics and Astronomy, University of Southern California, Los Angeles, CA, 90089, USA Center for Quantum Information Science & Technology, University of Southern California, Los Angeles, CA, 90089, USA    Bibek Pokharel [email protected] Department of Physics and Astronomy, University of Southern California, Los Angeles, CA, 90089, USA Center for Quantum Information Science & Technology, University of Southern California, Los Angeles, CA 90089, USA    Lina Tewala [email protected] Johns Hopkins University Applied Physics Laboratory, Laurel, MD, 20723, USA Thomas C. Jenkins Department of Biophysics, Johns Hopkins University Baltimore, MD, 21218, USA    Gregory Quiroz [email protected] Johns Hopkins University Applied Physics Laboratory, Laurel, MD, 20723, USA William H. Miller III Department of Physics &\& Astronomy, Johns Hopkins University, Baltimore, Maryland 21218, USA    Daniel A. Lidar [email protected] Department of Physics and Astronomy, University of Southern California, Los Angeles, CA, 90089, USA Center for Quantum Information Science & Technology, University of Southern California, Los Angeles, CA 90089, USA Department of Electrical Engineering, University of Southern California, Los Angeles, CA, 90089, USA Department of Chemistry, University of Southern California, Los Angeles, CA, 90089, USA
Abstract

Dynamical Decoupling (DD) is perhaps the simplest and least resource-intensive error suppression strategy for improving quantum computer performance. Here we report on a large-scale survey of the performance of 6060 different DD sequences from 1010 families, including basic as well as advanced sequences with high order error cancellation properties and built-in robustness. The survey is performed using three different superconducting-qubit IBMQ devices, with the goal of assessing the relative performance of the different sequences in the setting of arbitrary quantum state preservation. We find that the high-order universally robust (UR) and quadratic DD (QDD) sequences generally outperform all other sequences across devices and pulse interval settings. Surprisingly, we find that DD performance for basic sequences such as CPMG and XY4 can be made to nearly match that of UR and QDD by optimizing the pulse interval, with the optimal interval being substantially larger than the minimum interval possible on each device.

I Introduction

In the pre-fault-tolerance era, quantum computing research has two main near-term goals: to examine the promise of quantum computers via demonstrations of quantum algorithms [1, 2, 3] and to understand how quantum error correction and other noise mitigation methods can pave a path towards fault-tolerant quantum computers [4, 5]. The last decade has seen the rise of multiple cloud-based quantum computing platforms that allow a community of researchers to test error suppression and correction techniques [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]. Error suppression using dynamical decoupling (DD) [17, 18, 19, 20, 21] is among the earliest methods to have been experimentally demonstrated, using experimental platforms such as trapped ions [22, 23], photonic qubits [24], electron paramagnetic resonance [25], nuclear magnetic resonance (NMR) [26, 27, 28], trapped atoms [29] and nitrogen vacancies in diamond [30]. It is known that DD can be used to improve the fidelity of quantum computation both without [31, 32, 33, 34, 35, 36, 37, 38] and with quantum error correction [39, 40]. Several recent cloud-based demonstrations have shown that DD can unequivocally improve the performance of superconducting-qubit based devices [41, 42, 43, 44, 45, 46], even leading to algorithmic quantum advantage [47].

In this work, we systematically compare a suite of known and increasingly elaborate DD sequences developed over the past two decades (see Table 1 for a complete list and Ref. [48] for a detailed review). These DD sequences reflect a growing understanding of how to build features that suppress noise to increasingly higher order and with greater robustness to pulse imperfections. Our goal is to study the efficacy of the older and the more recent advanced sequences on currently available quantum computers. To this end, we implement these sequences on three different IBM Quantum Experience (IBMQE) transmon qubit-based platforms: ibmq_armonk (Armonk), ibmq_bogota (Bogota), and ibmq_jakarta (Jakarta). We rely on the open-pulse functionality [49] of IBMQE, which enables us to precisely control the pulses and their timing. The circuit-level implementation of the various sequences can be suboptimal, as we detail in the Appendix.

We assess these DD sequences for their ability to preserve an arbitrary single-qubit state. Previous work, focused on the XY4 sequence, has studied the use of DD to improve two-qubit entanglement [41] and the fidelity of two-qubit gates [44], and we leave a systematic survey of the multi-qubit problem for a future publication, given that the single-qubit case is already a rich and intricate topic, as we discuss below. By and large, we find that all DD sequences outperform the “unprotected” evolution (without DD). The higher-order DD sequences, like concatenated DD (CDD [50]), Uhrig DD (UDD [51]), quadratic DD (QDD [52]), nested UDD (NUDD [53]) and universally robust (UR [54]), perform consistently well across devices and pulse placement settings. While these more elaborate sequences are statistically better than the traditional sequences such as Hahn echo [55], Carr-Purcell-Meiboom-Gill (CPMG), and XY4 [56] for short pulse intervals, their advantage diminishes with sparser pulse placement. As both systematic and random errors, e.g., due to finite pulse-width and limited control, are reduced, advanced sequences will likely provide further performance improvements. Overall, our study indicates that the robust DD sequences can be viewed as the preferred choice over their traditional counterparts.

The structure of this paper is as follows. In Section II we review the pertinent DD background and describe the various pulse sequences we tested. In Section III, we detail the cloud-based demonstration setup, the nuances of DD sequence implementation, and the chosen success metrics. We describe the results and what we learned about the sequences and devices in Section IV. A summary of results and possible future research directions are provided in Section V. Additional details are provided in the Appendix.

II Dynamical decoupling background

For completeness, we first provide a brief review of DD. In this section we focus on a small subset of all sequences studied in this work, primarily to introduce key concepts and notation. The details of all the other sequences are provided in Appendix A. The reader who is already an expert in the theory may wish to skim this section to become familiar with our notation.

II.1 DD with perfect pulses

Consider a time-independent Hamiltonian

H=HS+HB+HSB,H=H_{S}+H_{B}+H_{SB}, (1)

where HSH_{S} and HBH_{B} contains terms that act, respectively, only on the system or the bath, and HSBH_{SB} contains the system-bath interactions. We write HS=HS0+HS1H_{S}=H_{S}^{0}+H_{S}^{1}, where HS1H_{S}^{1} represents an undesired, always on term (e.g., due to crosstalk), so that

Herr=HS1+HSBH_{\text{err}}=H_{S}^{1}+H_{SB} (2)

represents the “error Hamiltonian” we wish to remove using DD. HS0H_{S}^{0} contains all the terms we wish to keep. The corresponding free unitary evolution for duration τ\tau is given by

fτU(τ)=exp(iτH).f_{\tau}\equiv U(\tau)=\exp(-i\tau H). (3)

DD is generated by an additional, time-dependent control Hamiltonian Hc(t)H_{c}(t) acting purely on the system, so that the total Hamiltonian is

H(t)=HS0+Herr+HB+Hc(t).H(t)=H_{S}^{0}+H_{\text{err}}+H_{B}+H_{c}(t). (4)

An “ideal”, or “perfect” pulse sequence is generated by a control Hamiltonian that is a sum of error-free, instantaneous Hamiltonians {Ω0HPk}k=1n\{\Omega_{0}H_{P_{k}}\}_{k=1}^{n} that generate the pulses at corresponding intervals {τk}k=1n\{\tau_{k}\}_{k=1}^{n}:

H^c(t)=Ω0k=1nδ(ttk)HPk,tk=j=1kτj,\hat{H}_{c}(t)=\Omega_{0}\sum_{k=1}^{n}\delta(t-t_{k})H_{P_{k}},\quad t_{k}=\sum_{j=1}^{k}\tau_{j}, (5)

where we use the hat notation to denote ideal conditions and let Ω0\Omega_{0} have units of energy. Choosing Ω0\Omega_{0} such that Ω0Δ=π/2\Omega_{0}\Delta=\pi/2, where Δ\Delta is the “width” of the Dirac-delta function (this is made rigorous when we account for pulse width in Section II.2.1 below), this gives rise to instantaneous unitaries or pulses

P^k=eiπ2HPk,\hat{P}_{k}=e^{-i\frac{\pi}{2}H_{P_{k}}}, (6)

so that the total evolution is:

U~(T)=fτnP^nfτ2P^2fτ1P^1,\tilde{U}(T)=f_{\tau_{n}}\hat{P}_{n}\cdots f_{\tau_{2}}\hat{P}_{2}f_{\tau_{1}}\hat{P}_{1}, (7)

where Ttn=j=1nτjT\equiv t_{n}=\sum_{j=1}^{n}\tau_{j} is the total sequence time. The unitary U~(T)=U0(T)B(T)\tilde{U}(T)=U_{0}(T)B(T) can be decomposed into the desired error-free evolution U0(T)=exp(iTHS0)IBU_{0}(T)=\exp(-iTH_{S}^{0})\otimes I_{B} and the unitary error B(T)B(T). Ideally, B(T)=ISeiTB~B(T)=I_{S}\otimes e^{-iT\tilde{B}}, where B~\tilde{B} is an arbitrary Hermitian bath operator. Hence, by applying NN repetitions of an ideal DD sequence of duration TT, the system stroboscopically decouples from the bath at uniform intervals Tj=jTT_{j}=jT for j=1,,Nj=1,\ldots,N. In reality, we only achieve approximate decoupling, so that B(T)=ISeiTB~+errB(T)=I_{S}\otimes e^{-iT\tilde{B}}+\text{err}, and the history of DD design is motivated by making the error term as small as possible under different and increasingly more realistic physical scenarios.

II.1.1 First order protection

Historically, the first observation of stroboscopic decoupling came from nuclear magnetic resonance (NMR) spin echoes observed by Erwin Hahn in 1950 [55] with a single XX pulse.111We use X=σxX=\sigma^{x} interchangeably, and likewise for YY and ZZ, where σα\sigma^{\alpha} denotes the α\alpha’th Pauli matrix, with α{x,y,z}\alpha\in\{x,y,z\}. See Appendix A for a precise definition of all sequences. Several years later, Carr & Purcell [57] and Meiboom & Gill [58] independently proposed the improved CPMG sequence with two XX pulses. In theory, both sequences are only capable of partial decoupling in the ideal pulse limit. In particular, B(T)ISeiTB~B(T)\approx I_{S}\otimes e^{-iT\tilde{B}} only for states near |±=(|0±|1)/2\ket{\pm}=(\ket{0}\pm\ket{1})/\sqrt{2} (where |0\ket{0} and |1\ket{1} are the +1+1 and 1-1 eigenstates of σz\sigma^{z}, respectively), as we explain below. Nearly four decades after Hahn’s work, Maudsley proposed the XY4 sequence [56], which is universal since B(T)ISeiTB~B(T)\approx I_{S}\otimes e^{-iT\tilde{B}} on the full Hilbert space, which means all states are equally protected. Equivalently, universality means that arbitrary single-qubit interactions with the bath are decoupled to first order in τ\tau.

To make this discussion more precise, we first write HB+HSBH_{B}+H_{SB} in a generic way for a single qubit:

H¯HB+HSB=α=03γασαBα,\overline{H}\equiv H_{B}+H_{SB}=\sum_{\alpha=0}^{3}\gamma_{\alpha}\sigma^{\alpha}\otimes B^{\alpha}, (8)

where BαB^{\alpha} are bath terms and σ(0)=I\sigma^{(0)}=I.

Since distinct Pauli operators anti-commute, i.e., {σi,σj}=2Iδij\{\sigma_{i},\sigma_{j}\}=2I\delta_{ij}, then for k0k\neq 0,

σkH¯σk=αkγασαBα+γkσkBk.\sigma_{k}\overline{H}\sigma_{k}=-\sum_{\alpha\neq k}\gamma_{\alpha}\sigma^{\alpha}\otimes B^{\alpha}+\gamma_{k}\sigma_{k}\otimes B_{k}. (9)

The minus sign is an effective time-reversal of the terms that anticommute with σk\sigma_{k}. In the ideal pulse limit, this is enough to show that “pure-X” defined as

PX XfτXfτ,\displaystyle\equiv X-f_{\tau}-X-f_{\tau}\ , (10)

induces an effective error Hamiltonian

HPXeff=γxσxBx+ISB~+𝒪(τ2)H^{\text{eff}}_{\text{PX}}=\gamma_{x}\sigma^{x}\otimes B^{x}+I_{S}\otimes\tilde{B}+\mathcal{O}(\tau^{2}) (11)

every 2τ2\tau. Note that CPMG is defined similarly:

CPMGfτ/2XfτXfτ/2,\text{CPMG}\equiv f_{\tau/2}-X-f_{\tau}-X-f_{\tau/2}\ , (12)

which is just a symmetrized placement of the pulse intervals; see Section II.3. PX and CPMG have the same properties in the ideal pulse limit, but we choose to begin with PX for simplicity of presentation. Intuitively, the middle XfτXX-f_{\tau}-X is a time-reversed evolution of the σ(y,z)\sigma^{(y,z)} terms, followed by a forward evolution, which cancel to first order in τ\tau using the Zassenhaus formula [59], exp(τ(A+B))=eτAeτB+𝒪(τ2)\exp{\tau(A+B)}=e^{\tau A}e^{\tau B}+\mathcal{O}(\tau^{2}), an expansion that is closely related to the familiar Baker-Campbell-Hausdorff (BCH) formula. The undesired noise term γxσxBx\gamma_{x}\sigma^{x}\otimes B^{x} does not decohere |±\ket{\pm}, but all other states are subject to bit-flip noise in the absence of suppression. By adding a second rotation around yy, the XY4 sequence,

XY4YfτXfτYfτXfτ\text{XY4}\equiv Y-f_{\tau}-X-f_{\tau}-Y-f_{\tau}-X-f_{\tau} (13)

cancels the remaining σx\sigma^{x} term and achieves universal (first order) decoupling at time 4τ4\tau:

HXY4eff=ISB~+𝒪(τ2).H^{\text{eff}}_{\text{XY4}}=I_{S}\otimes\tilde{B}+\mathcal{O}(\tau^{2}). (14)

Practically, this means that all single-qubit states are equally protected to first order. These results can be generalized by viewing DD as a symmetrization procedure [60], with an intuitive geometrical interpretation wherein the pulses replace the original error Hamiltonian by a sequence of Hamiltonians that are arranged symmetrically so that their average cancels out [61].

II.1.2 Higher order protection

While the XY4 sequence is universal for qubits, it only provides first-order protection. A great deal of effort has been invested in developing DD sequences that provide higher order protection. We start with concatenated dynamical decoupling, or CDDn\text{CDD}_{n} [50]. CDDn\text{CDD}_{n} is an nthn^{\text{th}}-order recursion of XY4.222The construction works for any base sequence, but we specify XY4 here for ease of presentation since our results labeled with CDDn\text{CDD}_{n} always assume XY4 is the base sequence. For example, CDD1XY4\text{CDD}_{1}\equiv\text{XY4} is the base case, and

CDDn\displaystyle\text{CDD}_{n} XY4([CDDn1])\displaystyle\equiv{}\text{XY4}([\text{CDD}_{n-1}]) (15a)
=Y[CDDn1]X[CDDn1]Y[CDDn1]X[CDDn1],\displaystyle\begin{split}&=Y-[\text{CDD}_{n-1}]-X-[\text{CDD}_{n-1}]\\ {}&\qquad-Y-[\text{CDD}_{n-1}]-X-[\text{CDD}_{n-1}],\end{split} (15b)

which is just the definition of XY4 in Eq. 13 with every fτf_{\tau} replaced by CDDn1\text{CDD}_{n-1}. This recursive structure leads to an improved error term 𝒪(τn+1)\mathcal{O}(\tau^{n+1}) provided τ\tau is “small enough.” To make this point precise, we must define a measure of error under DD. Following Ref. [39], one useful way to do this is to separate the “good” and “bad” parts of the joint system-bath evolution, i.e., to split U~(T)\tilde{U}(T) [Eq. 7] as

U~(T)=𝒢+,\tilde{U}(T)=\mathcal{G}+\mathcal{B}, (16)

where 𝒢=U0(T)B(T)\mathcal{G}=U_{0}(T)\otimes B^{\prime}(T), and where – as above – U0(T)U_{0}(T) is the ideal operation that would be applied to the system in the absence of noise, and B(T)B^{\prime}(T) is a unitary transformation acting purely on the bath. The operator \mathcal{B} is the “bad” part, i.e., the deviation of U~(T)\tilde{U}(T) from the ideal operation. The error measure is then333We use the sup operator norm (the largest singular value of AA): Asup{|v}A|v|v=sup{|v s.t. |v=1}A|v\|A\|\equiv\sup_{\{\ket{v}\}}\frac{\|A\ket{v}\|}{\mathinner{\!\left\lVert\ket{v}\right\rVert}}=\sup_{\{\ket{v}\text{ s.t. }\mathinner{\!\left\lVert\ket{v}\right\rVert}=1\}}\|A\ket{v}\|.

ηDD=\eta_{\text{DD}}=\|\mathcal{B}\| (17)

Put simply, ηDD\eta_{\text{DD}} measures how far the DD-protected evolution U~(T)\tilde{U}(T) is from the ideal evolution 𝒢\mathcal{G}. With this error measure established, we can bound the performance of various DD sequences in terms of the relevant energy scales:

βHB,JHSB,ϵβ+J.\displaystyle\beta\equiv\|H_{B}\|,\ \ J\equiv\|H_{SB}\|,\ \ \epsilon\equiv\beta+J. (18)

Using these definitions, we can replace the coarse 𝒪\mathcal{O} estimates with rigorous upper bounds on ηDD\eta_{\text{DD}}. In particular, as shown in Ref. [39],

ηXY4\displaystyle\eta_{\text{XY4}} =(4Jτ)[12(4ϵτ)+29(4ϵτ)2]+𝒪(τ3)\displaystyle=(4J\tau)\left[\frac{1}{2}(4\epsilon\tau)+\frac{2}{9}(4\epsilon\tau)^{2}\right]+\mathcal{O}(\tau^{3}) (19a)
ηCDDn\displaystyle\eta_{\text{CDD}_{n}} =4n(n+3)/2(cϵτ)n(Jτ)+𝒪(τn+2),\displaystyle=4^{n(n+3)/2}(c\epsilon\tau)^{n}(J\tau)+\mathcal{O}(\tau^{n+2}), (19b)

where cc is a constant of order 11. This more careful analysis implies that (1) ϵτ1\epsilon\tau\lesssim 1 is sufficient for XY4 to provide error suppression, and (2) CDDn\text{CDD}_{n} has an optimal concatenation level induced by the competition between taking longer (the bad 4n2\sim 4^{n^{2}} scaling) and more error suppression [the good (cϵτ)n(c\epsilon\tau)^{n} scaling]. The corresponding optimal concatenation level is

nopt=log4(1/c¯ϵτ)1,n_{\text{opt}}=\lfloor\log_{4}(1/\overline{c}\epsilon\tau)-1\rfloor, (20)

where \lfloor\cdot\rfloor is the floor function and c¯\bar{c} is another constant of order 11 (defined in Eq. (165) of Ref. [39]). That such a saturation in performance should occur is fairly intuitive. By adding more layers of recursion, we suppress noise that was unsuppressed before. However, at the same time, we introduce more periods of free evolution fτf_{\tau} which cumulatively add up to more noise. At some point, the noise wins since there is no active noise removal in an open loop procedure such as DD.

Though CDDn\text{CDD}_{n} derived from recursive symmetrization allows for 𝒪(τn+1)\mathcal{O}(\tau^{n+1}) order suppression, it employs 4n\sim 4^{n} pulses. One may ask whether a shorter sequence could achieve the same goal. The answer is provided by the Uhrig DD (UDD) sequence [51]. The idea is to find which DD sequence acts as an optimal filter function on the noise-spectral density of the bath while relaxing the constraint of uniform pulse intervals [51, 62, 63, 64]. For a brief overview, we first assume that a qubit state decoheres as eχ(t)e^{-\chi(t)}. For a given noise spectral density S(ω)S(\omega),

χ(t)=2π0S(ω)ω2F(ωt)𝑑ω,\chi(t)=\frac{2}{\pi}\int_{0}^{\infty}\frac{S(\omega)}{\omega^{2}}F(\omega t)d\omega, (21)

where the frequency response of the system to DD is captured by the filter function F(ωt)F(\omega t). For example, for nn ideal π\pi pulses executed at times {tj}\{t_{j}\} [65],

Fn(ωτ)=|1+(1)n+1eiωτ+2j=1n(1)jeiωtj|2,F_{n}(\omega\tau)=\left|1+(-1)^{n+1}e^{i\omega\tau}+2\sum_{j=1}^{n}(-1)^{j}e^{i\omega t_{j}}\right|^{2}, (22)

which can be substituted into Eq. 21 and optimized for {tj}\{t_{j}\}; the result is UDD [51]. For a desired total evolution TT, the solution (and definition of UDD) is simply to place π\pi pulses with nonuniform pulse intervals,

tj=Tsin(jπ2(n+1))2.t_{j}=T\sin\left(\frac{j\pi}{2(n+1)}\right)^{2}. (23)

When we use nn XX-type pulses in particular, we obtain UDDxn\text{UDDx}_{n}. It turns out that UDDxn\text{UDDx}_{n} achieves 𝒪(τn)\mathcal{O}(\tau^{n}) suppression for states near |±\ket{\pm} using only nn pulses, and this is the minimum number of π\pi pulses needed [51, 66]. It is in this sense that UDD is provably optimal. However, it is important to note that this assumes that the total sequence time is fixed; only in this case can the optimal sequence be used to make the distance between the protected and unperturbed qubit states arbitrarily small in the number of applied pulses. On the other hand, if the minimum pulse interval is fixed and the total sequence time is allowed to scale with the number of pulses, then – as in CDD – longer sequences need not always be advantageous [67].

UDD can be improved from a single axis sequence to the universal quadratic DD sequence (QDD) [52, 66, 68] using recursive design principles similar to those that lead to XY4 and eventually CDDn\text{CDD}_{n} from PX. Namely, to achieve universal decoupling, we use a recursive embedding of a UDDym\text{UDDy}_{m} sequence into a UDDxn\text{UDDx}_{n} sequence. Each XX pulse in UDDxn\text{UDDx}_{n} is separated by a free evolution period ftj+1tjf_{t_{j+1}-t_{j}} which can be filled with a UDDym\text{UDDy}_{m} sequence. Hence, we can achieve min{τn,τm}\min\{\tau^{n},\tau^{m}\} universal decoupling, and when m=nm=n, we obtain universal order τn\tau^{n} decoupling using only n2n^{2} pulses instead of the 4n\sim 4^{n} in CDDn\text{CDD}_{n}. This is nearly optimal [52], and an exponential improvement over CDDn\text{CDD}_{n}. When mnm\neq n, the exact decoupling properties are more complicated [66]. Similar comments as for UDDxn\text{UDDx}_{n} regarding the difference between a fixed total sequence time TT vs a fixed minimum pulse interval apply for QDD as well [69].

While QDD is universal and near-optimal for single-qubit decoherence, the ultimate recursive generalization of UDD is nested UDD (NUDD) [53], which applies for general multi-qubit decoherence, and whose universality and suppression properties have been proven and analyzed in a number of works [70, 69, 71]. In the simplest setting, suppression to NN’th order of general decoherence afflicting an mm-qubit system requires (N+1)2m(N+1)^{2m} pulses under NUDD.

sequence uniform interval universal Needs OpenPulse
Hahn Echo [55] Y N N
PX/ CPMG [57, 72] Y N N
XY4 [56] Y Y Y
CDDn\text{CDD}_{n} [50] Y Y Y
EDD [73] Y Y & N Y
RGAn\text{RGA}_{n} [74] Y Y & N Y
KDD [75] Y Y Y
URn\text{UR}_{n} [54] Y Y Y
UDDxn\text{UDDx}_{n} [51] N N N
QDDn,m\text{QDD}_{n,m} [52] N Y Y
Table 1: Summary of the DD sequences surveyed in this work, along with the original references. A sequence has a uniform (pulse) interval provided τi=τji,j\tau_{i}=\tau_{j}\ \forall i,j [see Eq. 7] and is nonuniform otherwise. A sequence is universal (in theory) if it cancels an arbitrary HerrH_{\text{err}} [Eq. 2] to first order in the pulse interval, and practically this means it protects all states equally well. Otherwise, it only protects a subset of states (e.g., CPMG, which only protects |±\ket{\pm}). In practice, this distinction is more subtle due to rotating frame effects, as discussed in Section II.4. For those listed as both Y & N, such as RGAn\text{RGA}_{n}, we mean that not all sequences in the family are universal. For example, RGA2\text{RGA}_{2} is not universal but RGAn\text{RGA}_{n} for n4n\geq 4 is universal. The last column lists whether our eventual implementation requires OpenPulse [76] or can be implemented faithfully just with the traditional circuit API [77, 78] (see Appendix B).

II.2 DD with imperfect pulses

So far, we have reviewed DD theory with ideal pulses. An ideal pulse is instantaneous and error-free, but in reality, finite bandwidth constraints and control errors matter. Much of the work since CPMG has been concerned with (1) accounting for finite pulse width, (2) mitigating errors induced by finite width, and (3) mitigating systematic errors such as over- or under-rotations. We shall address these concerns in order.

II.2.1 Accounting for finite pulse width

During a finite width pulse, PjP_{j}, the effect of Herr+HBH_{\text{err}}+H_{B} cannot be ignored, so the analysis of Section II.1 needs to be modified correspondingly. Nevertheless, both the symmetrization and filter function approaches can be augmented to account for finite pulse width.

We may write a realistic DD sequence with Δ\Delta-width pulses just as in Eq. 7, but with the ideal control Hamiltonian replaced by

H^c(t)=kΩ(tτk)HPk,Δ/2Δ/2Ω(t)𝑑t=Ω0Δ=π2,\hat{H}_{c}(t)=\sum_{k}\Omega(t-\tau_{k})H_{P_{k}},\quad\int_{-\Delta/2}^{\Delta/2}\Omega(t)dt=\Omega_{0}\Delta=\frac{\pi}{2}, (24)

where Ω(t)\Omega(t) is sharply (but not infinitely) peaked at t=0t=0 and vanishes for |t|>Δ/2|t|>\Delta/2. The corresponding DD pulses are of the form

Pk\displaystyle P_{k} exp(iτkΔ/2τk+Δ/2𝑑t[Ω(tτk)HPk+Herr+HB]).\displaystyle\equiv\exp{-{i}\int_{\tau_{k}-\Delta/2}^{\tau_{k}+\Delta/2}dt[\Omega(t-\tau_{k})H_{P_{k}}+H_{\text{err}}+H_{B}]}. (25)

Note that the pulse intervals remain τk\tau_{k} as before, now denoting the peak-to-peak interval; the total sequence time therefore remains T=j=knτkT=\sum_{j=k}^{n}\tau_{k}. The ideal pulse limit of Eq. 7 is obtained by taking the pulse width to zero, so that Herr+HBH_{\text{err}}+H_{B} can be ignored:

P^k=limΔ0Pk=eiπ2HPk,limΔ0Ω(t)=Ω0δ(t).\hat{P}_{k}=\lim_{\Delta\to 0}P_{k}=e^{-i\frac{\pi}{2}H_{P_{k}}},\quad\lim_{\Delta\to 0}\Omega(t)=\Omega_{0}\delta(t). (26)

We can then recover a result similar to Eq. 9 by entering the toggling frame with respect to the control Hamiltonian Hc(t)H_{c}(t) (see Appendix D), and computing ηDD\eta_{\text{DD}} with a Magnus expansion or Dyson series [39]. Though the analysis is involved, the final result is straightforward: ηDD\eta_{\text{DD}} picks up an additional dependence on the pulse width Δ\Delta. For example, Eq. 19a is modified to

ηXY4(Δ)\displaystyle\eta_{\text{XY4}}^{(\Delta)} =4JΔ+ηXY4,\displaystyle=4J\Delta+\eta_{\text{XY4}}, (27)

which now has a linear dependence on Δ\Delta. This new dependence is fairly generic, i.e., the previously discussed PX, XY4, CDDn\text{CDD}_{n}, UDDn\text{UDD}_{n}, and QDDn,m\text{QDD}_{n,m} all have an error η\eta with an additive 𝒪(Δ)\mathcal{O}(\Delta) dependence. Nevertheless, (1) DD is still effective provided JΔ1J\Delta\ll 1, and (2) concatenation to order nn is still effective provided the JΔJ\Delta dependence does not dominate the ϵτn\epsilon\tau^{n} dependence. For CDDn\text{CDD}_{n} this amounts to an effective noise strength floor [39],

ηCDDn16ΔJ,\eta_{\text{CDD}_{n}}\geq 16\Delta J, (28)

which modifies the optimal concatenation level noptn_{\text{opt}}.

II.2.2 Mitigating errors induced by finite width

A natural question is to what extent we can suppress this first order 𝒪(Δ)\mathcal{O}(\Delta) dependence. One solution is Eulerian symmetrization,444The terminology arises from Euler cycles traversed by the control unitary in the Cayley graph of the group generated by the DD pulses. which exhibits robustness to pulse-width errors [73, 79, 80]. For example, the palindromic sequence

EDDXfτYfτXfτYfτYfτXfτYfτXfτ,\text{EDD}\equiv Xf_{\tau}Yf_{\tau}Xf_{\tau}Yf_{\tau}Yf_{\tau}Xf_{\tau}Yf_{\tau}Xf_{\tau}, (29)

which is an example of Eulerian DD (EDD), has error term [39]

ηEDD=(8Jτ)[12(8ϵτ)+29(8ϵτ)2+𝒪(τ3)],\eta_{\text{EDD}}=(8J\tau)\left[\frac{1}{2}(8\epsilon\tau)+\frac{2}{9}(8\epsilon\tau)^{2}+\mathcal{O}(\tau^{3})\right], (30)

which contains no first order 𝒪(Δ)\mathcal{O}(\Delta) term. Nevertheless, the constant factors are twice as large compared to XY4, and it turns out that EDD outperforms XY4 when Δ/τ8ϵτ\Delta/\tau\gtrsim 8\epsilon\tau (see Fig. 9 in Ref. [39]).555To clarify their relationship, suppose we take the ideal Δ0\Delta\rightarrow 0 limit. Here, XY4 is strictly better since EDD uses twice as many pulses (and therefore free periods) to accomplish the same 𝒪(τ)\mathcal{O}(\tau) decoupling. The same Eulerian approach can be used to derive the pulse-width robust version of the Hahn echo and CPMG, which we refer to with a “super” prefix (derived from the “Eulerian supercycle” terminology of Ref. [80]):

superHahn\displaystyle\text{super}-\text{Hahn} XfτX¯fτ\displaystyle\equiv Xf_{\tau}\overline{X}f_{\tau} (31a)
superCPMG\displaystyle\text{super}-\text{CPMG} XfτXfτX¯fτX¯fτ,\displaystyle\equiv Xf_{\tau}Xf_{\tau}\overline{X}f_{\tau}\overline{X}f_{\tau}, (31b)

where

P¯kexp(iτkΔ/2τk+Δ/2𝑑t[Ω(tτk)HPk+Herr+HB])\overline{P}_{k}\equiv\exp{-i\int_{\tau_{k}-\Delta/2}^{\tau_{k}+\Delta/2}dt[-\Omega(t-\tau_{k})H_{P_{k}}+H_{\text{err}}+H_{B}]} (32)

[compare to Eq. 25]. Intuitively, if XX is a finite pulse that generates a rotation about the xx axis of the Bloch sphere, then X¯\overline{X} is (approximately) a rotation about the x-x axis, i.e., with opposite orientation.

These robust sequences, coupled with concatenation, suggest that we can eliminate the effect of pulse width to arbitrary order 𝒪(Δn)\mathcal{O}(\Delta^{n}); up to certain caveats this indeed holds with concatenated dynamically corrected gates (CDCG) [81] (also see Ref. [82]).666Caveats include: (1) The analytical CDCG constructions given in Ref. [81] do not accommodate for the setting where always-on terms in the system’s Hamiltonian are needed for universal control (so that they cannot just be included in HerrH_{\text{err}}); (2) Control-induced (often multiplicative) noise can be simultaneously present along with bath-induced noise; multiplicative noise requires modifications to the formalism of Ref. [81]. However, this approach deviates significantly from the sequences consisting of only π\pi rotations we have considered so far. To our knowledge, no strategy better than EDD exists for sequences consisting of only π\pi pulses [79].

II.2.3 Mitigating systematic errors

In addition to finite width errors, real pulses are also subject to systematic errors. For example, Ω(t)\Omega(t) might be slightly miscalibrated, leading to a systematic over- or under-rotation, and any aforementioned gain might be lost due to the accumulation of these errors. A useful model of pulses subject to systematic errors is

Pjr=exp(±iπ2(1+ϵr)σα))P^{r}_{j}=\exp{\pm i\frac{\pi}{2}(1+\epsilon_{r})\sigma^{\alpha})} (33)

for α{x,y,z}\alpha\in\{x,y,z\}. This represents instantaneous X,Y,ZX,Y,Z and X¯,Y¯,Z¯\overline{X},\overline{Y},\overline{Z} pulses subject to systematic over- or under-rotation by ϵr\epsilon_{r}, also known as a flip-angle error. Another type of systematic control error is axis-misspecification, where instead of the intended σα\sigma^{\alpha} in Eq. 33 a linear combination of the form σα+ϵβσβ+ϵγσγ\sigma^{\alpha}+\epsilon_{\beta}\sigma^{\beta}+\epsilon_{\gamma}\sigma^{\gamma} is implemented, with ϵβ,ϵγ1\epsilon_{\beta},\epsilon_{\gamma}\ll 1 and αβγ\alpha\neq\beta\neq\gamma denoting orthogonal axes [30].

Fortunately, even simple π\pi pulses can mitigate systematic errors if rotation axes other than +x+x and +y+y are used. We consider three types of sequences: robust genetic algorithm (RGA) DD [74], Knill DD (KDD) [83, 75] and universally robust (UR) DD [54].

RGA DD.—

The basic idea of RGA is as follows. A universal DD sequence should satisfy k=1nPk=I\prod_{k=1}^{n}P_{k}={I} up to a global phase, but there is a great deal of freedom in what combination of pulses are used that satisfy this constraint. In Ref. [74], this freedom was exploited to find, by numerical optimization with genetic algorithms, a class of sequences robust to over- or under-rotations.

Subject to a generic single-qubit error Hamiltonian as in Eq. 8, optimal DD sequences were then found for a given number of pulses under different parameter regimes (i.e., the relative magnitude of JJ, β\beta, etc.). This numerical optimization “rediscovered” CPMG, XY4, and Eulerian DD as base sequences with 𝒪(τ2)\mathcal{O}(\tau^{2}) errors. Higher-order sequences were then found to be concatenations of the latter. For example,

RGA8c\displaystyle\text{RGA}_{8c} EDD\displaystyle\equiv\text{EDD} (34a)
RGA64c\displaystyle\text{RGA}_{64c} RGA8c[RGA8c].\displaystyle\equiv\text{RGA}_{8c}[\text{RGA}_{8c}]. (34b)

A total of 12 RGA sequences were found in total; more details are given in Appendix A.

KDD.—

The KDD sequence is similar in its goal to RGA because it mitigates systematic over or under-rotations. In design, however, it uses the principle of composite pulses [84, 85, 86]. The idea is to take a universal sequence such as XY4 and replace each pulse PjP_{j} with a composite series of pulses (CP)j(CP)_{j} that have the same effect as PjP_{j} but remove pulse imperfections by self-averaging; for details, see Appendix A.

The KDD sequence is robust to flip-angle errors [Eq. 33]. For example, suppose ϵr=π/20\epsilon_{r}=\pi/20, and we apply idealized KDD 10 times [which we denote by (KDD)10(\text{KDD})^{10}]. Then by direct calculation, (KDD)10I7×107\|(\text{KDD})^{10}-I\|\approx 7\times 10^{-7} whereas (XY4)10I3×102\|(\text{XY4})^{10}-I\|\approx 3\times 10^{-2}, and in fact, KDD is robust up to 𝒪(ϵr5)\mathcal{O}(\epsilon_{r}^{5}) [48, 54]. This robustness to over-rotations comes at the cost of 20 free evolution periods instead of 4, so we only expect KDD to work well in an ϵr\epsilon_{r} dominated parameter regime. As a preview of the results we report in Section IV, KDD is not among the top-performing sequences. Hence it appears reasonable to conclude that our demonstrations are conducted in a regime which is not ϵr\epsilon_{r} dominated.

UR DD.—

An alternative approach to devise robust DD sequences is the URn\text{UR}_{n} DD family [54], developed for a semiclassical noise model. In particular, the system Hamiltonian is modified by a random term instead of including an explicit quantum bath and system-bath interaction as for the other DD sequences we consider in this work. The model is expressed using an arbitrary unitary pulse that includes a fixed systematic error ϵr\epsilon_{r} as in Eq. 33, and reduces to a π\pi pulse in the ideal case. As detailed in Appendix A, this leads to a family of sequences that give rise to an error scaling as ϵnϵrn/2\epsilon_{n}\sim\epsilon_{r}^{n/2} using nn pulses.

These sequences recover some known results at low order: UR2\text{UR}_{2} is CPMG, and UR4\text{UR}_{4} is XY4. In other words, CPMG and XY4 are also robust to some pulse imperfections as they cancel flip-angle errors to a certain order. Moreover, by the same recursive arguments, CDDn\text{CDD}_{n} can achieve arbitrary flip-angle error suppression while achieving arbitrary 𝒪(τn)\mathcal{O}(\tau^{n}) (up to saturation) protection. Still, CDDn\text{CDD}_{n} requires exponentially more pulses than URn\text{UR}_{n}, since URn\text{UR}_{n} is by design a semiclassical 𝒪(τ2)\mathcal{O}(\tau^{2}) sequence. Whether URn\text{UR}_{n} is also an 𝒪(τ2)\mathcal{O}(\tau^{2}) sequence for a fully quantum bath is an interesting open problem.

II.3 Optimizing the pulse interval

Nonuniform pulse interval sequences such as UDD and QDD already demonstrate that introducing pulse intervals longer than the minimum possible (τmin\tau_{\min}) can be advantageous. In particular, such alterations can reduce spectral overlap between the filter function and bath spectral density. A longer pulse interval also results in pulses closer to the ideal limit of a small Δ/τ\Delta/\tau ratio when Δ\Delta is fixed. Empirical studies have also found evidence for better DD performance under longer pulse intervals [64, 75, 48].

We may distinguish two ways to optimize the pulse interval: an asymmetric or symmetric placement of additional delays. For example, the asymmetric and symmetric forms of XY4 we optimize take the form

XY4a(d)\displaystyle\text{XY4}_{a}(d) YfdXfdYfdXfd\displaystyle\equiv Yf_{d}Xf_{d}Yf_{d}Xf_{d} (35a)
XY4s(d)\displaystyle\text{XY4}_{s}(d) fd/2YfdXfdYfdXfd/2,\displaystyle\equiv f_{d/2}Yf_{d}Xf_{d}Yf_{d}Xf_{d/2}, (35b)

where dd sets the duration of the pulse interval. The asymmetric form here is consistent with how we defined XY4 in Eq. 13, and except CPMG we have tacitly defined every sequence so far in its asymmetric form for simplicity. The symmetric form is a simple alteration that makes the placement of pulse intervals symmetric about the mid-point of the sequence. For a generic sequence whose definition requires nonuniform pulse intervals like UDD, we can write the two forms as

DDa(d)\displaystyle\text{DD}_{a}(d) P1fτ1+dP2fτ2+dPnfτn+d\displaystyle\equiv P_{1}f_{\tau_{1}+d}P_{2}f_{\tau_{2}+d}\ldots P_{n}f_{\tau_{n}+d} (36a)
DDs(d)\displaystyle\text{DD}_{s}(d) fd/2P1fτ1+dP2fτ2+dPnfτn+d/2.\displaystyle\equiv f_{d/2}P_{1}f_{\tau_{1}+d}P_{2}f_{\tau_{2}+d}\ldots P_{n}f_{\tau_{n}+d/2}. (36b)

Here, the symmetric nature of DDs\text{DD}_{s} is harder to interpret. The key is to define P~j=P1fτ1\tilde{P}_{j}=P_{1}f_{\tau_{1}} as the “effective pulse”. In this case, the delay-pulse motif is fd/2P~jfd/2f_{d/2}\tilde{P}_{j}f_{d/2} for every pulse in the sequence exhibiting a reflection symmetry of the pulse interval about the center of the pulse. In the asymmetric version, it is P~jfd\tilde{P}_{j}f_{d} instead. Note that PXs(τ)=CPMG\text{PX}_{s}(\tau)=\text{CPMG}, i.e., the symmetric form of the PX sequence is CPMG. As CPMG is a well-known sequence, hereafter we refer to the PX sequence as CPMG throughout our data and analysis regardless of symmetry.

II.4 Superconducting hardware physics relevant to DD

So far, our account has been abstract and hardware agnostic. Since our demonstrations involve IBMQ superconducting hardware, we now provide a brief background relevant to the operation of DD in such devices. We closely follow Ref. [44], which derived the effective system Hamiltonian of transmon superconducting qubits from first principles and identified the importance of modeling DD performance within a frame rotating at the qubit drive frequency. It was found that DD is still effective with this added complication both in theory and cloud-based demonstrations and in practice, DD performance can be modeled reasonably well by ideal pulses with a minimal pulse spacing of τmin=Δ\tau_{\text{min}}=\Delta. We now address these points in more detail.

The effective system Hamiltonian for two qubits (the generalization to n>2n>2 is straightforward) is

HS=ωq12Z1ωq22Z2+JZZ,H_{S}=-\frac{\omega_{q_{1}}}{2}Z_{1}-\frac{\omega_{q2}}{2}Z_{2}+JZZ, (37)

where the ωqi\omega_{q_{i}}’s are qubit frequencies and J0J\neq 0 is an undesired, always-on ZZZZ crosstalk. The appropriate frame for describing the superconducting qubit dynamics is the one co-rotating with the number operator N^=k,l{0,1}(k+l)|klkl|=I12(Z1+Z2)\hat{N}=\sum_{k,l\in\{0,1\}}(k+l)\ket{kl}\!\!\bra{kl}=I-\frac{1}{2}(Z_{1}+Z_{2}). The unitary transformation into this frame is U(t)=eiωdN^tU(t)=e^{-i\omega_{d}\hat{N}t}, where ωd\omega_{d} is the drive frequency used to apply gates.777This frame choice is motivated by the observation that in the IBMQ devices, the drive frequency is calibrated to be resonant with a particular eigenfrequency of the devices’ qubits, which depends on the state of the neighboring qubits. Transforming into a frame that rotates with one of these frequencies gets one as close as possible to describing the gates and pulses as static in the rotating frame. In this frame, the effective dynamics are given by the Hamiltonian

H~(t)=i=12(ωdωqi2)Zi+JZZ+H~SB(t)+HB,\tilde{H}(t)=\sum_{i=1}^{2}\left(\frac{\omega_{d}-\omega_{q_{i}}}{2}\right)Z_{i}+JZZ+\tilde{H}_{SB}(t)+H_{B}, (38)

where H~SB=U(t)HSBU(t)\tilde{H}_{SB}=U^{\dagger}(t)H_{SB}U(t). To eliminate unwanted interactions, DD must symmetrize JZZJZZ and H~SB\tilde{H}_{SB}. The JZZJZZ term is removed by applying an XX-type DD to the first qubit (“the DD qubit”): X1Z1Z2X1=Z1Z2X_{1}Z_{1}Z_{2}X_{1}=-Z_{1}Z_{2}, so symmetrization still works as intended. However, the H~SB\tilde{H}_{SB} term is time-dependent in the rotating frame U(t)U(t), which changes the analysis. First, the sign-flipping captured by Eq. 9 no longer holds due to the time-dependence of H~SB\tilde{H}_{SB}. Second, some terms in H~err\tilde{H}_{\text{err}} self-average and nearly cancel even without applying DD 888Intuitively, the Z1Z2Z_{1}Z_{2} rotation at frequency ωd\omega_{d} is already self-averaging the effects of noise around the zz-axis, so in a sense, is acting like a DD sequence.:

{σ1x,σ2x,σ1y,σ2y,σxσz,σzσx,σyσz,σzσy}\{\sigma^{x}_{1},\sigma^{x}_{2},\sigma^{y}_{1},\sigma^{y}_{2},\sigma^{x}\sigma^{z},\sigma^{z}\sigma^{x},\sigma^{y}\sigma^{z},\sigma^{z}\sigma^{y}\} (39)

The remaining terms,

{σ2z,σxσx,σxσy,σyσx,σyσy},\{\sigma^{z}_{2},\sigma^{x}\sigma^{x},\sigma^{x}\sigma^{y},\sigma^{y}\sigma^{x},\sigma^{y}\sigma^{y}\}, (40)

are not canceled at all. This differs from expectation in that the two terms containing σ1y\sigma_{1}^{y} in Eq. 40 are not canceled, whereas in the ωd0\omega_{d}\rightarrow 0 limit, all terms containing σ1y\sigma_{1}^{y} are fully canceled.

Somewhat surprisingly, a nominally universal sequence such as XY4, is no longer universal in the rotating frame, again due to the time-dependence acquired by H~SB\tilde{H}_{SB}. In particular, the only terms that perfectly cancel to 𝒪(τ)\mathcal{O}(\tau) are the same Z1Z_{1} and ZZZZ as with CPMG. However, the list of terms that approximately cancels grows to include

{σxσx,σxσy,σyσx,σyσy},\{\sigma^{x}\sigma^{x},\sigma^{x}\sigma^{y},\sigma^{y}\sigma^{x},\sigma^{y}\sigma^{y}\}, (41)

and when τ\tau is fine-tuned to an integer multiple of 2π/ωd2\pi/\omega_{d}, then XY4 cancels all terms except, of course, terms involving I1I_{1}, which commute with the DD sequence.

Consequently, without fine-tuning τ\tau, we should expect CPMG and XY4 to behave similarly when the terms in Eq. 41 are not significant. Practically, this occurs when T1T2T_{1}\gg T_{2} for the qubits coupled to the DD qubit [44]. However, when instead T1T2T_{1}\lesssim T_{2} for coupled qubits, XY4 should prevail. In addition, the analysis in Ref. [44] was carried out under the assumption of ideal π\pi pulses with τmin=Δ\tau_{\min}=\Delta, and yet, the specific qualitative and quantitative predictions bore out in the cloud-based demonstrations. Hence, it is reasonable to model DD sequences on superconducting transmon qubits as

DDsc\displaystyle\text{DD}_{\text{sc}} P^1fτ1+Δ1P^nfτn+Δn,\displaystyle\equiv\hat{P}_{1}f_{\tau_{1}+\Delta_{1}}\ldots\hat{P}_{n}f_{\tau_{n}+\Delta_{n}}, (42a)

where P^j\hat{P}_{j} is once again an ideal pulse with zero width, and the free evolution periods have been incremented by Δj\Delta_{j} – the width of the actual pulse PjP_{j}.

II.5 What this theory means in practice

We conclude our discussion of the background theory by discussing its practical implications for actual DD memory cloud-based demonstrations. This section motivates many of our design choices and our interpretation of the results.

At a high level, our primary demonstration – whose full details are fleshed out in Section III below – is to prepare an initial state and apply DD for a duration TT. We estimate the fidelity overlap of the initial state and the final prepared state at time TT by an appropriate measurement at the end. By adjusting TT, we map out a fidelity decay curve (see Fig. 2 for examples), which serves as the basis of our performance assessment of the DD sequence.

The first important point from the theory is the presence of ZZZZ crosstalk, a spurious interaction that is always present in fixed-frequency transmon processors, even when the intentional coupling between qubits is turned off. Without DD, the always-on ZZZZ term induces coherent oscillations in the fidelity decay curve, such as the Free curve in Fig. 2, depending on the transmon drive frequency, the dressed transmon qubit eigenfrequency, and calibration details [44]. Applying DD via pulses that anti-commute with the ZZZZ term makes it possible to cancel the corresponding crosstalk term to first order [44, 46]. For some sequences – such as UR20\text{UR}_{20} in Fig. 2 – this first order cancellation almost entirely dampens the oscillation. One of our goals in this work is to rank DD sequence performance, and ZZZZ crosstalk cancellation is an important feature of the most performant sequences. The simplest way to cancel ZZZZ crosstalk is to apply DD to a single qubit (the one we measure) and leave the remaining qubits idle. We choose this simplest strategy in our work since it accomplishes our goal without adding complications to the analysis.

Second, we must choose which qubit to perform our demonstrations on. As we mentioned in our discussion in the previous subsection, we know that the XY4 sequence only approximately cancels certain terms. Naively, we expect these remaining terms to be important when T1T2T_{1}\lesssim T_{2} and negligible when T1T2T_{1}\gg T_{2}. Put differently, a universal sequence such as XY4 should behave similarly to CPMG when these terms are negligible anyway (T1T2T_{1}\gg T_{2}) and beat CPMG when the terms matter (T1T2T_{1}\lesssim T_{2}). To test this prediction, we pick three qubits where T1>T2T_{1}>T_{2}, T2>T1T_{2}>T_{1} and T1T2)T_{1}\lesssim T_{2}), as summarized in Table 2.

Third, we must contend with the presence of systematic gate errors. When applying DD, these systematic errors can manifest as coherent oscillations in the fidelity decay profile [75, 42]. For example, suppose the error is a systematic over-rotation by ϵr\epsilon_{r} within each cycle of DD with NN pulses. In that case, we over-rotate by an accumulated error NϵrN\epsilon_{r}, which manifests as fidelity oscillations. This explains the oscillations for CPMG and XY4 in Fig. 2. To stop this accumulation for a given fixed sequence, one strategy is to apply DD sparsely by increasing the pulse interval τ\tau. However, increasing τ\tau increases the strength of the leading error term obtained by symmetrization, which scales as 𝒪(τk)\mathcal{O}(\tau^{k}) (where kk is a sequence-dependent power).

Thus, there is a tension between packing pulses together to reduce the leading error term and applying pulses sparsely to reduce the build-up of coherent errors. Here, we probe both regimes (dense/sparse pulses) and discuss how this trade-off affects our results. Sequences that are designed to be robust to pulse imperfections play an important role in this study. As the UR20\text{UR}_{20} sequence demonstrates in Fig. 2, this design strategy can be effective at suppressing oscillations due to both ZZZZ crosstalk and the accumulation of coherent gate errors. Our work attempts to empirically disentangle the various trade-offs in DD sequence design.

III Methods

We performed our cloud-based demonstrations on three IBM Quantum superconducting processors [87] with OpenPulse access [78, 76]: ibmq_armonk, ibmq_bogota, and ibmq_jakarta. Key properties of these processors are summarized in Table 2.

Device ibmq_armonk ibmq_bogota ibmq_jakarta
# qubits 1 5 7
qubit used q0 q2 q1
T1 (μ\mus) 140 ±\pm 41 105 ±\pm 41 149 ±\pm 61
T2 (μ\mus) 227 ±\pm 71 145 ±\pm 63 21 ±\pm 3
pulse duration (ns) 71.11¯71.1\overline{1} 35.55¯35.5\overline{5} 35.55¯35.5\overline{5}
Table 2: The three processors used in our cloud-based demonstrations. The total number of qubits nn varies from 11 to 77, but in all cases, we applied DD to just one qubit: number 2 for Bogota and 1 for Jakarta (see the insets of Fig. 3 for the connectivity graph of each device). The choice of the qubit used is motivated by the prediction that T1T2T_{1}\gg T_{2} and T1T2T_{1}\lesssim T_{2} lead to different DD sequence behavior (see Section II.5 and Ref. [44]). The qubits we have chosen have the highest connectivity on their respective devices and therefore are subject to maximal ZZZZ crosstalk. The T1T_{1} and T2T_{2} times are the averages of all reported values during data collection for the specified qubit, along with the 2σ2\sigma sample standard deviation. Data was collected over roughly 2020 different calibrations, mostly between Aug. 9-25, 2021, for the Pauli demonstration and Jan. 11-19, 2022, for the Haar-interval demonstration.
\Qcircuit

@C=0.5em @R=1.0em & Rotate          NN reps. of DD Unrotate
\lstick|0⟩ \gateU \qw \gatef_d α2 \gateP_1 \gatef_τ_1 + d \gateP_2 \gatef_τ_2 + d \gate\gateP_n-1 \gatef_τ_n-1 + d \gateP_n \gatef_τ_n + d(1 - α2) \qw \gateU^† \qw \meter\qw\gategroup2427.7em( \gategroup24213.7em)

Figure 1: The “quantum memory” circuit that underlies all of our demonstrations. By sampling the circuit using 8192 shots, we estimate the Uhlmann fidelity, fe(t)f_{e}(t) in Eq. 43, between the prepared initial state and the final state under the DD sequence P1fτ1P2fτ2PnfτnP_{1}f_{\tau_{1}}P_{2}f_{\tau_{2}}\cdots P_{n}f_{\tau_{n}}. We have included an additional adjustable pulse interval fdf_{d}, where we set d=0d=0 for the Pauli demonstration (Section III.1) and systematically vary dd for the Haar demonstration (Section III.2). The choice of α=0\alpha=0 (11) corresponds to an asymmetric (symmetric) placement of the additional delays.

All our demonstrations follow the same basic structure, summarized in Fig. 1. Namely, we prepare a single-qubit initial state |ψ=U|0\ket{\psi}=U\ket{0}, apply NN repetitions of a given DD sequence SS lasting total time tt, undo UU by applying UU^{\dagger}, and finally measure in the computational basis. Note that this is a single-qubit protocol. Even on multi-qubit devices, we intentionally only apply DD to the single qubit we intend to measure to avoid unwanted ZZZZ crosstalk effects as discussed in Section II.5. The qubit used on each device is listed in Table 2.

We empirically estimate the Uhlmann fidelity

fe(t)=|ψ|ρfinal(t)|ψ|2f_{e}(t)=|\bra{\psi}\rho_{\text{final}}(t)\ket{\psi}|^{2} (43)

with 95% confidence intervals by counting the number of 0’s returned out of 81928192 shots and bootstrapping. The results we report below were obtained using OpenPulse [76], which allows for refined control over the exact waveforms sent to the chip instead of the coarser control that the standard Qiskit circuit API gives [77, 78]. Appendix B provides a detailed comparison between the two APIs, highlighting the significant advantage of using OpenPulse.

We utilize this simple procedure to perform two types of demonstrations which we refer to as Pauli and Haar, and explain in detail below. Briefly, the Pauli demonstration probes the ability of DD to maintain the fidelity of the six Pauli states (the eigenstates of {σx,σy,σz}\{\sigma^{x},\sigma^{y},\sigma^{z}\}) over long times, while the Haar demonstrations address this question for Haar-random states and short times.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Representative samples of our results for three DD sequences and free evolution. The Bloch sphere representation of the quantum states used for each plot is shown on the bottom right. (a) Normalized fidelity fe(t)fe(0)\frac{f_{e}(t)}{f_{e}(0)} under four DD sequences for the initial state |i\ket{-i} and a fixed calibration cycle on Bogota. (b) We summarize the result of many such fidelity decay curves using a box-plot. Each box shows the max (right-most vertical black lines), inner-quartile range (the orange box), median (the skinny portion of each orange box), and minimum (left-most vertical black lines) time-averaged fidelity, F(T=75μs)F(T=75\mu s) in Eq. 44, across the 66 Pauli states and 1010 calibration cycles (for a total of 6060 data points each) on Bogota. The vertical lines denote the performance of each sequence by its median, colored with the same color as the corresponding sequence. We use this type of box plot to summarize the Pauli demonstration results. (c) We show average fidelity convergence as a function of the number of Haar-random states. In particular, the horizontal lines represent 𝔼100Haar[fe(T=3.27μs)]\mathbb{E}_{100-\text{Haar}}[f_{e}(T=3.27\mu s)] whereas each point represents 𝔼NHaar[fe(T=3.27μs)]\mathbb{E}_{N-\text{Haar}}[f_{e}(T=3.27\mu s)] for increasing NN, i.e., the rolling average fidelity. In all cases, we find that 2525 states are sufficient for reasonable convergence as 𝔼25Haar[fe(T)]\mathbb{E}_{25-\text{Haar}}[f_{e}(T)] is within 1%1\% of 𝔼100Haar[fe(T)]\mathbb{E}_{100-\text{Haar}}[f_{e}(T)]. The data shown is for Jakarta, but a similar result holds across all devices tested.

III.1 Pauli demonstration for long times

For the Pauli demonstration, we keep the pulse interval τ\tau fixed to the smallest possible value allowed by the relevant device, τmin=Δ36ns\tau_{\min}=\Delta\approx 36ns (or 71ns\approx 71ns), the width of the XX and YY pulses (see Table 2 for specific Δ\Delta values for each device). Practically, this corresponds to placing all the pulses as close to each other as possible, i.e., the peak-to-peak spacing between pulses is just Δ\Delta (except for nonuniformly spaced sequences such as QDD). For ideal pulses and uniformly spaced sequences, this is expected to give the best performance [21, 88], so unless otherwise stated, all DD is implemented with minimal pulse spacing, τ=Δ\tau=\Delta.

Within this setting, we survey the capacity of each sequence to preserve the six Pauli eigenstates {|0,|1,|+,|,|+i,|i}\{\ket{0},\ket{1},\ket{+},\ket{-},\ket{+i},\ket{-i}\} for long times, which we define as T75μT\geq 75\mus. In particular, we generate fidelity decay curves like those shown in Fig. 2 by incrementing the number of repetitions of the sequence, NN, and thereby sampling fe(t)f_{e}(t) [Eq. 43] for increasingly longer times tt. Using XY4 as an example, we apply P1P2Pn=(XYXY)NP_{1}P_{2}\cdots P_{n}=(XYXY)^{N} for different values of NN while keeping the pulse interval fixed. After generating fidelity decay curves over 1010 or more different calibration cycles across a week, we summarize their performance using a box-plot like that shown in Fig. 2. For the Pauli demonstrations, the box-plot bins the average normalized fidelity,

F(T)1fe(0)fe(t)T=1T0T𝑑tfe(t)fe(0),F(T)\equiv\frac{1}{f_{e}(0)}\expectationvalue{f_{e}(t)}_{T}=\frac{1}{T}\int_{0}^{T}dt\frac{f_{e}(t)}{f_{e}(0)}, (44)

at time TT computed using numerical integration with Hermite polynomial interpolation. Note that no DD is applied at fe(0)f_{e}(0), so we account for state preparation and measurement errors by normalizing. (A sense of the value of fe(0)f_{e}(0) and its variation can be gleaned from Fig. 7. ) The same holds for Fig. 2.

We can estimate the best-performing sequences for a given device by ranking them by the median performance from this data. In Fig. 2, for example, this leads to the fidelity ordering UR20>CPMG>XY4>\text{UR}_{20}>\text{CPMG}>\text{XY4}> free evolution on Bogota, which agrees with the impression left by the decay profiles in Fig. 2 generated in a single run. We use F(T)F(T) because fidelity profiles fe(t)f_{e}(t) are generally oscillatory and noisy, so fitting fe(t)f_{e}(t) to extract a decay constant (as was done in Ref. [41]) does not return reliable results across the many sequences and different devices we tested. We provide a detailed discussion of these two methods in App. C.

III.2 Haar interval demonstrations

The Pauli demonstration estimates how well a sequence preserves quantum memory for long times without requiring excessive data, but it leaves several open questions. Namely, (1) Does DD preserve quantum memory for an arbitrary state? (2) Is τ=τmin\tau=\tau_{\min} the best choice empirically? And (3) how effective is DD for short times? In the Haar interval demonstration, we address all of these questions. This setting – of short times and arbitrary states – is particularly relevant to improving quantum computations with DD [31, 89, 35, 39]. For example, DD pulses have been inserted into idle spaces to improve circuit performance [43, 90, 45, 47, 91].

In contrast to the Pauli demonstration, where we fixed the pulse delay d=0d=0 and the symmetry ζ=a\zeta=a (see Eq. 35a and (35b)) and varied tt, here we fix t=Tt=T and vary dd and ζ\zeta, writing fe(d,ζ;t)f_{e}(d,\zeta;t). Further, we now sample over a fixed set of 2525 Haar random states instead of the 66 Pauli states. Note that we theoretically expect the empirical Haar-average over nn states 𝔼nHaar[fe]1ni=1nfe(ψi)\mathbb{E}_{n-\text{Haar}}[f_{e}]\equiv\frac{1}{n}\sum_{i=1}^{n}f_{e}(\psi_{i}) for |ψiHaar\ket{\psi_{i}}\sim\text{Haar} to converge to the true Haar-average feHaar\expectationvalue{f_{e}}_{\text{Haar}} for sufficiently large nn. As shown by Fig. 2, 2525 states are enough for a reasonable empirical convergence while keeping the total number of circuits to submit and run on IBMQ devices manageable in practice.

Refer to caption
Refer to caption
Refer to caption
Figure 3: A summary of the Pauli demonstration results for all three devices. The top ten sequences are ranked from top to bottom by median average fidelity [Eq. 44] for the listed time T=75μsT=75\mu s. Also displayed in all cases are CPMG and XY4, along with free evolution (Free). Colored vertical lines indicate the median fidelity of the correspondingly colored sequence (best UR, best QDD, CPMG, XY4, and free evolution). Thin white lines through the orange boxes indicate the median fidelities in all other cases. Otherwise, the conventions of Fig. 2 apply. Two main observations emerge: (1) DD systematically outperforms free evolution as indicated by “Free” being at the bottom. The corresponding dot-dashed red vertical line denotes the median average fidelity of Free, FFree(75μs)F_{\text{Free}}(75\mu s), which is below 0.60.6 on every device. (2) Advanced DD sequences, especially the UDD and QDD families, provide a substantial improvement over both CPMG and XY4. The best median performance of the top sequence is indicated by a solid green line, FXY4(75μs)F_{\text{XY4}}(75\mu s) by a cyan line, and FCPMG(75μs)F_{\text{CPMG}}(75\mu s) by a dark blue line.

The Haar interval demonstration procedure is now as follows. For a given DD sequence and time TT, we sample fe(d,ζ;t)f_{e}(d,\zeta;t) for ζ{a,s}\zeta\in\{a,s\} from d=0d=0 to d=dmaxd=d_{\max} for 88 equally spaced values across 2525 fixed Haar random states and 1010 calibration cycles (250250 data points for each dd value). Here d=0d=0 and d=dmaxd=d_{\max} correspond to the tightest and sparsest pulse placements, respectively. At dmaxd_{\max}, we consider only a single repetition of the sequence during the time window TT. To make contact with DD in algorithms, we first consider a short-time limit T=T5CNOT4μT=T_{5\text{CNOT}}\approx 4\mus, which is the amount of time it takes to apply 5 CNOTs on the DD-qubit. As shown by the example fidelity decay curves of Fig. 2, we expect similar results for T15μT\lesssim 15\mus before fidelity oscillations begin. To make contact with the Pauli demonstration, we also consider a long-time limit of T=75μT=75\mus. Finally, to keep the number of demonstrations manageable, we only optimize the interval of (i) the best-performing UR sequence, (ii) the best-performing QDD sequence, and CPMG, XY4, and Free as constant references.

IV Results

We split our results into three subsections. The first two summarize the results of demonstrations aimed at preserving Pauli eigenstates (Section III.1) and Haar-random states (Section III.2). In the third subsection, we discuss how theoretical expectations about the saturation of CDDn\text{CDD}_{n} [88] and URn\text{UR}_{n} [54] compare to the demonstration results.

IV.1 The Pauli demonstration result: DD works and advanced DD works even better

Refer to caption
(a1) Armonk Short Time, T5CNOT=5μsT_{5\text{CNOT}}=5\mu s
Refer to caption
(a2) Armonk Long Time, T=75μsT=75\mu s
Refer to caption
(b1) Bogota Short Time, T5CNOT=4.65μsT_{5\text{CNOT}}=4.65\mu s
Refer to caption
(b2) Bogota Long Time, T=75μsT=75\mu s
Refer to caption
(c1) Jakarta Short Time, T5CNOT=3.27μsT_{5\text{CNOT}}=3.27\mu s
Refer to caption
(c2) Jakarta Long Time, T=75μsT=75\mu s
Figure 4: Summary of the Haar interval demonstration across the three devices. We explore the relationship between the median fidelity fe(d,ζ;t)f_{e}(d,\zeta^{*};t) across Haar random states as a function of the relative spacing of the pulse intervals, d/dmaxd/d_{\max}. The value dmaxd_{\max} corresponds to the largest possible pulse spacing where only a single repetition of a sequence fits. While dmaxd_{\max} is a sequence-dependent quantity, we sample d/dmaxd/d_{\max} evenly regardless of the sequence. For a given device and TT, we plot the best robust sequence (URn\text{UR}_{n}) and the best nonuniform sequence QDDn,m\text{QDD}_{n,m} from Fig. 3, as well as CPMG, XY4, and Free evolution as reference sequences. The Free curve has a solid line for its median and a dashed line above and below, representing its upper and lower quartiles. The parameters n,mn,m in QDDn,m\text{QDD}_{n,m} are chosen so that the sequence fits the time window. The left and right columns correspond to short (T=T5CNOTT=T_{5\text{CNOT}}) and long times (T=75μT=75\mus), respectively. The confidence intervals are upper and lower quartiles – 75% and 25% of all fidelities lie below them, respectively. When the interval is not visible, this is because it is smaller than the marker for the median. Both asymmetric (open symbols) and symmetric (closed symbols) sequences were considered, but we display only the better of the two (for some empirically optimal dd^{*}).

In Fig. 3, we summarize the results of the Pauli demonstration. We rank each device’s top 1010 sequences by median performance across the six Pauli states and 1010 or more calibration cycles, followed by CPMG, XY4, and free evolution as standard references. As discussed in Section III.1, the figure of merit is the normalized, time-averaged fidelity at 75μs75\mu s [see Eq. 44] which is a long-time average.

The first significant observation is that DD is better than free evolution, consistent with numerous previous DD studies. This is evidenced by free evolution (Free) being close to the bottom of the ranking for every device.

Secondly, advanced DD sequences outperform Free, CPMG, and XY4 (shown as dark-blue and light-blue vertical lines in Fig. 3). In particular, 29/30 top sequences across all three devices are advanced – the exception being XY4 on ibmq_armonk. These sequences perform so well that there is a 50% improvement in the median fidelity of these sequences (0.85-0.95) over Free (0.45-0.55). The best sequences also have a much smaller variance in performance, as evidenced by their smaller interquartile range in FF. For example, on ibmq_armonk  75% of all demonstration outcomes for FUDDx25(75μs)F_{\text{UDDx}_{25}}(75\mu s) fall between 0.9 and 0.95, whereas for Free, the same range is between 0.55 and 0.8. Similar comparisons show that advanced DD beats CPMG and XY4 for every device to varying degrees.

Among the top advanced sequences shown, 16/2955%16/29\sim 55\% are UDD or QDD, which use nonuniform pulse intervals. On the one hand, the dominance of advanced DD strategies, especially UDD and QDD, is not surprising. After all, these sequences were designed to beat the simple sequences. On the other hand, as reviewed above, many confounding factors affect how well these sequences perform, such as finite pulse width errors and the effect of the rotating frame. It is remarkable that despite these factors, predictions made based on ideal pulses apply to actual noisy quantum devices.

Finally, we comment more specifically on CPMG and XY4, as these are widely used and well-known sequences. Generally, they do better than free evolution, which is consistent with previous results. On ibmq_armonk  XY4  outperforms CPMG, which outperforms free evolution. On ibmq_bogota, both XY4 and CPMG perform comparably and marginally better than free evolution. Finally, On ibmq_jakarta XY4 is worse than CPMG – and even free evolution – but the median performance of CPMG is substantially better than that of free evolution. It is tempting to relate these results to the relative values of T1T_{1} and T2T_{2}, as per Table 2, in the sense that CPMG is a single-axis (“pure-X”) type sequence [Eq. 10] which does not suppress the system-bath σx\sigma^{x} coupling term responsible for T1T_{1} relaxation, while XY4 does. Nevertheless, a closer look at Fig. 3 shows such an explanation would be inconsistent with the fact that both single-axis and multi-axis sequences are among the top 1010 performers for ibmq_armonk and ibmq_bogota.

The exception is ibmq_jakarta, for which there are no single-axis sequences in the top 1010. This processor has a much smaller T2T_{2} than T1T_{1} (for ibmq_armonk and ibmq_bogota, T1>T2T_{1}>T_{2}), so one might expect that a single-axis sequence such as UDD or CPMG would be among the top performers, but this is not the case. In the same vein, the top performing asymmetric QDDn,m\text{QDD}_{n,m} sequences all have n<mn<m, despite this opposite ordering of T1T_{1} and T2T_{2}. These results show that the backend values of T1T_{1} than T2T_{2} are not predictive of DD sequence performance.

IV.2 Haar Interval demonstration Results: DD works on arbitrary states, and increasing the pulse interval can help substantially

We summarize the Haar interval demonstration results in Fig. 4. Each plot corresponds to fe(d,ζ;t)f_{e}(d,\zeta^{*};t) as a function of dd, the additional pulse interval spacing. We plot the spacing in relative units of d/dmaxd/d_{\text{max}}, i.e., the additional delay fraction, for each sequence. The value dmaxd_{\max} corresponds to the largest possible pulse spacing where only a single repetition of a sequence fits within a given unit of time. Hence, dmaxd_{\max} depends both on the demonstration time TT and the sequence tested, and plotting d[0,dmax]d\in[0,d_{\max}] directly leads to sequences with different relative scales. Normalizing with respect to dmaxd_{\max} therefore makes the comparison between sequences easier to visualize in a single plot.

For each device, we compare CPMG, XY4, the best robust sequence from the Pauli demonstration, the best nonuniform sequence from the Pauli demonstration, and Free. The best robust and nonuniform sequences correspond to URn\text{UR}_{n} and QDDn,m\text{QDD}_{n,m} for each device. We only display the choice of ζ\zeta with the better optimum, and the error bars correspond to the inner-quartile range across the 250250 data points. These error bars are similar to the ones reported in the Pauli demonstration.

IV.2.1 d=0d=0: DD continues to outperform Free evolution also over Haar random states

The d=0d=0 (i.e., additional delay fraction d/dmax=0d/d_{\text{max}}=0) limit is identical to those in the Pauli demonstration, i.e., with the minimum possible pulse spacing. The advanced sequences, URn\text{UR}_{n} and QDDn,m\text{QDD}_{n,m}, outperform Free by a large margin for short times and by a moderate margin for long times. In particular, they have higher median fidelity, a smaller inter-quartile range than Free, and are consistently above the upper quartile in the short-time limit. But, up to error bars, URn\text{UR}_{n} and QDDn,m\text{QDD}_{n,m} are statistically indistinguishable in terms of their performance, except that URn\text{UR}_{n} has a better median than QDDn,m\text{QDD}_{n,m} for ibmq_jakarta.

Focusing on CPMG, for short times, it does slightly worse than the other sequences, yet still much better than Free, but for long times, it does about as poorly as Free. On ibmq_bogota and ibmq_jakarta, XY4 performs significantly worse than URn\text{UR}_{n}, and QDDn,m\text{QDD}_{n,m} and is always worse than CPMG. The main exception to this rule is ibmq_armonk where XY4is comparable to URn\text{UR}_{n} and QDDn,m\text{QDD}_{n,m} in both the short and long time limits.

Overall, while all sequences lose effectiveness in the long-time limit, the advanced sequences perform well when the pulse spacing is d=0d=0.

IV.2.2 d>0d>0: Increasing the pulse interval can improve DD performance

It is clear from Fig. 4 (most notably from Fig. 4(b2)) that increasing the pulse interval can sometimes significantly improve DD performance. For example, considering Fig. 4(a1), at d=0d=0 CPMG is worse than the other sequences, but by increasing dd, CPMG matches the performance of the sequences. The same qualitative result occurs even more dramatically for long times (Fig. 4(a2)). Here, CPMG goes from a median fidelity around 0.6 – as poor as Free – to around 0.9 at d/dmax0.55d/d_{\max}\approx 0.55. The significant improvement of CPMG with increasing dd is fairly generic across devices and times, with the only exception being ibmq_jakarta and long times. Thus, even a simple sequence such as CPMG (or XY4, which behaves similarly) can compete with the highest-ranking advanced sequences if the pulse interval is optimized. Unsurprisingly, optimizing pulse-interval can help, but the degree of improvement is surprising, particularly the ability of simple sequences to match the performance of advanced ones.

IV.2.3 d=dmaxd=d_{\max}: Performance at the single cycle limit

In a similar vein, it is notable how well sequences do in the d=dmaxd=d_{\max} limit (at the right of each plot in Fig. 4 where d/dmax=1d/d_{\max}=1). For CPMG at T=75μsT=75\mu s, this corresponds to applying a pulse on average every 37.5μs37.5\mu s; this certainly does not obey the principle of making τ\tau small, implied from error terms scaling as 𝒪(τn)\mathcal{O}(\tau^{n}) as in the standard theory. There is always a tradeoff between combating noise and packing too many pulses on real devices with finite width and implementation errors. Incoherent errors result in a well-documented degradation that reduces the cancelation order for most of the sequences we have considered; see Section II.2. In addition, coherent errors build up as oscillations in the fidelity; see Section II.5. Low-order sequences such as CPMG and XY4 are particularly susceptible to both incoherent and coherent errors, and are therefore expected to exhibit an optimal pulse interval. Moreover, the circuit structure will restrict the pulse interval when implementing DD on top of a quantum circuit. The general rule gleaned from Fig. 4 is to err toward the latter and apply DD sparsely. In particular, d=dmaxd=d_{\max} results in a comparable performance to d=0d=0 or a potentially significant improvement.

Nevertheless, whether dense DD is better than sparse DD can depend on the specific device characteristics, desired timescale, and relevant metrics. As a case in point, note the Haar demonstration results on ibmq_jakarta in the long-time limit. Here, most sequences – aside from XY4 – deteriorate with increasing dd. Surprisingly, the best strategy in median performance is XY4, which does come at the cost of a sizeable inner-quartile range.

URn\text{UR}_{n} for d=0d=0 does substantially better than Free for all six panels, which is an empirical confirmation of its claimed robustness. Even for d>0d>0, URn\text{UR}_{n} remains a high fidelity and low variance sequence. Since other sequences only roughly match URn\text{UR}_{n} upon interval optimization, using a robust sequence is a safe default choice.

IV.3 Saturation of CDD and UDD, and an optimum for UR

Refer to caption
(a) Comparison of CDD sequences on Bogota
Refer to caption
(b) Comparison of UDD sequences on Bogota
Refer to caption
(c) Comparison of UR sequences on Bogota
Figure 5: fe(t)f_{e}(t) Average fidelity at 75μ75\mus for (a) CDDn\text{CDD}_{n} for n{1,,5}n\in\{1,\dots,5\}, (b) UDDxn\text{UDD}_{x_{n}} for x{1,2,9,24,25}x\in\{1,2,9,24,25\}, (c) URn\text{UR}_{n} for n{6,10,20,50,100}n\in\{6,10,20,50,100\}. All three sequence families exhibit saturation on ibmq_bogota as we increase nn, as expected from theory [39, 67, 54]. The vertical purple line denotes the median performance of the correspondingly colored sequence in each panel, which is also the top-performing sequence by this metric.

In Fig. 5, we display the time-averaged fidelity from ibmq_bogota for CDDn\text{CDD}_{n}, URn\text{UR}_{n} and UDDxn\text{UDD}_{x_{n}} as a function of nn. Related results for the other DD sequence families are discussed in Appendix E. As discussed in Section II.1.2, CDDn\text{CDD}_{n} performance is expected to saturate at some noptn_{\text{opt}} according to Eq. 20. In Fig. 5(a), we observe evidence of this saturation at nopt=3n_{\text{opt}}=3 on ibmq_bogota. We can use this to provide an estimate of ϵ=HB+HSB\epsilon=\|H_{B}\|+\|H_{SB}\|. Substituting noptn_{\text{opt}} into Eq. 20 we find:

c¯ϵΔ[45,44]=[9.77×104,3.91×103].\overline{c}\epsilon\Delta\in[4^{-5},4^{-4}]=[9.77\times 10^{-4},3.91\times 10^{-3}]. (45)

This means that ϵΔ1\epsilon\Delta\ll 1 (we set c¯1\overline{c}\approx 1), which confirms the assumption we needed to make for DD to give a reasonable suppression given that XY4 yields 𝒪(ϵτ2)\mathcal{O}(\epsilon\tau^{2}) suppression. This provides a level of empirical support for the validity of our assumptions. In addition, Δ51\Delta\approx 51ns on ibmq_bogota, so we conclude that ϵ0.5\epsilon\approx 0.5MHz. Since qubit frequencies are roughly ωq4.55\omega_{q}\approx 4.5-5 GHz on IBM devices, this also confirms that ωqϵ\omega_{q}\gg\epsilon, as required for a DD pulse. We observe a similar saturation in CDDn\text{CDD}_{n} on Jakarta and Armonk as well (Appendix E).

Likewise, for an ideal demonstration with fixed time TT, the performance of UDDxn\text{UDDx}_{n} should scale as 𝒪(τn)\mathcal{O}(\tau^{n}), and hence we expect a performance that increases monotonically nn. In practice, this performance should saturate once the finite-pulse width error 𝒪(Δ)\mathcal{O}(\Delta) is the dominant noise contribution [67]. Once again, the UDDn\text{UDD}_{n} sequence performance on ibmq_bogota is consistent with theory. In particular, we expect (and observe) a consistent increase in performance with increasing nn until performance saturates. While this saturation is also seen on Armonk, on Jakarta UDDxn\text{UDDx}_{n}’s performance differs significantly from theoretical expectations (see Appendix E).

URn\text{UR}_{n} also experiences a tradeoff between error suppression and noise introduced by pulses. After a certain optimal nn, the performance for URn\text{UR}_{n} is expected to drop [54]. In particular, while URn\text{UR}_{n} yields 𝒪(ϵrn/2)\mathcal{O}(\epsilon_{r}^{n/2}) suppression with respect to flip angle errors (Section II.2.3), all URn\text{UR}_{n} provide 𝒪(τ2)\mathcal{O}(\tau^{2}) decoupling, i.e., adding more free evolution periods also means adding more noise. Thus, we expect URn\text{UR}_{n} to improve with increasing nn until performance saturates. On ibmq_bogota, by increasing up to UR20\text{UR}_{20}, we gain a large improvement over UR10\text{UR}_{10}, but increasing further to UR50\text{UR}_{50} or UR100\text{UR}_{100} results in a small degradation in performance; see Fig. 5. A similar saturation occurs with ibmq_jakarta and ibmq_armonk (see Appendix E).

V Summary and Conclusions

We performed an extensive survey of 1010 DD families (a total of 6060 sequences) across three superconducting IBM devices. In the first set of demonstrations (the Pauli demonstration, Section III.1), we tested how well these 60 sequences preserve the six Pauli eigenstates over relatively long times (2575μ25-75\mus). Motivated by theory, we used the smallest possible pulse interval for all sequences. We then chose the top-performing QDD and UR sequences from the Pauli demonstration for each device, along with CPMG, XY4, and free evolution as baselines, and studied them extensively. In this second set of demonstrations (the Haar demonstration Section III.2), we considered 2525 (fixed) Haar-random states for a wide range of pulse intervals, τ\tau.

In the Pauli demonstration (Section III.1), we ranked sequence performance by the median time-averaged fidelity at T=75μsT=75\mu s. This ranking is consistent with DD theory. The best-performing sequence on each device substantially outperforms free evolution. Moreover, the expected deviation, quantified using the inner-quartile range of the average fidelity, was much smaller for DD than for free evolution. Finally, 2929 out of the 3030 best performing sequences were “advanced” DD sequences, explicitly designed to achieve high-order cancellation or robustness to control errors.

We reported point-wise fidelity rather than the coarse-grained time-averaged fidelity in the Haar-interval demonstration (Section III.2). At τ=0\tau=0, the Haar-interval demonstration is identical to the Pauli demonstration except for the expanded set of states. Indeed, we found the same hierarchy of sequence performance between the two demonstrations. For example, on ibmq_jakarta, we found XY4 < CPMG < QDD1,6\text{QDD}_{1,6} < UR10\text{UR}_{10} for both Fig. 3 and Fig. 4. This suggests that a test over the Pauli states is a good proxy to Haar-random states for our metric.

However, once we allowed the pulse interval (τ\tau) to vary, we found two unexpected results. First, contrary to expectations, advanced sequences, which theoretically provide a better performance, do not retain their performance edge. Second, for most devices and times probed, DD sequence performance improves or stays roughly constant with increasing pulse intervals before decreasing slightly for very long pulse intervals. This effect is particularly significant for the basic CPMG and XY4 sequences. Relating these two results, we found that with pulse-interval optimization, the basic sequences’ performance is statistically indistinguishable from that of the advanced UR and QDD sequences. In stark contrast to the theoretical prediction favoring short intervals, choosing the longest possible pulse interval, with one sequence repetition within a given time, is generally better than the minimum interval. The one exception to these observations is the ibmq_jakarta processor for T=75μsT=75\mu\text{s}, which is larger than its mean T2T_{2} of 20.7μ20.7\mus. Here, the advanced sequences significantly outperform the basic sequences at their respective optimal interval values (τ=0\tau=0 for the advanced sequences), and DD performance degrades with sufficiently large pulse intervals. The short T2T_{2} for ibmq_jakarta is notable, since in contrast, T=75μs<T2T=75\mu\text{s}<\expectationvalue{T_{2}} for both ibmq_armonk (T2=230μ\expectationvalue{T_{2}}=230\mus) and ibmq_bogota (T2=146μ\expectationvalue{T_{2}}=146\mus). We may thus conclude that, overall, sparse DD is preferred to tightly-packed DD, provided decoherence in the free evolution periods between pulses is not too strong.

The UR sequence either matched or nearly matched the best performance of any other tested sequence at any τ\tau for each device. It also achieved near-optimal performance at τ=0\tau=0 in four of the six cases shown in Fig. 4. This is a testament to its robustness and suggests the URn\text{UR}_{n} family is a good generic choice, provided an optimization to find a suitable nn for a given device is performed. In our case, this meant choosing the top performing URn\text{UR}_{n} member from the Pauli demonstration. Alternatively, our results suggest that as long as T<T2T<T_{2}, one can choose a basic sequence and likely achieve comparable performance by optimizing the pulse interval. In other words, optimizing the pulse interval for a given basic DD sequence or optimizing the order of an advanced DD sequence at zero pulse interval are equally effective strategies for using DD in practice. However, the preferred strategy depends on hardware constraints. For example, if OpenPulse access (or a comparable control option) is not available so that a faithful URn\text{UR}_{n} implementation is not possible, one would be constrained to optimizing CPMG or XY4 pulse intervals. Under such circumstances, where the number of DD optimization parameters is restricted, using a variational approach to identify the remaining parameters (e.g., pulse intervals) can be an effective approach [90].

Overall, theoretically promising, advanced DD sequences work well in practice. However, one must fine-tune the sequence to obtain the best DD performance for a given device. A natural and timely extension of our work would be developing a rigorous theoretical understanding of our observations, which do not always conform to previous theoretical expectations. Developing DD sequences for specific hardware derived using the physical model of the system instead of trial-and-error optimization, or using machine learning methods [92], are other interesting directions. A thorough understanding of how to tailor DD sequences to today’s programmable quantum computers could be vital in using DD to accelerate the attainment of fault-tolerant quantum computing [39].

Data and code availability statement .
The code and data that support the findings of this study are openly available at the following URL/DOI: https://doi.org/10.5281/zenodo.7884641. In more detail, this citable Zenodo repository [93] contains (i) all raw and formatted data in this work (including machine calibration specifications), (ii) python code to submit identical demonstrations, (iii) the Mathematica code used to analyze the data, and (iv) a brief tutorial on installing and using the code as part of the GitHub ReadMe.
Acknowledgements.
NE was supported by the U.S. Department of Energy (DOE) Computational Science Graduate Fellowship under Award Number DE-SC0020347. This material is based upon work supported by the National Science Foundation the Quantum Leap Big Idea under Grant No. OMA-1936388. LT and GQ acknowledge funding from the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Accelerated Research in Quantum Computing under Award Number DE-SC0020316. This material is based upon work supported by the U.S. Department of Energy, Office of Science, SC-1 U.S. Department of Energy 1000 Independence Avenue, S.W. Washington DC 20585, under Award Number(s) DE-SC0021661. We acknowledge the use of IBM Quantum services for this work. The views expressed are those of the authors, and do not reflect the official policy or position of IBM or the IBM Quantum team. We acknowledge the access to advanced services provided by the IBM Quantum Researchers Program. We thank Vinay Tripathi for insightful discussions, particularly regarding device-specific effects. D.L. is grateful to Prof. Lorenza Viola for insightful discussions and comments.

Appendix A Summary of the DD sequences benchmarked in this work

Here we provide definitions for all the DD sequences we tested. To clarify what free evolution periods belong between pulses, we treat uniform and nonuniform pulse interval sequences separately. When possible, we define a pulse sequence in terms of another sequence (or entire DD sequence) using the notation [][\cdot]. In addition, several sequences are recursively built from simpler sequences. When this happens, we use the notation S=s1([s2])S=s_{1}([s_{2}]), whose meaning is illustrated by the example of CDDn\text{CDD}_{n} [Eq. 15].

A.1 Uniform pulse interval sequences

All the uniform pulse interval sequences are of the form

fτP1f2τP2f2τf2τPnfτ.f_{\tau}-P_{1}-f_{2\tau}-P_{2}-f_{2\tau}-\ldots-f_{2\tau}-P_{n}-f_{\tau}. (46)

For brevity, we omit the free evolution periods in the following definitions.

We distinguish between single and multi-axis sequences, by which we mean the number of orthogonal interactions in the system-bath interaction (e.g., pure dephasing is a single-axis case), not the number of axes used in the pulse sequences.

First, we list the single-axis DD sequences:

Hahn X\displaystyle\equiv X (47a)
super-Hahn/RGA2x\displaystyle\text{super-Hahn}/\text{RGA}_{2x} XX¯\displaystyle\equiv X-\overline{X} (47b)
RGA2y\displaystyle\text{RGA}_{2y} YY¯\displaystyle\equiv Y-\overline{Y} (47c)
CPMG XX\displaystyle\equiv X-X (47d)
super-CPMG XXX¯X¯.\displaystyle\equiv X-X-\overline{X}-\overline{X}. (47e)

Second, the URn\text{UR}_{n} sequence for n4n\geq 4 and nn even is defined as

URn\displaystyle\text{UR}_{n} =(π)ϕ1(π)ϕ2(π)ϕn\displaystyle=(\pi)_{\phi_{1}}-(\pi)_{\phi_{2}}-\ldots-(\pi)_{\phi_{n}} (48a)
ϕk\displaystyle\phi_{k} =(k1)(k2)2Φ(n)+(k1)ϕ2\displaystyle=\frac{(k-1)(k-2)}{2}\Phi^{(n)}+(k-1)\phi_{2} (48b)
Φ(4m)\displaystyle\Phi^{(4m)} =πmΦ(4m+2)=2mπ2m+1,\displaystyle=\frac{\pi}{m}\ \ \ \Phi^{(4m+2)}=\frac{2m\pi}{2m+1}, (48c)

where (π)ϕ(\pi)_{\phi} is a π\pi rotation about an axis which makes an angle ϕ\phi with the xx-axis, ϕ1\phi_{1} is a free parameter usually set to 0 by convention, and ϕ2=π/2\phi_{2}=\pi/2 is a standard choice we use. This is done so that UR4=XY4\text{UR}_{4}=\text{XY4} as discussed in Ref. [54] (note that despite this URn\text{UR}_{n} was designed for single-axis decoherence).

Next, we list the multi-axis sequences. We start with XY4 and all its variations:

XY4\CDD1\displaystyle\text{XY4}\text{\textbackslash}\text{CDD}_{1} YXYX\displaystyle\equiv Y-X-Y-X (49a)
CDDn\displaystyle\text{CDD}_{n} XY4([CDDn1])\displaystyle\equiv\text{XY4}([\text{CDD}_{n-1}]) (49b)
RGA4\displaystyle\text{RGA}_{4} Y¯XY¯X\displaystyle\equiv\overline{Y}-X-\overline{Y}-X (49c)
RGA4p\displaystyle\text{RGA}_{4p} Y¯X¯Y¯X¯\displaystyle\equiv\overline{Y}-\overline{X}-\overline{Y}-\overline{X} (49d)
RGA8c/XY8\displaystyle\text{RGA}_{8c}/\text{XY8} XYXYYXYX\displaystyle\equiv X-Y-X-Y-Y-X-Y-X (49e)
RGA8a\displaystyle\text{RGA}_{8a} XY¯XY¯YX¯YX¯\displaystyle\equiv X-\overline{Y}-X-\overline{Y}-Y-\overline{X}-Y-\overline{X} (49f)
super-EulerXYXYYXYXX¯Y¯X¯Y¯Y¯X¯Y¯X¯\displaystyle\begin{split}\text{super-Euler}&\equiv X-Y-X-Y-Y-X-Y-X\\ {}-&\overline{X}-\overline{Y}-\overline{X}-\overline{Y}-\overline{Y}-\overline{X}-\overline{Y}-\overline{X}\end{split} (49g)
KDD [Kπ/2][K0][Kπ/2][K0],\displaystyle\equiv[{K}_{\pi/2}]-[{K}_{0}]-[{K}_{\pi/2}]-[{K}_{0}], (49h)

where KϕK_{\phi} is a composite of 5 pulses:

Kϕ(π)π/6+ϕ(π)ϕ(π)π/2+ϕ(π)ϕ(π)π/6+ϕ.K_{\phi}\equiv(\pi)_{\pi/6+\phi}-(\pi)_{\phi}-(\pi)_{\pi/2+\phi}-(\pi)_{\phi}-(\pi)_{\pi/6+\phi}. (50)

For example, (π)0=X(\pi)_{0}=X and (π)π/2=Y(\pi)_{\pi/2}=Y. The series K(ϕ)K(\phi) is itself a π\pi rotation about the ϕ\phi axis followed by a π/3-\pi/3 rotation about the zz-axis. To see this, note that a π\pi rotation about the ϕ\phi axis can be written (π)ϕ=RZ(ϕ)RY(π/2)RZ(π)RY(π/2)RZ(ϕ)(\pi)_{\phi}=R_{Z}(-\phi)R_{Y}(-\pi/2)R_{Z}(\pi)R_{Y}(\pi/2)R_{Z}(\phi), and one can verify the claim by direct matrix multiplication. KDD (49h) is the Knill-composite version of XY4 with a total of 20 pulses [75, 48]. Note that the alternation of ϕ\phi between 0 and π/2\pi/2 means that successive pairs give rise to a π/3+π/3=0-\pi/3+\pi/3=0 zz-rotation at the end.

Next, we list the remaining multi-axis RGA sequences:

RGA16b\displaystyle\text{RGA}_{16b} RGA4p([RGA4p])\displaystyle\equiv\text{RGA}_{4p}([\text{RGA}_{4p}]) (51a)
RGA32a\displaystyle\text{RGA}_{32a} RGA4([RGA8a])\displaystyle\equiv\text{RGA}_{4}([\text{RGA}_{8a}]) (51b)
RGA32c\displaystyle\text{RGA}_{32c} RGA8c([RGA4])\displaystyle\equiv\text{RGA}_{8c}([\text{RGA}_{4}]) (51c)
RGA64a\displaystyle\text{RGA}_{64a} RGA8a([RGA8a])\displaystyle\equiv\text{RGA}_{8a}([\text{RGA}_{8a}]) (51d)
RGA64c\displaystyle\text{RGA}_{64c} RGA8c([RGA8c])\displaystyle\equiv\text{RGA}_{8c}([\text{RGA}_{8c}]) (51e)
RGA256a\displaystyle\text{RGA}_{256a} RGA4([RGA64a])\displaystyle\equiv\text{RGA}_{4}([\text{RGA}_{64a}]) (51f)

A.2 Nonuniform pulse interval sequences

The nonuniform sequences are described by a general DD sequence of the form

fτ1P1fτ2fτmPmfτm+1f_{\tau_{1}}-P_{1}-f_{\tau_{2}}-\ldots-f_{\tau_{m}}-P_{m}-f_{\tau_{m+1}} (52)

for pulses PjP_{j} applied at times tjt_{j} for 1jm1\leq j\leq m. Thus for ideal, zero-width pulses, the interval times are τj=tjtj1\tau_{j}=t_{j}-t_{j-1} with t00t_{0}\equiv 0 and tm+1Tt_{m+1}\equiv T, the total desired duration of the sequence.

A.2.1 Ideal UDD

For ideal pulses, UDDxn\text{UDDx}_{n} is defined as follows. For a desired evolution time TT, apply XX pulses at times tjt_{j} given by

tj=Tsin2(jπ2n+2),t_{j}=T\sin^{2}\left(\frac{j\pi}{2n+2}\right), (53)

for j=1,2,,nj=1,2,\ldots,n if nn is even and j=1,2,,n+1j=1,2,\ldots,n+1 if nn is odd. Hence, UDDxn\text{UDDx}_{n} always uses an even number of pulses – nn when nn is even and n+1n+1 when nn is odd – so that when no noise or errors are present UDDxn\text{UDDx}_{n} faithfully implements a net identity operator.

A.2.2 Ideal QDD

To define QDD it is useful to instead define UDDxn\text{UDDx}_{n} in terms of the pulse intervals, τj=tjtj1\tau_{j}=t_{j}-t_{j-1}. By defining the normalized pulse interval,

sj=tjtj1t1=sin((2j1)π2n+2)csc(π2n+2),s_{j}=\frac{t_{j}-t_{j-1}}{t_{1}}=\sin\left(\frac{(2j-1)\pi}{2n+2}\right)\csc\left(\frac{\pi}{2n+2}\right), (54)

for j=1,2,,n+1j=1,2,\ldots,n+1, we can define UDDxn\text{UDDx}_{n} over a total time TT,

UDDxn(T)fs1TXfsnTXfsn+1TXn,\text{UDDx}_{n}(T)\equiv f_{s_{1}T}-X-\ldots-f_{s_{n}T}-X-f_{s_{n+1}T}-X^{n}, (55)

where the notation XnX^{n} means that the sequence ends with XX (II) for odd (even) nn. From this, QDDn,m\text{QDD}_{n,m} has the recursive definition

QDDn,m\displaystyle\text{QDD}_{n,m} UDDxm(s1T)YUDDxm(snT)\displaystyle\equiv\text{UDDx}_{m}(s_{1}T)-Y-\text{UDDx}_{m}(s_{n}T)-\ldots
UDDxm(snT)YUDDxm(sn+1T)Yn\displaystyle-\text{UDDx}_{m}(s_{n}T)-Y-\text{UDDx}_{m}(s_{n+1}T)-Y^{n} (56)

This means that we implement UDDyn\text{UDDy}_{n} (the outer YY pulses) and embed an mthm^{\text{th}} order UDDxm\text{UDDx}_{m} sequence within the free evolution periods of this sequence. The inner UDDxm\text{UDDx}_{m} sequences have a rescaled total evolution time skTs_{k}T; since the decoupling properties only depend on τj\tau_{j} (and not the total time), we still obtain the expected inner cancellation. Written in this way, the total evolution time of QDDn,m\text{QDD}_{n,m} is Sn2TS_{n}^{2}T where Sn=j=1n+1sjS_{n}=\sum_{j=1}^{n+1}s_{j}.

To match the convention of all other sequences presented, we connect this definition to one in which the total evolution time of QDDn,m\text{QDD}_{n,m} is itself TT. First, we implement the outer UDDyn\text{UDDy}_{n} sequence with YY pulses placed at times tjt_{j} according to Eq. 53. The inner XX pulses must now be applied at times

tj,k=τjsin2(kπ2m+2)+tj1,t_{j,k}=\tau_{j}\sin^{2}\left(\frac{k\pi}{2m+2}\right)+t_{j-1}, (57)

where j=1,2,,nj=1,2,\ldots,n if nn is even (or j=1,2,,n+1j=1,2,\ldots,n+1 if nn is odd), with a similar condition for kk up to mm if mm is even (or m+1m+1 if mm is odd). Note that when mm is odd, we end each inner sequence with an XX, and then the outer sequence starts where a YY must be placed simultaneously. In these cases, we must apply a Z=XYZ=XY pulse (ignoring the global phase, which does not affect DD performance). Hence, we must apply rotations about XX and ZZ when mm is odd and XX and YY when mm is even. This can be avoided by instead re-defining the inner (or outer) sequence as ZZ when mm is odd and then combining terms to get XX and YY again. In this work, we choose the former approach. It would be interesting to compare this with the latter approach in future work.

A.2.3 UDD and QDD with finite-width pulses

For real pulses with finite width Δ\Delta, these formulas must be slightly augmented. First, defining tjt_{j} is ambiguous since the pulse application cannot be instantaneous at time tjt_{j}. In our implementation, pulses start at time t=tjΔt=t_{j}-\Delta so that they end when they should be applied in the ideal case. Trying two other strategies – placing the center of the pulse at tjt_{j} or the beginning of the pulse at tjt_{j} – did not result in a noticeable difference. Furthermore, XX and YY have finite width (roughly Δ=50\Delta=50 ns). When UDDxn\text{UDDx}_{n} is applied for nn even, we must end on an identity, so the identity must last for a duration Δ\Delta, i.e., I=fτI=f_{\tau}. A similar timing constraint detail appears for QDDn,m\text{QDD}_{n,m} when mm is odd. Here, we must apply a ZZ pulse, but on IBM devices, ZZ is virtual and instantaneous (see Appendix B). Thus, we apply ZfΔZ-f_{\Delta} to obtain the expected timings.

Appendix B Circuit vs OpenPulse APIs

We first tried to use the standard Qiskit circuit API [77, 78]. Given a DD sequence, we transpiled the circuit Fig. 1 to the respective device’s native gates. However, as we illustrate in Fig. 6(a), this can lead to many advanced sequences, such as UR20\text{UR}_{20}, behaving worse than expected. Specifically, this figure shows that implementing UR20\text{UR}_{20} in the standard circuit way, denoted UR20c\text{UR}_{20\text{c}} (where c stands for circuit), is substantially worse than an alternative denoted UR20p\text{UR}_{20\text{p}} (where p stands for pulse).

The better alternative is to use OpenPulse [76]. We call this the “pulse” implementation of a DD sequence. The programming specifics are provided in Ref. [93]; here, we focus on the practical difference between the two methods with the illustrative example shown in Fig. 6(b). Specifically, we compare the YfτYfτYf_{\tau}Yf_{\tau} sequence implemented in the circuit API and the OpenPulse API.

Refer to caption
(a) Comparison of pulse and circuit implementation of UR20\text{UR}_{20} on ibmq_armonk with respect to Free
Refer to caption
(b) Comparison of pulse and circuit implementation of YYY-Y on ibmq_armonk
Figure 6: Comparison of the circuit and OpenPulse API approaches to implementing (a) UR20\text{UR}_{20} and (b) YfτYfτYf_{\tau}Yf_{\tau}. Panel (a) demonstrates that the OpenPulse version is substantially better. This is partially explained by (b). When using OpenPulse, the YfτYfτYf_{\tau}Yf_{\tau} sequence behaves as expected by symmetrically protecting |±i\ket{\pm i}, in stark contrast to the circuit implementation, which uses virtual ZZ gates, denoted ZvZ_{v}.

Under OpenPulse, the decay profiles for |+i\ket{+i} and |i\ket{-i} are roughly identical, as expected for the YfτYfτYf_{\tau}Yf_{\tau} sequence. The slight discrepancy can be understood as arising from coherent errors in state preparation and the subsequent YY pulses, which accumulate over many repetitions. On the other hand, the circuit results exhibit a large asymmetry between |i\ket{i} and |i\ket{-i}. The reason is that YY is compiled into ZvXZ_{v}-X where ZvZ_{v} denotes a virtual ZZ gate [94]. As Fig. 6(b) shows, ZvXZ_{v}-X does not behave like YY. The simplest explanation consistent with the results is to interpret ZvZ_{v} as an instantaneous ZZ. In this case, Z|+i=|iZ\ket{+i}=\ket{-i} and the subsequent XX rotates the state from |i\ket{-i} to |+i\ket{+i} by rotating through the excited state. The initial state |i\ket{-i}, on the other hand, rotates through the ground state. Since the ground state is much more stable than the excited state on IBMQ’s transmon devices, this asymmetry in trajectory on the Bloch sphere is sufficient to explain the asymmetry in fidelity.999This observation and explanation is due to Vinay Tripathi.

Taking a step back, every gate that is not a simple rotation about the xx axis is compiled by the standard circuit approach into one that is a combination of ZvZ_{v}, XX, and X\sqrt{X}. These gates can behave unexpectedly, as shown here. In addition, the transpiler – unless explicitly instructed otherwise – also sometimes combines a ZvZ_{v} into a global phase without implementing it right before an XX gate. Consequently, two circuits can be logically equivalent while implementing different DD sequences. Using OpenPulse, we can ensure the proper implementation of (π)ϕ(\pi)_{\phi}. This allows the fidelity of UR20p\text{UR}_{20p} to exceed that of UR20c\text{UR}_{20c}.

Overall, we found (not shown) that the OpenPulse implementation was almost always better than or comparable to the equivalent circuit implementation, except for XY4 on ibmq_armonk, where XY4c\text{XY4}_{c} was substantially better than XY4p\text{XY4}_{p}. However, XY4p\text{XY4}_{p} was not the top-performing sequence. Hence, it seems reasonable to default to using OpenPulse for DD when available.

Appendix C Methodologies for extraction of fidelity metrics

In the Pauli demonstration, we compare the performance of different DD sequences on the six Pauli eigenstates for long times. More details of the method are discussed in Section III.1 and, in particular, Fig. 1 and Fig. 2. The results are summarized in Fig. 3 and in more detail in App. E.

To summarize the fidelity decay profiles, we have chosen to employ an integrated metric in this work. Namely, we consider a time-averaged (normalized) fidelity,

F(T)1fe(0)fe(t)T=1T0T𝑑tfe(t)fe(0).F(T)\equiv\frac{1}{f_{e}(0)}\expectationvalue{f_{e}(t)}_{T}=\frac{1}{T}\int_{0}^{T}dt\frac{f_{e}(t)}{f_{e}(0)}. (58)

We combine this metric with an interpolation of fidelity curves, which we explain in detail in this section. We call the combined approach “Interpolation with Time-Averaging” (ITA).

Past work has employed a different method for comparing DD sequences, obtained by fitting the decay profiles to a modified exponential decay function. That is, to assume that fe(t)eλtf_{e}(t)\sim e^{-\lambda t}, and then perform a fit to determine λ\lambda. For example, in previous work [41], some of us chose to fit the fidelity curves with a function of the form101010We have added a factor of 1/21/2 to Γ(t)\Gamma(t) that was unfortunately omitted in Ref. [41].

fP(t)\displaystyle f_{P}(t) =fe(Tf)fe(0)Γ(Tf)1[Γ(t)1]+fe(0),\displaystyle=\frac{f_{e}(T_{f})-f_{e}(0)}{\Gamma(T_{f})-1}\left[\Gamma(t)-1\right]+f_{e}(0), (59a)
Γ(t)\displaystyle\Gamma(t) 12(et/λcos(tγ)+et/α),\displaystyle\equiv\frac{1}{2}\left(e^{-t/\lambda}\cos(t\gamma)+e^{-t/\alpha}\right), (59b)

where fe(0)f_{e}(0) is the empirical fidelity at T=0T=0, fe(Tf)f_{e}(T_{f}) is the empirical fidelity at the final sampled time, and Γ(t)\Gamma(t) is the decay function that is the subject of the fitting procedure, with the three free parameters λ\lambda, γ\gamma, and α\alpha. This worked well in the context of the small set of sequences studied in Ref. [41].

As we show below, in the context of our present survey of sequences, fitting to Eq. 59 results in various technical difficulties, and the resulting fitting parameters are not straightforward to interpret and rank. We avoid these technical complications by using our integrated (or average fidelity) approach, and the interpretation is easier to understand. We devote this section primarily to explaining the justification of these statements, culminating in our preference for a methodology based on the use of Eq. 58.

First, we describe how we bootstrap to compute the empirical fidelities (and their errors) that we use at the beginning of our fitting process.

C.1 Point-wise fidelity estimate by bootstrapping

We use a standard bootstrapping technique [95] to calculate the 2σ2\sigma (95% confidence intervals) errors on the empirical Uhlmann fidelities,

fe(t)=|ψ|ρfinal(t)|ψ|2.f_{e}(t)=|\bra{\psi}\rho_{\text{final}}(t)\ket{\psi}|^{2}. (60)

To be explicit, we generate Ns=8192N_{s}=8192 binary samples (aka shots) from our demonstration (see Fig. 1) for a given {\{DD sequence, state preparation unitary UU, total DD time T}T\} 3-tuple. From this, we compute the empirical Uhlmann fidelity as the ratio of counted 0’s normalized by NsN_{s}. We then generate 1000 re-samples (i.e., NsN_{s} shots generated from the empirical distribution) with replacement, calculating fe(T)f_{e}(T) for each re-sample. From this set of 1000 fe(T)f_{e}(T)’s, we compute the sample mean, feT\expectationvalue{f_{e}}_{T}, and sample standard deviation, σT\sigma_{T}, where the TT subscript serves as a reminder that we perform this bootstrapping for each time point. For example, the errors on the fidelities in Fig. 2(a) are 2σ2\sigma errors generated from this procedure.

C.2 A survey of empirical fidelity decay curves

Given a systematic way to compute empirical fidelities through bootstrapping, we can now discuss the qualitatively different fidelity decay curves we encounter in our demonstrations, as illustrated in Fig. 7. At a high level, the curves are characterized by whether they decay and whether oscillations are present. If decay is present, there is an additional nuance about what fidelity the curve decays to and whether there is evidence of saturation, i.e., reaching a steady state with constant fidelity. Finally, it matters whether an oscillation is simple, characterized by a single frequency, or more complicated. All these features can be seen in the eight examples shown in Fig. 7, which we now discuss.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: A sample of fidelity decay curves from ibmq_armonk that showcases the qualitatively different curve types. Each plot shows curves generated from ten different calibration cycles. Note that all Pauli states are sampled within the same job, so data can be compared directly without worrying about system drift across different calibrations.

The first four panels in Fig. 7 correspond to curves dominated by decay but that decay to different final values. For the first two RGA8a\text{RGA}_{8a} plots, the final value seems stable (i.e., a fixed point). For CPMG and Free, the final fidelity reached does not seem to be the projected stable fidelity. The “Free, |0\ket{0}” curve does not decay, consistent with expectations from the stable |0\ket{0} state on a superconducting device. The last three plots show curves with significant oscillations. For the QDD1,2\text{QDD}_{1,2} plot, the oscillations are strong and only weakly damped. For CPMG, the oscillations are strongly damped. Finally, the KDD plot is a pathological case where the oscillations clearly exhibit more than one frequency and are also only weakly damped.

C.3 Interpolation vs curve fitting for time-series data

To obtain meaningful DD sequence performance metrics, it is essential to compress the raw time-series data into a compact form. Given a target protection time TT, the most straightforward metric is the empirical fidelity, fe(T)f_{e}(T). Given a set of initial states relevant to some demonstration (i.e., those prepared in a specific algorithm), one can then bin the state-dependent empirical fidelities across the states in a box plot.

The box plots we present throughout this work are generated using Mathematica’s BoxWhiskerChart command. As a reminder, a box plot is a common method to summarize the quartiles of a finite data set. In particular, let QxQ_{x} represent the xthx^{\text{th}} quantile defined as the smallest value contained in the data set for which x%x\% of the data lies below the value QxQ_{x}. With this defined, the box plot shows Q0Q_{0} as the bottom bar (aka the minimum), Q25Q75Q_{25}-Q_{75} as the orange box (aka the inner-quartile range), Q50Q_{50} as the small white sliver in the middle of the inner-quartile range, and Q100Q_{100} as the top bar (aka the maximum). In our box plots, we have also included the mean as the solid black line. A normal distribution is symmetric, so samples collected from a normal distribution should give rise to a box plot that is symmetric about the median and where the mean is approximately equal to the median.

If there is no pre-defined set of states of interest, it is reasonable to sample states from the Haar distribution as we did for Fig. 4. This is because the sample Haar mean is an unbiased estimator for the mean across the entire Bloch sphere. Indeed, for a large enough set of Haar random states (we found 2525 to be empirically sufficient), the distribution about the mean becomes approximately Gaussian. At this point, the mean and median agree, and the inner-quartile range becomes symmetric about its mean, so one may choose to report fe(T)±2σ\langle f_{e}(T)\rangle\pm 2\sigma instead.

When there is no target time TT, we would like a statistic that accurately predicts performance across a broad range of relevant times. The empirical fidelity for a given state at any fixed TT is an unreliable predictor of performance due to oscillations in some curves (see, e.g., QDD1,2,CPMG,\text{QDD}_{1,2},\text{CPMG}, and KDD in the last three plots in Fig. 7). To be concrete, consider that for QDD1,2\text{QDD}_{1,2} protecting |+\ket{+}, fe(20μs)0.2\langle f_{e}(20\mu s)\rangle\approx 0.2 whereas fe(50μs)0.8\langle f_{e}(50\mu s)\rangle\approx 0.8. The standard statistic, in this case, is a decay constant obtained by a modified exponential fit [e.g., Eq. 59[41].

So far, our discussion has centered around a statistic that captures the performance of individual curves in Fig. 7, i.e., a curve for a fixed DD sequence, state, and calibration. To summarize further, we can put all found λ\lambda values in a box plot. For simplicity, we call any method which fits individual curves and then averages over these curves a “fit-then-average” (FtA) approach. This is the high-level strategy we advocate for and use in our work. Note that we include interpolation as a possible curve-fitting strategy as part of the FtA approach, as well as fitting to a function with a fixed number of fit-parameters. In contrast, Ref. [41] utilized an “average-then-fit” (AtF) approach, wherein an averaging of many time-series into a single time-series was performed before fitting.111111We caution that we use the term “averaging” to indicate both time-averaging [as in Eq. 58] and data-averaging, i.e., averaging fidelity data from different calibration cycles and states. Only fitting to a function with a fixed number of fit-parameters was used in Ref. [41], but not interpolation. We discuss the differences between the FtA and AtF approaches below, but first, we demonstrate what they mean in practice and show how they can give rise to different quantitative predictions. We begin our discussion by considering Fig. 8.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 8: Illustration of the qualitative difference between different fitting procedures with free evolution (Free) data collected on ibmq_armonk. The empirical data (obtained from six different initial states and 1616 calibration cycles, for a total of 9696 time-series shown) is identical in the top left and bottom left panels (the empty symbols). We employ two different techniques [interpolation (top) and Eq. 59 (bottom)] to fit the data corresponding to individual time-series (left) and averaged time-series (right). Top left: curves are computed by interpolation, separately for each time-series. Bottom left: curves are computed by fitting each time-series to Eq. 59. The fits here must contend with three different types of behaviors: constant (no decay: |0\ket{0}), pure and slow decay (|1\ket{1}), and damped oscillations (the remaining equatorial states). Right: the AtF approach. We first average the fidelity across all the data for each fixed time and then fit it. Top: we utilize a piece-wise cubic spline interpolation between points. Bottom: we use the fit given by Eq. 59. We remark that fitting to Eq. 59 fails for some data sets, and this is why the bottom left panel contains fewer curves than the top left panel. This is explained more in the text and in full detail in Section C.9.

The data in Fig. 8 corresponds to the Pauli demonstration for free evolution (Free). On the left, we display the curves corresponding to each of the 96 demonstrations (six Pauli eigenstates and 16 calibration cycles in this case). The top curves are computed by a piece-wise cubic-spline interpolation between successive points, while the bottom curves are computed using Eq. 59 (additional details regarding how the fits were performed are provided in Section C.9). For a given state, the qualitative form of the fidelity decay curve is consistent from one calibration to the next. For |0\ket{0}, there is no decay. For |1\ket{1}, there is a slow decay induced from T1T_{1} relaxation back to the ground state (hence the fidelity dropping below 1/21/2). For the equatorial states, |±\ket{\pm} and |±i\ket{\pm i}, there is a sharp decay to the fully mixed state (with fidelity 1/21/2) accompanied by oscillations with different amplitude and frequencies depending on the particular state and calibration. The |0\ket{0} and |1\ket{1} profiles are entirely expected and understood, and the equatorial profiles can be interpreted as being due to T1T_{1} (relaxation) and T2T_{2} (dephasing) alongside coherent pulse-control errors which give rise to the oscillations. On the right, we display the fidelity decay curves computed using the AtF approach, i.e., by averaging the fidelity from all demonstrations at each fixed time. The two curves [top: interpolation; bottom: Eq. 59] display a simple decay behavior, which is qualitatively consistent with the curves in [41, Fig. 2].

In Section C.4, we go into more detail about what the approaches in Fig. 8 are in practice and, more importantly, how well they summarize the raw data. Before doing so, we comment on an important difference between the two left panels of Fig. 8. Whereas the interpolation method provides consistent and reasonable results for all fidelity decay curves, the fitting procedure can sometimes fail. A failure is not plotted, and this is why some data in the bottom right panel is missing. For example, most fits (15/16) for |0\ket{0} fail since Eq. 59 is not designed to handle flat “decays”. The state |1\ket{1} also fails 11/1611/16 times, but interestingly, the equatorial states produce a successful fit all 1616 times. The nature of the failure is explained in detail in Section C.9, but as a prelude, a fit fails when it predicts fidelities outside the range [0,1][0,1] or when it predicts a decay constant λ\lambda with an unreasonable uncertainty. We next address the advantages of the interpolation approach from the perspective of extracting quantitative fidelity metrics.

Refer to caption
Refer to caption
Figure 9: Summary of the integrated and fitting approaches when applied to the Free curves in Fig. 8. For both of the resulting metrics, we display a state-by-state box plot showing the variation in the reported statistic over the 1616 calibration cycles. Top: for each of the 6×166\times 16 interpolated curves, we compute a time-averaged fidelity at T=75μsT=75\mu s, using Eq. 58. These average fidelities are shown in box plot format, separated by state and also combined all together. Bottom: we attempt to extractλ\lambda for each of the 6×166\times 16 (|0\ket{0} fails 15/16 times and |1\ket{1} fails 11/16 times) curves fitted to Eq. 59 and shown in Fig. 8. The results are shown in box plot format, again separated by state and also combined all together. Top and bottom: also shown are the median (red line) and mean (blue dashed line) of these FtA results. The green dashed line is the AtF result (in this case, there is just a single number, hence no variation over calibration cycles or states). The FtA (blue) and AtF (green) curves agree in the top panel, as expected of a reasonable method, whereas they disagree significantly in the bottom panel.

C.4 Interpolation with time-averaging (ITA) vs curve fitting for fidelity metrics

To extract quantitative fidelity metrics from the fitted data, we compute the time-averaged fidelity [Eq. 58] and the decay constant λ\lambda [Eq. 59]. The results are shown in Fig. 9. The box plots shown in this figure are obtained from the individual curve fits in Fig. 8. The top panel is the ITA approach we advocate in this work (recall Appendix C): it corresponds to the time-averaged fidelity computed from the interpolated fidelity curves in the top-left panel of Fig. 8. The bottom panel corresponds to fits computed using Eq. 59, i.e., the bottom-left panel of Fig. 8, from which the decay constant λ\lambda is extracted.

Let us first discuss the bottom panel of Fig. 9. First, we again comment on the effect of fit failures. Since |0\ket{0} does not decay, Eq. 59 is not appropriate (i.e., λ\lambda\to\infty which is not numerically stable), and this results in only 1/161/16 fits leading to a valid λ\lambda prediction. This is insufficient data to generate a box plot; hence the absence of the |0\ket{0} data at the bottom. Similar numerical instability issues – though less severe – arise for all of the fits. For example, for |1\ket{1}, only 5/165/16 fits succeed. Among those that do succeed, the variation in λ\lambda is quite large, varying in the range [103,346][103,346]. When compared to the raw data in Fig. 8, such a large variation should not be expected. In contrast, all the fits for the equatorial states succeed, but the variation in λ\lambda is again relatively large, in the range [17,146][17,146]. To summarize, fitting Eq. 59 to our data has instability issues which manifest as: (i) failures to fit some data sets and (ii) large variations in the reported λ\lambda which are unphysical. The problem is not specific to Eq. 59 and indeed would likely occur for any function which tries to reasonably model the entire set of curves found in practice as in Fig. 7. Again, this is because (i) not all states decay as an exponential (i.e., the decay if |0\ket{0} is flat and that of |1\ket{1} appears roughly linear), (ii) λ\lambda and γ\gamma are not independent, and (iii) the notion of λ\lambda itself depends on the final expected fidelity, which is state-dependent.

We argue that instead, interpolating and using time-averaged fidelities, i.e., the approach shown in the top panel of Fig. 9, is preferred. The results shown there demonstrate that this method gives consistent and reasonable results for any fixed state. When significant variations are present (as in the entire data set), it is representative of the clear difference between initial states visible in Fig. 8 (top-left). The key is to choose an appropriate value of TT to average over. When TT is too small, it does not capture the difference between sequences, and when TT is too large, the difference between most DD sequences becomes unobservable. As a compromise, we choose TT long enough for oscillatory sequences to undergo at least one but up to a few oscillation periods.

Beyond the box plot variation, it is also worth remarking that the FtA mean FF of 75μ75\mus in Fig. 9 (top; the blue dashed line) is equal to the AtF FF value for the interpolation approach (top; green dot-dashed line), i.e., the averaging and fitting operations commute in this sense. Hence, the ITA approach has the nice property that we can recover the coarse-grained AtF integrated fidelity result from the granular box plot data by taking a mean. The same is not true of fitting where the two mean values of λ\lambda differ, as in the bottom panel of Fig. 9 (dashed blue and green lines). For the interpretation issues of fitting explained in Sections C.5 and C.6 below, alongside the practical reasons explained here, we conclude that the ITA method we use in this work is preferred to methods that attempt to directly extract the decay constant from curve fitting to functions such as fP(t)f_{P}(t). The ITA method is more versatile and yields more stable and sensible results. Notably, it is also granular and able to capture the behavior of individual (DD sequence, state) pairs. This granularity allows for finer comparisons to theory as discussed in Section II.5) and also leads to practical benefits in our DD design as seen in Appendix B.

C.5 The “ambiguous λ\lambda problem and its resolution with ITA

In the previous subsection, we established that the technical difficulties encountered when using FtA as compared to AtF are resolved by interpolating and time-averaging the fidelity [ITA, Eq. 58], instead of averaging and then fitting to a decay profile [Eq. 59]. In this subsection, we provide further arguments in favor of the ITA approach.

First, we take a step back and discuss what we call the “ambiguous λ\lambda problem when attempting to fit our data using standard models such as Eq. 59. Practically, this problem makes λ\lambda an unreliable estimator of DD sequence performance without additional information, and this additional information does not lead to a simple ranking. As we discuss below, this problem is generic and persists whether we consider individual decay curves or averaged curves. We argue that our time-averaged fidelity metric is a reliable estimator of performance for the top sequences and resolves this issue. After this practical ranking consideration, we consider the statistical meaning of the FtA and AtF approaches in the context of current noisy quantum devices and again argue in favor of the FtA approach.

The crux of the issue lies in the desired interpretation of the final statistic. In the AtF approach, the final decay constant λ\lambda gives an estimate of how well DD does on average over an ensemble of demonstrations. While this might be an appropriate metric for some benchmarking demonstrations, it is not typical of any given memory demonstration or algorithm. Here, we are interested in how well we expect a DD sequence (or free evolution) to preserve an unknown but fixed input state |ψ\ket{\psi} in any given demonstration, and as we show below, ITA is better at addressing this “typical behavior” question.

As we mentioned, there is an “ambiguous λ\lambda problem” when we try to interpret the meaning of the decay constant in Eq. 59. In particular, the notion of decay in this equation is not just a function of λ\lambda but also of fe(Tf)fe(0)f_{e}(T_{f})-f_{e}(0) and γ\gamma, and these fit-parameters are not mutually independent. To make this concrete, consider the four (artificial example) curves shown in Fig. 10 (top panel). These curves do not preserve fidelity equally, and yet, by construction, they all have decay time λ=50μs\lambda=50\mu s. Nevertheless, the time-averaged fidelity, F(75μs)F(75\mu s), does distinguish between the four curves sensibly (middle panel). First, the time-averaged fidelity metric predicts that c1c_{1} is the best curve and c2c_{2} is the second best, which indeed seems reasonable from the curves alone. On the other hand, c3>c4c_{3}>c_{4} is debatable and depends on the desired time scale. Had we chosen, e.g., 120μsT150μs120\mu s\leq T\leq 150\mu s, we would have found c4>c3c_{4}>c_{3}. Such ambiguities are inevitable when oscillations of different frequencies are present with no preferred target time TT. However, the best sequences do not have oscillations across the Pauli states, as we see by example in the bottom panel of Fig. 10. Thus, this time-averaged fidelity ambiguity for oscillatory curves does not affect the ranking of the top sequences, which we present in Fig. 3 as the main result of the Pauli demonstration. We remark that we choose to plot example curves with the same λ\lambda to avoid complications with fitting noisy data, but that the real data we have already encountered faces the “ambiguous λ\lambda problem” also due to noise.

Refer to caption
Refer to caption
Refer to caption
Figure 10: Examples demonstrating the problem with using λ\lambda to rank performance. (Top) Four different artificially generated fidelity curves from Eq. 59 with the same λ\lambda value. The legend labels each curve with cic_{i} and lists the value of γ\gamma and fe(Tf)f_{e}(T_{f}) used to generate it. (Middle) The time-averaged fidelity evaluated at 75μs75\mu s for each curve. This metric differs for each curve. (Bottom) The best sequences – such as rank 1 and 10 in Fig. 3 – do not exhibit oscillations across the six Pauli states.

C.6 Job-to-job fluctuations complicate comparing data from different jobs

We conclude our arguments in favor of the ITA approach by bringing up a subtle problem in the usage of cloud-based quantum devices. At a high level, users of cloud-based devices often have an implicit assumption that data taken within a fixed calibration cycle all come from roughly the same distribution. Hence, taking data for many states, averaging, and then fitting (the AtF approach) is sensible. In other words, the average over states is the Haar average computed on the device with the given backend properties. However, this is not generally true for all demonstrations. Some relevant quantum memory demonstrations (such as probing Free fidelities) violate this assumption, and data taken from one job is as if sampled from a different distribution if not appropriately handled. In this sense, we argue against naive averaging of data not taken within the same job. Whenever possible, we prefer to only make direct comparisons within a single job unless stability is empirically observed.

To justify this caution, we carefully test the veracity of standard a priori assumptions on a series of identical (procedure-wise) Free demonstrations on ibmq_armonk. The Free demonstration procedure follows the standard DD protocol we established in Fig. 1. Namely, we prepare |+\ket{+}, idle for 75μs\mu s by applying 1000\approx 1000 identity gates, unprepare |+\ket{+}, and then measure in the computational basis. To run statistical tests on the result of this demonstration, we repeat it many times. Because IBM quantum computers operate on a queuing system (similar to a high-performance computing scheduler like Slurm), there are actually several different ways to repeat an demonstration, and as we will show, which way is chosen makes a difference. This is the reason for the rather pedantic discussion that follows.

The first notion of repeating an demonstration is to simply sample the same circuit many times by instructing the IBM job scheduler (we define “job” below) to use NsN_{s} shots. For example, the simplest possible job testing the above demonstration consists of sending ibmq_armonk the tuple (C,Ns)(C,N_{s}), where CC is the circuit encoding the above Free demonstration. Upon receiving such a tuple, ibmq_armonk samples CC for NsN_{s} shots once the job reaches position one in the queue. This repetition is the same as discussed in Section C.1, and hence, is the basis of estimating the empirical fidelity fef_{e} [Eq. 60] by bootstrapping the NsN_{s} bitstrings. We are interested in the stability of fef_{e} and hence the stability of repeating this entire procedure. Naively, we could repeatedly send ibmq_armonk the same tuple in MM different jobs, i.e., [J1=(C,Ns),J2=(C,Ns),,JM=(C,Ns)][J_{1}=(C,N_{s}),J_{2}=(C,N_{s}),\ldots,J_{M}=(C,N_{s})] and compute MM estimated empirical fidelities, [f1,f2,,fM][f_{1},f_{2},\ldots,f_{M}] where we dropped the ee subscript for simplicity of notation. However, submitting demonstrations this way wastes substantial time waiting in the queue since J1J_{1} and J2J_{2} are not generally run contiguously due to other users also requesting jobs.

More generally, a job consists of a list of circuits that are sampled contiguously, i.e., without interruption by other users. For our purposes, each job JkJ_{k} shall consist of LkL_{k} identical demonstration tuples, Jk=[Ek,1(C1,Ns)k,1,Ek,2(C2,Ns)k,2,,Ek,Lk(Ck,Ns)k,Lk]J_{k}=[E_{k,1}\equiv(C_{1},N_{s})_{k,1},E_{k,2}\equiv(C_{2},N_{s})_{k,2},\ldots,E_{k,L_{k}}\equiv(C_{k},N_{s})_{k,L_{k}}] which allows us to compute LkL_{k} empirical fidelities, Rk=[fk,1,fk,2,,fk,Lk]R_{k}=[f_{k,1},f_{k,2},\ldots,f_{k,L_{k}}] (RkR_{k} standing for result kk). Since Cj=CjC_{j}=C\ \forall j, the index on CjC_{j} is technically redundant, but keeping it helps us make our final point on the intricacies of collecting data within a job. In particular, ibmq_armonk samples the Lk×NsL_{k}\times N_{s} bitstrings by running the circuits of JkJ_{k} in the order [C1,C2,,CL]Ns[C_{1},C_{2},\ldots,C_{L}]^{N_{s}} as opposed to [C1]Ns[CL]Ns[C_{1}]^{N_{s}}\ldots[C_{L}]^{N_{s}}, as one might naively expect. The goal of this strategy is to try and prevent device drift from biasing the results of Ek,1E_{k,1} compared to, e.g., Ek,LkE_{k,L_{k}}. With the notation defined, we can state exactly what tests we ran.

To generate our data set, we constructed 67 jobs J1,,J67J_{1},\ldots,J_{67} with a uniformly random value of Lk[10,75]L_{k}\in[10,75] (7575 is the maximum demonstration allotment on ibmq_armonk) but an otherwise identical set of demonstrations Ek,jE_{k,j} and submitted them to the ibmq_armonk  queue 121212We actually sent 100100 jobs, but after 6767 successful completions, our program crashed due to an internet connection issue.. The value in choosing a random LkL_{k} is to check whether the duration of the job itself has any effect on results. We coarsely summarize the set of all job results in Fig. 11, which we call the “stats test data set”. The entire data set was taken over a ten-hour period over a fixed calibration cycle. In fact, 2222 of the 6767 jobs were run contiguously (i.e., one after the other without other user’s jobs being run in between).131313We claim two jobs are contiguous if they began running on the device within 1 minute of each other. This is because the full circuit takes 80μs\approx 80\mu s to execute, so 8192 shots takes around 0.657s0.657s to execute. Hence, a job with 1010 to 7575 demonstrations takes between 6.57s6.57s to 49.3s49.3s to execute excluding any latency time to communicate between our laptop and the ibmq_armonk server. Despite the jobs being localized in time, there are large jumps in average fkf_{k} values from job to job, which violates our a priori assumption of how sampling from a quantum computer should work within a fixed calibration cycle. We make this more precise in a series of assumptions and results which support or reject the given assumption.

Refer to caption
Refer to caption
Figure 11: A coarse view of the entire “stats test data set” taken on ibmq_armonk. (Top) The random value of demonstrations, Lk[10,75]L_{k}\in[10,75], chosen for each job. (Bottom) The spread of fidelities for each job summarized in 67 box plots. In our earlier notation, this is a box plot over the empirical fidelities, RkR_{k}. All data was taken within a fixed calibration cycle over a 10 hour period, and many jobs (22 / 67) are contiguous, e.g., jobs 2 and 3 are run back-to-back with no interruption from other users’ jobs. Despite this, there is a large variation in fidelity from job to job, which does not seem correlated with the job duration.
Assumption 1.

Fidelities collected within a fixed job are normally distributed. Using the above notation, samples within Rk=[fk,1,,fk,L]R_{k}=[f_{k,1},\ldots,f_{k,L}] are drawn from a fixed normal distribution, i.e., fk,j𝒩(μk,σk)f_{k,j}\sim\mathcal{N}(\mu_{k},\sigma_{k}) for some unknown mean and variance.

Result 1.

Assumption Assumption 1 is well supported by hypothesis tests (Jarque-Bera, Shapiro-Wilk) and visual tests (box-plot, quantile-quantile (QQ) plot).

The Jarque-Bera test is a means to test whether a sample of points has the skewness and kurtosis matching that expected of a normal distribution [96]. We implement it in Mathematica using the JarqueBeraALMTest command. The Shaprio-Wilk test checks whether a set of sample points has the correct order statistics matching that expected of a normal distribution [97]. We implement it using the Mathematica command ShapiroWilkTest. Both tests are hypothesis tests where the null hypothesis H0H_{0} is that the data was drawn from a normal distribution and an alternative hypothesis HaH_{a} that it was not. Like most hypothesis tests, they are rejection based methods. Namely, the tests return a pp-value corresponding to the probability of H0H_{0}. If p<0.05p<0.05, we reject the null hypothesis with 95%95\% confidence (this is the default setting for the Mathematica command which we employ). Otherwise, we say that the data does not support rejecting H0H_{0}–namely, it is reasonable that a normal distribution could have produced such a sample. We summarize the results of both tests in Table 3.

Test used Job’s kk where H0H_{0} rejected pp-values of rejected jobs
Jaque-Bera test {}\{\} {}\{\}
Shapiro-Wilk {17,22,46}\{17,22,46\} {0.028,0.033,0.048}\{0.028,0.033,0.048\}
Table 3: Summary of jobs where the null hypothesis that the data is sampled from a normal distribution is rejected with 95% confidence.

From Table 3, we see that among the 67 jobs, only 3 give us cause for concern. Namely, jobs {17,22,46}\{17,22,46\} reject the null hypothesis under the Shapiro-Wilk test. From the bar chart in Fig. 11, these jobs have {12,59,45}\{12,59,45\} demonstrations, respectively. Since there are several other jobs with similar lengths whose data does not reject H0H_{0}, there does not seem to be a correlation between job length and rejection of normality.

We can also check the normality of the data visually using box plots or quantile-quantile (QQ) plots as we do in Fig. 12. Recall that QxQ_{x} represent the xthx^{\text{th}} quantile, i.e., the smallest value contained in the data set for which x%x\% of the data lies below the value QxQ_{x}. The QQ plot is a way to compare the quantiles of two data sets. In our case, we compare the quantiles of the sample data set to that of a normal distribution with the same mean and standard deviation of the data. Namely, given samples Rk=[f1,,fLk]R_{k}=[f_{1},\ldots,f_{L_{k}}], we compare the quantiles of RkR_{k} itself to 𝒩(R¯k,sRk)\mathcal{N}(\overline{R}_{k},s_{R_{k}}), where R¯k\overline{R}_{k} is the mean of the data and σRk\sigma_{R_{k}} is the uncorrected sample standard deviation, i.e., sRk=(1Ni(fiR¯k)2)1/2.s_{R_{k}}=(\frac{1}{N}\sum_{i}(f_{i}-\overline{R}_{k})^{2})^{1/2}. Letting QxQ_{x} be the quantiles of our raw data and Q^x\hat{Q}_{x} the quantiles of the corresponding normal distribution, then points on our QQ plots correspond to (Qx,Q^x).(Q_{x},\hat{Q}_{x}). Hence, if the two distributions are the same (or close), then the data should fall on the diagonal Qx=Q^xQ_{x}=\hat{Q}_{x}, which indicates normality.

Refer to caption
Refer to caption
Figure 12: Box plots and quantile-quantile (QQ) plots corresponding to different job data from Fig. 11. The top two rows show the three jobs which failed the Shapiro-Wilk test for normality. The bottom two rows show three jobs that did not fail the test and are otherwise chosen to be the first, middle, and last jobs in the set of data. Normality is characterized visually as (i) a symmetric box plot whose mean and median agree or (ii) a QQ plot whose data falls approximately along the diagonal. This is best exhibited by the k=1k=1 data in the third row. A deviation from normality is the negation of any of the above properties. For example, the k=17k=17 job in the first row departs from normality in all three respects.

We illustrate the use of these plots in Fig. 12. In the top two rows we exhibit the three example jobs that failed the Shapiro-Wilk test for normality. For Jk=17J_{k=17}, the corresponding box plot is (i) not symmetric and (ii) has a mean that deviates significantly from its median. The QQ plot also deviates from a diagonal line, especially for larger fidelities. As we move from k=17k=17 to k=22k=22 and then k=46k=46, the data appears increasingly more normal but still fails the Shapiro-Wilk test. The extent to which they fail the visual tests is best seen by considering a normal-looking example, which we can glean from the bottom two rows of plots, which were not rejected by either hypothesis test.

For the data shown, Jk=1J_{k=1} is the closest data set to normal. Here, p=0.982p=0.982 for Shapiro-Wilk, the box plot is symmetric, the mean/median agree, and the QQ plot data hardly deviates from the diagonal. Given this almost ideally normal data set, it is easier to see why the other data sets fail to appear as normal. For example, even those that pass hypothesis tests exhibit some non-normal features. But among those rejected, we see deviations from the diagonal line, skewness (a lack of symmetry in the box plot), and a discrepancy between the mean and the media.

Since most jobs (64/67) have data that passes hypothesis and visual normality tests, we conclude that Assumption Assumption 1 is reasonable.

Assumption 2.

Any demonstration within a fixed job is representative of any other identical demonstration within the same job. Hence, it is not necessary to repeat an identical demonstration within a fixed job. In other words, fk,1±σk,1f_{k,1}\pm\sigma_{k,1} where σk,1\sigma_{k,1} is obtained by bootstrapping as in Section C.1 is a sufficient summary of the estimates μk\mu_{k} and σk\sigma_{k} in fk,j𝒩(μk,σk)f_{k,j}\sim\mathcal{N}(\mu_{k},\sigma_{k}).

Result 2.

Assumption Assumption 2 is justifying by a simple direct comparison of mean fidelity and standard deviation estimates. In particular, we compare (i) directly computing the sample mean and sample standard deviation and assuming normality, (ii) bootstrapping over the fidelities in RkR_{k}, and (iii) bootstrapping over demonstration output counts used to compute fk,1f_{k,1}. For all jobs (i.e., k)\forall k), the three methods are consistent with each other since the 2σ2\sigma error bars have significant overlap in all cases. See Fig. 13.

The basic idea is that all three methods are consistent with each other. By consistency, we mean the intuitive notion that the error bars have significant overlap with each other. Further rigor, in this case, is not necessary due to the degree of matching with respect to job-to-job fluctuations we discuss below. This consistency is significant because the final method shown in Fig. 13 bootstraps only over the counts used to compute fk,1f_{k,1}. Hence, it is sufficient to run an demonstration only once within a single job and use the results as being representative of repeating the identical demonstration, thereby showing the claimed result.

Refer to caption
Figure 13: We compare different approaches of summarizing the fidelities from running repeated identical demonstrations within a fixed job. In other words, we compare three ways to summarize Rk=[f1,,fLk]R_{k}=[f_{1},\ldots,f_{L_{k}}]. The first method, “assume normal”, is to simply compute the sample average fk¯\overline{f_{k}} and sample standard deviation sks_{k} and then report fk¯±2sk\overline{f_{k}}\pm 2s_{k} as the average fidelity. The second method, “bootstrap RkR_{k}, is to bootstrap over the fidelities in RkR_{k}. To correctly estimate the population standard deviation, note that we must rescale the bootstrapped mean error by a factor of Lk\sqrt{L_{k}} (see text for more details). Finally, we can instead bootstrap over the counts used to compute fk,1f_{k,1} directly, which we call “bootstrap fk,1f_{k,1}. This does not use the information for fk,jf_{k,j} for j>2j>2 at all. Nevertheless, all three methods yield self-consistent results, as seen visually by the significant overlap in error bars between the three estimates. Note we show only a subset of the results for all jobs to ensure that the error bars are large enough to be visible. For a sense of the consistency among the approaches, note that Job 31 has the largest deviation. Hence, it is sufficient to use the bootstrap fk,1f_{k,1} method as purported in Result Result 2.

We next detail the methods leading to the three summaries in Fig. 13. The first is to summarize RkR_{k} by the sample mean and sample standard deviation, fk¯±2sk\overline{f_{k}}\pm 2s_{k} where fk¯=1Lki=1Lkfi\overline{f_{k}}=\frac{1}{L_{k}}\sum_{i=1}^{L_{k}}f_{i} and sk=1Lk1i=1n(fifk¯)2.s_{k}=\sqrt{\frac{1}{L_{k}-1}\sum_{i=1}^{n}(f_{i}-\overline{f_{k}})^{2}}. By Result Result 1, it is appropriate to then characterize the sample with the first two moments — hence the name “assume normal.” The second and third methods, called “bootstrap RkR_{k} and “bootstrap fk,1f_{k,1}, both utilize the bootstrapping procedure discussed in Section C.1. The difference is in what is treated as the samples. In the RkR_{k} case, we bootstrap over the fk,jf_{k,j} values using ns10,000n_{s}\equiv 10,000, LkL_{k}-sized samples. From these nsn_{s} estimates of fk¯l\overline{f_{k}}_{l}, we can estimate Rk=1nsi=1nsfk¯l\langle R_{k}\rangle=\frac{1}{n_{s}}\sum_{i=1}^{n_{s}}\overline{f_{k}}_{l}. We can then estimate the sample standard deviation sRk=1ns1i=1ns(fk¯lRk)2.s_{R_{k}}=\sqrt{\frac{1}{n_{s}-1}\sum_{i=1}^{n_{s}}(\overline{f_{k}}_{l}-\langle R_{k}\rangle)^{2}}. It is important to note that sRks_{R_{k}} here has the meaning of the standard deviation of our estimate of the mean. As Lk0L_{k}\rightarrow 0, sRk0s_{R_{k}}\rightarrow 0, so sRks_{R_{k}} is not an estimate of σk\sigma_{k} as in some parametric distribution 𝒩(μk,σk)\mathcal{N}(\mu_{k},\sigma_{k}). To match the same meaning as the “assume normal” estimate, we report Rk±2LksRk\langle R_{k}\rangle\pm 2\sqrt{L_{k}}s_{R_{k}} as the “bootstrap RkR_{k} estimate. In the fk,1f_{k,1} case, we bootstrap over the counts used to compute fk,1f_{k,1} in exactly the same way as discussed in Section C.1. Hence, this method does not utilize the demonstrations fk,jf_{k,j} for j>2j>2, and the sample standard deviation is an estimate of σk\sigma_{k} and is consistent with “assume normal.”

Assumption 3.

Given that RkR_{k} can be modeled as samples from 𝒩(μk,σk)\mathcal{N}(\mu_{k},\sigma_{k}), data in a different job RpR_{p} for pkp\neq k but in the same calibration cycle can also be modeled as samples from 𝒩(μk,σk)\mathcal{N}(\mu_{k},\sigma_{k}).

Result 3.

Assumption Assumption 3 is not supported by the data we have already seen in Figs. 11,12, and 13. It is worth reiterating that this assumption does not hold even when the calibration is fixed or even when the jobs are contiguous.

In Fig. 11, we see box plots scattered wildly with median fidelities ranging from 0.240.24 to 0.750.75 and whose ranges do not overlap. Put more precisely, recall that when “assuming normality” as in Fig. 13, we can simplify each job’s results to a single estimate f¯k±2sk\overline{f}_{k}\pm 2s_{k}. Once we have done that, we find minkf¯k=0.250\min_{k}\overline{f}_{k}=0.250 and maxkf¯k=0.734\max_{k}\overline{f}_{k}=0.734 and yet maxksk=0.013\max_{k}s_{k}=0.013, so it is not possible for all the data to be consistent as described in Assumption Assumption 3. This discrepancy is not resolved when considering jobs that are time-proximate or contiguous, as we have discussed before. To see this, note that all jobs were taken within a ten-hour window for a fixed calibration: the first nine jobs were taken within the first hour, the first three of which within the first ten minutes. Despite this, there are large variations that violate Assumption Assumption 3 even within the first three jobs.

This result is the first major departure from expectations; it tells us that an demonstration must be repeated many times across different jobs to estimate the variance in performance from run to run. One consequence is that if we test the preservation of |ψ\ket{\psi} in job 1 and |ϕ\ket{\phi} in job 2, it is not reliable to compare the fidelities f(|ψ,T)f(\ket{\psi},T) and f(|ϕ,T)f(\ket{\phi},T) and draw conclusions about the results of a generic demonstration testing both f(|ψ,T)f(\ket{\psi},T) and f(|ϕ,T)f(\ket{\phi},T). Alternatively, if we test the efficacy of two sequences s1s_{1} and s2s_{2} given the same state but where the data is taken from different jobs, this is not a reliable indicator of their individual relative performance either. Instead, many demonstrations must be taken where the variance can be estimated, or direct comparisons should only be made within a fixed job and then repeated to check for the stability of the result.

The next two assumptions and results address to what extent data taken across different jobs, or even calibrations, can be modeled as being normally distributed.

Assumption 4.

It is appropriate to model fk,1𝒩(μC,σC)f_{k,1}\sim\mathcal{N}(\mu_{C},\sigma_{C}) for CC some fixed calibration cycle provided all jobs kk are in CC. In other words, demonstrations taken from different jobs can be viewed as sampled from a fixed normal distribution.

Result 4.

This assumption is justified since it is not rejected by the Jaque-Bera and Shapiro-Wilk hypothesis tests and is supported by visual indicators (box plot, QQ plot, and error bar comparison plot).

Let A={fk,1}k=167A=\{f_{k,1}\}_{k=1}^{67} be the set of the first empirical fidelity from each job. Given this data set, neither the Jaque-Bera nor the Shaprio-Wilk test support rejecting the hypothesis that fk,1𝒩(μC,σC)f_{k,1}\sim\mathcal{N}(\mu_{C},\sigma_{C}). More precisely, the pp-values are 0.3480.348 and 0.1920.192, respectively, so with 95% confidence, we cannot reject the hypothesis.

In addition, we show the standard visual tests — box plot and QQ plot — in Fig. 14 alongside the error bar comparison plot similar to Fig. 13. The standard visual tests also support assuming normality since the box plot is symmetric, has a mean and median that almost agree, and the QQ plot data lies mostly along the diagonal. However, perhaps the most important evidence is the error bar comparison (bottom panel). By “assume normal”, we compute the sample mean (A¯\overline{A}) and standard deviation of σA\sigma_{A} of AA as the estimates of μC\mu_{C} and σC\sigma_{C}. We find A¯=0.48\overline{A}=0.48 and σA=0.12\sigma_{A}=0.12. Hence, we would report 0.48±240.48\pm 24 as the average fidelity value across the fixed calibration. When instead bootstrapping the samples of AA, we again find 0.48±0.240.48\pm 0.24, which remarkably has symmetric error bars.

Hence, in all, it is reasonable to model the fidelity samples from different jobs as samples of a fixed normal distribution within a fixed calibration cycle. Note that we were careful not to try and model the entire data set as normal; instead, we held the number of samples from each job constant. Had we modeled the entirety of the data set, the Jaque-Bera and Shaprio-Wilk tests would reject normality. This is because some samples are unfairly given more weight, which is no longer normal.141414By taking a different number of samples, it is as if we roll a die but write down the answer a random number of times. This sampling strategy takes longer to converge to the expected result.

Refer to caption
Refer to caption
Refer to caption
Figure 14: We graph the visual normality tests for the data set consisting of the first fidelity obtained from each job, A={fk,1}k=167A=\{f_{k,1}\}_{k=1}^{67}. Alongside the fact that both the Jaque-Bera and Shapiro-Wilk tests do not reject normality, we can conclude that data set AA does indeed appear to have been sampled from a normal distribution 𝒩(μC,σC)\mathcal{N}(\mu_{C},\sigma_{C}) by the same aforementioned box-plot and QQ-plot criteria. In the bottom panel, we show the difference between the “assume normal” procedure for data set AA versus bootstrapping. Since both agree, it is not only reasonable to assume normality but also to use either method to make a consistent prediction for curve fitting.
Assumption 5.

Let f^c\hat{f}_{c} be empirical fidelity samples similar to fk,1f_{k,1} from before but not necessarily sampled from a fixed calibration. Then f^c𝒩(μ,σ)\hat{f}_{c}\sim\mathcal{N}(\mu,\sigma) even across different calibrations.

Result 5.

This assumption is justified for the free evolution (Free) data but is violated for some DD sequences such as KDD. Nevertheless, for average point-wise fidelity estimates of the form fc¯±2σc\overline{f_{c}}\pm 2\sigma_{c}, the difference between assuming normality or bootstrapping is negligible.

Here, we will need to consider a different data set for the first time in this series of assumptions and tests. Instead of considering data from a fixed time T=75μsT=75\mu s, we consider fidelity decay data as in Figs. 7 and 8. Namely, we consider the preservation of |+\ket{+} and |1\ket{1} under Free and KDD as a function of time. In each case, we consider the fidelity at twelve roughly equidistant time steps fc(s)(|l,Tj);Tj15013(j1)μsf^{(s)}_{c}(\ket{l},T_{j});T_{j}\approx\frac{150}{13}(j-1)\mu s for j=1,,12j=1,\ldots,12. Here, ss is a placeholder for the sequence (Free or KDD), cc for the calibration at which the fidelity was sampled, and ll for the state. We test the normality for each set of fixed (s,l,j)(s,l,j), which amounts to a set such as f^c\hat{f}_{c} in Assumption Assumption 4.

We begin by applying both the Jarque-Bera and Shaprio-Wilk hypothesis tests. We tabulate the cases where the hypothesis of normality fails with 95% confidence in Table 4. According to these tests, we can confidently claim that the set of KDD fidelity decay curves for |+\ket{+} does not obey all the properties we would expect for samples of a series of normal distributions. Namely, the times TjT_{j} for j[3,9]j\in[3,9] do not have the right order statistics since they fail the Shapiro-Wilk test. This seems reasonable given Fig. 7, in which the KDD data appears to bi-modal for the middle times.

ss ll rejected jj by JB rejected jj by SW
Free |+\ket{+} {6} {6}
Free |1\ket{1} {2} {2}
KDD |+\ket{+} {} {3,4,5,6,7,8,9}
KDD |1\ket{1} {6} {6,7,8}
Table 4: Summary of Free and KDD decay curve data sets that violate the null hypothesis of normality with 95% confidence according to the Jarque-Bera (JB) or Shapiro-Wilk (SW) tests. The data set consists of a set of fidelities {fc(s)(|l,Tj)}c=1,,14\{f^{(s)}_{c}(\ket{l},T_{j})\}_{c=1,\ldots,14} for ss a DD sequence, cc the calibration, |l\ket{l} the state, and jj the time point for j=1,,12j=1,\ldots,12. Note that we test normality where samples are drawn over different calibrations cc with the other parameters fixed.

We can also consider box plots for the data which support the claims made in Table 4. As an example, in Fig. 15 we show the box plot for (Free, |1\ket{1}), which agrees with normality at each time point, and (KDD, |+)\ket{+}), which fails to appear normal at most time points.

Refer to caption
Refer to caption
Figure 15: We plot the fidelity decay of |1\ket{1} under Free and |+\ket{+} under KDD. The variation in the box plots is obtained by running identical demonstrations for that time across different calibrations. The Free data appears mostly normal at each time, which is corroborated by the hypothesis tests in Table 4. The KDD data does not appear normal for the intermediate times, in agreement with the hypothesis tests.

Despite not passing the tests for normality, we can still formally pretend that the data is normal and compute the sample mean and variance as estimates of μ\mu and σ\sigma. When we do this and compare the result to bootstrapping the mean and rescaling the confidence interval, the results are almost identical for both sets of data, as we can see in Fig. 16. By rescaling the confidence interval (CI), we mean we compute the 95% CI of the mean estimate by bootstrapping, (Δl,Δu)(\Delta_{l},\Delta_{u}), and report (CΔl,CΔu)(\sqrt{C}\Delta_{l},\sqrt{C}\Delta_{u}) where C\sqrt{C} is the number of calibration cycles (i.e., the number of fidelities) we bootstrap over. The rescaling is done to obtain an estimate of the spread of the underlying distribution and not of the sample mean. In other words, we observe that averaging via the bootstrap or by the simple sample mean and standard deviation agree even for pathological data of the form we obtain for KDD.

Refer to caption
Refer to caption
Figure 16: We show that regardless of whether the data formally appears normal, the reported mean and confidence interval is almost identical under this assumption or under bootstrapping. In particular, the top panel appears normal to various normality tests, whereas the bottom data does not. Despite this, reporting the sample mean and the standard deviation is almost identical to reporting the bootstrapped mean and confidence interval.

This observation seems to suggest that we can compare DD sequences by first averaging fidelities across calibrations. This would be in line with preview work such as Refs. [41, 98] that did not report on the subtleties of job-to-job variation. As outlined in the main text and implied by the analysis presented in this appendix, we choose not to do this; the next section clarifies why and what we do instead.

C.7 Effect of job-to-job fluctuations

Let fc(s)f^{(s)}_{c} be a fidelity sample for state ss and calibration cc for some fixed time and sequence. We saw in Result Result 5 that assuming fc(s)𝒩(μs,σs)f^{(s)}_{c}\sim\mathcal{N}(\mu_{s},\sigma_{s}) is not formally justified for all sequences and states. In particular, the |+\ket{+} state decay for KDD does not pass hypothesis tests for normality at intermediate times. Nevertheless, if we are only concerned with the average fidelity of a fixed state ss^{\prime} across calibrations,

f(s)N1Nc=1Nfc(s),\langle f^{(s^{\prime})}\rangle_{N}\equiv\frac{1}{N}\sum_{c=1}^{N}f^{(s^{\prime})}_{c}, (61)

assuming normality or bootstrapping leads to consistent predictions. But related work—such as Refs. [41, 98]—was concerned with the average fidelity across states,

f¯c1|S|sSfc(s),\overline{f}_{c^{\prime}}\equiv\frac{1}{|S|}\sum_{s\in S}f^{(s)}_{c^{\prime}}, (62)

for some set of states SS and fixed calibration cc^{\prime}. If fc(s)𝒩(μs,σs)f_{c}^{(s)}\sim\mathcal{N}(\mu_{s},\sigma_{s}) and IID, then f¯c𝒩(μ,σ)\overline{f}_{c}\sim\mathcal{N}(\mu,\sigma) where μ=1|S|sSμs\mu=\frac{1}{|S|}\sum_{s\in S}\mu_{s} and σ=1|S|2sSσs.\sigma=\frac{1}{|S|^{2}}\sum_{s\in S}\sigma_{s}. Thus, we would expect the state-average fidelity to be consistent from job to job. In practice, however, this does not happen. For example, the Pauli averaged-fidelity (i.e., choosing S={|0,|1,|+,|,|+i,|i}S=\{\ket{0},\ket{1},\ket{+},\ket{-},\ket{+i},\ket{-i}\}) for different calibrations for the KDD sequence at time T6T_{6} is given in Fig. 17.

Refer to caption
Figure 17: The average fidelity over the six Pauli eigenstates across 1414 calibrations under protection by the KDD sequence. Time is fixed to T6T_{6}, as defined in Fig. 15. The red point is the mean fidelity of the entire data set. This total average is not representative of any calibration average.

The inconsistency of the average state fidelity as exhibited in Fig. 17 has several consequences. First, data from a single calibration is generally not enough to characterize typical behavior. For example, if we only sampled a single calibration and happened to sample c=6c=6, we would be left with the wrong impression of KDD’s performance. Second, data from a single calibration is also generally not representative of average behavior. Indeed, the red dot in the middle corresponds to the average over the 1414 calibration averages, and no single calibration has a mean consistent with this average. Finally, averaging over states sampled in different jobs can lead to misleading results. For example, suppose we queued up several runs of |0\ket{0} in calibration c=1c=1, of |1\ket{1} in c=2c=2, etc…. Then, according to the average fidelities in Fig. 17, it is likely that a |+\ket{+} run in c=3c=3 would have substantially higher fidelity than a |\ket{-} run in c=4c=4. Yet, if we compared |+\ket{+} to |\ket{-} within the same calibration, they would have roughly the same fidelity.

To clarify this averaging subtlety, we present the raw data for each state as a box plot in Fig. 18. The variation arises from the fidelity fluctuations of each state across calibrations. Interestingly, the polar states, |0\ket{0} and |1\ket{1}, give surprisingly consistent results, while the remaining equatorial states have large variations in performance. This suggests that in some calibrations, the equatorial states were adversely affected, while the polar states were mostly unaffected. By checking the data calibration-by-calibration, we find this to indeed be true. Namely, the equatorial states collectively have a similar performance that is worse in some calibrations and better in others. This could happen, e.g., if, in some calibrations, T2T_{2} dropped while T1T_{1} stayed roughly the same. Thus, to have a direct comparison of state performance, it is imperative to confine the comparison to within a single job. The same can be said about comparing the fidelities of a fixed state generated at different times. Without confining comparisons to a single job, it is unknown if the difference in fidelities is due to drift or other causes.

Refer to caption
Figure 18: The variation in fidelity across 1414 calibrations for each Pauli state protected under KDD for a fixed time T6T_{6}. The final box compiles the data for all six Pauli states. The blue point shows the average (with 95% bootstrapped confidence intervals) across the data in each box plot.

On the other hand, it is not possible to test all DD sequences, states, and decay times within a short enough window to avoid significant drift. Even if it were, fair-queuing enforces a maximum number of 7575 circuits per job, so this sets a hard cut-off on what can be compared reliably in a time-proximate way. As a compromise between time-proximate comparisons and the hard 7575 circuits-per-job limit, we collected data as described in Section III. Namely, within a single job, we collect the fidelity decay curves for the six Pauli states. Each curve consists of 1212 time points. In total, this takes up 7272 circuits, and we pad this with an additional set of two measurement error mitigation circuits. In this sense, we guarantee that the fidelities for different states and times can be compared reliably within this data set, i.e., they are not different due to drift. We then repeat this demonstration over different calibration cycles to obtain an estimate of fidelity differences due to drift. By reporting the median across this data set, we provide an estimate of typical performance — the performance estimate of a hypothetical next demonstration. We emphasize that the typical performance is different from the average performance.

To further support this point, we have included average fidelities (with bootstrapped 95% confidence intervals) superimposed on the box plots in Fig. 18. For all data sets, the mean and median do not agree. For the polar states (|0\ket{0} and |1)\ket{1}), the difference is slight, and their respective median performance is included in the error bar of the mean. We remark that averages and medians generally agree as they do here for the best performing DD sequences (see Fig. 3). The mean is heavily biased downward for the remaining equatorial states, and the confidence intervals do not include the mean. Once averaging over the entire data set, the discrepancy is even worse. Here, the mean falls on top of the 25% quantile, and the top of the confidence interval differs from the median by a significant amount. Hence, an average method is inappropriate for DD sequences whose performance is highly sensitive to drift effects, and a median method is more appropriate. Again, we remark that this should not affect the top performing sequences shown in Fig. 3, but it gives us a consistent way to sift through all the sequences at once to even get to a top ranking in the first place.

To summarize, we have shown that averaging fidelities over states is a delicate issue that depends on how the data is averaged. When averaging within a fixed job, the results appear reasonable and simple averaging of the fidelities over states gives an average state fidelity that behaves as expected. When comparing state data across different jobs, a straightforward average can be biased due to a particularly bad calibration, resulting in an average sensitive to drift. A non-parametric comparison in which fidelities are assessed via a box plot, on the other hand, does not suffer from this issue. When confined to a fixed job, the median over states gives a measure of typical performance over the set of states. When fidelities are collected over both states and calibrations, the median is still a measure of typical performance, but now over both states and calibrations. Hence, a measure of typical performance is more robust than an average performance metric when considering drift. In practice, typical performance is more relevant for any given fixed demonstration. These facts, alongside the practical difficulties of averaging and fitting, are why we opted for the fit-then-average (FtA) approach.

C.8 Summary of fitting fidelity

The standard approach to comparing DD sequences is to generate average fidelity decay curves, fit them, and report a summary statistic like a decay constant. For this reason, we call this an “average-then-fit” (AtF) approach. In Ref. [41], for example, the authors generated fidelity decay curves for roughly 40 equally spaced times and 3636 states (3030 Haar-random and six Pauli eigenstates). For each time, they computed a state-average fidelity constituting a single average fidelity decay curve. They then fit this curve to a modified exponential decay [Eq. 59] and reported the decay constant for XY4 compared to Free evolution. This was sufficient to show that, on average, applying XY4 was better than free evolution.

In this work, we aim to go beyond this approach and identify any pitfalls. To do so, we set out to first compare a large collection of DD sequences and not just XY4 and Free. Second, we are interested in whether XY4 retains universality for superconducting devices (see Section II.4). This question requires us to know precisely whether, e.g., |+\ket{+} or |1\ket{1} is better protected under XY4. This led us to two methodological observations. (i) Using a standard fit such as Eq. 59 to individual state decay curves results in several problems: it does not always work in practice, leading to an ambiguous interpretation of the decay constant, and it requires a complicated fitting procedure. (ii) The fidelity of a given state protected with DD for a fixed time is unstable from job to job due to device drift. This means one must be careful when comparing fidelities not collected within the same job.

We resolved both issues by (i) focusing on typical performance instead of average and (ii) using a “fit-than-average” (FtA) approach where we interpolate with time-averaging (ITA). Along the way, the insistence on not first averaging results in the identification of significant and unexpected asymmetries in performance between states (see Appendix B and especially Fig. 6), which led to better DD sequence design. Ultimately, more than the issues with fitting leading to ambiguous/inconsistent results described in Section C.5, it is the core idea of comparing data within a given job that led to our eventual choice of analysis method.

C.9 Details of fitting using Mathematica

C.9.1 Standard fit to exponential details

At a high level, we fit the fidelity decay curves (see Fig. 7 for example curves and Fig. 8 for example curves with fits) to Eq. 59 using Mathematica’s NonlinearModelFit (NLM) [99, 100] which performs weighted least-squares regression. In more detail, NLM finds the parameters {λ^,γ^,α^}\{\hat{\lambda},\hat{\gamma},\hat{\alpha}\} which minimize the weighted sum of squares,

S=i=1N1σTi2(feTifP(Ti;λ,γ,α))S=\sum_{i=1}^{N}\frac{1}{\sigma_{T_{i}}^{2}}\left(\expectationvalue{f_{e}}_{T_{i}}-f_{P}(T_{i};\lambda,\gamma,\alpha)\right) (63)

with the assumption that

feTi=fP(Ti;λ,γ,α)+ϵTiϵTi𝒩(0,σTi)\expectationvalue{f_{e}}_{T_{i}}=f_{P}(T_{i};\lambda^{*},\gamma^{*},\alpha^{*})+\epsilon_{T_{i}}\ \ \ \ \ \epsilon_{T_{i}}\sim\mathcal{N}(0,\sigma_{T_{i}}) (64)

for some unknown parameters λ,γ,\lambda^{*},\gamma^{*}, and α\alpha^{*}. Given a good model and fitting procedure, the NLM procedure finds λ^λ,γ^γ\hat{\lambda}\approx\lambda^{*},\hat{\gamma}\approx\gamma^{*}, and α^α\hat{\alpha}\approx\alpha^{*} [101]. Furthermore, NLM computes the covariance matrix, CC, between the estimated parameters,

C\displaystyle C =XTWX\displaystyle=X^{T}WX (65a)
X\displaystyle X =[λF(T1)γF(T1)αF(T1)λF(TN)γF(TN)αF(TN)]\displaystyle=\begin{bmatrix}\partial_{\lambda}F(T_{1})&\partial_{\gamma}F(T_{1})&\partial_{\alpha}F(T_{1})\\ \vdots&\vdots&\vdots\\ \partial_{\lambda}F(T_{N})&\partial_{\gamma}F(T_{N})&\partial_{\alpha}F(T_{N})\end{bmatrix} (65b)
W\displaystyle W =diag(1/σT12,,1/σTN2),\displaystyle=\text{diag}(1/\sigma_{T_{1}}^{2},\ldots,1/\sigma_{T_{N}}^{2}), (65c)

where F(T)F(T) is the integrated fidelity [Eq. 58] and this particular form of CC and WW results from setting the Weights and VarianceEstimatorFunction NLM settings appropriately for experimental data.151515The choice of weights is exactly as stated in Eq. 65c, but the variance estimator setting is more subtle. We set the variance estimator to the constant 1, which prevents the covariance matrix from being rescaled by a variance estimate of σTi\sigma_{T_{i}} in Eq. 64. We do this since we do not need to estimate σTi\sigma_{T_{i}}–we already did this by performing our demonstration, and the weight matrix WW reflects this knowledge. These settings are discussed in Mathematica’s Fit Models with Measurement Errors tutorial. As usual, we compute the parameter standard errors (aka the standard deviations of the statistics) from the covariance matrix as,

SEλ=C11,SEγ=C22,SEα=C33.SE_{\lambda}=\sqrt{C_{11}},\ SE_{\gamma}=\sqrt{C_{22}},\ SE_{\alpha}=\sqrt{C_{33}}. (66)

From this, we can estimate the parameter 95% confidence intervals by calculating a t-score. To do this, we first find the number of degrees of freedom we have, ν=N#parameters\nu=N-\#\ \text{parameters}. The t-score is then the t0.95t_{0.95}^{*} such that

P(t0.95Tt0.95)=0.95TStudentTDist(ν).P(-t_{0.95}^{*}\leq T\leq t_{0.95}^{*})=0.95\ \ \ T\sim\text{StudentTDist}(\nu). (67)

In the end, we report our best-fit parameters with 95% confidence using

λ=λ^±SEλ×t0.95λ^±δλ^\lambda=\hat{\lambda}\pm SE_{\lambda}\times t_{0.95}^{*}\equiv\hat{\lambda}\pm\delta_{\hat{\lambda}} (68)

with an exactly analogous formula for γ\gamma and α\alpha (i.e., same t0.95t^{*}_{0.95} throughout).161616We spell out the details for completeness, but the estimate for δλ^\delta_{\hat{\lambda}} can be immediately returned from NLM using the “ParameterConfidenceIntervals” option without needing to compute the covariance matrix directly at all.

C.9.2 Consistency checks and selecting appropriate fitting options

The previous section outlined the general methodology and output of Mathematica’s NonlinearModelFit function. One detail we glossed over is the method by which NonlinearModelFit minimizes the weighted sum of squares, SS. By default, SS is minimized using the LevenbergMarquardt method — a particular implementation of the Gauss–Newton algorithm. Many of the details and optimization settings are outlined in Mathematica’s “Unconstrained Optimization: Methods for Local Minimization” documentation. An important feature of this method is that it is a local search method, and hence, the quality of the fit depends on the particular input settings and randomness in each run. Coupled with the fact that λ\lambda, γ\gamma, and α\alpha of Eq. 59 are not independent parameters, it is important in practice to try many different settings to obtain a good fit. A full enumeration of all settings we attempted can be found in the fittingFreeData.nb Mathematica notebook contained in our code repository [93]. For example, we consider fits for which we reduce the number of parameters (e.g., setting et/α=1e^{-t/\alpha}=1) or where γ=0\gamma=0 is either seeded or unseeded.

Upon trying many possible fits, we then need an objective way to compare their relative quality. To do so from a fit obtained using NonlinearModelFit, one can query its associated (corrected) Akaike information criterion (AICc) [102] via Mathematica’s Fit["AICc"]. The AIC estimates the relative quality of each model in a collection of models for a given data set. In our case, the sample size is relatively small: 1212 points for each fit. To remedy this, we employ the AICc, which corrects the tendency of AIC to favor overfitting for small sample sizes. See Ref. [102] for a summary of the formulas for AIC and AICc. A more detailed introduction and derivations of the formulas can be found in Ref. [103, pp. 51-74].

The lower the AICc, the better the model is relative to others. However, we must also ensure that the final fit satisfies consistency conditions. Namely, a fit is considered unreasonable if it (a) predicts a fidelity less than zero or greater than one, (b) predicts a parameter whose error is larger than the value itself, or (c) violates the Nyquist-Shannon sampling theorem [104], i.e., has frequency larger than the sample rate of the data itself. I.e., γ\gamma in Eq. 59 must satisfy 0γ2π2ΔtB0\leq\gamma\leq\frac{2\pi}{2\Delta t}\equiv B where Δt=(150/12)μs\Delta t=(150/12)\mu s is the spacing between data points in our fidelity decay curves. After fitting, we reject fits which violate properties (a) and (b) and correct those that violate (c). We next illustrate what this post-selection or correction looks like for our data.

First, we perform many fits without regard to whether (a)-(c) are satisfied since enforcing constraints makes the nonlinear fit results unstable, and moreover, the constraints are not always enforced even when specified inside NonlinearModelFit. We then post-select the fits that respect (a) and (b), and among these, we select the fit with the lowest AICc. If no fit respecting (a) and (b) is found, we drop that data set. Finally, we rescale the frequency via γmod(γ,B).\gamma\rightarrow\text{mod}(\gamma,B). This procedure — along with the fits along the way — is summarized in Fig. 19. Note that we are forced to post-select fits that are well modeled by Eq. 59 in order to report a reasonable value of λ\lambda for the given data. For this reason, the fitting procedure thus described is biased toward data that is appropriately modeled by Eq. 59.

Refer to caption
Refer to caption
Refer to caption
Figure 19: A summary of the post-selection process necessary when fitting to Eq. 59 with Mathematica’s NonlinearModelFit. The data is the same Free evolution data displayed in Fig. 8. Left: for each curve, we select the fit with the lowest AICc without regard to whether the fit meets consistency conditions such as having positive fidelity. Middle: the result of first post-selecting those fits which satisfy consistency conditions and then choosing that fit with the lowest AICc. Some data sets (especially the flat |0\ket{0} state decay) have no reasonable fits, so they are dropped. Right: finally, we rescale the frequency in accordance with the Nyquist-Shannon sampling theorem (this does not affect the predicted decay constant λ\lambda). This final plot is displayed as the result in Fig. 8 and the curves for which the λ\lambda’s are reported in Fig. 9. Namely, data that do not have a reasonable fit are not included in the final summary, which biases towards data that does fit Eq. 59.

C.9.3 Time-averaged fidelity approach

We construct a polynomial interpolation of the fidelity decay data using Mathematica’s Interpolation function [99, 105]. Then we integrate over the interpolation to compute a time-averaged fidelity according to Eq. 58. By default, the Interpolation uses a third-order Hermite method [106] which we found to be sufficient to model our data adequately. For example, the curves in Fig. 7 consist of Hermite interpolations of the triangular raw data, which appears reasonable. To compute Eq. 58 from the interpolation, we used Mathematica’s NIntegrate.

Refer to caption
Refer to caption
Figure 20: We compare third-order Hermite and cubic spline interpolations on the fidelity decay curve generated by KDD with the initial state |\ket{-}. The resulting interpolations are qualitatively similar, and the resulting average fidelities are within 6×1036\times 10^{-3} of each other for all 0T100μs0\leq T\leq 100\mu s. Hence, they are equally valid interpolations when using time-averaged fidelity as a performance metric.

For a sense of stability, we also tried the popular cubic spline interpolation method [107], but this does not affect our results outside of the interpolation error bound itself. For example, consider the complicated interpolation necessary for KDD using the Hermite or cubic spline methods in Fig. 20. Hence, either interpolation is equally valid when using the time-averaged fidelity metric.

We remark that polynomial interpolation is much easier to do than fitting to a model such as Eq. 59 since we need not (i) select reasonable fitting equations, (ii) fit them using nonlinear least squares regression, (iii) post-select fits with satisfy consistency conditions and low AICc. Additionally, the predictions are less susceptible to ambiguities as described in Section C.5. Finally, we are not forced to discard any data sets which fail the post-selection criteria, so we do not bias the final summary results in any particular way.

Appendix D Toggling frame

We assume that without any external control, the system and bath evolve under the time-independent noise Hamiltonian HH. A DD pulse sequence is realized via a time-dependent control Hamiltonian Hc(t)H_{c}(t) acting only on the system so that the system and bath evolve according to H+Hc(t)H+H_{c}(t) (refer to Section II.1 for definitions of the various Hamiltonian terms).

For understanding the effects of the control Hamiltonian, it is convenient to use the interaction picture defined by Hc(t)H_{c}(t), also known as the toggling frame [108, 21, 31, 109, 34]. The toggling-frame density operator ρ~SB(t)\tilde{\rho}_{SB}(t) is related to the Schrödinger-picture density operator ρSB(t)\rho_{SB}(t) by

ρSB(t)\displaystyle\rho_{SB}(t) =U(t,0)ρSB(0)U(t,0)\displaystyle=U(t,0)\rho_{SB}(0)U^{\dagger}(t,0)
Uc(t)ρ~SB(t)Uc(t),\displaystyle\equiv U_{c}(t)\tilde{\rho}_{SB}(t)U_{c}^{\dagger}(t), (69)

where U(t,0)U(t,0) is the evolution operator generated by the full Hamiltonian H+Hc(t)H+H_{c}(t). Therefore the toggling-frame state evolves according to

ρ~SB(t)=U~(t,0)ρ~SB(0)U~(t,0),\tilde{\rho}_{SB}(t)=\tilde{U}(t,0)\tilde{\rho}_{SB}(0)\tilde{U}^{\dagger}(t,0), (70)

where the toggling-frame time evolution operator

U~(t,0)Uc(t)U(t,0)\tilde{U}(t,0)\equiv U_{c}^{\dagger}(t)U(t,0) (71)

is generated by the toggling-frame Hamiltonian

H~(t)Uc(t)HUc(t).\tilde{H}(t)\equiv U_{c}^{\dagger}(t)HU_{c}(t). (72)

Since Uc(t)U_{c}(t) acts nontrivially only on the system, H~(t)\tilde{H}(t) can be written as

H~(t)=HB+H~err(t),\tilde{H}(t)=H_{B}+\tilde{H}_{\text{err}}(t), (73)

where H~err(t)Uc(t)HerrUc(t)\tilde{H}_{\text{err}}(t)\equiv U_{c}^{\dagger}(t)H_{\text{err}}U_{c}(t) is the toggling-frame version of HerrH_{\text{err}}. Because the operator norm is unitarily-invariant, we have H~(t)=Hϵ\|\tilde{H}(t)\|=\|H\|\leq\epsilon and H~err(t)=HerrJ\|\tilde{H}_{\text{err}}(t)\|=\|H_{\text{err}}\|\leq J.

Throughout, we consider cyclic DD, where Uc(t)U_{c}(t) returns to the identity (up to a possible irrelevant overall phase) at the end of a cycle taking time tDDt_{\rm DD}:

Uc(tDD)=Uc(0)=I.U_{c}(t_{\text{DD}})=U_{c}(0)=I. (74)

Therefore, at the end of the cycle, the toggling-frame and Schrödinger-picture states coincide.

Appendix E Results of all Pauli demonstrations

In Fig. 3, we summarized the results of the Pauli demonstration for the top 1010 sequences on each device and the baseline of CPMG, XY4, and free evolution. This appendix presents the data for all 6060 tested sequences. In particular, we split the data for each device into a 3×33\times 3 grid of plots shown in Figs. 21-23, separating by family as in Table 1 when possible. For convenience, CPMG, XY4, and Free are still included in all plots along with colored reference lines matching the convention in Fig. 3. The purple reference line denotes the best sequence in the given family plot excluding the baseline sequences placed at the end. For example, in Fig. 21(b) RGA64a\text{RGA}_{64a} is the best RGA sequence even though it does not outperform XY4; it is marked with a purple reference line.

We plot the same families for each device in a 3×33\times 3 grid, labeled (a)-(i). In each caption, we make a few general comments on DD performance overall and then comment on observations specific to each sequence family. Recall that a specific definition of each sequence is given in Appendix A, and a summary of their properties along with references in Table 1. To avoid excessive repetition in each caption, we first provide a brief description of each of the cases (a)-(i) shown in Fig. 21-23:

  1. (a)

    This “family” serves as a catch-all for the basic sequences Hahn [55], CPMG [57, 72], and XY4 [56], along with sequences born from their modifications. Namely, we also plot the Eulerian super-cycle versions [80, 73] (denoted by S), KDD [48, 75] which is a composite pulse XY4, and CDDn\text{CDD}_{n} [50, 88] which is a recursive embedding of XY4.

  2. (b)

    The robust genetic algorithm (RGA) family [74].

  3. (c)

    The universally robust (UR) family [54].

  4. (d)

    The Uhhrig dynamical decoupling (UDD) family using XX pulses, UDDx [51].

  5. (e–i)

    The quadratic dynamical decoupling (QDD) family with same inner and outer order, QDDn,n\text{QDD}_{n,n}, with fixed outer order 1, QDD1,m\text{QDD}_{1,m} with fixed outer order 2, QDD2,m\text{QDD}_{2,m} with fixed outer order 3, QDD3,m\text{QDD}_{3,m}, and with fixed outer order 4, QDD4,m\text{QDD}_{4,m}, respectively [52].

For each family, we expect the empirical hierarchy of sequence performance to be a complicated function of device-specific properties. Specifically, actual performance is a competition between (i) error cancellation order, (ii) number of free evolution periods, and (iii) systematic pulse errors due to finite width and miscalibrations, among other factors discussed in Section II. For each family, we summarize our expectations regarding these factors

  1. (a)

    For ideal pulses, we expect CDDn+1>CDDnKDD=S-XY4=XY4>S-CPMG=CPMG=S-Hahn>Hahn\text{CDD}_{n+1}>\text{CDD}_{n}\geq\text{KDD}=\text{S-}\text{XY4}=\text{XY4}>\text{S-CPMG}=\text{CPMG}=\text{S-Hahn}>\text{Hahn}. With a finite bandwidth constraint, we expect CDDn+1>CDDn\text{CDD}_{n+1}>\text{CDD}_{n} to only hold up until some optimal concatenation level noptn_{\text{opt}} after which performance saturates. Using finite width pulses with systematic errors, we expect S-Hahn>Hahn\text{S-Hahn}>\text{Hahn} (and similarly for other S sequences) and KDD>XY4\text{KDD}>\text{XY4} provided the additional robustness is helpful. I.e., if the pulses are extremely well calibrated and errors are dominated by latent bath-induced errors, then we should instead see Hahn>S-Hahn\text{Hahn}>\text{S-Hahn}.

  2. (b)

    The expected performance hierarchy for RGA is rather complicated, as indicated by the labeling, and is best summarized in depth using Table II of Ref. [74]. A quick summary is that if we have strong pulses dominated by miscalibration errors (ϵΔϵr\epsilon\Delta\ll\epsilon_{r}), then we expect RGA8a\text{RGA}_{8a} and RGA64a\text{RGA}_{64a} to do well. In the opposite limit, we expect RGA4,RGA8c,RGA16b,RGA64c\text{RGA}_{4},\text{RGA}_{8c},\text{RGA}_{16b},\text{RGA}_{64c} to do well. The increasing number indicates the number of pulses, and as this increases, the decoupling order 𝒪(τn)\mathcal{O}(\tau^{n}) increases, and the same competition between order-cancellation and free evolution periods as in CDDn\text{CDD}_{n} also applies.

  3. (c)

    The URn\text{UR}_{n} sequence provides 𝒪(ϵrn/2)\mathcal{O}(\epsilon_{r}^{n/2}) suppression of flip angle errors at the expense of using nn free evolution periods. The relationship of nn to 𝒪(τ)\mathcal{O}(\tau) decoupling is not well established in Ref. [54], but by construction seems to be 𝒪(τ2)\mathcal{O}(\tau^{2}) for all nn. Thus our expectation is that URn\text{UR}_{n} improves with increasing nn until performance saturates and the 𝒪(τ2)\mathcal{O}(\tau^{2}) contribution dominates the 𝒪(ϵrn/2)\mathcal{O}(\epsilon_{r}^{n/2}) contribution. To see this, note that for a fixed time, the number of free evolution periods will be roughly the same regardless of nn.

  4. (d)

    Ideally, for a fixed demonstration duration TT, the performance of UDDxn\text{UDDx}_{n} should scale as 𝒪(τn)\mathcal{O}(\tau^{n}), and hence improves monotonically with increasing nn. In practice, this performance should saturate once the finite-pulse width error 𝒪(Δ)\mathcal{O}(\Delta) is the dominant noise contribution.

  5. (e–i)

    An extensive numerical study of QDDn,m\text{QDD}_{n,m} performance is discussed in Ref. [66] with corresponding rigorous proofs in Ref. [68]. For ideal pules, the decoupling order is expected to be at least 𝒪(tsmin{m,n})\mathcal{O}(t_{s}^{\min\{m,n\}}) where tst_{s} is the total evolution time of implementing a single repetition. Since we are instead interested in a fixed total time TT consisting of multiple sequence repetitions with a minimal pulse interval, the interplay of competing factors is quite complicated. Further, we are forced to apply rotations about XX, YY, and ZZ to implement QDDn,m\text{QDD}_{n,m} when mm is odd, but as noted in Appendix B, ZZ is virtual without OpenPulse. In summary, the naive theoretical expectation is that QDDn,m\text{QDD}_{n,m} should improve with increasing min{n,m}\min\{n,m\}, eventually saturating for the same reasons as UDDxn\text{UDDx}_{n}. However, we expect the fixed TT and virtual-ZZ set-up to complicate the actual results.

Refer to caption
Figure 21: Collection of all Pauli-demonstration results for ibmq_armonk. The best performing sequence for each family (solid purple line) substantially outperforms Free (orange) and CPMG (blue) but only marginally differs from the performance of XY4 (cyan).
(a) Results are not consistent with ideal-pulse theory but are sensible when considering realistic competing factors. First, S-Hahn >> Hahn and S-CPMG >> CPMG are consistent with the large pulse widths on ibmq_armonk, for which Δ142\Delta\approx 142 ns. That XY4 works very well is also consistent with its expected approximate universality given that T112T2T_{1}\approx\frac{1}{2}T_{2} on ibmq_armonk. Given that XY4 already performs well, it is not surprising that its robust versions, S-XY4 and KDD, do worse. In particular, they have little room for improvement but also add extra free evolution periods that accumulate additional error. Finally, CDDn\text{CDD}_{n} is roughly flat for all nn tested, which is consistent with an expected saturation that happens to occur at nopt=1n_{\text{opt}}=1.
(b) The performance of RGA does not match theoretical expectations. To see this, note that RGA4\text{RGA}_{4} is itself a four-pulse sequence with error scaling 𝒪(τ2)\mathcal{O}(\tau^{2}) which is the same as XY4. Hence, it should perform comparably to XY4, but it does significantly worse. Furthermore, it is also unexpected that the best RGA sequences are 8a, 64a, and 64c. Indeed, 8a and 64a are expected to work best in a flip-angle error dominated regime, but in this regime, 64c has a scaling of 𝒪(ϵr2)\mathcal{O}(\epsilon_{r}^{2}), so it is not expected to do well.
(c) The URn\text{UR}_{n} sequence performance is consistent with theory. First note that UR4=XY4\text{UR}_{4}=\text{XY4}. Thus, we expect URn\text{UR}_{n} for n>4n>4 to improve upon or saturate at the performance of XY4, and the latter is what happens.
(d) The UDDn\text{UDD}_{n} sequence performance is consistent with theory. In particular, we expect (and observe) a consistent increase in performance with increasing nn until performance saturates.
(e-i) The QDDn,m\text{QDD}_{n,m} results in part match theoretical expectations, since they exhibit a strong even-odd effect, as predicted in Refs. [66, 68, 71]. Nevertheless, the optimal choice of nn and mm has to be fine-tuned for ibmq_armonk. We note that five out of ten of the top ten sequences on ibmq_armonk are from the QDD family.
Refer to caption
Figure 22: Collection of all Pauli-demonstration results for ibmq_bogota. The best-performing sequence for each family (solid purple line) substantially outperforms Free (orange), CPMG (blue), and XY4 (cyan).
(a) Results are not fully consistent with ideal-pulse theory. For example, it is unexpected that Hahn >> CPMG and that Hahn << S-Hahn while at the same time S-XY4 >> XY4. Nevertheless, some trends are expected such as CDD1<CDD2<CDD3\text{CDD}_{1}<\text{CDD}_{2}<\text{CDD}_{3} and then saturating.
(b) Results are somewhat consistent with ideal-pulse theory. First of all, RGA4XY4\text{RGA}_{4}\approx\text{XY4} in performance, which is sensible. The first large improvement comes from RGA8c\text{RGA}_{8c} and then from numbers 3232 and greater. This trend is similar to CDDn\text{CDD}_{n} increasing, which is expected since larger RGA sequences are also recursively defined (e.g., RGA64c\text{RGA}_{64c} is a recursive embedding of RGA8c\text{RGA}_{8c} into itself). However, it is again unexpected that both ‘a’ and ‘c’ sequences should work well at the same time. For example, RGA8c>RGA8a\text{RGA}_{8c}>\text{RGA}_{8a} theoretically means that ibmq_bogota has negligible flip angle error. In such a regime, the decoupling order of RGA8a\text{RGA}_{8a} is the same as RGA64a\text{RGA}_{64a} since they are designed to cancel flip-angle errors. But, we find that RGA64a>RGA8a\text{RGA}_{64a}>\text{RGA}_{8a} in practice.
(c) The URn\text{UR}_{n} results are mostly consistent with theory. First, UR6\text{UR}_{6} is an improvement over XY4, and though URn\text{UR}_{n} does increase with larger nn, it is not simply monotonic as one would expect in theory. Instead, we find a more general trend with an empirical optimum at n=20n=20.
(d) The UDDxn\text{UDDx}_{n} results are mostly consistent with expectations. Again, performance mostly increases with increasing nn, but the increase is not fully monotonic.
(e-i) The QDDn,m\text{QDD}_{n,m} results are fairly consistent with theory. In (e), performance of QDDn,n\text{QDD}_{n,n} increases with nn until n=3n=3. The degradation for n=4n=4 is consistent with expectations in the bandwidth-limited setting [69]. In (f - i) the results are again fairly expected: aside from parity effects (odd/even mm), for QDDn,m\text{QDD}_{n,m}, we expect monotonic improvement with increasing mm until n=mn=m, after which performance should saturate or even slightly improve; this is the general empirical trend.
Refer to caption
Figure 23: Collection of all Pauli-demonstration results for ibmq_jakarta. The best performing sequence for each family (solid purple line) substantially outperforms Free (orange) and CPMG (blue) but only marginally differs from the performance of XY4 (cyan).
(a) Results are not fully consistent with ideal-pulse theory. For example, it is unexpected that Hahn >> CPMG and that Hahn >> S-Hahn while at the same time S-XY4 >> XY4. Nevertheless, some trends are expected, such as CDD1<CDD2<CDD3>CDD4\text{CDD}_{1}<\text{CDD}_{2}<\text{CDD}_{3}>\text{CDD}_{4} and then saturating.
(b) Results are somewhat consistent with the theory. First of all, RGA4>XY4\text{RGA}_{4}>\text{XY4} is sensible and implies that we are in a flip-angle error dominated regime. However, we would then expect RGA4>RGA8a\text{RGA}_{4}>\text{RGA}_{8a}, which does not occur. Nevertheless, the recursively defined sequences (number 3232 and above) generally outperform their shorter counterparts, which is similar to CDDn\text{CDD}_{n} as expected.
(c) The URn\text{UR}_{n} results are consistent with theory. First, UR6\text{UR}_{6} is a large improvement over XY4, and from there UR10>UR6\text{UR}_{10}>\text{UR}_{6}. After this, the improvement plateaus.
(d) The performance of UDDxn\text{UDDx}_{n} greatly differs from the theoretical expectation of monotonic improvement with nn. In fact, the behavior is so erratic that we suspect device calibration errors dominated the demonstration. Nevertheless, the performance of UDDx1\text{UDDx}_{1} was excellent in this case.
(e-i) The QDDn,m\text{QDD}_{n,m} results are fairly consistent with theory, and quite similar to Fig. 22. The same comments as made there apply here.

References