Universal Sampling Lower Bounds for Quantum Error Mitigation

Ryuji Takagi [email protected] Department of Basic Science, The University of Tokyo, Tokyo 153-8902, Japan Nanyang Quantum Hub, School of Physical and Mathematical Sciences, Nanyang Technological University, 637371, Singapore Hiroyasu Tajima [email protected] Department of Communication Engineering and Informatics, University of Electro-Communications, 1-5-1 Chofugaoka, Chofu, Tokyo, 182-8585, Japan JST, PRESTO, 4-1-8 Honcho, Kawaguchi, Saitama, 332-0012, Japan Mile Gu [email protected] Nanyang Quantum Hub, School of Physical and Mathematical Sciences, Nanyang Technological University, 637371, Singapore Centre for Quantum Technologies, National University of Singapore, 3 Science Drive 2, 117543, Singapore MajuLab, CNRS-UNS-NUS-NTU International Joint Research Unit UMI 3654, Singapore

Abstract

Numerous quantum error-mitigation protocols have been proposed, motivated by the critical need to suppress noise effects on intermediate-scale quantum devices. Yet, their general potential and limitations remain elusive. In particular, to understand the ultimate feasibility of quantum error mitigation, it is crucial to characterize the fundamental sampling cost — how many times an arbitrary mitigation protocol must run a noisy quantum device. Here, we establish universal lower bounds on the sampling cost for quantum error mitigation to achieve the desired accuracy with high probability. Our bounds apply to general mitigation protocols, including the ones involving nonlinear postprocessing and those yet-to-be-discovered. The results imply that the sampling cost required for a wide class of protocols to mitigate errors must grow exponentially with the circuit depth for various noise models, revealing the fundamental obstacles in the scalability of useful noisy near-term quantum devices.

Introduction — As recent technological developments have started to realize controllable small-scale quantum devices, a central problem in quantum information science has been to pin down what can and cannot be accomplished with noisy intermediate-scale quantum (NISQ) devices [1]. One of the most relevant issues in understanding the ultimate capability of quantum hardware is to characterize how well noise effects could be circumvented. This is especially so for NISQ devices, as today’s quantum devices generally cannot accommodate full quantum error correction that requires scalable quantum architecture. As an alternative to quantum error correction, quantum error mitigation has recently attracted much attention as a potential tool to help NISQ devices realize useful applications [2, 3]. It is thus of primary interest from practical and foundational viewpoints to understand the ultimate feasibility of quantum error mitigation.

Quantum error mitigation protocols generally involve running available noisy quantum devices many times. The collected data is then post-processed to infer classical information of interest. While this avoids the engineering challenge in error correction, it comes at the price of sampling cost — computational overhead in having to sample a noisy device many times. This sampling cost represents the crucial quantity determining the feasibility of quantum error mitigation. If the required sampling cost becomes too large, then such quantum error mitigation protocol becomes infeasible under a realistic time constraint. Various prominent quantum error mitigation methods face this problem, where sampling cost grows exponentially with circuit size [4, 5, 6, 7, 8]. The crucial question then is whether there is hope to come up with a new error mitigation strategy that avoids this hurdle or if this is a universal feature shared by all quantum error mitigation protocols. To answer this question, we need a characterization of the sampling cost that is universally required for the general error-mitigation protocols, which has hitherto been unknown.

Here, we provide a solution to this problem. We derive lower bounds for the number of samples fundamentally required for general quantum error mitigations to realize the target performance. We then show that the required samples for a wide class of mitigation protocols to error-mitigate layered circuits under various noise models — including the depolarizing and stochastic Pauli noise — must grow exponentially with the circuit depth to achieve the target performance. This turns the conjecture that quantum error mitigation would generally suffer from the exponential sampling overhead into formal relations, extending the previous results on the exponential resource overhead required for noisy circuits without postprocessing [9, 10, 11]. We accomplish these by employing an information-theoretic approach, which establishes the novel connection between the state distinguishability and operationally motivated error-mitigation performance measures. Our results place the fundamental limitations imposed on the capability of general error-mitigation strategies that include existing protocols [5, 6, 12, 13, 7, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 8, 25, 26, 27, 28, 29, 30, 31] and the ones yet to be discovered, being analogous to the performance converse bounds established in several other disciplines — such as thermodynamics [32, 33, 34], quantum communication [35, 36], and quantum resource theories [37, 38] — that contributed to characterizing the ultimate operational capability allowed in each physical setting.

Our work complements and extends several recent advancements in the field. Ref. [39] introduced a general framework of quantum error mitigation and established lower bounds for the maximum estimator spread, i.e., the range of the outcomes of the estimator, imposed on all error mitigation in the class, which provides a sufficient number of samples to ensure the target accuracy. Those bounds were then employed to show that the maximum spread grows exponentially with the circuit depth to mitigate local depolarizing noise. Ref. [40] showed a related result where for the class of error-mitigation strategies that only involve linear postprocessing, in which the target expectation value can be represented by a linear combination of the actually observed quantities, either the maximum estimator spread or the sample number needs to grow exponentially with the circuit depth to mitigate local depolarizing noise. The severe obstacle induced by noise in showing a quantum advantage for variational quantum algorithms has also recently been studied [41, 42, 43]. Our results lift the observations made in these works to rigorous bounds for the necessary sampling cost required for general error mitigation, including the ones involving nonlinear postprocessing that constitute a large class of protocols [12, 13, 7, 14, 15, 16, 17, 25, 27, 29, 30, 21, 22, 24, 44, 19, 31].

Framework — Suppose we wish to obtain the expectation value of an observable $A\in\mathbb{O}$ for an ideal state $\rho\in\mathbb{S}$ where $\mathbb{O}$ and $\mathbb{S}$ are some sets of observables and states. We assume that the ideal quantum state $\rho$ is produced by a unitary quantum circuit $\mathcal{U}$ applied to the initial state $\rho_{\rm ini}\in\mathbb{S}_{\rm in}$ as $\rho=\mathcal{U}(\rho_{\rm ini})$ where $\mathbb{S}_{\rm in}$ is the set of possible input states. The noise in the circuit, however, prevents us from preparing the state $\rho$ exactly. We consider quantum error mitigation protocols that aim to estimate the true expectation value under the presence of noise in the following manner [39] (see also Fig. 1).

In the mitigation procedure, one can first modify the circuit, e.g., use a different choice of unitary gates with potential circuit simplification, apply nonadaptive operations (enabling, e.g., dynamical decoupling [45, 46] and Pauli twirling [6]), and supply ancillary qubits — the allowed modifications are determined by the capability of the available device. Together with the noise present in the modified circuit, this turns the original unitary $\mathcal{U}$ into some quantum channel $\mathcal{F}$ , which produces a distorted state $\rho^{\prime}$ . The distorted state can be represented in terms of the ideal state $\rho$ by $\rho^{\prime}=\mathcal{E}(\rho)$ where we call $\mathcal{E}\coloneqq\mathcal{F}\circ\mathcal{U}^{\dagger}$ an effective noise channel.

The second step consists of collecting $N$ samples $\{\mathcal{E}_{n}(\rho)\}_{n=1}^{N}$ of distorted states represented by a set of effective noise channels $\mathbb{E}\coloneqq\{\mathcal{E}_{n}\}_{n=1}^{N}$ and applying a trailing quantum process $\mathcal{P}_{A}$ over them. The effective noise channels in $\mathbb{E}$ can be different from each other in general, as noisy hardware could have different noise profiles each time, or could purposely change the noise strength [47, 5]. The trailing process $\mathcal{P}_{A}$ then outputs an estimate represented by a random variable $\hat{E}_{A}(\rho)$ for the true expectation value $\operatorname{Tr}(A\rho)$ . The main focus of our study is the sampling number $N$ , the total number $N$ of distorted states used in the error mitigation process.

We quantify the performance of an error-mitigation protocol by how well the protocol can estimate the expectation values for a given set $\mathbb{O}$ of observables and a set $\mathbb{S}$ of ideal states, which we call target observables and target states respectively. We keep the choices of these sets general, and they can be flexibly chosen depending on one’s interest. For instance, if one is interested in error mitigation protocols designed to estimate the Pauli observables (e.g., virtual distillation [15, 16, 21]), $\mathbb{O}$ can be chosen as the set of Pauli operators. As the trailing process includes a measurement depending on the observable, an error-mitigation strategy with target observables $\mathbb{O}$ is equipped with a family of trailing processes $\{\mathcal{P}_{A}\}_{A\in\mathbb{O}}$ . Similarly, our results hold for an arbitrary choice of $\mathbb{S}$ , where one can, for instance, choose this as the set of all quantum states, which better describes the protocols such as probabilistic error cancellation [47, 5, 48, 49, 50, 51, 52], or as the set of states in a certain subspace, which captures the essence of subspace expansion [12, 14, 17].

This framework includes many error-mitigation protocols proposed so far [5, 6, 12, 13, 7, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 8, 25, 26, 27, 28, 29, 30, 31]. It is worth noting that our framework includes protocols that involve nonlinear postprocessing of the measurement outcomes. Error-mitigation protocols typically work by (1) making some set of (usually Pauli) measurements for observables $\{O_{i}\}_{i}$ , (2) estimating their expectation values $\{\left\langle O_{i}\right\rangle\}_{i}$ for distorted states, and (3) applying a classical postprocessing function $f$ over them. The protocols with linear postprocessing functions, i.e., the ones with the form $f(\left\langle O_{i}\right\rangle_{i})=\sum_{i}c_{i}\left\langle O_{i}\right\rangle$ , are known to admit simpler analysis [40, 39], but numerous protocols — including virtual distillation [15, 16, 21], symmetry verification [13], and subspace expansion [12, 14, 17] — come with nonlinear postprocessing functions. In our framework, the sampling number $N$ is the total number of samples used, where we consider the output represented by $\hat{E}_{A}(\rho)$ as our final guess and thus do not generally assume repeating some procedure many times and take a statistical average. This enables us to have any postprocessing absorbed in the trailing process $\mathcal{P}_{A}$ , making our results valid for the protocols with nonlinear postprocessing functions.

We also remark that our framework includes protocols with much more operational power than existing protocols, as we allow the trailing process to apply any coherent interaction over all distorted states. Our results thus provide fundamental limits on the sampling overhead applicable to an arbitrary protocol in this extended class of error-mitigation protocols.

Refer to caption — Figure 1: Framework of quantum error mitigation. For an ideal state $\rho\in\mathbb{S}$ and an observable $A\in\mathbb{O}$ of interest, we first prepare $N$ copies of distorted states $\{\mathcal{E}_{n}(\rho)\}_{n=1}^{N}$ , where $\mathbb{E}=\{\mathcal{E}_{n}\}_{n=1}^{N}$ is the set of effective noise channels. A trailing quantum process $\mathcal{P}_{A}$ is then applied to $N$ distorted states, producing the final estimation of $\operatorname{Tr}(A\rho)$ represented by a random variable $\hat{E}_{A}(\rho)$ . We quantify the error-mitigation performance in two ways by studying the property of the distribution of $\hat{E}_{A}(\rho)$ ; the first is the combination of the accuracy $\delta$ and the success probability $1-\varepsilon$ , and the second is the combination of the bias $b_{A}(\rho)\coloneqq\left\langle\hat{E}_{A}(\rho)\right\rangle-\operatorname{Tr}(A\rho)$ and the standard deviation $\sigma_{A}^{\operatorname{QEM}}$ of $\hat{E}_{A}(\rho)$ .

Sampling lower bounds — We now consider the required samples to ensure the target performance. The performance of quantum error mitigation can be defined in multiple ways. Here, we consider two possible performance quantifiers that are operationally relevant.

Our first performance measure is the combination of the accuracy of the estimate and the success probability. This closely aligns with the operational motivation, where one would like an error mitigation strategy to be able to provide a good estimate for each observable in $\mathbb{O}$ and an ideal state in $\mathbb{S}$ at a high probability. This can be formalized as a condition

\displaystyle{\rm Prob}(|\operatorname{Tr}(A\rho)-\hat{E}_{A}(\rho)|\leq\delta)\geq 1-\varepsilon,\ \forall\rho\in\mathbb{S},\,\forall A\in\mathbb{O}

(1)

where $\delta$ is the target accuracy and $1-\varepsilon$ is the success probability (see also Fig. 1).

The problem then is to identify lower bounds on the number $N$ of distorted states needed to achieve this condition as a function of $\delta$ and $\varepsilon$ . We address this by observing that the trailing process of quantum error mitigation is represented as an application of a quantum channel and thus can never increase the state distinguishability. To formulate our result, let us define the observable-dependent distinguishability with respect to a set $\mathbb{O}$ of observables as

\displaystyle D_{\mathbb{O}}(\rho,\sigma)\coloneqq\max_{A\in\mathbb{O}}|\operatorname{Tr}[A(\rho-\sigma)]|.

(2)

This quantity can be understood as the resolution in distinguishing two quantum states by using the measurements of the observables in $\mathbb{O}$ . We note that when $\mathbb{O}=\{A\,|0\leq A\leq\mathbb{I}\}$ ¹¹1Here, we consider the standard matrix inequality with respect to the positive semidefinite cone, i.e., $M_{1}\leq M_{2}$ for two matrices $M_{1}$ and $M_{2}$ means that the matrix $M_{2}-M_{1}$ is positive semidefinite., the quantity in (2) becomes the trace distance $D_{\operatorname{tr}}(\rho,\sigma)=\frac{1}{2}\|\rho-\sigma\|_{1}$ [54].

We then obtain the following sampling lower bounds applicable to an arbitrary given set $\mathbb{E}$ of effective noise channels. (Proof in Appendix A ²²2See the Supplemental Material for detailed proofs and discussions of our main results, which includes Refs. [56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80].)

Theorem 1.

Suppose that an error-mitigation strategy achieves (1) with some $\delta\geq 0$ and $0\leq\varepsilon\leq 1/2$ with $N$ distorted states characterized by the effective noise channels $\mathbb{E}=\{\mathcal{E}_{n}\}_{n=1}^{N}$ . Then, the sample number $N$ is lower bounded as

	$\displaystyle N$	$\displaystyle\geq\max_{\begin{subarray}{c}\rho,\sigma\in\mathbb{S}\\ D_{\mathbb{O}}(\rho,\sigma)\geq 2\delta\end{subarray}}\min_{\mathcal{E}\in\mathbb{E}}\frac{\log\left[\frac{1}{4\varepsilon(1-\varepsilon)}\right]}{\log\left[1/F(\mathcal{E}(\rho),\mathcal{E}(\sigma))\right]},$		(3)
	$\displaystyle N$	$\displaystyle\geq\max_{\begin{subarray}{c}\rho,\sigma\in\mathbb{S}\\ D_{\mathbb{O}}(\rho,\sigma)\geq 2\delta\end{subarray}}\min_{\mathcal{E}\in\mathbb{E}}\frac{2(1-2\varepsilon)^{2}}{\ln 2\cdot S(\mathcal{E}(\rho)\\|\mathcal{E}(\sigma))},$		(3)

where $F(\rho,\sigma)\coloneqq\|\sqrt{\rho}\sqrt{\sigma}\|_{1}^{2}$ is the (square) fidelity and $S(\rho\|\sigma)\coloneqq\operatorname{Tr}(\rho\log\rho)-\operatorname{Tr}(\rho\log\sigma)$ is the relative entropy.

This result tells that if the noise effect brings states close to each other, it incurs an unavoidable sampling cost to error mitigation. The minimization over $\mathbb{E}$ chooses the effective noise channel that least reduces the infidelity and the relative entropy respectively. On the other hand, the maximum over the ideal states represents the fact that to mitigate two states $\rho$ and $\sigma$ that are separated further than $2\delta$ in terms of observables in $\mathbb{O}$ , the sample number $N$ that achieves the accuracy $\delta$ and the success probability $1-\varepsilon$ must satisfy the lower bounds with respect to $\rho$ and $\sigma$ . The maximization over such $\rho$ and $\sigma$ then provides the tightest lower bound. This also reflects the observation that error mitigation accommodating a larger set $\mathbb{O}$ of target observables would require a larger number of samples.

We remark that although the set $\mathbb{E}$ — which depends on how one modifies the noisy circuit — ultimately depends on a specific error-mitigation strategy in mind, fixing $\mathbb{E}$ to a certain form already provides useful insights as we see later in the context of noisy layered circuits. We also stress that the above bounds hold for an arbitrary choice of $\mathbb{E}$ , providing the general relation between the error mitigation performance and the information-theoretic quantity.

The bounds in Theorem 1 depend on the accuracy $\delta$ implicitly through the constraints on $\rho$ and $\sigma$ in the maximization. For instance, if one sets $\delta=0$ , one can find that both bounds diverge, as the choice of $\sigma=\rho$ would be allowed in the maximization. In Appendix B, we report an alternative bound that has an explicit dependence on the accuracy $\delta$ .

Let us now consider our second performance measure based on the standard deviation and the bias of the estimate. Let $\sigma_{A}^{\operatorname{QEM}}(\rho)$ be the standard deviation of $\hat{E}_{A}(\rho)$ for an observable $A\in\mathbb{O}$ , which represents the uncertainty of the final estimate of an error mitigation protocol. Since a good error mitigation protocol should come with a small fluctuation in its outcome, the standard deviation of the underlying distribution for the estimate can serve as a performance quantifier. However, the standard deviation itself is not sufficient to characterize the error mitigation performance, as one can easily come up with a useless strategy that always outputs a fixed outcome, which has zero standard deviation. This issue can be addressed by considering the deviation of the expected value of the estimate from the true expectation value called bias, defined as $b_{A}(\rho)\coloneqq\left\langle\hat{E}_{A}(\rho)\right\rangle-\operatorname{Tr}(A\rho)$ for a state $\rho\in\mathbb{S}$ and an observable $A\in\mathbb{O}$ (see also Fig. 1).

To assess the performance of error-mitigation protocols, we consider the worst-case error among possible ideal states and measurements. This motivates us to consider the maximum standard deviation $\sigma_{\max}^{\operatorname{QEM}}\coloneqq\max_{A\in\mathbb{O}}\max_{\rho\in\mathbb{S}}\sigma_{A}^{\operatorname{QEM}}(\rho)$ and the maximum bias $b_{\max}\coloneqq\max_{A\in\mathbb{O}}\max_{\rho\in\mathbb{S}}b_{A}(\rho)$ . Then, we obtain the following sampling lower bound in terms of these performance quantifiers. (Proof in Appendix C.)

Theorem 2.

The sampling cost for an error-mitigation strategy with the maximum standard deviation $\sigma_{\max}^{\operatorname{QEM}}$ and the maximum bias $b_{\max}$ is lower bounded as

\displaystyle N\geq\max_{\begin{subarray}{c}\rho,\sigma\in\mathbb{S}\\ D_{\mathbb{O}}(\rho,\sigma)-2b_{\max}\geq 0\end{subarray}}\min_{\mathcal{E}\in\mathbb{E}}\frac{\log\left[1-\frac{1}{\left(1+\frac{2\sigma_{\max}^{\operatorname{QEM}}}{D_{\mathbb{O}}(\rho,\sigma)-2b_{\max}}\right)^{2}}\right]^{-1}}{\log\left[1/F(\mathcal{E}(\rho),\mathcal{E}(\sigma))\right]}.

(4)

This result represents the trade-off between the standard deviation, bias, and the required sampling cost. To realize the small standard deviation and bias, error mitigation needs to use many samples; in fact, the lower bound diverges at the limit of $\sigma_{\max}^{\operatorname{QEM}}\to 0$ whenever there exist states $\rho,\sigma\in\mathbb{S}$ such that $D_{\mathbb{O}}(\rho,\sigma)\geq 2b_{\max}$ . On the other hand, a larger bias results in a smaller sampling lower bound, indicating a potential to reduce the sampling cost by giving up some bias.

The bounds in Theorems 1, 2 are universally applicable to arbitrary error mitigation protocols in our framework. Therefore, our bounds are not expected to give good estimates for a given specific error-mitigation protocol in general, just as there is a huge gap between the Carnot efficiency and the efficiency of most of the practical heat engines. Nevertheless, it is still insightful to investigate how our bounds are compared to existing mitigation protocols. In Appendix D, we compare the bound in Theorem 1 to the sampling cost for several error-mitigation methods, showing that our bound can provide nontrivial lower bounds with the gap being the factor of 3 to 6. Although this does not guarantee that our bound behaves similarly for other scenarios in general, this ensures that there is a setting in which the bound in Theorem 1 can provide a nearly tight estimate. We further show in Appendix E that the scaling of the lower bound in Theorem 2 with noise strength can be achieved by the probabilistic error cancellation method in a certain scenario. This shows that probabilistic error cancellation serves as an optimal protocol in this specific sense, complementing the recent observation on the optimality of probabilistic error cancellation established for the maximum estimator spread measure [39].

Noisy layered circuits — The above results clarify the close relation between the sampling cost and state distinguishability. As an application of our general bounds, we study the inevitable sample overhead to mitigate noise in the circuits consisting of multiple layers of unitaries. Although we here focus on the local depolarizing noise, our results can be extended to a number of other noise models as we discuss later.

Suppose that an $M$ -qubit quantum circuit consists of layers of unitaries, each of which is followed by a local depolarizing noise, i.e., a depolarizing noise of the form $\mathcal{D}_{p}=(1-p)\operatorname{id}+p\mathbb{I}/2$ where $p$ is a noise strength, applies to each qubit. We aim to estimate ideal expectation values for the target states $\mathbb{S}$ and observables $\mathbb{O}$ by using $N$ such noisy layered circuits. Although the noise strength can vary for different locations, we suppose that $L$ layers are followed by the local depolarizing noise with noise strength of at least $\gamma$ . We call these layers $U_{1},U_{2},\dots,U_{L}$ and let $\gamma_{n,l,m}$ denote the noise strength of the local depolarizing noise on the $m^{\rm th}$ qubit after the $l^{\rm th}$ unitary layer $U_{l}$ in the $n^{\rm th}$ noisy circuit, where $m\leq M,\,l\leq L,\,n\leq N$ . This gives the expression of the local depolarizing noise after $l^{\rm th}$ layer in the $n^{\rm th}$ noisy circuit as $\otimes_{m=1}^{M}\mathcal{D}_{\gamma_{n,l,m}}$ , where $\gamma_{n,l,m}\geq\gamma\,\forall n,l,m$ .

Here, we focus on the error-mitigation protocols that apply an arbitrary trailing process over $N$ distorted states and any unital operations (i.e., operations that preserve the maximally mixed state) before and after $U_{l}$ (Fig. 2). This structure ensures that error correction does not come into play here, as the size of input and output spaces of the intermediate unital channels is restricted to $M$ qubits, as well as that unital channels do not serve as good decoders for error correction.

We show that the necessary number of samples required to achieve the target performance grows exponentially with the number of layers in both performance quantifiers introduced above.

Theorem 3.

Suppose that an error-mitigation strategy described above is applied to an $M$ -qubit circuit to mitigate local depolarizing channels with strength at least $\gamma$ that follow $L$ layers of unitaries, and achieves (1) with some $\delta\geq 0$ and $0\leq\varepsilon\leq 1/2$ . Then, if there exist at least two states $\rho,\sigma\in\mathbb{S}$ such that $D_{\mathbb{O}}(\rho,\sigma)\geq 2\delta$ , the required sample number $N$ is lower bounded as

\displaystyle N\geq\frac{(1-2\varepsilon)^{2}}{2\ln(2)\,M(1-\gamma)^{2L}}.

(5)

The proof can be found in Appendix F. This result particularly shows that the required number of samples must grow exponentially with the circuit depth $L$ . We remark that the bound always holds under the mild condition, i.e., $D_{\mathbb{O}}(\rho,\sigma)\geq 2\delta$ for some $\rho,\sigma\in\mathbb{S}$ . This reflects that, to achieve the desired accuracy $\delta$ satisfying this condition, error mitigation really needs to extract the expectation values about the observables in $\mathbb{O}$ and the states in $\mathbb{S}$ , prohibiting it from merely making a random guess.

In Appendix G, we obtain a similar exponential growth of the required sample overhead for a fixed target bias and standard deviation. We also obtain in Appendix H alternative bounds that are tighter in the range of small $\varepsilon$ .

With a suitable modification of allowed unitaries and intermediate operations, we extend these results to a wide class of noise models, including stochastic Pauli, global depolarizing, and thermal noise. The case of thermal noise particularly provides an intriguing physical interpretation: the sampling cost $N$ required to mitigate thermal noise after time $t$ is characterized by the loss of free energy $N=\Omega(1/[F(\rho_{t})-F_{\rm eq}])$ where $\rho_{t}$ is the state at time $t$ and $F_{\rm eq}$ is the equilibrium free energy. This in turn shows that the necessary sampling cost grows as $N=\Omega(e^{\alpha_{\rm ent}t})$ where $\alpha_{\rm ent}$ is a constant characterized by the minimum entropy production rate. We provide details on these extensions in Appendix I.

We remark that Theorem 3 (and related results discussed in the Appendices) extends the previous results showing the exponential resource overhead required for noisy circuits without postprocessing [9, 10, 11]. In Appendix J, we provide further clarifications about the differences between the settings considered in the previous works and ours.

Conclusions — We established sampling lower bounds imposed on the general quantum error-mitigation protocols. Our results formalize the idea that the reduction in the state distinguishability caused by noise and error-mitigation processes leads to the unavoidable computational overhead in quantum error mitigation. We then showed that error-mitigation protocols with certain intermediate operations and an arbitrary trailing process require the number of samples that grows exponentially with the circuit depth to mitigate various types of noise. We presented these bounds with respect to multiple performance quantifiers — accuracy and success probability, as well as the standard deviation and bias — each of which has its own operational relevance.

Our bounds provide fundamental limitations that universally apply to general mitigation protocols, clarifying the underlying principle that regulates error-mitigation performance. As a trade-off, they may not give tight estimates for a given specific error-mitigation strategy, analogously to many other converse bounds established in other fields that typically give loose bounds for most specific protocols. A thorough study to identify in what setting our bounds can give good estimates will make an interesting future research direction.

Acknowledgements.

Acknowledgments — We thank Suguru Endo, Shintaro Minagawa, Masato Koashi, and Daniel Stilck França for helpful discussions, and Kento Tsubouchi, Takahiro Sagawa, and Nobuyuki Yoshioka for sharing a preliminary version of their manuscript. We acknowledge the support of the Singapore Ministry of Education Tier 1 Grants RG146/20 and RG77/22 (S), the NRF2021-QEP2- 02-P06 from the Singapore Research Foundation and the Singapore Ministry of Education Tier 2 Grant T2EP50221-0014 and the FQXi R-710-000-146-720 Grant “Are quantum agents more energetically efficient at making predictions?” from the Foundational Questions Institute and Fetzer Franklin Fund (a donor-advised fund of Silicon Valley Community Foundation). R.T. was also supported by the Lee Kuan Yew Postdoctoral Fellowship at Nanyang Technological University Singapore. H.T. is supported by JSPS Grants-in-Aid for Scientific Research No. JP19K14610 and No. JP22H05250, JST PRESTO No. JPMJPR2014, and JST MOONSHOT No. JPMJMS2061.

Note added. — During the completion of our manuscript, we became aware of an independent work by Tsubouchi et al. [81] that obtained a result related to our Theorem S.2 in Appendix G, in which they showed an alternative exponential sample lower bound applicable to error-mitigation protocols that achieve zero bias using quantum estimation theory.

References

Preskill [2018] J. Preskill, Quantum Computing in the NISQ era and beyond, Quantum 2, 79 (2018).
McArdle et al. [2020] S. McArdle, S. Endo, A. Aspuru-Guzik, S. C. Benjamin, and X. Yuan, Quantum computational chemistry, Rev. Mod. Phys. 92, 015003 (2020).
Endo et al. [2021] S. Endo, Z. Cai, S. C. Benjamin, and X. Yuan, Hybrid quantum-classical algorithms and quantum error mitigation, J. Phys. Soc. Japan 90, 032001 (2021).
Yuan et al. [2016] X. Yuan, Z. Zhang, N. Lütkenhaus, and X. Ma, Simulating single photons with realistic photon sources, Phys. Rev. A 94, 062305 (2016).
Temme et al. [2017] K. Temme, S. Bravyi, and J. M. Gambetta, Error mitigation for short-depth quantum circuits, Phys. Rev. Lett. 119, 180509 (2017).
Li and Benjamin [2017] Y. Li and S. C. Benjamin, Efficient variational quantum simulator incorporating active error minimization, Phys. Rev. X 7, 021050 (2017).
Endo et al. [2018] S. Endo, S. C. Benjamin, and Y. Li, Practical quantum error mitigation for near-future applications, Phys. Rev. X 8, 031027 (2018).
Bravyi et al. [2021] S. Bravyi, S. Sheldon, A. Kandala, D. C. Mckay, and J. M. Gambetta, Mitigating measurement errors in multiqubit experiments, Phys. Rev. A 103, 042605 (2021).
Unruh [1995] W. G. Unruh, Maintaining coherence in quantum computers, Phys. Rev. A 51, 992 (1995).
Aharonov et al. [1996] D. Aharonov, M. Ben-Or, R. Impagliazzo, and N. Nisan, Limitations of Noisy Reversible Computation, arXiv:quant-ph/9611028 (1996).
Palma et al. [1996] G. M. Palma, K.-a. Suominen, and A. Ekert, Quantum computers and dissipation, Proc. R. Soc. A: Math. Phys. Eng. Sci. 452, 567 (1996).
McClean et al. [2017] J. R. McClean, M. E. Kimchi-Schwartz, J. Carter, and W. A. de Jong, Hybrid quantum-classical hierarchy for mitigation of decoherence and determination of excited states, Phys. Rev. A 95, 042308 (2017).
Bonet-Monroig et al. [2018] X. Bonet-Monroig, R. Sagastizabal, M. Singh, and T. E. O’Brien, Low-cost error mitigation by symmetry verification, Phys. Rev. A 98, 062339 (2018).
McClean et al. [2020] J. R. McClean, Z. Jiang, N. C. Rubin, R. Babbush, and H. Neven, Decoding quantum errors with subspace expansions, Nat. Commun. 11, 636 (2020).
Koczor [2021] B. Koczor, Exponential error suppression for near-term quantum devices, Phys. Rev. X 11, 031057 (2021).
Huggins et al. [2021] W. J. Huggins, S. McArdle, T. E. O’Brien, J. Lee, N. C. Rubin, S. Boixo, K. B. Whaley, R. Babbush, and J. R. McClean, Virtual distillation for quantum error mitigation, Phys. Rev. X 11, 041036 (2021).
Yoshioka et al. [2022] N. Yoshioka, H. Hakoshima, Y. Matsuzaki, Y. Tokunaga, Y. Suzuki, and S. Endo, Generalized quantum subspace expansion, Phys. Rev. Lett. 129, 020502 (2022).
Czarnik et al. [2021] P. Czarnik, A. Arrasmith, P. J. Coles, and L. Cincio, Error mitigation with Clifford quantum-circuit data, Quantum 5, 592 (2021).
Strikis et al. [2021] A. Strikis, D. Qin, Y. Chen, S. C. Benjamin, and Y. Li, Learning-based quantum error mitigation, PRX Quantum 2, 040330 (2021).
Lowe et al. [2021] A. Lowe, M. H. Gordon, P. Czarnik, A. Arrasmith, P. J. Coles, and L. Cincio, Unified approach to data-driven quantum error mitigation, Phys. Rev. Research 3, 033098 (2021).
Czarnik et al. [2021] P. Czarnik, A. Arrasmith, L. Cincio, and P. J. Coles, Qubit-efficient exponential suppression of errors, arXiv:2102.06056 (2021).
O’Brien et al. [2021] T. E. O’Brien, S. Polla, N. C. Rubin, W. J. Huggins, S. McArdle, S. Boixo, J. R. McClean, and R. Babbush, Error mitigation via verified phase estimation, PRX Quantum 2, 020317 (2021).
Huo and Li [2022] M. Huo and Y. Li, Dual-state purification for practical quantum error mitigation, Phys. Rev. A 105, 022427 (2022).
Bultrini et al. [2023] D. Bultrini, M. H. Gordon, P. Czarnik, A. Arrasmith, M. Cerezo, P. J. Coles, and L. Cincio, Unifying and benchmarking state-of-the-art quantum error mitigation techniques, Quantum 7, 1034 (2023).
Cai [2021a] Z. Cai, Quantum Error Mitigation using Symmetry Expansion, Quantum 5, 548 (2021a).
Wang et al. [2023] K. Wang, Y.-A. Chen, and X. Wang, Mitigating quantum errors via truncated Neumann series, Sci. China Inf. Sci. 66, 180508 (2023).
Cai [2021b] Z. Cai, Multi-exponential error extrapolation and combining error mitigation techniques for NISQ applications, npj Quantum Inf. 7, 80 (2021b).
Mari et al. [2021] A. Mari, N. Shammah, and W. J. Zeng, Extending quantum probabilistic error cancellation by noise scaling, Phys. Rev. A 104, 052607 (2021).
Mezher et al. [2022] R. Mezher, J. Mills, and E. Kashefi, Mitigating errors by quantum verification and postselection, Phys. Rev. A 105, 052608 (2022).
Fontana et al. [2022] E. Fontana, I. Rungger, R. Duncan, and C. Cîrstoiu, Spectral analysis for noise diagnostics and filter-based digital error mitigation, arXiv:2206.08811 (2022).
Xiong et al. [2022] Y. Xiong, S. X. Ng, and L. Hanzo, Quantum error mitigation relying on permutation filtering, IEEE Trans. Commun. 70, 1927 (2022).
Carnot [1824] S. Carnot, Reflections on the motive power of fire, and on machines fitted to develop that power, Paris: Bachelier 108, 1824 (1824).
Landauer [1961] R. Landauer, Irreversibility and heat generation in the computing process, IBM J. Res. Dev. 5, 183 (1961).
Brandão et al. [2015] F. Brandão, M. Horodecki, N. Ng, J. Oppenheim, and S. Wehner, The second laws of quantum thermodynamics, Proc. Natl. Acad. Sci. U.S.A. 112, 3275 (2015).
Bennett et al. [1999] C. H. Bennett, P. W. Shor, J. A. Smolin, and A. V. Thapliyal, Entanglement-assisted classical capacity of noisy quantum channels, Phys. Rev. Lett. 83, 3081 (1999).
Pirandola et al. [2017] S. Pirandola, R. Laurenza, C. Ottaviani, and L. Banchi, Fundamental limits of repeaterless quantum communications, Nat. Commun. 8, 15043 (2017).
Regula and Takagi [2021] B. Regula and R. Takagi, Fundamental limitations on distillation of quantum channel resources, Nat. Commun. 12, 4411 (2021).
Fang and Liu [2022] K. Fang and Z.-W. Liu, No-go theorems for quantum resource purification: New approach and channel theory, PRX Quantum 3, 010337 (2022).
Takagi et al. [2022] R. Takagi, S. Endo, S. Minagawa, and M. Gu, Fundamental limits of quantum error mitigation, npj Quantum Inf. 8, 114 (2022).
Wang et al. [2021a] S. Wang, P. Czarnik, A. Arrasmith, M. Cerezo, L. Cincio, and P. J. Coles, Can error mitigation improve trainability of noisy variational quantum algorithms? arXiv:2109.01051 (2021a).
Wang et al. [2021b] S. Wang, E. Fontana, M. Cerezo, K. Sharma, A. Sone, L. Cincio, and P. J. Coles, Noise-induced barren plateaus in variational quantum algorithms, Nat. Commun. 12, 6961 (2021b).
Stilck França and García-Patrón [2021] D. Stilck França and R. García-Patrón, Limitations of optimization algorithms on noisy quantum devices, Nat. Phys. 17, 1221 (2021).
De Palma et al. [2023] G. De Palma, M. Marvian, C. Rouzé, and D. S. França, Limitations of variational quantum algorithms: A quantum optimal transport approach, PRX Quantum 4, 010309 (2023).
Nation et al. [2021] P. D. Nation, H. Kang, N. Sundaresan, and J. M. Gambetta, Scalable mitigation of measurement errors on quantum computers, PRX Quantum 2, 040326 (2021).
Viola and Lloyd [1998] L. Viola and S. Lloyd, Dynamical suppression of decoherence in two-state quantum systems, Phys. Rev. A 58, 2733 (1998).
Viola et al. [1999] L. Viola, E. Knill, and S. Lloyd, Dynamical decoupling of open quantum systems, Phys. Rev. Lett. 82, 2417 (1999).
Buscemi et al. [2013] F. Buscemi, M. Dall’Arno, M. Ozawa, and V. Vedral, Direct observation of any two-point quantum correlation function, arXiv:1312.4240 (2013).
Takagi [2021] R. Takagi, Optimal resource cost for error mitigation, Phys. Rev. Research 3, 033178 (2021).
Regula et al. [2021] B. Regula, R. Takagi, and M. Gu, Operational applications of the diamond norm and related measures in quantifying the non-physicality of quantum maps, Quantum 5, 522 (2021).
Jiang et al. [2021] J. Jiang, K. Wang, and X. Wang, Physical Implementability of Linear Maps and Its Application in Error Mitigation, Quantum 5, 600 (2021).
Sun et al. [2021] J. Sun, X. Yuan, T. Tsunoda, V. Vedral, S. C. Benjamin, and S. Endo, Mitigating realistic noise in practical noisy intermediate-scale quantum devices, Phys. Rev. Applied 15, 034026 (2021).
Piveteau et al. [2022] C. Piveteau, D. Sutter, and S. Woerner, Quasiprobability decompositions with reduced sampling overhead, npj Quantum Inf. 8, 12 (2022).
Note [1] Here, we consider the standard matrix inequality with respect to the positive semidefinite cone, i.e., $M_{1}\leq M_{2}$ for two matrices $M_{1}$ and $M_{2}$ means that the matrix $M_{2}-M_{1}$ is positive semidefinite.
Nielsen and Chuang [2011] M. A. Nielsen and I. L. Chuang, Quantum Computation and Quantum Information, 10th ed. (Cambridge University Press, New York, NY, USA, 2011).
Note [2] See the Supplemental Material for detailed proofs and discussions of our main results, which includes Refs. [56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80].
Fuchs and van de Graaf [1999] C. Fuchs and J. van de Graaf, Cryptographic distinguishability measures for quantum-mechanical states, IEEE Trans. Inf. Theory 45, 1216 (1999).
Hiai et al. [1981] F. Hiai, M. Ohya, and M. Tsukada, Sufficiency, KMS condition and relative entropy in von Neumann algebras. Pac. J. Math. 96, 99 (1981).
Reeb et al. [2011] D. Reeb, M. J. Kastoryano, and M. M. Wolf, Hilbert’s projective metric in quantum information theory, J. Math. Phys. 52, 082201 (2011).
Kastoryano and Temme [2013] M. J. Kastoryano and K. Temme, Quantum logarithmic sobolev inequalities and rapid mixing, J. Math. Phys. 54, 052202 (2013).
De Palma et al. [2021] G. De Palma, M. Marvian, D. Trevisan, and S. Lloyd, The quantum wasserstein distance of order 1, IEEE Trans. Inf. Theory 67, 6627 (2021).
Gilchrist et al. [2005] A. Gilchrist, N. K. Langford, and M. A. Nielsen, Distance measures to compare real and ideal quantum processes, Phys. Rev. A 71, 062310 (2005).
Tajima et al. [2018] H. Tajima, N. Shiraishi, and K. Saito, Uncertainty relations in implementation of unitary operations, Phys. Rev. Lett. 121, 110403 (2018).
Katsube et al. [2020] R. Katsube, M. Hotta, and K. Yamaguchi, A fundamental upper bound for signal to noise ratio of quantum detectors, J. Phys. Soc. Japan 89, 054005 (2020).
Hayashi [2016] M. Hayashi, Quantum Information Theory: Mathematical Foundation (Springer, 2016).
Audenaert and Eisert [2005] K. M. R. Audenaert and J. Eisert, Continuity bounds on the quantum relative entropy, J. Math. Phys. 46, 102104 (2005).
Carbone and Martinelli [2015] R. Carbone and A. Martinelli, Logarithmic sobolev inequalities in non-commutative algebras, Infin. Dimens. Anal. Quantum Probab. Relat. Top. 18, 1550011 (2015).
Kastoryano and Brandão [2016] M. J. Kastoryano and F. G. S. L. Brandão, Quantum Gibbs Samplers: The Commuting Case, Commun. Math. Phys. 344, 915 (2016).
Müller-Hermes et al. [2016a] A. Müller-Hermes, D. Stilck França, and M. M. Wolf, Relative entropy convergence for depolarizing channels, J. Math. Phys. 57, 022202 (2016a).
Müller-Hermes et al. [2016b] A. Müller-Hermes, D. Stilck França, and M. M. Wolf, Entropy production of doubly stochastic quantum channels, J. Math. Phys. 57, 022203 (2016b).
Bardet et al. [2021] I. Bardet, Á. Capel, A. Lucia, D. Pérez-García, and C. Rouzé, On the modified logarithmic sobolev inequality for the heat-bath dynamics for 1d systems, J. Math. Phys. 62, 061901 (2021).
Beigi et al. [2020] S. Beigi, N. Datta, and C. Rouzé, Quantum Reverse Hypercontractivity: Its Tensorization and Application to Strong Converses, Commun. Math. Phys. 376, 753 (2020).
Capel et al. [2020] Á. Capel, C. Rouzé, and D. Stilck França, The modified logarithmic Sobolev inequality for quantum spin systems: classical and commuting nearest neighbour interactions, arXiv:2009.11817 (2020).
Hirche et al. [2022] C. Hirche, C. Rouzé, and D. Stilck França, On contraction coefficients, partial orders and approximation of capacities for quantum channels, Quantum 6, 862 (2022).
Müller-Lennert et al. [2013] M. Müller-Lennert, F. Dupuis, O. Szehr, S. Fehr, and M. Tomamichel, On quantum rényi entropies: A new generalization and some properties, J. Math. Phys. 54, 122203 (2013).
Gorini et al. [1976] V. Gorini, A. Kossakowski, and E. C. G. Sudarshan, Completely positive dynamical semigroups of n‐level systems, J. Math. Phys. 17, 821 (1976).
Temme et al. [2014] K. Temme, F. Pastawski, and M. J. Kastoryano, Hypercontractivity of quasi-free quantum semigroups, J. Phys. A: Math. Theor. 47, 405303 (2014).
Temme and Kastoryano [2015] K. Temme and M. J. Kastoryano, How fast do stabilizer Hamiltonians thermalize? arXiv:1505.07811 (2015).
Capel et al. [2018] Á. Capel, A. Lucia, and D. Pérez-García, Quantum conditional relative entropy and quasi-factorization of the relative entropy, J. Phys. A: Math. Theor. 51, 484001 (2018).
Temme et al. [2010] K. Temme, M. J. Kastoryano, M. B. Ruskai, M. M. Wolf, and F. Verstraete, The chi2-divergence and mixing times of quantum Markov processes, J. Math. Phys. 51, 122201 (2010).
[80] Qiskit: Device backend noise model simulations, .
Tsubouchi et al. [2023] K. Tsubouchi, T. Sagawa, and N. Yoshioka, Universal cost bound of quantum error mitigation based on quantum estimation theory, Phys. Rev. Lett. 131, 210601 (2023), published concurrently in the same issue.

Universal Sampling Lower Bounds for Quantum Error Mitigation
Supplemental Material

Appendix A Proof of Theorem 1

Proof.

For an ideal state $\rho\in\mathbb{S}$ and an observable $A\in\mathbb{O}$ , let $p_{A,\rho}$ be the probability distribution for $\hat{E}_{A}(\rho)$ . This means that we have

\displaystyle\mathcal{P}_{A}\left(\otimes_{n=1}^{N}\mathcal{E}_{n}(\rho)\right)=\int dap_{A,\rho}(a)|{a}\rangle\!\langle{a}|

(6)

where $\{\ket{a}\}$ is the classical states representing possible estimates represented by the random variable $\hat{E}_{A}(\rho)$ .

Take arbitrary two states $\rho,\sigma\in\mathbb{S}$ satisfying $|\operatorname{Tr}(A\rho)-\operatorname{Tr}(A\sigma)|\geq 2\delta$ . We assume $\operatorname{Tr}(A\rho)\geq\operatorname{Tr}(A\sigma)$ without loss of generality. Let us divide the regions of estimates into three sections as (see Fig. S.1)

\displaystyle L\coloneqq\left\{\left.a\;\rule{0.0pt}{9.5pt}\right|\;a\leq\operatorname{Tr}(A\sigma)+\delta\right\},\ M\coloneqq\left\{\left.a\;\rule{0.0pt}{9.5pt}\right|\;\operatorname{Tr}(A\sigma)+\delta<a<\operatorname{Tr}(A\rho)-\delta\right\},\ R\coloneqq\left\{\left.a\;\rule{0.0pt}{9.5pt}\right|\;a\geq\operatorname{Tr}(A\rho)-\delta\right\}.

(7)

The condition (1) ensures that

	$\displaystyle\int_{L}da\,p_{A,\rho}(a)\leq\varepsilon,\quad\int_{R}da\,p_{A,\sigma}(a)\leq\varepsilon,$		(8)
	$\displaystyle\int_{R}da\,p_{A,\rho}(a)\geq 1-\varepsilon,\quad\int_{L}da\,p_{A,\sigma}(a)\geq 1-\varepsilon.$		(8)

This allows us to bound the trace distance between two classical probability distributions $p_{A,\rho}$ and $p_{A,\sigma}$ as

$\displaystyle D_{\rm tr}(p_{A,\rho},p_{A,\sigma})$	$\displaystyle=\frac{1}{2}\int_{L+M+R}da\,\|p_{A,\rho}(a)-p_{A,\sigma}(a)\|$	(9)
	$\displaystyle\geq\frac{1}{2}\int_{L}da\,\|p_{A,\rho}(a)-p_{A,\sigma}(a)\|+\frac{1}{2}\int_{R}da\,\|p_{A,\rho}(a)-p_{A,\sigma}(a)\|$
	$\displaystyle\geq\frac{1}{2}\int_{L}da\,\left[p_{A,\sigma}(a)-p_{A,\rho}(a)\right]+\frac{1}{2}\int_{R}da\,\left[p_{A,\rho}(a)-p_{A,\sigma}(a)\right]$
	$\displaystyle\geq 1-2\varepsilon$

where we used (8) in the last inequality. Noting that $\mathcal{P}_{A}$ in (6) is a quantum channel, we apply the data-processing inequality to the trace distance to get

\displaystyle D_{\rm tr}(\otimes_{n=1}^{N}\mathcal{E}_{n}(\rho),\otimes_{n=1}^{N}\mathcal{E}_{n}(\sigma))\geq 1-2\varepsilon.

(10)

To extract $N$ out, recall that $D_{\rm tr}(\rho,\sigma)\leq\sqrt{1-F(\rho,\sigma)}$ [56] and that $F(\otimes_{i}\rho_{i},\otimes_{i}\sigma_{i})=\prod_{i}F(\rho_{i},\sigma_{i})$ . Under the condition $\varepsilon\leq 1/2$ , we can further bound (10) as

\displaystyle 1-\prod_{n=1}^{N}F(\mathcal{E}_{n}(\rho),\mathcal{E}_{n}(\sigma))\geq(1-2\varepsilon)^{2},

(11)

leading to

\displaystyle\sum_{n=1}^{N}\log F(\mathcal{E}_{n}(\rho),\mathcal{E}_{n}(\sigma))^{-1}\geq\log\left[\frac{1}{4\varepsilon(1-\varepsilon)}\right].

(12)

Taking $n^{\star}\coloneqq{\rm argmin}_{n}F(\mathcal{E}_{n}(\rho),\mathcal{E}_{n}(\sigma))$ , i.e., $F(\mathcal{E}_{n^{\star}}(\rho),\mathcal{E}_{n^{\star}}(\sigma))^{-1}\geq F(\mathcal{E}_{i}(\rho),\mathcal{E}_{i}(\sigma))^{-1},\forall i$ , we get

\displaystyle N\geq\frac{\log\left[\frac{1}{4\varepsilon(1-\varepsilon)}\right]}{\log F(\mathcal{E}_{n^{\star}}(\rho),\mathcal{E}_{n^{\star}}(\sigma))^{-1}}.

(13)

Since this holds for every state $\rho$ and $\sigma$ such that $|\operatorname{Tr}(A\rho)-\operatorname{Tr}(A\sigma)|\geq 2\delta$ , we maximize the right-hand side over such states to reach the advertised statement.

The relation with respect to the relative entropy can be obtained by upper bounding (10) with the relative entropy. Using the quantum Pinsker’s inequality [57]

\displaystyle D_{\operatorname{tr}}(\rho,\sigma)\leq\sqrt{\frac{\ln 2}{2}}\sqrt{S(\rho\|\sigma)}

(14)

and the additivity $S(\otimes_{i}\rho_{i}\|\otimes_{i}\sigma_{i})=\sum_{i}S(\rho_{i}\|\sigma_{i})$ , we can bound (10) as

\displaystyle\frac{\ln 2}{2}\sum_{n=1}^{N}S\left(\mathcal{E}_{n}(\rho)\|\mathcal{E}_{n}(\sigma)\right)\geq(1-2\varepsilon)^{2}

(15)

when $\varepsilon\leq 1/2$ . Taking $n^{\star}\coloneqq{\rm argmax}_{n}S\left(\mathcal{E}_{n}(\rho)\|\mathcal{E}_{n}(\sigma)\right)$ , i.e., $S(\mathcal{E}_{n^{\star}}(\rho)\|\mathcal{E}_{n^{\star}}(\sigma))\geq S(\mathcal{E}_{i}(\rho)\|\mathcal{E}_{i}(\sigma)),\forall i$ , we get

\displaystyle N\geq\frac{2(1-2\varepsilon)^{2}}{\ln 2\cdot S(\mathcal{E}_{n^{\star}}(\rho)\|\mathcal{E}_{n^{\star}}(\sigma))}.

(16)

∎

Appendix B Alternative bound with an explicit accuracy dependence

Here, we obtain an alternative bound to the one in Theorem 1 that has an explicit dependence on the accuracy $\delta$ when $\mathbb{S}$ contains all pure states. To this end, let us define a generalized contraction coefficient with respect to $\mathbb{O}$ by

\displaystyle\eta^{\mathbb{E},\mathbb{O}}\coloneqq\max_{\mathcal{E}\in\mathbb{E}}\max_{\rho,\sigma\in\mathbb{D}}\frac{D_{\operatorname{tr}}(\mathcal{E}(\rho),\mathcal{E}(\sigma))}{D_{\mathbb{O}}(\rho,\sigma)},

(17)

where $\mathbb{D}$ is an arbitrary set that contains all pure quantum states. Note that $\eta^{\mathbb{E},\mathbb{O}}$ can take any nonnegative value in general. However, this particularly becomes the measure of contraction of trace distance when $\mathbb{O}=\{A\,|\,0\leq A\leq\mathbb{I}\}$ , in which case $0\leq\eta^{\mathbb{E},\mathbb{O}}\leq 1$ always holds [58, 59, 60]. The following result represents the lower bound in terms of accuracy, success probability, and the generalized contraction coefficient.

Proposition S.1.

Suppose $\mathbb{S}$ contains the set of all pure states. Then, the sampling cost $N$ that satisfies (1) is lower bounded as

\displaystyle N\geq\frac{\log\left[\frac{1}{4\varepsilon(1-\varepsilon)}\right]}{\log\frac{1}{\left(1-2\eta^{\mathbb{E},\mathbb{O}}\,\delta\right)^{2}}}.

(18)

In the case of $\varepsilon\ll 1$ and $\delta\ll 1$ , the lower bound approximately becomes $\log\left(\frac{1}{4\varepsilon}\right)/(4\eta^{\mathbb{E},\mathbb{O}}\,\delta)$ . This bound has the advantage of having explicit dependence on the accuracy and separating the contraction coefficient of the effective noise channel, making explicit the role of the reduction in the state distinguishability.

Proof.

Let $\eta^{\mathbb{E},\mathbb{O},\mathbb{S}}$ be the generalized contraction coefficient (c.f., Refs. [58, 59, 60]) for $\mathcal{E}$ defined by

\displaystyle\eta^{\mathbb{E},\mathbb{O},\mathbb{S}}\coloneqq\max_{\mathcal{E}\in\mathbb{E}}\max_{\rho,\sigma\in\mathbb{S}}\frac{D_{\operatorname{tr}}(\mathcal{E}(\rho),\mathcal{E}(\sigma))}{D_{\mathbb{O}}(\rho,\sigma)}.

(19)

Noting that $F(\rho,\sigma)\geq\left[1-D_{\operatorname{tr}}(\rho,\sigma)\right]^{2}$ , we have for arbitrary $\rho,\sigma\in\mathbb{S}$ that

	$\displaystyle\max_{\mathcal{E}\in\mathbb{E}}\log F(\mathcal{E}(\rho),\mathcal{E}(\sigma))^{-1}$	$\displaystyle\leq\log\frac{1}{\left[1-\max_{\mathcal{E}\in\mathbb{E}}D_{\operatorname{tr}}(\mathcal{E}(\rho),\mathcal{E}(\sigma))\right]^{2}}$		(20)
		$\displaystyle\leq\log\frac{1}{\left[1-\eta^{\mathbb{E},\mathbb{O},\mathbb{S}}\max_{A\in\mathbb{O}}\|\operatorname{Tr}[A(\rho-\sigma)]\|\right]^{2}}$		(20)

Let $\tilde{\delta}\coloneqq\min_{\rho,\sigma\in\mathbb{S}}\left\{\left.\delta^{\prime}\;\rule{0.0pt}{9.5pt}\right|\;\delta^{\prime}=\max_{A\in\mathbb{O}}|\operatorname{Tr}[A(\rho-\sigma)]|\geq 2\delta\right\}$ . Maximizing the lower bound in (3) over $\rho,\sigma\in\mathbb{S}$ such that $\max_{A\in\mathbb{O}}|\operatorname{Tr}(A\rho)-\operatorname{Tr}(A\sigma)|\geq 2\delta$ , we get

\displaystyle N\geq\frac{\log\left[\frac{1}{4\varepsilon(1-\varepsilon)}\right]}{\log\frac{1}{\left(1-\eta^{\mathbb{E},\mathbb{O},\mathbb{S}}\,\tilde{\delta}\right)^{2}}}.

(21)

Now, let $\mathbb{D}$ be a set that contains all pure states and consider arbitrary error mitigation with $\mathbb{S}=\mathbb{D}$ . In such a scenario, we have $\tilde{\delta}=2\delta$ and $\eta^{\mathbb{E},\mathbb{O},\mathbb{S}}=\eta^{\mathbb{E},\mathbb{O}}$ where

\displaystyle\eta^{\mathbb{E},\mathbb{O}}\coloneqq\max_{\mathcal{E}\in\mathbb{E}}\max_{\rho,\sigma\in\mathbb{D}}\frac{D_{\operatorname{tr}}(\mathcal{E}(\rho),\mathcal{E}(\sigma))}{D_{\mathbb{O}}(\rho,\sigma)}.

(22)

This gives

\displaystyle N\geq\frac{\log\left[\frac{1}{4\varepsilon(1-\varepsilon)}\right]}{\log\frac{1}{\left(1-2\eta^{\mathbb{E}}\,\delta\right)^{2}}},

(23)

concluding the proof.

∎

Appendix C Proof of Theorem 2

Proof.

Let $D_{F}(\rho,\sigma)\coloneqq\sqrt{1-F(\rho,\sigma)}$ be the purified distance [61]. For an observable $A$ , let $\sigma_{A}(\rho)\coloneqq\sqrt{\operatorname{Tr}[(A-\operatorname{Tr}[A\rho])^{2}\rho]}$ be the standard deviation of the probability distribution for measuring the observable $A$ for state $\rho$ . Then, as an improvement of a relation reported in Ref. [62], it was shown in Ref. [63] that arbitrary states $\eta,\tau$ and an observable $O$ satisfy

\displaystyle|\operatorname{Tr}[O(\eta-\tau)]|\leq D_{F}(\eta,\tau)\left(\sigma_{O}(\eta)+\sigma_{O}(\tau)+|\operatorname{Tr}[O(\eta-\tau)]|\right).

(24)

Using the data-processing inequality and applying (24) to the error-mitigated classical states, we get for arbitrary $\rho,\sigma\in\mathbb{S}$ that

	$\displaystyle D_{F}(\otimes_{n=1}^{N}\mathcal{E}_{n}(\rho),\otimes_{n=1}^{N}\mathcal{E}_{n}(\sigma))$	$\displaystyle\geq D_{F}(\mathcal{P}_{A}\circ\otimes_{n=1}^{N}\mathcal{E}_{n}(\rho),\mathcal{P}_{A}\circ\otimes_{n=1}^{N}\mathcal{E}_{n}(\sigma))$		(25)
		$\displaystyle\geq\frac{\Delta}{\sigma_{A}^{\operatorname{QEM}}(\rho)+\sigma_{A}^{\operatorname{QEM}}(\sigma)+\Delta}.$		(25)

Here, $\Delta\coloneqq|\operatorname{Tr}[A^{\prime}\,\mathcal{P}_{A}\circ\otimes_{n=1}^{N}\mathcal{E}_{n}(\rho)]-\operatorname{Tr}[A^{\prime}\,\mathcal{P}_{A}\circ\otimes_{n=1}^{N}\mathcal{E}_{n}(\sigma)]|=|\left\langle\hat{E}_{A}(\rho)\right\rangle-\left\langle\hat{E}_{A}(\sigma)\right\rangle|$ where $A^{\prime}=\sum_{a}|{a}\rangle\!\langle{a}|$ is the observable corresponding to the estimate (c.f., (6)), and $\sigma_{A}^{\operatorname{QEM}}(\rho)\coloneqq\sigma_{A^{\prime}}(\mathcal{P}_{A}\circ\otimes_{n=1}^{N}\mathcal{E}_{n}(\rho))$ is the standard deviation of $\hat{E}_{A}(\rho)$ . Reorganizing the terms gives

\displaystyle\sigma_{A}^{\operatorname{QEM}}(\rho)+\sigma_{A}^{\operatorname{QEM}}(\sigma)\geq\left(\frac{1}{D_{F}(\otimes_{n=1}^{N}\mathcal{E}_{n}(\rho),\otimes_{n=1}^{N}\mathcal{E}_{n}(\sigma))}-1\right)\Delta

(26)

Without loss of generality, let us take $\left\langle\hat{E}_{A}(\rho)\right\rangle-\left\langle\hat{E}_{A}(\sigma)\right\rangle\geq 0$ . Then, recalling the definition of bias $b_{A}(\rho)\coloneqq\left\langle\hat{E}_{A}(\rho)\right\rangle-\operatorname{Tr}[A\rho]$ , we have

\displaystyle\Delta=b_{A}(\rho)-b_{A}(\sigma)+\operatorname{Tr}[A(\rho-\sigma)].

(27)

Combining these, we get

\displaystyle\sigma_{A}^{\operatorname{QEM}}(\rho)+\sigma_{A}^{\operatorname{QEM}}(\sigma)\geq\left(\frac{1}{D_{F}(\otimes_{n=1}^{N}\mathcal{E}_{n}(\rho),\otimes_{n=1}^{N}\mathcal{E}_{n}(\sigma))}-1\right)(\operatorname{Tr}[A(\rho-\sigma)]+b_{A}(\rho)-b_{A}(\sigma)).

(28)

Defining he maximum standard deviation as $\sigma_{\max}^{\operatorname{QEM}}\coloneqq\max_{A\in\mathbb{O}}\max_{\rho\in\mathbb{S}}\sigma_{A,\max}^{\operatorname{QEM}}$ and the maximum bias $b_{\max}\coloneqq\max_{A\in\mathbb{O}}\max_{\rho\in\mathbb{S}}|b_{A}(\rho)|$ , we get

\displaystyle\sigma_{\max}^{\operatorname{QEM}}\geq\max_{\rho,\sigma\in\mathbb{S}}\frac{1}{2}\left(\frac{1}{D_{F}(\otimes_{n=1}^{N}\mathcal{E}_{n}(\rho),\otimes_{n=1}^{N}\mathcal{E}_{n}(\sigma))}-1\right)(D_{\mathbb{O}}(\rho,\sigma)-2b_{\max}).

(29)

where $D_{\mathbb{O}}$ is the observable-dependent distinguishability defined in (2).

The bound (29) can be turned into a lower bound on $N$ . To see this, note that (29) ensures that for arbitrary $\rho,\sigma\in\mathbb{S}$ , we have

\displaystyle\sigma_{\max}^{\operatorname{QEM}}\geq\frac{1-D_{F}(\otimes_{n=1}^{N}\mathcal{E}_{n}(\rho),\otimes_{n=1}^{N}\mathcal{E}_{n}(\sigma))}{2D_{F}(\otimes_{n=1}^{N}\mathcal{E}_{n}(\rho),\otimes_{n=1}^{N}\mathcal{E}_{n}(\sigma))}(D_{\mathbb{O}}(\rho,\sigma)-2b_{\max}),

(30)

which leads to

	$\displaystyle D_{F}(\otimes_{n=1}^{N}\mathcal{E}_{n}(\rho),\otimes_{n=1}^{N}\mathcal{E}_{n}(\sigma))$	$\displaystyle\geq\frac{D_{\mathbb{O}}(\rho,\sigma)-2b_{\max}}{2\sigma_{\max}^{\operatorname{QEM}}+D_{\mathbb{O}}(\rho,\sigma)-2b_{\max}}$		(31)
		$\displaystyle=\frac{1}{1+\frac{2\sigma_{\max}^{\operatorname{QEM}}}{D_{\mathbb{O}}(\rho,\sigma)-2b_{\max}}}.$		(31)

Noting that

	$\displaystyle D_{F}(\otimes_{n=1}^{N}\mathcal{E}_{n}(\rho),\otimes_{n=1}^{N}\mathcal{E}_{n}(\sigma))$	$\displaystyle=\sqrt{1-F(\otimes_{n=1}^{N}\mathcal{E}_{n}(\rho),\otimes_{n=1}^{N}\mathcal{E}_{n}(\sigma))}$		(32)
		$\displaystyle=\sqrt{1-\prod_{n=1}^{N}F(\mathcal{E}_{n}(\rho),\mathcal{E}_{n}(\sigma))}$		(32)

where we used the multiplicativity of the fidelity for tensor-product states. This gives

\displaystyle\prod_{n=1}^{N}F(\mathcal{E}_{n}(\rho),\mathcal{E}_{n}(\sigma))\leq 1-\frac{1}{\left(1+\frac{2\sigma_{\max}^{\operatorname{QEM}}}{D_{\mathbb{O}}(\rho,\sigma)-2b_{\max}}\right)^{2}}

(33)

for $\rho,\sigma\in\mathbb{S}$ such that $D_{\mathbb{O}}(\rho,\sigma)-2b_{\max}\geq 0$ . Taking the inverse and logarithm on both sides, we get

\displaystyle\sum_{n=1}^{N}\log F^{-1}(\mathcal{E}_{n}(\rho),\mathcal{E}_{n}(\sigma))\geq\log\left[1-\frac{1}{\left(1+\frac{2\sigma_{\max}^{\operatorname{QEM}}}{D_{\mathbb{O}}(\rho,\sigma)-2b_{\max}}\right)^{2}}\right]^{-1}.

(34)

Noting

\displaystyle N\max_{\mathcal{E}\in\mathbb{E}}\log F^{-1}(\mathcal{E}(\rho),\mathcal{E}(\sigma))\geq\sum_{n=1}^{N}\log F^{-1}(\mathcal{E}_{n}(\rho),\mathcal{E}_{n}(\sigma)),

(35)

we reach

\displaystyle N\geq\min_{\mathcal{E}\in\mathbb{E}}\frac{\log\left[1-\frac{1}{\left(1+\frac{2\sigma_{\max}^{\operatorname{QEM}}}{D_{\mathbb{O}}(\rho,\sigma)-2b_{\max}}\right)^{2}}\right]^{-1}}{\log F^{-1}(\mathcal{E}(\rho),\mathcal{E}(\sigma))}

(36)

for arbitrary states $\rho,\sigma\in\mathbb{S}$ such that $D_{\mathbb{O}}(\rho,\sigma)-2b_{\max}\geq 0$ .

∎

Appendix D Comparison of the bound in Theorem 1 with existing protocols

To make an explicit comparison between the bound in Theorem 1 and the sample cost for specific protocols, we consider mitigating the local depolarizing noise $\mathcal{D}_{p}^{\otimes M}$ , where $\mathcal{D}_{p}(\rho)\coloneqq(1-p)\rho+p\mathbb{I}/2$ , with accuracy $\delta$ and probability $1-\varepsilon$ . For the sake of the analysis, we consider conservative strategies with the target observable $\mathbb{O}=\{\prod_{i=1}^{M}X_{i}\}$ and target ideal states $\mathbb{S}=\{\operatorname{GHZ}(\pm\theta_{1}),\operatorname{GHZ}(\pm\theta_{2})\}$ where

\displaystyle\ket{\operatorname{GHZ}(\theta)}\coloneqq\frac{1}{\sqrt{2}}\left(\ket{0}^{\otimes M}+e^{i(\theta+\pi/2)}\ket{1}^{\otimes M}\right)

(37)

and we set $\theta_{1}\coloneqq\arcsin(\delta)$ , $\theta_{2}\coloneqq\arcsin(2\delta)$ . This refers to the setting in which one would like to obtain good estimates at least for these states and observable. We also set the effective noise channels being the local depolarizing noise, i.e., $\mathcal{E}_{i}=\mathcal{D}_{p}^{\otimes M},\ \forall i$ , which is satisfied for all the specific protocols we study below.

Since $\operatorname{Tr}[\prod_{i=1}^{M}X_{i}\operatorname{GHZ}(\theta)]=-\sin\theta$ , we particularly have $D_{\mathbb{O}}(\operatorname{GHZ}(\theta_{1}),\operatorname{GHZ}(-\theta_{1}))=2\delta$ . This allows us to bound the lower bound in Theorem 1 to get

\displaystyle N\geq\frac{\log\frac{1}{4\varepsilon(1-\varepsilon)}}{\log[1/F(\mathcal{D}_{p}^{\otimes M}(\operatorname{GHZ}(\theta_{1})),D_{p}^{\otimes M}(\operatorname{GHZ}(-\theta_{1})))]}.

(38)

To study the sample numbers for specific error-mitigation protocols, we simulate each mitigation strategy and evaluate the actual sample numbers used to achieve the desired performance. Namely, we evaluate the probability of getting estimates within the accuracy $\delta$ for all ideal states in $\mathbb{S}$ using a certain sample budget. If this success probability is above the target probability $1-\varepsilon$ , we need less number of samples. On the other hand, if the success probability is below $1-\varepsilon$ , we need more samples. By binary search, we evaluate the sample number that realizes the success probability $1-\varepsilon$ with a small error tolerance, which we set 0.01. We also note that since the shift of the expectation value due to depolarizing noise is larger for $\operatorname{GHZ}(\pm\theta_{2})$ than $\operatorname{GHZ}(\pm\theta_{1})$ and the noise and error mitigation apply to $\operatorname{GHZ}(\theta_{2})$ and $\operatorname{GHZ}(-\theta_{2})$ in a symmetric manner, it suffices to ensure that the expectation value for $\operatorname{GHZ}(\theta_{2})$ can be estimated with the accuracy $\delta$ with probability $1-\varepsilon$ .

The error-mitigation protocols we study are virtual distillation, symmetry verification, and probabilistic error cancellation. For virtual distillation, we consider 2-copy virtual distillation and estimate $\operatorname{Tr}(\prod_{i=1}^{M}X_{i}[\mathcal{D}_{p}^{\otimes M}(\operatorname{GHZ}(\theta_{2}))]^{2})/\operatorname{Tr}([\mathcal{D}_{p}^{\otimes M}(\operatorname{GHZ}(\theta_{2}))]^{2})$ as in Refs. [15, 16]. For symmetry verification [13], we observe that all the ideal states in $\mathbb{S}$ are +1 eigenstates of the stabilizers generated by $\{Z_{i}Z_{i+1}\}_{i=1}^{M-1}$ . We thus estimate $\operatorname{Tr}\left(\prod_{i=1}^{M}X_{i}\tilde{\rho}\right)/\operatorname{Tr}(\tilde{\rho})$ where $\tilde{\rho}\coloneqq\left(\prod_{i=1}^{M-1}\frac{\mathbb{I}+Z_{i}Z_{i+1}}{2}\right)\mathcal{D}_{p}^{\otimes M}(\operatorname{GHZ}(\theta_{2}))\left(\prod_{i=1}^{M-1}\frac{\mathbb{I}+Z_{i}Z_{i+1}}{2}\right)$ is the noisy state projected onto the stabilizer subspace. To run probabilistic cancellation [5], we utilize the decomposition of the inverse map for depolarizing noise $\mathcal{D}_{p}^{-1}=\left[1+\frac{3p}{4(1-p)}\right]\operatorname{id}-\frac{p}{4(1-p)}(\mathcal{X}+\mathcal{Y}+\mathcal{Z})$ where $\mathcal{X},\mathcal{Y},\mathcal{Z}$ are unitary Pauli channels. After every depolarizing channel, we simulate the action of this inverse channel by following the procedure described in Appendix E.

Following the analysis described above, we plot in Fig. S.2 our lower bound and the actual sample costs to achieve accuracy $\delta=0.1$ and failure probability $\varepsilon=0.2$ for 7-qubit local depolarizing noise with respect to the noise strength $p$ . We observe that Theorem 1 provides nontrivial lower bounds that are comparable to the actual sample cost, the difference being the factor of 3 to 6 in the studied range.

Note that Fig. S.2 also includes the sample number required for the case when we do not apply any error mitigation but just measure $\prod_{i=1}^{M}X_{i}$ on $\mathcal{D}_{p}^{\otimes M}(\operatorname{GHZ}(\theta_{2}))$ . When the noise strength is small, the target accuracy and success probability can be achieved without error mitigation, making the no-mitigation strategy sample-efficient. However, as the noise gets stronger, it gets harder to achieve the target performance and eventually becomes impossible to realize the target accuracy no matter what sample number is allowed, as can be seen in the divergent behavior in Fig. S.2.

Appendix E Scaling of the bound in Theorem 2 with noise strength

Let us consider $M$ -qubit state suffering from the $M$ -qubit local dephasing noise $\mathcal{Z}_{p}^{\otimes M}$ where $\mathcal{Z}(\rho)\coloneqq(1-p)\rho+pZ\rho Z$ . Consider the class of error mitigation with the same effective noise channels $\mathcal{E}_{i}=\mathcal{Z}_{p}^{\otimes M},\ \forall i$ . We also have a mild condition that the set $\mathbb{O}$ of target observables consists of normalized observables, i.e., those with the unit operator norm, and contains the Pauli operator $\prod_{i=1}^{M}X_{i}$ , and the set $\mathbb{S}$ of target states contains $M$ -qubit GHZ states $\operatorname{GHZ}_{\pm}\coloneqq|{\operatorname{GHZ}_{\pm}}\rangle\!\langle{\operatorname{GHZ}_{\pm}}|$ with $\ket{\operatorname{GHZ}_{\pm}}\coloneqq\frac{1}{\sqrt{2}}(\ket{0}^{\otimes M}\pm\ket{1}^{\otimes M})$ . We also assume that $b_{\max}\leq 1$ , which excludes the trivial strategy that always outputs a fixed estimate.

Noting that $D_{\mathbb{O}}(\operatorname{GHZ}_{+},\operatorname{GHZ}_{-})=\operatorname{Tr}[\prod_{i=1}X_{i}(\operatorname{GHZ}_{+}-\operatorname{GHZ}_{-})]=2$ , the choice of $\rho=\operatorname{GHZ}_{+}$ and $\sigma=\operatorname{GHZ}_{-}$ satisfies $D_{\mathbb{O}}(\rho,\sigma)-2b_{\max}\geq 0$ . This puts a further bound for the lower bound in (4) and results in

\displaystyle N\geq\frac{\log\left[1-\frac{1}{1+\frac{2\sigma_{\max}^{\rm QEM}}{D_{\mathbb{O}}(\operatorname{GHZ}_{+},\operatorname{GHZ}_{-})-2b_{\max}}}\right]}{\log[1/F(\mathcal{Z}_{p}^{\otimes M}(\operatorname{GHZ}_{+}),\mathcal{Z}_{p}^{\otimes M}(\operatorname{GHZ}_{-}))]}.

(39)

Note that

$\displaystyle\mathcal{Z}_{p}^{\otimes M}(\operatorname{GHZ}_{\pm})$	$\displaystyle=\sum_{k=0}^{\lfloor M/2\rfloor}\binom{M}{2k}(1-p)^{M-2k}p^{2k}\operatorname{GHZ}_{\pm}$	(40)
	$\displaystyle\quad+\sum_{k=1}^{\lfloor(M+1)/2\rfloor}\binom{M}{2k-1}(1-p)^{M-2k+1}p^{2k-1}\operatorname{GHZ}_{\mp}$
	$\displaystyle=\frac{1+(1-2p)^{M}}{2}\operatorname{GHZ}_{\pm}+\frac{1-(1-2p)^{M}}{2}\operatorname{GHZ}_{\mp},$

where we used

	$\displaystyle\sum_{k=0}^{\lfloor M/2\rfloor}\binom{M}{2k}(1-p)^{M-2k}p^{2k}$	$\displaystyle=\sum_{k=0}^{M}\binom{M}{k}(1-p)^{M-k}\frac{p^{k}+(-p)^{k}}{2}$		(41)
		$\displaystyle=\frac{1+(1-2p)^{M}}{2}.$		(41)

	$\displaystyle\sum_{k=1}^{\lfloor(M+1)/2\rfloor}\binom{M}{2k-1}(1-p)^{M-2k+1}p^{2k-1}$	$\displaystyle=\sum_{k=0}^{M}\binom{M}{k}(1-p)^{M-k}\frac{p^{k}-(-p)^{k}}{2}$		(42)
		$\displaystyle=\frac{1-(1-2p)^{M}}{2}.$		(42)

This gives

\displaystyle F(\mathcal{Z}_{p}^{\otimes M}(\operatorname{GHZ}_{+}),\mathcal{Z}_{p}^{\otimes M}(\operatorname{GHZ}_{-}))=1-(1-2p)^{2M}.

(43)

Focusing on the case when $p=\frac{1}{2}-\delta$ with $\delta\ll 1$ , the scaling of the lower bound in (39) with respect to the noise strength is given by

\displaystyle\left[\log\frac{1}{1-(1-2p)^{2M}}\right]^{-1}\sim 1/(2\delta)^{2M}.

(44)

We now show that this scaling can be achieved by probabilistic error cancellation. The probabilistic error cancellation method is the strategy to mitigate the effect of noise channel $\mathcal{E}$ by simulating the action of the inverse noise channel $\mathcal{E}^{-1}$ . The simulation can be accomplished by considering a linear decomposition of the inverse channel $\mathcal{E}^{-1}=\sum_{i}c_{i}\mathcal{B}_{i}$ where $c_{i}$ is a (possibly negative) real number and $\mathcal{B}_{i}$ is a physical operation that can be implemented on the given device. One then applies the operation $\mathcal{B}_{i}$ with probability $|c_{i}|/\gamma$ with $\gamma\coloneqq\sum_{i}|c_{i}|$ , makes a measurement, and multiplies $\gamma{\rm sgn}(c_{i})$ to the outcome, where ${\rm sgn}(c_{i})$ is the sign of $c_{i}$ that takes +1 if $c_{i}\geq 0$ and $-1$ if $c_{i}<0$ . This constructs a distribution whose expected value coincides with the ideal expectation value with the standard deviation scaled by the factor of $\gamma$ . Therefore, by taking the average over $N$ samples from this distribution, one can construct an estimate of the ideal expectation value, which obeys the distribution with the standard deviation $\sim\gamma/\sqrt{N}$ .

In general, one can also consider optimizing $\gamma$ over the choices of implementable operations $\{\mathcal{B}_{i}\}_{i}$ . Such optimal cost for mitigating the local dephasing noise is characterized as [39]

\displaystyle\gamma_{\rm opt}=\frac{1}{(1-2p)^{M}}.

(45)

This implies that the sample number $N$ that achieves some fixed standard deviation becomes $N\propto\frac{1}{(1-2p)^{2M}}$ . When $p=\frac{1}{2}-\delta$ with $\delta\ll 1$ , the sample scales as $N\sim 1/(2\delta)^{2M}$ , realizing the same scaling in (44).

Appendix F Proof of Theorem 3

Proof.

Let $\mathcal{D}_{p}(\rho)\coloneqq(1-p)\rho+p\mathbb{I}/2$ be the single-qubit depolarizing noise. Let also $\mathcal{U}_{1},\dots,\mathcal{U}_{L}$ be the unitary channels followed by a local depolarizing channel with the noise strength at least $\gamma$ and suppose the $l^{\mathrm{th}}$ layer $\mathcal{U}_{l}$ in the $n^{\mathrm{th}}$ noisy circuit is followed by the local depolarizing channel $\otimes_{m=1}^{M}\mathcal{D}_{\gamma_{n,l,m}}$ . By assumption, we have $\gamma_{n,l,m}\geq\gamma\ \forall n,l,m$ . Also, let $\Lambda_{n,l}$ and $\Xi_{n,l}$ be the unital channels applied before and after the $l^{\rm th}$ layer $U_{l}$ in the $n^{\rm th}$ noisy layered circuit.

We first note that every $\mathcal{D}_{\gamma_{n,l,m}}$ can be written as $\mathcal{D}_{\gamma_{n,l,m}-\gamma}\circ\mathcal{D}_{\gamma}$ . Since the second local depolarizing channel $\otimes_{m=1}^{M}\mathcal{D}_{\gamma_{n,l,m}-\gamma}$ is unital, it can be absorbed in the following unital channel $\Xi_{n,l}$ , allowing us to focus on $\mathcal{D}_{\gamma}^{\otimes M}$ as the noise after each layer.

Then, the effective noise channel for the $n^{\rm th}$ layered circuit can be written as

\displaystyle\mathcal{E}_{n}

\displaystyle=\mathcal{D}_{\gamma}^{\otimes M}\circ\Xi_{n,L}\circ\mathcal{U}_{L}\circ\Lambda_{n,L}\circ\cdots\circ\mathcal{D}_{\gamma}^{\otimes M}\circ\Xi_{n,1}\circ\mathcal{U}_{1}\circ\Lambda_{n,1}\circ\mathcal{U}_{1}^{\dagger}\circ\dots\circ\mathcal{U}_{L}^{\dagger}

(46)

With a little abuse of notation, we write this as $\mathcal{E}_{n}=\prod_{l=1}^{L}\left[\mathcal{D}_{\gamma}^{\otimes M}\circ\Xi_{n,l}\circ\mathcal{U}_{l}\circ\Lambda_{n,l}\right]\circ\prod_{l=1}^{L}\mathcal{U}_{L-l+1}^{\dagger}$ where we let the product sign refer to the concatenation of quantum channels.

Let $\rho_{\rm in}$ and $\sigma_{\rm in}$ be some input states and $\rho=\mathcal{U}_{L}\circ\dots\circ\mathcal{U}_{1}(\rho_{\rm in})$ and $\sigma=\mathcal{U}_{L}\circ\dots\circ\mathcal{U}_{1}(\sigma_{\rm in})$ . Then, the unitary invariance and the triangle inequality of the trace distance imply that

	$\displaystyle D_{\operatorname{tr}}\left(\otimes_{n=1}^{N}\mathcal{E}_{n}(\rho),\otimes_{n=1}^{N}\mathcal{E}_{n}(\sigma)\right)$	$\displaystyle=D_{\operatorname{tr}}\left(\otimes_{n=1}^{N}\mathcal{F}_{n}(\rho_{\rm in}),\otimes_{n=1}^{N}\mathcal{F}_{n}(\sigma_{\rm in})\right)$		(47)
		$\displaystyle\leq D_{\rm tr}\left(\otimes_{n=1}^{N}\mathcal{F}_{n}(\rho_{\rm in}),\frac{\mathbb{I}}{2^{NM}}\right)+D_{\rm tr}\left(\otimes_{n=1}^{N}\mathcal{F}_{n}(\sigma_{\rm in}),\frac{\mathbb{I}}{2^{NM}}\right)$		(47)

where we defined $\mathcal{F}_{n}\coloneqq\prod_{l=1}^{L}\left[\mathcal{D}_{\gamma}^{\otimes M}\circ\Xi_{n,l}\circ\mathcal{U}_{l}\circ\Lambda_{n,l}\right]$ . Using the quantum Pinsker’s inequality

\displaystyle D_{\operatorname{tr}}(\rho,\sigma)\leq\sqrt{\frac{\ln 2}{2}}\,\sqrt{S(\rho\|\sigma)}

(48)

that holds all states $\rho$ , $\sigma$ , where $S(\rho\|\sigma)\coloneqq\operatorname{Tr}(\rho\log\rho)-\operatorname{Tr}(\rho\log\sigma)$ is the relative entropy, we get

	$\displaystyle D_{\rm tr}\left(\otimes_{n=1}^{N}\mathcal{F}_{n}(\rho_{\rm in}),\frac{\mathbb{I}}{2^{NM}}\right)$	$\displaystyle\leq\sqrt{\frac{\ln 2}{2}}\sqrt{S\left(\otimes_{n=1}^{N}\mathcal{F}_{n}(\rho_{\rm in})\,\Big{\\|}\,\frac{\mathbb{I}}{2^{NM}}\right)}$		(49)
		$\displaystyle=\sqrt{\frac{\ln 2}{2}}\sqrt{\sum_{n=1}^{N}S\left(\mathcal{F}_{n}(\rho_{\rm in})\,\Big{\\|}\,\frac{\mathbb{I}}{2^{M}}\right)}$		(49)

where in the second line we used the additivity of the relative entropy $S(\rho_{1}\otimes\rho_{2}\|\sigma_{1}\otimes\sigma_{2})=S(\rho_{1}\|\sigma_{1})+S(\rho_{2}\|\sigma_{2})$ .

We now recall the result in Ref. [59], showing that

\displaystyle S\left(\mathcal{D}_{\gamma}^{\otimes M}(\tau)\,\Big{\|}\,\mathbb{I}/2^{M}\right)\leq(1-\gamma)^{2}S(\tau\,\|\,\mathbb{I}/2^{M})

(50)

for arbitrary $M$ -qubit state $\tau$ . This implies that for arbitrary state $M$ -qubit state $\tau$ , noise strength $\gamma$ , unitary $\mathcal{U}$ , and unital channels $\Xi$ , $\Lambda$ ,

$\displaystyle S(\mathcal{D}_{\gamma}^{\otimes M}\circ\Xi\circ\mathcal{U}\circ\Lambda(\tau)\,\\|\,\mathbb{I}/2^{M})$	$\displaystyle\leq(1-\gamma)^{2}S(\Xi\circ\mathcal{U}\circ\Lambda(\tau)\,\\|\,\mathbb{I}/2^{M})$	(51)
	$\displaystyle=(1-\gamma)^{2}S(\Xi\circ\mathcal{U}\circ\Lambda(\tau)\,\\|\,\Xi\circ\mathcal{U}\circ\Lambda(\mathbb{I}/2^{M}))$
	$\displaystyle\leq(1-\gamma)^{2}S(\tau\,\\|\,\mathbb{I}/2^{M})$

where we used (50) in the first line, the fact that $\Xi\circ\mathcal{U}\circ\Lambda(\mathbb{I}/2^{M})=\mathbb{I}/2^{M}$ because $\Xi$ and $\Lambda$ are unital in the second line, and the data-processing inequality for the relative entropy in the third line. The sequential application of (51) results in

	$\displaystyle S\left(\mathcal{F}_{n}(\rho_{\rm in})\,\Big{\\|}\,\frac{\mathbb{I}}{2^{M}}\right)$	$\displaystyle\leq(1-\gamma)^{2L}\,S\left(\rho_{\rm in}\,\Big{\\|}\,\frac{\mathbb{I}}{2^{M}}\right)$		(52)
		$\displaystyle\leq(1-\gamma)^{2L}M,$		(52)

where the second line follows from the upper bound of the relative entropy. Combining (47), (52) and (49), we get

\displaystyle D_{\rm tr}\left(\otimes_{n=1}^{N}\mathcal{F}_{n}(\rho_{\rm in}),\otimes_{n=1}^{N}\mathcal{F}_{n}(\sigma_{\rm in})\right)\leq 2\sqrt{\frac{\ln 2}{2}}\sqrt{(1-\gamma)^{2L}MN}.

(53)

Combining this with (10), we get

\displaystyle N\geq\frac{(1-2\varepsilon)^{2}}{2\ln(2)M(1-\gamma)^{2L}}

(54)

for any $\varepsilon\leq 1/2$ . ∎

Appendix G Bound for layered circuits with a fixed target bias and standard deviation

Theorem 3 in the main text shows the exponential sampling overhead for noisy layered circuits with a fixed target accuracy and success probability. Here, we present a similar exponential growth of the required sample overhead for a fixed target bias and standard deviation.

Theorem S.2.

Suppose that an error-mitigation strategy described above is applied to an $M$ -qubit circuit to mitigate local depolarizing channels with strength at least $\gamma$ that follow $L$ layers of unitaries. Then, if the estimator of the error mitigation has the maximum standard deviation $\sigma_{\max}^{\operatorname{QEM}}$ and maximum bias $b_{\max}$ , the required number $N$ of samples is lower bounded as

\displaystyle N\geq\frac{1}{4M(1-\gamma)^{2L}}\left(\frac{1}{\frac{2\sigma_{\max}^{\operatorname{QEM}}}{D_{\mathbb{O}}(\rho,\sigma)-2b_{\max}}+1}\right)^{2}

(55)

for arbitrary states $\rho,\sigma\in\mathbb{S}$ such that $D_{\mathbb{O}}(\rho,\sigma)-2b_{\max}\geq 0$ .

Proof.

We first note that $D_{F}(\rho,\sigma)\leq\sqrt{S(\rho\|\sigma)}$ for arbitrary states $\rho,\sigma$ [64]. Also, the purified distance $D_{F}$ satisfies the triangle inequality [61]. Therefore, we can repeat the argument in (47) and (49) with the purified distance to get

\displaystyle D_{F}(\otimes_{n=1}^{N}\mathcal{E}_{n}(\rho),\otimes_{n=1}^{N}\mathcal{E}_{n}(\sigma))\leq\left(\sqrt{\sum_{n=1}^{N}S\left(\mathcal{F}_{n}(\rho_{\rm in})\,\Big{\|}\,\frac{\mathbb{I}}{2^{M}}\right)}+\sqrt{\sum_{n=1}^{N}S\left(\mathcal{F}_{n}(\sigma_{\rm in})\,\Big{\|}\,\frac{\mathbb{I}}{2^{M}}\right)}\right).

(56)

Using (52), we further have

\displaystyle D_{F}(\otimes_{n=1}^{N}\mathcal{E}_{n}(\rho),\otimes_{n=1}^{N}\mathcal{E}_{n}(\sigma))\leq 2\sqrt{MN(1-\gamma)^{2L}}

(57)

This allows us to put another bound to (29) to get

\displaystyle\sigma_{\max}^{\operatorname{QEM}}\geq\max_{\begin{subarray}{c}\rho,\sigma\in\mathbb{S}\\ D_{\mathbb{O}}(\rho,\sigma)-2b_{\max}\geq 0\end{subarray}}\frac{1}{2}\left(\frac{1}{2\sqrt{MN(1-\gamma)^{2L}}}-1\right)(D_{\mathbb{O}}(\rho,\sigma)-2b_{\max}).

(58)

Reorganizing this, we obtain

\displaystyle N\geq\frac{1}{4M(1-\gamma)^{2L}}\left(\frac{1}{\frac{2\sigma_{\max}^{\operatorname{QEM}}}{D_{\mathbb{O}}(\rho,\sigma)-2b_{\max}}+1}\right)^{2}.

(59)

for arbitrary state $\rho,\sigma$ such that $D_{\mathbb{O}}(\rho,\sigma)-2b_{\max}\geq 0$ . ∎

Appendix H Alternative bounds for noisy layered circuits

Let us investigate alternative bounds that also diverge with the vanishing failure probability $\varepsilon\to 0$ . We first note the inequality [64]

\displaystyle S(\tau\|\eta)\geq\log F(\tau,\eta)^{-1}

(60)

that hold for arbitrary states $\tau$ and $\eta$ . Together with Theorem 1, our goal is to upper bound $S(\mathcal{E}_{n}(\rho)\|\mathcal{E}_{n}(\sigma))$ for the effective noise channel defined in (46).

To do this, we employ the continuity bounds of the relative entropy [65]. Namely, for arbitrary states $\tau$ and a full-rank state $\eta$ ,

\displaystyle S(\tau\|\eta)

\displaystyle\leq\frac{4D_{\operatorname{tr}}(\tau,\eta)^{2}}{\lambda_{\min}(\eta)}

(61)

where $\lambda_{\min}(\eta)$ is the minimum eigenvalue of a state $\eta$ , and for $D_{\operatorname{tr}}(\tau,\eta)\leq 1/2$ ,

\displaystyle S(\tau\|\eta)

\displaystyle\leq\frac{2}{\ln 2}D_{\operatorname{tr}}(\tau,\eta)\left[\log d+\log\left(\frac{2}{D_{\rm tr}(\tau,\eta)}\right)+\frac{1}{2}\log(1/\lambda_{\min}(\eta))\right].

(62)

where $d$ is the dimension of the Hilbert space that $\tau$ and $\eta$ act on. Note that (61) has the square dependence on the trace distance, while (62) has the log dependence on the minimum eigenvalue.

Note that $\mathcal{D}_{\gamma}^{\otimes M}=(1-(4\gamma)^{M})\Psi+(4\gamma)^{M}\mathcal{D}^{\otimes M}$ where $\Psi$ is some quantum channel and $\mathcal{D}$ is the completely depolarizing channel. This ensures that

\displaystyle\lambda_{\min}(\mathcal{D}_{\gamma}^{\otimes M}(\sigma))\geq\left(\frac{\gamma}{2}\right)^{M}

(63)

for arbitrary state $\sigma$ . Noting the structure in (46), where all components preserve the maximally mixed state, this results in

\displaystyle\lambda_{\min}(\mathcal{E}_{n}(\sigma))\geq\left(\frac{\gamma}{2}\right)^{M}.

(64)

We combine (61) with (53) and (64) to get

\displaystyle S(\mathcal{E}_{n}(\rho)\|\mathcal{E}_{n}(\sigma)))\leq 8\ln(2)\,M(1-\gamma)^{2L}\left(\frac{2}{\gamma}\right)^{M}.

(65)

Combining this with (60) and Theorem 1, we obtain

\displaystyle N

\displaystyle\geq\log\left[\frac{1}{4\varepsilon(1-\varepsilon)}\right]\frac{\left(\frac{\gamma}{2}\right)^{M}}{8\ln(2)\,M(1-\gamma)^{2L}}.

(66)

This bound diverges with $\varepsilon\to 0$ and grows exponentially with the circuit depth with exponent $2L$ . However, this is also exponentially loose with the qubit number $M$ .

The second bound (62) can remedy this drawback at the cost of a looser circuit depth dependence. Let us first note that the function $x\log(1/x)$ is monotonically increasing for $0<x\leq 1/2$ . Therefore, for $M$ , $\gamma$ , and $L$ such that $D_{\operatorname{tr}}(\mathcal{E}_{n}(\rho),\mathcal{E}_{n}(\sigma))\leq\sqrt{2\ln 2}\sqrt{M}(1-\gamma)^{L}\leq 1/2$ , we can combine (62), (53), and (64) to get

\displaystyle S(\mathcal{E}_{n}(\rho)\|\mathcal{E}_{n}(\sigma))\leq 2\sqrt{\frac{2}{\ln 2}}\sqrt{M}(1-\gamma)^{L}\left[M+\log\left(\frac{2}{\sqrt{2\ln 2}\sqrt{M}(1-\gamma)^{L}}\right)+\frac{M}{2}\log(2/\gamma)\right].

(67)

This, together with Theorem 1, we get

\displaystyle N\geq\log\left[\frac{1}{4\varepsilon(1-\varepsilon)}\right]\frac{\sqrt{\ln 2}}{\sqrt{8M}(1-\gamma)^{L}}\left[L\log\left(\frac{2}{1-\gamma}\right)+M-\frac{1}{2}\log\left(\frac{M\ln 2}{2}\right)+\frac{M}{2}\log(2/\gamma)\right]^{-1}.

(68)

This has a polynomial dependence with $M$ and still grows exponentially with $L$ (up to a polynomial correction), while the $L$ dependence is not as good as (66).

We can also apply a similar argument to show bounds for fixed standard deviation and bias, which diverge with $\sigma_{\max}^{\operatorname{QEM}}\to 0$ . Namely, we combine (65) with (60) and Theorem 2 to obtain

\displaystyle N\geq\log\left[1-\frac{1}{\left(1+\frac{2\sigma_{\max}^{\operatorname{QEM}}}{D_{\mathbb{O}}(\rho,\sigma)-2b_{\max}}\right)^{2}}\right]^{-1}\frac{\left(\frac{\gamma}{2}\right)^{M}}{8\ln(2)\,M(1-\gamma)^{2L}}.

(69)

This bound diverges with $\sigma_{\max}^{\operatorname{QEM}}\to 0$ and grows exponentially with the circuit depth with exponent $2L$ . Similarly, using (67) and Theorem 2, we get

\displaystyle N\geq\log\left[1-\frac{1}{\left(1+\frac{2\sigma_{\max}^{\operatorname{QEM}}}{D_{\mathbb{O}}(\rho,\sigma)-2b_{\max}}\right)^{2}}\right]^{-1}\frac{\sqrt{\ln 2}}{\sqrt{8M}(1-\gamma)^{L}}\left[L\log\left(\frac{2}{1-\gamma}\right)+M-\frac{1}{2}\log\left(\frac{M\ln 2}{2}\right)+\frac{M}{2}\log\left(\frac{2}{\gamma}\right)\right]^{-1}.

(70)

Appendix I Noisy layered circuits under other noise models

Let us consider a layered circuit affected by a noise channel $\mathcal{N}_{n,l}$ after the $l^{\mathrm{th}}$ layer $U_{l}$ in the $n^{\mathrm{th}}$ copy of the noisy circuits. Suppose that for every $n$ , all $\{\mathcal{N}_{n,l}\}_{l=1}^{L}$ and $\{U_{l}\}_{l=1}^{L}$ share the same fixed point $\sigma_{n}$ . We aim to mitigate errors by applying additional channels $\Lambda_{n,l}$ and $\Xi_{n,l}$ that also preserve $\sigma_{n}$ before and after $U_{l}$ in the $n^{\mathrm{th}}$ noisy circuit and applying a global trailing quantum process over $n$ copies of distorted state.

For an arbitrary quantum channel $\mathcal{N}$ with a fixed point $\sigma$ , let $\xi_{\mathcal{N}}$ be a contraction constant for $\mathcal{N}$ that satisfies

\displaystyle S(\mathcal{N}(\rho)\|\sigma)\leq\xi_{\mathcal{N}}S(\rho\|\sigma),\ \forall\rho.

(71)

The contraction constant $\xi_{\mathcal{N}}$ was studied for various situations, particularly in relation to logarithmic Sobolev inequalities for dynamical semigroups that represent continuous Markovian evolution [59, 66, 67, 68, 69, 70, 71, 72, 73].

Using the contraction constant, we can state the sample lower bound in the general form, which reproduces Theorems 3 and S.2 as special cases.

Theorem S.3.

Suppose that an error-mitigation strategy achieves (1) with some $\delta\geq 0$ and $0\leq\varepsilon\leq 1/2$ to mitigate noise $\{\mathcal{N}_{n,l}\}_{n,l}$ that occurs after each $U_{l}$ in the $n^{\mathrm{th}}$ layered circuit in an error mitigation strategy described above. Suppose also that every noise channel has the contraction constant smaller than $\xi$ , i.e., $\xi_{\mathcal{N}_{n,l}}\leq\xi,\forall n,l$ .

Then, if there exist at least two states $\rho,\sigma\in\mathbb{S}$ such that $D_{\mathbb{O}}(\rho,\sigma)\geq 2\delta$ , the required sample number $N$ is lower bounded as

\displaystyle N\geq\frac{(1-2\varepsilon)^{2}}{2\ln(2)\,M\xi^{L}}.

(72)

Similarly, to achieve the maximum standard deviation $\sigma_{\max}^{\operatorname{QEM}}$ and maximum bias $b_{\max}$ , the required number $N$ of samples is lower bounded as

\displaystyle N\geq\frac{1}{4M\xi^{L}}\left(\frac{1}{\frac{2\sigma_{\max}^{\operatorname{QEM}}}{D_{\mathbb{O}}(\rho,\sigma)-2b_{\max}}+1}\right)^{2}

(73)

for arbitrary states $\rho,\sigma\in\mathbb{S}$ such that $D_{\mathbb{O}}(\rho,\sigma)-2b_{\max}\geq 0$ .

Proof.

Both statements can be obtained essentially by following the same arguments in the proofs of Theorems 3 and S.2. Namely, defining $\mathcal{F}_{n}\coloneqq\prod_{l=1}^{L}\left[\mathcal{N}_{n,l}\circ\Xi_{n,l}\circ\mathcal{U}_{l}\circ\Lambda_{n,l}\right]$ , the argument leading to (53) gives us

\displaystyle D_{\rm tr}\left(\otimes_{n=1}^{N}\mathcal{F}_{n}(\rho_{\rm in}),\otimes_{n=1}^{N}\mathcal{F}_{n}(\sigma_{\rm in})\right)\leq\sqrt{2\ln 2}\sqrt{\xi^{L}MN}.

(74)

This, together with (10), gives

\displaystyle N\geq\frac{(1-2\varepsilon)^{2}}{2\ln(2)M\xi^{L}}

(75)

whenever there exist two states $\rho,\sigma\in\mathbb{S}$ such that $D_{\mathbb{O}}(\rho,\sigma)\geq 2\delta$ .

Similarly, the argument to get (57) results in

\displaystyle D_{F}\left(\otimes_{n=1}^{N}\mathcal{F}_{n}(\rho_{\rm in})\|\otimes_{n=1}^{N}\mathcal{F}_{n}(\sigma_{\rm in})\right)\leq 2\sqrt{MN\xi^{L}}.

(76)

Combining this to (29) leads to

\displaystyle N\geq\frac{1}{4M\xi^{L}}\left(\frac{1}{\frac{2\sigma_{\max}^{\operatorname{QEM}}}{D_{\mathbb{O}}(\rho,\sigma)-2b_{\max}}+1}\right)^{2}.

(77)

for arbitrary state $\rho,\sigma$ such that $D_{\mathbb{O}}(\rho,\sigma)-2b_{\max}\geq 0$ .

∎

Theorems 3 and S.2 are recovered by noting that the contraction constant for the local depolarizing noise $\mathcal{D}_{\gamma}^{\otimes M}$ is known as $\xi_{\mathcal{D}_{\gamma}^{\otimes M}}=(1-\gamma)^{2}$ [59].

We remark that analogous results to Theorem S.3 can also be obtained in terms of the contractivity constant for other distance measures, such as trace distance, purified distance, and Rényi- $\alpha$ divergences with $\alpha>1$ . Indeed, we can get corresponding bounds for stochastic Pauli noise by employing the contractivity for Rényi- $2$ sandwiched relative entropy[73]

\displaystyle S_{2}(\Lambda^{\otimes n}(\rho)\|\mathbb{I}/d^{n})\leq\exp\left[2(1-1/n)\frac{\ln r}{\ln n}\right]S_{2}(\rho\|\mathbb{I}/d^{n})

(78)

for every $n$ and $d$ -dimensional unital channel $\Lambda$ that satisfies $r^{-1}\|\Lambda-(1-r)\mathcal{D}\|_{2\to 2}\leq 1$ . Here, $S_{2}(\tau\|\eta)\coloneqq\log\operatorname{Tr}\left(\eta^{-1/2}\,\tau\,\eta^{-1/2}\,\tau\right)$ is the Rényi-2 sandwiched relative entropy [74], $\mathcal{D}$ is the completely depolarizing channel, and $\|\Phi\|_{2\to 2}\coloneqq\sup_{X\neq 0}\frac{\|\Phi(X)\|_{2}}{\|X\|_{2}}$ is the $2\to 2$ norm for an arbitrary linear map $\Phi$ . For an error probability ${\bf q}=(q_{x},q_{y},q_{z})$ for each Pauli error, define a qubit stochastic Pauli noise as $\mathcal{T}_{\bf q}(\tau)\coloneqq(1-q_{x}-q_{y}-q_{z})\tau+q_{x}X\tau X+q_{y}Y\tau Y+q_{z}Z\tau Z$ . Then, by taking $r=q\coloneqq|1-2\min\{q_{x}+q_{y},q_{y}+q_{z},q_{x}+q_{z}\}|$ in Eq. (78), we obtain

\displaystyle S_{2}(\mathcal{T}_{\bf q}^{\otimes n}(\rho)\|\mathbb{I}/2^{n})\leq q^{1/\ln 2}S_{2}(\rho\|\mathbb{I}/2^{n}).

(79)

We remark that this type of bound was also employed in Ref. [41]. Since $S(\tau\|\eta)\leq S_{2}(\tau\|\eta),\ \forall\tau,\eta$ [74], we can upper bound the form in (49) by (78). From there, we can follow the same argument as the ones in the proofs of Theorems 3 and S.2, noting the data-processing inequality and the additivity for tensor-product states known for Rény-2 sandwiched relative entropy [74]. This results in the sampling lower bounds $\Omega\left(q^{L/\ln 2}\right)$ for stochastic Pauli noise.

More generally, the exponential sample growth can be established for a large class of tensor-product channels. Let $\{\Phi_{t}\}_{t\geq 0}$ be a set of unital qubit channels constructing a dynamical semigroup, i.e., $\Phi_{0}=\operatorname{id}$ and $\Phi_{s+r}=\Phi_{s}\circ\Phi_{r},\forall s,t\geq 0$ , generated by a Liouvillian $\mathcal{L}$ , i.e., $\Phi_{t}=e^{t\mathcal{L}}$ [75]. Suppose further that $\mathbb{I}/2$ is the unique solution of $\mathcal{L}(\tau)=0$ , i.e., the unique fixed point for $\{\Phi_{t}\}_{t}$ . Then, for a noisy channel $\mathcal{N}=\Phi_{t}^{\otimes M}$ defined for some $t>0$ comes with $\xi_{\mathcal{N}}<1$ [69], leading to the exponential sample cost together with Theorem S.3. An example includes a channel generated by the Liouvillian constructed by a stochastic Pauli channel $\mathcal{L}(\tau)=\sum_{i=0}^{3}q_{i}P_{i}\tau P_{i}-\tau$ where $q_{i}$ is the probability that $i^{\mathrm{th}}$ Pauli $P_{i}$ is applied.

We can also apply this to noise channels globally applied to $M$ qubits. Consider the $M$ -qubit (not necessarily unital) depolarizing channel with a full-rank fixed point $\sigma$ defined as $\mathcal{D}_{\gamma,\sigma}(\tau)\coloneqq(1-\gamma)\tau+\gamma\sigma$ . For instance, the fixed state $\sigma$ can be taken as the thermal Gibbs state $\sigma_{\beta,H}=e^{-\beta H}/\operatorname{Tr}(e^{-\beta H})$ for some inverse temperature $\beta$ and Hamiltonian $H$ . Then, it was shown in Ref. [68] that

\displaystyle S(\mathcal{D}_{\gamma,\sigma}(\rho)\|\sigma)\leq(1-\gamma)^{2\alpha_{1}}S(\rho\|\sigma)

(80)

where $\alpha_{1}$ is computed by

\displaystyle\alpha_{1}=\min_{x\in[0,1]}\frac{1}{2}(1+q_{\lambda_{\min}(\sigma)}(x)).

(81)

Here, $\lambda_{\min}(\sigma)$ is the minimum eigenvalue of $\sigma$ , and $q_{x}(y)$ is defined for $x,y\in(0,1)$ as

\displaystyle q_{y}(x)\coloneqq\begin{cases}\frac{D_{2}(y\|x)}{D_{2}(x\|y)}&x\neq y\\ 1&x=y\end{cases}

(82)

where $D_{2}(x\|y)\coloneqq x\log(x/y)+(1-x)\log[(1-x)/(1-y)]$ is the binary relative entropy. As shown in Ref. [68], $\alpha_{1}$ monotonically increases with $\lambda_{\min}(\sigma)$ and behaves as $\alpha_{1}\to 1/2$ in the limit $\lambda_{\min}(\sigma)\to 0$ and $\alpha_{1}\to 1$ in the limit $\lambda_{\min}(\sigma)\to 1/2$ . Taking $\xi_{\mathcal{D}_{\gamma,\sigma}}=(1-\gamma)^{2\alpha_{1}}$ in Theorem S.3 then allows us to obtain sample lower bounds for the generalized global depolarizing noise with an arbitrary full-rank fixed state.

Similarly to the tensor-product channels, the exponential sample cost can be extended to a wide class of quantum channels. Let $\mathcal{N}$ be a unital channel and $\mathcal{N}^{\dagger}$ be its dual, i.e., $\operatorname{Tr}(X\mathcal{N}(Y))=\operatorname{Tr}(\mathcal{N}^{\dagger}(Y)X),\forall X,Y$ . Then, if $\mathcal{N}^{\dagger}\circ\mathcal{N}$ is primitive, i.e., $\lim_{n\to\infty}(\mathcal{N}^{\dagger}\circ\mathcal{N})^{n}(\tau)=\mathbb{I}/2^{M},\ \forall\tau$ , we have $\lambda_{\mathcal{N}}<1$ [69].

Our results also admit a thermodynamic interpretation. Consider the situation where the ideal circuit is the identity map and our system suffers from the continuous thermal map $\{\Phi_{t}\}_{t}$ generated by a Liouvillian $\mathcal{L}$ with a thermal Gibbs state $\sigma_{\beta,H}=e^{-\beta H}/\operatorname{Tr}(e^{-\beta H})$ as a fixed state, i.e., $\mathcal{L}(\sigma_{\beta,H})=0$ [59]. This represents the state preservation scenario, in which we would like to extract the classical information about the initial state after some period of time.

Let $\rho$ be the initial state and $\rho_{t}=\Phi_{t}(\rho)$ be a state at time $t\geq 0$ . Note that for an arbitrary state $\tau$ , we have $S(\tau\|\sigma_{\beta,H})=\beta(F(\tau)-F_{\rm eq})$ where $F(\tau)\coloneqq\operatorname{Tr}(\tau H)-S(\tau)/\beta$ is the nonequilibrium free energy of $\tau$ and $F_{\rm eq}\coloneqq F(\sigma_{\beta,H})$ is the equilibrium free energy. Then, Theorem 1 implies that, to extract the expectation value of the initial state $\rho$ , we need to use

\displaystyle N=\Omega\left(\frac{1}{F(\rho_{t})-F_{\rm eq}}\right)

(83)

samples. In other words, the necessary sampling cost grows as the free energy gets lost due to thermalization. The property of the dynamical semigroup further gives

\displaystyle F(\rho_{t})-F_{\rm eq}\leq e^{-\alpha_{\rm ent}t}\left[F(\rho)-F_{\rm eq}\right]

(84)

with

\displaystyle\alpha_{\rm ent}=\min_{\tau}\frac{\dot{\Sigma}(\tau)}{\beta\left[F(\tau)-F_{\rm eq}\right]}

(85)

where $\dot{\Sigma}(\tau)\coloneqq\left[\frac{d}{dt}S(\tau_{t})-\beta\frac{d}{dt}\operatorname{Tr}(\tau_{t}H)\right]|_{t=0}=-\operatorname{Tr}[\mathcal{L}(\tau)\ln(\tau)]-\operatorname{Tr}[\mathcal{L}(\tau)H]$ is the entropy production rate. In particular, when $\alpha_{\rm ent}>0$ holds, i.e., $\dot{\Sigma}(\tau)>0,\ \forall\tau\neq\sigma_{\beta,H}$ , this implies

\displaystyle N=\Omega\left(e^{\alpha_{\rm ent}t}\right),

(86)

the exponential sample cost with time $t$ . This cannot be avoided even if we apply intermediate operations that preserve the Gibbs state, as such operations cannot increase the free energy. We also remark that there has been an intense study on the evaluation of the exponent $\alpha_{1}$ for various types of thermal maps, lattices, and Hamiltonian (e.g., Refs. [59, 76, 77, 78, 72, 71]). These results can directly be applied to give lower bounds for the sampling overhead required for quantum error mitigation.

Thermal channels construct an important class of noise channels of physical significance, and the exponential contraction of the relative entropy, i.e., (71) with $\xi_{\mathcal{N}}<1$ , was established for the thermal map that constructs a Markov semi-group [59]. It was conjectured in Ref. [59] that an arbitrary Markov semi-group with a unique fixed point would also show the exponential decay of the relative entropy. If this is true, then Theorem S.3 applies to an even wider class of noise channels, particularly when $U_{l}=\mathbb{I}\,\forall l$ corresponding to the state preservation scenario.

Although this conjecture appears to be plausible, it is still generally elusive to rigorously show the exponential contraction of the relative entropy for non-unital channels. Nevertheless, our framework admits an alternative bound that also grows exponentially with circuit depth, as we show in the following.

The idea is to employ the exponential contraction of the trace distance instead of the relative entropy. For a channel $\mathcal{N}$ that has the unique fixed point $\tau$ , it holds that [79]

\displaystyle D_{\rm tr}\left(\mathcal{N}^{k}(\rho),\tau\right)\leq 2\left[s_{2}(\mathcal{N})\right]^{k}\left[\lambda_{\min}(\tau)\right]^{-1/2}\quad\forall\rho

(87)

for an arbitrary positive integer $k$ , where $\mathcal{N}^{k}$ refers to the $k$ sequential applications of $\mathcal{N}$ , $\lambda_{\min}(\tau)$ is the smallest eigenvalue of $\tau$ , and $s_{2}(\mathcal{N})$ is the second largest singular value of the map $\Gamma_{\tau}^{-1/2}\circ\mathcal{N}\circ\Gamma_{\tau}^{1/2}$ with $\Gamma_{\tau}$ defined by $\Gamma_{\tau}(\rho)\coloneqq\tau^{1/2}\rho\tau^{1/2}$ .

Consider the setting of the layered circuit with $U_{l}=\mathbb{I}\,\forall l$ , where we are to use $N$ $M$ -qubit circuits as an input to error mitigation. Let $\mathcal{N}$ be the $M$ -qubit noise channel that applies to each layer of a noisy circuit. This is the case where the effective noise channel of the depth- $L$ noisy circuit is represented by $\mathcal{N}^{L}$ . Then, the bound in Theorem 1 in terms of the relative entropy can be bounded as

	$\displaystyle N\geq\max_{\begin{subarray}{c}\rho,\sigma\in\mathbb{S}\\ D_{\mathbb{O}}(\rho,\sigma)\geq 2\delta\end{subarray}}\frac{2(1-2\varepsilon)^{2}}{\ln 2\cdot S(\mathcal{N}^{L}(\rho)\\|\mathcal{N}^{L}(\sigma))}$	$\displaystyle\geq\max_{\begin{subarray}{c}\rho,\sigma\in\mathbb{S}\\ D_{\mathbb{O}}(\rho,\sigma)\geq 2\delta\end{subarray}}\frac{(1-2\varepsilon)^{2}\lambda_{\min}(\mathcal{N}^{L}(\sigma))}{2\ln 2\,D_{\rm tr}(\mathcal{N}^{L}(\rho),\mathcal{N}^{L}(\sigma))^{2}}$		(88)
		$\displaystyle\geq\frac{(1-2\varepsilon)^{2}\lambda_{\min}(\mathcal{N})\lambda_{\min}(\tau)}{8\ln 2\left[s_{2}(\mathcal{N})\right]^{2L}}$		(88)

where the second inequality is due to the continuity bound of the relative entropy (61), and in the last inequality we used (87) and defined $\lambda_{\min}(\mathcal{N})\coloneqq\min_{\rho}\lambda_{\min}(\mathcal{N}(\rho))$ , which refers to the minimum eigenvalue of the output state of the channel $\mathcal{N}$ . This bound shows the exponential sampling cost for an arbitrary channel $\mathcal{N}$ that has a unique fixed state and $s_{2}(\mathcal{N})<1$ .

We remark that $\lambda_{\min}(\mathcal{N})$ and $\lambda_{\min}(\tau)$ can be exponentially small with the qubit number $M$ , which prevents one from using this bound to argue that the circuit with $L=\Omega(M)$ must use an exponential sample cost, unlike the one in Theorem S.3. Nevertheless, it still ensures exponential growth with respect to the layer number $L$ if we see it as a separate parameter from the qubit number $M$ . We also stress that the tighter bound in Theorem S.3 would become valid once the exponential decay of the relative entropy for non-unital channels is established, going along with future developments in the understanding of non-unital Markovian semi-groups.

As an example of an important non-unital channel, let us discuss the generalized amplitude damping noise [54], which describes the thermal relaxation at finite temperature in, e.g., superconducting systems [80]. The generalized amplitude damping $\mathcal{A}_{\gamma,\kappa}$ characterized by two parameters $\gamma,\kappa$ with $0<\gamma,\kappa<1$ has the Kraus operators

	$\displaystyle K_{1}$	$\displaystyle=\sqrt{1-\kappa}\left(\|{0}\rangle\!\langle{0}\|+\sqrt{1-\gamma}\|{1}\rangle\!\langle{1}\|\right)\quad K_{2}=\sqrt{\gamma(1-\kappa)}\|{0}\rangle\!\langle{1}\|$		(89)
	$\displaystyle K_{3}$	$\displaystyle=\sqrt{\kappa}\left(\sqrt{1-\gamma}\|{0}\rangle\!\langle{0}\|+\|{1}\rangle\!\langle{1}\|\right)\quad K_{4}=\sqrt{\gamma\kappa}\|{1}\rangle\!\langle{0}\|.$		(89)

The generalized amplitude damping channel maps the Pauli operators as

	$\displaystyle\mathcal{A}_{\gamma,\kappa}(X)$	$\displaystyle=\sqrt{1-\gamma}X\quad\mathcal{A}_{\gamma,\kappa}(Y)=\sqrt{1-\gamma}Y$		(90)
	$\displaystyle\mathcal{A}_{\gamma,\kappa}(Z)$	$\displaystyle=(1-\gamma)Z\quad\mathcal{A}_{\gamma,\kappa}(\mathbb{I})=\mathbb{I}+\gamma(1-2\kappa)Z.$		(90)

This ensures that at the limit of the infinite number of applications of $\mathcal{A}_{\gamma,\kappa}$ , the three non-trivial Pauli operators $X$ , $Y$ , and $Z$ vanish, while the identity operator approaches $\mathbb{I}+\gamma(1-2\kappa)\left(\sum_{k=0}^{\infty}(1-\gamma)^{k}\right)Z=\mathbb{I}+(1-2\kappa)Z$ . This shows that the fixed state of $\mathcal{A}_{\gamma,\kappa}$ is $\frac{\mathbb{I}+(1-2\kappa)Z}{2}$ , and more generally, the fixed point of the tensor-product channel $\mathcal{A}_{\gamma,\kappa}^{\otimes k}$ for an arbitrary positive integer $k$ is $\left(\frac{\mathbb{I}+(1-2\kappa)Z}{2}\right)^{\otimes k}$ , because whatever the $k$ -qubit initial state we start from, only the $\mathbb{I}/2^{k}$ term in the Pauli decomposition of the initial state survives after the infinite number of applications of $\mathcal{A}_{\gamma,\kappa}^{\otimes k}$ , which converges to $\left(\frac{\mathbb{I}+(1-2\kappa)Z}{2}\right)^{\otimes k}$ . This means that $\mathcal{A}_{\gamma,\kappa}^{\otimes k}$ has a unique fixed state, to which the bound (88) applies.

In particular, the sampling lower bound for mitigating the local generalized amplitude damping noise applied to $M$ -qubit circuit is obtained by setting $\mathcal{N}=\mathcal{A}_{\gamma,\kappa}^{\otimes M}$ in (88) with $\tau=\left(\frac{\mathbb{I}+(1-2\kappa)Z}{2}\right)^{\otimes M}$ . This immediately gives $\lambda_{\min}(\tau)=\tilde{\kappa}^{M}$ where $\tilde{\kappa}\coloneqq\min\{\kappa,1-\kappa\}$ .

To get $\lambda_{\min}(\mathcal{A}_{\gamma,\kappa})$ , we note that for a general channel $\mathcal{N}$ , if $\mathcal{N}$ satisfies $\mathcal{N}(\rho)\geq p\mathbb{I}$ for every state $\rho$ , we have $\lambda_{\min}(\mathcal{N})\geq p$ . This happens when its Choi state $J_{\mathcal{N}}=\operatorname{id}\otimes\mathcal{N}(\tilde{\Phi})$ for $\mathcal{N}$ , where $\tilde{\Phi}=|{ii}\rangle\!\langle{jj}|$ is the unnormalized maximally entangled state, has the minimum eigenvalue $\lambda_{\min}(J_{\mathcal{N}})\geq p$ . This allows us to lower bound $\lambda_{\min}(\mathcal{N})$ by looking at the eigenvalue of the Choi state. A straightforward calculation reveals that the Choi state of $\mathcal{A}_{\gamma,\kappa}$ is given by

\displaystyle J_{\mathcal{A}_{\gamma,\kappa}}=\begin{pmatrix}1-\gamma\kappa&0&0&\sqrt{1-\gamma}\\ 0&\gamma\kappa&0&0\\ 0&0&\gamma(1-\kappa)&0\\ \sqrt{1-\gamma}&0&0&1-\gamma(1-\kappa)\end{pmatrix},

(91)

which gives $\lambda_{\min}(J_{\mathcal{A}_{\gamma,\kappa}^{\otimes M}})=\left[\lambda_{\min}(J_{\mathcal{A}_{\gamma,\kappa}})\right]^{M}=\left(1-\gamma/2-\sqrt{1-\gamma+\gamma^{2}(\kappa-1/2)^{2}}\right)^{M}$ .

Finally, $s_{2}(\mathcal{A}_{\gamma,\kappa}^{\otimes M})$ can be obtained by computing the singular values of $\Gamma_{\tau}^{-1/2}\circ\mathcal{A}_{\gamma,\kappa}^{\otimes M}\circ\Gamma_{\tau}^{1/2}$ . Letting $\tau_{1}\coloneqq\frac{\mathbb{I}+(1-2\kappa)Z}{2}$ so that $\tau=\tau_{1}^{\otimes M}$ , we get $\Gamma_{\tau}^{-1/2}\circ\mathcal{A}_{\gamma,\kappa}^{\otimes M}\circ\Gamma_{\tau}^{1/2}=\left(\Gamma_{\tau_{1}}^{-1/2}\circ\mathcal{A}_{\gamma,\kappa}\circ\Gamma_{\tau_{1}}^{1/2}\right)^{\otimes M}$ . This implies that the singular values of the map $\Gamma_{\tau}^{-1/2}\circ\mathcal{A}_{\gamma,\kappa}^{\otimes M}\circ\Gamma_{\tau}^{1/2}$ on $M$ -qubit systems are obtained by multiplying the singular values of the single-qubit map $\Gamma_{\tau_{1}}^{-1/2}\circ\mathcal{A}_{\gamma,\kappa}\circ\Gamma_{\tau_{1}}^{1/2}$ . In addition, since the largest singular value of $\Gamma_{\tau_{1}}^{-1/2}\circ\mathcal{A}_{\gamma,\kappa}\circ\Gamma_{\tau_{1}}^{1/2}$ is 1, we get $s_{2}(\mathcal{A}_{\gamma,\kappa}^{\otimes M})=s_{2}(\mathcal{A}_{\gamma,\kappa})$ allowing us to only compute the second largest singular value of the single-qubit map $\Gamma_{\tau_{1}}^{-1/2}\circ\mathcal{A}_{\gamma,\kappa}\circ\Gamma_{\tau_{1}}^{1/2}$ . This can be obtained as $s_{2}(\mathcal{A}_{\gamma,\kappa}^{\otimes M})=s_{2}(\mathcal{A}_{\gamma,\kappa})=\sqrt{1-\gamma}$ by a simple computation.

Plugging these quantities into (88) results in

\displaystyle N\geq\frac{(1-2\varepsilon)^{2}\left[\tilde{\kappa}\left(1-\gamma/2-\sqrt{1-\gamma+\gamma^{2}(\kappa-1/2)^{2}}\right)\right]^{M}}{8\ln 2\left(1-\gamma\right)^{L}},\quad\tilde{\kappa}\coloneqq\min\{\kappa,1-\kappa\},

(92)

which particularly grows exponentially with the layer number $L$ , noting that $0<\gamma<1$ .

Appendix J Relation to previous works on noisy layered circuits

Theorems 3 and S.2 imply that if an $n$ -qubit quantum circuit with super-logarithmic depth suffers from local depolarizing noise (and other types of noise as discussed in Appendix I), super-polynomial number of samples are required, making the computation infeasible even with the help of quantum error mitigation. The observation that computation quickly becomes useless under the noise effects dates back to the seminal work by Aharonov et al. [10] (see, e.g., Refs. [9, 11] for the works that put forward and investigated related ideas). Here, we clarify differences between our results and the previous works, in particular Ref. [10].

The setting discussed in Ref. [10] involves a quantum circuit with $d$ layers of unitary gates, where each layer is followed by the action of local depolarizing noise. After the $d$ th layer, each qubit is measured with the computational basis. The authors showed that when the depth is super-logarithmic with the qubit number, the probability of obtaining 0 (and 1) cannot be away from 1/2 by a constant amount. This implies that to realize the “useful” computation — which should not be simulable by a random guess — the noisy quantum circuit should be restricted to the one with logarithmic depth. The authors showed this result by employing the fact — which was also shown in the same manuscript — that the relative entropy $S(\rho\|\mathbb{I}/2^{n})=n-S(\rho)$ decreases exponentially with circuit depth. They combined this with the fact that the computational-basis measurement (followed by partial trace) does not increase the relative entropy, showing that the entropy of the probability distribution of the measurement outcome of the first qubit can only be deviated from the maximal value by the amount scaling as $n2^{-d}$ . Therefore, to ensure the probability of obtaining 0 or 1 to be away from 1/2 by a constant amount, one needs to have $d=\mathcal{O}(\log n)$ to allow the entropy to be away from the maximal value with a large $n$ limit.

Although the above consideration contains important insights, and our results — as well as many other follow-up works after this such as Refs. [59, 43, 42, 41, 40] — benefit from their inspirations, the argument there is not directly applicable to the setting involving quantum error mitigation. As explained in the main text, the idea of quantum error mitigation is to use quantum and classical resources in hybrid, in which postprocessing computation after the noisy circuit is essential. In our framework, this part is included in the trailing process ( $\mathcal{P}_{A}$ in Fig. 1 in the main text) — indeed, our model allows an arbitrary quantum operation as the trailing process, which also includes classical postprocessing computation. Importantly, the trailing process is not unital in general, and therefore, entropy can decrease in the postprocessing step, preventing the results in Ref. [10] from being directly carried over.

We solve this problem by first reducing the performance of error mitigation to the distinguishability of two noisy quantum states, which eventually leads to Theorems 1 and 2 providing the necessary sampling cost to achieve the target error mitigation performance with respect to entropic measures. It is a priori not obvious how entropic quantity is quantitatively related to the operational performance quantifiers such as accuracy–probability and bias–standard deviation of quantum error mitigation. Theorems 1 and 2 establish the connections between these quantities, which then allow us to focus on analyzing the entropic measures, where we can apply the previous findings, such as the exponential decay of relative entropy from the maximally mixed state under local depolarizing noise, to show the exponential sampling overhead for the general error mitigation.

In addition, our quantitative bounds — which give nearly tight estimates for certain cases — provide concrete sampling lower bounds beyond just the exponential scaling behavior for layered circuits, which could also be used as a benchmark for error mitigation strategies.

Let us also remark on how our results are related to the prior works that investigated the capability of general error mitigation, which involves postprocessing operations. Refs. [39, 40] studied the maximum estimator spread, i.e., the range of outcomes of the estimator, imposed on general error mitigation protocols. These works showed that, for error mitigation with linear postprocessing, the maximum estimator spread must grow exponentially with the circuit depth for noisy layered circuits under local depolarizing noise. Although these results hinted that the exponential samples would be required in general error mitigation, they were not conclusive because the maximum estimator spread provides only a sufficient number of samples to ensure a certain accuracy. Our results close this gap by showing that exponential samples are necessary. Ref. [43] studied the general framework introduced in Ref. [39] and obtained a concentration bound, which places an upper bound for the probability of getting an estimate away from the value for the maximally mixed state by a certain amount. Their bound implies that, when the circuit depth is $\Omega(\log N)$ , the probability of achieving a certain constant accuracy will become exponentially small with respect to the accuracy multiplied by the Lipschitz constant of the estimator. However, as the authors pointed out, the Lipschitz constant of the estimator becomes exponentially large for many error mitigation protocols, which severely restricts the applicability of their bound. Also, their bound does not directly provide a sampling lower bound to achieve a certain accuracy. Our results, on the other hand, encompass the standard estimators with exponentially large Lipschitz constants and provide the first explicit sampling lower bounds that apply to general error mitigation protocols.

$\displaystyle S(\mathcal{D}_{\gamma}^{\otimes M}\circ\Xi\circ\mathcal{U}\circ\Lambda(\tau)\,\\|\,\mathbb{I}/2^{M})$	$\displaystyle\leq(1-\gamma)^{2}S(\Xi\circ\mathcal{U}\circ\Lambda(\tau)\,\\|\,\mathbb{I}/2^{M})$	(51)
	$\displaystyle=(1-\gamma)^{2}S(\Xi\circ\mathcal{U}\circ\Lambda(\tau)\,\\|\,\Xi\circ\mathcal{U}\circ\Lambda(\mathbb{I}/2^{M}))$
	$\displaystyle\leq(1-\gamma)^{2}S(\tau\,\\|\,\mathbb{I}/2^{M})$