^†^†thanks: Corresponding author: [email protected]

Bootstrapping Classical Shadows for Neural Quantum State Tomography

Wirawat Kokaew 1QB Information Technologies (1QBit), Vancouver, British Columbia, Canada Department of Physics & Astronomy, University of Waterloo, Waterloo, Ontario, Canada Perimeter Institute for Theoretical Physics, Waterloo, Ontario, Canada Bohdan Kulchytskyy 1QB Information Technologies (1QBit), Vancouver, British Columbia, Canada Shunji Matsuura 1QB Information Technologies (1QBit), Vancouver, British Columbia, Canada Pooya Ronagh 1QB Information Technologies (1QBit), Vancouver, British Columbia, Canada Department of Physics & Astronomy, University of Waterloo, Waterloo, Ontario, Canada Perimeter Institute for Theoretical Physics, Waterloo, Ontario, Canada Institute for Quantum Computing, University of Waterloo, Waterloo, Ontario, Canada

Abstract

We investigate the advantages of using autoregressive neural quantum states as ansatze for classical shadow tomography to improve its predictive power. We introduce a novel estimator for optimizing the cross-entropy loss function using classical shadows, and a new importance sampling strategy for estimating the loss gradient during training using stabilizer samples collected from classical shadows. We show that this loss function can be used to achieve stable reconstruction of GHZ states using a transformer-based neural network trained on classical shadow measurements. This loss function also enables the training of neural quantum states representing purifications of mixed states. Our results show that the intrinsic capability of autoregressive models in representing physically well-defined density matrices allows us to overcome the weakness of Pauli-based classical shadow tomography in predicting both high-weight observables and nonlinear observables such as the purity of pure and mixed states.

I Introduction

The accessing of experimentally prepared quantum states is obstructed by the limited rate at which classical information can be extracted from a quantum state through repetitive quantum measurements, which is costly. Moreover, in near-term quantum devices, various sources of physical noise introduce additional systematic errors, corrupting the measurements. Even when preparing quantum states fault tolerantly, accessing the logical state’s observables of interest will be susceptible to statistical errors resulting from the finite number of executions of the fault-tolerant algorithm. For these reasons, it is imperative to treat the quantum data collected from quantum experiments as scarce, highly valuable information and to maximize the utility of the available data.

The quantum data must be stored in a classical data structure as a proxy for our access to information about a given quantum state. In the simplest case, this data structure is the list of all the measurements performed (e.g., a tomographically complete set of measurements such as those used in the classical shadow technique [1]). In principle, such a raw dataset already includes all the information we have collected from the quantum state. This raises the question as to what the role of a neural quantum state is when it cannot learn more about the physics of the state than what it is provided in the training data. However:

(a)

the amount of memory required to store a tomographically complete set of measurements grows exponentially with the size of a quantum system;
(b)

using this raw data to estimate various observables of the system may result in highly erroneous and often unphysical estimations; and
(c)

every such estimation requires sweeping over the exponentially large list at least once.

While traditional quantum state tomography techniques that rely on solving inverse problems for various (partial or complete) approximations of the density matrix can overcome the last issue by providing a proxy to the raw data, they do not resolve the other issues.

Maximum likelihood estimation offers a path to alleviating all these challenges using memory-efficient parameterized models (i.e., ansatze) and randomized subsets of tomographically complete datasets. Model parameters are variationally optimized in update directions that make the training data the most likely to be generated by the learned ansatze. Such variational ansatze can flexibly be used to impose physicality on the reconstructed states. Among various approaches, tensor networks and neural networks provide flexible architectures capable of capturing relevant regions within the Hilbert space. Moreover, when these networks possess an autoregressive property, they can efficiently provide independent samples for subsequent tasks downstream as generative models for the learned quantum states.

Recently, Ref. [2] proposed an approach for training a neural quantum state using classical shadows to combine the advantages of variational ansatze and classical shadow measurements. The authors introduce the infidelity between the classical shadow state (introduced in Ref. [1]) and the ansatz as their loss function, since it is provably efficient to estimate this quantity using Clifford shadow measurements. However, in the case of an unphysical state like the classical shadow state, the infidelity estimator is not constrained within the physically valid $[0,1]$ range. Additionally, infidelity cannot guarantee bounded errors in approximating many quantities of interest from the trained model. Finally, the loss does not generalize to ansatze for mixed states.

Moreover, the method in Ref. [2] faces important practical challenges in its training protocol. The trainability of the authors’ ansatz relies on an ad hoc initialization step. That is, they pretrain their model using the cross-entropy loss function on the classical dataset with measurements only in the computational basis, which works well for quantum states that are sparse in the computational basis, such as the GHZ state. This pretraining requires an additional set of measurements beyond the one used in the classical shadow technique. Furthermore, its success depends on the informational content of the underlying quantum state in the computational basis, rendering it non-universal.

In this paper, we introduce a novel estimator for optimizing the cross-entropy loss function. Our approach capitalizes on the classical shadow state in order to bootstrap the dataset for construction of a novel unbiased estimators. Moreover, during training, we advocate for the utilization of stabilizer samples instead of relying on samples generated from our generative ansatz. This not only eliminates the necessity for pretraining but also provides a smoother training gradient. To summarize, the main contributions of our paper are as follows.

1.

New loss function. We provide a novel estimator for the cross-entropy loss function using classical shadows. This loss function is viable for training mixed states.
2.

New importance sampling strategy. We introduce stabilizer-based sampling for estimating the overlap between classical shadows and a neural quantum state. This overcomes the need for pretraining and significantly reduces the variance of gradient estimations during training.
3.

Supervision of a physically valid mixed-state autoregressive model on measurement data. We show that our new loss function can be used to train a neural network representing a purification of the mixed quantum state from classical shadow training data.
4.

Superior prediction of high-weight observables compared to raw classical shadow estimations. We show that the physicality constraints natively imposed by the explicit access of autoregressive models to conditional probability densities (in the autoregressive expansion) improves the accuracy of predictions extracted from classical shadow data.

Results on pure states. To showcase the strengths of our ideas, we focus on the Greenberger–Horne–Zeilinger (GHZ) state that is typically used as a challenging benchmark for quantum state tomography due to its multi-partite entanglement. In Section III.2, we employ Clifford measurements of pure GHZ states to conduct a comparative study. Our new loss function and importance sampling strategy show superior performance in learning six- and eight-qubit GHZ states compared to existing methods [2, 3].

Results on mixed states. Since in any relevant experimental setup the states to be studied and characterized are mixed, we use a six-qubit GHZ state depolarized according to various channel strengths to highlight the significance of our techniques for experimentalists. We note that Clifford measurements require deep circuits to implement Clifford twirling, which is not feasible on noisy quantum computers. To overcome this challenge, shallow shadow protocols have been devised [4, 5, 6, 7]. The depth of the Clifford decomposition prescribes an upper bound on the weights of the observables that can be estimated well using the collected measurements. The generalization power of our natively physical ansatz allows us to use the “shallowest” possible Clifford tails(i.e., Pauli measurements) to achieve satisfactory predictions of high-weight Pauli observables as well as nonlinear observables such as purity.

II Methods

In this section, we review existing methods used for efficiently reconstructing quantum states. In particular, we consider the recent development of neural networks and the classical shadow technique for reconstructing valid quantum states [2, 8, 9].

II.1 Neural Quantum States

A promising way to efficiently parameterize a pure quantum state $\ket{\psi}$ of multiple qubits is with a neural network parameterized by a weight vector $\lambda$ as

\psi_{\lambda}(s)=\sqrt{p_{\lambda}(s)}e^{i\varphi_{\lambda}(s)},

(1)

where $s\in\mathcal{B}_{n}=\{0,1\}^{n}$ is a binary vector in the computational basis. Similarly, the larger state space of density matrices can be parameterized with a neural network. Among competing approaches, the one based on the idea of purification [10],

\rho_{\lambda}(s_{1},s_{2})=\sum_{\bar{s}\in\mathcal{B}}\psi^{*}_{\lambda}(s_{1},\bar{s})\psi_{\lambda}(s_{2},\bar{s}),

(2)

is particularly appealing, as the resulting density matrix is physical. Here, $\bar{s}$ are auxiliary input variables representing the extended dimensions of the Hilbert space of the purified state. Therefore, their number $n_{\bar{s}}$ constrains the entropy of $\rho_{\lambda}$ , with $n_{s}$ being a theoretically proven sufficient upper bound for capturing any density matrix over the physical input variables $s$ . We treat $n_{\bar{s}}$ as a hyperparameter on the same footing as other parameters affecting the neural network’s architecture and hence its expressive power.

When the neural network parameters $\lambda$ in the ansatz Eqs. 1 and 2 are embedded within a generative model based on an autoregressive architecture, the resulting model can be very expressive and amenable to an efficient extraction of information about the underlying wave function as long as the quantity of interest can be cast as an unbiased estimator over samples from the model. For example, the purity of $\rho_{\lambda}$ can be estimated by evaluating the Swap operator between the physical and auxiliary qubits [11, 12, 13].

II.2 Classical Shadows

The classical shadow formalism provides an efficient method for predicting various properties using relatively few measurements. A classical shadow is created by repeatedly performing a simple procedure: at every iteration $i$ a random unitary transformation $\rho\mapsto U_{i}\rho U_{i}^{\dagger}$ is applied, after which all the qubits in the computational basis are measured. We can efficiently make a record of the pairs $(U_{i},s_{i})$ in classical memory. Let $\rho_{i}=U^{\dagger}\outerproduct{s_{i}}{s_{i}}U$ be the associated density matrix. The mapping $\mathcal{M}$ : $~{}\rho\mapsto\mathbb{E}[\rho_{i}]$ is invertible if the measurement protocol is tomographically complete. The set $\mathcal{S}_{\rho}=\{\mathcal{M}^{-1}(\rho_{i})\}_{i=1}^{N}$ is called a classical shadow of $\rho$ of size $N$ and can be used to construct an unbiased estimator $\hat{\rho}=\mathbb{E}[\mathcal{M}^{-1}(\rho_{i})]$ of $\rho$ .

The classical shadow $\mathcal{S}_{\rho}$ suffices to predict $M$ arbitrary linear and nonlinear functions $O_{i}$ of the state $\rho$ up to an additive error $\epsilon$ if it is of size $\Omega(\log(M)~{}{\rm max}_{i}~{}||O_{i}||^{2}_{\rm shadow}/\epsilon^{2})$ . The shadow norm depends on the ensemble from which the unitaries $U_{i}$ are drawn for creating the classical shadow [1] and has efficient bounds, particularly if the ensemble satisfies a unitary 3-design property. For example, this is the case if the unitaries are drawn from the Clifford set as in Ref. [1] or using analog evolutions of different lengths of time as in Ref. [14].

Two particularly interesting ensembles are the Pauli and Clifford groups. These ensembles are computationally favourable because of their efficient classical simulations, according to the Gottesman–Knill theorem. The inverse twirling maps for the $n$ -qubit Clifford and Pauli groups take the forms $\mathcal{M}^{-1}_{n}\left(X\right)=(2^{n}+1)X-\mathbb{I}$ and $\otimes_{k=1}^{n}\mathcal{M}_{1}^{-1}$ , respectively. The shadow norm scales well for the Clifford ensemble, while the Pauli ensemble is more experimentally friendly. For the Pauli ensemble, the shadow norm scales exponentially poorly with respect to the locality of an observable, rendering the procedure inefficient in estimating high-weight observables. A middle-ground solution suggested in the literature [4, 5, 6] is to use shallow subsets of the Clifford group.

In this paper, we show that an alternative approach to improving the predictive power of Pauli ensembles (i.e., the shallowest possible shadows) is using a neural network (or other parameterized ansatze). Whereas the raw classical shadow is powerful at predicting many observables efficiently, it is not always sample efficient due to the unphysical nature of the state it provides. This is because, while shadow tomography yields a trace-one quantum state, not all eigenvalues are guaranteed to be positive. However, neural quantum states allow for natively imposing physicality constraints on a reconstructed state.

Another advantage of the unitary design property of the Clifford ensemble is that it allows for the efficient prediction of nonlinear functions of a quantum state. This is particularly useful in estimating entropy-based quantities such as the Rényi entropies, since they involve two copies of the state, and thus quadratic functions of it. We show that neural networks also improve the predictive power of such nonlinear quantities when combined with the purification technique explained in Section II.1.

II.3 Neural Quantum State Tomography

II.3.1 Shadow-Based Loss Functions

Ref. [2] introduces the idea of using a classical shadow as a training objective for a neural quantum state. In the paper, the infidelity between the neural and shadow states from Clifford ensembles,

1-F_{\lambda}=1-\expectationvalue{\hat{\rho}}{\psi_{\lambda}},

is used to derive the loss function

\mathcal{L}_{\rm{inf}}^{\rm{C}}(\lambda)=-\sum_{i=1}^{N}p_{\lambda}(\phi_{i}).

(3)

Here, “C” stands for Clifford ensembles, “inf” refers to the infidelity, and $p_{\lambda}(\phi_{i})=\lvert\innerproduct{\psi_{\lambda}}{\phi_{i}}\rvert^{2}$ is the probability of measuring the snapshot $\phi_{i}=U_{i}|s_{i}\rangle$ .

The same loss function can also be modified for the case of Pauli ensembles as follows. For each Pauli $U_{i}=U_{i,1}\otimes\cdots\otimes U_{i,n}$ and any bitstring $b\in\mathcal{B}_{n}\equiv\{0,1\}^{n}$ , let $|b|=\sum_{i=1}^{n}b_{i}$ be the number of ones in $b$ . Let $U_{i,b}=U_{i,1}^{b_{1}}\otimes\cdots\otimes U_{i,n}^{b_{n}}$ be the operator that has an identity factor for each zero index in $b$ . In this case,

	$\displaystyle\mathcal{M}^{-1}\rho_{i}$	$\displaystyle=\bigotimes_{k=1}^{n}(3U_{i,k}^{\dagger}\outerproduct{s_{i,k}}{s_{i,k}}U_{i,k}-I)$
		$\displaystyle=\bigotimes_{k=1}^{n}(3U_{i,k}^{\dagger}\outerproduct{s_{i,k}}{s_{i,k}}U_{i,k}-\left[\outerproduct{0}{0}+\outerproduct{1}{1}\right])$
		$\displaystyle=\sum_{b\in\mathcal{B}_{n}}\sum_{c\in\mathcal{B}_{n-\|b\|}}(-1)^{n-\|b\|}3^{\|b\|}U_{i,b}^{\dagger}\outerproduct{s_{i,b,c}}{s_{i,b,c}}U_{i,b}$
		$\displaystyle=\sum_{b\in\mathcal{B}_{n}}\sum_{c\in\mathcal{B}_{n-\|b\|}}3^{\|b\|}\outerproduct{\phi_{i,b,c}}{\phi_{i,b,c}}.$

Here, $\ket{\phi_{i,b,c}}=(-1)^{\frac{n-|b|}{2}}U_{i,b}^{\dagger}\ket{s_{i,b,c}}$ , where the computational basis state $\ket{s_{i,b,c}}$ repeats the bit in $s_{i}$ for locations that are hot in the bitstring $b$ and uses the bit in bitstring $c$ in every other location. The resulting training loss function is

\mathcal{L}_{\rm{inf}}^{\rm{P}}(\lambda)=-\sum_{i=1}^{N}\sum_{b\in\mathcal{B}_{n}}\sum_{c\in\mathcal{B}_{n-|b|}}3^{|b|}p_{\lambda}(\phi_{i,b,c}),

(4)

where “P” stands for Pauli ensembles.

We suppress the superscripts “ $\rm{P}$ ” and “ $\rm{C}$ ” in the rest of this section, as the discussion that follows applies to both training datasets. Unfortunately, these loss functions are not generalizable to mixed states. To circumvent this drawback, we reconsider the Kullback–Leibler (KL) divergence. Our Pauli or Clifford datasets can be viewed as samples drawn from a target probability measure $p^{\rm{data}}$ : $~{}\rm{Stab}\to\mathbb{R}$ on the space of stabilizer states. For Pauli measurements this is the product of single-qubit stabilizer states $\mathrm{Stab}_{1}^{\otimes n}$ , and for Clifford measurements it is $\mathrm{Stab}_{n}$ , the full set of all $n$ -qubit stabilizer states. The neural network (or, more generally, the autoregressive model) representing the quantum state is a priori a generative model with explicit access to a probability distribution $p_{\lambda}$ : $~{}\{0,1\}^{n}\to\mathbb{R}$ on the computational basis states. However, the efficiency of stabilizer state tableau representation allows us to extend it to a measure $p_{\lambda}$ : $~{}\rm{Stab}\to\mathbb{R}$ on the stabilizer states. The KL divergence between $p_{\lambda}$ and $p^{\rm{data}}$ can be written as

\mathrm{KL}\left(p^{\mathrm{data}}||p_{\lambda}\right)=\expectationvalue{\ln p^{\mathrm{data}}(\phi)}_{\phi\sim p^{\mathrm{data}}}-\expectationvalue{\ln p_{\lambda}(\phi)}_{\phi\sim p^{\mathrm{data}}}.

We note that the entropy of the underlying data distribution is constant with respect to the variational parameters $\lambda$ . Consequently, the loss function reduces to the cross-entropy term $-\langle\ln p_{\lambda}(\phi)\rangle_{\phi\sim p^{\mathrm{data}}}$ .

Let $\mathcal{D}\subseteq\rm{Stab}$ be an independent, identically distributed family of samples (allowing repetitions) drawn from $p^{\mathrm{data}}$ , and $\widetilde{\mathcal{D}}$ be the set of elements in the dataset $\mathcal{D}$ (removing repetitions). The cross-entropy loss function can be approximated by the log-likelihood of $\mathcal{D}$ ,

\mathcal{L}_{\text{ECE}}(\lambda)=-\sum_{\phi\in\mathcal{D}}\ln p_{\lambda}(\phi),

(5)

which we call the empirical cross-entropy (ECE) loss function.

This is consistent with the intuition that the optimal model should treat the observed samples as the most probable ones. Equation 5 can be viewed in light of the empirical distribution

p^{\mathrm{emp}}(\phi)=\frac{1}{|\mathcal{D}|}\sum_{\phi^{\prime}\in\mathcal{D}}\mathbb{I}_{\phi^{\prime}=\phi}\,,

(6)

where $\mathbb{I}_{\phi^{\prime}=\phi}$ is an indicator function. Since $p^{\mathrm{emp}}$ is an unbiased estimator for $p^{\mathrm{data}}$ , the loss function Eq. 5 can be rewritten as

\mathcal{L}_{\text{ECE}}(\lambda)=-\expectationvalue{\ln p_{\lambda}(\phi)}_{\phi\sim p^{\mathrm{emp}}}=-\sum_{\phi\in\widetilde{\mathcal{D}}}p^{\rm{emp}}(\phi)\ln p_{\lambda}(\phi)\,.

We note that the new summation runs over distinct elements of the training data without repetition. This reformulation helps us to consider alternative unbiased estimators for the data distribution. Here, we consider an approach based on the classical shadow state $\hat{\rho}$ . For a stabilizer state $\phi$ , we define its shadow weight as

w(\phi)=|\matrixelement{\phi}{\hat{\rho}}{\phi}|,

where the absolute values are taken because $\matrixelement{\phi}{\hat{\rho}}{\phi}$ can take negative values, which prevents its being interpreted as a probability distribution. The normalized shadow weights

p^{\mathrm{sh}}(\phi)=\frac{w(\phi)}{\sum_{\phi\in\widetilde{\mathcal{D}}}w(\phi)}

(7)

can now replace the empirical averaging in the cross-entropy approximation, justifying the new loss function

\mathcal{L}_{\text{SCE}}(\lambda)=-\expectationvalue{\ln p_{\lambda}(\phi)}_{\phi\sim p^{\mathrm{sh}}}=-\sum_{\phi\in\widetilde{\mathcal{D}}}p^{\mathrm{sh}}(\phi)\ln p_{\lambda}(\phi)\,,

(8)

which we call the shadow-based cross-entropy (SCE) loss function. In contrast to the empirical cross-entropy loss function Eq. (5), which asserts equal weights on the contributions of each stabilizer state $\phi$ , the new loss function Eq. (8) leverages the classical shadow state to inject further signals useful for the training dynamics.

II.3.2 Monte Carlo Estimation and Gradients

Central to the computational tractability of the considered loss functions Eqs. 3, 5, and 8 is the evaluation of the overlap between a stabilizer state $\phi$ and the neural state $\psi_{\lambda}$ :

\innerproduct{\psi_{\lambda}}{\phi}=\sum_{s\in\mathcal{B}}\psi_{\lambda}^{*}(s)\phi(s).

(9)

This exponentially large expansion in the computational basis is intractable. However, it can be estimated using Monte Carlo sampling in at least two ways. We can sample $s\in\mathcal{B}$ according to $p_{\lambda}(s)=\|\psi^{*}_{\lambda}(s)\|^{2}$ to obtain the estimator

\displaystyle\innerproduct{\psi_{\lambda}}{\phi}

\displaystyle=\expectationvalue{\frac{\phi(s)}{\psi_{\lambda}(s)}}_{s\sim p_{\lambda}}\!.

(10)

Alternatively, we can sample $s\in\mathcal{B}$ with a probability proportionate to its overlap with the stabilizer state $\phi$ , $p_{\phi}=\|\phi(s)\|^{2}$ , to obtain the reformulated estimator

\displaystyle\innerproduct{\psi_{\lambda}}{\phi}

\displaystyle=\expectationvalue{\frac{\psi_{\lambda}^{*}(s)}{\phi^{*}(s)}}_{s\sim p_{\phi}}\!.

(11)

Both estimators are efficient to compute, as we can obtain the amplitudes and samples in polynomial time for both a stabilizer state $\phi$ and an autoregressive generative model representing $p_{\lambda}$ . The question remains as to which estimator has a better (i.e., lower) variance when used in training the neural network. The estimator Eq. (10) is employed in Ref. [2]. This sampling strategy is akin to on-policy reinforcement learning, wherein the agent $p_{\lambda}$ attempts to optimize its actions $s$ in order to optimize a reward. Hence, just like ordinary policy-gradient methods based on the REINFORCE technique, training is prone to suffering from a lack of diversity and therefore highly erroneous gradients.

In this paper, we propose using the estimator Eq. (II.3.2) instead, which keeps the training process more analogous to supervised learning. In this case, the stable statistics collected from $\phi$ provide a steady target during training. Moreover, unlike the gradient for Eq.(10) (see Ref. [2]), the gradient for Eq. (II.3.2) takes the simpler form

\displaystyle\nabla_{\lambda}\innerproduct{\psi_{\lambda}}{\phi}

\displaystyle=\expectationvalue{\frac{\nabla_{\lambda}\psi_{\lambda}^{*}(s)}{\phi^{*}(s)}}_{s\sim p_{\phi}}.

III Results

III.1 Scaling and Variance Study

Prior to studying the training of neural quantum states with our new methods, we explore the benefits of using shadow weights and stabilizer-based importance sampling. To this end, we focus on the key training metrics: the informational content and the statistical properties of the gradient estimator.

To facilitate our analysis, we consider an empirical proxy for the data distribution $p^{\rm{data}}_{\rm{emp}}(\phi)\propto p^{\rm{data}}(\phi)\mathbb{I}_{\phi\in\widetilde{\mathcal{D}}}$ , where $\mathbb{I}_{\phi\in\widetilde{\mathcal{D}}}$ is the indicator function that excludes unobserved stabilizers. This distribution represents the most-accurate reconstruction of the probability measure $p^{\rm{data}}$ given a dataset $\mathcal{D}$ , assuming no prior knowledge about $p^{\rm{data}}$ . To assess the informational content in the shadow weights, we consider the KL divergence ${\rm{KL}}\left(p^{\rm{data}}_{\rm{emp}}||p^{\rm{target}}\right)$ between this typically unknown distribution and $p^{\rm{target}}$ , which denotes either of the two empirically accessible distributions used for training: $p^{\rm{emp}}$ based on Eq. 6 or $p^{\rm{sh}}$ specified by Eq. 7. Specifically, we study the dependence of ${\rm{KL}}\left(p^{\rm{data}}_{\rm{emp}}||p^{\rm{target}}\right)$ on the size of the dataset relative to the size of the system by scaling the number of qubits given a fixed dataset, as shown in Fig. 1(a), and scaling the number of measurements in the dataset for a fixed number of qubits, as depicted in Fig. 1(b).

Refer to caption — Figure 1: Analysis of the informational content within shadow weights through the KL divergence ${\rm{KL}}\left(p^{\rm{data}}_{\rm{emp}}||p^{\rm{target}}\right)$ . The KL divergence serves as a comparison of the proxy for the typically unknown data distribution $p^{\rm{data}}_{\rm{emp}}$ (see the main text for the definition) with the distributions that can be inferred from the shadow measurements $p^{\rm{emp}}$ based on Eq. 6 and $p^{\rm{sh}}$ specified by Eq. 7. The KL divergence is examined as a function of (a) the system size, i.e., the number of qubits, for a dataset of 1000 shadows and (b) the number of shadows for a six-qubit system. The measurement type is specified by the marker type: circles and squares for Pauli (indicated by the subscript “P”) and Clifford (“C”) measurements, respectively. The target distribution type is specified by the colours and line styles: blue dashed and orange solid curves for the empirical distribution and shadow weights, respectively. Our results are averaged over 32 instances of pure states produced using random Clifford circuits, with error bars indicating the standard deviation.

The advantage of shadow weights in terms of estimating $p^{\rm{data}}_{\rm{emp}}$ is evident for smaller systems, as shown in Fig. 1 (a), though it gradually diminishes as the systems grow larger. In contrast, Fig. 1(b) indicates that this benefit can be reclaimed by increasing the number of measurements performed. A crucial factor in interpreting the plots in Fig. 1 is considering the difference in the configuration space dimensions of $p^{\rm{data}}_{\rm{emp}}$ associated with Pauli and Clifford measurements. Notably, Pauli measurements can reconstruct $4^{n}$ distinct stabilizers, whereas Clifford measurements can reconstruct $2^{0.5+\mathcal{O}(1)n^{2}}$ distinct stabilizers [15]. For $n=6$ qubits, the dataset sizes considered in Fig. 1(b) must be compared to the dimension of the stabilizer space. For the Pauli distribution, the dataset ranges from 5% to 500% of the total set size of distinct stabilizers. However, for the Clifford distribution, the maximum number of measurements is less than $0.001\%$ of the corresponding stabilizer set size. This significant difference explains the slight scaling advantage of the $p^{\rm{emp}}$ Pauli measurements over the $p^{\rm{emp}}$ Clifford measurements, as the former have a higher likelihood of obtaining repeated stabilizers, while the latter degenerate to a uniform distribution. For shadow weights, we observe a scaling advantage over empirical distributions in both regimes. However, unlike in the case of the empirical distributions, the Clifford distribution shows a slight improvement over the Pauli distribution. Such an advantage of Clifford shadow weights over Pauli ones, despite being in a regime with less data, is consistent with the theoretical expectation of an exponential benefit in predicting high-weight, low-rank observables (i.e., other stabilizers) [1]. Therefore, given a sufficiently large measurement dataset, training with shadow weights-based cross-entropy is expected to produce neural quantum states that better capture the underlying quantum state.

We now evaluate the advantages of the proposed stabilizer-based method of Section II.3.2, over the neural state-based method of Eq. 10, for Monte Carlo estimation of the overlap, which is used to compute the training gradient. For benchmarking purposes, we analyze the errors in the gradient estimates relative to the true gradient, obtained through an exhaustive evaluation of the inner product of Eq. 9. Specifically, we focus on the error in the gradient direction and distinguish between errors calculated over a mini-batch and a full batch of measurements. Figure 2 illustrates the evolution of these errors as the neural quantum state is trained using the true gradient. Figure 2(a) shows that both estimators struggle most with capturing the correct gradient direction during the early stages of training, where the neural quantum state is likely the most different from the target state. However, after several epochs, as the neural state becomes more similar to the target, the stabilizer-based estimation significantly improves, while the neural state estimation remains relatively high and inconsistent. Throughout the entire training process, the stabilizer-based estimator consistently provides estimates with much lower variance, which is essential for training stability. Additionally, our proposed estimator is consistently more accurate than the neural-based estimator. Being an unbiased estimator, the stochastic error due to mini-batching is expected to be cancelled out when averaged over the mini-batches and thus can be lowered using smaller learning rates, albeit at the cost of slower training dynamics. Figure 2(b) shows whether this expected behaviour holds for the given estimators. There is a clear difference between them: the stabilizer-based estimator functions as an unbiased estimator, reducing gradient errors by averaging over mini-batches. In contrast, the neural state-based estimator exhibits a systematic bias throughout the entire training duration, resulting in an extreme accumulation of errors from the true gradient. To eliminate this finite-sample bias, a significant increase in the samples generated by the neural state would be necessary. Therefore, given a fixed computational resource budget, our benchmarking results demonstrate that the stabilizer-based gradient estimation has a significant advantage over the neural state-based gradient estimation, highlighting its potential for optimization.

III.2 Training Performance on Pure GHZ States

In this section, we focus on the reconstruction of pure GHZ states from Pauli and Clifford measurements as a benchmark for the training enhancements we have proposed. In particular, we use the shadow-based cross-entropy ( $\mathcal{L}_{\text{SCE}}$ ), the empirical cross-entropy ( $\mathcal{L}_{\text{ECE}}$ ), and the infidelity ( $\mathcal{L}_{\text{inf}}$ ) as the loss functions. In addition, we employ the estimators Eqs. (10) and (II.3.2) in evaluating the overlap between the neural network and stabilizer states. The superscripts “NS” (for neural state-based samples) and “SS” (for stabilizer-based samples) appear with the corresponding loss function $\mathcal{L}$ to distinguish between the two estimators.

We illustrate the trainability and the generalization power of these training methods in Fig. 3 via the optimization progress curves of loss functions evaluated on the training set (upper row) and infidelity with respect to the true states (lower row), over 50 epochs. For Pauli measurements, plotted in Fig. 3(a) and (b), the superiority of cross-entropy-based training over the infidelity-based approach is apparent. The latter struggles to achieve significant progress in approximating the underlying quantum state. Moreover, the benefits of the cross-entropy-based loss function become even more pronounced when considering computational costs. Both loss functions rely on computing the overlap between the neural quantum state and stabilizers (see Eq. 9). For the infidelity loss function, each Pauli measurement leads to a proliferation of stabilizers that grows super-exponentially with the number of qubits (see Eq. 4). In contrast, this prohibitive scaling is entirely absent for both cross-entropy-based loss functions, as each Pauli measurement contributes a single stabilizer. In a previous work [2], we attempted to address the prohibitive scaling of the infidelity-based approach through Monte Carlo sampling of stabilizers from a classical shadow state. However, this strategy led to unstable training. Additionally, we note that stabilizer-based training improves in-training performance across all loss functions (the results for neural-state-based cross-entropy loss functions have been omitted for visual clarity). Moreover, these improvements during training translate to enhanced generalization only for cross-entropy loss functions. In fact, for the cross-entropy loss function, stabilizer-based training proves to be crucial.

We now discuss the Clifford measurements, beginning with the benchmarking performed on the six-qubit GHZ state shown in Fig. 3(c) and (d). Our previous observations regarding the advantages of stabilizer-based sampling generalize to this scenario and, in fact, are applicable to the infidelity loss function as well. Indeed, the loss function $\mathcal{L}_{\text{inf}}^{\text{SS}}$ leads to an improved generalization with the final infidelity of roughly an order of magnitude compared to the loss function $\mathcal{L}_{\text{inf}}^{\text{NS}}$ , as shown in Fig. 3(d). Interestingly, this improvement in generalization is achieved despite their corresponding loss functions yielding similar values at the end of training. Further examination of Fig. 3(c) and (d) reveals that our loss function $\mathcal{L}_{\text{ECE}}^{\text{SS}}$ greatly outperforms the previously proposed shadow-based loss function $\mathcal{L}_{\text{inf}}^{\text{NS}}$ [2] in both the convergence rate and generalization. In addition, a comparison between the loss functions $\mathcal{L}_{\text{SCE}}^{\text{SS}}$ and $\mathcal{L}_{\text{ECE}}^{\text{SS}}$ reveals that our cross-entropy estimator leads to a superior convergence rate and lower infidelity, illustrating the benefits of shadow weights extracted from the classical shadow state (see Eq. (7)). Overall, both loss functions $\mathcal{L}_{\text{SCE}}^{\text{SS}}$ and $\mathcal{L}_{\text{inf}}^{\text{SS}}$ demonstrate the best convergence rates. The loss functions $\mathcal{L}_{\text{SCE}}^{\text{SS}}$ and $\mathcal{L}_{\text{ECE}}^{\text{SS}}$ exhibit the greatest generalization, approaching an infidelity of $0.01$ , while our estimator shows the greatest stability in reaching states in proximity to the true states. An even more drastic difference in performance is observed for the eight-qubit GHZ state, shown in Fig. 3(e) and (f). In this case, the loss functions $\mathcal{L}_{\text{inf}}^{\text{NS}}$ , $\mathcal{L}_{\text{inf}}^{\text{SS}}$ , and $\mathcal{L}_{\text{ECE}}^{\text{SS}}$ struggle to escape local minima. While all of the loss functions result in infidelities that remain stuck at $0.99$ , only the loss function $\mathcal{L}_{\text{SCE}}^{\text{SS}}$ stands out, achieving an infidelity as low as $0.05$ . To summarize, our results indicate that our proposed method provides the best training objective for neural networks trained to approximate an underlying quantum state from limited measurements.

III.3 Applications on Mixed GHZ States

In this section, we describe our application of the methods introduced earlier, namely, the shadow-based cross-entropy loss function with stabilizer-based sampling, to mixed quantum states generated by noisy circuits. Specifically, we consider a circuit that prepares a six-qubit GHZ state beginning from the all-zeroes state $\ket{0}^{\otimes 6}$ . The circuit is composed of a Hadamard gate applied to the first qubit, followed by a series of CNOT gates applied to the consecutive pairs of qubits. For the noise model, we assume two-qubit depolarization noise on the CNOT gates. The depolarizing noise channel $\rho\mapsto(1-\frac{16}{15}p)\rho+\frac{16}{15}p\mathbb{I}/4$ is parameterized via a noise strength parameter $p$ . Motivated by the experimental relevance of our technique, we consider Pauli measurement datasets, as those are less noisy than Clifford measurements. As the ansatz for the density matrix, we employ the purification ansatz Eq. (2) with a transformer neural network used to parameterize $\psi_{\lambda}(s,\bar{s})$ . This approach allows us to impose physical constraints on the ansatz.

III.3.1 Pauli Observables

Although classical shadows based on Pauli measurements have been demonstrated to predict local observables with high accuracy, their predictive accuracy worsens in the case of many-body observables. In this section, we demonstrate the benefits of the neural network approach over classical shadows in predicting Pauli observables. To do so, a set of 5000 random Pauli strings is generated by sampling each single-qubit Pauli operator from a uniform distribution.

The absolute error in predicting Pauli observables from neural networks and classical shadows is shown in Fig. 4(a). As expected, the precision and accuracy of estimates extracted from the classical shadow technique become larger as the Pauli weight increases. This poor performance in predicting high weight observables is alleviated by projecting a classical shadow state to a physical state on the probability simplex (see Appendix B). Despite this adjustment, the neural state ansatz consistently demonstrates superior performance across all noise levels. This observation is further supported when focusing solely on prediction accuracy, as highlighted in Fig. 4(b), where we present the same dataset, now normalized by variance and averaged across all noise levels.

III.3.2 Purity and Trace Distance

Continuing our analysis, we broaden our scope to include assessments of purity and trace distance, shown in Fig. 5(a) and (b), respectively. In contrast to the behaviour observed in the case of Pauli observables, projecting the classical shadow state onto the probability simplex leads to a degradation in purity estimation, as shown in Fig. 5(a). However, both neural quantum states and classical shadow states are able to accurately track theoretically predicted results. The distinguishing feature of the physically constrained neural quantum state lies in its ability to consistently provide physical estimates, whereas the classical shadow state’s estimate falls below zero for the least pure target state we consider. As expected, the observed trends of the trace distance shown in Fig. 5(b) align with the results of the Pauli observables: the simplex projection demonstrates successful improvement over the classical shadow state, while the neural quantum state exhibits the lowest error for all noise levels.

Interestingly, we observe relatively weaker performance of the neural quantum states in the low noise regime, specifically $p=0.1$ and $0.2$ . We investigate this behaviour further by producing visualizations of density matrices, as exemplified in Fig. 5(c), (d), and (e) for $p=0.5$ . For such high noise levels, all non-zero components in the theoretical density matrix are substantial. In this case, the neural quantum state can successfully capture the full structure of the theoretical density matrix, whereas the classical shadow state is extremely noisy. However, at lower noise regimes, most of the diagonal elements of the density matrix decrease, and only the corner components remain large. The neural quantum state seems to struggle with capturing signals from very small components, possibly due to the Monte Carlo sampling noise present during training, leading to the relative weakening of its performance. We defer a detailed investigation of this phenomenon and the potential enhancements it could bring to neural quantum state tomography in this regime to future studies.

IV Conclusion

In this paper, we have introduced two new concepts for neural quantum state tomography: (a) the shadow-based cross-entropy loss function; and (b) stabilizer-based importance sampling for the estimation of quantum state overlaps. Our benchmarking using Pauli and Clifford measurements collected from pure GHZ states demonstrates a clear improvement in the trainability and the generalization power of the resulting neural quantum states. In particular, we demonstrate the superiority of these neural quantum states in comparison to the conventional neural quantum states trained using log-likelihood optimization and the recently introduced neural quantum states trained using infidelity loss, both of which rely on neural-state-based importance sampling. These advancements enable a significant reduction in both the classical and quantum resources necessary for neural quantum state tomography.

To highlight the significance of our techniques in experimental applications, we employ a physically constrained neural quantum state to reconstruct mixed quantum states from Pauli measurements. In comparison to the classical shadow state constructed from the same dataset, our neural quantum state is much more successful at predicting high-weight and nonlinear observables of both pure and mixed states.

With these promising results, we expect shadow-based neural quantum state tomography to be an invaluable tool for future experiments on near-term quantum devices. Its potential applications span a wide spectrum, from device characterization to quantum simulations in materials science, chemistry, and strong interactions physics. Moving forward, the next steps in the development and exploration of our method will involve scaling it up to accommodate larger systems and validating its utility to a more diverse set of quantum states, specially those prepared experimentally in the near term. One interesting avenue for future research is the characterization of quantum states prepared using analog quantum simulations, wherein the constraints imposed by limited-control electronics present a challenge for quantum state tomography [7].

Appendix A The Transformer Neural Quantum State and Details of Numerical Experiments

We base our neural quantum states on the transformer architecture from Ref. [bennewitz2021] (for details on its implementation, refer to Section I of the supplementary material of that work). More specifically, a quantum state is represented using the transformer model parameterized by the neural network weights $\lambda$ as specified in Eq. 1 or Eq. 2, depending on whether the target quantum state is a pure state or a mixed state, respectively. The details of the transformer’s architecture, hyperparameters, and datasets are presented in Table 1. PyTorch [16] is used for the implementation and training of the transformer. The Stim stabilizer-circuit simulator [17] is used for dataset generation and classical shadow state reconstruction. For training, we use the Adam optimizer [18] for stochastic-gradient descent. A cosine annealing schedule is applied to systematically adjust the learning rate throughout training epochs [19]. Finally, early stopping is used to prevent over-fitting.

		Fig. 3	Figs. 4 and 5
	Layers	2	2
	Internal dimensions	8	8
Neural quantum state	Heads	4	4
	Trainable parameters	858	954
	Ansatz	Eq. 1	Eq. 2 with 6 ancilla qubits
	Epochs	50	100
Hyperparameters	Initial learning rate	0.01	0.01
	Mini-batch size	100	20
	Monte Carlo samples	500	500
Dataset	Training	1000 classical shadows	3750 classical shadows
	Validation	–	1250 classical shadows

Table 1: Numerical experiment settings for Figs. 3, 4, and 5.

Appendix B Probability Simplex Projection

Unlike the density matrix of a physical quantum state, the classical shadow state is not necessarily positive semidefinite. One approach for obtaining a physical quantum state from a classical shadow state is through a projection onto the probability simplex. Formally, the eigenvalues ${\lambda}\in\mathbb{R}^{2^{n}}$ of the classical shadow state are projected onto the probability simplex $\Delta^{2^{n}}:=\{x=(x_{1},\dots,x_{2^{n}})^{T}|~{}x_{i}\geq 0~{}\wedge~{}\sum_{i=1}^{2^{n}}x_{i}=1\}$ , while the eigenvectors are left unchanged. The projected eigenvalues, $\lambda^{*}$ , are obtained via a minimization of the Euclidean distance:

\lambda^{*}={\arg\min_{x\in\Delta^{2^{n}}}||x-\lambda||},

where $||\cdot||$ denotes the Euclidean norm. We use the algorithm presented in Ref. [20] to perform this minimization.

Acknowledgements

We thank our editor, Marko Bucyk, for his careful review and editing of the manuscript. The authors acknowledge Bill Coish and Christine Muschik for useful discussions. W. K. acknowledges the support of Mitacs and a scholarship through the Perimeter Scholars International program. P. R. acknowledges the financial support of Mike and Ophelia Lazaridis, Innovation, Science and Economic Development Canada (ISED), and the Perimeter Institute for Theoretical Physics. Research at the Perimeter Institute is supported in part by the Government of Canada through ISED and by the Province of Ontario through the Ministry of Colleges and Universities.

References

Huang et al. [2020] H.-Y. Huang, R. Kueng, and J. Preskill, Predicting many properties of a quantum system from very few measurements, Nat. Phys. 16, 1050–1057 (2020).
Wei et al. [2023] V. Wei, W. A. Coish, P. Ronagh, and C. A. Muschik, Neural-Shadow Quantum State Tomography, arXiv preprint arXiv:2305.01078 (2023).
Bennewitz et al. [2021] E. R. Bennewitz, F. Hopfmueller, B. Kulchytskyy, J. Carrasquilla, and P. Ronagh, Neural Error Mitigation of Near-Term Quantum Simulations, Nat. Mach. Intell. 4, 618–624 (2021).
Ippoliti et al. [2023] M. Ippoliti, Y. Li, T. Rakovszky, and V. Khemani, Operator Relaxation and the Optimal Depth of Classical Shadows, Phys. Rev. Lett. 130, 230403 (2023).
Rozon et al. [2023] P.-G. Rozon, N. Bao, and K. Agarwal, Optimal twirling depths for shadow tomography in the presence of noise, arXiv preprint arXiv:2311.10137 (2023).
Hu et al. [2024] H.-Y. Hu, A. Gu, S. Majumder, H. Ren, Y. Zhang, D. S. Wang, Y.-Z. You, Z. Minev, S. F. Yelin, and A. Seif, Demonstration of Robust and Efficient Quantum Property Learning with Shallow Shadows, arXiv preprint arXiv:2402.17911 (2024).
Hu and You [2022] H.-Y. Hu and Y.-Z. You, Hamiltonian-driven shadow tomography of quantum states, Phys. Rev. Research 4, 013054 (2022).
Cha et al. [2022] P. Cha, P. Ginsparg, F. Wu, J. Carrasquilla, P. L. McMahon, and E.-A. Kim, Attention-based quantum tomography, Mach. Learn.: Sci. Technol. 3, 01LT01 (2022).
Hu et al. [2023] H.-Y. Hu, S. Choi, and Y.-Z. You, Classical shadow tomography with locally scrambled quantum dynamics, Phys. Rev. Research 5, 023027 (2023).
Torlai and Melko [2018] G. Torlai and R. G. Melko, Latent Space Purification via Neural Density Operators, Phys. Rev. Lett. 120, 240503 (2018).
Hastings et al. [2010] M. B. Hastings, I. González, A. B. Kallin, and R. G. Melko, Measuring Renyi Entanglement Entropy in Quantum Monte Carlo Simulations, Phys. Rev. Lett. 104, 157201 (2010).
Kulchytskyy [2019] B. Kulchytskyy, Probing universality with entanglement entropy via quantum Monte Carlo, Ph.D. thesis, University of Waterloo (2019).
Beach et al. [2019] M. J. S. Beach, I. D. Vlugt, A. Golubeva, P. Huembeli, B. Kulchytskyy, X. Luo, R. G. Melko, E. Merali, and G. Torlai, QuCumber: wavefunction reconstruction with neural networks, SciPost Phys. 7, 009 (2019).
Liu et al. [2023] Z. Liu, Z. Hao, and H.-Y. Hu, Predicting Arbitrary State Properties from Single Hamiltonian Quench Dynamics, arXiv preprint arXiv:2311.00695 (2023).
Aaronson and Gottesman [2004] S. Aaronson and D. Gottesman, Improved simulation of stabilizer circuits, Physical Review A—Atomic, Molecular, and Optical Physics 70, 052328 (2004).
Paszke et al. [2019] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, PyTorch: An Imperative Style, High-Performance Deep Learning Library, arXiv preprint arXiv:1912.01703 (2019).
Gidney [2021] C. Gidney, Stim: a fast stabilizer circuit simulator, Quantum 5, 497 (2021).
Kingma and Ba [2014] D. P. Kingma and J. Ba, Adam: A Method for Stochastic Optimization, arXiv preprint arXiv:1412.6980 (2014).
Loshchilov and Hutter [2016] I. Loshchilov and F. Hutter, SGDR: Stochastic Gradient Descent with Warm Restarts, arXiv preprint arXiv:1608.03983 (2016).
Chen and Ye [2011] Y. Chen and X. Ye, Projection Onto A Simplex, arXiv preprint arXiv:1101.6081 (2011).