\contourlength

0.8pt

Hutchinson’s Estimator is Bad at Kronecker-Trace-Estimation

Raphael A. Meyer
New York University
[email protected] Haim Avron
Tel Aviv University
[email protected]

Abstract

We study the problem of estimating the trace of a matrix $\mathbf{A}$ that can only be accessed through Kronecker-matrix-vector products. That is, for any Kronecker-structured vector $\boldsymbol{\mathrm{x}}=\otimes_{i=1}^{k}\boldsymbol{\mathrm{x}}_{i}$ , we can compute $\mathbf{A}\boldsymbol{\mathrm{x}}$ . We focus on the natural generalization of Hutchinson’s Estimator to this setting, proving tight rates for the number of matrix-vector products this estimator needs to find a $(1\pm\varepsilon)$ approximation to the trace of $\mathbf{A}$ .

We find an exact equation for the variance of the estimator when using a Kronecker of Gaussian vectors, revealing an intimate relationship between Hutchinson’s Estimator, the partial trace operator, and the partial transpose operator. Using this equation, we show that when using real vectors, in the worst case, this estimator needs $O(\frac{3^{k}}{\varepsilon^{2}})$ products to recover a $(1\pm\varepsilon)$ approximation of the trace of any PSD $\mathbf{A}$ , and a matching lower bound for certain PSD $\mathbf{A}$ . However, when using complex vectors, this can be exponentially improved to $\Theta(\frac{2^{k}}{\varepsilon^{2}})$ . We show that Hutchinson’s Estimator converges slowest when $\mathbf{A}$ itself also has Kronecker structure. We conclude with some theoretical evidence suggesting that, by combining Hutchinson’s Estimator with other techniques, it may be possible to avoid the exponential dependence on $k$ .

1 Introduction

A ubiquitous problem in scientific computing is that of estimating the trace of a $D\times D$ matrix $\mathbf{A}$ that can be only accessed via matrix-vector products. In other words, we are given access to an oracle that can evaluate $\mathbf{A}\boldsymbol{\mathrm{x}}$ for any $\boldsymbol{\mathrm{x}}\in{\mathbb{R}}^{D}$ , and our goal is to approximate $\operatorname{tr}(\mathbf{A})$ with as few oracle queries as possible. This matrix-vector oracle model or implicit matrix model is a common computational framework used in much of the numerical linear algebra community [Sun et al., 2021a, Chen and Hallman, 2022, Halikias and Townsend, 2022, Bakshi et al., 2022, Persson et al., 2022]. However, in many applications, we cannot efficiently evaluate $\mathbf{A}\boldsymbol{\mathrm{x}}$ for all vectors $\boldsymbol{\mathrm{x}}\in{\mathbb{R}}^{D}$ . Instead, only some matrix vector products can be efficiently computed.

In this paper, we study trace estimation in the Kronecker-matrix-vector oracle model, also known as the rank-one vector model, where we can evaluate $\mathbf{A}\boldsymbol{\mathrm{x}}$ for any vector $\boldsymbol{\mathrm{x}}$ that can can be written as a Kronecker product of smaller vectors, i.e. $\boldsymbol{\mathrm{x}}=\boldsymbol{\mathrm{x}}_{1}\otimes\boldsymbol{\mathrm{x}}_{2}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k}$ for some vectors $\boldsymbol{\mathrm{x}}_{1},\ldots,\boldsymbol{\mathrm{x}}_{k}\in{\mathbb{R}}^{d}$ . Here, $\otimes$ denotes the Kronecker product, which is a common primitive in many tensor-based algorithms and in much of quantum physics. The number of vectors ( $k$ ) and the number of entries ( $d$ ) in each of $\boldsymbol{\mathrm{x}}_{1},\dots,\boldsymbol{\mathrm{x}}_{k}$ are parameters of the model.

Our study is motivated by recent literature on entanglement estimation in quantum states [Feldman et al., 2022]. This computational model has also been studied in other applied domains [Chen and Hallman, 2022, Kueng et al., 2017]. An important setting where we can efficiently compute Kronecker-matrix-vector products is when $\mathbf{A}$ is the sum of a small number of matrices, where each summand is a Kronecker product of $k$ matrices. Such matrices occur as Hamiltonians of various quantum models [Wang et al., 2023]. Another instance in which we can efficiently compute Kronecker-matrix-vector products is when $\mathbf{A}$ is the flattening of a tensor network. This is the case in [Feldman et al., 2022], where the oracle is efficiently implemented when $\mathbf{A}$ describes a density operator that represents certain systems of quantum particles. Recently, there has been some effort to find algorithms that can efficiently solve linear algebra problems using this oracle, although the papers do not explicitly define such a Kronecker-matrix-vector oracle [Bujanovic and Kressner, 2021, Avron et al., 2014, Ahle et al., 2020, Sun et al., 2021b]. In this paper, we focus on the fundamental problem of trace estimation.

The trace estimation problem in the Kronecker-matrix-vector oracle model can be exactly solved using $D=d^{k}$ oracle queries. Since each standard basis vector $\boldsymbol{\mathrm{e}}_{1},\ldots,\boldsymbol{\mathrm{e}}_{D}$ obeys Kronecker structure, we can exactly look up each diagonal entry of $\mathbf{A}$ and evaluate $\operatorname{tr}(\mathbf{A})=\sum_{i=1}^{D}\boldsymbol{\mathrm{e}}_{i}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{e}}_{i}$ . So, our goal is to estimate the trace of $\mathbf{A}$ to $(1\pm\varepsilon)$ relative error using $o(d^{k})$ matrix-vector products. In the standard non-Kronecker setting, the most fundamental algorithm for trace estimation is Hutchinson’s Estimator [Girard, 1987, Hutchinson, 1989]:

H_{\ell}(\mathbf{A})\;{\vcentcolon=}\;\frac{1}{\ell}\sum_{j=1}^{\ell}{\boldsymbol{\mathrm{x}}^{(j)}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}^{(j)}\hskip 28.45274pt\text{where}\hskip 28.45274pt\boldsymbol{\mathrm{x}}^{(j)}\overset{iid}{\sim}{\mathcal{D}}.

If the distribution of random vectors has $\operatorname*{\mathbb{E}}[\boldsymbol{\mathrm{x}}]=\boldsymbol{\mathrm{0}}$ and $\operatorname*{\mathbb{E}}[\boldsymbol{\mathrm{x}}\boldsymbol{\mathrm{x}}^{\intercal}]=\mathbf{I}$ , then $\operatorname*{\mathbb{E}}[H_{\ell}(\mathbf{A})]=\operatorname{tr}(\mathbf{A})$ . Further, if ${\mathcal{D}}$ is a Gaussian or Rademacher distribution, then $\operatorname*{Var}[H_{\ell}(\mathbf{A})]\leq\frac{2}{\ell}\mathinner{\lVert\mathbf{A}\rVert}_{F}^{2}$ . In particular, under the common assumption that $\mathbf{A}$ is PSD, so that $\mathinner{\lVert\mathbf{A}\rVert}_{F}\leq\operatorname{tr}(\mathbf{A})$ , we then find that the standard deviation of Hutchinson’s estimator is

\sqrt{\operatorname*{Var}[H_{\ell}(\mathbf{A})]}\leq\sqrt{{\textstyle\frac{2}{\ell}}}\mathinner{\lVert\mathbf{A}\rVert}_{F}\leq\sqrt{{\textstyle\frac{2}{\ell}}}\operatorname{tr}(\mathbf{A})\leq\varepsilon\operatorname{tr}(\mathbf{A})

when we take $\ell\geq\Omega(\frac{1}{\varepsilon^{2}})$ queries. For large dimension $D$ , this very simple algorithm is much more efficient than exactly computing the diagonal entries of $\mathbf{A}$ , since the estimator has no dependence on the size of $\mathbf{A}$ at all.

This estimator can be naturally generalized to the Kronecker-matrix-vector oracle model by simply changing the distribution ${\mathcal{D}}$ . Instead of taking ${\mathcal{D}}$ to be a distribution over Gaussian or Rademacher vectors, we now take ${\mathcal{D}}$ to be the Kronecker of $k$ vectors drawn from some other isotropic zero-mean distribution ${\mathcal{D}}^{\prime}$ :

\boldsymbol{\mathrm{x}}=\boldsymbol{\mathrm{x}}_{1}\otimes\boldsymbol{\mathrm{x}}_{2}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k}\hskip 28.45274pt\text{where}\hskip 28.45274pt\boldsymbol{\mathrm{x}}_{i}\overset{iid}{\sim}{\mathcal{D}}^{\prime}

Notably, if $\operatorname*{\mathbb{E}}[\boldsymbol{\mathrm{x}}_{i}]=\boldsymbol{\mathrm{0}}$ and $\operatorname*{\mathbb{E}}[\boldsymbol{\mathrm{x}}_{i}\boldsymbol{\mathrm{x}}_{i}^{\intercal}]=\mathbf{I}$ , then we also immediately have that $\operatorname*{\mathbb{E}}[\boldsymbol{\mathrm{x}}]=\boldsymbol{\mathrm{0}}$ and $\operatorname*{\mathbb{E}}[\boldsymbol{\mathrm{x}}\boldsymbol{\mathrm{x}}^{\intercal}]=\mathbf{I}$ , so that $\operatorname*{\mathbb{E}}[H_{\ell}(\mathbf{A})]=\operatorname{tr}(\mathbf{A})$ . This simple estimator, which we refer to as the Kronecker-Hutchinson Estimator, fits the Kronecker-matrix-vector oracle model, and we may hope that it is efficient enough to estimate the trace of $\mathbf{A}$ very quickly. At a high level, we answer this in the negative, proving novel and very tight bounds on the variance of the Kronecker-Hutchinson Estimator, showing that the sample complexity of the estimator has an unavoidable exponential dependence on $k$ .

1.1 Prior Work

We first remark there is a great deal of literature on trace estimation without the Kronecker constraint. The design of Hutchinson’s Estimator is attributed to [Girard, 1987, Hutchinson, 1989], and some of the core papers that analyze the estimator are [Avron and Toledo, 2011, Roosta-Khorasani and Ascher, 2015, Cortinovis and Kressner, 2021].

Although relatively a new topic, trace estimation in the Kronecker-matrix-vector oracle model has recently been explored in the literature. From the physics literature, the works of [Feldman et al., 2022] and [Huang et al., 2020] analyze trace estimators when $\mathbf{A}=\otimes_{i=1}^{k}\mathbf{A}_{i}$ , and shows some empirical results for their performance. From a theoretical perspective, [Vershynin, 2020] provides bounds that can be used to analyze the Kronecker-Hutchinson Estimator. However, the paper provides very coarse bounds for this problem, showing that if $k=O(d)$ then we have $\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}\in\operatorname{tr}(\mathbf{A})\pm O(kd^{k-1}\mathinner{\lVert\mathbf{A}\rVert}_{2})$ , where $\mathinner{\lVert\mathbf{A}\rVert}_{2}$ is the operator norm of $\mathbf{A}$ . To obtain a relative error $(1\pm\varepsilon)$ , this suggests an overall sample complexity of $\ell=O(\frac{kd^{k-1}}{\varepsilon^{2}}\frac{\mathinner{\lVert\mathbf{A}\rVert}_{2}}{\operatorname{tr}(\mathbf{A})})=O(\frac{D}{\varepsilon^{2}}\frac{\mathinner{\lVert\mathbf{A}\rVert}_{2}}{\operatorname{tr}(\mathbf{A})})$ , which is generally more than $D$ queries.

The research of [Bujanovic and Kressner, 2021] directly analyzes the Kronecker-Hutchinson Estimator when $k=2$ , showing first that $\ell=O(\frac{1}{\varepsilon^{2}})$ samples suffice to ensure that $H_{\ell}(\mathbf{A})\geq(1-\varepsilon)\operatorname{tr}(\mathbf{A})$ , and second that $\ell=O(\frac{d}{\varepsilon^{2}}\frac{\mathinner{\lVert A\rVert}_{2}}{\operatorname{tr}(\mathbf{A})})$ samples suffice to ensure that $H_{\ell}(\mathbf{A})\leq(1+\varepsilon)\operatorname{tr}(\mathbf{A})$ . Lastly, Remark 2 of the arXiv version of [Ahle et al., 2020] is a result on using a Kronecker-Johnson-Lindenstrauss transform, and implies that $O(\frac{3^{k}}{\varepsilon^{2}})$ samples suffice for the Kronecker-Hutchinson Estimator with Rademacher vectors to achieve relative error $(1\pm\varepsilon)$ . While this worst-case bound is tight, the paper gives no way to understand when this bound is lose, and for what matrices fewer samples are needed. Further, all the aforementioned theoretical papers rely on heavy-duty analysis, for instance involving Gaussian and Rademacher Chaos theory. In contrast, our contributions give tighter results using simpler techniques.

We also note that [Ahle et al., 2020] does also include a Johnson-Lindenstrauss style guarantee for working with Kronecker-shaped data using a sketch that has $O(\frac{k}{\varepsilon^{2}})$ rows. However, this sketch does not actually operate in the Kronecker-matrix-vector oracle model.

1.2 Our Contributions

Our core technical contribution is an exact expression of the variance of Hutchinson’s Estimator when our query vectors are formed as a Kronecker product of Gaussian vectors. This also provides an upper bound on the variance when the estimator instead uses a Kronecker product of Rademacher vectors. For PSD matrices, our results imply a worst-case upper bound of $\ell=O(\frac{3^{k}}{\varepsilon^{2}})$ on the number of samples needed to estimate the trace of $\mathbf{A}$ to relative error $(1\pm\varepsilon)$ . We match this upper bound with a lower bound showing that the Kronecker-Hutchinson Estimator with iid zero-mean unit-variance vectors has to use $\ell=\Omega(\frac{3^{k}}{\varepsilon^{2}})$ samples to achieve standard deviation $\varepsilon\operatorname{tr}(\mathbf{A})$ . Lastly, we show that if we are allowed to use complex Kronecker-matrix-vector products instead of real matrix-vector products (i.e. if $\boldsymbol{\mathrm{x}}=\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k}$ where $\boldsymbol{\mathrm{x}}_{i}\in{\mathbb{C}}^{d}$ ), then this rate improves to $\Theta(\frac{2^{k}}{\varepsilon^{2}})$ .

To more concretely state our core technical contribution, we recall the partial trace operator from quantum physics. In quantum physics, a Hermitian density matrix $\mathbf{A}\in{\mathbb{C}}^{d^{k}\times d^{k}}$ is roughly speaking used to represent the joint probability distribution for a set of $k$ particles. Suppose these particles are partitioned into two sets, so that $[k]={\mathcal{S}}\cup\bar{}{\mathcal{S}}$ . Further suppose that Alice can observe the particles in ${\mathcal{S}}$ , and that Bob can observe the particles in $\bar{}{\mathcal{S}}$ , but they cannot observe each other’s particles. Then, the partial trace of $\mathbf{A}$ with respect to ${\mathcal{S}}$ , denoted $\operatorname{tr}_{{\mathcal{S}}}(\mathbf{A})$ , describes the marginal distribution of the particles that Bob observes. This is often called “tracing out the subsystem ${\mathcal{S}}$ ”.

We include a general formal definition of the partial trace in Section 2, but a rough intuition is easily gained by examining the $k=2$ case. Here, we can break down $\mathbf{A}={\mathbb{R}}^{d^{2}\times d^{2}}$ as a block matrix full of $\mathbf{A}_{ij}\in{\mathbb{R}}^{d\times d}$ :

\mathbf{A}=\begin{bmatrix}\mathbf{A}_{11}&\mathbf{A}_{12}&\ldots&\mathbf{A}_{1d}\\ \mathbf{A}_{21}&\mathbf{A}_{22}&\ldots&\mathbf{A}_{2d}\\ \vdots&\vdots&\ddots&\vdots\\ \mathbf{A}_{d1}&\mathbf{A}_{d2}&\ldots&\mathbf{A}_{dd}\\ \end{bmatrix}.

Then, the partial trace of $\mathbf{A}$ with respect to $\{1\}$ is the sum of the diagonal blocks, i.e. $\operatorname{tr}_{\{1\}}(\mathbf{A})=\sum_{i=1}^{d}\mathbf{A}_{ii}$ , and the partial trace of $\mathbf{A}$ with respect to $\{2\}$ is the matrix that contains the trace of each block matrix, i.e.

\operatorname{tr}_{\{2\}}(\mathbf{A})=\begin{bmatrix}\operatorname{tr}(\mathbf{A}_{11})&\operatorname{tr}(\mathbf{A}_{12})&\ldots&\operatorname{tr}(\mathbf{A}_{1d})\\ \operatorname{tr}(\mathbf{A}_{21})&\operatorname{tr}(\mathbf{A}_{22})&\ldots&\operatorname{tr}(\mathbf{A}_{2d})\\ \vdots&\vdots&\ddots&\vdots\\ \operatorname{tr}(\mathbf{A}_{d1})&\operatorname{tr}(\mathbf{A}_{d2})&\ldots&\operatorname{tr}(\mathbf{A}_{dd})\\ \end{bmatrix}.

We will also need to recall the partial transpose operator from quantum physics. This operator is often used in quantum physics, but lacks an intuitive physical meaning analogous to what the partial trace enjoys. Like the partial trace, the partial transpose can also be taken over any subset of particles ${\mathcal{V}}\subseteq[k]$ . We denote the partial transpose of $\mathbf{A}$ with respect to ${\mathcal{V}}$ as $\mathbf{A}^{\intercal_{{\mathcal{V}}}}$ , and call this “transposing the subsystem ${\mathcal{V}}$ ”. We have a formal definition of the partial transpose in Section 2. But, when $k=2$ , the two partial transposes of $\mathbf{A}$ are:

\mathbf{A}^{\intercal_{\{1\}}}=\begin{bmatrix}\mathbf{A}_{11}&\mathbf{A}_{21}&\cdots&\mathbf{A}_{d1}\\ \mathbf{A}_{12}&\mathbf{A}_{22}&\cdots&\mathbf{A}_{d2}\\ \vdots&\vdots&\ddots&\vdots\\ \mathbf{A}_{1d}&\mathbf{A}_{2d}&\cdots&\mathbf{A}_{dd}\end{bmatrix}\hskip 14.22636pt\text{and}\hskip 14.22636pt\mathbf{A}^{\intercal_{\{2\}}}=\begin{bmatrix}\mathbf{A}_{11}^{\intercal}&\mathbf{A}_{12}^{\intercal}&\cdots&\mathbf{A}_{1d}^{\intercal}\\ \mathbf{A}_{21}^{\intercal}&\mathbf{A}_{22}^{\intercal}&\cdots&\mathbf{A}_{2d}^{\intercal}\\ \vdots&\vdots&\ddots&\vdots\\ \mathbf{A}_{d1}^{\intercal}&\mathbf{A}_{d2}^{\intercal}&\cdots&\mathbf{A}_{dd}^{\intercal}\end{bmatrix}.

Having setup both the partial trace and the partial transpose, we can now state our core theorem:

Theorem 1.

The matrix $\bar{}\mathbf{A}$ seems odd at a glance, so we first justify why it is natural. For the classical $k=1$ setting with Gaussian vectors, we can write that $\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}]=2\mathinner{\lVert\mathbf{A}\rVert}_{F}^{2}$ for all symmetric matrices $\mathbf{A}$ [Avron and Toledo, 2011, Meyer, 2021]. For asymmetric matrices, this equality does not hold. Instead, we let $\hat{}\mathbf{A}\;{\vcentcolon=}\;\frac{\mathbf{A}+\mathbf{A}^{\intercal}}{2}$ , and note that $\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}=\boldsymbol{\mathrm{x}}^{\intercal}\hat{}\mathbf{A}\boldsymbol{\mathrm{x}}$ for all $\boldsymbol{\mathrm{x}}\in{\mathbb{R}}^{D}$ . Since $\hat{}\mathbf{A}$ is symmetric, we then find that $\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}]=2\mathinner{\lVert\hat{}\mathbf{A}\rVert}_{F}^{2}$ . Theorem 1 generalizes this idea to all possible ways to partially transpose across all possible subsystems ${\mathcal{V}}\subseteq[k]$ .

Next, note that $\mathinner{\lVert\hat{}\mathbf{A}\rVert}_{F}\leq\mathinner{\lVert\mathbf{A}\rVert}_{F}$ , and that symmetric matrices have $\mathbf{A}=\hat{}\mathbf{A}$ . So, we know that $\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}]=2\mathinner{\lVert\hat{}\mathbf{A}\rVert}_{F}^{2}\leq 2\mathinner{\lVert\mathbf{A}\rVert}_{F}^{2}$ holds for all matrices $\mathbf{A}$ , with equality for symmetric $\mathbf{A}$ . This bound can be easier to work with, since it does not require worrying about $\hat{}\mathbf{A}$ . For large $k$ , we similarly bound $\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\bar{}\mathbf{A})\rVert}_{F}\leq\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\mathbf{A})\rVert}_{F}$ , providing a more approachable version of Theorem 1:

Theorem 2.

Note that many reasonable matrices will have $\mathbf{A}=\bar{}\mathbf{A}$ . For instance, recall that the Kronecker-matrix-vector oracle model may be used when $\mathbf{A}$ is the sum of a small number of matrices, where each summand is a Kronecker product of $k$ matrices. If each summand is the Kronecker product of symmetric matrices, then we have $\mathbf{A}=\bar{}\mathbf{A}$ .

We demonstrate the utility of Theorem 2 by using it to show the following guarantee on estimating the trace of a random rank-one matrix:

Theorem 3.

Fix $d,k\in{\mathbb{N}}$ . Let $\mathbf{A}=\boldsymbol{\mathrm{g}}\boldsymbol{\mathrm{g}}^{\intercal}$ where $\boldsymbol{\mathrm{g}}\in{\mathbb{R}}^{d^{k}}$ is a vector of iid standard normal Gaussians. Then, with probability $\frac{3}{4}$ , we know that $\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}~{}|~{}\boldsymbol{\mathrm{g}}]\leq O((2+\frac{1}{d})^{k}(\operatorname{tr}(\mathbf{A}))^{2})$ . In particular, with probability $\frac{2}{3}$ , we know that the Kronecker-Hutchinson Estimator returns a $(1\pm\varepsilon)$ relative error estimate to the trace of $\mathbf{A}$ if $\ell=O(\frac{1}{\varepsilon^{2}}(2+\frac{1}{d})^{k})$ .

This rate of $(2+\frac{1}{d})^{k}$ is obviously expensive, but our sample complexity is lower than what is achievable with any other technique that we are aware of. Nevertheless, even without $\bar{}\mathbf{A}$ , this sum of partial traces may sometimes still be too unwieldy to work with. For such cases, we provide a worst-case bound for PSD matrices:

Theorem 4.

Fix $d,k\in{\mathbb{N}}$ , and let $\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}}$ be a PSD matrix. Let $\boldsymbol{\mathrm{x}}=\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k}$ , where each $\boldsymbol{\mathrm{x}}_{i}$ is either a Rademacher vector or a ${\mathcal{N}}(\boldsymbol{\mathrm{0}},\mathbf{I})$ Gaussian vector. Then, $\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}]\leq(3^{k}\operatorname{tr}(\mathbf{A}))^{2}$ .

In particular, by Chebyshev’s Inequality, Theorem 4 implies that $\ell=O(\frac{3^{k}}{\varepsilon^{2}})$ samples suffice to estimate $\operatorname{tr}(\mathbf{A})$ to relative error $(1\pm\varepsilon)$ . For the case of [Bujanovic and Kressner, 2021], when $k=2$ , we get the following:

Corollary 5.

Let $\mathbf{A}\in{\mathbb{R}}^{d^{2}\times d^{2}}$ . Let $\boldsymbol{\mathrm{x}}=\boldsymbol{\mathrm{x}}_{1}\otimes\boldsymbol{\mathrm{x}}_{2}$ , where each $\boldsymbol{\mathrm{x}}_{i}$ is independent and is either a Rademacher vector or a ${\mathcal{N}}(\boldsymbol{\mathrm{0}},\mathbf{I})$ Gaussian vector. Let $\bar{}\mathbf{A}=\frac{1}{4}(\mathbf{A}+\mathbf{A}^{\intercal_{\{1\}}}+\mathbf{A}^{\intercal_{\{2\}}}+\mathbf{A}^{\intercal})$ . Then, $\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}]\leq 2\mathinner{\lVert\operatorname{tr}_{\{1\}}(\bar{}\mathbf{A})\rVert}_{F}^{2}+2\mathinner{\lVert\operatorname{tr}_{\{2\}}(\bar{}\mathbf{A})\rVert}_{F}^{2}+4\mathinner{\lVert\bar{}\mathbf{A}\rVert}_{F}^{2}$ . If $\mathbf{A}$ is PSD, with probability $\frac{2}{3}$ , we know that the Kronecker-Hutchinson Estimator returns a $(1\pm\varepsilon)$ relative error estimate to the trace of $\mathbf{A}$ if $\ell=O(\varepsilon^{-2})$ .

This success probability can be boosted to $1-\delta$ for an arbitrary $\delta$ by picking up a $\log(\frac{1}{\delta})$ dependence in the sample complexity by returning the median of $O(\log(\frac{1}{\delta}))$ iid trials of $H_{\ell}(\mathbf{A})$ . We further show that the expensive rate of $O(\frac{3^{k}}{\varepsilon^{2}})$ is essentially tight in the worst case for the Kronecker-Hutchinson Estimator:

Theorem 6.

Fix $d,k\in{\mathbb{N}}$ . Consider Hutchinson’s Estimator run with vectors $\boldsymbol{\mathrm{x}}=\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k}$ where $\boldsymbol{\mathrm{x}}_{i}$ are vectors of iid real zero-mean unit-variance entries. Then, there exists a PSD matrix $\mathbf{A}$ such that Hutchinson’s Estimator needs $\ell=\Omega(\frac{(3-\frac{2}{d})^{k}}{\varepsilon^{2}})$ samples in order to have standard deviation $\leq\varepsilon\operatorname{tr}(\mathbf{A})$ .

1.2.1 Complex Kronecker-Matrix-Vector Products

Although certainly cheaper than exactly computing the trace with $d^{k}$ queries to the diagonal of $\mathbf{A}$ , the rate of $\frac{3^{k}}{\varepsilon^{2}}$ is prohibitively expensive for large $k$ . Our next set of results show that the bound improves if we can issue complex queries instead of real ones, that is when $\boldsymbol{\mathrm{x}}=\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k}$ where $\boldsymbol{\mathrm{x}}_{i}\in{\mathbb{C}}^{d}$ . It seems realistic to say that any software implementation of the Kronecker-matrix-vector oracle could handle complex queries instead of real ones.

Note that for $k=1$ , there is almost no distinction between the real and complex oracle models, as any complex matrix-vector product can be simulated by running two real matrix-vector products. However, this is no longer the case for large values of $k$ and real simulation becomes expensive. To see this, note that the real part of the Kronecker product of complex vectors might not have a Kronecker structure. Because of this, it may take $2^{k}$ real-valued Kronecker-matrix-vector products to simulate a single complex-valued Kronecker-matrix-vector product. In a vague intuitive sense, the complex Kronecker-structured vectors are exponentially less constrained than the real Kronecker-structured vectors. Running with this intuition in mind, we prove the following analogues to Theorems 1, 2, 4, 3 and 6:

Theorem 7.

Fix $d,k\in{\mathbb{N}}$ , and let $\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}}$ . Let $\boldsymbol{\mathrm{x}}=\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k}$ , where each $\boldsymbol{\mathrm{x}}_{i}$ is either a complex Rademacher vector or a complex Gaussian vector. That is, we have $\boldsymbol{\mathrm{x}}_{i}=\frac{1}{\sqrt{2}}\boldsymbol{\mathrm{r}}_{i}+\frac{i}{\sqrt{2}}\boldsymbol{\mathrm{m}}_{i}$ where $\boldsymbol{\mathrm{r}}_{i}$ and $\boldsymbol{\mathrm{m}}_{i}$ are iid real-valued vectors of either Rademacher or standard normal Gaussian entries. Let $\bar{}\mathbf{A}\;{\vcentcolon=}\;\frac{1}{2^{k}}\sum_{{\mathcal{V}}\subseteq[k]}\mathbf{A}^{\intercal_{{\mathcal{V}}}}$ be the average of all partial transposes of $\mathbf{A}$ . Then, $\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\textsf{H}}\mathbf{A}\boldsymbol{\mathrm{x}}]\leq\sum_{{\mathcal{S}}\subset[k]}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}}(\bar{}\mathbf{A})\rVert}_{F}^{2}\leq\sum_{{\mathcal{S}}\subset[k]}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}}(\mathbf{A})\rVert}_{F}^{2}$ . If $\boldsymbol{\mathrm{x}}_{i}$ are complex Gaussians, then the first expression is an equality. If we further know that $\mathbf{A}=\bar{}\mathbf{A}$ , then both expression are equalities. Further, if $\mathbf{A}$ is PSD, then we have $\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\textsf{H}}\mathbf{A}\boldsymbol{\mathrm{x}}]\leq(2^{k}\operatorname{tr}(\mathbf{A}))^{2}$ .

Theorem 8.

Fix $d,k\in{\mathbb{N}}$ . Let $\mathbf{A}=\boldsymbol{\mathrm{g}}\boldsymbol{\mathrm{g}}^{\intercal}$ where $\boldsymbol{\mathrm{g}}\in{\mathbb{R}}^{d^{k}}$ is a vector of iid standard normal Gaussians. Then, with probability $\frac{3}{4}$ , we know that $\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\textsf{H}}\mathbf{A}\boldsymbol{\mathrm{x}}~{}|~{}\boldsymbol{\mathrm{g}}]\leq O((1+\frac{1}{d})^{k}(\operatorname{tr}(\mathbf{A}))^{2})$ . In particular, with probability $\frac{2}{3}$ , we know that the Kronecker-Hutchinson Estimator returns a $(1\pm\varepsilon)$ relative error estimate to the trace of $\mathbf{A}$ if $\ell=O(\frac{1}{\varepsilon^{2}}(1+\frac{1}{d})^{k})$ .

Theorem 9.

Fix $d,k\in{\mathbb{N}}$ . Consider Hutchinson’s Estimator run with vectors $\boldsymbol{\mathrm{x}}=\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k}$ where $\boldsymbol{\mathrm{x}}_{i}$ are vectors of iid complex zero-mean unit-variance entries. Then, there exists a PSD matrix $\mathbf{A}$ such that Hutchinson’s Estimator needs $\ell=\Omega(\frac{(2-{\textstyle\frac{1}{d}})^{k}}{\varepsilon^{2}})$ samples in order to have standard deviation $\leq\varepsilon\operatorname{tr}(\mathbf{A})$ .

By applying Chebyshev’s Inequality to Theorems 4 and 7, we see an exponential speedup from $\ell=O(\frac{3^{k}}{\varepsilon^{2}})$ queries in the real-valued case to $\ell=O(\frac{2^{k}}{\varepsilon^{2}})$ queries in the complex-valued case. For large $k$ , this gap from $3^{k}$ to $2^{k}$ is a very significant speedup, and we are not aware of any other prior works which discuss this potential.

Another way to view this speedup is from comparing the Kronecker-Hutchinson Estimator to just summing the diagonal of $\mathbf{A}$ , which takes exactly $d^{k}$ queries. When $d=2$ , the real-valued Kronecker-Hutchinson Estimator is exponentially more expensive than just summing the diagonal directly. However, the complex-valued Kronecker-Hutchinson Estimator is only $O(\frac{1}{\varepsilon^{2}})$ times more expensive in the worst case. This is another way to frame the power of the complex-valued estimator – the complex-valued estimator is at most polynomially slower than summing the diagonal, while the real-valued estimator can be exponentially slower.

We can also directly compare Theorems 3 and 8. In the real-valued case, the sample complexity grows as $\ell=O(\frac{1}{\varepsilon^{2}}(2+\frac{1}{d})^{k})$ , while in the complex-valued case, the sample complexity grows as $\ell=O(\frac{1}{\varepsilon^{2}}(1+\frac{1}{d})^{k})$ . Notably, if $k=O(d)$ then this rate grows as $O(\frac{2^{k}}{\varepsilon^{2}})$ for real-valued queries and $O(\frac{1}{\varepsilon^{2}})$ for complex-valued queries. This shows a power exponential gap between our bounds, where we are only leveraging the difference between complex-valued and real-valued vectors.

1.2.2 Additional Results and Hope for a Faster Algorithm

We complement the aforementioned results with a few informative additions. First, we show a lower bound against all Kronecker-matrix-vector algorithms. By adapting the techniques of [Braverman et al., 2020] and [Jiang et al., 2021], we show that any Kronecker-matrix-vector algorithm that estimates the trace of $\mathbf{A}$ with standard deviation $\varepsilon\operatorname{tr}(\mathbf{A})$ must make at least $\Omega(\frac{\sqrt{k}}{\varepsilon})$ oracle queries.

Second, we note that the lower bounds in Theorems 6 and 9 both use construction of $\mathbf{A}$ such that $\mathbf{A}=\mathbf{A}_{1}\otimes\cdots\otimes\mathbf{A}_{k}$ . That is, the Kronecker-Hutchinson Estimator converges slowest for matrices that have a Kronecker structure. However, if we truly are given $\mathbf{A}$ with the promise that $\mathbf{A}=\mathbf{A}_{1}\otimes\cdots\otimes\mathbf{A}_{k}$ for some unknown matrices $\mathbf{A}_{1},\ldots,\mathbf{A}_{k}$ , then in Section 7 we show that the trace of $\mathbf{A}$ can be recovered exactly using just $kd+1$ Kronecker-matrix-vector products. Similarly, Theorems 3 and 8 suggest that the Kronecker-Hutchinson Estimator converges exponentially slowly for random rank-one matrices. But in this rank-one case, we show that we can exactly recover the trace of $\mathbf{A}$ with a single Kronecker-matrix-vector query!

This trend, where the matrices that make the Kronecker-Hutchinson Estimator converge slowly can have their traces efficiently estimated by another algorithm, is reminiscent of the setup for the analysis in [Meyer et al., 2021]. In that paper, the authors recall the analysis of the standard deviation of Hutchinson’s Estimator:

\sqrt{\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}]}\leq\sqrt{\frac{2}{\ell}}\mathinner{\lVert\mathbf{A}\rVert}_{F}\leq\sqrt{\frac{2}{\ell}}\operatorname{tr}(\mathbf{A})\leq\varepsilon\operatorname{tr}(\mathbf{A})

where the first inequality uses that $\mathinner{\lVert\mathbf{A}\rVert}_{F}\leq\operatorname{tr}(\mathbf{A})$ for PSD $\mathbf{A}$ , and the second inequality is forced to pick the sample complexity $\ell\geq\frac{2}{\varepsilon^{2}}$ . Since the first inequality is only tight for matrices $\mathbf{A}$ that are nearly low-rank, the authors propose an algorithm which combines Hutchinson’s Estimator with a low-rank approximation method that can efficiently compute the trace of nearly-low-rank matrices efficiently. This way, their algorithm has a worst-case matrix-vector complexity of $\Theta(\frac{1}{\varepsilon})$ , while Hutchinson’s Estimator is stuck with $\Theta(\frac{1}{\varepsilon^{2}})$ .

Our analysis similarly characterizes two cases when the Kronecker-Hutchinson Estimator converges slowly – when $\mathbf{A}$ is either low-rank or has Kronecker structure. We also show that we can compute the trace such matrices very efficiently. This suggests that a more involved algorithm can interpolate between these low query complexities and perhaps even avoid any exponential dependence on $k$ at all. We leave this as an important problem for future work.

Lastly, we remark that the analysis from [Meyer et al., 2021] crucially uses a tight bound for the variance of Hutchinson’s Estimator in terms of $\mathinner{\lVert\mathbf{A}\rVert}_{F}$ . We believe that our core results, Theorems 1 and 2, can serve an analogous role in constructing a faster trace-estimating algorithm.

2 Preliminaries

We use capital bold letter to denote matrices, lowercase bold letters to denote vectors, and lowercase non-bold letters to denote scalars. ${\mathbb{R}}$ is the set of reals, and ${\mathbb{C}}$ is the set of complex numbers. $\mathfrak{Re}[\boldsymbol{\mathrm{x}}]$ and $\mathfrak{Im}[\boldsymbol{\mathrm{x}}]$ denote the real and imaginary parts of a vector or scalar. $\boldsymbol{\mathrm{x}}^{\intercal}$ denotes the transpose of a vector, while $\boldsymbol{\mathrm{x}}^{\textsf{H}}$ denotes the conjugate transpose of a matrix or vector. $\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}}$ denotes out input matrix. We use the bracket notation $[\mathbf{B}]_{ij}$ to denote the $i,j$ entry of matrix $\mathbf{B}$ . We let $\boldsymbol{\mathrm{e}}_{i}$ denote the $i^{th}$ standard basis vector, whose exact dimension may depend on the context. We let $\mathbf{I}\in{\mathbb{R}}^{d\times d}$ always refer to the $d\times d$ identity matrix. We let $\mathinner{\lVert\mathbf{A}\rVert}_{F}$ denote the Frobenius norm, $\operatorname{tr}(\mathbf{A})$ the trace, $\operatorname{tr}_{{\mathcal{S}}}(\mathbf{A})$ the partial trace, and $\mathbf{A}^{\intercal_{{\mathcal{V}}}}$ the partial transpose. We let $[k]=\{1,\ldots,k\}$ be the set of integers from 1 to $k$ .

2.1 Kronecker product

The Kronecker product of two matrices $\mathbf{A}\in{\mathbb{R}}^{n\times m}$ and $\mathbf{B}\in{\mathbb{R}}^{p\times q}$ is $\mathbf{A}\otimes\mathbf{B}\in{\mathbb{R}}^{np\times mq}$ , where

\mathbf{A}\otimes\mathbf{B}\;{\vcentcolon=}\;\begin{bmatrix}[\mathbf{A}]_{11}\mathbf{B}&[\mathbf{A}]_{12}\mathbf{B}&\ldots&[\mathbf{A}]_{1m}\mathbf{B}\\ [\mathbf{A}]_{21}\mathbf{B}&[\mathbf{A}]_{22}\mathbf{B}&\ldots&[\mathbf{A}]_{2m}\mathbf{B}\\ \vdots&\vdots&\ddots&\vdots\\ [\mathbf{A}]_{n1}\mathbf{B}&[\mathbf{A}]_{n2}\mathbf{B}&\ldots&[\mathbf{A}]_{nm}\mathbf{B}\end{bmatrix}

We also use the shorthands $\otimes_{i=1}^{k}\boldsymbol{\mathrm{x}}_{i}\;{\vcentcolon=}\;\boldsymbol{\mathrm{x}}_{1}\otimes\boldsymbol{\mathrm{x}}_{2}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k}$ , as well as $\mathbf{A}^{\otimes k}\;{\vcentcolon=}\;\otimes_{i=1}^{k}\mathbf{A}$ . The Kronecker product has three key properties that we repeatedly use:

Lemma 10 (Mixed-Product Property (Folklore)).

$(\mathbf{A}\otimes\mathbf{B})(\mathbf{C}\otimes\mathbf{D})=\mathbf{A}\mathbf{C}\otimes\mathbf{B}\mathbf{D}$

Lemma 11 (Transposing (Folklore)).

$(\mathbf{A}\otimes\mathbf{B})^{\intercal}=(\mathbf{A}^{\intercal}\otimes\mathbf{B}^{\intercal})$

Lemma 12 (Fact 7.4.1 in [Bernstein, 2011]).

Let $\boldsymbol{\mathrm{x}}\in{\mathbb{R}}^{n}$ and $\boldsymbol{\mathrm{y}}\in{\mathbb{R}}^{m}$ . Then, $\boldsymbol{\mathrm{x}}\otimes\boldsymbol{\mathrm{y}}=(\mathbf{I}_{n}\otimes\boldsymbol{\mathrm{y}})\boldsymbol{\mathrm{x}}=(\boldsymbol{\mathrm{x}}\otimes\mathbf{I}_{m})\boldsymbol{\mathrm{y}}$ .

At one point, we will also require a short lemma about the squared Frobenius norm and the Kronecker product. This lemma is roughly equivalent to noting that $\mathinner{\lVert\mathbf{A}\rVert}_{F}^{2}=\sum_{i,j=1}^{d}\mathinner{\lVert\mathbf{A}_{ij}\rVert}_{F}^{2}$ in the block matrix decomposition of $\mathbf{A}$ when $k=2$ in Section 1.2.

Lemma 13.

Let $\mathbf{C},\mathbf{E}\in{\mathbb{R}}^{d^{k-i}\times d^{k-i}}$ and $\mathbf{D}\in{\mathbb{R}}^{d^{i}\times d^{i}}$ for any $i\in[k]$ . Then, we have

\sum_{\hat{i},\hat{j}=1}^{d^{i}}\mathinner{\lVert(\mathbf{C}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}})^{\intercal}\mathbf{D}(\mathbf{E}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}})\rVert}_{F}^{2}=\mathinner{\lVert(\mathbf{C}\otimes\mathbf{I})^{\intercal}\mathbf{D}(\mathbf{E}\otimes\mathbf{I})\rVert}_{F}^{2}.

Proof.

First, recall there exists a permutation matrix $\mathbf{P}$ such that $\mathbf{C}\otimes\boldsymbol{\mathrm{y}}=\mathbf{P}(\boldsymbol{\mathrm{y}}\otimes\mathbf{C})\mathbf{P}^{\intercal}$ for all $\boldsymbol{\mathrm{y}}\in{\mathbb{R}}^{d^{i}}$ . Now let $\hat{}\mathbf{D}\;{\vcentcolon=}\;\mathbf{P}\mathbf{D}\mathbf{P}$ . Then, by expand $\hat{}\mathbf{D}$ as a block matrix of matrices $\hat{}\mathbf{D}_{\hat{i},\hat{j}}\in{\mathbb{R}}^{d^{k-i}\times d^{k-i}}$ , and noting that $(\boldsymbol{\mathrm{e}}_{\hat{i}}\otimes\mathbf{C})^{\intercal}=\begin{bmatrix}\mathbf{0}&\cdots&\mathbf{0}&\mathbf{C}^{\intercal}&\mathbf{0}&\cdots&\mathbf{0}\end{bmatrix}$ . We then get that

(\boldsymbol{\mathrm{e}}_{\hat{i}}\otimes\mathbf{C})^{\intercal}\hat{}\mathbf{D}(\boldsymbol{\mathrm{e}}_{\hat{j}}\otimes\mathbf{E})=\begin{bmatrix}\mathbf{0}&\cdots&\mathbf{0}&\mathbf{C}^{\intercal}&\mathbf{0}&\cdots&\mathbf{0}\end{bmatrix}\begin{bmatrix}\hat{}\mathbf{D}_{11}&\cdots&\hat{}\mathbf{D}_{1d}\\ \vdots&\ddots&\vdots\\ \hat{}\mathbf{D}_{d1}&\cdots&\hat{}\mathbf{D}_{dd}\\ \end{bmatrix}\begin{bmatrix}\boldsymbol{\mathrm{0}}\\ \vdots\\ \boldsymbol{\mathrm{0}}\\ \mathbf{E}\\ \boldsymbol{\mathrm{0}}\\ \vdots\\ \boldsymbol{\mathrm{0}}\end{bmatrix}=\mathbf{C}^{\intercal}\hat{}\mathbf{D}_{\hat{i}\hat{j}}\mathbf{E}.

We can also notice that

(\mathbf{I}\otimes\mathbf{C})^{\intercal}\hat{}\mathbf{D}(\mathbf{I}\otimes\mathbf{E})=\begin{bmatrix}\mathbf{C}^{\intercal}\\ &\ddots\\ &&\mathbf{C}^{\intercal}\end{bmatrix}\begin{bmatrix}\hat{}\mathbf{D}_{11}&\ldots&\hat{}\mathbf{D}_{dd}\\ \vdots&\ddots&\vdots\\ \hat{}\mathbf{D}_{d1}&\ldots&\hat{}\mathbf{D}_{dd}\end{bmatrix}\begin{bmatrix}\mathbf{E}\\ &\ddots\\ &&\mathbf{E}\end{bmatrix}=\begin{bmatrix}\mathbf{C}^{\intercal}\hat{}\mathbf{D}_{11}\mathbf{E}&\ldots&\mathbf{C}^{\intercal}\hat{}\mathbf{D}_{1d}\mathbf{E}\\ \vdots&\ddots&\vdots\\ \mathbf{C}^{\intercal}\hat{}\mathbf{D}_{d1}\mathbf{E}&\ldots&\mathbf{C}^{\intercal}\hat{}\mathbf{D}_{dd}\mathbf{E}\end{bmatrix}.

Then, noting that the Frobenius norm is invariant to permutation, we find that

	$\displaystyle\sum_{\hat{i},\hat{j}=1}^{d^{i}}\mathinner{\lVert(\mathbf{C}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}})^{\intercal}\mathbf{D}(\mathbf{E}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}})\rVert}_{F}^{2}$	$\displaystyle=\sum_{\hat{i},\hat{j}=1}^{d^{i}}\mathinner{\lVert\mathbf{P}^{\intercal}(\mathbf{C}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}})^{\intercal}\mathbf{P}\mathbf{D}\mathbf{P}(\boldsymbol{\mathrm{e}}_{\hat{j}}\otimes\mathbf{E})\mathbf{P}^{\intercal}\rVert}_{F}^{2}$
		$\displaystyle=\sum_{\hat{i},\hat{j}=1}^{d^{i}}\mathinner{\lVert(\mathbf{C}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}})^{\intercal}\hat{}\mathbf{D}(\boldsymbol{\mathrm{e}}_{\hat{j}}\otimes\mathbf{E})\rVert}_{F}^{2}$
		$\displaystyle=\sum_{\hat{i},\hat{j}=1}^{d^{i}}\mathinner{\lVert\mathbf{C}^{\intercal}\hat{}\mathbf{D}_{\hat{i}\hat{j}}\mathbf{E}\rVert}_{F}^{2}$
		$\displaystyle=\mathinner{\lVert(\mathbf{I}\otimes\mathbf{C})^{\intercal}\hat{}\mathbf{D}(\mathbf{I}\otimes\mathbf{E})\rVert}_{F}^{2}$
		$\displaystyle=\mathinner{\lVert(\mathbf{C}\otimes\mathbf{I})^{\intercal}\hat{}\mathbf{D}(\mathbf{E}\otimes\mathbf{I})\rVert}_{F}^{2},$

where the last equality comes from again permuting the matrices inside the norm. ∎

2.2 The Partial Trace Operator

We take a moment to formalize the partial trace operator and characterize some useful properties of it. We start by defining the partial trace with respect to a single subsystem:

Definition 14.

Let $\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}}$ . Then, the partial trace of $\mathbf{A}$ with respect to its $i^{th}$ subsystem is

\displaystyle\operatorname{tr}_{\{i\}}(\mathbf{A})\;{\vcentcolon=}\;\sum_{j=1}^{d}(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{j}\otimes\mathbf{I}^{\otimes k-1})^{\intercal}\mathbf{A}(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{j}\otimes\mathbf{I}^{\otimes k-1})\in{\mathbb{R}}^{d^{k-1}\times d^{k-1}}.

(1)

where these identity matrices are $d\times d$ matrices, and where $\boldsymbol{\mathrm{e}}_{j}\in{\mathbb{R}}^{d}$ is the $j^{th}$ standard basis vector.

Equivalently, the partial trace operator can be defined as the unique linear operator which guarantees that for all $\mathbf{A}\in{\mathbb{R}}^{d^{i-1}\times d^{i-1}}$ , $\mathbf{B}\in{\mathbb{R}}^{d\times d}$ , and $\mathbf{C}\in{\mathbb{R}}^{d^{k-i}\times d^{k-i}}$ , we have that $\operatorname{tr}_{\{i\}}(\mathbf{A}\otimes\mathbf{B}\otimes\mathbf{C})=\operatorname{tr}(\mathbf{B})\cdot(\mathbf{A}\otimes\mathbf{C})$ .

We will often consider the partial trace of $\mathbf{A}$ with respect to a set of subsystems ${\mathcal{S}}\subseteq[k]$ . We can think of this as either composing the operation in Equation 1 $|{{\mathcal{S}}}|$ many times, or as extending the summation in Equation 1 to sum over all $d^{|{{\mathcal{S}}}|}$ basis vectors that span the product of subspaces defined in ${\mathcal{S}}$ . To see this equivalence, consider the partial trace of $\mathbf{A}$ with respect to ${\mathcal{S}}=\{1,2\}$ :

$\displaystyle\operatorname{tr}_{\{1,2\}}(\mathbf{A})$	$\displaystyle=\operatorname{tr}_{\{1\}}(\operatorname{tr}_{\{2\}}(\mathbf{A}))$
	$\displaystyle=\sum_{j=1}^{d}(\boldsymbol{\mathrm{e}}_{j}\otimes\mathbf{I}^{\otimes k-1})^{\intercal}\bigg{(}\sum_{\hat{j}=1}^{d}(\mathbf{I}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}}\otimes\mathbf{I}^{\otimes k-2})^{\intercal}\mathbf{A}(\mathbf{I}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}}\otimes\mathbf{I}^{\otimes k-2})\bigg{)}(\boldsymbol{\mathrm{e}}_{j}\otimes\mathbf{I}^{\otimes k-1})$
	$\displaystyle=\sum_{j=1}^{d}\sum_{\hat{j}=1}^{d}(\boldsymbol{\mathrm{e}}_{j}\otimes\mathbf{I}\otimes\mathbf{I}^{\otimes k-2})^{\intercal}(\mathbf{I}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}}\otimes\mathbf{I}^{\otimes k-2})^{\intercal}\mathbf{A}(\mathbf{I}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}}\otimes\mathbf{I}^{\otimes k-2})(\boldsymbol{\mathrm{e}}_{j}\otimes\mathbf{I}\otimes\mathbf{I}^{\otimes k-2})$
	$\displaystyle=\sum_{j=1}^{d}\sum_{\hat{j}=1}^{d}(\boldsymbol{\mathrm{e}}_{j}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}}\otimes\mathbf{I}^{\otimes k-2})^{\intercal}\mathbf{A}(\boldsymbol{\mathrm{e}}_{j}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}}\otimes\mathbf{I}^{\otimes k-2}).$	(Lemma 10)

Next, note that $\boldsymbol{\mathrm{e}}_{j}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}}\in{\mathbb{R}}^{d^{2}}$ is a standard basis vector in ${\mathbb{R}}^{d^{2}}$ uniquely identified by the pair $(j,\hat{j})$ . That is, we can replace this double sum above with a single sum over standard basis vectors in ${\mathbb{R}}^{d^{2}}$ :

\displaystyle\operatorname{tr}_{\{1,2\}}(\mathbf{A})

\displaystyle=\sum_{j=1}^{d^{2}}(\boldsymbol{\mathrm{e}}_{j}\otimes\mathbf{I}^{\otimes k-2})^{\intercal}\mathbf{A}(\boldsymbol{\mathrm{e}}_{j}\otimes\mathbf{I}^{\otimes k-2}).

(

\boldsymbol{\mathrm{e}}_{j}\in{\mathbb{R}}^{d^{2}}

now)

Which now looks remarkably similar to Definition 14, but with a larger definition of “subsystem” that the standard basis vector is acting on. Formally, we define the partial trace of $\mathbf{A}$ with respect to a set ${\mathcal{S}}$ as following:

Definition 15.

Let $\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}}$ , and let ${\mathcal{S}}\subseteq[k]$ . Let $i_{1},\ldots,i_{|{{\mathcal{S}}}|}$ be the indices in ${\mathcal{S}}$ , sorted so that $i_{1}<i_{2}<\ldots<i_{|{{\mathcal{S}}}|}$ . Then, the partial trace of $\mathbf{A}$ with respect to the subsystem ${\mathcal{S}}$ is

\operatorname{tr}_{{\mathcal{S}}}(\mathbf{A})\;{\vcentcolon=}\;\operatorname{tr}_{\{i_{1}\}}(\operatorname{tr}_{\{i_{2}\}}(\cdots(\operatorname{tr}_{\{i_{|{{\mathcal{S}}}|}\}}(\mathbf{A}))))\in{\mathbb{R}}^{d^{k-|{{\mathcal{S}}}|}\times d^{k-|{{\mathcal{S}}}|}}.

In the special case that ${\mathcal{S}}=\{i,i+1,\ldots,k\}$ , we denote this partial trace as $\operatorname{tr}_{i\mathrel{\mathop{\ordinarycolon}}}(\mathbf{A})$ .

In principle, the order of subscripts does not really matter. Tracing out the first subsystem and then tracing out the second subsystem produces the exact same matrix as tracing out the second subsystem and then tracing out the first subsystem. However, notationally, a non-ascending order becomes messy since removing the second subsystem changes the indices. To see this messiness, note that the above statement would be formalized as $\operatorname{tr}_{\{1\}}(\operatorname{tr}_{\{2\}}(\mathbf{A}))=\operatorname{tr}_{\{1\}}(\operatorname{tr}_{\{1\}}(\mathbf{A}))$ . To avoid this issue, we always expand the partial traces in ${\mathcal{S}}$ in sorted order as in Definition 15.

Moving on, a vital property of the partial trace is that it is trace-preserving:

Lemma 16 (Folklore).

For all $\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}}$ and ${\mathcal{S}}\subseteq[k]$ , we have $\operatorname{tr}(\operatorname{tr}_{{\mathcal{S}}}(\mathbf{A}))=\operatorname{tr}(\mathbf{A})$ .

2.3 Post-Measurement Reduced Density Matrices

Our analysis also heavily involves post-measurement reduced density matrices, which we now define, and whose long name originates from quantum physics. As mentioned in Section 1.2, if $\mathbf{A}$ is a matrix that describes the joint probability distribution of many particles (a density matrix), then $\operatorname{tr}_{\mathcal{S}}(\mathbf{A})$ describes the joint probability distribution of the particles that Bob can view. The Post-measurement Reduced Density Matrix (PMRDM) of $\mathbf{A}$ with respect to vector $\boldsymbol{\mathrm{x}}$ describes the joint distribution of the particles that Bob can view given that Alice measured her particles in the direction of the vector $\boldsymbol{\mathrm{x}}$ .¹¹1For the purpose of understanding the paper, there is no particular need to understand this intuition. Those interested in learning more are recommended to read through Chapter 6 of [Aaronson, 2018]. For brevity, we refer to this matrix as the PMRDM.

Definition 17.

Let $\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}}$ , $i\in[k]$ , and $\boldsymbol{\mathrm{x}}_{i}\in{\mathbb{R}}^{d}$ . Then the PMRDM of $\mathbf{A}$ after measuring $\boldsymbol{\mathrm{x}}$ along the $i^{th}$ subsystem is

{\mathcal{M}}_{\{i\}}(\mathbf{A})\;{\vcentcolon=}\;(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{x}}_{i}\otimes\mathbf{I}^{\otimes k-i})^{\intercal}\mathbf{A}(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{x}}_{i}\otimes\mathbf{I}^{\otimes k-i})\in{\mathbb{R}}^{d^{k-1}\times d^{k-1}}.

Further let $\boldsymbol{\mathrm{x}}_{1},\ldots,\boldsymbol{\mathrm{x}}_{i}\in{\mathbb{R}}^{d}$ . Then, the PMRDM of $\mathbf{A}$ after measuring according to $\boldsymbol{\mathrm{x}}_{1},\ldots,\boldsymbol{\mathrm{x}}_{i}$ in the first $i$ subsystems of $[k]$ is

	$\displaystyle{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i}(\mathbf{A})$	$\displaystyle\;{\vcentcolon=}\;{\mathcal{M}}_{\{1\}}({\mathcal{M}}_{\{2\}}(\cdots({\mathcal{M}}_{\{i\}}(\mathbf{A}))))$
		$\displaystyle=(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}\otimes\mathbf{I}^{\otimes k-i})^{\intercal}\mathbf{A}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}\otimes\mathbf{I}^{\otimes k-i})\in{\mathbb{R}}^{d^{k-i}\times d^{k-i}}.$

where $\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}=\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{i}\in{\mathbb{R}}^{d^{i}}$ .

Notice that the notation for the PMRDM ${\mathcal{M}}_{\{i\}}(\mathbf{A})$ does not explicitly show the vector $\boldsymbol{\mathrm{x}}_{i}$ that define the measurement. This is done for visual clarity, and is safe because the meaning of the vector $\boldsymbol{\mathrm{x}}_{i}$ will always be clear from context.²²2Namely, we always take $\boldsymbol{\mathrm{x}}_{i}$ to be the $i^{th}$ term in the expansion of $\boldsymbol{\mathrm{x}}=\otimes_{i=1}^{k}\boldsymbol{\mathrm{x}}_{i}$ , where $\boldsymbol{\mathrm{x}}_{i}$ has either iid Gaussian or Rademacher entries. Further, while we could define the PMRDM with respect to an arbitrary set of subsystems ${\mathcal{S}}$ , as done in Definition 15, our proofs only need to examine the PMRDM when looking at the first $i$ subsystems, so for simplicity we only define the PMRDM for that case.

We need one important property from the PMRDM, which relates it to the partial trace:

Lemma 18.

Let $\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}}$ , $i\in[k]$ , and $\boldsymbol{\mathrm{x}}_{1},\ldots,\boldsymbol{\mathrm{x}}_{i}\in{\mathbb{R}}^{d}$ . Define $\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}=\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{i}$ . Then, $\operatorname{tr}({\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i}(\mathbf{A}))=\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}^{\intercal}\operatorname{tr}_{i+1\mathrel{\mathop{\ordinarycolon}}}(\mathbf{A})\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}$ .

Proof.

We directly expand and simplify:

$\displaystyle\operatorname{tr}({\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i}(\mathbf{A}))$	$\displaystyle=\operatorname{tr}((\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}\otimes\mathbf{I}^{\otimes k-i})^{\intercal}\mathbf{A}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}\otimes\mathbf{I}^{\otimes k-i}))$
	$\displaystyle=\sum_{j=1}^{d^{k-i}}\boldsymbol{\mathrm{e}}_{j}^{\intercal}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}\otimes\mathbf{I}^{\otimes k-i})^{\intercal}\mathbf{A}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}\otimes\mathbf{I}^{\otimes k-i})\boldsymbol{\mathrm{e}}_{j}$
	$\displaystyle=\sum_{j=1}^{d^{k-1}}\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}^{\intercal}(\mathbf{I}^{\otimes i}\otimes\boldsymbol{\mathrm{e}}_{j})^{\intercal}\mathbf{A}(\mathbf{I}^{\otimes i}\otimes\boldsymbol{\mathrm{e}}_{j})\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}$	(Lemma 12)
	$\displaystyle=\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}^{\intercal}\operatorname{tr}_{i+1\mathrel{\mathop{\ordinarycolon}}}(\mathbf{A})\,\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}$

∎

2.4 The Partial Transpose Operator

Finally, we formalize the partial transpose operator, and characterize a useful property of it. We start by defining the partial transpose of $\mathbf{A}$ with respect to a single subsystem.

Definition 19.

Let $\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}}$ and $i\in[k]$ . Then the partial transpose of $\mathbf{A}$ with respect to the $i^{th}$ subsystem is

\mathbf{A}^{\intercal_{\{i\}}}\;{\vcentcolon=}\;\sum_{\hat{i},\hat{j}=1}^{d}(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}\boldsymbol{\mathrm{e}}_{\hat{j}}^{\intercal}\otimes\mathbf{I}^{\otimes k-i})\mathbf{A}(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}\boldsymbol{\mathrm{e}}_{\hat{j}}^{\intercal}\otimes\mathbf{I}^{\otimes k-i}).

Equivalently, the partial transpose operator is the unique linear operator which guarantees that for all $\mathbf{B}\in{\mathbb{R}}^{d^{i-1}\times d^{i-1}}$ , $\mathbf{C}\in{\mathbb{R}}^{d\times d}$ , $\mathbf{D}\in{\mathbb{R}}^{d^{k-i}\times d^{k-i}}$ , we have that $(\mathbf{B}\otimes\mathbf{C}\otimes\mathbf{D})^{\intercal_{\{i\}}}=(\mathbf{B}\otimes\mathbf{C}^{\intercal}\otimes\mathbf{D})$ . To verify this equivalence, notice that we can express the transpose of a matrix as

\sum_{\hat{i},\hat{j}=1}^{d}\boldsymbol{\mathrm{e}}_{\hat{i}}\boldsymbol{\mathrm{e}}_{\hat{j}}^{\intercal}\mathbf{C}\boldsymbol{\mathrm{e}}_{\hat{i}}\boldsymbol{\mathrm{e}}_{\hat{j}}^{\intercal}=\sum_{\hat{i},\hat{j}=1}^{d}[\mathbf{C}]_{\hat{j}\hat{i}}\boldsymbol{\mathrm{e}}_{\hat{i}}\boldsymbol{\mathrm{e}}_{\hat{j}}^{\intercal}=\mathbf{C}^{\intercal}.

Then, we see that

	$\displaystyle\sum_{\hat{i},\hat{j}=1}^{d}(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}\boldsymbol{\mathrm{e}}_{\hat{j}}^{\intercal}\otimes\mathbf{I}^{\otimes k-i})(\mathbf{B}\otimes\mathbf{C}\otimes\mathbf{D})(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}\boldsymbol{\mathrm{e}}_{\hat{j}}^{\intercal}\otimes\mathbf{I}^{\otimes k-i})$	$\displaystyle=\sum_{\hat{i},\hat{j}=1}^{d}(\mathbf{B}\otimes[\mathbf{C}^{\intercal}]\boldsymbol{\mathrm{e}}_{\hat{i}}\boldsymbol{\mathrm{e}}_{\hat{j}}^{\intercal}\otimes\mathbf{D})$
		$\displaystyle=(\mathbf{B}\otimes\mathbf{C}^{\intercal}\otimes\mathbf{D}).$

Like the partial trace, we will often be interested in taking the partial transpose of $\mathbf{A}$ with respect to a set of subsystems ${\mathcal{V}}\subseteq[k]$ . We extend the definition of the partial transpose directly to this case:

Definition 20.

Let $\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}}$ and ${\mathcal{V}}\subseteq[k]$ . Let $i_{1},\ldots,i_{|{{\mathcal{V}}}|}$ be the indices in ${\mathcal{V}}$ . Then, the partial transpose of $\mathbf{A}$ with respect to the subsystem ${\mathcal{V}}$ is

\mathbf{A}^{\intercal_{{\mathcal{V}}}}\;{\vcentcolon=}\;(((\mathbf{A}^{\intercal_{i_{1}}})^{\intercal_{i_{2}}})\cdots)^{\intercal_{i_{|{{\mathcal{V}}}|}}}.

Notably, the order of partial transposes does not matter in this case. That is, we have $(\mathbf{A}^{\intercal_{\{i\}}})^{\intercal_{\{j\}}}=(\mathbf{A}^{\intercal_{\{j\}}})^{\intercal_{\{i\}}}$ . We next establish two useful properties of the partial transpose:

Lemma 21.

Let $\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}}$ and $i\in[k]$ . Then, $\operatorname{tr}_{\{i\}}(\mathbf{A}^{\intercal_{\{i\}}})=\operatorname{tr}_{\{i\}}(\mathbf{A})$ .

Proof.

We directly compute

	$\displaystyle\operatorname{tr}_{\{i\}}(\mathbf{A}^{\intercal_{\{i\}}})$	$\displaystyle=\sum_{\hat{i}=1}^{d}(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}\otimes\mathbf{I}^{\otimes k-i})^{\intercal}\mathbf{A}^{\intercal_{\{i\}}}(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}\otimes\mathbf{I}^{\otimes k-i})$
		$\displaystyle=\sum_{\hat{i},\hat{j},\hat{\ell}=1}^{d}(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}^{\intercal}\boldsymbol{\mathrm{e}}_{\hat{j}}\boldsymbol{\mathrm{e}}_{\hat{\ell}}^{\intercal}\otimes\mathbf{I}^{\otimes k-i})\mathbf{A}(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}\boldsymbol{\mathrm{e}}_{\hat{j}}\boldsymbol{\mathrm{e}}_{\hat{\ell}}^{\intercal}\otimes\mathbf{I}^{\otimes k-i})$
		$\displaystyle=\sum_{\hat{i},\hat{j},\hat{\ell}=1}^{d}\mathbbm{1}_{[\hat{i}=\hat{j}]}\mathbbm{1}_{[\hat{i}=\hat{\ell}]}(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{\ell}}^{\intercal}\otimes\mathbf{I}^{\otimes k-i})\mathbf{A}(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}}\otimes\mathbf{I}^{\otimes k-i})$
		$\displaystyle=\sum_{\hat{i}=1}^{d}(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}\otimes\mathbf{I}^{\otimes k-i})^{\intercal}\mathbf{A}(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}\otimes\mathbf{I}^{\otimes k-i})$
		$\displaystyle=\operatorname{tr}_{\{i\}}(\mathbf{A}).$

∎

Lemma 22.

Let $\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}}$ , $i\in[k]$ , $\mathbf{B}\in{\mathbb{R}}^{d^{i-1}\times d^{i-1}}$ , and $\mathbf{C}\in{\mathbb{R}}^{d^{k-i}\times d^{k-i}}$ . Then, for all $\hat{i},\hat{j}\in[d]$ we have

(\mathbf{B}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}}\otimes\mathbf{C})^{\intercal}\mathbf{A}(\mathbf{B}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}\otimes\mathbf{C})=(\mathbf{B}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}\otimes\mathbf{C})^{\intercal}\mathbf{A}^{\intercal_{\{i\}}}(\mathbf{B}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}}\otimes\mathbf{C})

Proof.

We expand the right-hand-side:

		$\displaystyle(\mathbf{B}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}\otimes\mathbf{C})^{\intercal}\mathbf{A}^{\intercal_{\{i\}}}(\mathbf{B}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}}\otimes\mathbf{C})$
	$\displaystyle=$	$\displaystyle(\mathbf{B}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}\otimes\mathbf{C})^{\intercal}({\textstyle\sum_{\hat{\ell},\hat{m}=1}^{d}}(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{\ell}}\boldsymbol{\mathrm{e}}_{\hat{m}}^{\intercal}\otimes\mathbf{I}^{\otimes k-i})\mathbf{A}(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{\ell}}\boldsymbol{\mathrm{e}}_{\hat{m}}^{\intercal}\otimes\mathbf{I}^{\otimes k-i}))(\mathbf{B}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}}\otimes\mathbf{C})$
	$\displaystyle=$	$\displaystyle{\textstyle\sum_{\hat{\ell},\hat{m}=1}^{d}}(\mathbf{B}^{\intercal}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}^{\intercal}\boldsymbol{\mathrm{e}}_{\hat{\ell}}\boldsymbol{\mathrm{e}}_{\hat{m}}^{\intercal}\otimes\mathbf{C}^{\intercal})\mathbf{A}(\mathbf{B}\otimes\boldsymbol{\mathrm{e}}_{\hat{\ell}}\boldsymbol{\mathrm{e}}_{\hat{m}}^{\intercal}\boldsymbol{\mathrm{e}}_{\hat{j}}\otimes\mathbf{C})$
	$\displaystyle=$	$\displaystyle(\mathbf{B}^{\intercal}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}}^{\intercal}\otimes\mathbf{C}^{\intercal})\mathbf{A}(\mathbf{B}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}\otimes\mathbf{C}),$

where we remove sum by noting that $\boldsymbol{\mathrm{e}}_{\hat{i}}^{\intercal}\boldsymbol{\mathrm{e}}_{\hat{\ell}}$ and $\boldsymbol{\mathrm{e}}_{\hat{m}}^{\intercal}\boldsymbol{\mathrm{e}}_{\hat{j}}$ are zero unless $\hat{\ell}=\hat{i}$ and $\hat{m}=\hat{j}$ . ∎

We now have enough tools to exactly understand the variance of the Kronecker-Hutchinson Estimator.

3 The Exact Variance of Kronecker-Hutchinson Estimator

We begin by recalling the exact variance of the classical ( $k=1$ ) Hutchinson’s Estimator:

Lemma 23 ([Girard, 1987, Hutchinson, 1989, Avron and Toledo, 2011, Meyer, 2021]).

Let $\mathbf{A}\in{\mathbb{R}}^{D\times D}$ and let $\boldsymbol{\mathrm{g}}\in{\mathbb{R}}^{D}$ be a vector of either iid Rademacher or standard normal Gaussian entries. Let $\hat{}\mathbf{A}=\frac{1}{2}(\mathbf{A}+\mathbf{A}^{\intercal})$ . Then, $\operatorname*{\mathbb{E}}[\boldsymbol{\mathrm{g}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{g}}]=\operatorname{tr}(\mathbf{A})$ and $\operatorname*{Var}[\boldsymbol{\mathrm{g}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{g}}]\leq 2\mathinner{\lVert\hat{}\mathbf{A}\rVert}_{F}^{2}\leq 2\mathinner{\lVert\mathbf{A}\rVert}_{F}^{2}$ . The first inequality is tight if $\boldsymbol{\mathrm{g}}$ is Gaussian, and the second is tight if $\mathbf{A}$ is symmetric.

This factor of 2 is important for our analysis, and is the base of the exponent $2^{k-|{{\mathcal{S}}}|}$ in Theorem 1. The key to our proof technique is to use Lemma 12 to expand

$\displaystyle\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}$	$\displaystyle=(\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k})^{\intercal}\mathbf{A}(\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k})$
	$\displaystyle=\boldsymbol{\mathrm{x}}_{k}^{\intercal}(\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k-1}\otimes\mathbf{I})^{\intercal}\mathbf{A}(\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k-1}\otimes\mathbf{I})\boldsymbol{\mathrm{x}}_{k}$
	$\displaystyle=\boldsymbol{\mathrm{x}}_{k}^{\intercal}{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}k-1}(\mathbf{A})\boldsymbol{\mathrm{x}}_{k}$	(2)

which then is just a classical ( $k=1$ ) Hutchinson’s Estimate of the trace of ${\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}k-1}(\mathbf{A})$ . Lemma 23 then gives (for Gaussians) an exact understanding of the variance of this estimator. As a warm-up, we first use this proof technique to show that the Kronecker-Hutchinson Estimator is unbiased, though shorter proofs are certainly possible.

Theorem 24.

Let $\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}}$ . Let $\boldsymbol{\mathrm{x}}_{1},\ldots,\boldsymbol{\mathrm{x}}_{k}\in{\mathbb{R}}^{d}$ be iid Rademacher or standard normal Gaussian vectors, and let $\boldsymbol{\mathrm{x}}=\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k}$ . Then, $\operatorname*{\mathbb{E}}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}]=\operatorname{tr}(\mathbf{A})$ .

Proof.

For any $i\in[k]$ ,

$\displaystyle\operatorname*{\mathbb{E}}[\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}^{\intercal}\operatorname{tr}_{i+1\mathrel{\mathop{\ordinarycolon}}}(\mathbf{A})\,\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}]$	$\displaystyle=\operatorname*{\mathbb{E}}[\boldsymbol{\mathrm{x}}_{i}^{\intercal}\,{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}(\operatorname{tr}_{i+1\mathrel{\mathop{\ordinarycolon}}}(\mathbf{A}))\,\boldsymbol{\mathrm{x}}_{i}]$	(Equation 2)
	$\displaystyle=\operatorname{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\operatorname{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{i}}[\boldsymbol{\mathrm{x}}_{i}^{\intercal}\,{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}(\operatorname{tr}_{i+1\mathrel{\mathop{\ordinarycolon}}}(\mathbf{A}))\,\boldsymbol{\mathrm{x}}_{i}\mid\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}]$	(Tower Rule)
	$\displaystyle=\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}[\operatorname{tr}({\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}(\operatorname{tr}_{i+1\mathrel{\mathop{\ordinarycolon}}}(\mathbf{A})))\,]$	(Lemma 23)
	$\displaystyle=\operatorname*{\mathbb{E}}[\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}^{\intercal}\operatorname{tr}_{i\mathrel{\mathop{\ordinarycolon}}}(\mathbf{A})\,\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}].$	(Lemma 18)

Repeating this argument $k$ times in a row, and letting $\operatorname{tr}_{\emptyset}(\mathbf{A})\;{\vcentcolon=}\;\mathbf{A}$ , we conclude by Lemma 16 that

\operatorname*{\mathbb{E}}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}]=\operatorname*{\mathbb{E}}[\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}k}^{\intercal}\operatorname{tr}_{\emptyset}(\mathbf{A})\,\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}k}]=\ldots=\operatorname*{\mathbb{E}}[\boldsymbol{\mathrm{x}}_{1}^{\intercal}\operatorname{tr}_{2\mathrel{\mathop{\ordinarycolon}}}(\mathbf{A})\,\boldsymbol{\mathrm{x}}_{1}]=\operatorname{tr}(\operatorname{tr}_{2\mathrel{\mathop{\ordinarycolon}}}(\mathbf{A}))=\operatorname{tr}(\mathbf{A})

which completes the proof.

∎

Analyzing the variance of the estimator requires a bit more effort due to the Frobenius norm term and $\bar{}\mathbf{A}$ matrix that appear in Lemma 23. More concretely, we approach the analysis of the variance of $\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}$ via the classical formula

\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}]=\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}})^{2}]-(\operatorname*{\mathbb{E}}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}])^{2}.

Since we already know that $(\operatorname*{\mathbb{E}}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}])^{2}=(\operatorname{tr}(\mathbf{A}))^{2}$ , we turn our attention to evaluating $\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}})^{2}]$ . Note that rearranging Lemma 23 gives that

\displaystyle\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{g}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{g}})^{2}]\leq(\operatorname{tr}(\mathbf{A}))^{2}+2\mathinner{\lVert\bar{}\mathbf{A}\rVert}_{F}^{2},

(3)

with equality if $\boldsymbol{\mathrm{g}}$ is Gaussian. Then, noting that $\frac{1}{2}({\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}k-1}(\mathbf{A})+{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}k-1}(\mathbf{A})^{\intercal})={\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}k-1}(\frac{1}{2}(\mathbf{A}+\mathbf{A}^{\intercal}))$ , and taking the same approach as in the bias analysis, we see that

$\displaystyle\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}})^{2}]$	$\displaystyle=\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{x}}_{k}^{\intercal}{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}k-1}(\mathbf{A})\boldsymbol{\mathrm{x}}_{k})^{2}]$	(Equation 2)
	$\displaystyle=\operatorname{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}k-1}}\operatorname{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{k}}[(\boldsymbol{\mathrm{x}}_{k}^{\intercal}{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}k-1}(\mathbf{A})\boldsymbol{\mathrm{x}}_{k})^{2}\mid\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}k-1}]$	(Tower Rule)
	$\displaystyle=\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}k-1}}[(\operatorname{tr}({\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}k-1}(\mathbf{A})))^{2}+2\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}k-1}(\mathbf{B})\rVert}_{F}^{2}]$	(Equation 3)
	$\displaystyle=\operatorname{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}k-1}}[(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}k-1}^{\intercal}\operatorname{tr}_{\{k\}}(\mathbf{A})\,\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}k-1})^{2}]+2\operatorname{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}k-1}}[\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}k-1}(\mathbf{B})\rVert}_{F}^{2}],$	(Lemma 18)

where we let $\mathbf{B}\;{\vcentcolon=}\;\frac{1}{2}(\mathbf{A}+\mathbf{A}^{\intercal})$ . The first term admits a recursive relationship just like in the proof of Theorem 24, so we can easily handle that. The second term however requires more effort. The key observation we use is that the squared Frobenius norm of the PMRDM is in fact a sum of the squares of many Kronecker-Hutchinson Estimators:

$\displaystyle\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i}(\mathbf{B})\rVert}_{F}^{2}$	$\displaystyle=\sum_{\hat{i},\hat{j}=1}^{d^{k-i}}[{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i}(\mathbf{B})]_{\hat{i}\hat{j}}^{2}$
	$\displaystyle=\sum_{\hat{i},\hat{j}=1}^{d^{k-i}}(\boldsymbol{\mathrm{e}}_{\hat{i}}^{\intercal}{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i}(\mathbf{B})\boldsymbol{\mathrm{e}}_{\hat{j}})^{2}$
	$\displaystyle=\sum_{\hat{i},\hat{j}=1}^{d^{k-i}}(\boldsymbol{\mathrm{e}}_{\hat{i}}^{\intercal}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}\otimes\mathbf{I}^{\otimes k-i})^{\intercal}\mathbf{B}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}\otimes\mathbf{I}^{k-i})\boldsymbol{\mathrm{e}}_{\hat{j}})^{2}$
	$\displaystyle=\sum_{\hat{i},\hat{j}=1}^{d^{k-i}}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}^{\intercal}(\mathbf{I}^{\otimes i}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}})^{\intercal}\mathbf{B}(\mathbf{I}^{\otimes i}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}})\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i})^{2}$	(Lemma 12)

In particular, this last line is the sum of squares of many Kronecker-Hutchinson Estimators for a wide variety of matrices. Despite these being trace estimates for different matrices, we can just carefully analyze the resulting terms using the Mixed-Product Property (Lemma 10), Lemma 12, and Equation 3. We get the following recurrence relation:

Lemma 25.

Fix $i\in[k]$ , and let $\mathbf{B}\in{\mathbb{R}}^{d^{j}\times d^{j}}$ for some $j\geq i$ . Then,

\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}}[\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i}(\mathbf{B})\rVert}_{F}^{2}]\leq\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}(\operatorname{tr}_{i}(\mathbf{B}))\rVert}_{F}^{2}+2\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}({\textstyle\frac{1}{2}}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}}))\rVert}_{F}^{2},

with equality if $\boldsymbol{\mathrm{x}}_{i}$ is a standard normal Gaussian vector.

Proof.

We first extract the vector $\boldsymbol{\mathrm{x}}_{i}$ via the same technique as in Equation 2. Recall that we always take $\mathbf{I}=\mathbf{I}_{d}\in{\mathbb{R}}^{d\times d}$ to be the identity in $d$ dimensions.

	$\displaystyle\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}}[\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i}(\mathbf{B})\rVert}_{F}^{2}]$
	$\displaystyle=\sum_{\hat{i},\hat{j}}\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}}[(\boldsymbol{\mathrm{e}}_{\hat{i}}^{\intercal}{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i}(\mathbf{B})\boldsymbol{\mathrm{e}}_{\hat{j}})^{2}]$
	$\displaystyle=\sum_{\hat{i},\hat{j}}\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}}[(\boldsymbol{\mathrm{e}}_{\hat{i}}^{\intercal}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}\otimes\mathbf{I}^{\otimes j-i})^{\intercal}\mathbf{B}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}\otimes\mathbf{I}^{\otimes j-i})\boldsymbol{\mathrm{e}}_{\hat{j}})^{2}]$
	$\displaystyle=\sum_{\hat{i},\hat{j}}\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}}[((\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\boldsymbol{\mathrm{x}}_{i}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}})^{\intercal}\mathbf{B}(\boldsymbol{\mathrm{x}}_{i-1}\otimes\boldsymbol{\mathrm{x}}_{i}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}}))^{2}]$
	$\displaystyle=\operatorname{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\sum_{\hat{i},\hat{j}}\operatorname{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{i}}[(\boldsymbol{\mathrm{x}}_{i}^{\intercal}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\mathbf{I}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}})^{\intercal}\mathbf{B}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\mathbf{I}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}})\boldsymbol{\mathrm{x}}_{i})^{2}\mid\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}]$

We take a moment to examine the variance of this Hutchinson’s sample. Letting $\mathbf{C}\;{\vcentcolon=}\;\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\mathbf{I}$ , we get that the variance of the sample is at most

2\mathinner{\lVert{\textstyle\frac{1}{2}}((\mathbf{C}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}})^{\intercal}\mathbf{B}(\mathbf{C}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}})+(\mathbf{C}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}})^{\intercal}\mathbf{B}(\mathbf{C}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}))\rVert}_{F}^{2}.

By Lemma 22, we know the term on the right is equivalent to $(\mathbf{C}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}})^{\intercal}\mathbf{B}^{\intercal_{\{i\}}}(\mathbf{C}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}})$ , so that the norm above becomes $2\mathinner{\lVert(\mathbf{C}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}})^{\intercal}\frac{1}{2}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}})(\mathbf{C}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}})\rVert}_{F}^{2}$ . Lemma 13 further tells us the the sum of this variance across all $\hat{i},\hat{j}\in[d^{j-i}]$ is $2\mathinner{\lVert(\mathbf{C}\otimes\mathbf{I})^{\intercal}\frac{1}{2}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}})(\mathbf{C}\otimes\mathbf{I})\rVert}_{F}^{2}$ . Returning to our overall analysis, we get

	$\displaystyle\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}}[\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i}(\mathbf{B})\rVert}_{F}^{2}]$
	$\displaystyle\leq\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\sum_{\hat{i},\hat{j}}\left[(\operatorname{tr}((\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\mathbf{I}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}})^{\intercal}\mathbf{B}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\mathbf{I}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}})))^{2}+2\mathinner{\lVert(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\mathbf{I}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}})^{\intercal}{\textstyle\frac{1}{2}}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}})(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\mathbf{I}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}})\rVert}_{F}^{2}\right]$
	$\displaystyle=\operatorname{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\sum_{\hat{i},\hat{j}=1}\left[(\operatorname{tr}((\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\mathbf{I}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}})^{\intercal}\mathbf{B}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\mathbf{I}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}})))^{2}\right]+2\operatorname{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\mathinner{\lVert(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\mathbf{I}^{\otimes 2})^{\intercal}{\textstyle\frac{1}{2}}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}})(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\mathbf{I}^{\otimes 2})\rVert}_{F}^{2}$
	$\displaystyle=\operatorname{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\sum_{\hat{i},\hat{j}}\left[\left(\sum_{\ell=1}^{d}\boldsymbol{\mathrm{e}}_{\ell}^{\intercal}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\mathbf{I}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}})^{\intercal}\mathbf{B}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\mathbf{I}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}})\boldsymbol{\mathrm{e}}_{\ell}\right)^{2}\right]+2\operatorname{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}({\textstyle\frac{1}{2}}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}}))\rVert}_{F}^{2}$
	$\displaystyle=\operatorname{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\sum_{\hat{i},\hat{j}}\left[\left(\boldsymbol{\mathrm{e}}_{\hat{i}}^{\intercal}\sum_{\ell=1}^{d}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\boldsymbol{\mathrm{e}}_{\ell}\otimes\mathbf{I})^{\intercal}\mathbf{B}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\boldsymbol{\mathrm{e}}_{\ell}\otimes\mathbf{I})\boldsymbol{\mathrm{e}}_{\hat{j}}\right)^{2}\right]+2\operatorname{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}({\textstyle\frac{1}{2}}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}}))\rVert}_{F}^{2}$
	$\displaystyle=\operatorname{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\sum_{\hat{i},\hat{j}}\left[\left(\boldsymbol{\mathrm{e}}_{\hat{i}}^{\intercal}{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}(\operatorname{tr}_{\{i\}}(\mathbf{B}))\boldsymbol{\mathrm{e}}_{\hat{j}}\right)^{2}\right]+2\operatorname{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}({\textstyle\frac{1}{2}}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}}))\rVert}_{F}^{2}$
	$\displaystyle=\operatorname{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}(\operatorname{tr}_{i}(\mathbf{B}))\rVert}_{F}^{2}+2\operatorname{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}({\textstyle\frac{1}{2}}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}}))\rVert}_{F}^{2}$

Note that the only inequality applies Equation 3, which is an equality if $\boldsymbol{\mathrm{x}}_{i}$ is a standard normal Gaussian.

∎

Notice that this lemma takes in an arbitrary matrix $\mathbf{B}$ , instead of the specific choice of $\mathbf{B}=\frac{1}{2}(\mathbf{A}+\mathbf{A}^{\intercal})$ from earlier. This is useful, as we apply Lemma 25 repeatedly in the following proof:

Lemma 26.

Fix $i\in[k]$ , and let $\mathbf{B}\in{\mathbb{R}}^{d^{j}\times d^{j}}$ for some $j\geq i$ . Let $\bar{}\mathbf{B}=\frac{1}{2^{i}}\sum_{{\mathcal{V}}\subseteq[i]}\mathbf{B}^{\intercal_{{\mathcal{V}}}}$ . Then,

\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}}[\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i}(\mathbf{B})\rVert}_{F}^{2}]\leq\sum_{{\mathcal{S}}\subseteq[i]}2^{i-|{{\mathcal{S}}}|}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}}(\bar{}\mathbf{B})\rVert}_{F}^{2},

with equality if $\boldsymbol{\mathrm{x}}_{1},\ldots,\boldsymbol{\mathrm{x}}_{i}$ are iid standard normal Gaussian vectors.

Proof.

We prove this by induction over $i=1,\ldots,j$ . For notational clarity in this induction, we let $f_{i}(\mathbf{B})\;{\vcentcolon=}\;\frac{1}{2^{i}}\sum_{{\mathcal{V}}\subseteq[i]}\mathbf{B}^{\intercal_{{\mathcal{V}}}}$ . Then, our goal is to prove that $\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}}[\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i}(\mathbf{B})\rVert}_{F}^{2}]\leq\sum_{{\mathcal{S}}\subseteq[i]}2^{i-|{{\mathcal{S}}}|}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}}(f_{i}(\mathbf{B}))\rVert}_{F}^{2}$ for all $i$ and $\mathbf{B}$ . We start with the base case $i=1$ . First note that $\operatorname{tr}_{\{1\}}(\mathbf{B})=\operatorname{tr}_{\{1\}}(\frac{1}{2}(\mathbf{B}+\mathbf{B}^{\intercal_{\{1\}}}))=\operatorname{tr}_{\{1\}}(f_{1}(\mathbf{B}))$ by Lemma 21. Then Lemma 25 gives us that

	$\displaystyle\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{1}}[\mathinner{\lVert{\mathcal{M}}_{\{1\}}(\mathbf{B})\rVert}_{F}^{2}]$	$\displaystyle\leq\mathinner{\lVert\operatorname{tr}_{\{1\}}(\mathbf{B})\rVert}_{F}^{2}+2\mathinner{\lVert{\textstyle\frac{1}{2}}(\mathbf{B}+\mathbf{B}^{\intercal_{\{1\}}})\rVert}_{F}^{2}$
		$\displaystyle=2^{1-1}\mathinner{\lVert\operatorname{tr}_{\{1\}}(f_{1}(\mathbf{B}))\rVert}_{F}^{2}+2^{1-0}\mathinner{\lVert f_{1}(\mathbf{B})\rVert}_{F}^{2}$
		$\displaystyle=\sum_{{\mathcal{S}}\subseteq[1]}2^{1-\|{{\mathcal{S}}}\|}\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(f_{1}(\mathbf{B}))\rVert}_{F}^{2}.$

Next, for the induction case, we assume the lemma holds for $i-1$ . Since $f_{i}(\mathbf{B})$ is the average of the all partial transposes in the first $i$ subsystems of $\mathbf{B}$ , we can expand $f_{i}(\mathbf{B})=f_{i-1}({\textstyle\frac{1}{2}}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}}))$ . So, for instance, by our induction hypothesis we have that

\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}({\textstyle\frac{1}{2}}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}}))\rVert}_{F}^{2}\leq\sum_{{\mathcal{S}}\subseteq[i-1]}2^{i-1-|{{\mathcal{S}}}|}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}}(f_{i-1}({\textstyle\frac{1}{2}}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}})))\rVert}_{F}^{2}=\sum_{{\mathcal{S}}\subseteq[i-1]}2^{i-1-|{{\mathcal{S}}}|}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}}(f_{i}(\mathbf{B}))\rVert}_{F}^{2}.

Therefore, by Lemma 25, by our inductive hypothesis, and by the composition of partial traces, we have that

$\displaystyle\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}}[\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i}(\mathbf{B})\rVert}_{F}^{2}]$	$\displaystyle\leq\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}(\operatorname{tr}_{\{i\}}(\mathbf{B}))\rVert}_{F}^{2}$	$\displaystyle+2\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}({\textstyle\frac{1}{2}}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}}))\rVert}_{F}^{2}$
	$\displaystyle=\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}(\operatorname{tr}_{\{i\}}({\textstyle\frac{1}{2}}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}})))\rVert}_{F}^{2}$	$\displaystyle+2\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}({\textstyle\frac{1}{2}}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}}))\rVert}_{F}^{2}$
	$\displaystyle\leq\sum_{{\mathcal{S}}\subseteq[i-1]}2^{i-1-\|{{\mathcal{S}}}\|}\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\operatorname{tr}_{\{i\}}(f_{i}(\mathbf{B})))\rVert}_{F}^{2}$	$\displaystyle+2\sum_{{\mathcal{S}}\subseteq[i-1]}2^{i-1-\|{{\mathcal{S}}}\|}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}}(f_{i}(\mathbf{B}))\rVert}_{F}^{2}$
	$\displaystyle=\sum_{{\mathcal{S}}\subseteq[i-1]}2^{i-1-\|{{\mathcal{S}}}\|}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}\cup\{i\}}(f_{i}(\mathbf{B}))\rVert}_{F}^{2}$	$\displaystyle+2\sum_{{\mathcal{S}}\subseteq[i-1]}2^{i-1-\|{{\mathcal{S}}}\|}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}}(f_{i}(\mathbf{B}))\rVert}_{F}^{2}$
	$\displaystyle=\sum_{{\mathcal{S}}\subseteq[i-1]}2^{i-(\|{{\mathcal{S}}}\|+1)}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}\cup\{i\}}(f_{i}(\mathbf{B}))\rVert}_{F}^{2}$	$\displaystyle+\sum_{{\mathcal{S}}\subseteq[i-1]}2^{i-\|{{\mathcal{S}}}\|}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}}(f_{i}(\mathbf{B}))\rVert}_{F}^{2}$	(4)
	$\displaystyle=\sum_{{\mathcal{S}}^{\prime}\subseteq[i]}2^{i-\|{{\mathcal{S}}^{\prime}}\|}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}^{\prime}}(f_{i}(\mathbf{B}))\rVert}_{F}^{2}$		(5)

The second inequality applies the induction hypotheses to the matrix $\operatorname{tr}_{\{i\}}(\frac{1}{2}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}}))$ on the left term, and applies the induction hypothesis to the matrix $\frac{1}{2}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}})$ on the right term.

Further, Equation 5 holds by comparing the first and second terms in Equation 4, where the left sum covers all ${\mathcal{S}}^{\prime}\subseteq[i]$ such that $i\in{\mathcal{S}}^{\prime}$ , and where the right term covers all ${\mathcal{S}}^{\prime}\subseteq[i]$ such that $i\notin{\mathcal{S}}^{\prime}$ . Note that the only inequalities used are Equations 3 and 25, as well as the induction hypothesis, which are all equalities when considering iid standard normal Gaussian vectors. ∎

We now have all the tools needed to prove the bound on the variance of the Kronecker-Hutchinson Estimator.

Theorem 1 Restated.

Proof.

We start again with the observation in Equation 2, which we use to prove the following claim by induction over $i=1,\ldots,k$ :

\displaystyle\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}}[(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}^{\intercal}\operatorname{tr}_{i+1\mathrel{\mathop{\ordinarycolon}}}(\mathbf{A})\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i})^{2}]\leq(\operatorname{tr}(\mathbf{A}))^{2}+2\mathinner{\lVert\operatorname{tr}_{2\mathrel{\mathop{\ordinarycolon}}}({\textstyle\frac{1}{2}}(\mathbf{A}+\mathbf{A}^{\intercal}))\rVert}_{F}^{2}+2\sum_{j=1}^{i-1}\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}j}}\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}j}(\operatorname{tr}_{j+2\mathrel{\mathop{\ordinarycolon}}}({\textstyle\frac{1}{2}}(\mathbf{A}+\mathbf{A}^{\intercal})))\rVert}_{F}^{2},

(6)

where we interpret $\operatorname{tr}_{k+1}(\mathbf{A})\;{\vcentcolon=}\;\mathbf{A}$ . The base case for Equation 6 is when $i=1$ , where we immediately have

\displaystyle\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{1}}[(\boldsymbol{\mathrm{x}}_{1}^{\intercal}\operatorname{tr}_{2\mathrel{\mathop{\ordinarycolon}}}(\mathbf{A})\boldsymbol{\mathrm{x}}_{1})^{2}]\leq(\operatorname{tr}(\operatorname{tr}_{2\mathrel{\mathop{\ordinarycolon}}}(\mathbf{A})))^{2}+2\mathinner{\lVert\operatorname{tr}_{2\mathrel{\mathop{\ordinarycolon}}}({\textstyle\frac{1}{2}}(\mathbf{A}+\mathbf{A}^{\intercal}))\rVert}_{F}^{2}=(\operatorname{tr}(\mathbf{A}))^{2}+2\mathinner{\lVert\operatorname{tr}_{2\mathrel{\mathop{\ordinarycolon}}}({\textstyle\frac{1}{2}}(\mathbf{A}+\mathbf{A}^{\intercal}))\rVert}_{F}^{2}

which completes the base case. Now we assume Equation 6 holds for $i-1$ , and we show that it holds for $i$ :

	$\displaystyle\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}}[(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}^{\intercal}\operatorname{tr}_{i+1\mathrel{\mathop{\ordinarycolon}}}(\mathbf{A})\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i})^{2}]$
	$\displaystyle=\operatorname{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\operatorname{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{i}}[(\boldsymbol{\mathrm{x}}_{i}^{\intercal}{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}(\operatorname{tr}_{i+1\mathrel{\mathop{\ordinarycolon}}}(\mathbf{A}))\boldsymbol{\mathrm{x}}_{i})^{2}\mid\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}]$
	$\displaystyle\leq\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\left[(\operatorname{tr}({\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}(\operatorname{tr}_{i+1\mathrel{\mathop{\ordinarycolon}}}(\mathbf{A}))))^{2}+2\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}(\operatorname{tr}_{i+1\mathrel{\mathop{\ordinarycolon}}}({\textstyle\frac{1}{2}}(\mathbf{A}+\mathbf{A}^{\intercal})))\rVert}_{F}^{2}\right]$		(Lemma 23)
	$\displaystyle=\operatorname{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}[(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}^{\intercal}\operatorname{tr}_{i\mathrel{\mathop{\ordinarycolon}}}(\mathbf{A})\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1})^{2}]+\operatorname{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}2\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}(\operatorname{tr}_{i+1\mathrel{\mathop{\ordinarycolon}}}({\textstyle\frac{1}{2}}(\mathbf{A}+\mathbf{A}^{\intercal})))\rVert}_{F}^{2}$
	$\displaystyle\leq(\operatorname{tr}(\mathbf{A}))^{2}+2\mathinner{\lVert\operatorname{tr}_{2\mathrel{\mathop{\ordinarycolon}}}(\mathbf{A})\rVert}_{F}^{2}+2\sum_{j=1}^{i-1}\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}j}}\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}j}(\operatorname{tr}_{j+2\mathrel{\mathop{\ordinarycolon}}}({\textstyle\frac{1}{2}}(\mathbf{A}+\mathbf{A}^{\intercal})))\rVert}_{F}^{2}$

where the last line applies the induction hypothesis. This completes the proof by induction of Equation 6. We can then appeal to Lemma 26 and merge terms together. First, let $\mathbf{C}_{i}\;{\vcentcolon=}\;\frac{1}{2^{i}}\sum_{{\mathcal{V}}\subseteq[i]}(\frac{1}{2}(\mathbf{A}+\mathbf{A}^{\intercal}))^{\intercal_{{\mathcal{V}}}}$ , and note that $\mathbf{C}_{0}={\textstyle\frac{1}{2}}(\mathbf{A}+\mathbf{A}^{\intercal})$ . Then,

	$\displaystyle\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}})^{2}]$	$\displaystyle\leq(\operatorname{tr}(\mathbf{A}))^{2}+2\mathinner{\lVert\operatorname{tr}_{2\mathrel{\mathop{\ordinarycolon}}}({\textstyle\frac{1}{2}}(\mathbf{A}+\mathbf{A}^{\intercal}))\rVert}_{F}^{2}+2\sum_{i=1}^{k-1}\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}}\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i}(\operatorname{tr}_{i+2\mathrel{\mathop{\ordinarycolon}}}({\textstyle\frac{1}{2}}(\mathbf{A}+\mathbf{A}^{\intercal})))\rVert}_{F}^{2}$
		$\displaystyle\leq(\operatorname{tr}(\mathbf{A}))^{2}+2\mathinner{\lVert\operatorname{tr}_{2\mathrel{\mathop{\ordinarycolon}}}(\mathbf{C}_{0})\rVert}_{F}^{2}+2\sum_{i=1}^{k-1}\sum_{{\mathcal{S}}\subseteq[i]}2^{i-\|{{\mathcal{S}}}\|}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}\cup\{i+2,\ldots,k\}}(\mathbf{C}_{i})\rVert}_{F}^{2}$
		$\displaystyle=(\operatorname{tr}(\mathbf{A}))^{2}+\sum_{i=0}^{k-1}\sum_{{\mathcal{S}}\subseteq[i]}2^{i+1-\|{{\mathcal{S}}}\|}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}\cup\{i+2,\ldots,k\}}(\mathbf{C}_{i})\rVert}_{F}^{2}$
		$\displaystyle=(\operatorname{tr}(\mathbf{A}))^{2}+\sum_{i=1}^{k}\sum_{{\mathcal{S}}\subseteq[i-1]}2^{i-\|{{\mathcal{S}}}\|}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}\cup\{i+1,\ldots,k\}}(\mathbf{C}_{i-1})\rVert}_{F}^{2}$

We proceed by simplifying the partial trace $\operatorname{tr}_{{\mathcal{S}}\cup\{i+1,\ldots,k\}}(\mathbf{C}_{i-1})$ . First, note that $\mathbf{C}_{i-1}$ is the average of all partial transposes of $\frac{1}{2}(\mathbf{A}+\mathbf{A}^{\intercal})$ in the first $i-1$ subsystems. By Lemma 21, we can extend this to be the average of all partial transposes of $\frac{1}{2}(\mathbf{A}+\mathbf{A}^{\intercal})$ in across all subsystems ${\mathcal{V}}\subset[i-1]\cup\{i+1,\ldots,k\}$ , so that

	$\displaystyle\operatorname{tr}_{i+1\mathrel{\mathop{\ordinarycolon}}}(\mathbf{C}_{i-1})$	$\displaystyle=\operatorname{tr}_{i+1\mathrel{\mathop{\ordinarycolon}}}\left(\frac{1}{2^{k-1}}\sum_{{\mathcal{V}}\subseteq[k]\setminus\{i\}}({\textstyle\frac{1}{2}}(\mathbf{A}+\mathbf{A}^{\intercal}))^{\intercal_{{\mathcal{V}}}}\right)$
		$\displaystyle=\operatorname{tr}_{i+1\mathrel{\mathop{\ordinarycolon}}}\left(\frac{\frac{1}{2^{k-1}}\sum_{{\mathcal{V}}\subseteq[k]\setminus\{i\}}\mathbf{A}^{\intercal_{{\mathcal{V}}}}+\frac{1}{2^{k-1}}\sum_{{\mathcal{V}}\subseteq[k]\setminus\{i\}}(\mathbf{A}^{\intercal})^{\intercal_{{\mathcal{V}}}}}{2}\right).$

The left term above is the average of all partial transposes of $\mathbf{A}$ that do not include subsystem $i$ . Since $(\mathbf{A}^{\intercal})^{\intercal_{{\mathcal{V}}}}=\mathbf{A}^{\intercal_{\bar{}{\mathcal{V}}}}$ where $\bar{}{\mathcal{V}}\;{\vcentcolon=}\;[k]\setminus{\mathcal{V}}$ , the right term above is the average of all partial transposes of $\mathbf{A}$ that do include $i$ . So, we find that $\operatorname{tr}_{i+1\mathrel{\mathop{\ordinarycolon}}}(\mathbf{C}_{i-1})=\operatorname{tr}_{i+1\mathrel{\mathop{\ordinarycolon}}}(\frac{1}{2^{k}}\sum_{{\mathcal{V}}\subseteq[k]}\mathbf{A}^{\intercal_{{\mathcal{V}}}})=\operatorname{tr}_{i+1\mathrel{\mathop{\ordinarycolon}}}(\bar{}\mathbf{A})$ . We are therefore left with the expression

\displaystyle\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}})^{2}]\leq(\operatorname{tr}(\mathbf{A}))^{2}+\sum_{i=1}^{k}\sum_{{\mathcal{S}}\subseteq[i-1]}2^{i-|{{\mathcal{S}}}|}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}\cup\{i+1,\ldots,k\}}(\bar{}\mathbf{A})\rVert}_{F}^{2}.

(7)

We now focus on the double summation in Equation 7. We start by understanding the inner summation in terms of the index $i$ from the outer iteration. Let

F(i)\;{\vcentcolon=}\;\sum_{{\mathcal{S}}\subseteq[i-1]}2^{i-|{{\mathcal{S}}}|}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}\cup\{i+1,\ldots,k\}}(\bar{}\mathbf{A})\rVert}_{F}^{2}

so that

\sum_{i=1}^{k}\sum_{{\mathcal{S}}\subseteq[i-1]}2^{i-|{{\mathcal{S}}}|}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}\cup\{i+1,\ldots,k\}}(\bar{}\mathbf{A})\rVert}_{F}^{2}=\sum_{i=1}^{k}F(i).

We see that $F(i)$ is a weighted sum of $\mathinner{\lVert\operatorname{tr}_{\hat{}{\mathcal{S}}}(\bar{}\mathbf{A})\rVert}_{F}^{2}$ across all $\hat{}{\mathcal{S}}\subseteq[k]$ of the form $\hat{}{\mathcal{S}}={\mathcal{S}}\cup\{i+1,\ldots,k\}$ where ${\mathcal{S}}\subseteq[i-1]$ . That is, we consider all $\hat{}{\mathcal{S}}$ that satisfies two properties:

1.

Indices $i+1,\ldots,k$ are all included: $\{i+1,\ldots,k\}\subseteq{\mathcal{S}}$
2.

Index $i$ is not included: $i\notin\hat{}{\mathcal{S}}$

We can intuitively think of $i$ as a “Gap” that precedes the “Suffix” $\{i+1,\ldots,k\}$ . Crucially, if $\mathinner{\lVert\operatorname{tr}_{\hat{}{\mathcal{S}}}(\bar{}\mathbf{A})\rVert}_{F}^{2}$ appears in a term of $F(i)$ , then $\mathinner{\lVert\operatorname{tr}_{\hat{}{\mathcal{S}}}(\bar{}\mathbf{A})\rVert}_{F}^{2}$ does not appear in a term of any other $F(j)$ . This is because each $\hat{}{\mathcal{S}}$ has a unique gap: for any $\hat{}{\mathcal{S}}$ , let $i\in[k]$ be the smallest value such that $\{i+1,\ldots,k\}\subseteq\hat{}{\mathcal{S}}$ . Then $i$ is the gap of $\hat{}{\mathcal{S}}$ . We note an edge-case, where if $\hat{}{\mathcal{S}}=\{1,\ldots,k\}$ then the gap is $i=0$ and we recover the term $\mathinner{\lVert\operatorname{tr}_{\hat{}{\mathcal{S}}}(\bar{}\mathbf{A})\rVert}_{F}^{2}=(\operatorname{tr}(\bar{}\mathbf{A}))^{2}$ in Equation 7.

Next, note that every $\hat{}{\mathcal{S}}$ is scaled by $2^{i-|{{\mathcal{S}}}|}$ . Since $|\hat{}{\mathcal{S}}|=|{{\mathcal{S}}\cup\{i+1,\ldots,k\}}|=|{{\mathcal{S}}}|+k-i$ , we know $|{{\mathcal{S}}}|-i=|\hat{}{\mathcal{S}}|-k$ , so we can write $2^{i-|{{\mathcal{S}}}|}=2^{k-|\hat{}{\mathcal{S}}|}$ . So, we have shown that every $\hat{}{\mathcal{S}}\subseteq[k]$ appears exactly once in Equation 7, and each term $\mathinner{\lVert\operatorname{tr}_{\hat{}{\mathcal{S}}}(\bar{}\mathbf{A})\rVert}_{F}^{2}$ is scaled by $2^{k-|\hat{}{\mathcal{S}}|}$ . That is,

\displaystyle\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}})^{2}]

\displaystyle\leq\sum_{\hat{}{\mathcal{S}}\subseteq[k]}2^{k-|\hat{}{\mathcal{S}}|}\mathinner{\lVert\operatorname{tr}_{\hat{}{\mathcal{S}}}(\bar{}\mathbf{A})\rVert}_{F}^{2},

which completes the proof since

\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}]=\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}})^{2}]-(\operatorname*{\mathbb{E}}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}])^{2}=\sum_{\hat{}{\mathcal{S}}\subset[k]}2^{k-|\hat{}{\mathcal{S}}|}\mathinner{\lVert\operatorname{tr}_{\hat{}{\mathcal{S}}}(\bar{}\mathbf{A})\rVert}_{F}^{2}.

Note that all inequalities in the proof become equalities if we know that $\boldsymbol{\mathrm{x}}_{1},\ldots,\boldsymbol{\mathrm{x}}_{k}$ are all iid standard normal Gaussians. ∎

3.1 Proof of Theorem 2

The variance expression in Theorem 1 is extremely tight, being an equality in the case that our query vectors are Kroneckers of Gaussian vectors. However, it is not immediately clear how we should interpret this expression. We now prove Theorem 2, which is a more approachable upper bound of Theorem 1. We start with a short lemma:

Lemma 27.

Let $\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}}$ and ${\mathcal{V}}\subseteq[k]$ . Then, $\mathinner{\lVert\mathbf{A}^{\intercal_{{\mathcal{V}}}}\rVert}_{F}=\mathinner{\lVert\mathbf{A}\rVert}_{F}$ .

Proof.

Consider any $i\in{\mathcal{V}}$ . Then, recall that each standard basis $\boldsymbol{\mathrm{e}}_{\hat{i}}\in{\mathbb{R}}^{d^{k}}$ can be decomposed as $\boldsymbol{\mathrm{e}}_{\hat{i}}=\boldsymbol{\mathrm{e}}_{\hat{j}}\otimes\boldsymbol{\mathrm{e}}_{\hat{k}}\otimes\boldsymbol{\mathrm{e}}_{\hat{\ell}}$ where $\boldsymbol{\mathrm{e}}_{\hat{j}}\in{\mathbb{R}}^{d^{i-1}}$ , $\boldsymbol{\mathrm{e}}_{\hat{k}}\in{\mathbb{R}}^{d}$ , and $\boldsymbol{\mathrm{e}}_{\hat{\ell}}\in{\mathbb{R}}^{d^{k-i}}$ . Then, we can expand the norm of $\mathbf{A}$ as

	$\displaystyle\mathinner{\lVert\mathbf{A}^{\intercal_{\{i\}}}\rVert}_{F}^{2}$	$\displaystyle=\sum_{\hat{i}_{1},\hat{i}_{2}=1}^{d^{k}}(\boldsymbol{\mathrm{e}}_{\hat{i}_{1}}^{\intercal}\mathbf{A}^{\intercal_{\{i\}}}\boldsymbol{\mathrm{e}}_{\hat{i}_{2}})^{2}$
		$\displaystyle=\sum_{\hat{j}_{1},\hat{j}_{2}=1}^{d^{i-1}}\sum_{\hat{k}_{1},\hat{k}_{2}=1}^{d}\sum_{\hat{\ell}_{1},\hat{\ell}_{2}=1}^{d^{k-1}}((\boldsymbol{\mathrm{e}}_{\hat{j}_{1}}\otimes\boldsymbol{\mathrm{e}}_{\hat{k}_{1}}\otimes\boldsymbol{\mathrm{e}}_{\hat{\ell}_{1}})^{\intercal}\mathbf{A}^{\intercal_{\{i\}}}(\boldsymbol{\mathrm{e}}_{\hat{j}_{2}}\otimes\boldsymbol{\mathrm{e}}_{\hat{k}_{2}}\otimes\boldsymbol{\mathrm{e}}_{\hat{\ell}_{2}}))^{2}$
		$\displaystyle=\sum_{\hat{j}_{1},\hat{j}_{2}=1}^{d^{i-1}}\sum_{\hat{k}_{1},\hat{k}_{2}=1}^{d}\sum_{\hat{\ell}_{1},\hat{\ell}_{2}=1}^{d^{k-1}}\left(\sum_{\hat{m},\hat{n}=1}^{d}(\boldsymbol{\mathrm{e}}_{\hat{j}_{1}}^{\intercal}\otimes\boldsymbol{\mathrm{e}}_{\hat{k}_{1}}^{\intercal}\boldsymbol{\mathrm{e}}_{\hat{m}}\boldsymbol{\mathrm{e}}_{\hat{n}}^{\intercal}\otimes\boldsymbol{\mathrm{e}}_{\hat{\ell}_{1}}^{\intercal})\mathbf{A}(\boldsymbol{\mathrm{e}}_{\hat{j}_{2}}\otimes\boldsymbol{\mathrm{e}}_{\hat{m}}\boldsymbol{\mathrm{e}}_{\hat{n}}^{\intercal}\boldsymbol{\mathrm{e}}_{\hat{k}_{2}}\otimes\boldsymbol{\mathrm{e}}_{\hat{\ell}_{2}})\right)^{2}$
		$\displaystyle=\sum_{\hat{j}_{1},\hat{j}_{2}=1}^{d^{i-1}}\sum_{\hat{k}_{1},\hat{k}_{2}=1}^{d}\sum_{\hat{\ell}_{1},\hat{\ell}_{2}=1}^{d^{k-1}}((\boldsymbol{\mathrm{e}}_{\hat{j}_{1}}^{\intercal}\otimes\boldsymbol{\mathrm{e}}_{\hat{k}_{2}}^{\intercal}\otimes\boldsymbol{\mathrm{e}}_{\hat{\ell}_{1}}^{\intercal})\mathbf{A}(\boldsymbol{\mathrm{e}}_{\hat{j}_{2}}\otimes\boldsymbol{\mathrm{e}}_{\hat{k}_{1}}\otimes\boldsymbol{\mathrm{e}}_{\hat{\ell}_{2}}))^{2}$
		$\displaystyle=\sum_{\hat{i}_{1},\hat{i}_{2}=1}^{d^{k}}\boldsymbol{\mathrm{e}}_{\hat{i}_{1}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{e}}_{\hat{i}_{2}}$
		$\displaystyle=\mathinner{\lVert\mathbf{A}\rVert}_{F}^{2},$

where we remove the sum over $\hat{m}$ and $\hat{n}$ be noting that $\boldsymbol{\mathrm{e}}_{\hat{k}_{1}}^{\intercal}\boldsymbol{\mathrm{e}}_{\hat{m}}$ and $\boldsymbol{\mathrm{e}}_{\hat{n}}^{\intercal}\boldsymbol{\mathrm{e}}_{\hat{k}_{2}}$ are zero unless $\hat{m}=\hat{k}_{1}$ and $\hat{n}=\hat{k}_{2}$ . We then complete the lemma by decomposing $\mathbf{A}^{\intercal_{{\mathcal{V}}}}=(((\mathbf{A}^{\intercal_{\{i_{1}\}}})^{\intercal_{\{i_{2}\}}})\cdots)^{\intercal_{\{i_{|{{\mathcal{V}}}|}\}}}$ , where ${\mathcal{V}}=\{i_{1},\ldots,i_{|{{\mathcal{V}}}|}\}$ . ∎

We now prove the theorem:

Theorem 2 Restated.

Proof.

We prove that $\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\bar{}\mathbf{A})\rVert}_{F}\leq\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\mathbf{A})\rVert}_{F}$ , which implies the theorem. First, recall that the partial trace is linear, so that by the triangle inequality we can write

\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\bar{}\mathbf{A})\rVert}_{F}=\mathinner{\lVert{\textstyle\frac{1}{2^{k}}}{\textstyle\sum_{{\mathcal{V}}\subseteq[k]}}\operatorname{tr}_{\mathcal{S}}(\mathbf{A}^{\intercal_{{\mathcal{V}}}})\rVert}_{F}\leq{\textstyle\frac{1}{2^{k}}}{\textstyle\sum_{{\mathcal{V}}\subseteq[k]}}\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\mathbf{A}^{\intercal_{{\mathcal{V}}}})\rVert}_{F}.

Next, we examine the norm $\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\mathbf{A}^{\intercal_{{\mathcal{V}}}})\rVert}$ . First, by Lemma 21, we can assume without loss of generality that ${\mathcal{V}}\cap{\mathcal{S}}=\emptyset$ . Then, for notationally simplicity, suppose that ${\mathcal{V}}=\{1,\ldots,|{{\mathcal{V}}}|\}$ . We can then write $\operatorname{tr}_{\mathcal{S}}{\mathbf{A}^{\intercal_{{\mathcal{V}}}}}=(\operatorname{tr}_{\mathcal{S}}(\mathbf{A}))^{\intercal_{{\mathcal{V}}}}$ . To see this, fix any $i\in{\mathcal{V}}$ and $j\in{\mathcal{S}}$ . By our without-loss-of-generality claim, we know that $i<j$ , and can therefore expand

	$\displaystyle\operatorname{tr}_{\{j\}}(\mathbf{A}^{\intercal_{\{i\}}})$	$\displaystyle=\sum_{\hat{i}=1}^{d}(\mathbf{I}^{\otimes j-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}\otimes\mathbf{I}^{\otimes k-j})^{\intercal}\mathbf{A}^{\intercal_{\{i\}}}(\mathbf{I}^{\otimes j-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}\otimes\mathbf{I}^{\otimes k-j})$
		$\displaystyle=\sum_{\hat{i},\hat{j},\hat{\ell}=1}^{d}(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}}\boldsymbol{\mathrm{e}}_{\hat{\ell}}^{\intercal}\otimes\mathbf{I}^{j-i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}^{\intercal}\otimes\mathbf{I}^{\otimes k-j})\mathbf{A}(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}}\boldsymbol{\mathrm{e}}_{\hat{\ell}}^{\intercal}\otimes\mathbf{I}^{j-i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}\otimes\mathbf{I}^{\otimes k-j})$
		$\displaystyle=\sum_{\hat{i},\hat{j},\hat{\ell}=1}^{d}(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}}\boldsymbol{\mathrm{e}}_{\hat{\ell}}^{\intercal}\otimes\mathbf{I}^{k-i})\operatorname{tr}_{\{j\}}(\mathbf{A})(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}}\boldsymbol{\mathrm{e}}_{\hat{\ell}}^{\intercal}\otimes\mathbf{I}^{k-i})$
		$\displaystyle=(\operatorname{tr}_{\{j\}}(\mathbf{A}))^{\intercal_{\{i\}}}.$

So, we can pull all of the partial transposes from inside of the partial trace to outside the partial trace. Then, by Lemma 27, we can conclude that

	$\displaystyle\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\bar{}\mathbf{A})\rVert}_{F}$	$\displaystyle\leq{\textstyle\frac{1}{2^{k}}}{\textstyle\sum_{{\mathcal{V}}\subseteq[k]}}\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\mathbf{A}^{\intercal_{{\mathcal{V}}}})\rVert}_{F}$
		$\displaystyle={\textstyle\frac{1}{2^{k}}}{\textstyle\sum_{{\mathcal{V}}\subseteq[k]}}\mathinner{\lVert(\operatorname{tr}_{\mathcal{S}}(\mathbf{A}))^{\intercal_{{\mathcal{V}}}}\rVert}_{F}$
		$\displaystyle={\textstyle\frac{1}{2^{k}}}{\textstyle\sum_{{\mathcal{V}}\subseteq[k]}}\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\mathbf{A})\rVert}_{F}$
		$\displaystyle=\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\mathbf{A})\rVert}_{F}.$

∎

While Theorem 2 is more approachable than Theorem 1, it is still not immediately clear how to think of this rate. In order to get a better understanding of this variance, we turn to the worst-case performance of this estimator for the class of PSD matrices.

4 Cost of the Kronecker-Hutchinson Estimator for PSD Matrices

Trace estimation is often studied under the assumption that the input matrix $\mathbf{A}$ is PSD. This assumption is very common and reasonable in practice, and without such an assumption, it is impossible for an algorithm such as Hutchinson’s Estimator to achieve a $(1\pm\varepsilon)$ relative error approximation to the trace of $\mathbf{A}$ . This is because an indefinite matrix $\mathbf{A}$ can have $\operatorname{tr}(\mathbf{A})=0$ while having large Frobenius norm, meaning that Hutchinson’s Estimator would have to perfectly recover the trace of $\mathbf{A}$ despite having a nonzero variance. This would incur infinite sample complexity. However, if we assume that $\mathbf{A}$ is PSD, then $\mathinner{\lVert\mathbf{A}\rVert}_{F}\leq\operatorname{tr}(\mathbf{A})$ , and so we know that the standard deviation of the classical Hutchinson’s Estimator is $\sqrt{\frac{2}{\ell}}\mathinner{\lVert\mathbf{A}\rVert}_{F}\leq\sqrt{\frac{2}{\ell}}\operatorname{tr}(\mathbf{A})$ . That is, $\ell=O(\frac{1}{\varepsilon^{2}})$ samples suffice to achieve $(1\pm\varepsilon)$ relative error with at least constant probability.

We now make the same assumption and study the Kronecker-Hutchinson Estimator, and show that $\ell=O(\frac{3^{k}}{\varepsilon^{2}})$ samples suffices to estimate the trace of a PSD matrix to $(1\pm\varepsilon)$ relative error. To start, we need a basic property of the partial trace:

Lemma 28.

Let $\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}}$ be a PSD matrix, and let ${\mathcal{S}}\subseteq[k]$ . Then, $\operatorname{tr}_{\mathcal{S}}(\mathbf{A})$ is also PSD.

Proof.

Recall that if $\mathbf{A}$ is PSD, then $\mathbf{B}^{\intercal}\mathbf{A}\mathbf{B}$ is also PSD for any matrix $\mathbf{B}$ of compatible dimension. Further recall that if $\mathbf{A}$ and $\mathbf{C}$ are both PSD, then $\mathbf{A}+\mathbf{C}$ is also PSD. Since $\operatorname{tr}_{\mathcal{S}}(\mathbf{A})$ is a composition of such operations on $\mathbf{A}$ , we get that $\operatorname{tr}_{\mathcal{S}}(\mathbf{A})$ is also PSD. ∎

Lemma 29 (Folklore).

Let $\mathbf{B}$ be a PSD matrix. Then, $\mathinner{\lVert\mathbf{B}\rVert}_{F}\leq\operatorname{tr}(\mathbf{B})$ .

We now prove our next core theorem:

Theorem 4 Restated.

Proof.

We first apply Theorems 2, 29 and 16 to get

\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}})^{2}]\leq\sum_{{\mathcal{S}}\subseteq[k]}2^{k-|{{\mathcal{S}}}|}\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\mathbf{A})\rVert}_{F}^{2}\leq\sum_{{\mathcal{S}}\subseteq[k]}2^{k-|{{\mathcal{S}}}|}(\operatorname{tr}(\operatorname{tr}_{\mathcal{S}}(\mathbf{A})))^{2}=(\operatorname{tr}(\mathbf{A}))^{2}\sum_{{\mathcal{S}}\subseteq[k]}2^{k-|{{\mathcal{S}}}|}.

So we turn our attention to bounding this sum. If we let $\sigma_{i}\;{\vcentcolon=}\;\mathbbm{1}_{[i\notin{\mathcal{S}}]}$ , then we can write $k-|{{\mathcal{S}}}|=\sum_{i}\sigma_{i}$ , and so

\displaystyle\sum_{{\mathcal{S}}\subseteq[k]}2^{k-|{{\mathcal{S}}}|}=\sum_{\boldsymbol{\sigma}\in\{0,1\}^{k}}2^{\sum_{i}\sigma_{i}}=\sum_{\sigma_{1}\in\{0,1\}}2^{\sigma_{1}}\left(\cdots\left(\sum_{\sigma_{k-1}\in\{0,1\}}2^{\sigma_{k-1}}\left(\sum_{\sigma_{k}\in\{0,1\}}2^{\sigma_{k}}\right)\right)\right)=3^{k}.

We then conclude that $\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}]\leq\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}})^{2}]\leq 3^{k}(\operatorname{tr}(\mathbf{A}))^{2}$ . ∎

In particular, if we take $\ell$ samples in the Kronecker-Hutchinson Estimator, we get a standard deviation of $\sqrt{\operatorname*{Var}[H_{\ell}(\mathbf{A})]}\leq\sqrt{\frac{3^{k}}{\ell}}(\operatorname{tr}(\mathbf{A}))$ , which is less than $\varepsilon\operatorname{tr}(\mathbf{A})$ for $\ell\geq\frac{3^{k}}{\varepsilon^{2}}$ . More formally, by Chebyshev’s inequality, we find that $\ell=O(\frac{3^{k}}{\varepsilon^{2}})$ samples suffices to get

(1-\varepsilon)\operatorname{tr}(\mathbf{A})\leq H_{\ell}(\mathbf{A})\leq(1+\varepsilon)\operatorname{tr}(\mathbf{A})

with constant probability.

4.1 A Matching Lower Bound

We might hope that the painful rate of $\ell\geq\frac{3^{k}}{\varepsilon^{2}}$ results from a loose analysis of the Kronecker-Hutchinson Estimator. Or, perhaps, that some more clever choice of random vector distribution can avoid this painful exponential dependence. We show that this is not the case, at least not among vectors with real iid mean-zero unit-variance components:

Theorem 6 Restated.

Fix $d,k\in{\mathbb{N}}$ . Consider Hutchinson’s Estimator run with independent vectors $\boldsymbol{\mathrm{x}}=\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k}$ where $\boldsymbol{\mathrm{x}}_{i}$ are vectors of real iid zero-mean unit-variance entries. Then, there exists a PSD matrix $\mathbf{A}$ such that Hutchinson’s Estimator needs $\ell=\Omega(\frac{(3-\frac{2}{d})^{k}}{\varepsilon^{2}})$ samples in order to have standard deviation $\leq\varepsilon\operatorname{tr}(\mathbf{A})$ .

Proof.

We let $\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}}$ be the all-ones matrix. If we let $\boldsymbol{\mathrm{e}}\in{\mathbb{R}}^{d}$ denote the all-ones vector, then we can write $\mathbf{A}=\otimes_{i=1}^{k}\boldsymbol{\mathrm{e}}\boldsymbol{\mathrm{e}}^{\intercal}$ . This is helpful because we can then use the Mixed-Product property (Lemma 10) to show that

\textstyle{\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}=\otimes_{i=1}^{k}\boldsymbol{\mathrm{x}}_{i}^{\intercal}\boldsymbol{\mathrm{e}}\boldsymbol{\mathrm{e}}^{\intercal}\boldsymbol{\mathrm{x}}_{i}=\prod_{i=1}^{k}(\boldsymbol{\mathrm{e}}^{\intercal}\boldsymbol{\mathrm{x}}_{i})^{2}},

where we can replace the Kronecker product with a scalar product because the Kronecker product of scalars is just the scalar product. Notably, we see that $\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}$ is a product of $k$ iid terms. Since we are interested in the variance of the estimator, we actually want to examine

\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}})^{2}]=\prod_{i=1}^{k}\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{e}}^{\intercal}\boldsymbol{\mathrm{x}}_{i})^{4}]=\left(\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{e}}^{\intercal}\boldsymbol{\mathrm{x}}_{1})^{4}]\right)^{k}.

So, letting $\boldsymbol{\mathrm{y}}\;{\vcentcolon=}\;\boldsymbol{\mathrm{x}}_{1}$ in order to simplify notation, we turn our attention to lower bounding the expectation of $(\boldsymbol{\mathrm{e}}^{\intercal}\boldsymbol{\mathrm{y}})^{4}=(\sum_{i=1}^{d}y_{i})^{4}$ where $\boldsymbol{\mathrm{y}}$ is a vector of iid zero-mean unit-variance terms. We expand this sum as

\textstyle\operatorname*{\mathbb{E}}[(\sum_{i=1}^{d}y_{i})^{4}]=\sum_{i,j,m,n=1}^{d}\operatorname*{\mathbb{E}}[y_{i}y_{j}y_{m}y_{n}].

This expectation is zero if any random variable appears exactly once, like how $\operatorname*{\mathbb{E}}[y_{1}y_{2}^{3}]=\operatorname*{\mathbb{E}}[y_{1}]\operatorname*{\mathbb{E}}[y_{2}^{3}]=0$ by independence. So, we can reduce the sum above to terms that either look like $\operatorname*{\mathbb{E}}[y_{i}^{4}]$ or like $\operatorname*{\mathbb{E}}[y_{i}^{2}y_{j}^{2}]$ . Terms like $\operatorname*{\mathbb{E}}[y_{i}^{4}]$ occur exactly $d$ times. Terms like $\operatorname*{\mathbb{E}}[y_{i}^{2}y_{j}^{2}]$ occur exactly $3d^{2}-3d$ times³³3This can happen 3 ways: if $i=j\neq m=n$ or if $i=m\neq j=n$ or if $i=n\neq j=m$ . Each such pattern occurs $d^{2}-d$ times, so the overall pattern occurs $3d^{2}-3d$ times.. We can then bound $\operatorname*{\mathbb{E}}[y_{i}^{4}]\geq(\operatorname*{\mathbb{E}}[y_{i}^{2}])^{2}=1$ by Jensen’s Inequality, and $\operatorname*{\mathbb{E}}[y_{i}^{2}y_{j}^{2}]=\operatorname*{\mathbb{E}}[y_{i}^{2}]\operatorname*{\mathbb{E}}[y_{j}^{2}]=1$ by independence. So, we get that

\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}})^{2}]=\left(\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{e}}^{\intercal}\boldsymbol{\mathrm{y}})^{4}]\right)^{k}\geq\left(d+3d^{2}-3d\right)^{k}=(3d^{2}-2d)^{k}.

Which, noting that $\operatorname{tr}(\mathbf{A})=d^{k}$ , in turn gets a variance lower bound of

\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}]=\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}})^{2}]-(\operatorname{tr}(\mathbf{A}))^{2}\geq(3d^{2}-2d)^{k}-d^{2k}.

Then, since $\operatorname*{Var}[H_{\ell}(\mathbf{A})]=\frac{1}{\ell}\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}]$ , in order for the Kronecker-Hutchinson’s Estimator to have standard deviation $\varepsilon\operatorname{tr}(\mathbf{A})$ , we need to take

\ell\geq\frac{\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}]}{\varepsilon^{2}(\operatorname{tr}(\mathbf{A}))^{2}}\geq\frac{(3d^{2}-2d)^{k}-d^{2k}}{\varepsilon^{2}d^{2k}}=\frac{(3-\frac{2}{d})^{k}-1}{\varepsilon^{2}}=\Omega\left(\frac{(3-\frac{2}{d})^{k}}{\varepsilon^{2}}\right).

∎

We note that this lower bound uses a rank-one matrix, which is interesting because we can exactly compute the trace of any rank-one matrix with a single Kronecker-matrix-vector product. In particular, if $\mathbf{A}=\boldsymbol{\mathrm{u}}\boldsymbol{\mathrm{u}}^{\intercal}$ for some $\boldsymbol{\mathrm{u}}\in{\mathbb{R}}^{d}$ , then for any vector $\boldsymbol{\mathrm{x}}$ not orthogonal to $\boldsymbol{\mathrm{u}}$ , we have that

\frac{\mathinner{\lVert\mathbf{A}\boldsymbol{\mathrm{x}}\rVert}_{2}^{2}}{\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}}=\frac{\boldsymbol{\mathrm{x}}^{\intercal}\boldsymbol{\mathrm{u}}\cdot\boldsymbol{\mathrm{u}}^{\intercal}\boldsymbol{\mathrm{u}}\cdot\boldsymbol{\mathrm{u}}^{\intercal}\boldsymbol{\mathrm{x}}}{\boldsymbol{\mathrm{x}}^{\intercal}\boldsymbol{\mathrm{u}}\cdot\boldsymbol{\mathrm{u}}^{\intercal}\boldsymbol{\mathrm{x}}}=\boldsymbol{\mathrm{u}}^{\intercal}\boldsymbol{\mathrm{u}}=\operatorname{tr}(\mathbf{A}).

Since a Kronecker of Gaussian vectors is not orthogonal to any fixed vector $\boldsymbol{\mathrm{u}}$ with probability 1, we can exactly compute this trace with probability 1. That is to say, the hard instance used in this lower bound is truly only hard for the Kronecker-Hutchinson Estimator, and is not hard in general for the Kronecker-matrix-vector oracle model.

The lower bound is also structured in a second way – the matrix $\mathbf{A}=\otimes_{i=1}^{k}\boldsymbol{\mathrm{e}}\boldsymbol{\mathrm{e}}^{\intercal}$ is not only rank-one but also has Kronecker structure. In Section 7 we prove that we can exactly compute the trace of any matrix $\mathbf{A}$ with Kronecker structure by using $kd+1$ Kronecker-matrix-vector products. That is to say, the hard instance used in this lower bound is in a second sense only hard for the Kronecker-Hutchinson’s Estimator, and its class of hard input matrices is not hard in general for the Kronecker-matrix-vector oracle model. That said, we do not have a clear sense of what algorithms may be able to bridge this gap and admit much faster trace estimation.

4.2 Estimating the Trace of a Random Rank-One Matrix

To round off our discussion of PSD matrix trace estimation, we show that estimating the trace of a random rank-one matrix still has an exponential dependence on $k$ , but a milder dependence than shown in the worst-case bound of Theorem 6. This proof serves two purposes. First, supposing our upper bound is tight, it suggests that the Kronecker-Hutchinson Estimator can still have an exponentially slow rate of convergence for matrices that are far from having Kronecker structure. Second, it shows how the variance bound from Theorem 2 can be used to get a tightened understanding of the concentration of the Kronecker-Hutchinson Estimator for particular matrices.

We study the random rank-one matrix $\mathbf{A}=\boldsymbol{\mathrm{g}}\boldsymbol{\mathrm{g}}^{\intercal}\in{\mathbb{R}}^{d^{k}\times d^{k}}$ , where $\boldsymbol{\mathrm{g}}\sim{\mathcal{N}}(\boldsymbol{\mathrm{0}},\mathbf{I})$ . By analyzing the expected squared Frobenius norm of the partial traces of $\mathbf{A}$ , we prove the following:

Theorem 3 Restated.

Proof.

We start by analyzing the expected squared Frobenius norm of the partial trace of $\mathbf{A}$ with respect to a set of subsystems ${\mathcal{S}}\subseteq[k]$ . Let $i\;{\vcentcolon=}\;|{{\mathcal{S}}}|$ , and note that without loss of generality we can assume that ${\mathcal{S}}=\{1,\ldots,i\}$ when trying to compute $\operatorname*{\mathbb{E}}[\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\mathbf{A})\rVert}_{F}^{2}]$ . We then decompose the vector $\boldsymbol{\mathrm{g}}=\left[\begin{smallmatrix}\boldsymbol{\mathrm{g}}_{1}\\ \vdots\\ \boldsymbol{\mathrm{g}}_{d^{i}}\end{smallmatrix}\right]$ where $\boldsymbol{\mathrm{g}}_{i}\in{\mathbb{R}}^{d^{k-i}}$ are standard normal Gaussian vectors. Then, note that we can decompose $\boldsymbol{\mathrm{g}}=(\boldsymbol{\mathrm{e}}_{1}\otimes\boldsymbol{\mathrm{g}}_{1})+\ldots+(\boldsymbol{\mathrm{e}}_{d^{i}}\otimes\boldsymbol{\mathrm{g}}_{d^{i}})$ . This lets us write out

\boldsymbol{\mathrm{g}}^{\intercal}(\boldsymbol{\mathrm{e}}_{j}\otimes\mathbf{I}^{\otimes k-i})=\sum_{\hat{j}=1}^{d^{i}}(\boldsymbol{\mathrm{e}}_{\hat{j}}\otimes\boldsymbol{\mathrm{g}}_{\hat{j}})^{\intercal}(\boldsymbol{\mathrm{e}}_{j}\otimes\mathbf{I}^{\otimes k-i})=\sum_{\hat{j}=1}^{d^{i}}(\boldsymbol{\mathrm{e}}_{\hat{j}}^{\intercal}\boldsymbol{\mathrm{e}}_{j}\otimes\boldsymbol{\mathrm{g}}_{\hat{j}}^{\intercal})=\boldsymbol{\mathrm{g}}_{j}^{\intercal}.

Then, if we define $\mathbf{G}=\begin{bmatrix}\boldsymbol{\mathrm{g}}_{1}\,\cdots~{}\boldsymbol{\mathrm{g}}_{d^{i}}\end{bmatrix}\in{\mathbb{R}}^{d^{k-i}\times d^{i}}$ , then we see that

\operatorname{tr}_{{\mathcal{S}}}(\boldsymbol{\mathrm{g}}\boldsymbol{\mathrm{g}}^{\intercal})=\sum_{j=1}^{d^{i}}(\boldsymbol{\mathrm{e}}_{j}\otimes\mathbf{I}^{\otimes k-i})^{\intercal}\boldsymbol{\mathrm{g}}\boldsymbol{\mathrm{g}}^{\intercal}(\boldsymbol{\mathrm{e}}_{j}\otimes\mathbf{I}^{\otimes k-i})=\sum_{j=1}^{d^{i}}\boldsymbol{\mathrm{g}}_{j}\boldsymbol{\mathrm{g}}_{j}^{\intercal}=\mathbf{G}\mathbf{G}^{\intercal}.

Noticing that $\mathbf{G}$ is just a matrix full of iid standard normal Gaussians, we can then bound $\mathinner{\lVert\operatorname{tr}_{\mathrel{\mathop{\ordinarycolon}}i}(\mathbf{A})\rVert}_{F}^{2}=\mathinner{\lVert\mathbf{G}\mathbf{G}^{\intercal}\rVert}_{F}^{2}=\mathinner{\lVert\mathbf{G}\rVert}_{4}^{4}$ in expectation by just using known properties of the norms of Gaussians, where $\mathinner{\lVert\cdot\rVert}_{4}$ denotes the Schatten 4-norm. For instance, Lemma A.1 from [Tropp and Webber, 2023] implies that a standard normal Gaussian matrix $\mathbf{G}\in{\mathbb{R}}^{N\times P}$ has $\operatorname*{\mathbb{E}}[\mathinner{\lVert\mathbf{G}\rVert}_{4}^{4}]=NP(N+P+1)$ . So,

\operatorname*{\mathbb{E}}[\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}}(\mathbf{A})\rVert}_{F}^{2}]=\operatorname*{\mathbb{E}}[\mathinner{\lVert\mathbf{G}\rVert}_{4}^{4}]=d^{k}(d^{k-|{{\mathcal{S}}}|}+d^{|{{\mathcal{S}}}|}+1).

This in turn implies that

	$\displaystyle\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{g}}}\left[\sum_{{\mathcal{S}}\subseteq[k]}2^{k-\|{{\mathcal{S}}}\|}\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\mathbf{A})\rVert}_{F}^{2}\right]$	$\displaystyle=\sum_{{\mathcal{S}}\subseteq[k]}2^{k-\|{{\mathcal{S}}}\|}d^{k}(d^{k-\|{{\mathcal{S}}}\|}+d^{\|{{\mathcal{S}}}\|}+1)$
		$\displaystyle=\sum_{i=0}^{k}\binom{k}{i}2^{k-i}d^{k}(d^{k-i}+d^{i}+1).$

The last line above recognizes that we only depended on $|{{\mathcal{S}}}|$ , so we could just sum over all values of $i=|{{\mathcal{S}}}|$ , scaled by the $\binom{k}{i}$ times that $|{{\mathcal{S}}}|$ appears. Rearranging terms, and recalling that $\sum_{i=0}^{k}\binom{k}{i}c^{i}=(1+c)^{k}$ , we get

	$\displaystyle\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{g}}}\left[\sum_{{\mathcal{S}}\subseteq[k]}2^{k-\|{{\mathcal{S}}}\|}\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\mathbf{A})\rVert}_{F}^{2}\right]$	$\displaystyle=2^{k}d^{k}\sum_{i=0}^{k}\binom{k}{i}2^{-i}(d^{k-i}+d^{i}+1)$
		$\displaystyle=2^{k}d^{k}\sum_{i=0}^{k}\binom{k}{i}\left(d^{k}({\textstyle\frac{1}{2d}})^{i}+({\textstyle\frac{d}{2}})^{i}+({\textstyle\frac{1}{2}})^{i}\right)$
		$\displaystyle=2^{k}d^{k}\left(d^{k}(1+{\textstyle\frac{1}{2d}})^{k}+(1+{\textstyle\frac{d}{2}})^{k}+({\textstyle\frac{3}{2}})^{k}\right)$
		$\displaystyle=2^{k}d^{2k}\left((1+{\textstyle\frac{1}{2d}})^{k}+({\textstyle\frac{1}{2}}+{\textstyle\frac{1}{d}})^{k}+({\textstyle\frac{3}{2d}})^{k}\right)$
		$\displaystyle\leq 2^{k}d^{2k}\cdot 3(1+{\textstyle\frac{1}{2d}})^{k}$
		$\displaystyle=3d^{2k}(2+{\textstyle\frac{1}{d}})^{k}.$

Then, by Markov’s Inequality, we know that

\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}~{}|~{}\boldsymbol{\mathrm{g}}]\leq\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{g}}}\left[\sum_{{\mathcal{S}}\subset[k]}2^{k-|{{\mathcal{S}}}|}\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\mathbf{A})\rVert}_{F}^{2}\right]\leq 8\cdot 3d^{2k}(2+{\textstyle\frac{1}{d}})^{k}

holds with probability at least $\frac{7}{8}$ . Then, since $\operatorname{tr}(\mathbf{A})=\mathinner{\lVert\boldsymbol{\mathrm{g}}\rVert}_{2}^{2}$ is a chi-square random variable with parameter $d^{k}$ , we can use Chebyshev’s Inequality to bound $\operatorname{tr}(\mathbf{A})>\frac{1}{2}d^{k}$ with probability at least $\frac{7}{8}$ if $d^{k}>64$ . We condition on both of these events, together holding with probability at least $\frac{3}{4}$ . Given these events, the variance of the Kronecker-Hutchinson Estimator is therefore

\displaystyle\operatorname*{Var}[(\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}})^{2}~{}|~{}\boldsymbol{\mathrm{g}}]\leq 24d^{2k}(2+{\textstyle\frac{1}{d}})^{k}\leq 96(2+{\textstyle\frac{1}{d}})^{k}(\operatorname{tr}(\mathbf{A}))^{2}.

We can now examine how many samples $\ell$ are needed for the Kronecker-Hutchinson Estimator to achieve a $(1\pm\varepsilon)$ relative error estimation to $\mathbf{A}$ with constant probability. By Chebyshev’s Inequality, we get that $|{H_{\ell}(\mathbf{A})-\operatorname{tr}(\mathbf{A})}|\leq\sqrt{12\operatorname*{Var}[H_{\ell}(\mathbf{A})\mid\boldsymbol{\mathrm{g}}]}=\frac{\sqrt{12}}{\sqrt{\ell}}\sqrt{\operatorname*{Var}[H_{\ell}(\mathbf{A})\mid\boldsymbol{\mathrm{g}}]}$ with probability $\frac{1}{12}$ . To make this additive error at most $\varepsilon\operatorname{tr}(\mathbf{A})$ , we take

\ell=\frac{12\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}~{}|~{}\boldsymbol{\mathrm{g}}]}{\varepsilon^{2}(\operatorname{tr}(\mathbf{A}))^{2}}\leq\frac{12\cdot 96(2+{\textstyle\frac{1}{d}})^{k}(\operatorname{tr}(\mathbf{A}))^{2}}{\varepsilon^{2}(\operatorname{tr}(\mathbf{A}))^{2}}=1152\frac{(2+{\textstyle\frac{1}{d}})^{k}}{\varepsilon^{2}}.

That is, with constant probability, $\ell=O(\frac{1}{\varepsilon^{2}}(2+{\textstyle\frac{1}{d}})^{k})$ samples suffice to get a relative error approximation to a random rank-one matrix with constant probability. By allowing larger $d$ and $k$ values, and by using sharper concentrations, the constant of 1152 can be significantly improved. ∎

5 The Complex-Valued Kronecker-Matrix-Vector Oracle

Until now, we have only considered using a real-valued Kronecker-matrix-vector oracle, where we are only allowed to compute $\mathbf{A}\boldsymbol{\mathrm{x}}$ for vectors $\boldsymbol{\mathrm{x}}=\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k}$ where $\boldsymbol{\mathrm{x}}_{i}\in{\mathbb{R}}^{d}$ . In this section, we consider how our results change when we instead allow $\boldsymbol{\mathrm{x}}_{i}\in{\mathbb{C}}^{d}$ . If we did not have any Kronecker constraint on our oracle, then this change of oracles would be banal. If some algorithm computes $q$ complex-valued non-Kronecker matrix-vector products with $\mathbf{A}$ , then we can find another algorithm which produces the exact same result, but instead uses $2q$ real-valued non-Kronecker matrix-vector products with $\mathbf{A}$ . Each complex-values query gets decomposed as $\mathbf{A}\boldsymbol{\mathrm{x}}=\mathbf{A}\mathfrak{Re}[\boldsymbol{\mathrm{x}}]+i\mathbf{A}\mathfrak{Im}[\boldsymbol{\mathrm{x}}]$ , which can be evaluated with just two real-valued oracle queries.

However, this picture changes with a Kronecker constraint. If $\boldsymbol{\mathrm{x}}=\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k}$ for $\boldsymbol{\mathrm{x}}_{i}\in{\mathbb{C}}^{d}$ , then the real part of the vector $\mathfrak{Re}[\boldsymbol{\mathrm{x}}]$ does not necessarily have Kronecker structure as well, and therefore we cannot reproduce the strategy above. We can see this even with very small vectors:

\begin{bmatrix}1\\ i\end{bmatrix}\otimes\begin{bmatrix}1\\ i\end{bmatrix}=\begin{bmatrix}1\\ i\\ i\\ -1\end{bmatrix}\hskip 19.91684pt\text{which gives}\hskip 19.91684pt\mathfrak{Re}\left(\begin{bmatrix}1\\ i\\ i\\ -1\end{bmatrix}\right)=\begin{bmatrix}1\\ 0\\ 0\\ -1\end{bmatrix}\hskip 14.22636pt\text{and}\hskip 14.22636pt\mathfrak{Im}\left(\begin{bmatrix}1\\ i\\ i\\ -1\end{bmatrix}\right)=\begin{bmatrix}0\\ 1\\ 1\\ 0\end{bmatrix}.

The real and imaginary component vectors do not have Kronecker structure. We can still simulate a complex-valued oracle with a real-valued oracle, but this requires far more real-valued oracle queries in the worst case. Specifically, we expand the real and imaginary parts of each vectors $\boldsymbol{\mathrm{x}}_{i}$ . To keep the notation simple, let $\boldsymbol{\mathrm{x}}_{i}=\boldsymbol{\mathrm{r}}_{i}+i\boldsymbol{\mathrm{m}}_{i}$ decompose the real and imaginary parts of $\boldsymbol{\mathrm{x}}_{i}\in{\mathbb{C}}^{d}$ , so that $\boldsymbol{\mathrm{r}}_{i},\boldsymbol{\mathrm{m}}_{i}\in{\mathbb{R}}^{d}$ . Then,

	$\displaystyle\boldsymbol{\mathrm{x}}_{1}\otimes\boldsymbol{\mathrm{x}}_{2}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k}$	$\displaystyle=(\boldsymbol{\mathrm{r}}_{1}+i\boldsymbol{\mathrm{m}}_{1})\otimes\boldsymbol{\mathrm{x}}_{2}\otimes\cdots\boldsymbol{\mathrm{x}}_{k}$
		$\displaystyle=(\boldsymbol{\mathrm{r}}_{1}\otimes\boldsymbol{\mathrm{x}}_{2}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k})+i(\boldsymbol{\mathrm{m}}_{1}\otimes\boldsymbol{\mathrm{x}}_{2}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k})$
		$\displaystyle=(\boldsymbol{\mathrm{r}}_{1}\otimes\boldsymbol{\mathrm{r}}_{2}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k})+i(\boldsymbol{\mathrm{r}}_{1}\otimes\boldsymbol{\mathrm{m}}_{2}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k})$
		$\displaystyle\hskip 28.45274pt+i(\boldsymbol{\mathrm{m}}_{1}\otimes\boldsymbol{\mathrm{r}}_{2}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k})-(\boldsymbol{\mathrm{m}}_{1}\otimes\boldsymbol{\mathrm{m}}_{2}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k})$
		$\displaystyle=\ldots$

where we continue to expand this out into a sum of $2^{k}$ real-valued Kronecker-structured vectors. So, if some algorithm solves a problem using $q$ complex-valued Kronecker-structure matrix-vector products with $\mathbf{A}$ , then we can only immediately guarantee the existence of an algorithm that computes $2^{k}q$ real-valued Kronecker-structured products with $\mathbf{A}$ that also solves the problem.

Keeping this exponential gap between the real-valued and complex-valued oracles in mind, we now turn our attention back to the Kronecker-Hutchinson Estimator. In the non-Kronecker case, it is well known that the Hutchinson Estimator performs well with complex queries:

Lemma 30.

Let $\mathbf{A}\in{\mathbb{R}}^{D\times D}$ and let $\boldsymbol{\mathrm{g}}\in{\mathbb{C}}^{D}$ be a vector of either iid complex Rademacher or complex standard normal Gaussian entries. That is, we have $\boldsymbol{\mathrm{g}}=\frac{1}{\sqrt{2}}\boldsymbol{\mathrm{r}}+\frac{i}{\sqrt{2}}\boldsymbol{\mathrm{m}}$ where $\boldsymbol{\mathrm{r}}$ and $\boldsymbol{\mathrm{m}}$ are iid real-valued vectors of either Rademacher or standard normal Gaussian entries. Then, $\operatorname*{\mathbb{E}}[\boldsymbol{\mathrm{g}}^{\textsf{H}}\mathbf{A}\boldsymbol{\mathrm{g}}]=\operatorname{tr}(\mathbf{A})$ and $\operatorname*{Var}[\boldsymbol{\mathrm{g}}^{\textsf{H}}\mathbf{A}\boldsymbol{\mathrm{g}}]\leq\mathinner{\lVert\frac{\mathbf{A}+\mathbf{A}^{\intercal}}{2}\rVert}_{F}^{2}\leq\mathinner{\lVert\mathbf{A}\rVert}_{F}^{2}$ , where the first inequality is tight if $\boldsymbol{\mathrm{g}}$ is complex Gaussian, and the second is tight if $\mathbf{A}$ is symmetric.

Note that the difference between Lemma 23 and Lemma 30 is just a factor of 2 in the variance term. However, this factor of 2 has been extremely influential in all of our proofs so far. Improving this constant to be 1 and following the exact same proofs give the following results:

Theorem 7 Restated.

Fix $d,k\in{\mathbb{N}}$ , and let $\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}}$ . Let $\boldsymbol{\mathrm{x}}=\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k}$ , where each $\boldsymbol{\mathrm{x}}_{i}$ is either a complex Rademacher vector or a complex Gaussian vector. Let $\bar{}\mathbf{A}\;{\vcentcolon=}\;\frac{1}{2^{k}}\sum_{{\mathcal{V}}\subseteq[k]}\mathbf{A}^{\intercal_{{\mathcal{V}}}}$ be the average of all partial transposes of $\mathbf{A}$ . Then, $\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\textsf{H}}\mathbf{A}\boldsymbol{\mathrm{x}}]\leq\sum_{{\mathcal{S}}\subset[k]}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}}(\bar{}\mathbf{A})\rVert}_{F}^{2}\leq\sum_{{\mathcal{S}}\subset[k]}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}}(\mathbf{A})\rVert}_{F}^{2}$ . If $\boldsymbol{\mathrm{x}}_{i}$ are complex Gaussians, then the first expression is an equality. If we further know that $\mathbf{A}=\bar{}\mathbf{A}$ , then both expression are equalities. Further, if $\mathbf{A}$ is PSD, then we have $\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\textsf{H}}\mathbf{A}\boldsymbol{\mathrm{x}}]\leq(2^{k}\operatorname{tr}(\mathbf{A}))^{2}$ .

Proof.

Retrace the proofs of Lemmas 25 and 26 as well as Theorems 2 and 4. Anywhere that a 2 appears, replace it with a 1. ∎

We can similarly show that this dependence of $2^{k}$ is essentially tight:

Theorem 9 Restated.

Fix $d,k\in{\mathbb{N}}$ . Consider Hutchinson’s Estimator run with vectors $\boldsymbol{\mathrm{x}}=\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k}$ where $\boldsymbol{\mathrm{x}}_{i}$ are vectors of iid complex zero-mean unit-variance entries. Then, there exists a PSD matrix $\mathbf{A}$ such that Hutchinson’s Estimator needs $\ell=\Omega(\frac{(2-{\textstyle\frac{1}{d}})^{k}}{\varepsilon^{2}})$ samples in order to have standard deviation $\leq\varepsilon\operatorname{tr}(\mathbf{A})$ .

Proof.

Like in the proof of Theorem 6, we take $\mathbf{A}=\otimes_{i=1}^{k}\boldsymbol{\mathrm{e}}\boldsymbol{\mathrm{e}}^{\intercal}$ where $\boldsymbol{\mathrm{e}}\in{\mathbb{R}}^{d}$ is the all-ones vector. In fact, we follow the proof of Theorem 6 rather closely, but we recreate much of it to handle the complex conjugation correctly. First, we expand the product

\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{x}}^{\textsf{H}}\mathbf{A}\boldsymbol{\mathrm{x}})^{2}]=\prod_{i=1}^{k}\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{x}}_{i}^{\textsf{H}}\boldsymbol{\mathrm{e}}\boldsymbol{\mathrm{e}}^{\textsf{H}}\boldsymbol{\mathrm{x}}_{i})^{2}]=\left(\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{e}}^{\textsf{H}}\boldsymbol{\mathrm{x}}_{1})^{2}\overline{(\boldsymbol{\mathrm{e}}^{\textsf{H}}\boldsymbol{\mathrm{x}}_{1})^{2}}]\right)^{k}.

We let $\boldsymbol{\mathrm{y}}\;{\vcentcolon=}\;\boldsymbol{\mathrm{x}}_{1}$ to simplify the notation. Then, we expand this inner expectation:

\textstyle\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{e}}^{\textsf{H}}\boldsymbol{\mathrm{y}})^{2}\overline{(\boldsymbol{\mathrm{e}}^{\textsf{H}}\boldsymbol{\mathrm{y}})^{2}}]=\operatorname*{\mathbb{E}}[(\sum_{i,j=1}^{d}y_{i}y_{j})(\sum_{m,n=1}^{d}\overline{y_{m}y_{n}})]=\sum_{i,j,m,n=1}^{d}\operatorname*{\mathbb{E}}[y_{i}y_{j}\overline{y_{m}y_{n}}].

This expectation is zero if any random variable appears exactly once, like how $\operatorname*{\mathbb{E}}[y_{1}y_{2}\overline{y_{2}^{2}}]=\operatorname*{\mathbb{E}}[y_{1}]\operatorname*{\mathbb{E}}[y_{2}\overline{y_{2}^{2}}]=0$ by independence. So, we only need to consider terms in this sum where we do not have any one term appear alone. This can happen in three different ways.

First, if $i=j=m=n$ , then by Jensen’s inequality, we get $\operatorname*{\mathbb{E}}[y_{i}y_{j}\overline{y_{m}y_{n}}]=\operatorname*{\mathbb{E}}[|y_{i}|^{4}]\geq(\operatorname*{\mathbb{E}}[|y_{i}|^{2}])^{2}=1$ . This case occurs $d$ times in the sum. Second, if $i=j\neq m=n$ , then by the iid distribution of the entry of $\boldsymbol{\mathrm{y}}$ , we get $\operatorname*{\mathbb{E}}[y_{i}y_{j}\overline{y_{m}y_{n}}]=\operatorname*{\mathbb{E}}[y_{i}^{2}]\overline{\operatorname*{\mathbb{E}}[y_{m}^{2}]}=|\operatorname*{\mathbb{E}}[y_{i}^{2}]|^{2}\geq 0$ . Lastly, if $i=m\neq j=n$ or $i=n\neq j=m$ , then $\operatorname*{\mathbb{E}}[y_{i}y_{j}\overline{y_{m}y_{n}}]=\operatorname*{\mathbb{E}}[|y_{i}|^{2}]\operatorname*{\mathbb{E}}[|y_{j}|^{2}]=1$ . This case occurs $2(d^{2}-d)$ times in the sum. So, we get that

\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{x}}^{\textsf{H}}\mathbf{A}\boldsymbol{\mathrm{x}})^{2}]\geq\left(2d^{2}-2d+d\right)^{k}=(2d^{2}-d)^{k}.

Which, noting that $\operatorname{tr}(\mathbf{A})=d^{k}$ , in turn gets a variance lower bound of

\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\textsf{H}}\mathbf{A}\boldsymbol{\mathrm{x}}]=\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{x}}^{\textsf{H}}\mathbf{A}\boldsymbol{\mathrm{x}})^{2}]-(\operatorname{tr}(\mathbf{A}))^{2}\geq(2d^{2}-d)^{k}-d^{2k}.

Then, since $\operatorname*{Var}[H_{\ell}(\mathbf{A})]=\frac{1}{\ell}\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\textsf{H}}\mathbf{A}\boldsymbol{\mathrm{x}}]$ , in order for the complex Kronecker-Hutchinson’s Estimator to have standard deviation $\varepsilon\operatorname{tr}(\mathbf{A})$ , we need to take

\ell\geq\frac{\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\textsf{H}}\mathbf{A}\boldsymbol{\mathrm{x}}]}{\varepsilon^{2}(\operatorname{tr}(\mathbf{A}))^{2}}\geq\frac{(2d^{2}-d)^{k}-d^{2k}}{\varepsilon^{2}d^{2k}}=\frac{(2-\frac{1}{d})^{k}-1}{\varepsilon^{2}}=\Omega\left(\frac{(2-{\textstyle\frac{1}{d}})^{k}}{\varepsilon^{2}}\right).

∎

Lastly, we produce a much nicer bound on the sample complexity of the Kronecker-Hutchinson’s Estimator when working with random rank-one matrices:

Theorem 8 Restated.

Fix $d,k\in{\mathbb{N}}$ . Let $\mathbf{A}=\boldsymbol{\mathrm{g}}\boldsymbol{\mathrm{g}}^{\intercal}$ where $\boldsymbol{\mathrm{g}}\in{\mathbb{R}}^{d^{k}}$ is a vector of iid standard normal Gaussians. Then, with probability $\frac{3}{4}$ , we know that $\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\textsf{H}}\mathbf{A}\boldsymbol{\mathrm{x}}~{}|~{}\boldsymbol{\mathrm{g}}]\leq O((1+\frac{1}{d})^{k}(\operatorname{tr}(\mathbf{A}))^{2})$ . In particular, with probability $\frac{2}{3}$ , we know that the Kronecker-Hutchinson Estimator returns a $(1\pm\varepsilon)$ relative error estimate to the trace of $\mathbf{A}$ if $\ell=O(\frac{1}{\varepsilon^{2}}(1+\frac{1}{d})^{k})$ .

Proof.

Recall from the proof of Theorem 3 that the expected squared Frobenius norm of the partial trace of $\mathbf{A}$ is

\operatorname*{\mathbb{E}}[\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}}(\mathbf{A})\rVert}_{F}^{2}]=\operatorname*{\mathbb{E}}[\mathinner{\lVert\mathbf{G}\rVert}_{4}^{4}]=d^{k}(d^{k-|{{\mathcal{S}}}|}+d^{|{{\mathcal{S}}}|}+1).

Then, we can directly bound

	$\displaystyle\sum_{{\mathcal{S}}\subseteq[k]}\operatorname*{\mathbb{E}}[\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\mathbf{A})\rVert}_{F}^{2}]$	$\displaystyle=\sum_{{\mathcal{S}}\subseteq[k]}d^{k}(d^{k-\|{{\mathcal{S}}}\|}+d^{\|{{\mathcal{S}}}\|}+1)$
		$\displaystyle=\sum_{i=0}^{k}\binom{k}{i}d^{k}(d^{k-i}+d^{i}+1).$

Where the last line recognizes that we only depended on $|{{\mathcal{S}}}|$ , so we could just sum over all values of $i=|{{\mathcal{S}}}|$ , scaled by the $\binom{k}{i}$ times that $|{{\mathcal{S}}}|$ appears. Rearranging terms, and recalling that $\sum_{i=0}^{k}\binom{k}{i}c^{i}=(1+c)^{k}$ , we get

	$\displaystyle\sum_{{\mathcal{S}}\subseteq[k]}\operatorname*{\mathbb{E}}[\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\mathbf{A})\rVert}_{F}^{2}]$	$\displaystyle=d^{k}\sum_{i=0}^{k}\binom{k}{i}(d^{k-i}+d^{i}+1)$
		$\displaystyle=d^{k}\sum_{i=0}^{k}\binom{k}{i}\left(d^{k}({\textstyle\frac{1}{d}})^{i}+d^{i}+1^{i}\right)$
		$\displaystyle=d^{k}\left(d^{k}(1+{\textstyle\frac{1}{d}})^{k}+(1+d)^{k}+2^{k}\right)$
		$\displaystyle=d^{2k}\left(2(1+{\textstyle\frac{1}{d}})^{k}+({\textstyle\frac{2}{d}})^{k}\right)$
		$\displaystyle\leq 3d^{2k}(1+{\textstyle\frac{1}{d}})^{k}.$

As in the proof of Theorem 3, we know that $\operatorname{tr}(\mathbf{A})\geq\frac{1}{2}d^{k}$ with probability $\frac{7}{8}$ . Further, by Markov’s Inequality, we can bound

	$\displaystyle\operatorname*{Var}[(\boldsymbol{\mathrm{x}}^{\textsf{H}}\mathbf{A}\boldsymbol{\mathrm{x}})^{2}\mid\boldsymbol{\mathrm{g}}]$	$\displaystyle\leq 8\operatorname{\mathbb{E}}_{\boldsymbol{\mathrm{g}}}\left[\sum_{{\mathcal{S}}\subseteq[k]}\operatorname{\mathbb{E}}[\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\mathbf{A})\rVert}_{F}^{2}]\right]$	$\displaystyle=24d^{2k}(1+{\textstyle\frac{1}{d}})^{k}$
		$\displaystyle\leq 96(1+{\textstyle\frac{1}{d}})^{k}(\operatorname{tr}(\mathbf{A}))^{2}.$

Following the same probability arguments as in the proof of Theorem 3, we get that taking

\ell=\frac{12\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\textsf{H}}\mathbf{A}\boldsymbol{\mathrm{x}}]}{\varepsilon^{2}(\operatorname{tr}(\mathbf{A}))^{2}}\leq\frac{12\cdot 96(1+{\textstyle\frac{1}{d}})^{k}(\operatorname{tr}(\mathbf{A}))^{2}}{\varepsilon^{2}(\operatorname{tr}(\mathbf{A}))^{2}}=1152\frac{(1+{\textstyle\frac{1}{d}})^{k}}{\varepsilon^{2}}

samples suffices for the complex-valued Kronecker-Hutchinson’s Estimator to recover a $(1\pm\varepsilon)$ relative error approximation to the trace of $\mathbf{A}$ with probability $\frac{2}{3}$ . ∎

Recall that the rate in the real-valued case was $\ell=O(\frac{1}{\varepsilon^{2}}(2+\frac{1}{d})^{k})$ , and the rate here in the complex-valued case is $\ell=O(\frac{1}{\varepsilon^{2}}(1+\frac{1}{d})^{k})$ . In the real-valued case, this rate grows at least as fast as $2^{k}$ , preventing any polynomial dependence on $k$ from being possible. However, in the complex valued case, if $k=O(d)$ , then the Kronecker-Hutchinson Estimator actually converges in $\ell=O(\frac{1}{\varepsilon^{2}})$ samples. We only show that the real-valued algorithm converges exponentially slowly, while the complex-valued algorithm converges polynomially quickly!

6 Looking Beyond the Kronecker-Hutchinson. Estimator

We have shown that the Kronecker-Hutchinson Estimator has a painful exponential dependence on the $k$ in the worst case. However, we are broadly interested in the class of all algorithms that operate in the Kronecker-matrix-vector oracle model. We show that all trace estimation algorithms in this model must have query complexity that grows with $k$ :

Theorem 31.

Any algorithm that accesses a PSD matrix $\mathbf{A}$ via Kronecker-matrix-vector products $\mathbf{A}\boldsymbol{\mathrm{x}}^{(1)},\ldots,\mathbf{A}\boldsymbol{\mathrm{x}}^{(q)}$ , where $\boldsymbol{\mathrm{x}}^{(1)},\ldots,\boldsymbol{\mathrm{x}}^{(q)}$ are (possibly adaptively) chosen real-valued vectors, requires $q=\Omega(\frac{\sqrt{k}}{\varepsilon})$ in order to output an estimate $\hat{t}$ such that $\sqrt{\operatorname*{\mathbb{E}}[(\hat{t}-\operatorname{tr}(\mathbf{A}))^{2}]}\leq\varepsilon\operatorname{tr}(\mathbf{A})$ .

If we constrain ourselves to estimators $\hat{t}$ that are unbiased (i.e. $\operatorname*{\mathbb{E}}[\hat{t}]=\operatorname{tr}(\mathbf{A})$ ), then this lower bound says that $q=\Omega(\frac{\sqrt{k}}{\varepsilon})$ Kronecker-matrix-vector products are required to achieve standard deviation $\leq\varepsilon\operatorname{tr}(\mathbf{A})$ . Our analysis builds off of tools in [Braverman et al., 2020, Jiang et al., 2021], which studies matrix-vector lower bounds without the Kronecker constraint. In particular, we use the following theorem:

Definition 32.

Let $\mathbf{G}\in{\mathbb{R}}^{d\times d}$ be a matrix of iid standard normal Gaussian entries. Then, the matrix $\mathbf{W}\;{\vcentcolon=}\;\mathbf{G}^{\intercal}\mathbf{G}\in{\mathbb{R}}^{d\times d}$ is distributed as a $\text{Wishart}(d)$ random matrix.

Imported Theorem 33 (Lemma 13 from [Braverman et al., 2020]).

Let $\mathbf{W}\sim\emph{\text{Wishart(d)}}$ . Then, for any sequence of (possibly adaptively chosen) matrix-vector queries $\boldsymbol{\mathrm{x}}^{(1)},\ldots,\boldsymbol{\mathrm{x}}^{(q)}$ and responses $\boldsymbol{\mathrm{z}}^{(1)},\ldots,\boldsymbol{\mathrm{z}}^{(q)}$ (so that $\boldsymbol{\mathrm{z}}^{(j)}=\mathbf{W}\boldsymbol{\mathrm{x}}^{(j)}$ ), there exists a rotation matrix $\mathbf{V}\in{\mathbb{R}}^{d\times d}$ constructed as a function of $\boldsymbol{\mathrm{x}}^{(1)},\ldots,\boldsymbol{\mathrm{x}}^{(q)}$ such that the matrix $\mathbf{V}\mathbf{W}\mathbf{V}^{\intercal}$ can be written as

\mathbf{V}\mathbf{W}\mathbf{V}^{\intercal}=\mathbf{\Delta}+\begin{bmatrix}\mathbf{0}&\mathbf{0}\\ \mathbf{0}&\widetilde{}\mathbf{W}\end{bmatrix},

where $\widetilde{}\mathbf{W}\in{\mathbb{R}}^{d-q\times d-q}$ conditioned on the values of $\boldsymbol{\mathrm{x}}^{(1)},\ldots,\boldsymbol{\mathrm{x}}^{(q)},\boldsymbol{\mathrm{z}}^{(1)},\ldots,\boldsymbol{\mathrm{z}}^{(q)}$ is distributed as Wishart matrix. That is, $\widetilde{}\mathbf{W}~{}|~{}(\boldsymbol{\mathrm{x}}^{(1)},\ldots,\boldsymbol{\mathrm{x}}^{(q)},\boldsymbol{\mathrm{z}}^{(1)},\ldots,\boldsymbol{\mathrm{z}}^{(q)})\sim\emph{\text{Wishart}}(d-q)$ . Further, $\mathbf{\Delta}$ is a PSD matrix that is a deterministic function of $\boldsymbol{\mathrm{x}}^{(1)},\ldots,\boldsymbol{\mathrm{x}}^{(q)},\boldsymbol{\mathrm{z}}^{(1)},\ldots,\boldsymbol{\mathrm{z}}^{(q)}$ .

This theorem is powerful because it says that any algorithm that has accessed a Wishart matrix via $q$ matrix-vector products must have a submatrix which the algorithm has absolutely no knowledge of. We note that in [Braverman et al., 2020], they do not state that $\mathbf{\Delta}$ is a deterministic function of $\boldsymbol{\mathrm{x}}^{(1)},\ldots,\boldsymbol{\mathrm{x}}^{(q)},\boldsymbol{\mathrm{z}}^{(1)},\ldots,\boldsymbol{\mathrm{z}}^{(q)}$ , but this fact is easily verified by examining by the proof of Lemma 13 in [Braverman et al., 2020].

We start by extending 33 to the Kronecker-matrix-vector oracle model. We construct our hard instance input matrix as $\mathbf{A}=\otimes_{i=1}^{k}\mathbf{W}_{i}\in{\mathbb{R}}^{d^{k}\times d^{k}}$ , where $\mathbf{W}_{1},\ldots,\mathbf{W}_{k}\in{\mathbb{R}}^{d\times d}$ are iid Wishart matrices:

Corollary 34.

Let $\mathbf{W}_{1},\ldots,\mathbf{W}_{k}$ be iid $\emph{\text{Wishart}}(d)$ matrices, and let $\mathbf{A}\;{\vcentcolon=}\;\otimes_{i=1}^{k}\mathbf{W}_{i}\in{\mathbb{R}}^{d^{k}\times d^{k}}$ . Then, for any sequence of (possibly adaptively chosen) Kronecker-matrix-vector queries $\boldsymbol{\mathrm{x}}^{(1)},\ldots,\boldsymbol{\mathrm{x}}^{(q)}$ and responses $\boldsymbol{\mathrm{z}}^{(1)},\ldots,\boldsymbol{\mathrm{z}}^{(q)}$ (so that $\boldsymbol{\mathrm{z}}^{(j)}=\mathbf{A}\boldsymbol{\mathrm{x}}^{(j)}$ ), there exists orthogonal matrices $\mathbf{V}_{1},\ldots,\mathbf{V}_{k}\in{\mathbb{R}}^{d\times d}$ such that $\mathbf{V}=\otimes_{i=1}^{k}\mathbf{V}_{i}$ has

\mathbf{V}\mathbf{A}\mathbf{V}^{\intercal}=\otimes_{i=1}^{k}\left(\mathbf{\Delta}_{i}+\begin{bmatrix}\mathbf{0}&\mathbf{0}\\ \mathbf{0}&\widetilde{}\mathbf{W}_{i}\end{bmatrix}\right).

Let $\boldsymbol{\mathrm{x}}^{(j)}=\otimes_{i=1}^{k}\boldsymbol{\mathrm{x}}_{i}^{(j)}$ and $\boldsymbol{\mathrm{z}}_{i}^{(j)}=\mathbf{W}_{i}\boldsymbol{\mathrm{x}}_{i}^{(j)}$ be decompositions of the query and response vectors. Then, each $\widetilde{}\mathbf{W}_{i}\in{\mathbb{R}}^{d-q\times d-q}$ conditioned on the values of $\boldsymbol{\mathrm{x}}_{i}^{(1)},\ldots,\boldsymbol{\mathrm{x}}_{i}^{(q)},\boldsymbol{\mathrm{z}}_{i}^{(1)},\ldots,\boldsymbol{\mathrm{z}}_{i}^{(q)}$ is distributed as a Wishart matrix. That is, $\widetilde{}\mathbf{W}_{i}~{}|~{}(\boldsymbol{\mathrm{x}}_{i}^{(1)},\ldots,\boldsymbol{\mathrm{x}}_{i}^{(q)},\boldsymbol{\mathrm{z}}_{i}^{(1)},\ldots,\boldsymbol{\mathrm{z}}_{i}^{(q)})\sim\emph{\text{Wishart}}(d-q)$ . Further, each $\mathbf{\Delta}_{i}$ is a PSD matrix that is a deterministic function of $\boldsymbol{\mathrm{x}}_{i}^{(1)},\ldots,\boldsymbol{\mathrm{x}}_{i}^{(q)},\boldsymbol{\mathrm{z}}_{i}^{(1)},\ldots,\boldsymbol{\mathrm{z}}_{i}^{(q)}$ .

With this setup, we can now form a lower bound against any trace estimating algorithm. We first prove in Lemma 35 that, in terms of minimizing mean squared error, no estimator can outperform the conditional expectation $\tilde{t}=\operatorname*{\mathbb{E}}[\operatorname{tr}(\mathbf{A})~{}|~{}\mathbf{V}_{1},\ldots,\mathbf{V}_{k},\mathbf{\Delta}_{1},\ldots,\mathbf{\Delta}_{k}]$ . Next, in Lemma 36, we compute the exact variance of this estimator. Lastly, in the final proof of Theorem 31, we compare this variance to the trace of $\mathbf{A}$ , to conclude that $q\geq\Omega(\frac{\sqrt{k}}{\varepsilon})$ .

To simplify notation, we let ${\mathcal{W}}=(\widetilde{}\mathbf{W}_{1},\ldots,\widetilde{}\mathbf{W}_{k})$ denote the list of hidden Wishart matrices, and let ${\mathcal{V}}\;{\vcentcolon=}\;(\mathbf{V}_{1},\ldots,\mathbf{V}_{k},\mathbf{\Delta}_{1},\ldots,\mathbf{\Delta}_{k})$ be the variables that $\tilde{t}$ is conditioned on. Note that the algorithm only observes the response vectors $\boldsymbol{\mathrm{z}}^{(1)},\ldots,\boldsymbol{\mathrm{z}}^{(q)}$ , and that the algorithm does not exactly know the vectors $\boldsymbol{\mathrm{z}}_{i}^{(j)}=\mathbf{W}_{i}\boldsymbol{\mathrm{x}}_{i}^{(j)}$ . Therefore, the algorithm might not observe all of ${\mathcal{V}}$ , but ${\mathcal{V}}$ definitely covers everything that the algorithm does see.

Lemma 35.

Consider an algorithm that interacts with $\mathbf{A}$ via $q$ Kronecker-matrix-vector products as in the setting of Corollary 34. Let $\hat{t}$ be the output of that algorithm, and let $\tilde{t}\;{\vcentcolon=}\;\operatorname*{\mathbb{E}}[\operatorname{tr}(\mathbf{A})~{}|~{}{\mathcal{V}}]$ . Then, $\tilde{t}$ has better means squared error than $\hat{t}$ :

\operatorname*{\mathbb{E}}[(\hat{t}-\operatorname{tr}(\mathbf{A}))^{2}]\geq\operatorname*{\mathbb{E}}[(\tilde{t}-\operatorname{tr}(\mathbf{A}))^{2}].

Proof.

Note that the algorithm only observes $\boldsymbol{\mathrm{x}}^{(1)},\ldots,\boldsymbol{\mathrm{x}}^{(q)},\boldsymbol{\mathrm{z}}^{(1)},\ldots\boldsymbol{\mathrm{z}}^{(q)}$ , so the estimator $\hat{t}$ has to be a (possibly randomized) function of ${\mathcal{V}}$ . We then expand the mean squared error of $\hat{t}$ by using the tower rule:

	$\displaystyle\operatorname*{\mathbb{E}}[(\hat{t}-\operatorname{tr}(\mathbf{A}))^{2}]$	$\displaystyle=\operatorname{\mathbb{E}}[\operatorname{\mathbb{E}}_{{\mathcal{W}}}[(\hat{t}-\operatorname{tr}(\mathbf{A}))^{2}~{}\|~{}{\mathcal{V}}]]$
		$\displaystyle\geq\operatorname{\mathbb{E}}[\operatorname{\mathbb{E}}_{{\mathcal{W}}}[(\tilde{t}-\operatorname{tr}(\mathbf{A}))^{2}~{}\|~{}{\mathcal{V}}]]$
		$\displaystyle=\operatorname*{\mathbb{E}}[(\tilde{t}-\operatorname{tr}(\mathbf{A}))^{2}],$

where the inequality comes from $\tilde{t}$ being the minimum mean square error estimator for that particular conditional expectation. ∎

For any specific instantiation of ${\mathcal{V}}$ , we can ask what the expected mean squared error of $\tilde{t}$ is:

\operatorname*{\mathbb{E}}_{{\mathcal{W}}}[(\operatorname{tr}(\mathbf{A})-\tilde{t})^{2}~{}|~{}{\mathcal{V}}]=\operatorname*{\mathbb{E}}_{{\mathcal{W}}}[(\operatorname{tr}(\mathbf{A})-\operatorname*{\mathbb{E}}[\operatorname{tr}(\mathbf{A})|{\mathcal{V}}])^{2}~{}|~{}{\mathcal{V}}]=\operatorname*{Var}_{{\mathcal{W}}}[\operatorname{tr}(\mathbf{A})~{}|~{}{\mathcal{V}}].

We do not really care about this value for a specific choice of ${\mathcal{V}}$ , but instead care about it on average across all values of ${\mathcal{V}}$ . That is, we care about the squared error $\operatorname*{\mathbb{E}}[(\tilde{t}-\operatorname{tr}(\mathbf{A}))^{2}]=\operatorname*{\mathbb{E}}_{{\mathcal{V}}}[\operatorname*{Var}[\operatorname{tr}(\mathbf{A})~{}|~{}{\mathcal{V}}]]$ . We then want to ensure that the error is less than $\varepsilon\operatorname*{\mathbb{E}}[\operatorname{tr}(\mathbf{A})]$ . So, we want to find out what value of $q$ ensures that

\sqrt{\operatorname*{\mathbb{E}}[\operatorname*{Var}[\operatorname{tr}(\mathbf{A})~{}|~{}{\mathcal{V}}]]}\leq\varepsilon\operatorname*{\mathbb{E}}[\operatorname{tr}(\mathbf{A})].

We start by computing the variance inside the square root on the left.

Lemma 36.

The mean squared error of $\tilde{t}$ is $\operatorname*{\mathbb{E}}[\operatorname*{Var}[\operatorname{tr}(\mathbf{A})~{}|~{}{\mathcal{V}}]]=(d^{4}+2d^{2})^{k}-(d^{4}+2d^{2}-2(d-q)^{2})^{k}$ .

Proof.

Let $X_{i}\;{\vcentcolon=}\;\operatorname{tr}(\mathbf{W}_{i})$ and $X\;{\vcentcolon=}\;\operatorname{tr}(\mathbf{A})$ , so that $X=\prod_{i=1}^{k}X_{i}$ is a product of $k$ independent random variables. Then, since these terms are independence, we know that

\operatorname*{Var}[{\textstyle\prod_{i=1}^{k}X_{i}}]=\prod_{i=1}^{k}\operatorname*{\mathbb{E}}[X_{i}^{2}]-\prod_{i=1}^{k}(\operatorname*{\mathbb{E}}[X_{i}])^{2}.

Further, since $X_{i}|{\mathcal{V}}$ is independent of $X_{j}|{\mathcal{V}}$ for $i\neq j$ , we know that

	$\displaystyle\operatorname{\mathbb{E}}[\operatorname{Var}[{\textstyle\prod_{i=1}^{k}X_{i}}~{}\|~{}{\mathcal{V}}]]$	$\displaystyle=\operatorname{\mathbb{E}}\left[\prod_{i=1}^{k}\operatorname{\mathbb{E}}[X_{i}^{2}\|{\mathcal{V}}]-\prod_{i=1}^{k}(\operatorname*{\mathbb{E}}[X_{i}\|{\mathcal{V}}])^{2}\right]$
		$\displaystyle=\prod_{i=1}^{k}\operatorname{\mathbb{E}}[X_{i}^{2}]-\prod_{i=1}^{k}\operatorname{\mathbb{E}}[(\operatorname*{\mathbb{E}}[X_{i}\|{\mathcal{V}}])^{2}].$

The last expectation on the right can be simplified a bit further. Since $\operatorname*{Var}[X_{i}~{}|~{}{\mathcal{V}}]=\operatorname*{\mathbb{E}}[X_{i}^{2}~{}|~{}{\mathcal{V}}]-(\operatorname*{\mathbb{E}}[X_{i}~{}|~{}{\mathcal{V}}])^{2}$ , we can write

\operatorname*{\mathbb{E}}[(\operatorname*{\mathbb{E}}[X_{i}|{\mathcal{V}}])^{2}]=\operatorname*{\mathbb{E}}[\operatorname*{\mathbb{E}}[X_{i}^{2}|{\mathcal{V}}]-\operatorname*{Var}[X_{i}|{\mathcal{V}}]]=\operatorname*{\mathbb{E}}[X_{i}^{2}]-\operatorname*{\mathbb{E}}[\operatorname*{Var}[X_{i}|{\mathcal{V}}]],

which in turn means we have

\displaystyle\operatorname*{\mathbb{E}}[\operatorname*{Var}[{\textstyle\prod_{i=1}^{k}X_{i}}~{}|~{}{\mathcal{V}}]]=\prod_{i=1}^{k}\operatorname*{\mathbb{E}}[X_{i}^{2}]-\prod_{i=1}^{k}\left(\operatorname*{\mathbb{E}}[X_{i}^{2}]-\operatorname*{\mathbb{E}}[\operatorname*{Var}[X_{i}|{\mathcal{V}}]]\right).

(8)

Now we can directly analyze these means and expectations. First, note that $X_{i}=\operatorname{tr}(\mathbf{W}_{i})=\operatorname{tr}(\mathbf{G}_{i}^{\intercal}\mathbf{G}_{i})=\mathinner{\lVert\mathbf{G}_{i}\rVert}_{F}^{2}\sim\chi_{d^{2}}^{2}$ is a chi-squared random variable. Therefore, $\operatorname*{\mathbb{E}}[X_{i}^{2}]=\operatorname*{Var}[X_{i}]+(\operatorname*{\mathbb{E}}[X_{i}^{2}])^{2}=2d^{2}+d^{4}$ . Next, we focus on $\operatorname*{\mathbb{E}}[\operatorname*{Var}[X_{i}|{\mathcal{V}}]]$ . By the linearity and cyclic property of the trace, and since $\mathbf{V}_{i}\mathbf{V}_{i}^{\intercal}=\mathbf{I}$ , we have that

X_{i}=\operatorname{tr}(\mathbf{W}_{i})=\operatorname{tr}(\mathbf{W}_{i}\mathbf{V}_{i}\mathbf{V}_{i}^{\intercal})=\operatorname{tr}(\mathbf{V}_{i}^{\intercal}\mathbf{W}_{i}\mathbf{V}_{i})=\operatorname{tr}(\mathbf{\Delta}_{i})+\operatorname{tr}(\widetilde{}\mathbf{W}_{i}).

Therefore the only term in ${\mathcal{V}}$ that $X_{i}$ depends on is $\mathbf{\Delta}_{i}$ , and so we can expand

\displaystyle\operatorname*{\mathbb{E}}[\operatorname*{Var}[X_{i}|{\mathcal{V}}]]=\operatorname*{\mathbb{E}}[\operatorname*{Var}[\operatorname{tr}(\mathbf{\Delta}_{i})+\operatorname{tr}(\widetilde{}\mathbf{W}_{i})~{}|~{}\mathbf{\Delta}_{i}]]=\operatorname*{\mathbb{E}}[\operatorname*{Var}[\operatorname{tr}(\widetilde{}\mathbf{W}_{i})]]=\operatorname*{Var}[\operatorname{tr}(\widetilde{}\mathbf{W}_{i})]=2(d-q)^{2},

where the last equality notes that $\operatorname{tr}(\widetilde{}\mathbf{W}_{i})=\operatorname{tr}(\widetilde{}\mathbf{G}_{i}^{\intercal}\widetilde{}\mathbf{G}_{i})=\mathinner{\lVert\widetilde{}\mathbf{G}_{i}\rVert}_{F}^{2}\sim\chi_{(d-q)^{2}}$ is also a chi-squared random variable. Plugging our expectations back into Equation 8, we find that

\displaystyle\operatorname*{\mathbb{E}}[\operatorname*{Var}[{\textstyle\prod_{i=1}^{k}X_{i}}~{}|~{}{\mathcal{V}}]]

\displaystyle=\left(d^{4}+2d^{2}\right)^{k}-\left(d^{4}+2d^{2}-2(d-q)^{2}\right)^{k},

which completes the proof. ∎

We now have built up the tools we need to prove the central lower bound:

Theorem 31 Restated.

Proof.

We consider $\mathbf{A}$ as in Corollary 34. By Lemmas 35 and 36, we know that

\operatorname*{\mathbb{E}}[(\hat{t}-\operatorname{tr}(\mathbf{A}))^{2}]\geq\operatorname*{\mathbb{E}}[(\tilde{t}-\operatorname{tr}(\mathbf{A}))^{2}]=(d^{4}+2d^{2})^{k}-(d^{4}+2d^{2}-2(d-q)^{2})^{k}.

We then want to show that this term on the right hand side is at most $\varepsilon^{2}(\operatorname*{\mathbb{E}}[\operatorname{tr}(\mathbf{A})])^{2}$ . First, we quickly note that $\operatorname*{\mathbb{E}}[\operatorname{tr}(\mathbf{A})]=\prod_{i=1}^{k}\operatorname*{\mathbb{E}}[\operatorname{tr}(\mathbf{W}_{i})]=d^{2k}$ , since $\operatorname{tr}(\mathbf{W}_{i})=\operatorname{tr}(\mathbf{G}_{i}^{\intercal}\mathbf{G}_{i})=\mathinner{\lVert\mathbf{G}_{i}\rVert}_{F}^{2}\sim\chi_{d^{2}}^{2}$ is a chi-squared random variable. So, we want to find what values of $q$ ensure that

(d^{4}+2d^{2})^{k}-(d^{4}+2d^{2}-2(d-q)^{2})^{k}\leq\varepsilon^{2}d^{4k}.

In fact, it will be mathematically convenient to demand slightly less accuracy, which will still give a meaningful lower bound:

(d^{4}+2d^{2})^{k}-(d^{4}+2d^{2}-2(d-q)^{2})^{k}\leq\varepsilon^{2}(d^{4}+2d^{2})^{k}.

We can rearrange this inequality to produce

(d-q)^{2}\leq{\textstyle\frac{1}{2}}(d^{4}+2d^{2})(1-(1-\varepsilon^{2})^{1/k}).

Note that for $\varepsilon\in(0,\frac{1}{2})$ and $k\geq 1$ , we have that $(1-\frac{2\varepsilon^{2}}{k})^{k}=(1-\frac{2\varepsilon^{2}}{k})^{\frac{k}{2\varepsilon^{2}}\cdot 2\varepsilon^{2}}\leq e^{-2\varepsilon^{2}}\leq 1-\varepsilon^{2}$ , so that $(1-\varepsilon^{2})^{1/k}\geq 1-\frac{2\varepsilon^{2}}{k}$ . So, by additionally bounding $d^{4}+2d^{2}\leq 2d^{4}$ for $d\geq 2$ , we can simplify the above bound to be

(d-q)^{2}\leq{\textstyle\frac{1}{2}}(d^{4}+2d^{2})(1-(1-\varepsilon^{2})^{1/k})\leq{\textstyle\frac{2d^{4}\varepsilon^{2}}{k}}.

Solving this inequality for $q$ gives $q\geq d-\frac{\sqrt{2}\varepsilon}{\sqrt{k}}d^{2}$ . Optimizing over $d$ gives $q\geq\frac{4\sqrt{k}}{\sqrt{2}\varepsilon}=\Omega(\frac{\sqrt{k}}{\varepsilon})$ , completing the proof. ∎

7 Exactly Computing the Trace of $\mathbf{A}=\otimes_{i=1}^{k}\mathbf{A}_{i}$ in $kd+1$ Queries

In this section, we are promised that $\mathbf{A}=\mathbf{A}_{1}\otimes\cdots\otimes\mathbf{A}_{k}$ for some unknown matrices $\mathbf{A}_{1},\ldots,\mathbf{A}_{k}\in{\mathbb{R}}^{d\times d}$ . We show that by using exactly $kd+1$ Kronecker-matrix-vector products, we can recover a set of matrices $\mathbf{B}_{1},\ldots,\mathbf{B}_{k}\in{\mathbb{R}}^{d\times d}$ such that $\otimes_{i=1}^{k}\mathbf{B}_{i}=\mathbf{A}$ . Note that we cannot exactly recover the matrices $\mathbf{A}_{1},\ldots,\mathbf{A}_{k}$ because the Kronecker product does not admit a unique decomposition: we have $\alpha\mathbf{A}\otimes\mathbf{B}=\mathbf{A}\otimes\alpha\mathbf{B}$ for all scalars $\alpha$ and matrices $\mathbf{A}$ , $\mathbf{B}$ . However, we are able to recover the trace exactly since $\operatorname{tr}(\mathbf{A}_{1}\otimes\mathbf{A}_{2})=\operatorname{tr}(\mathbf{A}_{1})\operatorname{tr}(\mathbf{A}_{2})$ , we can then compute $\operatorname{tr}(\mathbf{A})=\prod_{i=1}^{k}\operatorname{tr}(\mathbf{B}_{i})$ .

Algorithm 1 Exact Kronecker Recovery

input: Kronecker-matrix-vector oracle access to $\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}}$ .
output: Factor matrices $\mathbf{B}_{1},\ldots,\mathbf{B}_{k}\in{\mathbb{R}}^{d\times d}$ .

1: Sample

{\mathcal{N}}(\boldsymbol{\mathrm{0}},\mathbf{I})

vectors

\boldsymbol{\mathrm{g}}_{1},\ldots,\boldsymbol{\mathrm{g}}_{k}\in{\mathbb{R}}^{d}

2: Let

\bar{}\boldsymbol{\mathrm{x}}=\otimes_{i=1}^{k}\boldsymbol{\mathrm{g}}_{i}

3: Compute

\gamma\;{\vcentcolon=}\;\bar{}\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\bar{}\boldsymbol{\mathrm{x}}

4: If

\gamma=0

, then return

\mathbf{B}_{1},\ldots,\mathbf{B}_{k}=\mathbf{0}

5: For all

i\in[k],m\in[d]

, let

\boldsymbol{\mathrm{x}}_{i}^{(m)}=(\otimes_{j=1}^{i-1}\boldsymbol{\mathrm{g}}_{i})\otimes\boldsymbol{\mathrm{e}}_{m}\otimes(\otimes_{j=i+1}^{k}\boldsymbol{\mathrm{g}}_{i})

6: For all

i\in[k],m\in[d]

, compute

\mathbf{A}\boldsymbol{\mathrm{x}}_{i}^{(m)}

7: For all

i\in[k],m\in[d],n\in[d]

, store

[\mathbf{B}_{i}]_{m,n}=\gamma^{-(1-\frac{1}{k})}{\boldsymbol{\mathrm{x}}_{i}^{(m)}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}_{i}^{(n)}

8: return

\mathbf{B}_{1},\ldots,\mathbf{B}_{k}

Theorem 37.

Let $\mathbf{A}=\mathbf{A}_{1}\otimes\cdots\otimes\mathbf{A}_{k}\in{\mathbb{R}}^{d^{k}\times d^{k}}$ . Then, with probability 1, Algorithm 1 computes exactly $kd+1$ Kronecker-matrix-vector products with $\mathbf{A}$ and returns matrices $\mathbf{B}_{1},\ldots,\mathbf{B}_{k}$ such that $\mathbf{A}=\mathbf{B}_{1}\otimes\cdots\otimes\mathbf{B}_{k}$ .

Proof.

At a high level, the quadratic form ${\boldsymbol{\mathrm{x}}_{i}^{(m)}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}_{i}^{(n)}$ is trying to recover the $(m,n)$ entry of $\mathbf{A}_{i}$ . To see why, notice that

{\boldsymbol{\mathrm{x}}_{i}^{(m)}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}_{i}^{(n)}=(\otimes_{j=1}^{i-1}\boldsymbol{\mathrm{g}}_{j}^{\intercal}\mathbf{A}_{j}\boldsymbol{\mathrm{g}}_{j})\otimes\boldsymbol{\mathrm{e}}_{m}^{\intercal}\mathbf{A}_{i}\boldsymbol{\mathrm{e}}_{n}\otimes(\otimes_{j=i+1}^{k}\boldsymbol{\mathrm{g}}_{j}^{\intercal}\mathbf{A}_{j}\boldsymbol{\mathrm{g}}_{j}).

If we let $\gamma_{i}\;{\vcentcolon=}\;\boldsymbol{\mathrm{g}}_{i}^{\intercal}\mathbf{A}_{i}\boldsymbol{\mathrm{g}}_{i}$ denote the various components of $\gamma=\otimes_{i=1}^{k}\boldsymbol{\mathrm{g}}_{i}^{\intercal}\mathbf{A}_{i}\boldsymbol{\mathrm{g}}_{i}=\prod_{i=1}^{k}\gamma_{i}$ , then we can write the above expression as

{\boldsymbol{\mathrm{x}}_{i}^{(m)}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}_{i}^{(n)}=({\textstyle\prod_{j\neq i}\gamma_{i}})\boldsymbol{\mathrm{e}}_{m}^{\intercal}\mathbf{A}_{i}\boldsymbol{\mathrm{e}}_{n}={\textstyle\frac{\gamma}{\gamma_{i}}}[\mathbf{A}_{i}]_{m,n}

So, we get that the matrix $\mathbf{B}_{i}$ has $[\mathbf{B}_{i}]_{m,n}=\frac{\gamma^{\frac{1}{k}}}{\gamma_{i}}[\mathbf{A}_{i}]_{m,n}$ . We therefore conclude that

\otimes_{i=1}^{k}\mathbf{B}_{i}=\otimes_{i=1}^{k}\big{(}{\textstyle\frac{\gamma^{\frac{1}{k}}}{\gamma_{i}}}\mathbf{A}_{i}\big{)}=\big{(}{\textstyle\prod_{i=1}^{k}}{\textstyle\frac{\gamma^{\frac{1}{k}}}{\gamma_{i}}}\big{)}\mathbf{A}={\textstyle\frac{\gamma}{\gamma}}\mathbf{A}=\mathbf{A}.

This analysis implicitly assumes that $\gamma_{i}=\boldsymbol{\mathrm{g}}_{i}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{g}}_{i}\neq 0$ for all $i\in[k]$ . Since $\boldsymbol{\mathrm{g}}_{i}$ are random Gaussian vectors, we know that $\boldsymbol{\mathrm{g}}_{i}$ does not lie in the nullspace of $\mathbf{A}_{i}$ with probability 1 so long as $\mathbf{A}_{i}$ is not the all-zeros matrix. So, Algorithm 1 is correct so long as no factor $\mathbf{A}_{i}$ is the all-zeros matrix. If some $\mathbf{A}_{i}$ is the all-zeros matrix, then we get that $\mathbf{A}$ is also the all-zeros matrix, so that $\gamma=0$ , and therefore our output will have $\mathbf{B}_{1},\ldots,\mathbf{B}_{k}$ all being all-zeros matrices, which is correct.

To count the number of Kronecker-matrix-vector products computed, note that Algorithm 1 only computes such products on lines 3 and 6. Line 3 computes exactly 1 product, and line 6 computes exactly $kd$ products. ∎

8 Conclusion

In this paper, we produced new and tight bounds on the rate of convergence for the Kronecker-Hutchinson Estimator. We show that the estimator generally has an exponential dependence on $k$ , and we showed an expression for the exact variance of the estimator when using the Kronecker of Gaussian vectors.

There are several natural future directions for this research. It is worth exploring the variance of more classes of matrices under Theorem 4. In particular, is there a natural property of matrices which exactly characterizes the input matrices for which the Kronecker-Hutchinson Estimator has a polynomial rate of convergence? If so, this would be a significant step towards a sub-exponential runtime for trace estimation in the Kronecker-matrix-vector oracle.

It is also worth looking at algorithms in the Kronecker-matrix-vector oracle model beyond the Kronecker-Hutchinson Estimator. For instance, when $\mathbf{A}$ has a quickly decaying spectrum, could a low-rank approximation method like ones used in the non-Kronecker case provide good trace estimation guarantees [Li and Zhu, 2021, Saibaba et al., 2017]?

There is also potential for newer and stronger lower bounds. We provide a lower bound against possibly adaptive algorithms, which is very general, but is also perhaps too general to be descriptive of most algorithms people are currently using. Is there perhaps a lower bound against all sketching-based methods which is exponential in $k$ ? Or, even more narrowly, is there a lower bound against all quadratic-form based algorithms, in the vain of [Wimmer et al., 2014]?

Lastly, there is also plenty of potential to study more problems in the Kronecker-matrix-vector oracle model. Is it possible to accurately and efficiently compute the top eigenvalue or eigenvector of a matrix from Kronecker-matrix-vector products? How about solving a linear system? These problems all need to be explored in great detail.

Acknowledgements

Raphael Meyer was partially supported by a GAANN fellowship from the US Department of Education. Haim Avron was partially supported by the US-Israel Binational Science Foundation (Grant no. 2017698). We thank Moshe Goldstein and Noa Feldman for discussions on quantum entanglement estimation, which inspired this work. We thank Tyler Chen for the design of Algorithm 1. We thank Christopher Musco for pointing us to relevant results in [Ahle et al., 2020].

References

[Aaronson, 2018] Aaronson, S. (2018). Introduction to quantum information science lecture notes. https://www.scottaaronson.com/qclec.pdf.
[Ahle et al., 2020] Ahle, T. D., Kapralov, M., Knudsen, J. B., Pagh, R., Velingker, A., Woodruff, D. P., and Zandieh, A. (2020). Oblivious sketching of high-degree polynomial kernels. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 141–160. SIAM.
[Avron et al., 2014] Avron, H., Nguyen, H., and Woodruff, D. (2014). Subspace embeddings for the polynomial kernel. Advances in neural information processing systems, 27.
[Avron and Toledo, 2011] Avron, H. and Toledo, S. (2011). Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix. J. ACM, 58(2).
[Bakshi et al., 2022] Bakshi, A., Clarkson, K. L., and Woodruff, D. P. (2022). Low-rank approximation with 1/ $\varepsilon$ ${}^{\mbox{1/3}}$ matrix-vector products. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, pages 1130–1143.
[Bernstein, 2011] Bernstein, D. S. (2011). Matrix Mathematics: Theory, Facts, and Formulas. Princeton University Press, second edition.
[Braverman et al., 2020] Braverman, M., Hazan, E., Simchowitz, M., and Woodworth, B. (2020). The gradient complexity of linear regression. In Conference on Learning Theory, pages 627–647. PMLR.
[Bujanovic and Kressner, 2021] Bujanovic, Z. and Kressner, D. (2021). Norm and trace estimation with random rank-one vectors. SIAM Journal on Matrix Analysis and Applications, 42(1):202–223.
[Chen and Hallman, 2022] Chen, T. and Hallman, E. (2022). Krylov-aware stochastic trace estimation. arXiv preprint arXiv:2205.01736.
[Cortinovis and Kressner, 2021] Cortinovis, A. and Kressner, D. (2021). On randomized trace estimates for indefinite matrices with an application to determinants. Foundations of Computational Mathematics, pages 1–29.
[Feldman et al., 2022] Feldman, N., Kshetrimayum, A., Eisert, J., and Goldstein, M. (2022). Entanglement estimation in tensor network states via sampling. PRX Quantum, 3(3):030312.
[Girard, 1987] Girard, D. (1987). Un algorithme simple et rapide pour la validation croisée generalisée sur des problèmes de grande taille: applications à la restauration d’image. Informatique et Mathématiques Appliqu’ees de Grenoble.
[Halikias and Townsend, 2022] Halikias, D. and Townsend, A. (2022). Matrix recovery from matrix-vector products. arXiv preprint arXiv:2212.09841.
[Huang et al., 2020] Huang, H.-Y., Kueng, R., and Preskill, J. (2020). Predicting many properties of a quantum system from very few measurements. Nature Physics, 16(10):1050–1057.
[Hutchinson, 1989] Hutchinson, M. F. (1989). A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics-Simulation and Computation, 18(3):1059–1076.
[Jiang et al., 2021] Jiang, S., Pham, H., Woodruff, D., and Zhang, R. (2021). Optimal sketching for trace estimation. Advances in Neural Information Processing Systems, 34:23741–23753.
[Kueng et al., 2017] Kueng, R., Rauhut, H., and Terstiege, U. (2017). Low rank matrix recovery from rank one measurements. Applied and Computational Harmonic Analysis, 42(1):88–116.
[Li and Zhu, 2021] Li, H. and Zhu, Y. (2021). Randomized block krylov subspace methods for trace and log-determinant estimators. BIT Numerical Mathematics, 61:911–939.
[Meyer, 2021] Meyer, R. A. (2021). Updates for Hutch++: Hutch++ for undergrads. https://ram900.hosting.nyu.edu/hutchplusplus/#hutch_for_undergrads.
[Meyer et al., 2021] Meyer, R. A., Musco, C., Musco, C., and Woodruff, D. P. (2021). Hutch++: Optimal stochastic trace estimation. In Symposium on Simplicity in Algorithms (SOSA), pages 142–155. SIAM.
[Persson et al., 2022] Persson, D., Cortinovis, A., and Kressner, D. (2022). Improved variants of the Hutch++ algorithm for trace estimation. SIAM Journal on Matrix Analysis and Applications, 43(3):1162–1185.
[Roosta-Khorasani and Ascher, 2015] Roosta-Khorasani, F. and Ascher, U. (2015). Improved bounds on sample size for implicit matrix trace estimators. Foundations of Computational Mathematics, 15(5):1187–1212.
[Saibaba et al., 2017] Saibaba, A. K., Alexanderian, A., and Ipsen, I. C. (2017). Randomized matrix-free trace and log-determinant estimators. Numerische Mathematik, 137(2):353–395.
[Sun et al., 2021a] Sun, X., Woodruff, D. P., Yang, G., and Zhang, J. (2021a). Querying a matrix through matrix-vector products. ACM Transactions on Algorithms (TALG), 17(4):1–19.
[Sun et al., 2021b] Sun, Y., Guo, Y., Tropp, J. A., and Udell, M. (2021b). Tensor random projection for low memory dimension reduction. arXiv preprint arXiv:2105.00105.
[Tropp and Webber, 2023] Tropp, J. A. and Webber, R. J. (2023). Randomized algorithms for low-rank matrix approximation: Design, analysis, and applications. arXiv preprint arXiv:2306.12418.
[Vershynin, 2020] Vershynin, R. (2020). Concentration inequalities for random tensors. Bernoulli.
[Wang et al., 2023] Wang, S., McArdle, S., and Berta, M. (2023). Qubit-efficient randomized quantum algorithms for linear algebra. arXiv preprint arXiv:2302.01873.
[Wimmer et al., 2014] Wimmer, K., Wu, Y., and Zhang, P. (2014). Optimal query complexity for estimating the trace of a matrix. In Automata, Languages, and Programming: 41st International Colloquium, ICALP 2014, Copenhagen, Denmark, July 8-11, 2014, Proceedings, Part I 41, pages 1051–1062. Springer.

	$\displaystyle\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}}[\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i}(\mathbf{B})\rVert}_{F}^{2}]$
	$\displaystyle\leq\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\sum_{\hat{i},\hat{j}}\left[(\operatorname{tr}((\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\mathbf{I}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}})^{\intercal}\mathbf{B}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\mathbf{I}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}})))^{2}+2\mathinner{\lVert(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\mathbf{I}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}})^{\intercal}{\textstyle\frac{1}{2}}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}})(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\mathbf{I}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}})\rVert}_{F}^{2}\right]$
	$\displaystyle=\operatorname{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\sum_{\hat{i},\hat{j}=1}\left[(\operatorname{tr}((\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\mathbf{I}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}})^{\intercal}\mathbf{B}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\mathbf{I}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}})))^{2}\right]+2\operatorname{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\mathinner{\lVert(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\mathbf{I}^{\otimes 2})^{\intercal}{\textstyle\frac{1}{2}}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}})(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\mathbf{I}^{\otimes 2})\rVert}_{F}^{2}$
	$\displaystyle=\operatorname{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\sum_{\hat{i},\hat{j}}\left[\left(\sum_{\ell=1}^{d}\boldsymbol{\mathrm{e}}_{\ell}^{\intercal}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\mathbf{I}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}})^{\intercal}\mathbf{B}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\mathbf{I}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}})\boldsymbol{\mathrm{e}}_{\ell}\right)^{2}\right]+2\operatorname{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}({\textstyle\frac{1}{2}}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}}))\rVert}_{F}^{2}$
	$\displaystyle=\operatorname{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\sum_{\hat{i},\hat{j}}\left[\left(\boldsymbol{\mathrm{e}}_{\hat{i}}^{\intercal}\sum_{\ell=1}^{d}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\boldsymbol{\mathrm{e}}_{\ell}\otimes\mathbf{I})^{\intercal}\mathbf{B}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\boldsymbol{\mathrm{e}}_{\ell}\otimes\mathbf{I})\boldsymbol{\mathrm{e}}_{\hat{j}}\right)^{2}\right]+2\operatorname{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}({\textstyle\frac{1}{2}}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}}))\rVert}_{F}^{2}$
	$\displaystyle=\operatorname{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\sum_{\hat{i},\hat{j}}\left[\left(\boldsymbol{\mathrm{e}}_{\hat{i}}^{\intercal}{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}(\operatorname{tr}_{\{i\}}(\mathbf{B}))\boldsymbol{\mathrm{e}}_{\hat{j}}\right)^{2}\right]+2\operatorname{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}({\textstyle\frac{1}{2}}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}}))\rVert}_{F}^{2}$
	$\displaystyle=\operatorname{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}(\operatorname{tr}_{i}(\mathbf{B}))\rVert}_{F}^{2}+2\operatorname{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}({\textstyle\frac{1}{2}}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}}))\rVert}_{F}^{2}$

	$\displaystyle\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}})^{2}]$	$\displaystyle\leq(\operatorname{tr}(\mathbf{A}))^{2}+2\mathinner{\lVert\operatorname{tr}_{2\mathrel{\mathop{\ordinarycolon}}}({\textstyle\frac{1}{2}}(\mathbf{A}+\mathbf{A}^{\intercal}))\rVert}_{F}^{2}+2\sum_{i=1}^{k-1}\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}}\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i}(\operatorname{tr}_{i+2\mathrel{\mathop{\ordinarycolon}}}({\textstyle\frac{1}{2}}(\mathbf{A}+\mathbf{A}^{\intercal})))\rVert}_{F}^{2}$
		$\displaystyle\leq(\operatorname{tr}(\mathbf{A}))^{2}+2\mathinner{\lVert\operatorname{tr}_{2\mathrel{\mathop{\ordinarycolon}}}(\mathbf{C}_{0})\rVert}_{F}^{2}+2\sum_{i=1}^{k-1}\sum_{{\mathcal{S}}\subseteq[i]}2^{i-\|{{\mathcal{S}}}\|}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}\cup\{i+2,\ldots,k\}}(\mathbf{C}_{i})\rVert}_{F}^{2}$
		$\displaystyle=(\operatorname{tr}(\mathbf{A}))^{2}+\sum_{i=0}^{k-1}\sum_{{\mathcal{S}}\subseteq[i]}2^{i+1-\|{{\mathcal{S}}}\|}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}\cup\{i+2,\ldots,k\}}(\mathbf{C}_{i})\rVert}_{F}^{2}$
		$\displaystyle=(\operatorname{tr}(\mathbf{A}))^{2}+\sum_{i=1}^{k}\sum_{{\mathcal{S}}\subseteq[i-1]}2^{i-\|{{\mathcal{S}}}\|}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}\cup\{i+1,\ldots,k\}}(\mathbf{C}_{i-1})\rVert}_{F}^{2}$