This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\contourlength

0.8pt

Hutchinson’s Estimator is Bad at Kronecker-Trace-Estimation

Raphael A. Meyer
New York University
[email protected]
   Haim Avron
Tel Aviv University
[email protected]
Abstract

We study the problem of estimating the trace of a matrix 𝐀\mathbf{A} that can only be accessed through Kronecker-matrix-vector products. That is, for any Kronecker-structured vector 𝐱=i=1k𝐱i\boldsymbol{\mathrm{x}}=\otimes_{i=1}^{k}\boldsymbol{\mathrm{x}}_{i}, we can compute 𝐀𝐱\mathbf{A}\boldsymbol{\mathrm{x}}. We focus on the natural generalization of Hutchinson’s Estimator to this setting, proving tight rates for the number of matrix-vector products this estimator needs to find a (1±ε)(1\pm\varepsilon) approximation to the trace of 𝐀\mathbf{A}.

We find an exact equation for the variance of the estimator when using a Kronecker of Gaussian vectors, revealing an intimate relationship between Hutchinson’s Estimator, the partial trace operator, and the partial transpose operator. Using this equation, we show that when using real vectors, in the worst case, this estimator needs O(3kε2)O(\frac{3^{k}}{\varepsilon^{2}}) products to recover a (1±ε)(1\pm\varepsilon) approximation of the trace of any PSD 𝐀\mathbf{A}, and a matching lower bound for certain PSD 𝐀\mathbf{A}. However, when using complex vectors, this can be exponentially improved to Θ(2kε2)\Theta(\frac{2^{k}}{\varepsilon^{2}}). We show that Hutchinson’s Estimator converges slowest when 𝐀\mathbf{A} itself also has Kronecker structure. We conclude with some theoretical evidence suggesting that, by combining Hutchinson’s Estimator with other techniques, it may be possible to avoid the exponential dependence on kk.

1 Introduction

A ubiquitous problem in scientific computing is that of estimating the trace of a D×DD\times D matrix 𝐀\mathbf{A} that can be only accessed via matrix-vector products. In other words, we are given access to an oracle that can evaluate 𝐀𝐱\mathbf{A}\boldsymbol{\mathrm{x}} for any 𝐱D\boldsymbol{\mathrm{x}}\in{\mathbb{R}}^{D}, and our goal is to approximate tr(𝐀)\operatorname{tr}(\mathbf{A}) with as few oracle queries as possible. This matrix-vector oracle model or implicit matrix model is a common computational framework used in much of the numerical linear algebra community [Sun et al., 2021a, Chen and Hallman, 2022, Halikias and Townsend, 2022, Bakshi et al., 2022, Persson et al., 2022]. However, in many applications, we cannot efficiently evaluate 𝐀𝐱\mathbf{A}\boldsymbol{\mathrm{x}} for all vectors 𝐱D\boldsymbol{\mathrm{x}}\in{\mathbb{R}}^{D}. Instead, only some matrix vector products can be efficiently computed.

In this paper, we study trace estimation in the Kronecker-matrix-vector oracle model, also known as the rank-one vector model, where we can evaluate 𝐀𝐱\mathbf{A}\boldsymbol{\mathrm{x}} for any vector 𝐱\boldsymbol{\mathrm{x}} that can can be written as a Kronecker product of smaller vectors, i.e. 𝐱=𝐱1𝐱2𝐱k\boldsymbol{\mathrm{x}}=\boldsymbol{\mathrm{x}}_{1}\otimes\boldsymbol{\mathrm{x}}_{2}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k} for some vectors 𝐱1,,𝐱kd\boldsymbol{\mathrm{x}}_{1},\ldots,\boldsymbol{\mathrm{x}}_{k}\in{\mathbb{R}}^{d}. Here, \otimes denotes the Kronecker product, which is a common primitive in many tensor-based algorithms and in much of quantum physics. The number of vectors (kk) and the number of entries (dd) in each of 𝐱1,,𝐱k\boldsymbol{\mathrm{x}}_{1},\dots,\boldsymbol{\mathrm{x}}_{k} are parameters of the model.

Our study is motivated by recent literature on entanglement estimation in quantum states  [Feldman et al., 2022]. This computational model has also been studied in other applied domains [Chen and Hallman, 2022, Kueng et al., 2017]. An important setting where we can efficiently compute Kronecker-matrix-vector products is when 𝐀\mathbf{A} is the sum of a small number of matrices, where each summand is a Kronecker product of kk matrices. Such matrices occur as Hamiltonians of various quantum models [Wang et al., 2023]. Another instance in which we can efficiently compute Kronecker-matrix-vector products is when 𝐀\mathbf{A} is the flattening of a tensor network. This is the case in [Feldman et al., 2022], where the oracle is efficiently implemented when 𝐀\mathbf{A} describes a density operator that represents certain systems of quantum particles. Recently, there has been some effort to find algorithms that can efficiently solve linear algebra problems using this oracle, although the papers do not explicitly define such a Kronecker-matrix-vector oracle [Bujanovic and Kressner, 2021, Avron et al., 2014, Ahle et al., 2020, Sun et al., 2021b]. In this paper, we focus on the fundamental problem of trace estimation.

The trace estimation problem in the Kronecker-matrix-vector oracle model can be exactly solved using D=dkD=d^{k} oracle queries. Since each standard basis vector 𝐞1,,𝐞D\boldsymbol{\mathrm{e}}_{1},\ldots,\boldsymbol{\mathrm{e}}_{D} obeys Kronecker structure, we can exactly look up each diagonal entry of 𝐀\mathbf{A} and evaluate tr(𝐀)=i=1D𝐞i𝐀𝐞i\operatorname{tr}(\mathbf{A})=\sum_{i=1}^{D}\boldsymbol{\mathrm{e}}_{i}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{e}}_{i}. So, our goal is to estimate the trace of 𝐀\mathbf{A} to (1±ε)(1\pm\varepsilon) relative error using o(dk)o(d^{k}) matrix-vector products. In the standard non-Kronecker setting, the most fundamental algorithm for trace estimation is Hutchinson’s Estimator [Girard, 1987, Hutchinson, 1989]:

H(𝐀):=1j=1𝐱(j)𝐀𝐱(j)where𝐱(j)iid𝒟.H_{\ell}(\mathbf{A})\;{\vcentcolon=}\;\frac{1}{\ell}\sum_{j=1}^{\ell}{\boldsymbol{\mathrm{x}}^{(j)}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}^{(j)}\hskip 28.45274pt\text{where}\hskip 28.45274pt\boldsymbol{\mathrm{x}}^{(j)}\overset{iid}{\sim}{\mathcal{D}}.

If the distribution of random vectors has 𝔼[𝐱]=𝟎\operatorname*{\mathbb{E}}[\boldsymbol{\mathrm{x}}]=\boldsymbol{\mathrm{0}} and 𝔼[𝐱𝐱]=𝐈\operatorname*{\mathbb{E}}[\boldsymbol{\mathrm{x}}\boldsymbol{\mathrm{x}}^{\intercal}]=\mathbf{I}, then 𝔼[H(𝐀)]=tr(𝐀)\operatorname*{\mathbb{E}}[H_{\ell}(\mathbf{A})]=\operatorname{tr}(\mathbf{A}). Further, if 𝒟{\mathcal{D}} is a Gaussian or Rademacher distribution, then Var[H(𝐀)]2𝐀F2\operatorname*{Var}[H_{\ell}(\mathbf{A})]\leq\frac{2}{\ell}\mathinner{\lVert\mathbf{A}\rVert}_{F}^{2}. In particular, under the common assumption that 𝐀\mathbf{A} is PSD, so that 𝐀Ftr(𝐀)\mathinner{\lVert\mathbf{A}\rVert}_{F}\leq\operatorname{tr}(\mathbf{A}), we then find that the standard deviation of Hutchinson’s estimator is

Var[H(𝐀)]2𝐀F2tr(𝐀)εtr(𝐀)\sqrt{\operatorname*{Var}[H_{\ell}(\mathbf{A})]}\leq\sqrt{{\textstyle\frac{2}{\ell}}}\mathinner{\lVert\mathbf{A}\rVert}_{F}\leq\sqrt{{\textstyle\frac{2}{\ell}}}\operatorname{tr}(\mathbf{A})\leq\varepsilon\operatorname{tr}(\mathbf{A})

when we take Ω(1ε2)\ell\geq\Omega(\frac{1}{\varepsilon^{2}}) queries. For large dimension DD, this very simple algorithm is much more efficient than exactly computing the diagonal entries of 𝐀\mathbf{A}, since the estimator has no dependence on the size of 𝐀\mathbf{A} at all.

This estimator can be naturally generalized to the Kronecker-matrix-vector oracle model by simply changing the distribution 𝒟{\mathcal{D}}. Instead of taking 𝒟{\mathcal{D}} to be a distribution over Gaussian or Rademacher vectors, we now take 𝒟{\mathcal{D}} to be the Kronecker of kk vectors drawn from some other isotropic zero-mean distribution 𝒟{\mathcal{D}}^{\prime}:

𝐱=𝐱1𝐱2𝐱kwhere𝐱iiid𝒟\boldsymbol{\mathrm{x}}=\boldsymbol{\mathrm{x}}_{1}\otimes\boldsymbol{\mathrm{x}}_{2}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k}\hskip 28.45274pt\text{where}\hskip 28.45274pt\boldsymbol{\mathrm{x}}_{i}\overset{iid}{\sim}{\mathcal{D}}^{\prime}

Notably, if 𝔼[𝐱i]=𝟎\operatorname*{\mathbb{E}}[\boldsymbol{\mathrm{x}}_{i}]=\boldsymbol{\mathrm{0}} and 𝔼[𝐱i𝐱i]=𝐈\operatorname*{\mathbb{E}}[\boldsymbol{\mathrm{x}}_{i}\boldsymbol{\mathrm{x}}_{i}^{\intercal}]=\mathbf{I}, then we also immediately have that 𝔼[𝐱]=𝟎\operatorname*{\mathbb{E}}[\boldsymbol{\mathrm{x}}]=\boldsymbol{\mathrm{0}} and 𝔼[𝐱𝐱]=𝐈\operatorname*{\mathbb{E}}[\boldsymbol{\mathrm{x}}\boldsymbol{\mathrm{x}}^{\intercal}]=\mathbf{I}, so that 𝔼[H(𝐀)]=tr(𝐀)\operatorname*{\mathbb{E}}[H_{\ell}(\mathbf{A})]=\operatorname{tr}(\mathbf{A}). This simple estimator, which we refer to as the Kronecker-Hutchinson Estimator, fits the Kronecker-matrix-vector oracle model, and we may hope that it is efficient enough to estimate the trace of 𝐀\mathbf{A} very quickly. At a high level, we answer this in the negative, proving novel and very tight bounds on the variance of the Kronecker-Hutchinson Estimator, showing that the sample complexity of the estimator has an unavoidable exponential dependence on kk.

1.1 Prior Work

We first remark there is a great deal of literature on trace estimation without the Kronecker constraint. The design of Hutchinson’s Estimator is attributed to [Girard, 1987, Hutchinson, 1989], and some of the core papers that analyze the estimator are [Avron and Toledo, 2011, Roosta-Khorasani and Ascher, 2015, Cortinovis and Kressner, 2021].

Although relatively a new topic, trace estimation in the Kronecker-matrix-vector oracle model has recently been explored in the literature. From the physics literature, the works of [Feldman et al., 2022] and [Huang et al., 2020] analyze trace estimators when 𝐀=i=1k𝐀i\mathbf{A}=\otimes_{i=1}^{k}\mathbf{A}_{i}, and shows some empirical results for their performance. From a theoretical perspective, [Vershynin, 2020] provides bounds that can be used to analyze the Kronecker-Hutchinson Estimator. However, the paper provides very coarse bounds for this problem, showing that if k=O(d)k=O(d) then we have 𝐱𝐀𝐱tr(𝐀)±O(kdk1𝐀2)\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}\in\operatorname{tr}(\mathbf{A})\pm O(kd^{k-1}\mathinner{\lVert\mathbf{A}\rVert}_{2}), where 𝐀2\mathinner{\lVert\mathbf{A}\rVert}_{2} is the operator norm of 𝐀\mathbf{A}. To obtain a relative error (1±ε)(1\pm\varepsilon), this suggests an overall sample complexity of =O(kdk1ε2𝐀2tr(𝐀))=O(Dε2𝐀2tr(𝐀))\ell=O(\frac{kd^{k-1}}{\varepsilon^{2}}\frac{\mathinner{\lVert\mathbf{A}\rVert}_{2}}{\operatorname{tr}(\mathbf{A})})=O(\frac{D}{\varepsilon^{2}}\frac{\mathinner{\lVert\mathbf{A}\rVert}_{2}}{\operatorname{tr}(\mathbf{A})}), which is generally more than DD queries.

The research of [Bujanovic and Kressner, 2021] directly analyzes the Kronecker-Hutchinson Estimator when k=2k=2, showing first that =O(1ε2)\ell=O(\frac{1}{\varepsilon^{2}}) samples suffice to ensure that H(𝐀)(1ε)tr(𝐀)H_{\ell}(\mathbf{A})\geq(1-\varepsilon)\operatorname{tr}(\mathbf{A}), and second that =O(dε2A2tr(𝐀))\ell=O(\frac{d}{\varepsilon^{2}}\frac{\mathinner{\lVert A\rVert}_{2}}{\operatorname{tr}(\mathbf{A})}) samples suffice to ensure that H(𝐀)(1+ε)tr(𝐀)H_{\ell}(\mathbf{A})\leq(1+\varepsilon)\operatorname{tr}(\mathbf{A}). Lastly, Remark 2 of the arXiv version of [Ahle et al., 2020] is a result on using a Kronecker-Johnson-Lindenstrauss transform, and implies that O(3kε2)O(\frac{3^{k}}{\varepsilon^{2}}) samples suffice for the Kronecker-Hutchinson Estimator with Rademacher vectors to achieve relative error (1±ε)(1\pm\varepsilon). While this worst-case bound is tight, the paper gives no way to understand when this bound is lose, and for what matrices fewer samples are needed. Further, all the aforementioned theoretical papers rely on heavy-duty analysis, for instance involving Gaussian and Rademacher Chaos theory. In contrast, our contributions give tighter results using simpler techniques.

We also note that [Ahle et al., 2020] does also include a Johnson-Lindenstrauss style guarantee for working with Kronecker-shaped data using a sketch that has O(kε2)O(\frac{k}{\varepsilon^{2}}) rows. However, this sketch does not actually operate in the Kronecker-matrix-vector oracle model.

1.2 Our Contributions

Our core technical contribution is an exact expression of the variance of Hutchinson’s Estimator when our query vectors are formed as a Kronecker product of Gaussian vectors. This also provides an upper bound on the variance when the estimator instead uses a Kronecker product of Rademacher vectors. For PSD matrices, our results imply a worst-case upper bound of =O(3kε2)\ell=O(\frac{3^{k}}{\varepsilon^{2}}) on the number of samples needed to estimate the trace of 𝐀\mathbf{A} to relative error (1±ε)(1\pm\varepsilon). We match this upper bound with a lower bound showing that the Kronecker-Hutchinson Estimator with iid zero-mean unit-variance vectors has to use =Ω(3kε2)\ell=\Omega(\frac{3^{k}}{\varepsilon^{2}}) samples to achieve standard deviation εtr(𝐀)\varepsilon\operatorname{tr}(\mathbf{A}). Lastly, we show that if we are allowed to use complex Kronecker-matrix-vector products instead of real matrix-vector products (i.e. if 𝐱=𝐱1𝐱k\boldsymbol{\mathrm{x}}=\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k} where 𝐱id\boldsymbol{\mathrm{x}}_{i}\in{\mathbb{C}}^{d}), then this rate improves to Θ(2kε2)\Theta(\frac{2^{k}}{\varepsilon^{2}}).

To more concretely state our core technical contribution, we recall the partial trace operator from quantum physics. In quantum physics, a Hermitian density matrix 𝐀dk×dk\mathbf{A}\in{\mathbb{C}}^{d^{k}\times d^{k}} is roughly speaking used to represent the joint probability distribution for a set of kk particles. Suppose these particles are partitioned into two sets, so that [k]=𝒮¯𝒮[k]={\mathcal{S}}\cup\bar{}{\mathcal{S}}. Further suppose that Alice can observe the particles in 𝒮{\mathcal{S}}, and that Bob can observe the particles in ¯𝒮\bar{}{\mathcal{S}}, but they cannot observe each other’s particles. Then, the partial trace of 𝐀\mathbf{A} with respect to 𝒮{\mathcal{S}}, denoted tr𝒮(𝐀)\operatorname{tr}_{{\mathcal{S}}}(\mathbf{A}), describes the marginal distribution of the particles that Bob observes. This is often called “tracing out the subsystem 𝒮{\mathcal{S}}”.

We include a general formal definition of the partial trace in Section 2, but a rough intuition is easily gained by examining the k=2k=2 case. Here, we can break down 𝐀=d2×d2\mathbf{A}={\mathbb{R}}^{d^{2}\times d^{2}} as a block matrix full of 𝐀ijd×d\mathbf{A}_{ij}\in{\mathbb{R}}^{d\times d}:

𝐀=[𝐀11𝐀12𝐀1d𝐀21𝐀22𝐀2d𝐀d1𝐀d2𝐀dd].\mathbf{A}=\begin{bmatrix}\mathbf{A}_{11}&\mathbf{A}_{12}&\ldots&\mathbf{A}_{1d}\\ \mathbf{A}_{21}&\mathbf{A}_{22}&\ldots&\mathbf{A}_{2d}\\ \vdots&\vdots&\ddots&\vdots\\ \mathbf{A}_{d1}&\mathbf{A}_{d2}&\ldots&\mathbf{A}_{dd}\\ \end{bmatrix}.

Then, the partial trace of 𝐀\mathbf{A} with respect to {1}\{1\} is the sum of the diagonal blocks, i.e. tr{1}(𝐀)=i=1d𝐀ii\operatorname{tr}_{\{1\}}(\mathbf{A})=\sum_{i=1}^{d}\mathbf{A}_{ii}, and the partial trace of 𝐀\mathbf{A} with respect to {2}\{2\} is the matrix that contains the trace of each block matrix, i.e.

tr{2}(𝐀)=[tr(𝐀11)tr(𝐀12)tr(𝐀1d)tr(𝐀21)tr(𝐀22)tr(𝐀2d)tr(𝐀d1)tr(𝐀d2)tr(𝐀dd)].\operatorname{tr}_{\{2\}}(\mathbf{A})=\begin{bmatrix}\operatorname{tr}(\mathbf{A}_{11})&\operatorname{tr}(\mathbf{A}_{12})&\ldots&\operatorname{tr}(\mathbf{A}_{1d})\\ \operatorname{tr}(\mathbf{A}_{21})&\operatorname{tr}(\mathbf{A}_{22})&\ldots&\operatorname{tr}(\mathbf{A}_{2d})\\ \vdots&\vdots&\ddots&\vdots\\ \operatorname{tr}(\mathbf{A}_{d1})&\operatorname{tr}(\mathbf{A}_{d2})&\ldots&\operatorname{tr}(\mathbf{A}_{dd})\\ \end{bmatrix}.

We will also need to recall the partial transpose operator from quantum physics. This operator is often used in quantum physics, but lacks an intuitive physical meaning analogous to what the partial trace enjoys. Like the partial trace, the partial transpose can also be taken over any subset of particles 𝒱[k]{\mathcal{V}}\subseteq[k]. We denote the partial transpose of 𝐀\mathbf{A} with respect to 𝒱{\mathcal{V}} as 𝐀𝒱\mathbf{A}^{\intercal_{{\mathcal{V}}}}, and call this “transposing the subsystem 𝒱{\mathcal{V}}”. We have a formal definition of the partial transpose in Section 2. But, when k=2k=2, the two partial transposes of 𝐀\mathbf{A} are:

𝐀{1}=[𝐀11𝐀21𝐀d1𝐀12𝐀22𝐀d2𝐀1d𝐀2d𝐀dd]and𝐀{2}=[𝐀11𝐀12𝐀1d𝐀21𝐀22𝐀2d𝐀d1𝐀d2𝐀dd].\mathbf{A}^{\intercal_{\{1\}}}=\begin{bmatrix}\mathbf{A}_{11}&\mathbf{A}_{21}&\cdots&\mathbf{A}_{d1}\\ \mathbf{A}_{12}&\mathbf{A}_{22}&\cdots&\mathbf{A}_{d2}\\ \vdots&\vdots&\ddots&\vdots\\ \mathbf{A}_{1d}&\mathbf{A}_{2d}&\cdots&\mathbf{A}_{dd}\end{bmatrix}\hskip 14.22636pt\text{and}\hskip 14.22636pt\mathbf{A}^{\intercal_{\{2\}}}=\begin{bmatrix}\mathbf{A}_{11}^{\intercal}&\mathbf{A}_{12}^{\intercal}&\cdots&\mathbf{A}_{1d}^{\intercal}\\ \mathbf{A}_{21}^{\intercal}&\mathbf{A}_{22}^{\intercal}&\cdots&\mathbf{A}_{2d}^{\intercal}\\ \vdots&\vdots&\ddots&\vdots\\ \mathbf{A}_{d1}^{\intercal}&\mathbf{A}_{d2}^{\intercal}&\cdots&\mathbf{A}_{dd}^{\intercal}\end{bmatrix}.

Having setup both the partial trace and the partial transpose, we can now state our core theorem:

Theorem 1.

Fix d,kd,k\in{\mathbb{N}}, and let 𝐀dk×dk\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}}. Let 𝐱=𝐱1𝐱k\boldsymbol{\mathrm{x}}=\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k}, where each 𝐱i\boldsymbol{\mathrm{x}}_{i} is independent and is either a Rademacher vector or a 𝒩(𝟎,𝐈){\mathcal{N}}(\boldsymbol{\mathrm{0}},\mathbf{I}) Gaussian vector. Let ¯𝐀:=12k𝒱[k]𝐀𝒱\bar{}\mathbf{A}\;{\vcentcolon=}\;\frac{1}{2^{k}}\sum_{{\mathcal{V}}\subseteq[k]}\mathbf{A}^{\intercal_{{\mathcal{V}}}} be the average of all partial transposes of 𝐀\mathbf{A}. Then, Var[𝐱𝐀𝐱]𝒮[k]2k|𝒮|tr𝒮(¯𝐀)F2\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}]\leq\sum_{{\mathcal{S}}\subset[k]}2^{k-|{{\mathcal{S}}}|}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}}(\bar{}\mathbf{A})\rVert}_{F}^{2}. Furthermore, if all 𝐱i\boldsymbol{\mathrm{x}}_{i} are 𝒩(𝟎,𝐈){\mathcal{N}}(\boldsymbol{\mathrm{0}},\mathbf{I}), then this expression is an equality.

The matrix ¯𝐀\bar{}\mathbf{A} seems odd at a glance, so we first justify why it is natural. For the classical k=1k=1 setting with Gaussian vectors, we can write that Var[𝐱𝐀𝐱]=2𝐀F2\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}]=2\mathinner{\lVert\mathbf{A}\rVert}_{F}^{2} for all symmetric matrices 𝐀\mathbf{A} [Avron and Toledo, 2011, Meyer, 2021]. For asymmetric matrices, this equality does not hold. Instead, we let ^𝐀:=𝐀+𝐀2\hat{}\mathbf{A}\;{\vcentcolon=}\;\frac{\mathbf{A}+\mathbf{A}^{\intercal}}{2}, and note that 𝐱𝐀𝐱=𝐱^𝐀𝐱\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}=\boldsymbol{\mathrm{x}}^{\intercal}\hat{}\mathbf{A}\boldsymbol{\mathrm{x}} for all 𝐱D\boldsymbol{\mathrm{x}}\in{\mathbb{R}}^{D}. Since ^𝐀\hat{}\mathbf{A} is symmetric, we then find that Var[𝐱𝐀𝐱]=2^𝐀F2\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}]=2\mathinner{\lVert\hat{}\mathbf{A}\rVert}_{F}^{2}. Theorem 1 generalizes this idea to all possible ways to partially transpose across all possible subsystems 𝒱[k]{\mathcal{V}}\subseteq[k].

Next, note that ^𝐀F𝐀F\mathinner{\lVert\hat{}\mathbf{A}\rVert}_{F}\leq\mathinner{\lVert\mathbf{A}\rVert}_{F}, and that symmetric matrices have 𝐀=^𝐀\mathbf{A}=\hat{}\mathbf{A}. So, we know that Var[𝐱𝐀𝐱]=2^𝐀F22𝐀F2\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}]=2\mathinner{\lVert\hat{}\mathbf{A}\rVert}_{F}^{2}\leq 2\mathinner{\lVert\mathbf{A}\rVert}_{F}^{2} holds for all matrices 𝐀\mathbf{A}, with equality for symmetric 𝐀\mathbf{A}. This bound can be easier to work with, since it does not require worrying about ^𝐀\hat{}\mathbf{A}. For large kk, we similarly bound tr𝒮(¯𝐀)Ftr𝒮(𝐀)F\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\bar{}\mathbf{A})\rVert}_{F}\leq\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\mathbf{A})\rVert}_{F}, providing a more approachable version of Theorem 1:

Theorem 2.

Fix d,kd,k\in{\mathbb{N}}, and let 𝐀dk×dk\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}}. Let 𝐱=𝐱1𝐱k\boldsymbol{\mathrm{x}}=\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k}, where each 𝐱i\boldsymbol{\mathrm{x}}_{i} is independent and is either a Rademacher vector or a 𝒩(𝟎,𝐈){\mathcal{N}}(\boldsymbol{\mathrm{0}},\mathbf{I}) Gaussian vector. Then, Var[𝐱𝐀𝐱]𝒮[k]2k|𝒮|tr𝒮(𝐀)F2\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}]\leq\sum_{{\mathcal{S}}\subset[k]}2^{k-|{{\mathcal{S}}}|}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}}(\mathbf{A})\rVert}_{F}^{2}. Furthermore, if all 𝐱i\boldsymbol{\mathrm{x}}_{i} are 𝒩(𝟎,𝐈){\mathcal{N}}(\boldsymbol{\mathrm{0}},\mathbf{I}) and if 𝐀=¯𝐀\mathbf{A}=\bar{}\mathbf{A}, then this expression is an equality.

Note that many reasonable matrices will have 𝐀=¯𝐀\mathbf{A}=\bar{}\mathbf{A}. For instance, recall that the Kronecker-matrix-vector oracle model may be used when 𝐀\mathbf{A} is the sum of a small number of matrices, where each summand is a Kronecker product of kk matrices. If each summand is the Kronecker product of symmetric matrices, then we have 𝐀=¯𝐀\mathbf{A}=\bar{}\mathbf{A}.

We demonstrate the utility of Theorem 2 by using it to show the following guarantee on estimating the trace of a random rank-one matrix:

Theorem 3.

Fix d,kd,k\in{\mathbb{N}}. Let 𝐀=𝐠𝐠\mathbf{A}=\boldsymbol{\mathrm{g}}\boldsymbol{\mathrm{g}}^{\intercal} where 𝐠dk\boldsymbol{\mathrm{g}}\in{\mathbb{R}}^{d^{k}} is a vector of iid standard normal Gaussians. Then, with probability 34\frac{3}{4}, we know that Var[𝐱𝐀𝐱|𝐠]O((2+1d)k(tr(𝐀))2)\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}~{}|~{}\boldsymbol{\mathrm{g}}]\leq O((2+\frac{1}{d})^{k}(\operatorname{tr}(\mathbf{A}))^{2}). In particular, with probability 23\frac{2}{3}, we know that the Kronecker-Hutchinson Estimator returns a (1±ε)(1\pm\varepsilon) relative error estimate to the trace of 𝐀\mathbf{A} if =O(1ε2(2+1d)k)\ell=O(\frac{1}{\varepsilon^{2}}(2+\frac{1}{d})^{k}).

This rate of (2+1d)k(2+\frac{1}{d})^{k} is obviously expensive, but our sample complexity is lower than what is achievable with any other technique that we are aware of. Nevertheless, even without ¯𝐀\bar{}\mathbf{A}, this sum of partial traces may sometimes still be too unwieldy to work with. For such cases, we provide a worst-case bound for PSD matrices:

Theorem 4.

Fix d,kd,k\in{\mathbb{N}}, and let 𝐀dk×dk\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}} be a PSD matrix. Let 𝐱=𝐱1𝐱k\boldsymbol{\mathrm{x}}=\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k}, where each 𝐱i\boldsymbol{\mathrm{x}}_{i} is either a Rademacher vector or a 𝒩(𝟎,𝐈){\mathcal{N}}(\boldsymbol{\mathrm{0}},\mathbf{I}) Gaussian vector. Then, Var[𝐱𝐀𝐱](3ktr(𝐀))2\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}]\leq(3^{k}\operatorname{tr}(\mathbf{A}))^{2}.

In particular, by Chebyshev’s Inequality, Theorem 4 implies that =O(3kε2)\ell=O(\frac{3^{k}}{\varepsilon^{2}}) samples suffice to estimate tr(𝐀)\operatorname{tr}(\mathbf{A}) to relative error (1±ε)(1\pm\varepsilon). For the case of [Bujanovic and Kressner, 2021], when k=2k=2, we get the following:

Corollary 5.

Let 𝐀d2×d2\mathbf{A}\in{\mathbb{R}}^{d^{2}\times d^{2}}. Let 𝐱=𝐱1𝐱2\boldsymbol{\mathrm{x}}=\boldsymbol{\mathrm{x}}_{1}\otimes\boldsymbol{\mathrm{x}}_{2}, where each 𝐱i\boldsymbol{\mathrm{x}}_{i} is independent and is either a Rademacher vector or a 𝒩(𝟎,𝐈){\mathcal{N}}(\boldsymbol{\mathrm{0}},\mathbf{I}) Gaussian vector. Let ¯𝐀=14(𝐀+𝐀{1}+𝐀{2}+𝐀)\bar{}\mathbf{A}=\frac{1}{4}(\mathbf{A}+\mathbf{A}^{\intercal_{\{1\}}}+\mathbf{A}^{\intercal_{\{2\}}}+\mathbf{A}^{\intercal}). Then, Var[𝐱𝐀𝐱]2tr{1}(¯𝐀)F2+2tr{2}(¯𝐀)F2+4¯𝐀F2\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}]\leq 2\mathinner{\lVert\operatorname{tr}_{\{1\}}(\bar{}\mathbf{A})\rVert}_{F}^{2}+2\mathinner{\lVert\operatorname{tr}_{\{2\}}(\bar{}\mathbf{A})\rVert}_{F}^{2}+4\mathinner{\lVert\bar{}\mathbf{A}\rVert}_{F}^{2}. If 𝐀\mathbf{A} is PSD, with probability 23\frac{2}{3}, we know that the Kronecker-Hutchinson Estimator returns a (1±ε)(1\pm\varepsilon) relative error estimate to the trace of 𝐀\mathbf{A} if =O(ε2)\ell=O(\varepsilon^{-2}).

This success probability can be boosted to 1δ1-\delta for an arbitrary δ\delta by picking up a log(1δ)\log(\frac{1}{\delta}) dependence in the sample complexity by returning the median of O(log(1δ))O(\log(\frac{1}{\delta})) iid trials of H(𝐀)H_{\ell}(\mathbf{A}). We further show that the expensive rate of O(3kε2)O(\frac{3^{k}}{\varepsilon^{2}}) is essentially tight in the worst case for the Kronecker-Hutchinson Estimator:

Theorem 6.

Fix d,kd,k\in{\mathbb{N}}. Consider Hutchinson’s Estimator run with vectors 𝐱=𝐱1𝐱k\boldsymbol{\mathrm{x}}=\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k} where 𝐱i\boldsymbol{\mathrm{x}}_{i} are vectors of iid real zero-mean unit-variance entries. Then, there exists a PSD matrix 𝐀\mathbf{A} such that Hutchinson’s Estimator needs =Ω((32d)kε2)\ell=\Omega(\frac{(3-\frac{2}{d})^{k}}{\varepsilon^{2}}) samples in order to have standard deviation εtr(𝐀)\leq\varepsilon\operatorname{tr}(\mathbf{A}).

1.2.1 Complex Kronecker-Matrix-Vector Products

Although certainly cheaper than exactly computing the trace with dkd^{k} queries to the diagonal of 𝐀\mathbf{A}, the rate of 3kε2\frac{3^{k}}{\varepsilon^{2}} is prohibitively expensive for large kk. Our next set of results show that the bound improves if we can issue complex queries instead of real ones, that is when 𝐱=𝐱1𝐱k\boldsymbol{\mathrm{x}}=\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k} where 𝐱id\boldsymbol{\mathrm{x}}_{i}\in{\mathbb{C}}^{d}. It seems realistic to say that any software implementation of the Kronecker-matrix-vector oracle could handle complex queries instead of real ones.

Note that for k=1k=1, there is almost no distinction between the real and complex oracle models, as any complex matrix-vector product can be simulated by running two real matrix-vector products. However, this is no longer the case for large values of kk and real simulation becomes expensive. To see this, note that the real part of the Kronecker product of complex vectors might not have a Kronecker structure. Because of this, it may take 2k2^{k} real-valued Kronecker-matrix-vector products to simulate a single complex-valued Kronecker-matrix-vector product. In a vague intuitive sense, the complex Kronecker-structured vectors are exponentially less constrained than the real Kronecker-structured vectors. Running with this intuition in mind, we prove the following analogues to Theorems 1, 2, 4, 3 and 6:

Theorem 7.

Fix d,kd,k\in{\mathbb{N}}, and let 𝐀dk×dk\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}}. Let 𝐱=𝐱1𝐱k\boldsymbol{\mathrm{x}}=\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k}, where each 𝐱i\boldsymbol{\mathrm{x}}_{i} is either a complex Rademacher vector or a complex Gaussian vector. That is, we have 𝐱i=12𝐫i+i2𝐦i\boldsymbol{\mathrm{x}}_{i}=\frac{1}{\sqrt{2}}\boldsymbol{\mathrm{r}}_{i}+\frac{i}{\sqrt{2}}\boldsymbol{\mathrm{m}}_{i} where 𝐫i\boldsymbol{\mathrm{r}}_{i} and 𝐦i\boldsymbol{\mathrm{m}}_{i} are iid real-valued vectors of either Rademacher or standard normal Gaussian entries. Let ¯𝐀:=12k𝒱[k]𝐀𝒱\bar{}\mathbf{A}\;{\vcentcolon=}\;\frac{1}{2^{k}}\sum_{{\mathcal{V}}\subseteq[k]}\mathbf{A}^{\intercal_{{\mathcal{V}}}} be the average of all partial transposes of 𝐀\mathbf{A}. Then, Var[𝐱H𝐀𝐱]𝒮[k]tr𝒮(¯𝐀)F2𝒮[k]tr𝒮(𝐀)F2\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\textsf{H}}\mathbf{A}\boldsymbol{\mathrm{x}}]\leq\sum_{{\mathcal{S}}\subset[k]}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}}(\bar{}\mathbf{A})\rVert}_{F}^{2}\leq\sum_{{\mathcal{S}}\subset[k]}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}}(\mathbf{A})\rVert}_{F}^{2}. If 𝐱i\boldsymbol{\mathrm{x}}_{i} are complex Gaussians, then the first expression is an equality. If we further know that 𝐀=¯𝐀\mathbf{A}=\bar{}\mathbf{A}, then both expression are equalities. Further, if 𝐀\mathbf{A} is PSD, then we have Var[𝐱H𝐀𝐱](2ktr(𝐀))2\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\textsf{H}}\mathbf{A}\boldsymbol{\mathrm{x}}]\leq(2^{k}\operatorname{tr}(\mathbf{A}))^{2}.

Theorem 8.

Fix d,kd,k\in{\mathbb{N}}. Let 𝐀=𝐠𝐠\mathbf{A}=\boldsymbol{\mathrm{g}}\boldsymbol{\mathrm{g}}^{\intercal} where 𝐠dk\boldsymbol{\mathrm{g}}\in{\mathbb{R}}^{d^{k}} is a vector of iid standard normal Gaussians. Then, with probability 34\frac{3}{4}, we know that Var[𝐱H𝐀𝐱|𝐠]O((1+1d)k(tr(𝐀))2)\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\textsf{H}}\mathbf{A}\boldsymbol{\mathrm{x}}~{}|~{}\boldsymbol{\mathrm{g}}]\leq O((1+\frac{1}{d})^{k}(\operatorname{tr}(\mathbf{A}))^{2}). In particular, with probability 23\frac{2}{3}, we know that the Kronecker-Hutchinson Estimator returns a (1±ε)(1\pm\varepsilon) relative error estimate to the trace of 𝐀\mathbf{A} if =O(1ε2(1+1d)k)\ell=O(\frac{1}{\varepsilon^{2}}(1+\frac{1}{d})^{k}).

Theorem 9.

Fix d,kd,k\in{\mathbb{N}}. Consider Hutchinson’s Estimator run with vectors 𝐱=𝐱1𝐱k\boldsymbol{\mathrm{x}}=\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k} where 𝐱i\boldsymbol{\mathrm{x}}_{i} are vectors of iid complex zero-mean unit-variance entries. Then, there exists a PSD matrix 𝐀\mathbf{A} such that Hutchinson’s Estimator needs =Ω((21d)kε2)\ell=\Omega(\frac{(2-{\textstyle\frac{1}{d}})^{k}}{\varepsilon^{2}}) samples in order to have standard deviation εtr(𝐀)\leq\varepsilon\operatorname{tr}(\mathbf{A}).

By applying Chebyshev’s Inequality to Theorems 4 and 7, we see an exponential speedup from =O(3kε2)\ell=O(\frac{3^{k}}{\varepsilon^{2}}) queries in the real-valued case to =O(2kε2)\ell=O(\frac{2^{k}}{\varepsilon^{2}}) queries in the complex-valued case. For large kk, this gap from 3k3^{k} to 2k2^{k} is a very significant speedup, and we are not aware of any other prior works which discuss this potential.

Another way to view this speedup is from comparing the Kronecker-Hutchinson Estimator to just summing the diagonal of 𝐀\mathbf{A}, which takes exactly dkd^{k} queries. When d=2d=2, the real-valued Kronecker-Hutchinson Estimator is exponentially more expensive than just summing the diagonal directly. However, the complex-valued Kronecker-Hutchinson Estimator is only O(1ε2)O(\frac{1}{\varepsilon^{2}}) times more expensive in the worst case. This is another way to frame the power of the complex-valued estimator – the complex-valued estimator is at most polynomially slower than summing the diagonal, while the real-valued estimator can be exponentially slower.

We can also directly compare Theorems 3 and 8. In the real-valued case, the sample complexity grows as =O(1ε2(2+1d)k)\ell=O(\frac{1}{\varepsilon^{2}}(2+\frac{1}{d})^{k}), while in the complex-valued case, the sample complexity grows as =O(1ε2(1+1d)k)\ell=O(\frac{1}{\varepsilon^{2}}(1+\frac{1}{d})^{k}). Notably, if k=O(d)k=O(d) then this rate grows as O(2kε2)O(\frac{2^{k}}{\varepsilon^{2}}) for real-valued queries and O(1ε2)O(\frac{1}{\varepsilon^{2}}) for complex-valued queries. This shows a power exponential gap between our bounds, where we are only leveraging the difference between complex-valued and real-valued vectors.

1.2.2 Additional Results and Hope for a Faster Algorithm

We complement the aforementioned results with a few informative additions. First, we show a lower bound against all Kronecker-matrix-vector algorithms. By adapting the techniques of [Braverman et al., 2020] and [Jiang et al., 2021], we show that any Kronecker-matrix-vector algorithm that estimates the trace of 𝐀\mathbf{A} with standard deviation εtr(𝐀)\varepsilon\operatorname{tr}(\mathbf{A}) must make at least Ω(kε)\Omega(\frac{\sqrt{k}}{\varepsilon}) oracle queries.

Second, we note that the lower bounds in Theorems 6 and 9 both use construction of 𝐀\mathbf{A} such that 𝐀=𝐀1𝐀k\mathbf{A}=\mathbf{A}_{1}\otimes\cdots\otimes\mathbf{A}_{k}. That is, the Kronecker-Hutchinson Estimator converges slowest for matrices that have a Kronecker structure. However, if we truly are given 𝐀\mathbf{A} with the promise that 𝐀=𝐀1𝐀k\mathbf{A}=\mathbf{A}_{1}\otimes\cdots\otimes\mathbf{A}_{k} for some unknown matrices 𝐀1,,𝐀k\mathbf{A}_{1},\ldots,\mathbf{A}_{k}, then in Section 7 we show that the trace of 𝐀\mathbf{A} can be recovered exactly using just kd+1kd+1 Kronecker-matrix-vector products. Similarly, Theorems 3 and 8 suggest that the Kronecker-Hutchinson Estimator converges exponentially slowly for random rank-one matrices. But in this rank-one case, we show that we can exactly recover the trace of 𝐀\mathbf{A} with a single Kronecker-matrix-vector query!

This trend, where the matrices that make the Kronecker-Hutchinson Estimator converge slowly can have their traces efficiently estimated by another algorithm, is reminiscent of the setup for the analysis in [Meyer et al., 2021]. In that paper, the authors recall the analysis of the standard deviation of Hutchinson’s Estimator:

Var[𝐱𝐀𝐱]2𝐀F2tr(𝐀)εtr(𝐀)\sqrt{\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}]}\leq\sqrt{\frac{2}{\ell}}\mathinner{\lVert\mathbf{A}\rVert}_{F}\leq\sqrt{\frac{2}{\ell}}\operatorname{tr}(\mathbf{A})\leq\varepsilon\operatorname{tr}(\mathbf{A})

where the first inequality uses that 𝐀Ftr(𝐀)\mathinner{\lVert\mathbf{A}\rVert}_{F}\leq\operatorname{tr}(\mathbf{A}) for PSD 𝐀\mathbf{A}, and the second inequality is forced to pick the sample complexity 2ε2\ell\geq\frac{2}{\varepsilon^{2}}. Since the first inequality is only tight for matrices 𝐀\mathbf{A} that are nearly low-rank, the authors propose an algorithm which combines Hutchinson’s Estimator with a low-rank approximation method that can efficiently compute the trace of nearly-low-rank matrices efficiently. This way, their algorithm has a worst-case matrix-vector complexity of Θ(1ε)\Theta(\frac{1}{\varepsilon}), while Hutchinson’s Estimator is stuck with Θ(1ε2)\Theta(\frac{1}{\varepsilon^{2}}).

Our analysis similarly characterizes two cases when the Kronecker-Hutchinson Estimator converges slowly – when 𝐀\mathbf{A} is either low-rank or has Kronecker structure. We also show that we can compute the trace such matrices very efficiently. This suggests that a more involved algorithm can interpolate between these low query complexities and perhaps even avoid any exponential dependence on kk at all. We leave this as an important problem for future work.

Lastly, we remark that the analysis from [Meyer et al., 2021] crucially uses a tight bound for the variance of Hutchinson’s Estimator in terms of 𝐀F\mathinner{\lVert\mathbf{A}\rVert}_{F}. We believe that our core results, Theorems 1 and 2, can serve an analogous role in constructing a faster trace-estimating algorithm.

2 Preliminaries

We use capital bold letter to denote matrices, lowercase bold letters to denote vectors, and lowercase non-bold letters to denote scalars. {\mathbb{R}} is the set of reals, and {\mathbb{C}} is the set of complex numbers. 𝔢[𝐱]\mathfrak{Re}[\boldsymbol{\mathrm{x}}] and 𝔪[𝐱]\mathfrak{Im}[\boldsymbol{\mathrm{x}}] denote the real and imaginary parts of a vector or scalar. 𝐱\boldsymbol{\mathrm{x}}^{\intercal} denotes the transpose of a vector, while 𝐱H\boldsymbol{\mathrm{x}}^{\textsf{H}} denotes the conjugate transpose of a matrix or vector. 𝐀dk×dk\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}} denotes out input matrix. We use the bracket notation [𝐁]ij[\mathbf{B}]_{ij} to denote the i,ji,j entry of matrix 𝐁\mathbf{B}. We let 𝐞i\boldsymbol{\mathrm{e}}_{i} denote the ithi^{th} standard basis vector, whose exact dimension may depend on the context. We let 𝐈d×d\mathbf{I}\in{\mathbb{R}}^{d\times d} always refer to the d×dd\times d identity matrix. We let 𝐀F\mathinner{\lVert\mathbf{A}\rVert}_{F} denote the Frobenius norm, tr(𝐀)\operatorname{tr}(\mathbf{A}) the trace, tr𝒮(𝐀)\operatorname{tr}_{{\mathcal{S}}}(\mathbf{A}) the partial trace, and 𝐀𝒱\mathbf{A}^{\intercal_{{\mathcal{V}}}} the partial transpose. We let [k]={1,,k}[k]=\{1,\ldots,k\} be the set of integers from 1 to kk.

2.1 Kronecker product

The Kronecker product of two matrices 𝐀n×m\mathbf{A}\in{\mathbb{R}}^{n\times m} and 𝐁p×q\mathbf{B}\in{\mathbb{R}}^{p\times q} is 𝐀𝐁np×mq\mathbf{A}\otimes\mathbf{B}\in{\mathbb{R}}^{np\times mq}, where

𝐀𝐁:=[[𝐀]11𝐁[𝐀]12𝐁[𝐀]1m𝐁[𝐀]21𝐁[𝐀]22𝐁[𝐀]2m𝐁[𝐀]n1𝐁[𝐀]n2𝐁[𝐀]nm𝐁]\mathbf{A}\otimes\mathbf{B}\;{\vcentcolon=}\;\begin{bmatrix}[\mathbf{A}]_{11}\mathbf{B}&[\mathbf{A}]_{12}\mathbf{B}&\ldots&[\mathbf{A}]_{1m}\mathbf{B}\\ [\mathbf{A}]_{21}\mathbf{B}&[\mathbf{A}]_{22}\mathbf{B}&\ldots&[\mathbf{A}]_{2m}\mathbf{B}\\ \vdots&\vdots&\ddots&\vdots\\ [\mathbf{A}]_{n1}\mathbf{B}&[\mathbf{A}]_{n2}\mathbf{B}&\ldots&[\mathbf{A}]_{nm}\mathbf{B}\end{bmatrix}

We also use the shorthands i=1k𝐱i:=𝐱1𝐱2𝐱k\otimes_{i=1}^{k}\boldsymbol{\mathrm{x}}_{i}\;{\vcentcolon=}\;\boldsymbol{\mathrm{x}}_{1}\otimes\boldsymbol{\mathrm{x}}_{2}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k}, as well as 𝐀k:=i=1k𝐀\mathbf{A}^{\otimes k}\;{\vcentcolon=}\;\otimes_{i=1}^{k}\mathbf{A}. The Kronecker product has three key properties that we repeatedly use:

Lemma 10 (Mixed-Product Property (Folklore)).

(𝐀𝐁)(𝐂𝐃)=𝐀𝐂𝐁𝐃(\mathbf{A}\otimes\mathbf{B})(\mathbf{C}\otimes\mathbf{D})=\mathbf{A}\mathbf{C}\otimes\mathbf{B}\mathbf{D}

Lemma 11 (Transposing (Folklore)).

(𝐀𝐁)=(𝐀𝐁)(\mathbf{A}\otimes\mathbf{B})^{\intercal}=(\mathbf{A}^{\intercal}\otimes\mathbf{B}^{\intercal})

Lemma 12 (Fact 7.4.1 in [Bernstein, 2011]).

Let 𝐱n\boldsymbol{\mathrm{x}}\in{\mathbb{R}}^{n} and 𝐲m\boldsymbol{\mathrm{y}}\in{\mathbb{R}}^{m}. Then, 𝐱𝐲=(𝐈n𝐲)𝐱=(𝐱𝐈m)𝐲\boldsymbol{\mathrm{x}}\otimes\boldsymbol{\mathrm{y}}=(\mathbf{I}_{n}\otimes\boldsymbol{\mathrm{y}})\boldsymbol{\mathrm{x}}=(\boldsymbol{\mathrm{x}}\otimes\mathbf{I}_{m})\boldsymbol{\mathrm{y}}.

At one point, we will also require a short lemma about the squared Frobenius norm and the Kronecker product. This lemma is roughly equivalent to noting that 𝐀F2=i,j=1d𝐀ijF2\mathinner{\lVert\mathbf{A}\rVert}_{F}^{2}=\sum_{i,j=1}^{d}\mathinner{\lVert\mathbf{A}_{ij}\rVert}_{F}^{2} in the block matrix decomposition of 𝐀\mathbf{A} when k=2k=2 in Section 1.2.

Lemma 13.

Let 𝐂,𝐄dki×dki\mathbf{C},\mathbf{E}\in{\mathbb{R}}^{d^{k-i}\times d^{k-i}} and 𝐃di×di\mathbf{D}\in{\mathbb{R}}^{d^{i}\times d^{i}} for any i[k]i\in[k]. Then, we have

i^,j^=1di(𝐂𝐞i^)𝐃(𝐄𝐞j^)F2=(𝐂𝐈)𝐃(𝐄𝐈)F2.\sum_{\hat{i},\hat{j}=1}^{d^{i}}\mathinner{\lVert(\mathbf{C}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}})^{\intercal}\mathbf{D}(\mathbf{E}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}})\rVert}_{F}^{2}=\mathinner{\lVert(\mathbf{C}\otimes\mathbf{I})^{\intercal}\mathbf{D}(\mathbf{E}\otimes\mathbf{I})\rVert}_{F}^{2}.
Proof.

First, recall there exists a permutation matrix 𝐏\mathbf{P} such that 𝐂𝐲=𝐏(𝐲𝐂)𝐏\mathbf{C}\otimes\boldsymbol{\mathrm{y}}=\mathbf{P}(\boldsymbol{\mathrm{y}}\otimes\mathbf{C})\mathbf{P}^{\intercal} for all 𝐲di\boldsymbol{\mathrm{y}}\in{\mathbb{R}}^{d^{i}}. Now let ^𝐃:=𝐏𝐃𝐏\hat{}\mathbf{D}\;{\vcentcolon=}\;\mathbf{P}\mathbf{D}\mathbf{P}. Then, by expand ^𝐃\hat{}\mathbf{D} as a block matrix of matrices ^𝐃i^,j^dki×dki\hat{}\mathbf{D}_{\hat{i},\hat{j}}\in{\mathbb{R}}^{d^{k-i}\times d^{k-i}}, and noting that (𝐞i^𝐂)=[𝟎𝟎𝐂𝟎𝟎](\boldsymbol{\mathrm{e}}_{\hat{i}}\otimes\mathbf{C})^{\intercal}=\begin{bmatrix}\mathbf{0}&\cdots&\mathbf{0}&\mathbf{C}^{\intercal}&\mathbf{0}&\cdots&\mathbf{0}\end{bmatrix}. We then get that

(𝐞i^𝐂)^𝐃(𝐞j^𝐄)=[𝟎𝟎𝐂𝟎𝟎][^𝐃11^𝐃1d^𝐃d1^𝐃dd][𝟎𝟎𝐄𝟎𝟎]=𝐂^𝐃i^j^𝐄.(\boldsymbol{\mathrm{e}}_{\hat{i}}\otimes\mathbf{C})^{\intercal}\hat{}\mathbf{D}(\boldsymbol{\mathrm{e}}_{\hat{j}}\otimes\mathbf{E})=\begin{bmatrix}\mathbf{0}&\cdots&\mathbf{0}&\mathbf{C}^{\intercal}&\mathbf{0}&\cdots&\mathbf{0}\end{bmatrix}\begin{bmatrix}\hat{}\mathbf{D}_{11}&\cdots&\hat{}\mathbf{D}_{1d}\\ \vdots&\ddots&\vdots\\ \hat{}\mathbf{D}_{d1}&\cdots&\hat{}\mathbf{D}_{dd}\\ \end{bmatrix}\begin{bmatrix}\boldsymbol{\mathrm{0}}\\ \vdots\\ \boldsymbol{\mathrm{0}}\\ \mathbf{E}\\ \boldsymbol{\mathrm{0}}\\ \vdots\\ \boldsymbol{\mathrm{0}}\end{bmatrix}=\mathbf{C}^{\intercal}\hat{}\mathbf{D}_{\hat{i}\hat{j}}\mathbf{E}.

We can also notice that

(𝐈𝐂)^𝐃(𝐈𝐄)=[𝐂𝐂][^𝐃11^𝐃dd^𝐃d1^𝐃dd][𝐄𝐄]=[𝐂^𝐃11𝐄𝐂^𝐃1d𝐄𝐂^𝐃d1𝐄𝐂^𝐃dd𝐄].(\mathbf{I}\otimes\mathbf{C})^{\intercal}\hat{}\mathbf{D}(\mathbf{I}\otimes\mathbf{E})=\begin{bmatrix}\mathbf{C}^{\intercal}\\ &\ddots\\ &&\mathbf{C}^{\intercal}\end{bmatrix}\begin{bmatrix}\hat{}\mathbf{D}_{11}&\ldots&\hat{}\mathbf{D}_{dd}\\ \vdots&\ddots&\vdots\\ \hat{}\mathbf{D}_{d1}&\ldots&\hat{}\mathbf{D}_{dd}\end{bmatrix}\begin{bmatrix}\mathbf{E}\\ &\ddots\\ &&\mathbf{E}\end{bmatrix}=\begin{bmatrix}\mathbf{C}^{\intercal}\hat{}\mathbf{D}_{11}\mathbf{E}&\ldots&\mathbf{C}^{\intercal}\hat{}\mathbf{D}_{1d}\mathbf{E}\\ \vdots&\ddots&\vdots\\ \mathbf{C}^{\intercal}\hat{}\mathbf{D}_{d1}\mathbf{E}&\ldots&\mathbf{C}^{\intercal}\hat{}\mathbf{D}_{dd}\mathbf{E}\end{bmatrix}.

Then, noting that the Frobenius norm is invariant to permutation, we find that

i^,j^=1di(𝐂𝐞i^)𝐃(𝐄𝐞j^)F2\displaystyle\sum_{\hat{i},\hat{j}=1}^{d^{i}}\mathinner{\lVert(\mathbf{C}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}})^{\intercal}\mathbf{D}(\mathbf{E}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}})\rVert}_{F}^{2} =i^,j^=1di𝐏(𝐂𝐞i^)𝐏𝐃𝐏(𝐞j^𝐄)𝐏F2\displaystyle=\sum_{\hat{i},\hat{j}=1}^{d^{i}}\mathinner{\lVert\mathbf{P}^{\intercal}(\mathbf{C}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}})^{\intercal}\mathbf{P}\mathbf{D}\mathbf{P}(\boldsymbol{\mathrm{e}}_{\hat{j}}\otimes\mathbf{E})\mathbf{P}^{\intercal}\rVert}_{F}^{2}
=i^,j^=1di(𝐂𝐞i^)^𝐃(𝐞j^𝐄)F2\displaystyle=\sum_{\hat{i},\hat{j}=1}^{d^{i}}\mathinner{\lVert(\mathbf{C}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}})^{\intercal}\hat{}\mathbf{D}(\boldsymbol{\mathrm{e}}_{\hat{j}}\otimes\mathbf{E})\rVert}_{F}^{2}
=i^,j^=1di𝐂^𝐃i^j^𝐄F2\displaystyle=\sum_{\hat{i},\hat{j}=1}^{d^{i}}\mathinner{\lVert\mathbf{C}^{\intercal}\hat{}\mathbf{D}_{\hat{i}\hat{j}}\mathbf{E}\rVert}_{F}^{2}
=(𝐈𝐂)^𝐃(𝐈𝐄)F2\displaystyle=\mathinner{\lVert(\mathbf{I}\otimes\mathbf{C})^{\intercal}\hat{}\mathbf{D}(\mathbf{I}\otimes\mathbf{E})\rVert}_{F}^{2}
=(𝐂𝐈)^𝐃(𝐄𝐈)F2,\displaystyle=\mathinner{\lVert(\mathbf{C}\otimes\mathbf{I})^{\intercal}\hat{}\mathbf{D}(\mathbf{E}\otimes\mathbf{I})\rVert}_{F}^{2},

where the last equality comes from again permuting the matrices inside the norm. ∎

2.2 The Partial Trace Operator

We take a moment to formalize the partial trace operator and characterize some useful properties of it. We start by defining the partial trace with respect to a single subsystem:

Definition 14.

Let 𝐀dk×dk\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}}. Then, the partial trace of 𝐀\mathbf{A} with respect to its ithi^{th} subsystem is

tr{i}(𝐀):=j=1d(𝐈i1𝐞j𝐈k1)𝐀(𝐈i1𝐞j𝐈k1)dk1×dk1.\displaystyle\operatorname{tr}_{\{i\}}(\mathbf{A})\;{\vcentcolon=}\;\sum_{j=1}^{d}(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{j}\otimes\mathbf{I}^{\otimes k-1})^{\intercal}\mathbf{A}(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{j}\otimes\mathbf{I}^{\otimes k-1})\in{\mathbb{R}}^{d^{k-1}\times d^{k-1}}. (1)

where these identity matrices are d×dd\times d matrices, and where 𝐞jd\boldsymbol{\mathrm{e}}_{j}\in{\mathbb{R}}^{d} is the jthj^{th} standard basis vector.

Equivalently, the partial trace operator can be defined as the unique linear operator which guarantees that for all 𝐀di1×di1\mathbf{A}\in{\mathbb{R}}^{d^{i-1}\times d^{i-1}}, 𝐁d×d\mathbf{B}\in{\mathbb{R}}^{d\times d}, and 𝐂dki×dki\mathbf{C}\in{\mathbb{R}}^{d^{k-i}\times d^{k-i}}, we have that tr{i}(𝐀𝐁𝐂)=tr(𝐁)(𝐀𝐂)\operatorname{tr}_{\{i\}}(\mathbf{A}\otimes\mathbf{B}\otimes\mathbf{C})=\operatorname{tr}(\mathbf{B})\cdot(\mathbf{A}\otimes\mathbf{C}).

We will often consider the partial trace of 𝐀\mathbf{A} with respect to a set of subsystems 𝒮[k]{\mathcal{S}}\subseteq[k]. We can think of this as either composing the operation in Equation 1 |𝒮||{{\mathcal{S}}}| many times, or as extending the summation in Equation 1 to sum over all d|𝒮|d^{|{{\mathcal{S}}}|} basis vectors that span the product of subspaces defined in 𝒮{\mathcal{S}}. To see this equivalence, consider the partial trace of 𝐀\mathbf{A} with respect to 𝒮={1,2}{\mathcal{S}}=\{1,2\}:

tr{1,2}(𝐀)\displaystyle\operatorname{tr}_{\{1,2\}}(\mathbf{A}) =tr{1}(tr{2}(𝐀))\displaystyle=\operatorname{tr}_{\{1\}}(\operatorname{tr}_{\{2\}}(\mathbf{A}))
=j=1d(𝐞j𝐈k1)(j^=1d(𝐈𝐞j^𝐈k2)𝐀(𝐈𝐞j^𝐈k2))(𝐞j𝐈k1)\displaystyle=\sum_{j=1}^{d}(\boldsymbol{\mathrm{e}}_{j}\otimes\mathbf{I}^{\otimes k-1})^{\intercal}\bigg{(}\sum_{\hat{j}=1}^{d}(\mathbf{I}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}}\otimes\mathbf{I}^{\otimes k-2})^{\intercal}\mathbf{A}(\mathbf{I}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}}\otimes\mathbf{I}^{\otimes k-2})\bigg{)}(\boldsymbol{\mathrm{e}}_{j}\otimes\mathbf{I}^{\otimes k-1})
=j=1dj^=1d(𝐞j𝐈𝐈k2)(𝐈𝐞j^𝐈k2)𝐀(𝐈𝐞j^𝐈k2)(𝐞j𝐈𝐈k2)\displaystyle=\sum_{j=1}^{d}\sum_{\hat{j}=1}^{d}(\boldsymbol{\mathrm{e}}_{j}\otimes\mathbf{I}\otimes\mathbf{I}^{\otimes k-2})^{\intercal}(\mathbf{I}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}}\otimes\mathbf{I}^{\otimes k-2})^{\intercal}\mathbf{A}(\mathbf{I}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}}\otimes\mathbf{I}^{\otimes k-2})(\boldsymbol{\mathrm{e}}_{j}\otimes\mathbf{I}\otimes\mathbf{I}^{\otimes k-2})
=j=1dj^=1d(𝐞j𝐞j^𝐈k2)𝐀(𝐞j𝐞j^𝐈k2).\displaystyle=\sum_{j=1}^{d}\sum_{\hat{j}=1}^{d}(\boldsymbol{\mathrm{e}}_{j}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}}\otimes\mathbf{I}^{\otimes k-2})^{\intercal}\mathbf{A}(\boldsymbol{\mathrm{e}}_{j}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}}\otimes\mathbf{I}^{\otimes k-2}). (Lemma 10)

Next, note that 𝐞j𝐞j^d2\boldsymbol{\mathrm{e}}_{j}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}}\in{\mathbb{R}}^{d^{2}} is a standard basis vector in d2{\mathbb{R}}^{d^{2}} uniquely identified by the pair (j,j^)(j,\hat{j}). That is, we can replace this double sum above with a single sum over standard basis vectors in d2{\mathbb{R}}^{d^{2}}:

tr{1,2}(𝐀)\displaystyle\operatorname{tr}_{\{1,2\}}(\mathbf{A}) =j=1d2(𝐞j𝐈k2)𝐀(𝐞j𝐈k2).\displaystyle=\sum_{j=1}^{d^{2}}(\boldsymbol{\mathrm{e}}_{j}\otimes\mathbf{I}^{\otimes k-2})^{\intercal}\mathbf{A}(\boldsymbol{\mathrm{e}}_{j}\otimes\mathbf{I}^{\otimes k-2}). (𝐞jd2\boldsymbol{\mathrm{e}}_{j}\in{\mathbb{R}}^{d^{2}} now)

Which now looks remarkably similar to Definition 14, but with a larger definition of “subsystem” that the standard basis vector is acting on. Formally, we define the partial trace of 𝐀\mathbf{A} with respect to a set 𝒮{\mathcal{S}} as following:

Definition 15.

Let 𝐀dk×dk\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}}, and let 𝒮[k]{\mathcal{S}}\subseteq[k]. Let i1,,i|𝒮|i_{1},\ldots,i_{|{{\mathcal{S}}}|} be the indices in 𝒮{\mathcal{S}}, sorted so that i1<i2<<i|𝒮|i_{1}<i_{2}<\ldots<i_{|{{\mathcal{S}}}|}. Then, the partial trace of 𝐀\mathbf{A} with respect to the subsystem 𝒮{\mathcal{S}} is

tr𝒮(𝐀):=tr{i1}(tr{i2}((tr{i|𝒮|}(𝐀))))dk|𝒮|×dk|𝒮|.\operatorname{tr}_{{\mathcal{S}}}(\mathbf{A})\;{\vcentcolon=}\;\operatorname{tr}_{\{i_{1}\}}(\operatorname{tr}_{\{i_{2}\}}(\cdots(\operatorname{tr}_{\{i_{|{{\mathcal{S}}}|}\}}(\mathbf{A}))))\in{\mathbb{R}}^{d^{k-|{{\mathcal{S}}}|}\times d^{k-|{{\mathcal{S}}}|}}.

In the special case that 𝒮={i,i+1,,k}{\mathcal{S}}=\{i,i+1,\ldots,k\}, we denote this partial trace as tri:(𝐀)\operatorname{tr}_{i\mathrel{\mathop{\ordinarycolon}}}(\mathbf{A}).

In principle, the order of subscripts does not really matter. Tracing out the first subsystem and then tracing out the second subsystem produces the exact same matrix as tracing out the second subsystem and then tracing out the first subsystem. However, notationally, a non-ascending order becomes messy since removing the second subsystem changes the indices. To see this messiness, note that the above statement would be formalized as tr{1}(tr{2}(𝐀))=tr{1}(tr{1}(𝐀))\operatorname{tr}_{\{1\}}(\operatorname{tr}_{\{2\}}(\mathbf{A}))=\operatorname{tr}_{\{1\}}(\operatorname{tr}_{\{1\}}(\mathbf{A})). To avoid this issue, we always expand the partial traces in 𝒮{\mathcal{S}} in sorted order as in Definition 15.

Moving on, a vital property of the partial trace is that it is trace-preserving:

Lemma 16 (Folklore).

For all 𝐀dk×dk\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}} and 𝒮[k]{\mathcal{S}}\subseteq[k], we have tr(tr𝒮(𝐀))=tr(𝐀)\operatorname{tr}(\operatorname{tr}_{{\mathcal{S}}}(\mathbf{A}))=\operatorname{tr}(\mathbf{A}).

2.3 Post-Measurement Reduced Density Matrices

Our analysis also heavily involves post-measurement reduced density matrices, which we now define, and whose long name originates from quantum physics. As mentioned in Section 1.2, if 𝐀\mathbf{A} is a matrix that describes the joint probability distribution of many particles (a density matrix), then tr𝒮(𝐀)\operatorname{tr}_{\mathcal{S}}(\mathbf{A}) describes the joint probability distribution of the particles that Bob can view. The Post-measurement Reduced Density Matrix (PMRDM) of 𝐀\mathbf{A} with respect to vector 𝐱\boldsymbol{\mathrm{x}} describes the joint distribution of the particles that Bob can view given that Alice measured her particles in the direction of the vector 𝐱\boldsymbol{\mathrm{x}}.111For the purpose of understanding the paper, there is no particular need to understand this intuition. Those interested in learning more are recommended to read through Chapter 6 of [Aaronson, 2018]. For brevity, we refer to this matrix as the PMRDM.

Definition 17.

Let 𝐀dk×dk\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}}, i[k]i\in[k], and 𝐱id\boldsymbol{\mathrm{x}}_{i}\in{\mathbb{R}}^{d}. Then the PMRDM of 𝐀\mathbf{A} after measuring 𝐱\boldsymbol{\mathrm{x}} along the ithi^{th} subsystem is

{i}(𝐀):=(𝐈i1𝐱i𝐈ki)𝐀(𝐈i1𝐱i𝐈ki)dk1×dk1.{\mathcal{M}}_{\{i\}}(\mathbf{A})\;{\vcentcolon=}\;(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{x}}_{i}\otimes\mathbf{I}^{\otimes k-i})^{\intercal}\mathbf{A}(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{x}}_{i}\otimes\mathbf{I}^{\otimes k-i})\in{\mathbb{R}}^{d^{k-1}\times d^{k-1}}.

Further let 𝐱1,,𝐱id\boldsymbol{\mathrm{x}}_{1},\ldots,\boldsymbol{\mathrm{x}}_{i}\in{\mathbb{R}}^{d}. Then, the PMRDM of 𝐀\mathbf{A} after measuring according to 𝐱1,,𝐱i\boldsymbol{\mathrm{x}}_{1},\ldots,\boldsymbol{\mathrm{x}}_{i} in the first ii subsystems of [k][k] is

:i(𝐀)\displaystyle{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i}(\mathbf{A}) :={1}({2}(({i}(𝐀))))\displaystyle\;{\vcentcolon=}\;{\mathcal{M}}_{\{1\}}({\mathcal{M}}_{\{2\}}(\cdots({\mathcal{M}}_{\{i\}}(\mathbf{A}))))
=(𝐱:i𝐈ki)𝐀(𝐱:i𝐈ki)dki×dki.\displaystyle=(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}\otimes\mathbf{I}^{\otimes k-i})^{\intercal}\mathbf{A}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}\otimes\mathbf{I}^{\otimes k-i})\in{\mathbb{R}}^{d^{k-i}\times d^{k-i}}.

where 𝐱:i=𝐱1𝐱idi\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}=\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{i}\in{\mathbb{R}}^{d^{i}}.

Notice that the notation for the PMRDM {i}(𝐀){\mathcal{M}}_{\{i\}}(\mathbf{A}) does not explicitly show the vector 𝐱i\boldsymbol{\mathrm{x}}_{i} that define the measurement. This is done for visual clarity, and is safe because the meaning of the vector 𝐱i\boldsymbol{\mathrm{x}}_{i} will always be clear from context.222Namely, we always take 𝐱i\boldsymbol{\mathrm{x}}_{i} to be the ithi^{th} term in the expansion of 𝐱=i=1k𝐱i\boldsymbol{\mathrm{x}}=\otimes_{i=1}^{k}\boldsymbol{\mathrm{x}}_{i}, where 𝐱i\boldsymbol{\mathrm{x}}_{i} has either iid Gaussian or Rademacher entries. Further, while we could define the PMRDM with respect to an arbitrary set of subsystems 𝒮{\mathcal{S}}, as done in Definition 15, our proofs only need to examine the PMRDM when looking at the first ii subsystems, so for simplicity we only define the PMRDM for that case.

We need one important property from the PMRDM, which relates it to the partial trace:

Lemma 18.

Let 𝐀dk×dk\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}}, i[k]i\in[k], and 𝐱1,,𝐱id\boldsymbol{\mathrm{x}}_{1},\ldots,\boldsymbol{\mathrm{x}}_{i}\in{\mathbb{R}}^{d}. Define 𝐱:i=𝐱1𝐱i\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}=\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{i}. Then, tr(:i(𝐀))=𝐱:itri+1:(𝐀)𝐱:i\operatorname{tr}({\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i}(\mathbf{A}))=\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}^{\intercal}\operatorname{tr}_{i+1\mathrel{\mathop{\ordinarycolon}}}(\mathbf{A})\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}.

Proof.

We directly expand and simplify:

tr(:i(𝐀))\displaystyle\operatorname{tr}({\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i}(\mathbf{A})) =tr((𝐱:i𝐈ki)𝐀(𝐱:i𝐈ki))\displaystyle=\operatorname{tr}((\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}\otimes\mathbf{I}^{\otimes k-i})^{\intercal}\mathbf{A}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}\otimes\mathbf{I}^{\otimes k-i}))
=j=1dki𝐞j(𝐱:i𝐈ki)𝐀(𝐱:i𝐈ki)𝐞j\displaystyle=\sum_{j=1}^{d^{k-i}}\boldsymbol{\mathrm{e}}_{j}^{\intercal}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}\otimes\mathbf{I}^{\otimes k-i})^{\intercal}\mathbf{A}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}\otimes\mathbf{I}^{\otimes k-i})\boldsymbol{\mathrm{e}}_{j}
=j=1dk1𝐱:i(𝐈i𝐞j)𝐀(𝐈i𝐞j)𝐱:i\displaystyle=\sum_{j=1}^{d^{k-1}}\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}^{\intercal}(\mathbf{I}^{\otimes i}\otimes\boldsymbol{\mathrm{e}}_{j})^{\intercal}\mathbf{A}(\mathbf{I}^{\otimes i}\otimes\boldsymbol{\mathrm{e}}_{j})\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i} (Lemma 12)
=𝐱:itri+1:(𝐀)𝐱:i\displaystyle=\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}^{\intercal}\operatorname{tr}_{i+1\mathrel{\mathop{\ordinarycolon}}}(\mathbf{A})\,\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}

2.4 The Partial Transpose Operator

Finally, we formalize the partial transpose operator, and characterize a useful property of it. We start by defining the partial transpose of 𝐀\mathbf{A} with respect to a single subsystem.

Definition 19.

Let 𝐀dk×dk\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}} and i[k]i\in[k]. Then the partial transpose of 𝐀\mathbf{A} with respect to the ithi^{th} subsystem is

𝐀{i}:=i^,j^=1d(𝐈i1𝐞i^𝐞j^𝐈ki)𝐀(𝐈i1𝐞i^𝐞j^𝐈ki).\mathbf{A}^{\intercal_{\{i\}}}\;{\vcentcolon=}\;\sum_{\hat{i},\hat{j}=1}^{d}(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}\boldsymbol{\mathrm{e}}_{\hat{j}}^{\intercal}\otimes\mathbf{I}^{\otimes k-i})\mathbf{A}(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}\boldsymbol{\mathrm{e}}_{\hat{j}}^{\intercal}\otimes\mathbf{I}^{\otimes k-i}).

Equivalently, the partial transpose operator is the unique linear operator which guarantees that for all 𝐁di1×di1\mathbf{B}\in{\mathbb{R}}^{d^{i-1}\times d^{i-1}}, 𝐂d×d\mathbf{C}\in{\mathbb{R}}^{d\times d}, 𝐃dki×dki\mathbf{D}\in{\mathbb{R}}^{d^{k-i}\times d^{k-i}}, we have that (𝐁𝐂𝐃){i}=(𝐁𝐂𝐃)(\mathbf{B}\otimes\mathbf{C}\otimes\mathbf{D})^{\intercal_{\{i\}}}=(\mathbf{B}\otimes\mathbf{C}^{\intercal}\otimes\mathbf{D}). To verify this equivalence, notice that we can express the transpose of a matrix as

i^,j^=1d𝐞i^𝐞j^𝐂𝐞i^𝐞j^=i^,j^=1d[𝐂]j^i^𝐞i^𝐞j^=𝐂.\sum_{\hat{i},\hat{j}=1}^{d}\boldsymbol{\mathrm{e}}_{\hat{i}}\boldsymbol{\mathrm{e}}_{\hat{j}}^{\intercal}\mathbf{C}\boldsymbol{\mathrm{e}}_{\hat{i}}\boldsymbol{\mathrm{e}}_{\hat{j}}^{\intercal}=\sum_{\hat{i},\hat{j}=1}^{d}[\mathbf{C}]_{\hat{j}\hat{i}}\boldsymbol{\mathrm{e}}_{\hat{i}}\boldsymbol{\mathrm{e}}_{\hat{j}}^{\intercal}=\mathbf{C}^{\intercal}.

Then, we see that

i^,j^=1d(𝐈i1𝐞i^𝐞j^𝐈ki)(𝐁𝐂𝐃)(𝐈i1𝐞i^𝐞j^𝐈ki)\displaystyle\sum_{\hat{i},\hat{j}=1}^{d}(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}\boldsymbol{\mathrm{e}}_{\hat{j}}^{\intercal}\otimes\mathbf{I}^{\otimes k-i})(\mathbf{B}\otimes\mathbf{C}\otimes\mathbf{D})(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}\boldsymbol{\mathrm{e}}_{\hat{j}}^{\intercal}\otimes\mathbf{I}^{\otimes k-i}) =i^,j^=1d(𝐁[𝐂]𝐞i^𝐞j^𝐃)\displaystyle=\sum_{\hat{i},\hat{j}=1}^{d}(\mathbf{B}\otimes[\mathbf{C}^{\intercal}]\boldsymbol{\mathrm{e}}_{\hat{i}}\boldsymbol{\mathrm{e}}_{\hat{j}}^{\intercal}\otimes\mathbf{D})
=(𝐁𝐂𝐃).\displaystyle=(\mathbf{B}\otimes\mathbf{C}^{\intercal}\otimes\mathbf{D}).

Like the partial trace, we will often be interested in taking the partial transpose of 𝐀\mathbf{A} with respect to a set of subsystems 𝒱[k]{\mathcal{V}}\subseteq[k]. We extend the definition of the partial transpose directly to this case:

Definition 20.

Let 𝐀dk×dk\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}} and 𝒱[k]{\mathcal{V}}\subseteq[k]. Let i1,,i|𝒱|i_{1},\ldots,i_{|{{\mathcal{V}}}|} be the indices in 𝒱{\mathcal{V}}. Then, the partial transpose of 𝐀\mathbf{A} with respect to the subsystem 𝒱{\mathcal{V}} is

𝐀𝒱:=(((𝐀i1)i2))i|𝒱|.\mathbf{A}^{\intercal_{{\mathcal{V}}}}\;{\vcentcolon=}\;(((\mathbf{A}^{\intercal_{i_{1}}})^{\intercal_{i_{2}}})\cdots)^{\intercal_{i_{|{{\mathcal{V}}}|}}}.

Notably, the order of partial transposes does not matter in this case. That is, we have (𝐀{i}){j}=(𝐀{j}){i}(\mathbf{A}^{\intercal_{\{i\}}})^{\intercal_{\{j\}}}=(\mathbf{A}^{\intercal_{\{j\}}})^{\intercal_{\{i\}}}. We next establish two useful properties of the partial transpose:

Lemma 21.

Let 𝐀dk×dk\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}} and i[k]i\in[k]. Then, tr{i}(𝐀{i})=tr{i}(𝐀)\operatorname{tr}_{\{i\}}(\mathbf{A}^{\intercal_{\{i\}}})=\operatorname{tr}_{\{i\}}(\mathbf{A}).

Proof.

We directly compute

tr{i}(𝐀{i})\displaystyle\operatorname{tr}_{\{i\}}(\mathbf{A}^{\intercal_{\{i\}}}) =i^=1d(𝐈i1𝐞i^𝐈ki)𝐀{i}(𝐈i1𝐞i^𝐈ki)\displaystyle=\sum_{\hat{i}=1}^{d}(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}\otimes\mathbf{I}^{\otimes k-i})^{\intercal}\mathbf{A}^{\intercal_{\{i\}}}(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}\otimes\mathbf{I}^{\otimes k-i})
=i^,j^,^=1d(𝐈i1𝐞i^𝐞j^𝐞^𝐈ki)𝐀(𝐈i1𝐞i^𝐞j^𝐞^𝐈ki)\displaystyle=\sum_{\hat{i},\hat{j},\hat{\ell}=1}^{d}(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}^{\intercal}\boldsymbol{\mathrm{e}}_{\hat{j}}\boldsymbol{\mathrm{e}}_{\hat{\ell}}^{\intercal}\otimes\mathbf{I}^{\otimes k-i})\mathbf{A}(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}\boldsymbol{\mathrm{e}}_{\hat{j}}\boldsymbol{\mathrm{e}}_{\hat{\ell}}^{\intercal}\otimes\mathbf{I}^{\otimes k-i})
=i^,j^,^=1d𝟙[i^=j^]𝟙[i^=^](𝐈i1𝐞^𝐈ki)𝐀(𝐈i1𝐞j^𝐈ki)\displaystyle=\sum_{\hat{i},\hat{j},\hat{\ell}=1}^{d}\mathbbm{1}_{[\hat{i}=\hat{j}]}\mathbbm{1}_{[\hat{i}=\hat{\ell}]}(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{\ell}}^{\intercal}\otimes\mathbf{I}^{\otimes k-i})\mathbf{A}(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}}\otimes\mathbf{I}^{\otimes k-i})
=i^=1d(𝐈i1𝐞i^𝐈ki)𝐀(𝐈i1𝐞i^𝐈ki)\displaystyle=\sum_{\hat{i}=1}^{d}(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}\otimes\mathbf{I}^{\otimes k-i})^{\intercal}\mathbf{A}(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}\otimes\mathbf{I}^{\otimes k-i})
=tr{i}(𝐀).\displaystyle=\operatorname{tr}_{\{i\}}(\mathbf{A}).

Lemma 22.

Let 𝐀dk×dk\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}}, i[k]i\in[k], 𝐁di1×di1\mathbf{B}\in{\mathbb{R}}^{d^{i-1}\times d^{i-1}}, and 𝐂dki×dki\mathbf{C}\in{\mathbb{R}}^{d^{k-i}\times d^{k-i}}. Then, for all i^,j^[d]\hat{i},\hat{j}\in[d] we have

(𝐁𝐞j^𝐂)𝐀(𝐁𝐞i^𝐂)=(𝐁𝐞i^𝐂)𝐀{i}(𝐁𝐞j^𝐂)(\mathbf{B}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}}\otimes\mathbf{C})^{\intercal}\mathbf{A}(\mathbf{B}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}\otimes\mathbf{C})=(\mathbf{B}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}\otimes\mathbf{C})^{\intercal}\mathbf{A}^{\intercal_{\{i\}}}(\mathbf{B}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}}\otimes\mathbf{C})
Proof.

We expand the right-hand-side:

(𝐁𝐞i^𝐂)𝐀{i}(𝐁𝐞j^𝐂)\displaystyle(\mathbf{B}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}\otimes\mathbf{C})^{\intercal}\mathbf{A}^{\intercal_{\{i\}}}(\mathbf{B}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}}\otimes\mathbf{C})
=\displaystyle= (𝐁𝐞i^𝐂)(^,m^=1d(𝐈i1𝐞^𝐞m^𝐈ki)𝐀(𝐈i1𝐞^𝐞m^𝐈ki))(𝐁𝐞j^𝐂)\displaystyle(\mathbf{B}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}\otimes\mathbf{C})^{\intercal}({\textstyle\sum_{\hat{\ell},\hat{m}=1}^{d}}(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{\ell}}\boldsymbol{\mathrm{e}}_{\hat{m}}^{\intercal}\otimes\mathbf{I}^{\otimes k-i})\mathbf{A}(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{\ell}}\boldsymbol{\mathrm{e}}_{\hat{m}}^{\intercal}\otimes\mathbf{I}^{\otimes k-i}))(\mathbf{B}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}}\otimes\mathbf{C})
=\displaystyle= ^,m^=1d(𝐁𝐞i^𝐞^𝐞m^𝐂)𝐀(𝐁𝐞^𝐞m^𝐞j^𝐂)\displaystyle{\textstyle\sum_{\hat{\ell},\hat{m}=1}^{d}}(\mathbf{B}^{\intercal}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}^{\intercal}\boldsymbol{\mathrm{e}}_{\hat{\ell}}\boldsymbol{\mathrm{e}}_{\hat{m}}^{\intercal}\otimes\mathbf{C}^{\intercal})\mathbf{A}(\mathbf{B}\otimes\boldsymbol{\mathrm{e}}_{\hat{\ell}}\boldsymbol{\mathrm{e}}_{\hat{m}}^{\intercal}\boldsymbol{\mathrm{e}}_{\hat{j}}\otimes\mathbf{C})
=\displaystyle= (𝐁𝐞j^𝐂)𝐀(𝐁𝐞i^𝐂),\displaystyle(\mathbf{B}^{\intercal}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}}^{\intercal}\otimes\mathbf{C}^{\intercal})\mathbf{A}(\mathbf{B}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}\otimes\mathbf{C}),

where we remove sum by noting that 𝐞i^𝐞^\boldsymbol{\mathrm{e}}_{\hat{i}}^{\intercal}\boldsymbol{\mathrm{e}}_{\hat{\ell}} and 𝐞m^𝐞j^\boldsymbol{\mathrm{e}}_{\hat{m}}^{\intercal}\boldsymbol{\mathrm{e}}_{\hat{j}} are zero unless ^=i^\hat{\ell}=\hat{i} and m^=j^\hat{m}=\hat{j}. ∎

We now have enough tools to exactly understand the variance of the Kronecker-Hutchinson Estimator.

3 The Exact Variance of Kronecker-Hutchinson Estimator

We begin by recalling the exact variance of the classical (k=1k=1) Hutchinson’s Estimator:

Lemma 23 ([Girard, 1987, Hutchinson, 1989, Avron and Toledo, 2011, Meyer, 2021]).

Let 𝐀D×D\mathbf{A}\in{\mathbb{R}}^{D\times D} and let 𝐠D\boldsymbol{\mathrm{g}}\in{\mathbb{R}}^{D} be a vector of either iid Rademacher or standard normal Gaussian entries. Let ^𝐀=12(𝐀+𝐀)\hat{}\mathbf{A}=\frac{1}{2}(\mathbf{A}+\mathbf{A}^{\intercal}). Then, 𝔼[𝐠𝐀𝐠]=tr(𝐀)\operatorname*{\mathbb{E}}[\boldsymbol{\mathrm{g}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{g}}]=\operatorname{tr}(\mathbf{A}) and Var[𝐠𝐀𝐠]2^𝐀F22𝐀F2\operatorname*{Var}[\boldsymbol{\mathrm{g}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{g}}]\leq 2\mathinner{\lVert\hat{}\mathbf{A}\rVert}_{F}^{2}\leq 2\mathinner{\lVert\mathbf{A}\rVert}_{F}^{2}. The first inequality is tight if 𝐠\boldsymbol{\mathrm{g}} is Gaussian, and the second is tight if 𝐀\mathbf{A} is symmetric.

This factor of 2 is important for our analysis, and is the base of the exponent 2k|𝒮|2^{k-|{{\mathcal{S}}}|} in Theorem 1. The key to our proof technique is to use Lemma 12 to expand

𝐱𝐀𝐱\displaystyle\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}} =(𝐱1𝐱k)𝐀(𝐱1𝐱k)\displaystyle=(\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k})^{\intercal}\mathbf{A}(\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k})
=𝐱k(𝐱1𝐱k1𝐈)𝐀(𝐱1𝐱k1𝐈)𝐱k\displaystyle=\boldsymbol{\mathrm{x}}_{k}^{\intercal}(\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k-1}\otimes\mathbf{I})^{\intercal}\mathbf{A}(\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k-1}\otimes\mathbf{I})\boldsymbol{\mathrm{x}}_{k}
=𝐱k:k1(𝐀)𝐱k\displaystyle=\boldsymbol{\mathrm{x}}_{k}^{\intercal}{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}k-1}(\mathbf{A})\boldsymbol{\mathrm{x}}_{k} (2)

which then is just a classical (k=1k=1) Hutchinson’s Estimate of the trace of :k1(𝐀){\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}k-1}(\mathbf{A}). Lemma 23 then gives (for Gaussians) an exact understanding of the variance of this estimator. As a warm-up, we first use this proof technique to show that the Kronecker-Hutchinson Estimator is unbiased, though shorter proofs are certainly possible.

Theorem 24.

Let 𝐀dk×dk\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}}. Let 𝐱1,,𝐱kd\boldsymbol{\mathrm{x}}_{1},\ldots,\boldsymbol{\mathrm{x}}_{k}\in{\mathbb{R}}^{d} be iid Rademacher or standard normal Gaussian vectors, and let 𝐱=𝐱1𝐱k\boldsymbol{\mathrm{x}}=\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k}. Then, 𝔼[𝐱𝐀𝐱]=tr(𝐀)\operatorname*{\mathbb{E}}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}]=\operatorname{tr}(\mathbf{A}).

Proof.

For any i[k]i\in[k],

𝔼[𝐱:itri+1:(𝐀)𝐱:i]\displaystyle\operatorname*{\mathbb{E}}[\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}^{\intercal}\operatorname{tr}_{i+1\mathrel{\mathop{\ordinarycolon}}}(\mathbf{A})\,\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}] =𝔼[𝐱i:i1(tri+1:(𝐀))𝐱i]\displaystyle=\operatorname*{\mathbb{E}}[\boldsymbol{\mathrm{x}}_{i}^{\intercal}\,{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}(\operatorname{tr}_{i+1\mathrel{\mathop{\ordinarycolon}}}(\mathbf{A}))\,\boldsymbol{\mathrm{x}}_{i}] (Equation 2)
=𝔼𝐱:i1𝔼𝐱i[𝐱i:i1(tri+1:(𝐀))𝐱i𝐱:i1]\displaystyle=\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{i}}[\boldsymbol{\mathrm{x}}_{i}^{\intercal}\,{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}(\operatorname{tr}_{i+1\mathrel{\mathop{\ordinarycolon}}}(\mathbf{A}))\,\boldsymbol{\mathrm{x}}_{i}\mid\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}] (Tower Rule)
=𝔼𝐱:i1[tr(:i1(tri+1:(𝐀)))]\displaystyle=\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}[\operatorname{tr}({\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}(\operatorname{tr}_{i+1\mathrel{\mathop{\ordinarycolon}}}(\mathbf{A})))\,] (Lemma 23)
=𝔼[𝐱:i1tri:(𝐀)𝐱:i1].\displaystyle=\operatorname*{\mathbb{E}}[\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}^{\intercal}\operatorname{tr}_{i\mathrel{\mathop{\ordinarycolon}}}(\mathbf{A})\,\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}]. (Lemma 18)

Repeating this argument kk times in a row, and letting tr(𝐀):=𝐀\operatorname{tr}_{\emptyset}(\mathbf{A})\;{\vcentcolon=}\;\mathbf{A}, we conclude by Lemma 16 that

𝔼[𝐱𝐀𝐱]=𝔼[𝐱:ktr(𝐀)𝐱:k]==𝔼[𝐱1tr2:(𝐀)𝐱1]=tr(tr2:(𝐀))=tr(𝐀)\operatorname*{\mathbb{E}}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}]=\operatorname*{\mathbb{E}}[\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}k}^{\intercal}\operatorname{tr}_{\emptyset}(\mathbf{A})\,\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}k}]=\ldots=\operatorname*{\mathbb{E}}[\boldsymbol{\mathrm{x}}_{1}^{\intercal}\operatorname{tr}_{2\mathrel{\mathop{\ordinarycolon}}}(\mathbf{A})\,\boldsymbol{\mathrm{x}}_{1}]=\operatorname{tr}(\operatorname{tr}_{2\mathrel{\mathop{\ordinarycolon}}}(\mathbf{A}))=\operatorname{tr}(\mathbf{A})

which completes the proof.

Analyzing the variance of the estimator requires a bit more effort due to the Frobenius norm term and ¯𝐀\bar{}\mathbf{A} matrix that appear in Lemma 23. More concretely, we approach the analysis of the variance of 𝐱𝐀𝐱\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}} via the classical formula

Var[𝐱𝐀𝐱]=𝔼[(𝐱𝐀𝐱)2](𝔼[𝐱𝐀𝐱])2.\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}]=\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}})^{2}]-(\operatorname*{\mathbb{E}}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}])^{2}.

Since we already know that (𝔼[𝐱𝐀𝐱])2=(tr(𝐀))2(\operatorname*{\mathbb{E}}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}])^{2}=(\operatorname{tr}(\mathbf{A}))^{2}, we turn our attention to evaluating 𝔼[(𝐱𝐀𝐱)2]\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}})^{2}]. Note that rearranging Lemma 23 gives that

𝔼[(𝐠𝐀𝐠)2](tr(𝐀))2+2¯𝐀F2,\displaystyle\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{g}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{g}})^{2}]\leq(\operatorname{tr}(\mathbf{A}))^{2}+2\mathinner{\lVert\bar{}\mathbf{A}\rVert}_{F}^{2}, (3)

with equality if 𝐠\boldsymbol{\mathrm{g}} is Gaussian. Then, noting that 12(:k1(𝐀)+:k1(𝐀))=:k1(12(𝐀+𝐀))\frac{1}{2}({\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}k-1}(\mathbf{A})+{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}k-1}(\mathbf{A})^{\intercal})={\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}k-1}(\frac{1}{2}(\mathbf{A}+\mathbf{A}^{\intercal})), and taking the same approach as in the bias analysis, we see that

𝔼[(𝐱𝐀𝐱)2]\displaystyle\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}})^{2}] =𝔼[(𝐱k:k1(𝐀)𝐱k)2]\displaystyle=\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{x}}_{k}^{\intercal}{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}k-1}(\mathbf{A})\boldsymbol{\mathrm{x}}_{k})^{2}] (Equation 2)
=𝔼𝐱:k1𝔼𝐱k[(𝐱k:k1(𝐀)𝐱k)2𝐱:k1]\displaystyle=\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}k-1}}\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{k}}[(\boldsymbol{\mathrm{x}}_{k}^{\intercal}{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}k-1}(\mathbf{A})\boldsymbol{\mathrm{x}}_{k})^{2}\mid\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}k-1}] (Tower Rule)
=𝔼𝐱:k1[(tr(:k1(𝐀)))2+2:k1(𝐁)F2]\displaystyle=\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}k-1}}[(\operatorname{tr}({\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}k-1}(\mathbf{A})))^{2}+2\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}k-1}(\mathbf{B})\rVert}_{F}^{2}] (Equation 3)
=𝔼𝐱:k1[(𝐱:k1tr{k}(𝐀)𝐱:k1)2]+2𝔼𝐱:k1[:k1(𝐁)F2],\displaystyle=\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}k-1}}[(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}k-1}^{\intercal}\operatorname{tr}_{\{k\}}(\mathbf{A})\,\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}k-1})^{2}]+2\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}k-1}}[\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}k-1}(\mathbf{B})\rVert}_{F}^{2}], (Lemma 18)

where we let 𝐁:=12(𝐀+𝐀)\mathbf{B}\;{\vcentcolon=}\;\frac{1}{2}(\mathbf{A}+\mathbf{A}^{\intercal}). The first term admits a recursive relationship just like in the proof of Theorem 24, so we can easily handle that. The second term however requires more effort. The key observation we use is that the squared Frobenius norm of the PMRDM is in fact a sum of the squares of many Kronecker-Hutchinson Estimators:

:i(𝐁)F2\displaystyle\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i}(\mathbf{B})\rVert}_{F}^{2} =i^,j^=1dki[:i(𝐁)]i^j^2\displaystyle=\sum_{\hat{i},\hat{j}=1}^{d^{k-i}}[{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i}(\mathbf{B})]_{\hat{i}\hat{j}}^{2}
=i^,j^=1dki(𝐞i^:i(𝐁)𝐞j^)2\displaystyle=\sum_{\hat{i},\hat{j}=1}^{d^{k-i}}(\boldsymbol{\mathrm{e}}_{\hat{i}}^{\intercal}{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i}(\mathbf{B})\boldsymbol{\mathrm{e}}_{\hat{j}})^{2}
=i^,j^=1dki(𝐞i^(𝐱:i𝐈ki)𝐁(𝐱:i𝐈ki)𝐞j^)2\displaystyle=\sum_{\hat{i},\hat{j}=1}^{d^{k-i}}(\boldsymbol{\mathrm{e}}_{\hat{i}}^{\intercal}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}\otimes\mathbf{I}^{\otimes k-i})^{\intercal}\mathbf{B}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}\otimes\mathbf{I}^{k-i})\boldsymbol{\mathrm{e}}_{\hat{j}})^{2}
=i^,j^=1dki(𝐱:i(𝐈i𝐞i^)𝐁(𝐈i𝐞j^)𝐱:i)2\displaystyle=\sum_{\hat{i},\hat{j}=1}^{d^{k-i}}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}^{\intercal}(\mathbf{I}^{\otimes i}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}})^{\intercal}\mathbf{B}(\mathbf{I}^{\otimes i}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}})\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i})^{2} (Lemma 12)

In particular, this last line is the sum of squares of many Kronecker-Hutchinson Estimators for a wide variety of matrices. Despite these being trace estimates for different matrices, we can just carefully analyze the resulting terms using the Mixed-Product Property (Lemma 10), Lemma 12, and Equation 3. We get the following recurrence relation:

Lemma 25.

Fix i[k]i\in[k], and let 𝐁dj×dj\mathbf{B}\in{\mathbb{R}}^{d^{j}\times d^{j}} for some jij\geq i. Then,

𝔼𝐱:i[:i(𝐁)F2]𝔼𝐱:i1:i1(tri(𝐁))F2+2𝔼𝐱:i1:i1(12(𝐁+𝐁{i}))F2,\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}}[\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i}(\mathbf{B})\rVert}_{F}^{2}]\leq\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}(\operatorname{tr}_{i}(\mathbf{B}))\rVert}_{F}^{2}+2\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}({\textstyle\frac{1}{2}}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}}))\rVert}_{F}^{2},

with equality if 𝐱i\boldsymbol{\mathrm{x}}_{i} is a standard normal Gaussian vector.

Proof.

We first extract the vector 𝐱i\boldsymbol{\mathrm{x}}_{i} via the same technique as in Equation 2. Recall that we always take 𝐈=𝐈dd×d\mathbf{I}=\mathbf{I}_{d}\in{\mathbb{R}}^{d\times d} to be the identity in dd dimensions.

𝔼𝐱:i[:i(𝐁)F2]\displaystyle\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}}[\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i}(\mathbf{B})\rVert}_{F}^{2}]
=i^,j^𝔼𝐱:i[(𝐞i^:i(𝐁)𝐞j^)2]\displaystyle=\sum_{\hat{i},\hat{j}}\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}}[(\boldsymbol{\mathrm{e}}_{\hat{i}}^{\intercal}{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i}(\mathbf{B})\boldsymbol{\mathrm{e}}_{\hat{j}})^{2}]
=i^,j^𝔼𝐱:i[(𝐞i^(𝐱:i𝐈ji)𝐁(𝐱:i𝐈ji)𝐞j^)2]\displaystyle=\sum_{\hat{i},\hat{j}}\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}}[(\boldsymbol{\mathrm{e}}_{\hat{i}}^{\intercal}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}\otimes\mathbf{I}^{\otimes j-i})^{\intercal}\mathbf{B}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}\otimes\mathbf{I}^{\otimes j-i})\boldsymbol{\mathrm{e}}_{\hat{j}})^{2}]
=i^,j^𝔼𝐱:i[((𝐱:i1𝐱i𝐞i^)𝐁(𝐱i1𝐱i𝐞j^))2]\displaystyle=\sum_{\hat{i},\hat{j}}\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}}[((\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\boldsymbol{\mathrm{x}}_{i}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}})^{\intercal}\mathbf{B}(\boldsymbol{\mathrm{x}}_{i-1}\otimes\boldsymbol{\mathrm{x}}_{i}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}}))^{2}]
=𝔼𝐱:i1i^,j^𝔼𝐱i[(𝐱i(𝐱:i1𝐈𝐞i^)𝐁(𝐱:i1𝐈𝐞j^)𝐱i)2𝐱:i1]\displaystyle=\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\sum_{\hat{i},\hat{j}}\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{i}}[(\boldsymbol{\mathrm{x}}_{i}^{\intercal}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\mathbf{I}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}})^{\intercal}\mathbf{B}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\mathbf{I}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}})\boldsymbol{\mathrm{x}}_{i})^{2}\mid\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}]

We take a moment to examine the variance of this Hutchinson’s sample. Letting 𝐂:=𝐱:i1𝐈\mathbf{C}\;{\vcentcolon=}\;\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\mathbf{I}, we get that the variance of the sample is at most

212((𝐂𝐞i^)𝐁(𝐂𝐞j^)+(𝐂𝐞j^)𝐁(𝐂𝐞i^))F2.2\mathinner{\lVert{\textstyle\frac{1}{2}}((\mathbf{C}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}})^{\intercal}\mathbf{B}(\mathbf{C}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}})+(\mathbf{C}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}})^{\intercal}\mathbf{B}(\mathbf{C}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}))\rVert}_{F}^{2}.

By Lemma 22, we know the term on the right is equivalent to (𝐂𝐞i^)𝐁{i}(𝐂𝐞j^)(\mathbf{C}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}})^{\intercal}\mathbf{B}^{\intercal_{\{i\}}}(\mathbf{C}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}}), so that the norm above becomes 2(𝐂𝐞i^)12(𝐁+𝐁{i})(𝐂𝐞j^)F22\mathinner{\lVert(\mathbf{C}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}})^{\intercal}\frac{1}{2}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}})(\mathbf{C}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}})\rVert}_{F}^{2}. Lemma 13 further tells us the the sum of this variance across all i^,j^[dji]\hat{i},\hat{j}\in[d^{j-i}] is 2(𝐂𝐈)12(𝐁+𝐁{i})(𝐂𝐈)F22\mathinner{\lVert(\mathbf{C}\otimes\mathbf{I})^{\intercal}\frac{1}{2}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}})(\mathbf{C}\otimes\mathbf{I})\rVert}_{F}^{2}. Returning to our overall analysis, we get

𝔼𝐱:i[:i(𝐁)F2]\displaystyle\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}}[\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i}(\mathbf{B})\rVert}_{F}^{2}]
𝔼𝐱:i1i^,j^[(tr((𝐱:i1𝐈𝐞i^)𝐁(𝐱:i1𝐈𝐞j^)))2+2(𝐱:i1𝐈𝐞i^)12(𝐁+𝐁{i})(𝐱:i1𝐈𝐞j^)F2]\displaystyle\leq\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\sum_{\hat{i},\hat{j}}\left[(\operatorname{tr}((\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\mathbf{I}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}})^{\intercal}\mathbf{B}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\mathbf{I}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}})))^{2}+2\mathinner{\lVert(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\mathbf{I}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}})^{\intercal}{\textstyle\frac{1}{2}}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}})(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\mathbf{I}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}})\rVert}_{F}^{2}\right]
=𝔼𝐱:i1i^,j^=1[(tr((𝐱:i1𝐈𝐞i^)𝐁(𝐱:i1𝐈𝐞j^)))2]+2𝔼𝐱:i1(𝐱:i1𝐈2)12(𝐁+𝐁{i})(𝐱:i1𝐈2)F2\displaystyle=\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\sum_{\hat{i},\hat{j}=1}\left[(\operatorname{tr}((\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\mathbf{I}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}})^{\intercal}\mathbf{B}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\mathbf{I}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}})))^{2}\right]+2\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\mathinner{\lVert(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\mathbf{I}^{\otimes 2})^{\intercal}{\textstyle\frac{1}{2}}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}})(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\mathbf{I}^{\otimes 2})\rVert}_{F}^{2}
=𝔼𝐱:i1i^,j^[(=1d𝐞(𝐱:i1𝐈𝐞i^)𝐁(𝐱:i1𝐈𝐞j^)𝐞)2]+2𝔼𝐱:i1:i1(12(𝐁+𝐁{i}))F2\displaystyle=\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\sum_{\hat{i},\hat{j}}\left[\left(\sum_{\ell=1}^{d}\boldsymbol{\mathrm{e}}_{\ell}^{\intercal}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\mathbf{I}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}})^{\intercal}\mathbf{B}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\mathbf{I}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}})\boldsymbol{\mathrm{e}}_{\ell}\right)^{2}\right]+2\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}({\textstyle\frac{1}{2}}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}}))\rVert}_{F}^{2}
=𝔼𝐱:i1i^,j^[(𝐞i^=1d(𝐱:i1𝐞𝐈)𝐁(𝐱:i1𝐞𝐈)𝐞j^)2]+2𝔼𝐱:i1:i1(12(𝐁+𝐁{i}))F2\displaystyle=\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\sum_{\hat{i},\hat{j}}\left[\left(\boldsymbol{\mathrm{e}}_{\hat{i}}^{\intercal}\sum_{\ell=1}^{d}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\boldsymbol{\mathrm{e}}_{\ell}\otimes\mathbf{I})^{\intercal}\mathbf{B}(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}\otimes\boldsymbol{\mathrm{e}}_{\ell}\otimes\mathbf{I})\boldsymbol{\mathrm{e}}_{\hat{j}}\right)^{2}\right]+2\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}({\textstyle\frac{1}{2}}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}}))\rVert}_{F}^{2}
=𝔼𝐱:i1i^,j^[(𝐞i^:i1(tr{i}(𝐁))𝐞j^)2]+2𝔼𝐱:i1:i1(12(𝐁+𝐁{i}))F2\displaystyle=\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\sum_{\hat{i},\hat{j}}\left[\left(\boldsymbol{\mathrm{e}}_{\hat{i}}^{\intercal}{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}(\operatorname{tr}_{\{i\}}(\mathbf{B}))\boldsymbol{\mathrm{e}}_{\hat{j}}\right)^{2}\right]+2\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}({\textstyle\frac{1}{2}}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}}))\rVert}_{F}^{2}
=𝔼𝐱:i1:i1(tri(𝐁))F2+2𝔼𝐱:i1:i1(12(𝐁+𝐁{i}))F2\displaystyle=\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}(\operatorname{tr}_{i}(\mathbf{B}))\rVert}_{F}^{2}+2\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}({\textstyle\frac{1}{2}}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}}))\rVert}_{F}^{2}

Note that the only inequality applies Equation 3, which is an equality if 𝐱i\boldsymbol{\mathrm{x}}_{i} is a standard normal Gaussian.

Notice that this lemma takes in an arbitrary matrix 𝐁\mathbf{B}, instead of the specific choice of 𝐁=12(𝐀+𝐀)\mathbf{B}=\frac{1}{2}(\mathbf{A}+\mathbf{A}^{\intercal}) from earlier. This is useful, as we apply Lemma 25 repeatedly in the following proof:

Lemma 26.

Fix i[k]i\in[k], and let 𝐁dj×dj\mathbf{B}\in{\mathbb{R}}^{d^{j}\times d^{j}} for some jij\geq i. Let ¯𝐁=12i𝒱[i]𝐁𝒱\bar{}\mathbf{B}=\frac{1}{2^{i}}\sum_{{\mathcal{V}}\subseteq[i]}\mathbf{B}^{\intercal_{{\mathcal{V}}}}. Then,

𝔼𝐱:i[:i(𝐁)F2]𝒮[i]2i|𝒮|tr𝒮(¯𝐁)F2,\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}}[\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i}(\mathbf{B})\rVert}_{F}^{2}]\leq\sum_{{\mathcal{S}}\subseteq[i]}2^{i-|{{\mathcal{S}}}|}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}}(\bar{}\mathbf{B})\rVert}_{F}^{2},

with equality if 𝐱1,,𝐱i\boldsymbol{\mathrm{x}}_{1},\ldots,\boldsymbol{\mathrm{x}}_{i} are iid standard normal Gaussian vectors.

Proof.

We prove this by induction over i=1,,ji=1,\ldots,j. For notational clarity in this induction, we let fi(𝐁):=12i𝒱[i]𝐁𝒱f_{i}(\mathbf{B})\;{\vcentcolon=}\;\frac{1}{2^{i}}\sum_{{\mathcal{V}}\subseteq[i]}\mathbf{B}^{\intercal_{{\mathcal{V}}}}. Then, our goal is to prove that 𝔼𝐱:i[:i(𝐁)F2]𝒮[i]2i|𝒮|tr𝒮(fi(𝐁))F2\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}}[\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i}(\mathbf{B})\rVert}_{F}^{2}]\leq\sum_{{\mathcal{S}}\subseteq[i]}2^{i-|{{\mathcal{S}}}|}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}}(f_{i}(\mathbf{B}))\rVert}_{F}^{2} for all ii and 𝐁\mathbf{B}. We start with the base case i=1i=1. First note that tr{1}(𝐁)=tr{1}(12(𝐁+𝐁{1}))=tr{1}(f1(𝐁))\operatorname{tr}_{\{1\}}(\mathbf{B})=\operatorname{tr}_{\{1\}}(\frac{1}{2}(\mathbf{B}+\mathbf{B}^{\intercal_{\{1\}}}))=\operatorname{tr}_{\{1\}}(f_{1}(\mathbf{B})) by Lemma 21. Then Lemma 25 gives us that

𝔼𝐱1[{1}(𝐁)F2]\displaystyle\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{1}}[\mathinner{\lVert{\mathcal{M}}_{\{1\}}(\mathbf{B})\rVert}_{F}^{2}] tr{1}(𝐁)F2+212(𝐁+𝐁{1})F2\displaystyle\leq\mathinner{\lVert\operatorname{tr}_{\{1\}}(\mathbf{B})\rVert}_{F}^{2}+2\mathinner{\lVert{\textstyle\frac{1}{2}}(\mathbf{B}+\mathbf{B}^{\intercal_{\{1\}}})\rVert}_{F}^{2}
=211tr{1}(f1(𝐁))F2+210f1(𝐁)F2\displaystyle=2^{1-1}\mathinner{\lVert\operatorname{tr}_{\{1\}}(f_{1}(\mathbf{B}))\rVert}_{F}^{2}+2^{1-0}\mathinner{\lVert f_{1}(\mathbf{B})\rVert}_{F}^{2}
=𝒮[1]21|𝒮|tr𝒮(f1(𝐁))F2.\displaystyle=\sum_{{\mathcal{S}}\subseteq[1]}2^{1-|{{\mathcal{S}}}|}\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(f_{1}(\mathbf{B}))\rVert}_{F}^{2}.

Next, for the induction case, we assume the lemma holds for i1i-1. Since fi(𝐁)f_{i}(\mathbf{B}) is the average of the all partial transposes in the first ii subsystems of 𝐁\mathbf{B}, we can expand fi(𝐁)=fi1(12(𝐁+𝐁{i}))f_{i}(\mathbf{B})=f_{i-1}({\textstyle\frac{1}{2}}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}})). So, for instance, by our induction hypothesis we have that

𝔼𝐱:i1:i1(12(𝐁+𝐁{i}))F2𝒮[i1]2i1|𝒮|tr𝒮(fi1(12(𝐁+𝐁{i})))F2=𝒮[i1]2i1|𝒮|tr𝒮(fi(𝐁))F2.\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}({\textstyle\frac{1}{2}}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}}))\rVert}_{F}^{2}\leq\sum_{{\mathcal{S}}\subseteq[i-1]}2^{i-1-|{{\mathcal{S}}}|}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}}(f_{i-1}({\textstyle\frac{1}{2}}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}})))\rVert}_{F}^{2}=\sum_{{\mathcal{S}}\subseteq[i-1]}2^{i-1-|{{\mathcal{S}}}|}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}}(f_{i}(\mathbf{B}))\rVert}_{F}^{2}.

Therefore, by Lemma 25, by our inductive hypothesis, and by the composition of partial traces, we have that

𝔼𝐱:i[:i(𝐁)F2]\displaystyle\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}}[\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i}(\mathbf{B})\rVert}_{F}^{2}] 𝔼𝐱:i1:i1(tr{i}(𝐁))F2\displaystyle\leq\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}(\operatorname{tr}_{\{i\}}(\mathbf{B}))\rVert}_{F}^{2} +2𝔼𝐱:i1:i1(12(𝐁+𝐁{i}))F2\displaystyle+2\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}({\textstyle\frac{1}{2}}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}}))\rVert}_{F}^{2}
=𝔼𝐱:i1:i1(tr{i}(12(𝐁+𝐁{i})))F2\displaystyle=\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}(\operatorname{tr}_{\{i\}}({\textstyle\frac{1}{2}}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}})))\rVert}_{F}^{2} +2𝔼𝐱:i1:i1(12(𝐁+𝐁{i}))F2\displaystyle+2\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}({\textstyle\frac{1}{2}}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}}))\rVert}_{F}^{2}
𝒮[i1]2i1|𝒮|tr𝒮(tr{i}(fi(𝐁)))F2\displaystyle\leq\sum_{{\mathcal{S}}\subseteq[i-1]}2^{i-1-|{{\mathcal{S}}}|}\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\operatorname{tr}_{\{i\}}(f_{i}(\mathbf{B})))\rVert}_{F}^{2} +2𝒮[i1]2i1|𝒮|tr𝒮(fi(𝐁))F2\displaystyle+2\sum_{{\mathcal{S}}\subseteq[i-1]}2^{i-1-|{{\mathcal{S}}}|}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}}(f_{i}(\mathbf{B}))\rVert}_{F}^{2}
=𝒮[i1]2i1|𝒮|tr𝒮{i}(fi(𝐁))F2\displaystyle=\sum_{{\mathcal{S}}\subseteq[i-1]}2^{i-1-|{{\mathcal{S}}}|}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}\cup\{i\}}(f_{i}(\mathbf{B}))\rVert}_{F}^{2} +2𝒮[i1]2i1|𝒮|tr𝒮(fi(𝐁))F2\displaystyle+2\sum_{{\mathcal{S}}\subseteq[i-1]}2^{i-1-|{{\mathcal{S}}}|}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}}(f_{i}(\mathbf{B}))\rVert}_{F}^{2}
=𝒮[i1]2i(|𝒮|+1)tr𝒮{i}(fi(𝐁))F2\displaystyle=\sum_{{\mathcal{S}}\subseteq[i-1]}2^{i-(|{{\mathcal{S}}}|+1)}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}\cup\{i\}}(f_{i}(\mathbf{B}))\rVert}_{F}^{2} +𝒮[i1]2i|𝒮|tr𝒮(fi(𝐁))F2\displaystyle+\sum_{{\mathcal{S}}\subseteq[i-1]}2^{i-|{{\mathcal{S}}}|}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}}(f_{i}(\mathbf{B}))\rVert}_{F}^{2} (4)
=𝒮[i]2i|𝒮|tr𝒮(fi(𝐁))F2\displaystyle=\sum_{{\mathcal{S}}^{\prime}\subseteq[i]}2^{i-|{{\mathcal{S}}^{\prime}}|}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}^{\prime}}(f_{i}(\mathbf{B}))\rVert}_{F}^{2} (5)

The second inequality applies the induction hypotheses to the matrix tr{i}(12(𝐁+𝐁{i}))\operatorname{tr}_{\{i\}}(\frac{1}{2}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}})) on the left term, and applies the induction hypothesis to the matrix 12(𝐁+𝐁{i})\frac{1}{2}(\mathbf{B}+\mathbf{B}^{\intercal_{\{i\}}}) on the right term.

Further, Equation 5 holds by comparing the first and second terms in Equation 4, where the left sum covers all 𝒮[i]{\mathcal{S}}^{\prime}\subseteq[i] such that i𝒮i\in{\mathcal{S}}^{\prime}, and where the right term covers all 𝒮[i]{\mathcal{S}}^{\prime}\subseteq[i] such that i𝒮i\notin{\mathcal{S}}^{\prime}. Note that the only inequalities used are Equations 3 and 25, as well as the induction hypothesis, which are all equalities when considering iid standard normal Gaussian vectors. ∎

We now have all the tools needed to prove the bound on the variance of the Kronecker-Hutchinson Estimator.

Theorem 1 Restated.

Fix d,kd,k\in{\mathbb{N}}, and let 𝐀dk×dk\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}}. Let 𝐱=𝐱1𝐱k\boldsymbol{\mathrm{x}}=\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k}, where each 𝐱i\boldsymbol{\mathrm{x}}_{i} is independent and is either a Rademacher vector or a 𝒩(𝟎,𝐈){\mathcal{N}}(\boldsymbol{\mathrm{0}},\mathbf{I}) Gaussian vector. Let ¯𝐀:=12k𝒱[k]𝐀𝒱\bar{}\mathbf{A}\;{\vcentcolon=}\;\frac{1}{2^{k}}\sum_{{\mathcal{V}}\subseteq[k]}\mathbf{A}^{\intercal_{{\mathcal{V}}}} be the average of all partial transposes of 𝐀\mathbf{A}. Then, Var[𝐱𝐀𝐱]𝒮[k]2k|𝒮|tr𝒮(¯𝐀)F2\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}]\leq\sum_{{\mathcal{S}}\subset[k]}2^{k-|{{\mathcal{S}}}|}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}}(\bar{}\mathbf{A})\rVert}_{F}^{2}. Furthermore, if all 𝐱i\boldsymbol{\mathrm{x}}_{i} are 𝒩(𝟎,𝐈){\mathcal{N}}(\boldsymbol{\mathrm{0}},\mathbf{I}), then this expression is an equality.

Proof.

We start again with the observation in Equation 2, which we use to prove the following claim by induction over i=1,,ki=1,\ldots,k:

𝔼𝐱:i[(𝐱:itri+1:(𝐀)𝐱:i)2](tr(𝐀))2+2tr2:(12(𝐀+𝐀))F2+2j=1i1𝔼𝐱:j:j(trj+2:(12(𝐀+𝐀)))F2,\displaystyle\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}}[(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}^{\intercal}\operatorname{tr}_{i+1\mathrel{\mathop{\ordinarycolon}}}(\mathbf{A})\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i})^{2}]\leq(\operatorname{tr}(\mathbf{A}))^{2}+2\mathinner{\lVert\operatorname{tr}_{2\mathrel{\mathop{\ordinarycolon}}}({\textstyle\frac{1}{2}}(\mathbf{A}+\mathbf{A}^{\intercal}))\rVert}_{F}^{2}+2\sum_{j=1}^{i-1}\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}j}}\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}j}(\operatorname{tr}_{j+2\mathrel{\mathop{\ordinarycolon}}}({\textstyle\frac{1}{2}}(\mathbf{A}+\mathbf{A}^{\intercal})))\rVert}_{F}^{2}, (6)

where we interpret trk+1(𝐀):=𝐀\operatorname{tr}_{k+1}(\mathbf{A})\;{\vcentcolon=}\;\mathbf{A}. The base case for Equation 6 is when i=1i=1, where we immediately have

𝔼𝐱1[(𝐱1tr2:(𝐀)𝐱1)2](tr(tr2:(𝐀)))2+2tr2:(12(𝐀+𝐀))F2=(tr(𝐀))2+2tr2:(12(𝐀+𝐀))F2\displaystyle\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{1}}[(\boldsymbol{\mathrm{x}}_{1}^{\intercal}\operatorname{tr}_{2\mathrel{\mathop{\ordinarycolon}}}(\mathbf{A})\boldsymbol{\mathrm{x}}_{1})^{2}]\leq(\operatorname{tr}(\operatorname{tr}_{2\mathrel{\mathop{\ordinarycolon}}}(\mathbf{A})))^{2}+2\mathinner{\lVert\operatorname{tr}_{2\mathrel{\mathop{\ordinarycolon}}}({\textstyle\frac{1}{2}}(\mathbf{A}+\mathbf{A}^{\intercal}))\rVert}_{F}^{2}=(\operatorname{tr}(\mathbf{A}))^{2}+2\mathinner{\lVert\operatorname{tr}_{2\mathrel{\mathop{\ordinarycolon}}}({\textstyle\frac{1}{2}}(\mathbf{A}+\mathbf{A}^{\intercal}))\rVert}_{F}^{2}

which completes the base case. Now we assume Equation 6 holds for i1i-1, and we show that it holds for ii:

𝔼𝐱:i[(𝐱:itri+1:(𝐀)𝐱:i)2]\displaystyle\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}}[(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}^{\intercal}\operatorname{tr}_{i+1\mathrel{\mathop{\ordinarycolon}}}(\mathbf{A})\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i})^{2}]
=𝔼𝐱:i1𝔼𝐱i[(𝐱i:i1(tri+1:(𝐀))𝐱i)2𝐱:i1]\displaystyle=\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{i}}[(\boldsymbol{\mathrm{x}}_{i}^{\intercal}{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}(\operatorname{tr}_{i+1\mathrel{\mathop{\ordinarycolon}}}(\mathbf{A}))\boldsymbol{\mathrm{x}}_{i})^{2}\mid\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}]
𝔼𝐱:i1[(tr(:i1(tri+1:(𝐀))))2+2:i1(tri+1:(12(𝐀+𝐀)))F2]\displaystyle\leq\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}\left[(\operatorname{tr}({\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}(\operatorname{tr}_{i+1\mathrel{\mathop{\ordinarycolon}}}(\mathbf{A}))))^{2}+2\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}(\operatorname{tr}_{i+1\mathrel{\mathop{\ordinarycolon}}}({\textstyle\frac{1}{2}}(\mathbf{A}+\mathbf{A}^{\intercal})))\rVert}_{F}^{2}\right] (Lemma 23)
=𝔼𝐱:i1[(𝐱:i1tri:(𝐀)𝐱:i1)2]+𝔼𝐱:i12:i1(tri+1:(12(𝐀+𝐀)))F2\displaystyle=\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}[(\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}^{\intercal}\operatorname{tr}_{i\mathrel{\mathop{\ordinarycolon}}}(\mathbf{A})\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1})^{2}]+\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i-1}}2\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i-1}(\operatorname{tr}_{i+1\mathrel{\mathop{\ordinarycolon}}}({\textstyle\frac{1}{2}}(\mathbf{A}+\mathbf{A}^{\intercal})))\rVert}_{F}^{2}
(tr(𝐀))2+2tr2:(𝐀)F2+2j=1i1𝔼𝐱:j:j(trj+2:(12(𝐀+𝐀)))F2\displaystyle\leq(\operatorname{tr}(\mathbf{A}))^{2}+2\mathinner{\lVert\operatorname{tr}_{2\mathrel{\mathop{\ordinarycolon}}}(\mathbf{A})\rVert}_{F}^{2}+2\sum_{j=1}^{i-1}\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}j}}\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}j}(\operatorname{tr}_{j+2\mathrel{\mathop{\ordinarycolon}}}({\textstyle\frac{1}{2}}(\mathbf{A}+\mathbf{A}^{\intercal})))\rVert}_{F}^{2}

where the last line applies the induction hypothesis. This completes the proof by induction of Equation 6. We can then appeal to Lemma 26 and merge terms together. First, let 𝐂i:=12i𝒱[i](12(𝐀+𝐀))𝒱\mathbf{C}_{i}\;{\vcentcolon=}\;\frac{1}{2^{i}}\sum_{{\mathcal{V}}\subseteq[i]}(\frac{1}{2}(\mathbf{A}+\mathbf{A}^{\intercal}))^{\intercal_{{\mathcal{V}}}}, and note that 𝐂0=12(𝐀+𝐀)\mathbf{C}_{0}={\textstyle\frac{1}{2}}(\mathbf{A}+\mathbf{A}^{\intercal}). Then,

𝔼[(𝐱𝐀𝐱)2]\displaystyle\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}})^{2}] (tr(𝐀))2+2tr2:(12(𝐀+𝐀))F2+2i=1k1𝔼𝐱:i:i(tri+2:(12(𝐀+𝐀)))F2\displaystyle\leq(\operatorname{tr}(\mathbf{A}))^{2}+2\mathinner{\lVert\operatorname{tr}_{2\mathrel{\mathop{\ordinarycolon}}}({\textstyle\frac{1}{2}}(\mathbf{A}+\mathbf{A}^{\intercal}))\rVert}_{F}^{2}+2\sum_{i=1}^{k-1}\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{x}}_{\mathrel{\mathop{\ordinarycolon}}i}}\mathinner{\lVert{\mathcal{M}}_{\mathrel{\mathop{\ordinarycolon}}i}(\operatorname{tr}_{i+2\mathrel{\mathop{\ordinarycolon}}}({\textstyle\frac{1}{2}}(\mathbf{A}+\mathbf{A}^{\intercal})))\rVert}_{F}^{2}
(tr(𝐀))2+2tr2:(𝐂0)F2+2i=1k1𝒮[i]2i|𝒮|tr𝒮{i+2,,k}(𝐂i)F2\displaystyle\leq(\operatorname{tr}(\mathbf{A}))^{2}+2\mathinner{\lVert\operatorname{tr}_{2\mathrel{\mathop{\ordinarycolon}}}(\mathbf{C}_{0})\rVert}_{F}^{2}+2\sum_{i=1}^{k-1}\sum_{{\mathcal{S}}\subseteq[i]}2^{i-|{{\mathcal{S}}}|}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}\cup\{i+2,\ldots,k\}}(\mathbf{C}_{i})\rVert}_{F}^{2}
=(tr(𝐀))2+i=0k1𝒮[i]2i+1|𝒮|tr𝒮{i+2,,k}(𝐂i)F2\displaystyle=(\operatorname{tr}(\mathbf{A}))^{2}+\sum_{i=0}^{k-1}\sum_{{\mathcal{S}}\subseteq[i]}2^{i+1-|{{\mathcal{S}}}|}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}\cup\{i+2,\ldots,k\}}(\mathbf{C}_{i})\rVert}_{F}^{2}
=(tr(𝐀))2+i=1k𝒮[i1]2i|𝒮|tr𝒮{i+1,,k}(𝐂i1)F2\displaystyle=(\operatorname{tr}(\mathbf{A}))^{2}+\sum_{i=1}^{k}\sum_{{\mathcal{S}}\subseteq[i-1]}2^{i-|{{\mathcal{S}}}|}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}\cup\{i+1,\ldots,k\}}(\mathbf{C}_{i-1})\rVert}_{F}^{2}

We proceed by simplifying the partial trace tr𝒮{i+1,,k}(𝐂i1)\operatorname{tr}_{{\mathcal{S}}\cup\{i+1,\ldots,k\}}(\mathbf{C}_{i-1}). First, note that 𝐂i1\mathbf{C}_{i-1} is the average of all partial transposes of 12(𝐀+𝐀)\frac{1}{2}(\mathbf{A}+\mathbf{A}^{\intercal}) in the first i1i-1 subsystems. By Lemma 21, we can extend this to be the average of all partial transposes of 12(𝐀+𝐀)\frac{1}{2}(\mathbf{A}+\mathbf{A}^{\intercal}) in across all subsystems 𝒱[i1]{i+1,,k}{\mathcal{V}}\subset[i-1]\cup\{i+1,\ldots,k\}, so that

tri+1:(𝐂i1)\displaystyle\operatorname{tr}_{i+1\mathrel{\mathop{\ordinarycolon}}}(\mathbf{C}_{i-1}) =tri+1:(12k1𝒱[k]{i}(12(𝐀+𝐀))𝒱)\displaystyle=\operatorname{tr}_{i+1\mathrel{\mathop{\ordinarycolon}}}\left(\frac{1}{2^{k-1}}\sum_{{\mathcal{V}}\subseteq[k]\setminus\{i\}}({\textstyle\frac{1}{2}}(\mathbf{A}+\mathbf{A}^{\intercal}))^{\intercal_{{\mathcal{V}}}}\right)
=tri+1:(12k1𝒱[k]{i}𝐀𝒱+12k1𝒱[k]{i}(𝐀)𝒱2).\displaystyle=\operatorname{tr}_{i+1\mathrel{\mathop{\ordinarycolon}}}\left(\frac{\frac{1}{2^{k-1}}\sum_{{\mathcal{V}}\subseteq[k]\setminus\{i\}}\mathbf{A}^{\intercal_{{\mathcal{V}}}}+\frac{1}{2^{k-1}}\sum_{{\mathcal{V}}\subseteq[k]\setminus\{i\}}(\mathbf{A}^{\intercal})^{\intercal_{{\mathcal{V}}}}}{2}\right).

The left term above is the average of all partial transposes of 𝐀\mathbf{A} that do not include subsystem ii. Since (𝐀)𝒱=𝐀¯𝒱(\mathbf{A}^{\intercal})^{\intercal_{{\mathcal{V}}}}=\mathbf{A}^{\intercal_{\bar{}{\mathcal{V}}}} where ¯𝒱:=[k]𝒱\bar{}{\mathcal{V}}\;{\vcentcolon=}\;[k]\setminus{\mathcal{V}}, the right term above is the average of all partial transposes of 𝐀\mathbf{A} that do include ii. So, we find that tri+1:(𝐂i1)=tri+1:(12k𝒱[k]𝐀𝒱)=tri+1:(¯𝐀)\operatorname{tr}_{i+1\mathrel{\mathop{\ordinarycolon}}}(\mathbf{C}_{i-1})=\operatorname{tr}_{i+1\mathrel{\mathop{\ordinarycolon}}}(\frac{1}{2^{k}}\sum_{{\mathcal{V}}\subseteq[k]}\mathbf{A}^{\intercal_{{\mathcal{V}}}})=\operatorname{tr}_{i+1\mathrel{\mathop{\ordinarycolon}}}(\bar{}\mathbf{A}). We are therefore left with the expression

𝔼[(𝐱𝐀𝐱)2](tr(𝐀))2+i=1k𝒮[i1]2i|𝒮|tr𝒮{i+1,,k}(¯𝐀)F2.\displaystyle\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}})^{2}]\leq(\operatorname{tr}(\mathbf{A}))^{2}+\sum_{i=1}^{k}\sum_{{\mathcal{S}}\subseteq[i-1]}2^{i-|{{\mathcal{S}}}|}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}\cup\{i+1,\ldots,k\}}(\bar{}\mathbf{A})\rVert}_{F}^{2}. (7)

We now focus on the double summation in Equation 7. We start by understanding the inner summation in terms of the index ii from the outer iteration. Let

F(i):=𝒮[i1]2i|𝒮|tr𝒮{i+1,,k}(¯𝐀)F2F(i)\;{\vcentcolon=}\;\sum_{{\mathcal{S}}\subseteq[i-1]}2^{i-|{{\mathcal{S}}}|}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}\cup\{i+1,\ldots,k\}}(\bar{}\mathbf{A})\rVert}_{F}^{2}

so that

i=1k𝒮[i1]2i|𝒮|tr𝒮{i+1,,k}(¯𝐀)F2=i=1kF(i).\sum_{i=1}^{k}\sum_{{\mathcal{S}}\subseteq[i-1]}2^{i-|{{\mathcal{S}}}|}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}\cup\{i+1,\ldots,k\}}(\bar{}\mathbf{A})\rVert}_{F}^{2}=\sum_{i=1}^{k}F(i).

We see that F(i)F(i) is a weighted sum of tr^𝒮(¯𝐀)F2\mathinner{\lVert\operatorname{tr}_{\hat{}{\mathcal{S}}}(\bar{}\mathbf{A})\rVert}_{F}^{2} across all ^𝒮[k]\hat{}{\mathcal{S}}\subseteq[k] of the form ^𝒮=𝒮{i+1,,k}\hat{}{\mathcal{S}}={\mathcal{S}}\cup\{i+1,\ldots,k\} where 𝒮[i1]{\mathcal{S}}\subseteq[i-1]. That is, we consider all ^𝒮\hat{}{\mathcal{S}} that satisfies two properties:

  1. 1.

    Indices i+1,,ki+1,\ldots,k are all included: {i+1,,k}𝒮\{i+1,\ldots,k\}\subseteq{\mathcal{S}}

  2. 2.

    Index ii is not included: i^𝒮i\notin\hat{}{\mathcal{S}}

We can intuitively think of ii as a “Gap” that precedes the “Suffix” {i+1,,k}\{i+1,\ldots,k\}. Crucially, if tr^𝒮(¯𝐀)F2\mathinner{\lVert\operatorname{tr}_{\hat{}{\mathcal{S}}}(\bar{}\mathbf{A})\rVert}_{F}^{2} appears in a term of F(i)F(i), then tr^𝒮(¯𝐀)F2\mathinner{\lVert\operatorname{tr}_{\hat{}{\mathcal{S}}}(\bar{}\mathbf{A})\rVert}_{F}^{2} does not appear in a term of any other F(j)F(j). This is because each ^𝒮\hat{}{\mathcal{S}} has a unique gap: for any ^𝒮\hat{}{\mathcal{S}}, let i[k]i\in[k] be the smallest value such that {i+1,,k}^𝒮\{i+1,\ldots,k\}\subseteq\hat{}{\mathcal{S}}. Then ii is the gap of ^𝒮\hat{}{\mathcal{S}}. We note an edge-case, where if ^𝒮={1,,k}\hat{}{\mathcal{S}}=\{1,\ldots,k\} then the gap is i=0i=0 and we recover the term tr^𝒮(¯𝐀)F2=(tr(¯𝐀))2\mathinner{\lVert\operatorname{tr}_{\hat{}{\mathcal{S}}}(\bar{}\mathbf{A})\rVert}_{F}^{2}=(\operatorname{tr}(\bar{}\mathbf{A}))^{2} in Equation 7.

Next, note that every ^𝒮\hat{}{\mathcal{S}} is scaled by 2i|𝒮|2^{i-|{{\mathcal{S}}}|}. Since |^𝒮|=|𝒮{i+1,,k}|=|𝒮|+ki|\hat{}{\mathcal{S}}|=|{{\mathcal{S}}\cup\{i+1,\ldots,k\}}|=|{{\mathcal{S}}}|+k-i, we know |𝒮|i=|^𝒮|k|{{\mathcal{S}}}|-i=|\hat{}{\mathcal{S}}|-k, so we can write 2i|𝒮|=2k|^𝒮|2^{i-|{{\mathcal{S}}}|}=2^{k-|\hat{}{\mathcal{S}}|}. So, we have shown that every ^𝒮[k]\hat{}{\mathcal{S}}\subseteq[k] appears exactly once in Equation 7, and each term tr^𝒮(¯𝐀)F2\mathinner{\lVert\operatorname{tr}_{\hat{}{\mathcal{S}}}(\bar{}\mathbf{A})\rVert}_{F}^{2} is scaled by 2k|^𝒮|2^{k-|\hat{}{\mathcal{S}}|}. That is,

𝔼[(𝐱𝐀𝐱)2]\displaystyle\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}})^{2}] ^𝒮[k]2k|^𝒮|tr^𝒮(¯𝐀)F2,\displaystyle\leq\sum_{\hat{}{\mathcal{S}}\subseteq[k]}2^{k-|\hat{}{\mathcal{S}}|}\mathinner{\lVert\operatorname{tr}_{\hat{}{\mathcal{S}}}(\bar{}\mathbf{A})\rVert}_{F}^{2},

which completes the proof since

Var[𝐱𝐀𝐱]=𝔼[(𝐱𝐀𝐱)2](𝔼[𝐱𝐀𝐱])2=^𝒮[k]2k|^𝒮|tr^𝒮(¯𝐀)F2.\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}]=\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}})^{2}]-(\operatorname*{\mathbb{E}}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}])^{2}=\sum_{\hat{}{\mathcal{S}}\subset[k]}2^{k-|\hat{}{\mathcal{S}}|}\mathinner{\lVert\operatorname{tr}_{\hat{}{\mathcal{S}}}(\bar{}\mathbf{A})\rVert}_{F}^{2}.

Note that all inequalities in the proof become equalities if we know that 𝐱1,,𝐱k\boldsymbol{\mathrm{x}}_{1},\ldots,\boldsymbol{\mathrm{x}}_{k} are all iid standard normal Gaussians. ∎

3.1 Proof of Theorem 2

The variance expression in Theorem 1 is extremely tight, being an equality in the case that our query vectors are Kroneckers of Gaussian vectors. However, it is not immediately clear how we should interpret this expression. We now prove Theorem 2, which is a more approachable upper bound of Theorem 1. We start with a short lemma:

Lemma 27.

Let 𝐀dk×dk\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}} and 𝒱[k]{\mathcal{V}}\subseteq[k]. Then, 𝐀𝒱F=𝐀F\mathinner{\lVert\mathbf{A}^{\intercal_{{\mathcal{V}}}}\rVert}_{F}=\mathinner{\lVert\mathbf{A}\rVert}_{F}.

Proof.

Consider any i𝒱i\in{\mathcal{V}}. Then, recall that each standard basis 𝐞i^dk\boldsymbol{\mathrm{e}}_{\hat{i}}\in{\mathbb{R}}^{d^{k}} can be decomposed as 𝐞i^=𝐞j^𝐞k^𝐞^\boldsymbol{\mathrm{e}}_{\hat{i}}=\boldsymbol{\mathrm{e}}_{\hat{j}}\otimes\boldsymbol{\mathrm{e}}_{\hat{k}}\otimes\boldsymbol{\mathrm{e}}_{\hat{\ell}} where 𝐞j^di1\boldsymbol{\mathrm{e}}_{\hat{j}}\in{\mathbb{R}}^{d^{i-1}}, 𝐞k^d\boldsymbol{\mathrm{e}}_{\hat{k}}\in{\mathbb{R}}^{d}, and 𝐞^dki\boldsymbol{\mathrm{e}}_{\hat{\ell}}\in{\mathbb{R}}^{d^{k-i}}. Then, we can expand the norm of 𝐀\mathbf{A} as

𝐀{i}F2\displaystyle\mathinner{\lVert\mathbf{A}^{\intercal_{\{i\}}}\rVert}_{F}^{2} =i^1,i^2=1dk(𝐞i^1𝐀{i}𝐞i^2)2\displaystyle=\sum_{\hat{i}_{1},\hat{i}_{2}=1}^{d^{k}}(\boldsymbol{\mathrm{e}}_{\hat{i}_{1}}^{\intercal}\mathbf{A}^{\intercal_{\{i\}}}\boldsymbol{\mathrm{e}}_{\hat{i}_{2}})^{2}
=j^1,j^2=1di1k^1,k^2=1d^1,^2=1dk1((𝐞j^1𝐞k^1𝐞^1)𝐀{i}(𝐞j^2𝐞k^2𝐞^2))2\displaystyle=\sum_{\hat{j}_{1},\hat{j}_{2}=1}^{d^{i-1}}\sum_{\hat{k}_{1},\hat{k}_{2}=1}^{d}\sum_{\hat{\ell}_{1},\hat{\ell}_{2}=1}^{d^{k-1}}((\boldsymbol{\mathrm{e}}_{\hat{j}_{1}}\otimes\boldsymbol{\mathrm{e}}_{\hat{k}_{1}}\otimes\boldsymbol{\mathrm{e}}_{\hat{\ell}_{1}})^{\intercal}\mathbf{A}^{\intercal_{\{i\}}}(\boldsymbol{\mathrm{e}}_{\hat{j}_{2}}\otimes\boldsymbol{\mathrm{e}}_{\hat{k}_{2}}\otimes\boldsymbol{\mathrm{e}}_{\hat{\ell}_{2}}))^{2}
=j^1,j^2=1di1k^1,k^2=1d^1,^2=1dk1(m^,n^=1d(𝐞j^1𝐞k^1𝐞m^𝐞n^𝐞^1)𝐀(𝐞j^2𝐞m^𝐞n^𝐞k^2𝐞^2))2\displaystyle=\sum_{\hat{j}_{1},\hat{j}_{2}=1}^{d^{i-1}}\sum_{\hat{k}_{1},\hat{k}_{2}=1}^{d}\sum_{\hat{\ell}_{1},\hat{\ell}_{2}=1}^{d^{k-1}}\left(\sum_{\hat{m},\hat{n}=1}^{d}(\boldsymbol{\mathrm{e}}_{\hat{j}_{1}}^{\intercal}\otimes\boldsymbol{\mathrm{e}}_{\hat{k}_{1}}^{\intercal}\boldsymbol{\mathrm{e}}_{\hat{m}}\boldsymbol{\mathrm{e}}_{\hat{n}}^{\intercal}\otimes\boldsymbol{\mathrm{e}}_{\hat{\ell}_{1}}^{\intercal})\mathbf{A}(\boldsymbol{\mathrm{e}}_{\hat{j}_{2}}\otimes\boldsymbol{\mathrm{e}}_{\hat{m}}\boldsymbol{\mathrm{e}}_{\hat{n}}^{\intercal}\boldsymbol{\mathrm{e}}_{\hat{k}_{2}}\otimes\boldsymbol{\mathrm{e}}_{\hat{\ell}_{2}})\right)^{2}
=j^1,j^2=1di1k^1,k^2=1d^1,^2=1dk1((𝐞j^1𝐞k^2𝐞^1)𝐀(𝐞j^2𝐞k^1𝐞^2))2\displaystyle=\sum_{\hat{j}_{1},\hat{j}_{2}=1}^{d^{i-1}}\sum_{\hat{k}_{1},\hat{k}_{2}=1}^{d}\sum_{\hat{\ell}_{1},\hat{\ell}_{2}=1}^{d^{k-1}}((\boldsymbol{\mathrm{e}}_{\hat{j}_{1}}^{\intercal}\otimes\boldsymbol{\mathrm{e}}_{\hat{k}_{2}}^{\intercal}\otimes\boldsymbol{\mathrm{e}}_{\hat{\ell}_{1}}^{\intercal})\mathbf{A}(\boldsymbol{\mathrm{e}}_{\hat{j}_{2}}\otimes\boldsymbol{\mathrm{e}}_{\hat{k}_{1}}\otimes\boldsymbol{\mathrm{e}}_{\hat{\ell}_{2}}))^{2}
=i^1,i^2=1dk𝐞i^1𝐀𝐞i^2\displaystyle=\sum_{\hat{i}_{1},\hat{i}_{2}=1}^{d^{k}}\boldsymbol{\mathrm{e}}_{\hat{i}_{1}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{e}}_{\hat{i}_{2}}
=𝐀F2,\displaystyle=\mathinner{\lVert\mathbf{A}\rVert}_{F}^{2},

where we remove the sum over m^\hat{m} and n^\hat{n} be noting that 𝐞k^1𝐞m^\boldsymbol{\mathrm{e}}_{\hat{k}_{1}}^{\intercal}\boldsymbol{\mathrm{e}}_{\hat{m}} and 𝐞n^𝐞k^2\boldsymbol{\mathrm{e}}_{\hat{n}}^{\intercal}\boldsymbol{\mathrm{e}}_{\hat{k}_{2}} are zero unless m^=k^1\hat{m}=\hat{k}_{1} and n^=k^2\hat{n}=\hat{k}_{2}. We then complete the lemma by decomposing 𝐀𝒱=(((𝐀{i1}){i2})){i|𝒱|}\mathbf{A}^{\intercal_{{\mathcal{V}}}}=(((\mathbf{A}^{\intercal_{\{i_{1}\}}})^{\intercal_{\{i_{2}\}}})\cdots)^{\intercal_{\{i_{|{{\mathcal{V}}}|}\}}}, where 𝒱={i1,,i|𝒱|}{\mathcal{V}}=\{i_{1},\ldots,i_{|{{\mathcal{V}}}|}\}. ∎

We now prove the theorem:

Theorem 2 Restated.

Fix d,kd,k\in{\mathbb{N}}, and let 𝐀dk×dk\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}}. Let 𝐱=𝐱1𝐱k\boldsymbol{\mathrm{x}}=\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k}, where each 𝐱i\boldsymbol{\mathrm{x}}_{i} is independent and is either a Rademacher vector or a 𝒩(𝟎,𝐈){\mathcal{N}}(\boldsymbol{\mathrm{0}},\mathbf{I}) Gaussian vector. Then, Var[𝐱𝐀𝐱]𝒮[k]2k|𝒮|tr𝒮(𝐀)F2\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}]\leq\sum_{{\mathcal{S}}\subset[k]}2^{k-|{{\mathcal{S}}}|}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}}(\mathbf{A})\rVert}_{F}^{2}. Furthermore, if all 𝐱i\boldsymbol{\mathrm{x}}_{i} are 𝒩(𝟎,𝐈){\mathcal{N}}(\boldsymbol{\mathrm{0}},\mathbf{I}) and if 𝐀=¯𝐀\mathbf{A}=\bar{}\mathbf{A}, then this expression is an equality.

Proof.

We prove that tr𝒮(¯𝐀)Ftr𝒮(𝐀)F\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\bar{}\mathbf{A})\rVert}_{F}\leq\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\mathbf{A})\rVert}_{F}, which implies the theorem. First, recall that the partial trace is linear, so that by the triangle inequality we can write

tr𝒮(¯𝐀)F=12k𝒱[k]tr𝒮(𝐀𝒱)F12k𝒱[k]tr𝒮(𝐀𝒱)F.\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\bar{}\mathbf{A})\rVert}_{F}=\mathinner{\lVert{\textstyle\frac{1}{2^{k}}}{\textstyle\sum_{{\mathcal{V}}\subseteq[k]}}\operatorname{tr}_{\mathcal{S}}(\mathbf{A}^{\intercal_{{\mathcal{V}}}})\rVert}_{F}\leq{\textstyle\frac{1}{2^{k}}}{\textstyle\sum_{{\mathcal{V}}\subseteq[k]}}\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\mathbf{A}^{\intercal_{{\mathcal{V}}}})\rVert}_{F}.

Next, we examine the norm tr𝒮(𝐀𝒱)\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\mathbf{A}^{\intercal_{{\mathcal{V}}}})\rVert}. First, by Lemma 21, we can assume without loss of generality that 𝒱𝒮={\mathcal{V}}\cap{\mathcal{S}}=\emptyset. Then, for notationally simplicity, suppose that 𝒱={1,,|𝒱|}{\mathcal{V}}=\{1,\ldots,|{{\mathcal{V}}}|\}. We can then write tr𝒮𝐀𝒱=(tr𝒮(𝐀))𝒱\operatorname{tr}_{\mathcal{S}}{\mathbf{A}^{\intercal_{{\mathcal{V}}}}}=(\operatorname{tr}_{\mathcal{S}}(\mathbf{A}))^{\intercal_{{\mathcal{V}}}}. To see this, fix any i𝒱i\in{\mathcal{V}} and j𝒮j\in{\mathcal{S}}. By our without-loss-of-generality claim, we know that i<ji<j, and can therefore expand

tr{j}(𝐀{i})\displaystyle\operatorname{tr}_{\{j\}}(\mathbf{A}^{\intercal_{\{i\}}}) =i^=1d(𝐈j1𝐞i^𝐈kj)𝐀{i}(𝐈j1𝐞i^𝐈kj)\displaystyle=\sum_{\hat{i}=1}^{d}(\mathbf{I}^{\otimes j-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}\otimes\mathbf{I}^{\otimes k-j})^{\intercal}\mathbf{A}^{\intercal_{\{i\}}}(\mathbf{I}^{\otimes j-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}\otimes\mathbf{I}^{\otimes k-j})
=i^,j^,^=1d(𝐈i1𝐞j^𝐞^𝐈ji1𝐞i^𝐈kj)𝐀(𝐈i1𝐞j^𝐞^𝐈ji1𝐞i^𝐈kj)\displaystyle=\sum_{\hat{i},\hat{j},\hat{\ell}=1}^{d}(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}}\boldsymbol{\mathrm{e}}_{\hat{\ell}}^{\intercal}\otimes\mathbf{I}^{j-i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}^{\intercal}\otimes\mathbf{I}^{\otimes k-j})\mathbf{A}(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}}\boldsymbol{\mathrm{e}}_{\hat{\ell}}^{\intercal}\otimes\mathbf{I}^{j-i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{i}}\otimes\mathbf{I}^{\otimes k-j})
=i^,j^,^=1d(𝐈i1𝐞j^𝐞^𝐈ki)tr{j}(𝐀)(𝐈i1𝐞j^𝐞^𝐈ki)\displaystyle=\sum_{\hat{i},\hat{j},\hat{\ell}=1}^{d}(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}}\boldsymbol{\mathrm{e}}_{\hat{\ell}}^{\intercal}\otimes\mathbf{I}^{k-i})\operatorname{tr}_{\{j\}}(\mathbf{A})(\mathbf{I}^{\otimes i-1}\otimes\boldsymbol{\mathrm{e}}_{\hat{j}}\boldsymbol{\mathrm{e}}_{\hat{\ell}}^{\intercal}\otimes\mathbf{I}^{k-i})
=(tr{j}(𝐀)){i}.\displaystyle=(\operatorname{tr}_{\{j\}}(\mathbf{A}))^{\intercal_{\{i\}}}.

So, we can pull all of the partial transposes from inside of the partial trace to outside the partial trace. Then, by Lemma 27, we can conclude that

tr𝒮(¯𝐀)F\displaystyle\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\bar{}\mathbf{A})\rVert}_{F} 12k𝒱[k]tr𝒮(𝐀𝒱)F\displaystyle\leq{\textstyle\frac{1}{2^{k}}}{\textstyle\sum_{{\mathcal{V}}\subseteq[k]}}\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\mathbf{A}^{\intercal_{{\mathcal{V}}}})\rVert}_{F}
=12k𝒱[k](tr𝒮(𝐀))𝒱F\displaystyle={\textstyle\frac{1}{2^{k}}}{\textstyle\sum_{{\mathcal{V}}\subseteq[k]}}\mathinner{\lVert(\operatorname{tr}_{\mathcal{S}}(\mathbf{A}))^{\intercal_{{\mathcal{V}}}}\rVert}_{F}
=12k𝒱[k]tr𝒮(𝐀)F\displaystyle={\textstyle\frac{1}{2^{k}}}{\textstyle\sum_{{\mathcal{V}}\subseteq[k]}}\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\mathbf{A})\rVert}_{F}
=tr𝒮(𝐀)F.\displaystyle=\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\mathbf{A})\rVert}_{F}.

While Theorem 2 is more approachable than Theorem 1, it is still not immediately clear how to think of this rate. In order to get a better understanding of this variance, we turn to the worst-case performance of this estimator for the class of PSD matrices.

4 Cost of the Kronecker-Hutchinson Estimator for PSD Matrices

Trace estimation is often studied under the assumption that the input matrix 𝐀\mathbf{A} is PSD. This assumption is very common and reasonable in practice, and without such an assumption, it is impossible for an algorithm such as Hutchinson’s Estimator to achieve a (1±ε)(1\pm\varepsilon) relative error approximation to the trace of 𝐀\mathbf{A}. This is because an indefinite matrix 𝐀\mathbf{A} can have tr(𝐀)=0\operatorname{tr}(\mathbf{A})=0 while having large Frobenius norm, meaning that Hutchinson’s Estimator would have to perfectly recover the trace of 𝐀\mathbf{A} despite having a nonzero variance. This would incur infinite sample complexity. However, if we assume that 𝐀\mathbf{A} is PSD, then 𝐀Ftr(𝐀)\mathinner{\lVert\mathbf{A}\rVert}_{F}\leq\operatorname{tr}(\mathbf{A}), and so we know that the standard deviation of the classical Hutchinson’s Estimator is 2𝐀F2tr(𝐀)\sqrt{\frac{2}{\ell}}\mathinner{\lVert\mathbf{A}\rVert}_{F}\leq\sqrt{\frac{2}{\ell}}\operatorname{tr}(\mathbf{A}). That is, =O(1ε2)\ell=O(\frac{1}{\varepsilon^{2}}) samples suffice to achieve (1±ε)(1\pm\varepsilon) relative error with at least constant probability.

We now make the same assumption and study the Kronecker-Hutchinson Estimator, and show that =O(3kε2)\ell=O(\frac{3^{k}}{\varepsilon^{2}}) samples suffices to estimate the trace of a PSD matrix to (1±ε)(1\pm\varepsilon) relative error. To start, we need a basic property of the partial trace:

Lemma 28.

Let 𝐀dk×dk\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}} be a PSD matrix, and let 𝒮[k]{\mathcal{S}}\subseteq[k]. Then, tr𝒮(𝐀)\operatorname{tr}_{\mathcal{S}}(\mathbf{A}) is also PSD.

Proof.

Recall that if 𝐀\mathbf{A} is PSD, then 𝐁𝐀𝐁\mathbf{B}^{\intercal}\mathbf{A}\mathbf{B} is also PSD for any matrix 𝐁\mathbf{B} of compatible dimension. Further recall that if 𝐀\mathbf{A} and 𝐂\mathbf{C} are both PSD, then 𝐀+𝐂\mathbf{A}+\mathbf{C} is also PSD. Since tr𝒮(𝐀)\operatorname{tr}_{\mathcal{S}}(\mathbf{A}) is a composition of such operations on 𝐀\mathbf{A}, we get that tr𝒮(𝐀)\operatorname{tr}_{\mathcal{S}}(\mathbf{A}) is also PSD. ∎

Lemma 29 (Folklore).

Let 𝐁\mathbf{B} be a PSD matrix. Then, 𝐁Ftr(𝐁)\mathinner{\lVert\mathbf{B}\rVert}_{F}\leq\operatorname{tr}(\mathbf{B}).

We now prove our next core theorem:

Theorem 4 Restated.

Fix d,kd,k\in{\mathbb{N}}, and let 𝐀dk×dk\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}} be a PSD matrix. Let 𝐱=𝐱1𝐱k\boldsymbol{\mathrm{x}}=\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k}, where each 𝐱i\boldsymbol{\mathrm{x}}_{i} is either a Rademacher vector or a 𝒩(𝟎,𝐈){\mathcal{N}}(\boldsymbol{\mathrm{0}},\mathbf{I}) Gaussian vector. Then, Var[𝐱𝐀𝐱](3ktr(𝐀))2\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}]\leq(3^{k}\operatorname{tr}(\mathbf{A}))^{2}.

Proof.

We first apply Theorems 2, 29 and 16 to get

𝔼[(𝐱𝐀𝐱)2]𝒮[k]2k|𝒮|tr𝒮(𝐀)F2𝒮[k]2k|𝒮|(tr(tr𝒮(𝐀)))2=(tr(𝐀))2𝒮[k]2k|𝒮|.\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}})^{2}]\leq\sum_{{\mathcal{S}}\subseteq[k]}2^{k-|{{\mathcal{S}}}|}\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\mathbf{A})\rVert}_{F}^{2}\leq\sum_{{\mathcal{S}}\subseteq[k]}2^{k-|{{\mathcal{S}}}|}(\operatorname{tr}(\operatorname{tr}_{\mathcal{S}}(\mathbf{A})))^{2}=(\operatorname{tr}(\mathbf{A}))^{2}\sum_{{\mathcal{S}}\subseteq[k]}2^{k-|{{\mathcal{S}}}|}.

So we turn our attention to bounding this sum. If we let σi:= 1[i𝒮]\sigma_{i}\;{\vcentcolon=}\;\mathbbm{1}_{[i\notin{\mathcal{S}}]}, then we can write k|𝒮|=iσik-|{{\mathcal{S}}}|=\sum_{i}\sigma_{i}, and so

𝒮[k]2k|𝒮|=𝝈{0,1}k2iσi=σ1{0,1}2σ1((σk1{0,1}2σk1(σk{0,1}2σk)))=3k.\displaystyle\sum_{{\mathcal{S}}\subseteq[k]}2^{k-|{{\mathcal{S}}}|}=\sum_{\boldsymbol{\sigma}\in\{0,1\}^{k}}2^{\sum_{i}\sigma_{i}}=\sum_{\sigma_{1}\in\{0,1\}}2^{\sigma_{1}}\left(\cdots\left(\sum_{\sigma_{k-1}\in\{0,1\}}2^{\sigma_{k-1}}\left(\sum_{\sigma_{k}\in\{0,1\}}2^{\sigma_{k}}\right)\right)\right)=3^{k}.

We then conclude that Var[𝐱𝐀𝐱]𝔼[(𝐱𝐀𝐱)2]3k(tr(𝐀))2\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}]\leq\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}})^{2}]\leq 3^{k}(\operatorname{tr}(\mathbf{A}))^{2}. ∎

In particular, if we take \ell samples in the Kronecker-Hutchinson Estimator, we get a standard deviation of Var[H(𝐀)]3k(tr(𝐀))\sqrt{\operatorname*{Var}[H_{\ell}(\mathbf{A})]}\leq\sqrt{\frac{3^{k}}{\ell}}(\operatorname{tr}(\mathbf{A})), which is less than εtr(𝐀)\varepsilon\operatorname{tr}(\mathbf{A}) for 3kε2\ell\geq\frac{3^{k}}{\varepsilon^{2}}. More formally, by Chebyshev’s inequality, we find that =O(3kε2)\ell=O(\frac{3^{k}}{\varepsilon^{2}}) samples suffices to get

(1ε)tr(𝐀)H(𝐀)(1+ε)tr(𝐀)(1-\varepsilon)\operatorname{tr}(\mathbf{A})\leq H_{\ell}(\mathbf{A})\leq(1+\varepsilon)\operatorname{tr}(\mathbf{A})

with constant probability.

4.1 A Matching Lower Bound

We might hope that the painful rate of 3kε2\ell\geq\frac{3^{k}}{\varepsilon^{2}} results from a loose analysis of the Kronecker-Hutchinson Estimator. Or, perhaps, that some more clever choice of random vector distribution can avoid this painful exponential dependence. We show that this is not the case, at least not among vectors with real iid mean-zero unit-variance components:

Theorem 6 Restated.

Fix d,kd,k\in{\mathbb{N}}. Consider Hutchinson’s Estimator run with independent vectors 𝐱=𝐱1𝐱k\boldsymbol{\mathrm{x}}=\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k} where 𝐱i\boldsymbol{\mathrm{x}}_{i} are vectors of real iid zero-mean unit-variance entries. Then, there exists a PSD matrix 𝐀\mathbf{A} such that Hutchinson’s Estimator needs =Ω((32d)kε2)\ell=\Omega(\frac{(3-\frac{2}{d})^{k}}{\varepsilon^{2}}) samples in order to have standard deviation εtr(𝐀)\leq\varepsilon\operatorname{tr}(\mathbf{A}).

Proof.

We let 𝐀dk×dk\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}} be the all-ones matrix. If we let 𝐞d\boldsymbol{\mathrm{e}}\in{\mathbb{R}}^{d} denote the all-ones vector, then we can write 𝐀=i=1k𝐞𝐞\mathbf{A}=\otimes_{i=1}^{k}\boldsymbol{\mathrm{e}}\boldsymbol{\mathrm{e}}^{\intercal}. This is helpful because we can then use the Mixed-Product property (Lemma 10) to show that

𝐱𝐀𝐱=i=1k𝐱i𝐞𝐞𝐱i=i=1k(𝐞𝐱i)2,\textstyle{\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}=\otimes_{i=1}^{k}\boldsymbol{\mathrm{x}}_{i}^{\intercal}\boldsymbol{\mathrm{e}}\boldsymbol{\mathrm{e}}^{\intercal}\boldsymbol{\mathrm{x}}_{i}=\prod_{i=1}^{k}(\boldsymbol{\mathrm{e}}^{\intercal}\boldsymbol{\mathrm{x}}_{i})^{2}},

where we can replace the Kronecker product with a scalar product because the Kronecker product of scalars is just the scalar product. Notably, we see that 𝐱𝐀𝐱\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}} is a product of kk iid terms. Since we are interested in the variance of the estimator, we actually want to examine

𝔼[(𝐱𝐀𝐱)2]=i=1k𝔼[(𝐞𝐱i)4]=(𝔼[(𝐞𝐱1)4])k.\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}})^{2}]=\prod_{i=1}^{k}\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{e}}^{\intercal}\boldsymbol{\mathrm{x}}_{i})^{4}]=\left(\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{e}}^{\intercal}\boldsymbol{\mathrm{x}}_{1})^{4}]\right)^{k}.

So, letting 𝐲:=𝐱1\boldsymbol{\mathrm{y}}\;{\vcentcolon=}\;\boldsymbol{\mathrm{x}}_{1} in order to simplify notation, we turn our attention to lower bounding the expectation of (𝐞𝐲)4=(i=1dyi)4(\boldsymbol{\mathrm{e}}^{\intercal}\boldsymbol{\mathrm{y}})^{4}=(\sum_{i=1}^{d}y_{i})^{4} where 𝐲\boldsymbol{\mathrm{y}} is a vector of iid zero-mean unit-variance terms. We expand this sum as

𝔼[(i=1dyi)4]=i,j,m,n=1d𝔼[yiyjymyn].\textstyle\operatorname*{\mathbb{E}}[(\sum_{i=1}^{d}y_{i})^{4}]=\sum_{i,j,m,n=1}^{d}\operatorname*{\mathbb{E}}[y_{i}y_{j}y_{m}y_{n}].

This expectation is zero if any random variable appears exactly once, like how 𝔼[y1y23]=𝔼[y1]𝔼[y23]=0\operatorname*{\mathbb{E}}[y_{1}y_{2}^{3}]=\operatorname*{\mathbb{E}}[y_{1}]\operatorname*{\mathbb{E}}[y_{2}^{3}]=0 by independence. So, we can reduce the sum above to terms that either look like 𝔼[yi4]\operatorname*{\mathbb{E}}[y_{i}^{4}] or like 𝔼[yi2yj2]\operatorname*{\mathbb{E}}[y_{i}^{2}y_{j}^{2}]. Terms like 𝔼[yi4]\operatorname*{\mathbb{E}}[y_{i}^{4}] occur exactly dd times. Terms like 𝔼[yi2yj2]\operatorname*{\mathbb{E}}[y_{i}^{2}y_{j}^{2}] occur exactly 3d23d3d^{2}-3d times333This can happen 3 ways: if i=jm=ni=j\neq m=n or if i=mj=ni=m\neq j=n or if i=nj=mi=n\neq j=m. Each such pattern occurs d2dd^{2}-d times, so the overall pattern occurs 3d23d3d^{2}-3d times.. We can then bound 𝔼[yi4](𝔼[yi2])2=1\operatorname*{\mathbb{E}}[y_{i}^{4}]\geq(\operatorname*{\mathbb{E}}[y_{i}^{2}])^{2}=1 by Jensen’s Inequality, and 𝔼[yi2yj2]=𝔼[yi2]𝔼[yj2]=1\operatorname*{\mathbb{E}}[y_{i}^{2}y_{j}^{2}]=\operatorname*{\mathbb{E}}[y_{i}^{2}]\operatorname*{\mathbb{E}}[y_{j}^{2}]=1 by independence. So, we get that

𝔼[(𝐱𝐀𝐱)2]=(𝔼[(𝐞𝐲)4])k(d+3d23d)k=(3d22d)k.\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}})^{2}]=\left(\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{e}}^{\intercal}\boldsymbol{\mathrm{y}})^{4}]\right)^{k}\geq\left(d+3d^{2}-3d\right)^{k}=(3d^{2}-2d)^{k}.

Which, noting that tr(𝐀)=dk\operatorname{tr}(\mathbf{A})=d^{k}, in turn gets a variance lower bound of

Var[𝐱𝐀𝐱]=𝔼[(𝐱𝐀𝐱)2](tr(𝐀))2(3d22d)kd2k.\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}]=\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}})^{2}]-(\operatorname{tr}(\mathbf{A}))^{2}\geq(3d^{2}-2d)^{k}-d^{2k}.

Then, since Var[H(𝐀)]=1Var[𝐱𝐀𝐱]\operatorname*{Var}[H_{\ell}(\mathbf{A})]=\frac{1}{\ell}\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}], in order for the Kronecker-Hutchinson’s Estimator to have standard deviation εtr(𝐀)\varepsilon\operatorname{tr}(\mathbf{A}), we need to take

Var[𝐱𝐀𝐱]ε2(tr(𝐀))2(3d22d)kd2kε2d2k=(32d)k1ε2=Ω((32d)kε2).\ell\geq\frac{\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}]}{\varepsilon^{2}(\operatorname{tr}(\mathbf{A}))^{2}}\geq\frac{(3d^{2}-2d)^{k}-d^{2k}}{\varepsilon^{2}d^{2k}}=\frac{(3-\frac{2}{d})^{k}-1}{\varepsilon^{2}}=\Omega\left(\frac{(3-\frac{2}{d})^{k}}{\varepsilon^{2}}\right).

We note that this lower bound uses a rank-one matrix, which is interesting because we can exactly compute the trace of any rank-one matrix with a single Kronecker-matrix-vector product. In particular, if 𝐀=𝐮𝐮\mathbf{A}=\boldsymbol{\mathrm{u}}\boldsymbol{\mathrm{u}}^{\intercal} for some 𝐮d\boldsymbol{\mathrm{u}}\in{\mathbb{R}}^{d}, then for any vector 𝐱\boldsymbol{\mathrm{x}} not orthogonal to 𝐮\boldsymbol{\mathrm{u}}, we have that

𝐀𝐱22𝐱𝐀𝐱=𝐱𝐮𝐮𝐮𝐮𝐱𝐱𝐮𝐮𝐱=𝐮𝐮=tr(𝐀).\frac{\mathinner{\lVert\mathbf{A}\boldsymbol{\mathrm{x}}\rVert}_{2}^{2}}{\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}}=\frac{\boldsymbol{\mathrm{x}}^{\intercal}\boldsymbol{\mathrm{u}}\cdot\boldsymbol{\mathrm{u}}^{\intercal}\boldsymbol{\mathrm{u}}\cdot\boldsymbol{\mathrm{u}}^{\intercal}\boldsymbol{\mathrm{x}}}{\boldsymbol{\mathrm{x}}^{\intercal}\boldsymbol{\mathrm{u}}\cdot\boldsymbol{\mathrm{u}}^{\intercal}\boldsymbol{\mathrm{x}}}=\boldsymbol{\mathrm{u}}^{\intercal}\boldsymbol{\mathrm{u}}=\operatorname{tr}(\mathbf{A}).

Since a Kronecker of Gaussian vectors is not orthogonal to any fixed vector 𝐮\boldsymbol{\mathrm{u}} with probability 1, we can exactly compute this trace with probability 1. That is to say, the hard instance used in this lower bound is truly only hard for the Kronecker-Hutchinson Estimator, and is not hard in general for the Kronecker-matrix-vector oracle model.

The lower bound is also structured in a second way – the matrix 𝐀=i=1k𝐞𝐞\mathbf{A}=\otimes_{i=1}^{k}\boldsymbol{\mathrm{e}}\boldsymbol{\mathrm{e}}^{\intercal} is not only rank-one but also has Kronecker structure. In Section 7 we prove that we can exactly compute the trace of any matrix 𝐀\mathbf{A} with Kronecker structure by using kd+1kd+1 Kronecker-matrix-vector products. That is to say, the hard instance used in this lower bound is in a second sense only hard for the Kronecker-Hutchinson’s Estimator, and its class of hard input matrices is not hard in general for the Kronecker-matrix-vector oracle model. That said, we do not have a clear sense of what algorithms may be able to bridge this gap and admit much faster trace estimation.

4.2 Estimating the Trace of a Random Rank-One Matrix

To round off our discussion of PSD matrix trace estimation, we show that estimating the trace of a random rank-one matrix still has an exponential dependence on kk, but a milder dependence than shown in the worst-case bound of Theorem 6. This proof serves two purposes. First, supposing our upper bound is tight, it suggests that the Kronecker-Hutchinson Estimator can still have an exponentially slow rate of convergence for matrices that are far from having Kronecker structure. Second, it shows how the variance bound from Theorem 2 can be used to get a tightened understanding of the concentration of the Kronecker-Hutchinson Estimator for particular matrices.

We study the random rank-one matrix 𝐀=𝐠𝐠dk×dk\mathbf{A}=\boldsymbol{\mathrm{g}}\boldsymbol{\mathrm{g}}^{\intercal}\in{\mathbb{R}}^{d^{k}\times d^{k}}, where 𝐠𝒩(𝟎,𝐈)\boldsymbol{\mathrm{g}}\sim{\mathcal{N}}(\boldsymbol{\mathrm{0}},\mathbf{I}). By analyzing the expected squared Frobenius norm of the partial traces of 𝐀\mathbf{A}, we prove the following:

Theorem 3 Restated.

Fix d,kd,k\in{\mathbb{N}}. Let 𝐀=𝐠𝐠\mathbf{A}=\boldsymbol{\mathrm{g}}\boldsymbol{\mathrm{g}}^{\intercal} where 𝐠dk\boldsymbol{\mathrm{g}}\in{\mathbb{R}}^{d^{k}} is a vector of iid standard normal Gaussians. Then, with probability 34\frac{3}{4}, we know that Var[𝐱𝐀𝐱|𝐠]O((2+1d)k(tr(𝐀))2)\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}~{}|~{}\boldsymbol{\mathrm{g}}]\leq O((2+\frac{1}{d})^{k}(\operatorname{tr}(\mathbf{A}))^{2}). In particular, with probability 23\frac{2}{3}, we know that the Kronecker-Hutchinson Estimator returns a (1±ε)(1\pm\varepsilon) relative error estimate to the trace of 𝐀\mathbf{A} if =O(1ε2(2+1d)k)\ell=O(\frac{1}{\varepsilon^{2}}(2+\frac{1}{d})^{k}).

Proof.

We start by analyzing the expected squared Frobenius norm of the partial trace of 𝐀\mathbf{A} with respect to a set of subsystems 𝒮[k]{\mathcal{S}}\subseteq[k]. Let i:=|𝒮|i\;{\vcentcolon=}\;|{{\mathcal{S}}}|, and note that without loss of generality we can assume that 𝒮={1,,i}{\mathcal{S}}=\{1,\ldots,i\} when trying to compute 𝔼[tr𝒮(𝐀)F2]\operatorname*{\mathbb{E}}[\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\mathbf{A})\rVert}_{F}^{2}]. We then decompose the vector 𝐠=[𝐠1𝐠di]\boldsymbol{\mathrm{g}}=\left[\begin{smallmatrix}\boldsymbol{\mathrm{g}}_{1}\\ \vdots\\ \boldsymbol{\mathrm{g}}_{d^{i}}\end{smallmatrix}\right] where 𝐠idki\boldsymbol{\mathrm{g}}_{i}\in{\mathbb{R}}^{d^{k-i}} are standard normal Gaussian vectors. Then, note that we can decompose 𝐠=(𝐞1𝐠1)++(𝐞di𝐠di)\boldsymbol{\mathrm{g}}=(\boldsymbol{\mathrm{e}}_{1}\otimes\boldsymbol{\mathrm{g}}_{1})+\ldots+(\boldsymbol{\mathrm{e}}_{d^{i}}\otimes\boldsymbol{\mathrm{g}}_{d^{i}}). This lets us write out

𝐠(𝐞j𝐈ki)=j^=1di(𝐞j^𝐠j^)(𝐞j𝐈ki)=j^=1di(𝐞j^𝐞j𝐠j^)=𝐠j.\boldsymbol{\mathrm{g}}^{\intercal}(\boldsymbol{\mathrm{e}}_{j}\otimes\mathbf{I}^{\otimes k-i})=\sum_{\hat{j}=1}^{d^{i}}(\boldsymbol{\mathrm{e}}_{\hat{j}}\otimes\boldsymbol{\mathrm{g}}_{\hat{j}})^{\intercal}(\boldsymbol{\mathrm{e}}_{j}\otimes\mathbf{I}^{\otimes k-i})=\sum_{\hat{j}=1}^{d^{i}}(\boldsymbol{\mathrm{e}}_{\hat{j}}^{\intercal}\boldsymbol{\mathrm{e}}_{j}\otimes\boldsymbol{\mathrm{g}}_{\hat{j}}^{\intercal})=\boldsymbol{\mathrm{g}}_{j}^{\intercal}.

Then, if we define 𝐆=[𝐠1𝐠di]dki×di\mathbf{G}=\begin{bmatrix}\boldsymbol{\mathrm{g}}_{1}\,\cdots~{}\boldsymbol{\mathrm{g}}_{d^{i}}\end{bmatrix}\in{\mathbb{R}}^{d^{k-i}\times d^{i}}, then we see that

tr𝒮(𝐠𝐠)=j=1di(𝐞j𝐈ki)𝐠𝐠(𝐞j𝐈ki)=j=1di𝐠j𝐠j=𝐆𝐆.\operatorname{tr}_{{\mathcal{S}}}(\boldsymbol{\mathrm{g}}\boldsymbol{\mathrm{g}}^{\intercal})=\sum_{j=1}^{d^{i}}(\boldsymbol{\mathrm{e}}_{j}\otimes\mathbf{I}^{\otimes k-i})^{\intercal}\boldsymbol{\mathrm{g}}\boldsymbol{\mathrm{g}}^{\intercal}(\boldsymbol{\mathrm{e}}_{j}\otimes\mathbf{I}^{\otimes k-i})=\sum_{j=1}^{d^{i}}\boldsymbol{\mathrm{g}}_{j}\boldsymbol{\mathrm{g}}_{j}^{\intercal}=\mathbf{G}\mathbf{G}^{\intercal}.

Noticing that 𝐆\mathbf{G} is just a matrix full of iid standard normal Gaussians, we can then bound tr:i(𝐀)F2=𝐆𝐆F2=𝐆44\mathinner{\lVert\operatorname{tr}_{\mathrel{\mathop{\ordinarycolon}}i}(\mathbf{A})\rVert}_{F}^{2}=\mathinner{\lVert\mathbf{G}\mathbf{G}^{\intercal}\rVert}_{F}^{2}=\mathinner{\lVert\mathbf{G}\rVert}_{4}^{4} in expectation by just using known properties of the norms of Gaussians, where 4\mathinner{\lVert\cdot\rVert}_{4} denotes the Schatten 4-norm. For instance, Lemma A.1 from [Tropp and Webber, 2023] implies that a standard normal Gaussian matrix 𝐆N×P\mathbf{G}\in{\mathbb{R}}^{N\times P} has 𝔼[𝐆44]=NP(N+P+1)\operatorname*{\mathbb{E}}[\mathinner{\lVert\mathbf{G}\rVert}_{4}^{4}]=NP(N+P+1). So,

𝔼[tr𝒮(𝐀)F2]=𝔼[𝐆44]=dk(dk|𝒮|+d|𝒮|+1).\operatorname*{\mathbb{E}}[\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}}(\mathbf{A})\rVert}_{F}^{2}]=\operatorname*{\mathbb{E}}[\mathinner{\lVert\mathbf{G}\rVert}_{4}^{4}]=d^{k}(d^{k-|{{\mathcal{S}}}|}+d^{|{{\mathcal{S}}}|}+1).

This in turn implies that

𝔼𝐠[𝒮[k]2k|𝒮|tr𝒮(𝐀)F2]\displaystyle\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{g}}}\left[\sum_{{\mathcal{S}}\subseteq[k]}2^{k-|{{\mathcal{S}}}|}\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\mathbf{A})\rVert}_{F}^{2}\right] =𝒮[k]2k|𝒮|dk(dk|𝒮|+d|𝒮|+1)\displaystyle=\sum_{{\mathcal{S}}\subseteq[k]}2^{k-|{{\mathcal{S}}}|}d^{k}(d^{k-|{{\mathcal{S}}}|}+d^{|{{\mathcal{S}}}|}+1)
=i=0k(ki)2kidk(dki+di+1).\displaystyle=\sum_{i=0}^{k}\binom{k}{i}2^{k-i}d^{k}(d^{k-i}+d^{i}+1).

The last line above recognizes that we only depended on |𝒮||{{\mathcal{S}}}|, so we could just sum over all values of i=|𝒮|i=|{{\mathcal{S}}}|, scaled by the (ki)\binom{k}{i} times that |𝒮||{{\mathcal{S}}}| appears. Rearranging terms, and recalling that i=0k(ki)ci=(1+c)k\sum_{i=0}^{k}\binom{k}{i}c^{i}=(1+c)^{k}, we get

𝔼𝐠[𝒮[k]2k|𝒮|tr𝒮(𝐀)F2]\displaystyle\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{g}}}\left[\sum_{{\mathcal{S}}\subseteq[k]}2^{k-|{{\mathcal{S}}}|}\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\mathbf{A})\rVert}_{F}^{2}\right] =2kdki=0k(ki)2i(dki+di+1)\displaystyle=2^{k}d^{k}\sum_{i=0}^{k}\binom{k}{i}2^{-i}(d^{k-i}+d^{i}+1)
=2kdki=0k(ki)(dk(12d)i+(d2)i+(12)i)\displaystyle=2^{k}d^{k}\sum_{i=0}^{k}\binom{k}{i}\left(d^{k}({\textstyle\frac{1}{2d}})^{i}+({\textstyle\frac{d}{2}})^{i}+({\textstyle\frac{1}{2}})^{i}\right)
=2kdk(dk(1+12d)k+(1+d2)k+(32)k)\displaystyle=2^{k}d^{k}\left(d^{k}(1+{\textstyle\frac{1}{2d}})^{k}+(1+{\textstyle\frac{d}{2}})^{k}+({\textstyle\frac{3}{2}})^{k}\right)
=2kd2k((1+12d)k+(12+1d)k+(32d)k)\displaystyle=2^{k}d^{2k}\left((1+{\textstyle\frac{1}{2d}})^{k}+({\textstyle\frac{1}{2}}+{\textstyle\frac{1}{d}})^{k}+({\textstyle\frac{3}{2d}})^{k}\right)
2kd2k3(1+12d)k\displaystyle\leq 2^{k}d^{2k}\cdot 3(1+{\textstyle\frac{1}{2d}})^{k}
=3d2k(2+1d)k.\displaystyle=3d^{2k}(2+{\textstyle\frac{1}{d}})^{k}.

Then, by Markov’s Inequality, we know that

Var[𝐱𝐀𝐱|𝐠]𝔼𝐠[𝒮[k]2k|𝒮|tr𝒮(𝐀)F2]83d2k(2+1d)k\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}~{}|~{}\boldsymbol{\mathrm{g}}]\leq\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{g}}}\left[\sum_{{\mathcal{S}}\subset[k]}2^{k-|{{\mathcal{S}}}|}\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\mathbf{A})\rVert}_{F}^{2}\right]\leq 8\cdot 3d^{2k}(2+{\textstyle\frac{1}{d}})^{k}

holds with probability at least 78\frac{7}{8}. Then, since tr(𝐀)=𝐠22\operatorname{tr}(\mathbf{A})=\mathinner{\lVert\boldsymbol{\mathrm{g}}\rVert}_{2}^{2} is a chi-square random variable with parameter dkd^{k}, we can use Chebyshev’s Inequality to bound tr(𝐀)>12dk\operatorname{tr}(\mathbf{A})>\frac{1}{2}d^{k} with probability at least 78\frac{7}{8} if dk>64d^{k}>64. We condition on both of these events, together holding with probability at least 34\frac{3}{4}. Given these events, the variance of the Kronecker-Hutchinson Estimator is therefore

Var[(𝐱𝐀𝐱)2|𝐠]24d2k(2+1d)k96(2+1d)k(tr(𝐀))2.\displaystyle\operatorname*{Var}[(\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}})^{2}~{}|~{}\boldsymbol{\mathrm{g}}]\leq 24d^{2k}(2+{\textstyle\frac{1}{d}})^{k}\leq 96(2+{\textstyle\frac{1}{d}})^{k}(\operatorname{tr}(\mathbf{A}))^{2}.

We can now examine how many samples \ell are needed for the Kronecker-Hutchinson Estimator to achieve a (1±ε)(1\pm\varepsilon) relative error estimation to 𝐀\mathbf{A} with constant probability. By Chebyshev’s Inequality, we get that |H(𝐀)tr(𝐀)|12Var[H(𝐀)𝐠]=12Var[H(𝐀)𝐠]|{H_{\ell}(\mathbf{A})-\operatorname{tr}(\mathbf{A})}|\leq\sqrt{12\operatorname*{Var}[H_{\ell}(\mathbf{A})\mid\boldsymbol{\mathrm{g}}]}=\frac{\sqrt{12}}{\sqrt{\ell}}\sqrt{\operatorname*{Var}[H_{\ell}(\mathbf{A})\mid\boldsymbol{\mathrm{g}}]} with probability 112\frac{1}{12}. To make this additive error at most εtr(𝐀)\varepsilon\operatorname{tr}(\mathbf{A}), we take

=12Var[𝐱𝐀𝐱|𝐠]ε2(tr(𝐀))21296(2+1d)k(tr(𝐀))2ε2(tr(𝐀))2=1152(2+1d)kε2.\ell=\frac{12\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}~{}|~{}\boldsymbol{\mathrm{g}}]}{\varepsilon^{2}(\operatorname{tr}(\mathbf{A}))^{2}}\leq\frac{12\cdot 96(2+{\textstyle\frac{1}{d}})^{k}(\operatorname{tr}(\mathbf{A}))^{2}}{\varepsilon^{2}(\operatorname{tr}(\mathbf{A}))^{2}}=1152\frac{(2+{\textstyle\frac{1}{d}})^{k}}{\varepsilon^{2}}.

That is, with constant probability, =O(1ε2(2+1d)k)\ell=O(\frac{1}{\varepsilon^{2}}(2+{\textstyle\frac{1}{d}})^{k}) samples suffice to get a relative error approximation to a random rank-one matrix with constant probability. By allowing larger dd and kk values, and by using sharper concentrations, the constant of 1152 can be significantly improved. ∎

5 The Complex-Valued Kronecker-Matrix-Vector Oracle

Until now, we have only considered using a real-valued Kronecker-matrix-vector oracle, where we are only allowed to compute 𝐀𝐱\mathbf{A}\boldsymbol{\mathrm{x}} for vectors 𝐱=𝐱1𝐱k\boldsymbol{\mathrm{x}}=\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k} where 𝐱id\boldsymbol{\mathrm{x}}_{i}\in{\mathbb{R}}^{d}. In this section, we consider how our results change when we instead allow 𝐱id\boldsymbol{\mathrm{x}}_{i}\in{\mathbb{C}}^{d}. If we did not have any Kronecker constraint on our oracle, then this change of oracles would be banal. If some algorithm computes qq complex-valued non-Kronecker matrix-vector products with 𝐀\mathbf{A}, then we can find another algorithm which produces the exact same result, but instead uses 2q2q real-valued non-Kronecker matrix-vector products with 𝐀\mathbf{A}. Each complex-values query gets decomposed as 𝐀𝐱=𝐀𝔢[𝐱]+i𝐀𝔪[𝐱]\mathbf{A}\boldsymbol{\mathrm{x}}=\mathbf{A}\mathfrak{Re}[\boldsymbol{\mathrm{x}}]+i\mathbf{A}\mathfrak{Im}[\boldsymbol{\mathrm{x}}], which can be evaluated with just two real-valued oracle queries.

However, this picture changes with a Kronecker constraint. If 𝐱=𝐱1𝐱k\boldsymbol{\mathrm{x}}=\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k} for 𝐱id\boldsymbol{\mathrm{x}}_{i}\in{\mathbb{C}}^{d}, then the real part of the vector 𝔢[𝐱]\mathfrak{Re}[\boldsymbol{\mathrm{x}}] does not necessarily have Kronecker structure as well, and therefore we cannot reproduce the strategy above. We can see this even with very small vectors:

[1i][1i]=[1ii1]which gives𝔢([1ii1])=[1001]and𝔪([1ii1])=[0110].\begin{bmatrix}1\\ i\end{bmatrix}\otimes\begin{bmatrix}1\\ i\end{bmatrix}=\begin{bmatrix}1\\ i\\ i\\ -1\end{bmatrix}\hskip 19.91684pt\text{which gives}\hskip 19.91684pt\mathfrak{Re}\left(\begin{bmatrix}1\\ i\\ i\\ -1\end{bmatrix}\right)=\begin{bmatrix}1\\ 0\\ 0\\ -1\end{bmatrix}\hskip 14.22636pt\text{and}\hskip 14.22636pt\mathfrak{Im}\left(\begin{bmatrix}1\\ i\\ i\\ -1\end{bmatrix}\right)=\begin{bmatrix}0\\ 1\\ 1\\ 0\end{bmatrix}.

The real and imaginary component vectors do not have Kronecker structure. We can still simulate a complex-valued oracle with a real-valued oracle, but this requires far more real-valued oracle queries in the worst case. Specifically, we expand the real and imaginary parts of each vectors 𝐱i\boldsymbol{\mathrm{x}}_{i}. To keep the notation simple, let 𝐱i=𝐫i+i𝐦i\boldsymbol{\mathrm{x}}_{i}=\boldsymbol{\mathrm{r}}_{i}+i\boldsymbol{\mathrm{m}}_{i} decompose the real and imaginary parts of 𝐱id\boldsymbol{\mathrm{x}}_{i}\in{\mathbb{C}}^{d}, so that 𝐫i,𝐦id\boldsymbol{\mathrm{r}}_{i},\boldsymbol{\mathrm{m}}_{i}\in{\mathbb{R}}^{d}. Then,

𝐱1𝐱2𝐱k\displaystyle\boldsymbol{\mathrm{x}}_{1}\otimes\boldsymbol{\mathrm{x}}_{2}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k} =(𝐫1+i𝐦1)𝐱2𝐱k\displaystyle=(\boldsymbol{\mathrm{r}}_{1}+i\boldsymbol{\mathrm{m}}_{1})\otimes\boldsymbol{\mathrm{x}}_{2}\otimes\cdots\boldsymbol{\mathrm{x}}_{k}
=(𝐫1𝐱2𝐱k)+i(𝐦1𝐱2𝐱k)\displaystyle=(\boldsymbol{\mathrm{r}}_{1}\otimes\boldsymbol{\mathrm{x}}_{2}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k})+i(\boldsymbol{\mathrm{m}}_{1}\otimes\boldsymbol{\mathrm{x}}_{2}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k})
=(𝐫1𝐫2𝐱k)+i(𝐫1𝐦2𝐱k)\displaystyle=(\boldsymbol{\mathrm{r}}_{1}\otimes\boldsymbol{\mathrm{r}}_{2}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k})+i(\boldsymbol{\mathrm{r}}_{1}\otimes\boldsymbol{\mathrm{m}}_{2}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k})
+i(𝐦1𝐫2𝐱k)(𝐦1𝐦2𝐱k)\displaystyle\hskip 28.45274pt+i(\boldsymbol{\mathrm{m}}_{1}\otimes\boldsymbol{\mathrm{r}}_{2}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k})-(\boldsymbol{\mathrm{m}}_{1}\otimes\boldsymbol{\mathrm{m}}_{2}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k})
=\displaystyle=\ldots

where we continue to expand this out into a sum of 2k2^{k} real-valued Kronecker-structured vectors. So, if some algorithm solves a problem using qq complex-valued Kronecker-structure matrix-vector products with 𝐀\mathbf{A}, then we can only immediately guarantee the existence of an algorithm that computes 2kq2^{k}q real-valued Kronecker-structured products with 𝐀\mathbf{A} that also solves the problem.

Keeping this exponential gap between the real-valued and complex-valued oracles in mind, we now turn our attention back to the Kronecker-Hutchinson Estimator. In the non-Kronecker case, it is well known that the Hutchinson Estimator performs well with complex queries:

Lemma 30.

Let 𝐀D×D\mathbf{A}\in{\mathbb{R}}^{D\times D} and let 𝐠D\boldsymbol{\mathrm{g}}\in{\mathbb{C}}^{D} be a vector of either iid complex Rademacher or complex standard normal Gaussian entries. That is, we have 𝐠=12𝐫+i2𝐦\boldsymbol{\mathrm{g}}=\frac{1}{\sqrt{2}}\boldsymbol{\mathrm{r}}+\frac{i}{\sqrt{2}}\boldsymbol{\mathrm{m}} where 𝐫\boldsymbol{\mathrm{r}} and 𝐦\boldsymbol{\mathrm{m}} are iid real-valued vectors of either Rademacher or standard normal Gaussian entries. Then, 𝔼[𝐠H𝐀𝐠]=tr(𝐀)\operatorname*{\mathbb{E}}[\boldsymbol{\mathrm{g}}^{\textsf{H}}\mathbf{A}\boldsymbol{\mathrm{g}}]=\operatorname{tr}(\mathbf{A}) and Var[𝐠H𝐀𝐠]𝐀+𝐀2F2𝐀F2\operatorname*{Var}[\boldsymbol{\mathrm{g}}^{\textsf{H}}\mathbf{A}\boldsymbol{\mathrm{g}}]\leq\mathinner{\lVert\frac{\mathbf{A}+\mathbf{A}^{\intercal}}{2}\rVert}_{F}^{2}\leq\mathinner{\lVert\mathbf{A}\rVert}_{F}^{2}, where the first inequality is tight if 𝐠\boldsymbol{\mathrm{g}} is complex Gaussian, and the second is tight if 𝐀\mathbf{A} is symmetric.

Note that the difference between Lemma 23 and Lemma 30 is just a factor of 2 in the variance term. However, this factor of 2 has been extremely influential in all of our proofs so far. Improving this constant to be 1 and following the exact same proofs give the following results:

Theorem 7 Restated.

Fix d,kd,k\in{\mathbb{N}}, and let 𝐀dk×dk\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}}. Let 𝐱=𝐱1𝐱k\boldsymbol{\mathrm{x}}=\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k}, where each 𝐱i\boldsymbol{\mathrm{x}}_{i} is either a complex Rademacher vector or a complex Gaussian vector. Let ¯𝐀:=12k𝒱[k]𝐀𝒱\bar{}\mathbf{A}\;{\vcentcolon=}\;\frac{1}{2^{k}}\sum_{{\mathcal{V}}\subseteq[k]}\mathbf{A}^{\intercal_{{\mathcal{V}}}} be the average of all partial transposes of 𝐀\mathbf{A}. Then, Var[𝐱H𝐀𝐱]𝒮[k]tr𝒮(¯𝐀)F2𝒮[k]tr𝒮(𝐀)F2\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\textsf{H}}\mathbf{A}\boldsymbol{\mathrm{x}}]\leq\sum_{{\mathcal{S}}\subset[k]}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}}(\bar{}\mathbf{A})\rVert}_{F}^{2}\leq\sum_{{\mathcal{S}}\subset[k]}\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}}(\mathbf{A})\rVert}_{F}^{2}. If 𝐱i\boldsymbol{\mathrm{x}}_{i} are complex Gaussians, then the first expression is an equality. If we further know that 𝐀=¯𝐀\mathbf{A}=\bar{}\mathbf{A}, then both expression are equalities. Further, if 𝐀\mathbf{A} is PSD, then we have Var[𝐱H𝐀𝐱](2ktr(𝐀))2\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\textsf{H}}\mathbf{A}\boldsymbol{\mathrm{x}}]\leq(2^{k}\operatorname{tr}(\mathbf{A}))^{2}.

Proof.

Retrace the proofs of Lemmas 25 and 26 as well as Theorems 2 and 4. Anywhere that a 2 appears, replace it with a 1. ∎

We can similarly show that this dependence of 2k2^{k} is essentially tight:

Theorem 9 Restated.

Fix d,kd,k\in{\mathbb{N}}. Consider Hutchinson’s Estimator run with vectors 𝐱=𝐱1𝐱k\boldsymbol{\mathrm{x}}=\boldsymbol{\mathrm{x}}_{1}\otimes\cdots\otimes\boldsymbol{\mathrm{x}}_{k} where 𝐱i\boldsymbol{\mathrm{x}}_{i} are vectors of iid complex zero-mean unit-variance entries. Then, there exists a PSD matrix 𝐀\mathbf{A} such that Hutchinson’s Estimator needs =Ω((21d)kε2)\ell=\Omega(\frac{(2-{\textstyle\frac{1}{d}})^{k}}{\varepsilon^{2}}) samples in order to have standard deviation εtr(𝐀)\leq\varepsilon\operatorname{tr}(\mathbf{A}).

Proof.

Like in the proof of Theorem 6, we take 𝐀=i=1k𝐞𝐞\mathbf{A}=\otimes_{i=1}^{k}\boldsymbol{\mathrm{e}}\boldsymbol{\mathrm{e}}^{\intercal} where 𝐞d\boldsymbol{\mathrm{e}}\in{\mathbb{R}}^{d} is the all-ones vector. In fact, we follow the proof of Theorem 6 rather closely, but we recreate much of it to handle the complex conjugation correctly. First, we expand the product

𝔼[(𝐱H𝐀𝐱)2]=i=1k𝔼[(𝐱iH𝐞𝐞H𝐱i)2]=(𝔼[(𝐞H𝐱1)2(𝐞H𝐱1)2¯])k.\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{x}}^{\textsf{H}}\mathbf{A}\boldsymbol{\mathrm{x}})^{2}]=\prod_{i=1}^{k}\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{x}}_{i}^{\textsf{H}}\boldsymbol{\mathrm{e}}\boldsymbol{\mathrm{e}}^{\textsf{H}}\boldsymbol{\mathrm{x}}_{i})^{2}]=\left(\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{e}}^{\textsf{H}}\boldsymbol{\mathrm{x}}_{1})^{2}\overline{(\boldsymbol{\mathrm{e}}^{\textsf{H}}\boldsymbol{\mathrm{x}}_{1})^{2}}]\right)^{k}.

We let 𝐲:=𝐱1\boldsymbol{\mathrm{y}}\;{\vcentcolon=}\;\boldsymbol{\mathrm{x}}_{1} to simplify the notation. Then, we expand this inner expectation:

𝔼[(𝐞H𝐲)2(𝐞H𝐲)2¯]=𝔼[(i,j=1dyiyj)(m,n=1dymyn¯)]=i,j,m,n=1d𝔼[yiyjymyn¯].\textstyle\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{e}}^{\textsf{H}}\boldsymbol{\mathrm{y}})^{2}\overline{(\boldsymbol{\mathrm{e}}^{\textsf{H}}\boldsymbol{\mathrm{y}})^{2}}]=\operatorname*{\mathbb{E}}[(\sum_{i,j=1}^{d}y_{i}y_{j})(\sum_{m,n=1}^{d}\overline{y_{m}y_{n}})]=\sum_{i,j,m,n=1}^{d}\operatorname*{\mathbb{E}}[y_{i}y_{j}\overline{y_{m}y_{n}}].

This expectation is zero if any random variable appears exactly once, like how 𝔼[y1y2y22¯]=𝔼[y1]𝔼[y2y22¯]=0\operatorname*{\mathbb{E}}[y_{1}y_{2}\overline{y_{2}^{2}}]=\operatorname*{\mathbb{E}}[y_{1}]\operatorname*{\mathbb{E}}[y_{2}\overline{y_{2}^{2}}]=0 by independence. So, we only need to consider terms in this sum where we do not have any one term appear alone. This can happen in three different ways.

First, if i=j=m=ni=j=m=n, then by Jensen’s inequality, we get 𝔼[yiyjymyn¯]=𝔼[|yi|4](𝔼[|yi|2])2=1\operatorname*{\mathbb{E}}[y_{i}y_{j}\overline{y_{m}y_{n}}]=\operatorname*{\mathbb{E}}[|y_{i}|^{4}]\geq(\operatorname*{\mathbb{E}}[|y_{i}|^{2}])^{2}=1. This case occurs dd times in the sum. Second, if i=jm=ni=j\neq m=n, then by the iid distribution of the entry of 𝐲\boldsymbol{\mathrm{y}}, we get 𝔼[yiyjymyn¯]=𝔼[yi2]𝔼[ym2]¯=|𝔼[yi2]|20\operatorname*{\mathbb{E}}[y_{i}y_{j}\overline{y_{m}y_{n}}]=\operatorname*{\mathbb{E}}[y_{i}^{2}]\overline{\operatorname*{\mathbb{E}}[y_{m}^{2}]}=|\operatorname*{\mathbb{E}}[y_{i}^{2}]|^{2}\geq 0. Lastly, if i=mj=ni=m\neq j=n or i=nj=mi=n\neq j=m, then 𝔼[yiyjymyn¯]=𝔼[|yi|2]𝔼[|yj|2]=1\operatorname*{\mathbb{E}}[y_{i}y_{j}\overline{y_{m}y_{n}}]=\operatorname*{\mathbb{E}}[|y_{i}|^{2}]\operatorname*{\mathbb{E}}[|y_{j}|^{2}]=1. This case occurs 2(d2d)2(d^{2}-d) times in the sum. So, we get that

𝔼[(𝐱H𝐀𝐱)2](2d22d+d)k=(2d2d)k.\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{x}}^{\textsf{H}}\mathbf{A}\boldsymbol{\mathrm{x}})^{2}]\geq\left(2d^{2}-2d+d\right)^{k}=(2d^{2}-d)^{k}.

Which, noting that tr(𝐀)=dk\operatorname{tr}(\mathbf{A})=d^{k}, in turn gets a variance lower bound of

Var[𝐱H𝐀𝐱]=𝔼[(𝐱H𝐀𝐱)2](tr(𝐀))2(2d2d)kd2k.\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\textsf{H}}\mathbf{A}\boldsymbol{\mathrm{x}}]=\operatorname*{\mathbb{E}}[(\boldsymbol{\mathrm{x}}^{\textsf{H}}\mathbf{A}\boldsymbol{\mathrm{x}})^{2}]-(\operatorname{tr}(\mathbf{A}))^{2}\geq(2d^{2}-d)^{k}-d^{2k}.

Then, since Var[H(𝐀)]=1Var[𝐱H𝐀𝐱]\operatorname*{Var}[H_{\ell}(\mathbf{A})]=\frac{1}{\ell}\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\textsf{H}}\mathbf{A}\boldsymbol{\mathrm{x}}], in order for the complex Kronecker-Hutchinson’s Estimator to have standard deviation εtr(𝐀)\varepsilon\operatorname{tr}(\mathbf{A}), we need to take

Var[𝐱H𝐀𝐱]ε2(tr(𝐀))2(2d2d)kd2kε2d2k=(21d)k1ε2=Ω((21d)kε2).\ell\geq\frac{\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\textsf{H}}\mathbf{A}\boldsymbol{\mathrm{x}}]}{\varepsilon^{2}(\operatorname{tr}(\mathbf{A}))^{2}}\geq\frac{(2d^{2}-d)^{k}-d^{2k}}{\varepsilon^{2}d^{2k}}=\frac{(2-\frac{1}{d})^{k}-1}{\varepsilon^{2}}=\Omega\left(\frac{(2-{\textstyle\frac{1}{d}})^{k}}{\varepsilon^{2}}\right).

Lastly, we produce a much nicer bound on the sample complexity of the Kronecker-Hutchinson’s Estimator when working with random rank-one matrices:

Theorem 8 Restated.

Fix d,kd,k\in{\mathbb{N}}. Let 𝐀=𝐠𝐠\mathbf{A}=\boldsymbol{\mathrm{g}}\boldsymbol{\mathrm{g}}^{\intercal} where 𝐠dk\boldsymbol{\mathrm{g}}\in{\mathbb{R}}^{d^{k}} is a vector of iid standard normal Gaussians. Then, with probability 34\frac{3}{4}, we know that Var[𝐱H𝐀𝐱|𝐠]O((1+1d)k(tr(𝐀))2)\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\textsf{H}}\mathbf{A}\boldsymbol{\mathrm{x}}~{}|~{}\boldsymbol{\mathrm{g}}]\leq O((1+\frac{1}{d})^{k}(\operatorname{tr}(\mathbf{A}))^{2}). In particular, with probability 23\frac{2}{3}, we know that the Kronecker-Hutchinson Estimator returns a (1±ε)(1\pm\varepsilon) relative error estimate to the trace of 𝐀\mathbf{A} if =O(1ε2(1+1d)k)\ell=O(\frac{1}{\varepsilon^{2}}(1+\frac{1}{d})^{k}).

Proof.

Recall from the proof of Theorem 3 that the expected squared Frobenius norm of the partial trace of 𝐀\mathbf{A} is

𝔼[tr𝒮(𝐀)F2]=𝔼[𝐆44]=dk(dk|𝒮|+d|𝒮|+1).\operatorname*{\mathbb{E}}[\mathinner{\lVert\operatorname{tr}_{{\mathcal{S}}}(\mathbf{A})\rVert}_{F}^{2}]=\operatorname*{\mathbb{E}}[\mathinner{\lVert\mathbf{G}\rVert}_{4}^{4}]=d^{k}(d^{k-|{{\mathcal{S}}}|}+d^{|{{\mathcal{S}}}|}+1).

Then, we can directly bound

𝒮[k]𝔼[tr𝒮(𝐀)F2]\displaystyle\sum_{{\mathcal{S}}\subseteq[k]}\operatorname*{\mathbb{E}}[\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\mathbf{A})\rVert}_{F}^{2}] =𝒮[k]dk(dk|𝒮|+d|𝒮|+1)\displaystyle=\sum_{{\mathcal{S}}\subseteq[k]}d^{k}(d^{k-|{{\mathcal{S}}}|}+d^{|{{\mathcal{S}}}|}+1)
=i=0k(ki)dk(dki+di+1).\displaystyle=\sum_{i=0}^{k}\binom{k}{i}d^{k}(d^{k-i}+d^{i}+1).

Where the last line recognizes that we only depended on |𝒮||{{\mathcal{S}}}|, so we could just sum over all values of i=|𝒮|i=|{{\mathcal{S}}}|, scaled by the (ki)\binom{k}{i} times that |𝒮||{{\mathcal{S}}}| appears. Rearranging terms, and recalling that i=0k(ki)ci=(1+c)k\sum_{i=0}^{k}\binom{k}{i}c^{i}=(1+c)^{k}, we get

𝒮[k]𝔼[tr𝒮(𝐀)F2]\displaystyle\sum_{{\mathcal{S}}\subseteq[k]}\operatorname*{\mathbb{E}}[\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\mathbf{A})\rVert}_{F}^{2}] =dki=0k(ki)(dki+di+1)\displaystyle=d^{k}\sum_{i=0}^{k}\binom{k}{i}(d^{k-i}+d^{i}+1)
=dki=0k(ki)(dk(1d)i+di+1i)\displaystyle=d^{k}\sum_{i=0}^{k}\binom{k}{i}\left(d^{k}({\textstyle\frac{1}{d}})^{i}+d^{i}+1^{i}\right)
=dk(dk(1+1d)k+(1+d)k+2k)\displaystyle=d^{k}\left(d^{k}(1+{\textstyle\frac{1}{d}})^{k}+(1+d)^{k}+2^{k}\right)
=d2k(2(1+1d)k+(2d)k)\displaystyle=d^{2k}\left(2(1+{\textstyle\frac{1}{d}})^{k}+({\textstyle\frac{2}{d}})^{k}\right)
3d2k(1+1d)k.\displaystyle\leq 3d^{2k}(1+{\textstyle\frac{1}{d}})^{k}.

As in the proof of Theorem 3, we know that tr(𝐀)12dk\operatorname{tr}(\mathbf{A})\geq\frac{1}{2}d^{k} with probability 78\frac{7}{8}. Further, by Markov’s Inequality, we can bound

Var[(𝐱H𝐀𝐱)2𝐠]\displaystyle\operatorname*{Var}[(\boldsymbol{\mathrm{x}}^{\textsf{H}}\mathbf{A}\boldsymbol{\mathrm{x}})^{2}\mid\boldsymbol{\mathrm{g}}] 8𝔼𝐠[𝒮[k]𝔼[tr𝒮(𝐀)F2]]\displaystyle\leq 8\operatorname*{\mathbb{E}}_{\boldsymbol{\mathrm{g}}}\left[\sum_{{\mathcal{S}}\subseteq[k]}\operatorname*{\mathbb{E}}[\mathinner{\lVert\operatorname{tr}_{\mathcal{S}}(\mathbf{A})\rVert}_{F}^{2}]\right] =24d2k(1+1d)k\displaystyle=24d^{2k}(1+{\textstyle\frac{1}{d}})^{k}
96(1+1d)k(tr(𝐀))2.\displaystyle\leq 96(1+{\textstyle\frac{1}{d}})^{k}(\operatorname{tr}(\mathbf{A}))^{2}.

Following the same probability arguments as in the proof of Theorem 3, we get that taking

=12Var[𝐱H𝐀𝐱]ε2(tr(𝐀))21296(1+1d)k(tr(𝐀))2ε2(tr(𝐀))2=1152(1+1d)kε2\ell=\frac{12\operatorname*{Var}[\boldsymbol{\mathrm{x}}^{\textsf{H}}\mathbf{A}\boldsymbol{\mathrm{x}}]}{\varepsilon^{2}(\operatorname{tr}(\mathbf{A}))^{2}}\leq\frac{12\cdot 96(1+{\textstyle\frac{1}{d}})^{k}(\operatorname{tr}(\mathbf{A}))^{2}}{\varepsilon^{2}(\operatorname{tr}(\mathbf{A}))^{2}}=1152\frac{(1+{\textstyle\frac{1}{d}})^{k}}{\varepsilon^{2}}

samples suffices for the complex-valued Kronecker-Hutchinson’s Estimator to recover a (1±ε)(1\pm\varepsilon) relative error approximation to the trace of 𝐀\mathbf{A} with probability 23\frac{2}{3}. ∎

Recall that the rate in the real-valued case was =O(1ε2(2+1d)k)\ell=O(\frac{1}{\varepsilon^{2}}(2+\frac{1}{d})^{k}), and the rate here in the complex-valued case is =O(1ε2(1+1d)k)\ell=O(\frac{1}{\varepsilon^{2}}(1+\frac{1}{d})^{k}). In the real-valued case, this rate grows at least as fast as 2k2^{k}, preventing any polynomial dependence on kk from being possible. However, in the complex valued case, if k=O(d)k=O(d), then the Kronecker-Hutchinson Estimator actually converges in =O(1ε2)\ell=O(\frac{1}{\varepsilon^{2}}) samples. We only show that the real-valued algorithm converges exponentially slowly, while the complex-valued algorithm converges polynomially quickly!

6 Looking Beyond the Kronecker-Hutchinson. Estimator

We have shown that the Kronecker-Hutchinson Estimator has a painful exponential dependence on the kk in the worst case. However, we are broadly interested in the class of all algorithms that operate in the Kronecker-matrix-vector oracle model. We show that all trace estimation algorithms in this model must have query complexity that grows with kk:

Theorem 31.

Any algorithm that accesses a PSD matrix 𝐀\mathbf{A} via Kronecker-matrix-vector products 𝐀𝐱(1),,𝐀𝐱(q)\mathbf{A}\boldsymbol{\mathrm{x}}^{(1)},\ldots,\mathbf{A}\boldsymbol{\mathrm{x}}^{(q)}, where 𝐱(1),,𝐱(q)\boldsymbol{\mathrm{x}}^{(1)},\ldots,\boldsymbol{\mathrm{x}}^{(q)} are (possibly adaptively) chosen real-valued vectors, requires q=Ω(kε)q=\Omega(\frac{\sqrt{k}}{\varepsilon}) in order to output an estimate t^\hat{t} such that 𝔼[(t^tr(𝐀))2]εtr(𝐀)\sqrt{\operatorname*{\mathbb{E}}[(\hat{t}-\operatorname{tr}(\mathbf{A}))^{2}]}\leq\varepsilon\operatorname{tr}(\mathbf{A}).

If we constrain ourselves to estimators t^\hat{t} that are unbiased (i.e. 𝔼[t^]=tr(𝐀)\operatorname*{\mathbb{E}}[\hat{t}]=\operatorname{tr}(\mathbf{A})), then this lower bound says that q=Ω(kε)q=\Omega(\frac{\sqrt{k}}{\varepsilon}) Kronecker-matrix-vector products are required to achieve standard deviation εtr(𝐀)\leq\varepsilon\operatorname{tr}(\mathbf{A}). Our analysis builds off of tools in [Braverman et al., 2020, Jiang et al., 2021], which studies matrix-vector lower bounds without the Kronecker constraint. In particular, we use the following theorem:

Definition 32.

Let 𝐆d×d\mathbf{G}\in{\mathbb{R}}^{d\times d} be a matrix of iid standard normal Gaussian entries. Then, the matrix 𝐖:=𝐆𝐆d×d\mathbf{W}\;{\vcentcolon=}\;\mathbf{G}^{\intercal}\mathbf{G}\in{\mathbb{R}}^{d\times d} is distributed as a Wishart(d)\text{Wishart}(d) random matrix.

Imported Theorem 33 (Lemma 13 from [Braverman et al., 2020]).

Let 𝐖Wishart(d)\mathbf{W}\sim\emph{\text{Wishart(d)}}. Then, for any sequence of (possibly adaptively chosen) matrix-vector queries 𝐱(1),,𝐱(q)\boldsymbol{\mathrm{x}}^{(1)},\ldots,\boldsymbol{\mathrm{x}}^{(q)} and responses 𝐳(1),,𝐳(q)\boldsymbol{\mathrm{z}}^{(1)},\ldots,\boldsymbol{\mathrm{z}}^{(q)} (so that 𝐳(j)=𝐖𝐱(j)\boldsymbol{\mathrm{z}}^{(j)}=\mathbf{W}\boldsymbol{\mathrm{x}}^{(j)}), there exists a rotation matrix 𝐕d×d\mathbf{V}\in{\mathbb{R}}^{d\times d} constructed as a function of 𝐱(1),,𝐱(q)\boldsymbol{\mathrm{x}}^{(1)},\ldots,\boldsymbol{\mathrm{x}}^{(q)} such that the matrix 𝐕𝐖𝐕\mathbf{V}\mathbf{W}\mathbf{V}^{\intercal} can be written as

𝐕𝐖𝐕=𝚫+[𝟎𝟎𝟎~𝐖],\mathbf{V}\mathbf{W}\mathbf{V}^{\intercal}=\mathbf{\Delta}+\begin{bmatrix}\mathbf{0}&\mathbf{0}\\ \mathbf{0}&\widetilde{}\mathbf{W}\end{bmatrix},

where ~𝐖dq×dq\widetilde{}\mathbf{W}\in{\mathbb{R}}^{d-q\times d-q} conditioned on the values of 𝐱(1),,𝐱(q),𝐳(1),,𝐳(q)\boldsymbol{\mathrm{x}}^{(1)},\ldots,\boldsymbol{\mathrm{x}}^{(q)},\boldsymbol{\mathrm{z}}^{(1)},\ldots,\boldsymbol{\mathrm{z}}^{(q)} is distributed as Wishart matrix. That is, ~𝐖|(𝐱(1),,𝐱(q),𝐳(1),,𝐳(q))Wishart(dq)\widetilde{}\mathbf{W}~{}|~{}(\boldsymbol{\mathrm{x}}^{(1)},\ldots,\boldsymbol{\mathrm{x}}^{(q)},\boldsymbol{\mathrm{z}}^{(1)},\ldots,\boldsymbol{\mathrm{z}}^{(q)})\sim\emph{\text{Wishart}}(d-q). Further, 𝚫\mathbf{\Delta} is a PSD matrix that is a deterministic function of 𝐱(1),,𝐱(q),𝐳(1),,𝐳(q)\boldsymbol{\mathrm{x}}^{(1)},\ldots,\boldsymbol{\mathrm{x}}^{(q)},\boldsymbol{\mathrm{z}}^{(1)},\ldots,\boldsymbol{\mathrm{z}}^{(q)}.

This theorem is powerful because it says that any algorithm that has accessed a Wishart matrix via qq matrix-vector products must have a submatrix which the algorithm has absolutely no knowledge of. We note that in [Braverman et al., 2020], they do not state that 𝚫\mathbf{\Delta} is a deterministic function of 𝐱(1),,𝐱(q),𝐳(1),,𝐳(q)\boldsymbol{\mathrm{x}}^{(1)},\ldots,\boldsymbol{\mathrm{x}}^{(q)},\boldsymbol{\mathrm{z}}^{(1)},\ldots,\boldsymbol{\mathrm{z}}^{(q)}, but this fact is easily verified by examining by the proof of Lemma 13 in [Braverman et al., 2020].

We start by extending 33 to the Kronecker-matrix-vector oracle model. We construct our hard instance input matrix as 𝐀=i=1k𝐖idk×dk\mathbf{A}=\otimes_{i=1}^{k}\mathbf{W}_{i}\in{\mathbb{R}}^{d^{k}\times d^{k}}, where 𝐖1,,𝐖kd×d\mathbf{W}_{1},\ldots,\mathbf{W}_{k}\in{\mathbb{R}}^{d\times d} are iid Wishart matrices:

Corollary 34.

Let 𝐖1,,𝐖k\mathbf{W}_{1},\ldots,\mathbf{W}_{k} be iid Wishart(d)\emph{\text{Wishart}}(d) matrices, and let 𝐀:=i=1k𝐖idk×dk\mathbf{A}\;{\vcentcolon=}\;\otimes_{i=1}^{k}\mathbf{W}_{i}\in{\mathbb{R}}^{d^{k}\times d^{k}}. Then, for any sequence of (possibly adaptively chosen) Kronecker-matrix-vector queries 𝐱(1),,𝐱(q)\boldsymbol{\mathrm{x}}^{(1)},\ldots,\boldsymbol{\mathrm{x}}^{(q)} and responses 𝐳(1),,𝐳(q)\boldsymbol{\mathrm{z}}^{(1)},\ldots,\boldsymbol{\mathrm{z}}^{(q)} (so that 𝐳(j)=𝐀𝐱(j)\boldsymbol{\mathrm{z}}^{(j)}=\mathbf{A}\boldsymbol{\mathrm{x}}^{(j)}), there exists orthogonal matrices 𝐕1,,𝐕kd×d\mathbf{V}_{1},\ldots,\mathbf{V}_{k}\in{\mathbb{R}}^{d\times d} such that 𝐕=i=1k𝐕i\mathbf{V}=\otimes_{i=1}^{k}\mathbf{V}_{i} has

𝐕𝐀𝐕=i=1k(𝚫i+[𝟎𝟎𝟎~𝐖i]).\mathbf{V}\mathbf{A}\mathbf{V}^{\intercal}=\otimes_{i=1}^{k}\left(\mathbf{\Delta}_{i}+\begin{bmatrix}\mathbf{0}&\mathbf{0}\\ \mathbf{0}&\widetilde{}\mathbf{W}_{i}\end{bmatrix}\right).

Let 𝐱(j)=i=1k𝐱i(j)\boldsymbol{\mathrm{x}}^{(j)}=\otimes_{i=1}^{k}\boldsymbol{\mathrm{x}}_{i}^{(j)} and 𝐳i(j)=𝐖i𝐱i(j)\boldsymbol{\mathrm{z}}_{i}^{(j)}=\mathbf{W}_{i}\boldsymbol{\mathrm{x}}_{i}^{(j)} be decompositions of the query and response vectors. Then, each ~𝐖idq×dq\widetilde{}\mathbf{W}_{i}\in{\mathbb{R}}^{d-q\times d-q} conditioned on the values of 𝐱i(1),,𝐱i(q),𝐳i(1),,𝐳i(q)\boldsymbol{\mathrm{x}}_{i}^{(1)},\ldots,\boldsymbol{\mathrm{x}}_{i}^{(q)},\boldsymbol{\mathrm{z}}_{i}^{(1)},\ldots,\boldsymbol{\mathrm{z}}_{i}^{(q)} is distributed as a Wishart matrix. That is, ~𝐖i|(𝐱i(1),,𝐱i(q),𝐳i(1),,𝐳i(q))Wishart(dq)\widetilde{}\mathbf{W}_{i}~{}|~{}(\boldsymbol{\mathrm{x}}_{i}^{(1)},\ldots,\boldsymbol{\mathrm{x}}_{i}^{(q)},\boldsymbol{\mathrm{z}}_{i}^{(1)},\ldots,\boldsymbol{\mathrm{z}}_{i}^{(q)})\sim\emph{\text{Wishart}}(d-q). Further, each 𝚫i\mathbf{\Delta}_{i} is a PSD matrix that is a deterministic function of 𝐱i(1),,𝐱i(q),𝐳i(1),,𝐳i(q)\boldsymbol{\mathrm{x}}_{i}^{(1)},\ldots,\boldsymbol{\mathrm{x}}_{i}^{(q)},\boldsymbol{\mathrm{z}}_{i}^{(1)},\ldots,\boldsymbol{\mathrm{z}}_{i}^{(q)}.

With this setup, we can now form a lower bound against any trace estimating algorithm. We first prove in Lemma 35 that, in terms of minimizing mean squared error, no estimator can outperform the conditional expectation t~=𝔼[tr(𝐀)|𝐕1,,𝐕k,𝚫1,,𝚫k]\tilde{t}=\operatorname*{\mathbb{E}}[\operatorname{tr}(\mathbf{A})~{}|~{}\mathbf{V}_{1},\ldots,\mathbf{V}_{k},\mathbf{\Delta}_{1},\ldots,\mathbf{\Delta}_{k}]. Next, in Lemma 36, we compute the exact variance of this estimator. Lastly, in the final proof of Theorem 31, we compare this variance to the trace of 𝐀\mathbf{A}, to conclude that qΩ(kε)q\geq\Omega(\frac{\sqrt{k}}{\varepsilon}).

To simplify notation, we let 𝒲=(~𝐖1,,~𝐖k){\mathcal{W}}=(\widetilde{}\mathbf{W}_{1},\ldots,\widetilde{}\mathbf{W}_{k}) denote the list of hidden Wishart matrices, and let 𝒱:=(𝐕1,,𝐕k,𝚫1,,𝚫k){\mathcal{V}}\;{\vcentcolon=}\;(\mathbf{V}_{1},\ldots,\mathbf{V}_{k},\mathbf{\Delta}_{1},\ldots,\mathbf{\Delta}_{k}) be the variables that t~\tilde{t} is conditioned on. Note that the algorithm only observes the response vectors 𝐳(1),,𝐳(q)\boldsymbol{\mathrm{z}}^{(1)},\ldots,\boldsymbol{\mathrm{z}}^{(q)}, and that the algorithm does not exactly know the vectors 𝐳i(j)=𝐖i𝐱i(j)\boldsymbol{\mathrm{z}}_{i}^{(j)}=\mathbf{W}_{i}\boldsymbol{\mathrm{x}}_{i}^{(j)}. Therefore, the algorithm might not observe all of 𝒱{\mathcal{V}}, but 𝒱{\mathcal{V}} definitely covers everything that the algorithm does see.

Lemma 35.

Consider an algorithm that interacts with 𝐀\mathbf{A} via qq Kronecker-matrix-vector products as in the setting of Corollary 34. Let t^\hat{t} be the output of that algorithm, and let t~:=𝔼[tr(𝐀)|𝒱]\tilde{t}\;{\vcentcolon=}\;\operatorname*{\mathbb{E}}[\operatorname{tr}(\mathbf{A})~{}|~{}{\mathcal{V}}]. Then, t~\tilde{t} has better means squared error than t^\hat{t}:

𝔼[(t^tr(𝐀))2]𝔼[(t~tr(𝐀))2].\operatorname*{\mathbb{E}}[(\hat{t}-\operatorname{tr}(\mathbf{A}))^{2}]\geq\operatorname*{\mathbb{E}}[(\tilde{t}-\operatorname{tr}(\mathbf{A}))^{2}].
Proof.

Note that the algorithm only observes 𝐱(1),,𝐱(q),𝐳(1),𝐳(q)\boldsymbol{\mathrm{x}}^{(1)},\ldots,\boldsymbol{\mathrm{x}}^{(q)},\boldsymbol{\mathrm{z}}^{(1)},\ldots\boldsymbol{\mathrm{z}}^{(q)}, so the estimator t^\hat{t} has to be a (possibly randomized) function of 𝒱{\mathcal{V}}. We then expand the mean squared error of t^\hat{t} by using the tower rule:

𝔼[(t^tr(𝐀))2]\displaystyle\operatorname*{\mathbb{E}}[(\hat{t}-\operatorname{tr}(\mathbf{A}))^{2}] =𝔼[𝔼𝒲[(t^tr(𝐀))2|𝒱]]\displaystyle=\operatorname*{\mathbb{E}}[\operatorname*{\mathbb{E}}_{{\mathcal{W}}}[(\hat{t}-\operatorname{tr}(\mathbf{A}))^{2}~{}|~{}{\mathcal{V}}]]
𝔼[𝔼𝒲[(t~tr(𝐀))2|𝒱]]\displaystyle\geq\operatorname*{\mathbb{E}}[\operatorname*{\mathbb{E}}_{{\mathcal{W}}}[(\tilde{t}-\operatorname{tr}(\mathbf{A}))^{2}~{}|~{}{\mathcal{V}}]]
=𝔼[(t~tr(𝐀))2],\displaystyle=\operatorname*{\mathbb{E}}[(\tilde{t}-\operatorname{tr}(\mathbf{A}))^{2}],

where the inequality comes from t~\tilde{t} being the minimum mean square error estimator for that particular conditional expectation. ∎

For any specific instantiation of 𝒱{\mathcal{V}}, we can ask what the expected mean squared error of t~\tilde{t} is:

𝔼𝒲[(tr(𝐀)t~)2|𝒱]=𝔼𝒲[(tr(𝐀)𝔼[tr(𝐀)|𝒱])2|𝒱]=Var𝒲[tr(𝐀)|𝒱].\operatorname*{\mathbb{E}}_{{\mathcal{W}}}[(\operatorname{tr}(\mathbf{A})-\tilde{t})^{2}~{}|~{}{\mathcal{V}}]=\operatorname*{\mathbb{E}}_{{\mathcal{W}}}[(\operatorname{tr}(\mathbf{A})-\operatorname*{\mathbb{E}}[\operatorname{tr}(\mathbf{A})|{\mathcal{V}}])^{2}~{}|~{}{\mathcal{V}}]=\operatorname*{Var}_{{\mathcal{W}}}[\operatorname{tr}(\mathbf{A})~{}|~{}{\mathcal{V}}].

We do not really care about this value for a specific choice of 𝒱{\mathcal{V}}, but instead care about it on average across all values of 𝒱{\mathcal{V}}. That is, we care about the squared error 𝔼[(t~tr(𝐀))2]=𝔼𝒱[Var[tr(𝐀)|𝒱]]\operatorname*{\mathbb{E}}[(\tilde{t}-\operatorname{tr}(\mathbf{A}))^{2}]=\operatorname*{\mathbb{E}}_{{\mathcal{V}}}[\operatorname*{Var}[\operatorname{tr}(\mathbf{A})~{}|~{}{\mathcal{V}}]]. We then want to ensure that the error is less than ε𝔼[tr(𝐀)]\varepsilon\operatorname*{\mathbb{E}}[\operatorname{tr}(\mathbf{A})]. So, we want to find out what value of qq ensures that

𝔼[Var[tr(𝐀)|𝒱]]ε𝔼[tr(𝐀)].\sqrt{\operatorname*{\mathbb{E}}[\operatorname*{Var}[\operatorname{tr}(\mathbf{A})~{}|~{}{\mathcal{V}}]]}\leq\varepsilon\operatorname*{\mathbb{E}}[\operatorname{tr}(\mathbf{A})].

We start by computing the variance inside the square root on the left.

Lemma 36.

The mean squared error of t~\tilde{t} is 𝔼[Var[tr(𝐀)|𝒱]]=(d4+2d2)k(d4+2d22(dq)2)k\operatorname*{\mathbb{E}}[\operatorname*{Var}[\operatorname{tr}(\mathbf{A})~{}|~{}{\mathcal{V}}]]=(d^{4}+2d^{2})^{k}-(d^{4}+2d^{2}-2(d-q)^{2})^{k}.

Proof.

Let Xi:=tr(𝐖i)X_{i}\;{\vcentcolon=}\;\operatorname{tr}(\mathbf{W}_{i}) and X:=tr(𝐀)X\;{\vcentcolon=}\;\operatorname{tr}(\mathbf{A}), so that X=i=1kXiX=\prod_{i=1}^{k}X_{i} is a product of kk independent random variables. Then, since these terms are independence, we know that

Var[i=1kXi]=i=1k𝔼[Xi2]i=1k(𝔼[Xi])2.\operatorname*{Var}[{\textstyle\prod_{i=1}^{k}X_{i}}]=\prod_{i=1}^{k}\operatorname*{\mathbb{E}}[X_{i}^{2}]-\prod_{i=1}^{k}(\operatorname*{\mathbb{E}}[X_{i}])^{2}.

Further, since Xi|𝒱X_{i}|{\mathcal{V}} is independent of Xj|𝒱X_{j}|{\mathcal{V}} for iji\neq j, we know that

𝔼[Var[i=1kXi|𝒱]]\displaystyle\operatorname*{\mathbb{E}}[\operatorname*{Var}[{\textstyle\prod_{i=1}^{k}X_{i}}~{}|~{}{\mathcal{V}}]] =𝔼[i=1k𝔼[Xi2|𝒱]i=1k(𝔼[Xi|𝒱])2]\displaystyle=\operatorname*{\mathbb{E}}\left[\prod_{i=1}^{k}\operatorname*{\mathbb{E}}[X_{i}^{2}|{\mathcal{V}}]-\prod_{i=1}^{k}(\operatorname*{\mathbb{E}}[X_{i}|{\mathcal{V}}])^{2}\right]
=i=1k𝔼[Xi2]i=1k𝔼[(𝔼[Xi|𝒱])2].\displaystyle=\prod_{i=1}^{k}\operatorname*{\mathbb{E}}[X_{i}^{2}]-\prod_{i=1}^{k}\operatorname*{\mathbb{E}}[(\operatorname*{\mathbb{E}}[X_{i}|{\mathcal{V}}])^{2}].

The last expectation on the right can be simplified a bit further. Since Var[Xi|𝒱]=𝔼[Xi2|𝒱](𝔼[Xi|𝒱])2\operatorname*{Var}[X_{i}~{}|~{}{\mathcal{V}}]=\operatorname*{\mathbb{E}}[X_{i}^{2}~{}|~{}{\mathcal{V}}]-(\operatorname*{\mathbb{E}}[X_{i}~{}|~{}{\mathcal{V}}])^{2}, we can write

𝔼[(𝔼[Xi|𝒱])2]=𝔼[𝔼[Xi2|𝒱]Var[Xi|𝒱]]=𝔼[Xi2]𝔼[Var[Xi|𝒱]],\operatorname*{\mathbb{E}}[(\operatorname*{\mathbb{E}}[X_{i}|{\mathcal{V}}])^{2}]=\operatorname*{\mathbb{E}}[\operatorname*{\mathbb{E}}[X_{i}^{2}|{\mathcal{V}}]-\operatorname*{Var}[X_{i}|{\mathcal{V}}]]=\operatorname*{\mathbb{E}}[X_{i}^{2}]-\operatorname*{\mathbb{E}}[\operatorname*{Var}[X_{i}|{\mathcal{V}}]],

which in turn means we have

𝔼[Var[i=1kXi|𝒱]]=i=1k𝔼[Xi2]i=1k(𝔼[Xi2]𝔼[Var[Xi|𝒱]]).\displaystyle\operatorname*{\mathbb{E}}[\operatorname*{Var}[{\textstyle\prod_{i=1}^{k}X_{i}}~{}|~{}{\mathcal{V}}]]=\prod_{i=1}^{k}\operatorname*{\mathbb{E}}[X_{i}^{2}]-\prod_{i=1}^{k}\left(\operatorname*{\mathbb{E}}[X_{i}^{2}]-\operatorname*{\mathbb{E}}[\operatorname*{Var}[X_{i}|{\mathcal{V}}]]\right). (8)

Now we can directly analyze these means and expectations. First, note that Xi=tr(𝐖i)=tr(𝐆i𝐆i)=𝐆iF2χd22X_{i}=\operatorname{tr}(\mathbf{W}_{i})=\operatorname{tr}(\mathbf{G}_{i}^{\intercal}\mathbf{G}_{i})=\mathinner{\lVert\mathbf{G}_{i}\rVert}_{F}^{2}\sim\chi_{d^{2}}^{2} is a chi-squared random variable. Therefore, 𝔼[Xi2]=Var[Xi]+(𝔼[Xi2])2=2d2+d4\operatorname*{\mathbb{E}}[X_{i}^{2}]=\operatorname*{Var}[X_{i}]+(\operatorname*{\mathbb{E}}[X_{i}^{2}])^{2}=2d^{2}+d^{4}. Next, we focus on 𝔼[Var[Xi|𝒱]]\operatorname*{\mathbb{E}}[\operatorname*{Var}[X_{i}|{\mathcal{V}}]]. By the linearity and cyclic property of the trace, and since 𝐕i𝐕i=𝐈\mathbf{V}_{i}\mathbf{V}_{i}^{\intercal}=\mathbf{I}, we have that

Xi=tr(𝐖i)=tr(𝐖i𝐕i𝐕i)=tr(𝐕i𝐖i𝐕i)=tr(𝚫i)+tr(~𝐖i).X_{i}=\operatorname{tr}(\mathbf{W}_{i})=\operatorname{tr}(\mathbf{W}_{i}\mathbf{V}_{i}\mathbf{V}_{i}^{\intercal})=\operatorname{tr}(\mathbf{V}_{i}^{\intercal}\mathbf{W}_{i}\mathbf{V}_{i})=\operatorname{tr}(\mathbf{\Delta}_{i})+\operatorname{tr}(\widetilde{}\mathbf{W}_{i}).

Therefore the only term in 𝒱{\mathcal{V}} that XiX_{i} depends on is 𝚫i\mathbf{\Delta}_{i}, and so we can expand

𝔼[Var[Xi|𝒱]]=𝔼[Var[tr(𝚫i)+tr(~𝐖i)|𝚫i]]=𝔼[Var[tr(~𝐖i)]]=Var[tr(~𝐖i)]=2(dq)2,\displaystyle\operatorname*{\mathbb{E}}[\operatorname*{Var}[X_{i}|{\mathcal{V}}]]=\operatorname*{\mathbb{E}}[\operatorname*{Var}[\operatorname{tr}(\mathbf{\Delta}_{i})+\operatorname{tr}(\widetilde{}\mathbf{W}_{i})~{}|~{}\mathbf{\Delta}_{i}]]=\operatorname*{\mathbb{E}}[\operatorname*{Var}[\operatorname{tr}(\widetilde{}\mathbf{W}_{i})]]=\operatorname*{Var}[\operatorname{tr}(\widetilde{}\mathbf{W}_{i})]=2(d-q)^{2},

where the last equality notes that tr(~𝐖i)=tr(~𝐆i~𝐆i)=~𝐆iF2χ(dq)2\operatorname{tr}(\widetilde{}\mathbf{W}_{i})=\operatorname{tr}(\widetilde{}\mathbf{G}_{i}^{\intercal}\widetilde{}\mathbf{G}_{i})=\mathinner{\lVert\widetilde{}\mathbf{G}_{i}\rVert}_{F}^{2}\sim\chi_{(d-q)^{2}} is also a chi-squared random variable. Plugging our expectations back into Equation 8, we find that

𝔼[Var[i=1kXi|𝒱]]\displaystyle\operatorname*{\mathbb{E}}[\operatorname*{Var}[{\textstyle\prod_{i=1}^{k}X_{i}}~{}|~{}{\mathcal{V}}]] =(d4+2d2)k(d4+2d22(dq)2)k,\displaystyle=\left(d^{4}+2d^{2}\right)^{k}-\left(d^{4}+2d^{2}-2(d-q)^{2}\right)^{k},

which completes the proof. ∎

We now have built up the tools we need to prove the central lower bound:

Theorem 31 Restated.

Any algorithm that accesses a PSD matrix 𝐀\mathbf{A} via Kronecker-matrix-vector products 𝐀𝐱(1),,𝐀𝐱(q)\mathbf{A}\boldsymbol{\mathrm{x}}^{(1)},\ldots,\mathbf{A}\boldsymbol{\mathrm{x}}^{(q)}, where 𝐱(1),,𝐱(q)\boldsymbol{\mathrm{x}}^{(1)},\ldots,\boldsymbol{\mathrm{x}}^{(q)} are (possibly adaptively) chosen real-valued vectors, requires q=Ω(kε)q=\Omega(\frac{\sqrt{k}}{\varepsilon}) in order to output an estimate t^\hat{t} such that 𝔼[(t^tr(𝐀))2]εtr(𝐀)\sqrt{\operatorname*{\mathbb{E}}[(\hat{t}-\operatorname{tr}(\mathbf{A}))^{2}]}\leq\varepsilon\operatorname{tr}(\mathbf{A}).

Proof.

We consider 𝐀\mathbf{A} as in Corollary 34. By Lemmas 35 and 36, we know that

𝔼[(t^tr(𝐀))2]𝔼[(t~tr(𝐀))2]=(d4+2d2)k(d4+2d22(dq)2)k.\operatorname*{\mathbb{E}}[(\hat{t}-\operatorname{tr}(\mathbf{A}))^{2}]\geq\operatorname*{\mathbb{E}}[(\tilde{t}-\operatorname{tr}(\mathbf{A}))^{2}]=(d^{4}+2d^{2})^{k}-(d^{4}+2d^{2}-2(d-q)^{2})^{k}.

We then want to show that this term on the right hand side is at most ε2(𝔼[tr(𝐀)])2\varepsilon^{2}(\operatorname*{\mathbb{E}}[\operatorname{tr}(\mathbf{A})])^{2}. First, we quickly note that 𝔼[tr(𝐀)]=i=1k𝔼[tr(𝐖i)]=d2k\operatorname*{\mathbb{E}}[\operatorname{tr}(\mathbf{A})]=\prod_{i=1}^{k}\operatorname*{\mathbb{E}}[\operatorname{tr}(\mathbf{W}_{i})]=d^{2k}, since tr(𝐖i)=tr(𝐆i𝐆i)=𝐆iF2χd22\operatorname{tr}(\mathbf{W}_{i})=\operatorname{tr}(\mathbf{G}_{i}^{\intercal}\mathbf{G}_{i})=\mathinner{\lVert\mathbf{G}_{i}\rVert}_{F}^{2}\sim\chi_{d^{2}}^{2} is a chi-squared random variable. So, we want to find what values of qq ensure that

(d4+2d2)k(d4+2d22(dq)2)kε2d4k.(d^{4}+2d^{2})^{k}-(d^{4}+2d^{2}-2(d-q)^{2})^{k}\leq\varepsilon^{2}d^{4k}.

In fact, it will be mathematically convenient to demand slightly less accuracy, which will still give a meaningful lower bound:

(d4+2d2)k(d4+2d22(dq)2)kε2(d4+2d2)k.(d^{4}+2d^{2})^{k}-(d^{4}+2d^{2}-2(d-q)^{2})^{k}\leq\varepsilon^{2}(d^{4}+2d^{2})^{k}.

We can rearrange this inequality to produce

(dq)212(d4+2d2)(1(1ε2)1/k).(d-q)^{2}\leq{\textstyle\frac{1}{2}}(d^{4}+2d^{2})(1-(1-\varepsilon^{2})^{1/k}).

Note that for ε(0,12)\varepsilon\in(0,\frac{1}{2}) and k1k\geq 1, we have that (12ε2k)k=(12ε2k)k2ε22ε2e2ε21ε2(1-\frac{2\varepsilon^{2}}{k})^{k}=(1-\frac{2\varepsilon^{2}}{k})^{\frac{k}{2\varepsilon^{2}}\cdot 2\varepsilon^{2}}\leq e^{-2\varepsilon^{2}}\leq 1-\varepsilon^{2}, so that (1ε2)1/k12ε2k(1-\varepsilon^{2})^{1/k}\geq 1-\frac{2\varepsilon^{2}}{k}. So, by additionally bounding d4+2d22d4d^{4}+2d^{2}\leq 2d^{4} for d2d\geq 2, we can simplify the above bound to be

(dq)212(d4+2d2)(1(1ε2)1/k)2d4ε2k.(d-q)^{2}\leq{\textstyle\frac{1}{2}}(d^{4}+2d^{2})(1-(1-\varepsilon^{2})^{1/k})\leq{\textstyle\frac{2d^{4}\varepsilon^{2}}{k}}.

Solving this inequality for qq gives qd2εkd2q\geq d-\frac{\sqrt{2}\varepsilon}{\sqrt{k}}d^{2}. Optimizing over dd gives q4k2ε=Ω(kε)q\geq\frac{4\sqrt{k}}{\sqrt{2}\varepsilon}=\Omega(\frac{\sqrt{k}}{\varepsilon}), completing the proof. ∎

7 Exactly Computing the Trace of 𝐀=i=1k𝐀i\mathbf{A}=\otimes_{i=1}^{k}\mathbf{A}_{i} in kd+1kd+1 Queries

In this section, we are promised that 𝐀=𝐀1𝐀k\mathbf{A}=\mathbf{A}_{1}\otimes\cdots\otimes\mathbf{A}_{k} for some unknown matrices 𝐀1,,𝐀kd×d\mathbf{A}_{1},\ldots,\mathbf{A}_{k}\in{\mathbb{R}}^{d\times d}. We show that by using exactly kd+1kd+1 Kronecker-matrix-vector products, we can recover a set of matrices 𝐁1,,𝐁kd×d\mathbf{B}_{1},\ldots,\mathbf{B}_{k}\in{\mathbb{R}}^{d\times d} such that i=1k𝐁i=𝐀\otimes_{i=1}^{k}\mathbf{B}_{i}=\mathbf{A}. Note that we cannot exactly recover the matrices 𝐀1,,𝐀k\mathbf{A}_{1},\ldots,\mathbf{A}_{k} because the Kronecker product does not admit a unique decomposition: we have α𝐀𝐁=𝐀α𝐁\alpha\mathbf{A}\otimes\mathbf{B}=\mathbf{A}\otimes\alpha\mathbf{B} for all scalars α\alpha and matrices 𝐀\mathbf{A}, 𝐁\mathbf{B}. However, we are able to recover the trace exactly since tr(𝐀1𝐀2)=tr(𝐀1)tr(𝐀2)\operatorname{tr}(\mathbf{A}_{1}\otimes\mathbf{A}_{2})=\operatorname{tr}(\mathbf{A}_{1})\operatorname{tr}(\mathbf{A}_{2}), we can then compute tr(𝐀)=i=1ktr(𝐁i)\operatorname{tr}(\mathbf{A})=\prod_{i=1}^{k}\operatorname{tr}(\mathbf{B}_{i}).

Algorithm 1 Exact Kronecker Recovery

input: Kronecker-matrix-vector oracle access to 𝐀dk×dk\mathbf{A}\in{\mathbb{R}}^{d^{k}\times d^{k}}.
output: Factor matrices 𝐁1,,𝐁kd×d\mathbf{B}_{1},\ldots,\mathbf{B}_{k}\in{\mathbb{R}}^{d\times d}.

1:  Sample 𝒩(𝟎,𝐈){\mathcal{N}}(\boldsymbol{\mathrm{0}},\mathbf{I}) vectors 𝐠1,,𝐠kd\boldsymbol{\mathrm{g}}_{1},\ldots,\boldsymbol{\mathrm{g}}_{k}\in{\mathbb{R}}^{d}.
2:  Let ¯𝐱=i=1k𝐠i\bar{}\boldsymbol{\mathrm{x}}=\otimes_{i=1}^{k}\boldsymbol{\mathrm{g}}_{i}.
3:  Compute γ:=¯𝐱𝐀¯𝐱\gamma\;{\vcentcolon=}\;\bar{}\boldsymbol{\mathrm{x}}^{\intercal}\mathbf{A}\bar{}\boldsymbol{\mathrm{x}}.
4:  If γ=0\gamma=0, then return 𝐁1,,𝐁k=𝟎\mathbf{B}_{1},\ldots,\mathbf{B}_{k}=\mathbf{0}
5:  For all i[k],m[d]i\in[k],m\in[d], let 𝐱i(m)=(j=1i1𝐠i)𝐞m(j=i+1k𝐠i)\boldsymbol{\mathrm{x}}_{i}^{(m)}=(\otimes_{j=1}^{i-1}\boldsymbol{\mathrm{g}}_{i})\otimes\boldsymbol{\mathrm{e}}_{m}\otimes(\otimes_{j=i+1}^{k}\boldsymbol{\mathrm{g}}_{i}).
6:  For all i[k],m[d]i\in[k],m\in[d], compute 𝐀𝐱i(m)\mathbf{A}\boldsymbol{\mathrm{x}}_{i}^{(m)}.
7:  For all i[k],m[d],n[d]i\in[k],m\in[d],n\in[d], store [𝐁i]m,n=γ(11k)𝐱i(m)𝐀𝐱i(n)[\mathbf{B}_{i}]_{m,n}=\gamma^{-(1-\frac{1}{k})}{\boldsymbol{\mathrm{x}}_{i}^{(m)}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}_{i}^{(n)}.
8:  return 𝐁1,,𝐁k\mathbf{B}_{1},\ldots,\mathbf{B}_{k}.
Theorem 37.

Let 𝐀=𝐀1𝐀kdk×dk\mathbf{A}=\mathbf{A}_{1}\otimes\cdots\otimes\mathbf{A}_{k}\in{\mathbb{R}}^{d^{k}\times d^{k}}. Then, with probability 1, Algorithm 1 computes exactly kd+1kd+1 Kronecker-matrix-vector products with 𝐀\mathbf{A} and returns matrices 𝐁1,,𝐁k\mathbf{B}_{1},\ldots,\mathbf{B}_{k} such that 𝐀=𝐁1𝐁k\mathbf{A}=\mathbf{B}_{1}\otimes\cdots\otimes\mathbf{B}_{k}.

Proof.

At a high level, the quadratic form 𝐱i(m)𝐀𝐱i(n){\boldsymbol{\mathrm{x}}_{i}^{(m)}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}_{i}^{(n)} is trying to recover the (m,n)(m,n) entry of 𝐀i\mathbf{A}_{i}. To see why, notice that

𝐱i(m)𝐀𝐱i(n)=(j=1i1𝐠j𝐀j𝐠j)𝐞m𝐀i𝐞n(j=i+1k𝐠j𝐀j𝐠j).{\boldsymbol{\mathrm{x}}_{i}^{(m)}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}_{i}^{(n)}=(\otimes_{j=1}^{i-1}\boldsymbol{\mathrm{g}}_{j}^{\intercal}\mathbf{A}_{j}\boldsymbol{\mathrm{g}}_{j})\otimes\boldsymbol{\mathrm{e}}_{m}^{\intercal}\mathbf{A}_{i}\boldsymbol{\mathrm{e}}_{n}\otimes(\otimes_{j=i+1}^{k}\boldsymbol{\mathrm{g}}_{j}^{\intercal}\mathbf{A}_{j}\boldsymbol{\mathrm{g}}_{j}).

If we let γi:=𝐠i𝐀i𝐠i\gamma_{i}\;{\vcentcolon=}\;\boldsymbol{\mathrm{g}}_{i}^{\intercal}\mathbf{A}_{i}\boldsymbol{\mathrm{g}}_{i} denote the various components of γ=i=1k𝐠i𝐀i𝐠i=i=1kγi\gamma=\otimes_{i=1}^{k}\boldsymbol{\mathrm{g}}_{i}^{\intercal}\mathbf{A}_{i}\boldsymbol{\mathrm{g}}_{i}=\prod_{i=1}^{k}\gamma_{i}, then we can write the above expression as

𝐱i(m)𝐀𝐱i(n)=(jiγi)𝐞m𝐀i𝐞n=γγi[𝐀i]m,n{\boldsymbol{\mathrm{x}}_{i}^{(m)}}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{x}}_{i}^{(n)}=({\textstyle\prod_{j\neq i}\gamma_{i}})\boldsymbol{\mathrm{e}}_{m}^{\intercal}\mathbf{A}_{i}\boldsymbol{\mathrm{e}}_{n}={\textstyle\frac{\gamma}{\gamma_{i}}}[\mathbf{A}_{i}]_{m,n}

So, we get that the matrix 𝐁i\mathbf{B}_{i} has [𝐁i]m,n=γ1kγi[𝐀i]m,n[\mathbf{B}_{i}]_{m,n}=\frac{\gamma^{\frac{1}{k}}}{\gamma_{i}}[\mathbf{A}_{i}]_{m,n}. We therefore conclude that

i=1k𝐁i=i=1k(γ1kγi𝐀i)=(i=1kγ1kγi)𝐀=γγ𝐀=𝐀.\otimes_{i=1}^{k}\mathbf{B}_{i}=\otimes_{i=1}^{k}\big{(}{\textstyle\frac{\gamma^{\frac{1}{k}}}{\gamma_{i}}}\mathbf{A}_{i}\big{)}=\big{(}{\textstyle\prod_{i=1}^{k}}{\textstyle\frac{\gamma^{\frac{1}{k}}}{\gamma_{i}}}\big{)}\mathbf{A}={\textstyle\frac{\gamma}{\gamma}}\mathbf{A}=\mathbf{A}.

This analysis implicitly assumes that γi=𝐠i𝐀𝐠i0\gamma_{i}=\boldsymbol{\mathrm{g}}_{i}^{\intercal}\mathbf{A}\boldsymbol{\mathrm{g}}_{i}\neq 0 for all i[k]i\in[k]. Since 𝐠i\boldsymbol{\mathrm{g}}_{i} are random Gaussian vectors, we know that 𝐠i\boldsymbol{\mathrm{g}}_{i} does not lie in the nullspace of 𝐀i\mathbf{A}_{i} with probability 1 so long as 𝐀i\mathbf{A}_{i} is not the all-zeros matrix. So, Algorithm 1 is correct so long as no factor 𝐀i\mathbf{A}_{i} is the all-zeros matrix. If some 𝐀i\mathbf{A}_{i} is the all-zeros matrix, then we get that 𝐀\mathbf{A} is also the all-zeros matrix, so that γ=0\gamma=0, and therefore our output will have 𝐁1,,𝐁k\mathbf{B}_{1},\ldots,\mathbf{B}_{k} all being all-zeros matrices, which is correct.

To count the number of Kronecker-matrix-vector products computed, note that Algorithm 1 only computes such products on lines 3 and 6. Line 3 computes exactly 1 product, and line 6 computes exactly kdkd products. ∎

8 Conclusion

In this paper, we produced new and tight bounds on the rate of convergence for the Kronecker-Hutchinson Estimator. We show that the estimator generally has an exponential dependence on kk, and we showed an expression for the exact variance of the estimator when using the Kronecker of Gaussian vectors.

There are several natural future directions for this research. It is worth exploring the variance of more classes of matrices under Theorem 4. In particular, is there a natural property of matrices which exactly characterizes the input matrices for which the Kronecker-Hutchinson Estimator has a polynomial rate of convergence? If so, this would be a significant step towards a sub-exponential runtime for trace estimation in the Kronecker-matrix-vector oracle.

It is also worth looking at algorithms in the Kronecker-matrix-vector oracle model beyond the Kronecker-Hutchinson Estimator. For instance, when 𝐀\mathbf{A} has a quickly decaying spectrum, could a low-rank approximation method like ones used in the non-Kronecker case provide good trace estimation guarantees [Li and Zhu, 2021, Saibaba et al., 2017]?

There is also potential for newer and stronger lower bounds. We provide a lower bound against possibly adaptive algorithms, which is very general, but is also perhaps too general to be descriptive of most algorithms people are currently using. Is there perhaps a lower bound against all sketching-based methods which is exponential in kk? Or, even more narrowly, is there a lower bound against all quadratic-form based algorithms, in the vain of [Wimmer et al., 2014]?

Lastly, there is also plenty of potential to study more problems in the Kronecker-matrix-vector oracle model. Is it possible to accurately and efficiently compute the top eigenvalue or eigenvector of a matrix from Kronecker-matrix-vector products? How about solving a linear system? These problems all need to be explored in great detail.

Acknowledgements

Raphael Meyer was partially supported by a GAANN fellowship from the US Department of Education. Haim Avron was partially supported by the US-Israel Binational Science Foundation (Grant no. 2017698). We thank Moshe Goldstein and Noa Feldman for discussions on quantum entanglement estimation, which inspired this work. We thank Tyler Chen for the design of Algorithm 1. We thank Christopher Musco for pointing us to relevant results in [Ahle et al., 2020].

References

  • [Aaronson, 2018] Aaronson, S. (2018). Introduction to quantum information science lecture notes. https://www.scottaaronson.com/qclec.pdf.
  • [Ahle et al., 2020] Ahle, T. D., Kapralov, M., Knudsen, J. B., Pagh, R., Velingker, A., Woodruff, D. P., and Zandieh, A. (2020). Oblivious sketching of high-degree polynomial kernels. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 141–160. SIAM.
  • [Avron et al., 2014] Avron, H., Nguyen, H., and Woodruff, D. (2014). Subspace embeddings for the polynomial kernel. Advances in neural information processing systems, 27.
  • [Avron and Toledo, 2011] Avron, H. and Toledo, S. (2011). Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix. J. ACM, 58(2).
  • [Bakshi et al., 2022] Bakshi, A., Clarkson, K. L., and Woodruff, D. P. (2022). Low-rank approximation with 1/ε\varepsilon1/3{}^{\mbox{1/3}} matrix-vector products. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, pages 1130–1143.
  • [Bernstein, 2011] Bernstein, D. S. (2011). Matrix Mathematics: Theory, Facts, and Formulas. Princeton University Press, second edition.
  • [Braverman et al., 2020] Braverman, M., Hazan, E., Simchowitz, M., and Woodworth, B. (2020). The gradient complexity of linear regression. In Conference on Learning Theory, pages 627–647. PMLR.
  • [Bujanovic and Kressner, 2021] Bujanovic, Z. and Kressner, D. (2021). Norm and trace estimation with random rank-one vectors. SIAM Journal on Matrix Analysis and Applications, 42(1):202–223.
  • [Chen and Hallman, 2022] Chen, T. and Hallman, E. (2022). Krylov-aware stochastic trace estimation. arXiv preprint arXiv:2205.01736.
  • [Cortinovis and Kressner, 2021] Cortinovis, A. and Kressner, D. (2021). On randomized trace estimates for indefinite matrices with an application to determinants. Foundations of Computational Mathematics, pages 1–29.
  • [Feldman et al., 2022] Feldman, N., Kshetrimayum, A., Eisert, J., and Goldstein, M. (2022). Entanglement estimation in tensor network states via sampling. PRX Quantum, 3(3):030312.
  • [Girard, 1987] Girard, D. (1987). Un algorithme simple et rapide pour la validation croisée generalisée sur des problèmes de grande taille: applications à la restauration d’image. Informatique et Mathématiques Appliqu’ees de Grenoble.
  • [Halikias and Townsend, 2022] Halikias, D. and Townsend, A. (2022). Matrix recovery from matrix-vector products. arXiv preprint arXiv:2212.09841.
  • [Huang et al., 2020] Huang, H.-Y., Kueng, R., and Preskill, J. (2020). Predicting many properties of a quantum system from very few measurements. Nature Physics, 16(10):1050–1057.
  • [Hutchinson, 1989] Hutchinson, M. F. (1989). A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics-Simulation and Computation, 18(3):1059–1076.
  • [Jiang et al., 2021] Jiang, S., Pham, H., Woodruff, D., and Zhang, R. (2021). Optimal sketching for trace estimation. Advances in Neural Information Processing Systems, 34:23741–23753.
  • [Kueng et al., 2017] Kueng, R., Rauhut, H., and Terstiege, U. (2017). Low rank matrix recovery from rank one measurements. Applied and Computational Harmonic Analysis, 42(1):88–116.
  • [Li and Zhu, 2021] Li, H. and Zhu, Y. (2021). Randomized block krylov subspace methods for trace and log-determinant estimators. BIT Numerical Mathematics, 61:911–939.
  • [Meyer, 2021] Meyer, R. A. (2021). Updates for Hutch++: Hutch++ for undergrads. https://ram900.hosting.nyu.edu/hutchplusplus/#hutch_for_undergrads.
  • [Meyer et al., 2021] Meyer, R. A., Musco, C., Musco, C., and Woodruff, D. P. (2021). Hutch++: Optimal stochastic trace estimation. In Symposium on Simplicity in Algorithms (SOSA), pages 142–155. SIAM.
  • [Persson et al., 2022] Persson, D., Cortinovis, A., and Kressner, D. (2022). Improved variants of the Hutch++ algorithm for trace estimation. SIAM Journal on Matrix Analysis and Applications, 43(3):1162–1185.
  • [Roosta-Khorasani and Ascher, 2015] Roosta-Khorasani, F. and Ascher, U. (2015). Improved bounds on sample size for implicit matrix trace estimators. Foundations of Computational Mathematics, 15(5):1187–1212.
  • [Saibaba et al., 2017] Saibaba, A. K., Alexanderian, A., and Ipsen, I. C. (2017). Randomized matrix-free trace and log-determinant estimators. Numerische Mathematik, 137(2):353–395.
  • [Sun et al., 2021a] Sun, X., Woodruff, D. P., Yang, G., and Zhang, J. (2021a). Querying a matrix through matrix-vector products. ACM Transactions on Algorithms (TALG), 17(4):1–19.
  • [Sun et al., 2021b] Sun, Y., Guo, Y., Tropp, J. A., and Udell, M. (2021b). Tensor random projection for low memory dimension reduction. arXiv preprint arXiv:2105.00105.
  • [Tropp and Webber, 2023] Tropp, J. A. and Webber, R. J. (2023). Randomized algorithms for low-rank matrix approximation: Design, analysis, and applications. arXiv preprint arXiv:2306.12418.
  • [Vershynin, 2020] Vershynin, R. (2020). Concentration inequalities for random tensors. Bernoulli.
  • [Wang et al., 2023] Wang, S., McArdle, S., and Berta, M. (2023). Qubit-efficient randomized quantum algorithms for linear algebra. arXiv preprint arXiv:2302.01873.
  • [Wimmer et al., 2014] Wimmer, K., Wu, Y., and Zhang, P. (2014). Optimal query complexity for estimating the trace of a matrix. In Automata, Languages, and Programming: 41st International Colloquium, ICALP 2014, Copenhagen, Denmark, July 8-11, 2014, Proceedings, Part I 41, pages 1051–1062. Springer.