This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Extremal bounds for Gaussian trace estimation

Eric Hallman
  • Abstract: This work derives extremal tail bounds for the Gaussian trace estimator applied to a real symmetric matrix. We define a partial ordering on the eigenvalues, so that when a matrix has greater spectrum under this ordering, its estimator will have worse tail bounds. This is done for two families of matrices: positive semidefinite matrices with bounded effective rank, and indefinite matrices with bounded 2-norm and fixed Frobenius norm. In each case, the tail region is defined rigorously and is constant for a given family.

1 Introduction

Let 𝐀n×n{\bf A}\in\mathbb{R}^{n\times n} be a symmetric matrix with eigenvalues λ1λn\lambda_{1}\geq\ldots\geq\lambda_{n}, and let 𝐳1,,𝐳mn{\bf z}_{1},\ldots,{\bf z}_{m}\in\mathbb{R}^{n} be i.i.d standard Gaussian vectors. The Gaussian trace estimator is given by

tr(𝐀)trmG(𝐀):=1mj=1m𝐳jT𝐀𝐳j.\operatorname{tr}({\bf A})\approx\operatorname{tr}_{m}^{G}({\bf A}):=\frac{1}{m}\sum_{j=1}^{m}{\bf z}_{j}^{T}{\bf A}{\bf z}_{j}. (1)

It is well known that this estimator is unbiased and has variance 2m𝐀F2\tfrac{2}{m}\|{\bf A}\|_{F}^{2}.

Often, it is useful to know how many samples mm are needed to produce an estimate that satisfies a given error tolerance. Cortinovis and Kressner [1] have used concentration inequalities to derive good results on this problem, with slight improvements by Persson, Cortinovis, and Kressner in [6].

Theorem 1 ([1], Thm. 1).

Let 𝐀n×n{\bf A}\in\mathbb{R}^{n\times n} be symmetric. Then for all ε>0\varepsilon>0,

Pr(|trmG(𝐀)tr(𝐀)|ε)2exp(mε24𝐀F2+4ε𝐀2).\mathrm{Pr}\left(|\operatorname{tr}_{m}^{G}({\bf A})-\operatorname{tr}({\bf A})|\geq\varepsilon\right)\leq 2\exp\left(-\frac{m\varepsilon^{2}}{4\|{\bf A}\|_{F}^{2}+4\varepsilon\|{\bf A}\|_{2}}\right).

For nonzero symmetric positive semidefinite (SPSD) matrices, this result can be turned into a relative error estimate. The trace estimator will generally be more accurate on matrices with tightly clustered eigenvalues.

Definition 1.

The effective rank of a nonzero SPSD matrix 𝐀{\bf A} is

reff(𝐀):=tr(𝐀)𝐀2.r_{\mathrm{eff}}({\bf A}):=\frac{\operatorname{tr}({\bf A})}{\|{\bf A}\|_{2}}.
Corollary 1 ([1], Remark 2).

For nonzero SPSD 𝐀n×n{\bf A}\in\mathbb{R}^{n\times n}, replace ε\varepsilon by εtr(𝐀)\varepsilon\cdot\operatorname{tr}({\bf A}) in Theorem 1, and note that 𝐀F2/tr(𝐀)2reff(𝐀)1\|{\bf A}\|_{F}^{2}/\operatorname{tr}({\bf A})^{2}\leq r_{\mathrm{eff}}({\bf A})^{-1}. For ε>0\varepsilon>0,

Pr(|trmG(𝐀)tr(𝐀)|εtr(𝐀))2exp(mε2reff(𝐀)4(1+ε)).\mathrm{Pr}\left(|\operatorname{tr}_{m}^{G}({\bf A})-\operatorname{tr}({\bf A})|\geq\varepsilon\cdot\operatorname{tr}({\bf A})\right)\leq 2\exp\left(-\frac{m\varepsilon^{2}\cdot r_{\mathrm{eff}}({\bf A})}{4(1+\varepsilon)}\right).

The goal of this paper is to tighten these bounds as much as possible using techniques from [10, 9, 8]. The practical benefit may be marginal, since the above bounds are already quite good. Still, I found the techniques interesting.

1.1 Extremal bounds

Words like tight and optimal can be slippery. To be precise, I mean to find extremal tail probabilities: ones that can be expressed as the supremum or infimum over some set. The error bounds of Theorem 1 and Corollary 1 use only certain information about 𝐀{\bf A}; respectively, they use (𝐀2,𝐀F)(\|{\bf A}\|_{2},\|{\bf A}\|_{F}) and reff(𝐀)r_{\mathrm{eff}}({\bf A}). It is therefore worth considering the sets of all matrices with the same summary statistics:

𝒜abs(λ,ϕ)\displaystyle\mathcal{A}_{\mathrm{abs}}(\lambda,\phi) :={𝐀𝒮:𝐀2λ,𝐀F=ϕ},0<λϕ,\displaystyle:=\{{\bf A}\in\mathcal{S}_{\infty}\,:\,\|{\bf A}\|_{2}\leq\lambda,\,\|{\bf A}\|_{F}=\phi\},\quad 0<\lambda\leq\phi,
𝒜rel(μ)\displaystyle\mathcal{A}_{\mathrm{rel}}(\mu) :={𝐀𝒮+:𝐀21μ,tr(𝐀)=1},1μ,\displaystyle:=\{{\bf A}\in\mathcal{S}_{\infty}^{+}\,:\,\|{\bf A}\|_{2}\leq\tfrac{1}{\mu},\,\operatorname{tr}({\bf A})=1\},\quad 1\leq\mu,

where 𝒮\mathcal{S}_{\infty} is the set of all symmetric matrices of any dimension, and 𝒮+\mathcal{S}_{\infty}^{+} is the set of all SPSD matrices of any dimension. For 𝒜rel(μ)\mathcal{A}_{\mathrm{rel}}(\mu), the parameter μ\mu is a lower bound on the effective rank of 𝐀{\bf A}; we can fix tr(𝐀)\operatorname{tr}({\bf A}) without loss of generality since the relative error of trmG(𝐀)\operatorname{tr}_{m}^{G}({\bf A}) is scale invariant.

For each of the above sets 𝒜\mathcal{A}, we define a partial ordering:

  • For 𝒜rel(μ)\mathcal{A}_{\mathrm{rel}}(\mu), the partial ordering is the vector majorization order on the eigenvalues. This paper will show that 𝝀𝐀𝝀𝐁\boldsymbol{\lambda}_{{\bf A}}\preceq\boldsymbol{\lambda}_{{\bf B}} implies that trmG(𝐁)\operatorname{tr}_{m}^{G}({\bf B}) has worse upper and lower tail bounds than trmG(𝐀)\operatorname{tr}_{m}^{G}({\bf A}).

  • For 𝒜abs(λ,ϕ)\mathcal{A}_{\mathrm{abs}}(\lambda,\phi), the partial ordering is related to the majorization order but is a little more complicated. This paper will show that 𝝀𝐀𝝀𝐁\boldsymbol{\lambda}_{{\bf A}}\preceq\boldsymbol{\lambda}_{{\bf B}} implies that trmG(𝐁)\operatorname{tr}_{m}^{G}({\bf B}) has worse upper tail bounds and that trmG(𝐀)\operatorname{tr}_{m}^{G}({\bf A}) has worse lower tail bounds.

With these partial orderings, 𝒜rel(μ)\mathcal{A}_{\mathrm{rel}}(\mu) has a maximum element and 𝒜abs(λ,ϕ)\mathcal{A}_{\mathrm{abs}}(\lambda,\phi) has both maximum and minimum elements.111Strictly speaking, the maximum is unique only if we treat matrices as equivalent when they have the same nonzero eigenvalues.

1.2 The tail

It is not particularly useful to show that if 𝝀𝐀𝝀𝐁\boldsymbol{\lambda}_{{\bf A}}\preceq\boldsymbol{\lambda}_{{\bf B}} then one trace estimate eventually has worse tail bounds than the other—this much can already be deduced just by comparing the asymptotic behavior of the probability distributions.

The critical result, then, is that we can define the “tail” regions so that their value depends only on 𝒜\mathcal{A}. For error tolerances within these regions, the partial ordering works as advertised on all elements of 𝒜\mathcal{A}.

Unfortunately, it has so far been beyond my ability to prove exactly where these tail regions begin. This paper contains some pessimistic bounds, but otherwise is restricted to conjecture.

1.3 The extremal matrices

For the relative error, the matrix with the worst tail bounds is

𝐀rel(μ)=1μ[𝐈μ𝟎𝟎𝟎μμ𝟎𝟎𝟎𝟎].{\bf A}_{\mathrm{rel}}(\mu)=\frac{1}{\mu}\begin{bmatrix}{\bf I}_{\lfloor\mu\rfloor}&{\bf 0}&{\bf 0}\\ {\bf 0}&\mu-\lfloor\mu\rfloor&{\bf 0}\\ {\bf 0}&{\bf 0}&{\bf 0}\end{bmatrix}. (2)

For the absolute error, the matrix with the worst upper tail bounds is

𝐀abs(λ,ϕ)=λ[𝐈ρ𝟎𝟎𝟎ρρ𝟎𝟎𝟎𝟎],ρ:=ϕ2λ2,{\bf A}_{\mathrm{abs}}(\lambda,\phi)=\lambda\begin{bmatrix}{\bf I}_{\lfloor\rho\rfloor}&{\bf 0}&{\bf 0}\\ {\bf 0}&\sqrt{\rho-\lfloor\rho\rfloor}&{\bf 0}\\ {\bf 0}&{\bf 0}&{\bf 0}\end{bmatrix},\quad\rho:=\frac{\phi^{2}}{\lambda^{2}}, (3)

where ρ\rho is an upper bound on the stable rank of 𝐀{\bf A}. By symmetry, the matrix with the worst lower tail bounds is 𝐀abs(λ,ϕ)-{\bf A}_{\mathrm{abs}}(\lambda,\phi).

Definition 2.

The stable rank of a nonzero matrix 𝐀{\bf A} is

rstab(𝐀):=𝐀F2𝐀22.r_{\mathrm{stab}}({\bf A}):=\frac{\|{\bf A}\|_{F}^{2}}{\|{\bf A}\|_{2}^{2}}.

1.3.1 Extension to Gamma random variables

When μ\mu and ρ\rho are not integers, the distributions of the trace estimators for (2) and (3) are not quite as elegant as I would have liked them to be. We can, however, relax the problem by considering more general linear combinations of Gamma random variables, as opposed to only those whose distribution can be expressed as the Gaussian trace estimate of some matrix as in (1).

For distribution families 𝒬rel\mathcal{Q}_{\mathrm{rel}} and 𝒬abs\mathcal{Q}_{\mathrm{abs}} to be defined later, we will show that

max𝒬relGamma(mμ2,mμ2)\max\,\mathcal{Q}_{\mathrm{rel}}\sim Gamma\left(\frac{m\mu}{2},\frac{m\mu}{2}\right) (4)

and

extrema𝒬abs=±λQ,QGamma(mρ2,m2),ρ:=ϕ2λ2.\operatorname*{extrema}\,\mathcal{Q}_{\mathrm{abs}}=\pm\lambda Q,\quad Q\sim Gamma\left(\frac{m\rho}{2},\frac{m}{2}\right),\,\rho:=\frac{\phi^{2}}{\lambda^{2}}. (5)

When μ\mu and ρ\rho are integers, these are exactly the distributions of the trace estimators trmG(𝐀rel)\operatorname{tr}_{m}^{G}({\bf A}_{\mathrm{rel}}) and trmG(𝐀abs)\operatorname{tr}_{m}^{G}({\bf A}_{\mathrm{abs}}) from (2) and (3).

1.4 Related work

This paper is primarily a spiritual successor to the works [10, 9, 8]. Székely [10] gives thorough tail bounds for the relative error when QQ is a nonnegative sum of chi-squared r.v’s with no further restrictions (the worst case is typically Qχ12Q\sim\chi_{1}^{2}) and provides a majorization result as a corollary. Roosta-Khorasani, Székely, and Ascher [9] extend the main result to where QQ is a nonnegative sum of Gamma r.v’s with arbitrary shape and scale parameters. This allows them to consider the effects of using a larger sampling number mm for trace estimation. Finally, Roosta-Khorasani and Székely [8] generalize the worst-case bounds of [9] by obtaining a majorization result for nonnegative sums of Gamma r.v’s. The present work’s Theorem 2, in particular, is more or less a restatement of a result from [8].

This current paper is novel in two main ways. First, it considers how the above bounds might be strengthened when the matrix in question has a large effective rank. Second, it obtains absolute error bounds for indefinite matrices —that is, when the sum of Gamma random variables is not necessarily nonnegative.

2 Majorization

In [8], Roosta-Khorasani and Székely use the majorization order on the eigenvalues of a matrix as a measure of the “skewness” of its spectrum. Their general observation is that the Gaussian trace estimator performs worse when the spectrum is highly skewed. In other words: the majorization order is the partial ordering that they (and we) use to determine for which matrices the Gaussian trace estimator yields the worst relative error bounds.

Given a nonnegative vector 𝝀n\boldsymbol{\lambda}\in\mathbb{R}^{n}, let λ[i]\lambda_{[i]} denote its indices in sorted order, so that λ[1]λ[n]0\lambda_{[1]}\geq\ldots\geq\lambda_{[n]}\geq 0. We say that 𝝀\boldsymbol{\lambda} majorizes 𝝁\boldsymbol{\mu}, denoted 𝝁𝝀\boldsymbol{\mu}\preceq\boldsymbol{\lambda}, if

i=1kμ[i]\displaystyle\sum_{i=1}^{k}\mu_{[i]} i=1kλ[i],kn,\displaystyle\leq\sum_{i=1}^{k}\lambda_{[i]},\quad\forall k\leq n, (6a)
i=1nμi\displaystyle\sum_{i=1}^{n}\mu_{i} =i=1nλi.\displaystyle=\sum_{i=1}^{n}\lambda_{i}. (6b)

Similarly, we say that 𝝀\boldsymbol{\lambda} weakly majorizes 𝝁\boldsymbol{\mu}, denoted 𝝁w𝝀\boldsymbol{\mu}\preceq_{w}\boldsymbol{\lambda}, just as long as (6a) holds. If 𝝀\boldsymbol{\lambda} and 𝝁\boldsymbol{\mu} do not have the same length, pad the shorter one with zeroes as necessary.

If 𝝀\boldsymbol{\lambda} weakly majorizes 𝝁\boldsymbol{\mu} but does not majorize it, then i=1nμi<i=1nλi\sum_{i=1}^{n}\mu_{i}<\sum_{i=1}^{n}\lambda_{i}. For our proofs, it will be useful to identify the indices for which the inequalities in (6a) are strict.

Definition 3.

Given 𝛍w𝛌\boldsymbol{\mu}\preceq_{w}\boldsymbol{\lambda}, the leading slack index is the smallest number jj such that

i=1μ[i]<i=1λ[i],{j,,n}.\sum_{i=1}^{\ell}\mu_{[i]}<\sum_{i=1}^{\ell}\lambda_{[i]},\quad\forall\ell\in\{j,\ldots,n\}.

This index has the property that μ[j]\mu_{[j]} (and no larger entry of 𝝁\boldsymbol{\mu}) can be increased by a nonzero amount without violating the majorization condition.

In deriving their main results, Roosta-Khorasani and Székely make use of the following lemma [7, 12.5a].

Lemma 1.

If 𝛍𝛌\boldsymbol{\mu}\preceq\boldsymbol{\lambda}, then there exists a finite sequence of vectors

𝝁𝜼1𝜼r=𝝀\boldsymbol{\mu}\preceq\boldsymbol{\eta}_{1}\preceq\ldots\preceq\boldsymbol{\eta}_{r}=\boldsymbol{\lambda}

so that 𝛈i\boldsymbol{\eta}_{i} and 𝛈i+1\boldsymbol{\eta}_{i+1} differ in two coordinates only for i=1,,r1i=1,\ldots,r-1.

This lemma lets us assume without loss of generality that 𝝁\boldsymbol{\mu} and 𝝀\boldsymbol{\lambda} differ in two coordinates only, satisfying

0λk<μkμj<λj.0\leq\lambda_{k}<\mu_{k}\leq\mu_{j}<\lambda_{j}.

2.1 Indefinite majorization

For indefinite matrices with fixed Frobenius norm as in 𝒜abs(λ,ϕ)\mathcal{A}_{\mathrm{abs}}(\lambda,\phi), the majorization condition is somewhat more complicated. Here is why: a Gamma random variable is nonnegative, so its lower tail can’t get too far from the mean. The average of many Gamma random variables, however, looks more like a normal distribution which has tails in both directions. This suggests that trmG(𝐀)\operatorname{tr}_{m}^{G}({\bf A}) will have worse absolute upper tail bounds when the nonnegative eigenvalues are highly skewed and when the negative eigenvalues are tightly clustered.

We will also see that trmG(|𝐀|)\operatorname{tr}_{m}^{G}(|{\bf A}|) has worse upper tail bounds than trmG(𝐀)\operatorname{tr}_{m}^{G}({\bf A}), for indefinite 𝐀{\bf A}. Analogous results for lower tail bounds follow by symmetry.

Now we describe the majorization condition more precisely.

Definition 4.

For a vector 𝛌\boldsymbol{\lambda}, define

𝝀:=min(𝝀,𝟎)and𝝀+:=max(𝝀,𝟎),\boldsymbol{\lambda}^{-}:=\min(\boldsymbol{\lambda},\boldsymbol{0})\quad\text{and}\quad\boldsymbol{\lambda}^{+}:=\max(\boldsymbol{\lambda},\boldsymbol{0}),

where the min and max operations are done elementwise.

Definition 5.

For vectors 𝛌\boldsymbol{\lambda} and 𝛍\boldsymbol{\mu}, we say that 𝛍F𝛌\boldsymbol{\mu}\preceq_{F}\boldsymbol{\lambda} if

(𝝁+)2\displaystyle(\boldsymbol{\mu}^{+})^{2} w(𝝀+)2,\displaystyle\preceq_{w}(\boldsymbol{\lambda}^{+})^{2}, (7a)
(𝝀)2\displaystyle(\boldsymbol{\lambda}^{-})^{2} w(𝝁)2,\displaystyle\preceq_{w}(\boldsymbol{\mu}^{-})^{2}, (7b)
i=1nμi2\displaystyle\sum_{i=1}^{n}\mu_{i}^{2} =i=1nλi2.\displaystyle=\sum_{i=1}^{n}\lambda_{i}^{2}. (7c)

Condition (7a) specifies that the positive entries of 𝝀\boldsymbol{\lambda} are more skewed, and perhaps have more total weight, than those of 𝝁\boldsymbol{\mu}. Condition (7b) specifies the opposite for the negative entries. Condition (7c) means that the matrices associated with 𝝀\boldsymbol{\lambda} and 𝝁\boldsymbol{\mu} have the same Frobenius norm. This last condition also means that the associated trace estimators have the same variance.

For the main result, we propose a lemma analogous to Lemma 1 (see Appendix A) for the proof).

Lemma 2.

If 𝛍F𝛌\boldsymbol{\mu}\preceq_{F}\boldsymbol{\lambda}, then there exists a finite sequence of vectors

𝝁F𝜼1FF𝜼r=𝝀\boldsymbol{\mu}\preceq_{F}\boldsymbol{\eta}_{1}\preceq_{F}\ldots\preceq_{F}\boldsymbol{\eta}_{r}=\boldsymbol{\lambda}

so that 𝛈i\boldsymbol{\eta}_{i} and 𝛈i+1\boldsymbol{\eta}_{i+1} differ in two coordinates only for i=1,,r1i=1,\ldots,r-1. Furthermore, the difference between consecutive vectors may be assumed to take one of the following forms:

  1. 1.

    μk<λk0μj<λj\mu_{k}<\lambda_{k}\leq 0\leq\mu_{j}<\lambda_{j};

  2. 2.

    0λk<μkμj<λj0\leq\lambda_{k}<\mu_{k}\leq\mu_{j}<\lambda_{j};

  3. 3.

    μk<λkλj<μj0\mu_{k}<\lambda_{k}\leq\lambda_{j}<\mu_{j}\leq 0.

Once again, we may assume without loss of generality that 𝝁\boldsymbol{\mu} and 𝝀\boldsymbol{\lambda} differ in two coordinates only, this time satisfying λj2μj2=μk2λk2>0\lambda_{j}^{2}-\mu_{j}^{2}=\mu_{k}^{2}-\lambda_{k}^{2}>0.

3 Gamma random variables

In this section we cover some relevant properties of Gamma random variables, and explain how to reformulate the original trace estimation problem in terms of such variables.

Since real symmetric matrices are orthogonally diagonalizable and Gaussian vectors are rotationally invariant, the Gaussian trace estimator (1) can be written in the form

trmG(𝐀)=i=1nλiXi,Xii.i.d.1mχm2,\operatorname{tr}_{m}^{G}({\bf A})=\sum_{i=1}^{n}\lambda_{i}X_{i},\quad X_{i}\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}\tfrac{1}{m}\chi_{m}^{2}, (8)

where χm2\chi_{m}^{2} is a chi-squared variable with mm degrees of freedom.

The chi-squared distribution is a special case of the Gamma distribution, and so we can generalize (8) as follows:

Definition 6.

Given a vector 𝛌=(λ1,,λn)n\boldsymbol{\lambda}=(\lambda_{1},\ldots,\lambda_{n})\in\mathbb{R}^{n}, denote

Q(𝝀;α,β):=i=1nλiXi,Xii.i.d.Gamma(α,β).Q(\boldsymbol{\lambda};\alpha,\beta):=\sum_{i=1}^{n}\lambda_{i}X_{i},\quad X_{i}\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}Gamma\left(\alpha,\beta\right).

When α\alpha and β\beta are clear from context, we will use Q𝛌Q_{\boldsymbol{\lambda}} for short.

With the notation above, trmG(𝐀)\operatorname{tr}_{m}^{G}({\bf A}) has the same distribution as Q(𝝀𝐀;m2,m2)Q\left(\boldsymbol{\lambda}_{{\bf A}};\tfrac{m}{2},\tfrac{m}{2}\right).

3.1 Basic properties

Here are a few other facts about the Gamma distribution:

  • A Gamma random variable with shape α>0\alpha>0 and rate β>0\beta>0 has the PDF

    f(x)={βαΓ(α)xα1eβxx0,0x0.f(x)=\begin{cases}\frac{\beta^{\alpha}}{\Gamma(\alpha)}x^{\alpha-1}e^{-\beta x}&x\geq 0,\\ 0&x\leq 0.\end{cases} (9)

    The tail decays more slowly than the tail of a normal distribution.

  • If XGamma(α,β)X\sim Gamma(\alpha,\beta) then 𝔼[X]=α/β\mathbb{E}[X]=\alpha/\beta and Var[X]=α/β2\operatorname{Var}[X]=\alpha/\beta^{2}. Furthermore, XX is unimodal with mode (α1)/β(\alpha-1)/\beta if α1\alpha\geq 1 and 0 otherwise.

  • Gamma random variables follow a scaling law: if XGamma(α,β){X\sim Gamma(\alpha,\beta)} and λ>0\lambda>0, then λXGamma(α,β/λ)\lambda X\sim Gamma(\alpha,\beta/\lambda).

  • Gamma random variables with the same rate parameter have an additive property: if X1Gamma(α1,β)X_{1}\sim Gamma(\alpha_{1},\beta) and X2Gamma(α2,β)X_{2}\sim Gamma(\alpha_{2},\beta), then X1+X2Gamma(α1+α2,β){X_{1}+X_{2}\sim Gamma(\alpha_{1}+\alpha_{2},\beta)}.

  • Conversely, Gamma random variables are infinitely divisible: if XGamma(α,β)X\sim Gamma(\alpha,\beta) then for any positive integer TT, XX has the same distribution as i=1TXi\sum_{i=1}^{T}X_{i}, where X1,,XTi.i.d.Gamma(α/T,β)X_{1},\ldots,X_{T}\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}Gamma(\alpha/T,\beta).

3.2 Generalizing the distribution families

In order to define 𝒬rel\mathcal{Q}_{\mathrm{rel}} and 𝒬abs\mathcal{Q}_{\mathrm{abs}} as extensions of 𝒜rel\mathcal{A}_{\mathrm{rel}} and 𝒜abs\mathcal{A}_{\mathrm{abs}}, we first generalize the notions of 2-norm and effective rank.

Definition 7.

For a random variable Q𝛌=Q(𝛌;α,β)Q_{\boldsymbol{\lambda}}=Q(\boldsymbol{\lambda};\alpha,\beta), we define the scale of Q𝛌Q_{\boldsymbol{\lambda}} as

scale(Q𝝀)=maxi|λi|β.\operatorname{scale}(Q_{\boldsymbol{\lambda}})=\frac{\max_{i}|\lambda_{i}|}{\beta}.

It is worth mentioning that scale(Q𝝀)\operatorname{scale}(Q_{\boldsymbol{\lambda}}) is a property of the distribution itself, not just its representation in terms of (𝝀,α,β)(\boldsymbol{\lambda},\alpha,\beta).

Definition 8.

For a random variable Q𝛌=Q(𝛌;α,β)Q_{\boldsymbol{\lambda}}=Q(\boldsymbol{\lambda};\alpha,\beta) with nonnegative weights 𝛌\boldsymbol{\lambda}, we define the effective shape of Q𝛌Q_{\boldsymbol{\lambda}} as

αeff(Q𝝀):=𝔼[Q𝝀]scale(Q𝝀)=αi=1nλimaxiλi.\alpha_{\mathrm{eff}}(Q_{\boldsymbol{\lambda}}):=\frac{\mathbb{E}[Q_{\boldsymbol{\lambda}}]}{\operatorname{scale}(Q_{\boldsymbol{\lambda}})}=\frac{\alpha\sum_{i=1}^{n}\lambda_{i}}{\max_{i}\lambda_{i}}.

Now we can extend the definition of 𝒜rel(μ)\mathcal{A}_{\mathrm{rel}}(\mu):

Definition 9.

For fixed α>0\alpha>0 and β>0\beta>0, define

𝒬rel(μ;α,β):={Q𝝀:scale(Q𝝀)1μ,𝔼[Q𝝀]=1},αμ,\mathcal{Q}_{\mathrm{rel}}(\mu;\alpha,\beta):=\{Q_{\boldsymbol{\lambda}}\,:\,\operatorname{scale}(Q_{\boldsymbol{\lambda}})\leq\tfrac{1}{\mu},\,\mathbb{E}[Q_{\boldsymbol{\lambda}}]=1\},\quad\alpha\leq\mu,

where the weights 𝛌n\boldsymbol{\lambda}\in\mathbb{R}^{n} are nonnegative.

The parameter μ\mu in 𝒬rel(μ;α,β)\mathcal{Q}_{\mathrm{rel}}(\mu;\alpha,\beta) is a lower bound on the effective shape of Q𝝀Q_{\boldsymbol{\lambda}}. This definition generalizes the case of matrix trace estimation since

trmG(𝒜rel(μ))=𝒬rel(mμ2;m2,m2).\operatorname{tr}_{m}^{G}\left(\mathcal{A}_{\mathrm{rel}}(\mu)\right)=\mathcal{Q}_{\mathrm{rel}}\left(\tfrac{m\mu}{2};\tfrac{m}{2},\tfrac{m}{2}\right).

The definition of 𝒜abs(λ,ϕ)\mathcal{A}_{\mathrm{abs}}(\lambda,\phi) can be extended similarly:

Definition 10.

For fixed α>0\alpha>0 and β>0\beta>0, define

𝒬abs(λ,ϕ;α,β):={Q𝝀:scale(Q𝝀)λ,Var[Q𝝀]=ϕ2},0<λ1αϕ.\mathcal{Q}_{\mathrm{abs}}(\lambda,\phi;\alpha,\beta):=\{Q_{\boldsymbol{\lambda}}\,:\,\operatorname{scale}(Q_{\boldsymbol{\lambda}})\leq\lambda,\operatorname{Var}[Q_{\boldsymbol{\lambda}}]=\phi^{2}\},\quad 0<\lambda\leq\tfrac{1}{\sqrt{\alpha}}\phi.

In this case, we have

trmG(𝒜abs(λ,ϕ))=𝒬abs(2λm,ϕ2m;m2,m2).\operatorname{tr}_{m}^{G}(\mathcal{A}_{\mathrm{abs}}(\lambda,\phi))=\mathcal{Q}_{\mathrm{abs}}\left(\tfrac{2\lambda}{m},\phi\sqrt{\tfrac{2}{m}};\tfrac{m}{2},\tfrac{m}{2}\right).

3.3 Infinite division

Our strategy for relaxing the original trace estimation problem is to use the fact that Gamma random variables are infinitely divisible (see Section 3.1). Any Gamma random variable may be rewritten as a sum of arbitrarily many Gamma random variables with smaller shape parameters α\alpha, and this fact enables us to nest some distribution families within others.

Lemma 3.

For any integer T1T\geq 1, it holds that Q(𝛌;α,β)Q(\boldsymbol{\lambda};\alpha,\beta) has the same distribution as Q((𝛌,,𝛌T times);αT,β)Q\big{(}(\underbrace{\boldsymbol{\lambda},\ldots,\boldsymbol{\lambda}}_{T\text{ times}});\tfrac{\alpha}{T},\beta\big{)}, and consequently that

𝒬rel(μ;α,β)𝒬rel(μ;αT,β)and𝒬abs(λ,ϕ;α,β)𝒬abs(λ,ϕ;αT,β).\mathcal{Q}_{\mathrm{rel}}(\mu;\alpha,\beta)\subseteq\mathcal{Q}_{\mathrm{rel}}\left(\mu;\tfrac{\alpha}{T},\beta\right)\quad\text{and}\quad\mathcal{Q}_{\mathrm{abs}}(\lambda,\phi;\alpha,\beta)\subseteq\mathcal{Q}_{\mathrm{abs}}\left(\lambda,\phi;\tfrac{\alpha}{T},\beta\right).

By considering the limit as TT\rightarrow\infty, we can effectively do away with the constraint of sharing a single scale parameter, and in doing so obtain the more general sets 𝒬rel(μ)\mathcal{Q}_{\mathrm{rel}}(\mu) and 𝒬abs(λ,ϕ)\mathcal{Q}_{\mathrm{abs}}(\lambda,\phi) promised in (4) and (5).

Definition 11.

For μ>0\mu>0, let 𝒬rel(μ)\mathcal{Q}_{\mathrm{rel}}(\mu) be the set of all finite linear combinations of Gamma random variables

Q=i=1nλiXi,Xii.i.d.Gamma(αi,βi),λi0,αi>0,βi>0,Q=\sum_{i=1}^{n}\lambda_{i}X_{i},\quad X_{i}\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}Gamma(\alpha_{i},\beta_{i}),\,\lambda_{i}\geq 0,\,\alpha_{i}>0,\,\beta_{i}>0,

subject to the constraints scale[Q]1μ\operatorname{scale}[Q]\leq\tfrac{1}{\mu} and 𝔼[Q]=1\mathbb{E}[Q]=1.

Definition 12.

For λ>0\lambda>0 and ϕ>0\phi>0, let 𝒬abs(λ,ϕ)\mathcal{Q}_{\mathrm{abs}}(\lambda,\phi) be the set of all finite linear combinations of Gamma random variables

Q=i=1nλiXi,Xii.i.d.Gamma(αi,βi),αi>0,βi>0,Q=\sum_{i=1}^{n}\lambda_{i}X_{i},\quad X_{i}\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}Gamma(\alpha_{i},\beta_{i}),\,\alpha_{i}>0,\,\beta_{i}>0,

subject to the constraints scale[Q]λ\operatorname{scale}[Q]\leq\lambda and Var[Q]=ϕ2\operatorname{Var}[Q]=\phi^{2}.

For any fixed α>0\alpha>0 and β>0\beta>0, for sufficiently large TT the set 𝒬rel(μ;αT,β)\mathcal{Q}_{\mathrm{rel}}(\mu;\tfrac{\alpha}{T},\beta) contains distributions arbitrarily close to any given distribution in 𝒬rel(μ)\mathcal{Q}_{\mathrm{rel}}(\mu); the same goes for 𝒬abs(λ,ϕ;αT,β)\mathcal{Q}_{\mathrm{abs}}(\lambda,\phi;\tfrac{\alpha}{T},\beta) and 𝒬abs(λ,ϕ)\mathcal{Q}_{\mathrm{abs}}(\lambda,\phi). Thus any bounds we derive for the former set will also apply to the latter.

3.4 Fixed shape and scale parameters

The main results assume that parameters α\alpha and β\beta are fixed in order to keep the majorization definitions as simple as possible. If αi\alpha_{i} and βi\beta_{i} were allowed to differ for each random variable XiX_{i} in the mixture QQ, we could still map QQ to a random variable that takes on values λi/βi\lambda_{i}/\beta_{i} with probability αi\alpha_{i}, then make comparisons using the stochastic ordering. I don’t feel that the added complexity is worth it, so won’t pursue this approach further.

4 Majorization theorems

This section presents the main majorization results. The relative error result is essentially a restatement of results from [8]; to the best of my knowledge the absolute error result is novel.

4.1 Relative error

As mentioned above, this first result was essentially proved as part of [8, Thm. 1]. I’ve still included the proof (Appendix A.1) for the sake of completeness.

Theorem 2.

Fix μα>0\mu\geq\alpha>0, and β>0\beta>0. Define x¯upper\overline{x}_{\mathrm{upper}} and x¯lower\overline{x}_{\mathrm{lower}} to be the extreme values of the mode of the PDF of the distribution

Q𝝀+λjψ+λkψ,Q𝝀𝒬rel(μ;α,β),jk,Q_{\boldsymbol{\lambda}}+\lambda_{j}\psi+\lambda_{k}\psi^{\prime},\quad Q_{\boldsymbol{\lambda}}\in\mathcal{Q}_{\mathrm{rel}}(\mu;\alpha,\beta),\ j\neq k,

where ψ,ψi.i.d.Gamma(1,β)\psi,\psi^{\prime}\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}Gamma\left(1,\beta\right) are exponential r.v’s independent of Q𝛌Q_{\boldsymbol{\lambda}}. If Q𝛌,Q𝛍𝒬rel(μ;α,β)Q_{\boldsymbol{\lambda}},Q_{\boldsymbol{\mu}}\in\mathcal{Q}_{\mathrm{rel}}(\mu;\alpha,\beta) satisfy 𝛍𝛌\boldsymbol{\mu}\preceq\boldsymbol{\lambda}, then

Pr(Q𝝁x)\displaystyle\mathrm{Pr}\left(Q_{\boldsymbol{\mu}}\leq x\right) Pr(Q𝝀x),xx¯upper,\displaystyle\geq\mathrm{Pr}\left(Q_{\boldsymbol{\lambda}}\leq x\right),\quad\forall x\geq\overline{x}_{\mathrm{upper}},
Pr(Q𝝁x)\displaystyle\mathrm{Pr}\left(Q_{\boldsymbol{\mu}}\leq x\right) Pr(Q𝝀x),xx¯lower.\displaystyle\leq\mathrm{Pr}\left(Q_{\boldsymbol{\lambda}}\leq x\right),\quad\forall x\leq\overline{x}_{\mathrm{lower}}.

That is, Q𝝀Q_{\boldsymbol{\lambda}} has worse upper and lower tail bounds.

4.1.1 Locating the tail

But what are the values of x¯upper\overline{x}_{\mathrm{upper}} and x¯lower\overline{x}_{\mathrm{lower}}? In the most general case μ=α\mu=\alpha, [8, Thm. 1] shows that x¯upper=1+(2α)1\overline{x}_{\mathrm{upper}}=1+(2\alpha)^{-1} and x¯lower=1α1\overline{x}_{\mathrm{lower}}=1-\alpha^{-1}. These values necessarily serve as bounds whenever μα\mu\geq\alpha, but I would like to obtain stronger results for distributions with greater effective shape.

Theorem 3 gives a pessimistic result: it tells us where the tails are not, rather than where they are.

Theorem 3.

If μ>α\mu>\alpha, the points x¯upper\overline{x}_{\mathrm{upper}} and x¯lower\overline{x}_{\mathrm{lower}} as defined in Theorem 2 satisfy

x¯upper1+(αμ/α)1andx¯lower1(αμ/α)1.\overline{x}_{\mathrm{upper}}\geq 1+(\alpha\lceil\mu/\alpha\rceil)^{-1}\quad\text{and}\quad\overline{x}_{\mathrm{lower}}\leq 1-(\alpha\lceil\mu/\alpha\rceil)^{-1}.
Proof.

For the bound on x¯upper\overline{x}_{\mathrm{upper}}, set Q𝝀=i=1μ/αλXiQ_{\boldsymbol{\lambda}}=\sum_{i=1}^{\lceil\mu/\alpha\rceil}\lambda X_{i} with λ=βαμ/α\lambda=\tfrac{\beta}{\alpha\lceil\mu/\alpha\rceil}. The distribution Q𝝀+λ1ψ+λ2ψQ_{\boldsymbol{\lambda}}+\lambda_{1}\psi+\lambda_{2}\psi^{\prime} is Gamma(αμ/α+2,λ1β)Gamma(\alpha\lceil\mu/\alpha\rceil+2,\lambda^{-1}\beta), so its mode is 1+(αμ/α)1{1+(\alpha\lceil\mu/\alpha\rceil)^{-1}} as desired while 𝔼[Q𝝀]=1\mathbb{E}[Q_{\boldsymbol{\lambda}}]=1.

For the bound on x¯lower\overline{x}_{\mathrm{lower}}, take the same Q𝝀Q_{\boldsymbol{\lambda}}, but pad it with eigenvalues λj=λk=0\lambda_{j}=\lambda_{k}=0 instead. ∎

Unfortunately, I have not so far been able to establish bounds in the opposite direction. I will therefore leave them as a conjecture:

Conjecture 1.

The points x¯upper\overline{x}_{\mathrm{upper}} and x¯lower\overline{x}_{\mathrm{lower}} as defined in Theorem 2 satisfy

x¯upper1+μ1andx¯lower1μ1.\overline{x}_{\mathrm{upper}}\leq 1+\mu^{-1}\quad\text{and}\quad\overline{x}_{\mathrm{lower}}\geq 1-\mu^{-1}.

4.2 Absolute error

The relative error analysis in the previous section has two significant limitations: first, it only applies when 𝐀{\bf A} is SPSD. Second, reff(𝐀)r_{\mathrm{eff}}({\bf A}) may sometimes be significantly larger than rstab(𝐀)r_{\mathrm{stab}}({\bf A}), which is a better indicator of the quality of the Gaussian trace estimator since the variance of trmG(𝐀)\operatorname{tr}_{m}^{G}({\bf A}) is 2m𝐀F2\tfrac{2}{m}\|{\bf A}\|_{F}^{2}.

We therefore turn to the case where 𝐀{\bf A} may be indefinite but bounds on 𝐀F\|{\bf A}\|_{F} and 𝐀2\|{\bf A}\|_{2} are known. As before, it is useful to present the results in terms of more general Gamma random variables.

Here is the main result on the absolute error of the estimator (proof in Appendix A.2).

Theorem 4.

Fix α>0\alpha>0, β>0\beta>0, and ϕαλ>0\frac{\phi}{\sqrt{\alpha}}\geq\lambda>0. Define x^upper\hat{x}_{\mathrm{upper}} to be the supremum over all inflection points of the PDF of the distribution

Q𝝀+λjψ+λkψ𝔼[Q𝝀],Q𝝀𝒬abs(λ,ϕ;α,β),jk,Q_{\boldsymbol{\lambda}}+\lambda_{j}\psi+\lambda_{k}\psi^{\prime}-\mathbb{E}[Q_{\boldsymbol{\lambda}}],\quad Q_{\boldsymbol{\lambda}}\in\mathcal{Q}_{\mathrm{abs}}(\lambda,\phi;\alpha,\beta),\,j\neq k, (10)

where ψ,ψi.i.d.Gamma(1,β)\psi,\psi^{\prime}\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}Gamma(1,\beta) are exponential r.v’s independent of Q𝛌Q_{\boldsymbol{\lambda}}. If Q𝛌,Q𝛍𝒬abs(λ,ϕ;α,β)Q_{\boldsymbol{\lambda}},Q_{\boldsymbol{\mu}}\in\mathcal{Q}_{\mathrm{abs}}(\lambda,\phi;\alpha,\beta) satisfy 𝛍F𝛌\boldsymbol{\mu}\preceq_{F}\boldsymbol{\lambda}, then

Pr(Q𝝁𝔼[Q𝝁]x)Pr(Q𝝀𝔼[Q𝝀]x),|x|>x^upper.\mathrm{Pr}(Q_{\boldsymbol{\mu}}-\mathbb{E}[Q_{\boldsymbol{\mu}}]\leq x)\geq\mathrm{Pr}(Q_{\boldsymbol{\lambda}}-\mathbb{E}[Q_{\boldsymbol{\lambda}}]\leq x),\quad\forall|x|>\hat{x}_{\mathrm{upper}}.

This single equation (note the use of the absolute value |x||x| in the condition) means that Q𝝀Q_{\boldsymbol{\lambda}} has worse upper tail bounds and that Q𝝁Q_{\boldsymbol{\mu}} has worse lower tail bounds.

4.2.1 Locating the tail

Theorem 4 implies that the “tails” are the regions where all density functions of the form (10) are convex. For reference, a normal distribution has inflection points equal to the mean plus or minus one standard deviation.

How much further away could the tails begin? Theorem 5 gives a pessimistic result.

Theorem 5.

The point x^upper\hat{x}_{\mathrm{upper}} as defined in Theorem 4 satisfies

x^upperϕ1+rα+1rα,wherer=ϕ2λ2α.\hat{x}_{\mathrm{upper}}\geq\phi\frac{1+\sqrt{r\alpha+1}}{\sqrt{r\alpha}},\quad\text{where}\ r=\left\lceil\frac{\phi^{2}}{\lambda^{2}\alpha}\right\rceil.

See Appendix A.3 for the proof. I’ll leave an upper bound as a conjecture.

Conjecture 2.

The point x^upper\hat{x}_{\mathrm{upper}} as defined in Theorem 4 satisfies

x^upperλ+ϕ2+λ2.\hat{x}_{\mathrm{upper}}\leq\lambda+\sqrt{\phi^{2}+\lambda^{2}}.

5 Application to trace estimation

In this section I’ll rephrase the main theorems (Theorems 2 and 4) more directly in terms of Gaussian trace estimation.

Here is the basic idea: the matrix 𝐀rel(μ){\bf A}_{\mathrm{rel}}(\mu) from (2) (resp. 𝐀abs(λ,ϕ){\bf A}_{\mathrm{abs}}(\lambda,\phi) from (3)) majorizes every other element of 𝒜rel(μ)\mathcal{A}_{\mathrm{rel}}(\mu) (resp. 𝒜abs(λ,ϕ)\mathcal{A}_{\mathrm{abs}}(\lambda,\phi)). Thus by the main theorems, these two matrices have the worst tail bounds. Employing the infinite division strategy of Section (3.3) shows that the Gaussian trace estimators for these matrices are in turn tail-bounded by the Gamma distributions in (4) and (5), respectively. Finally, increasing the sampling number mm will make the tails take up a larger portion of the distribution, increasing the domain over which the bounds in this paper are effective.

First up is the relative error bound. Note that a little unit conversion is used here: if reff(𝐀)=μr_{\mathrm{eff}}({\bf A})=\mu, then αeff(trmG(𝐀))=mμ2\alpha_{\mathrm{eff}}(\operatorname{tr}_{m}^{G}({\bf A}))=\frac{m\mu}{2}.

Theorem 6.

For nonzero SPSD 𝐀n×n{\bf A}\in\mathbb{R}^{n\times n} and sampling number mm, let μ=reff(𝐀)\mu=r_{\mathrm{eff}}({\bf A}). Then there exists εrel\varepsilon_{\mathrm{rel}} such that for εεrel\varepsilon\geq\varepsilon_{\mathrm{rel}},

Pr(|trmG(𝐀)tr(𝐀)|εtr(𝐀))\displaystyle\mathrm{Pr}\left(|\operatorname{tr}_{m}^{G}({\bf A})-\operatorname{tr}({\bf A})|\geq\varepsilon\cdot\operatorname{tr}({\bf A})\right) Pr(|trmG(𝐀rel)1|ε)\displaystyle\leq\mathrm{Pr}\left(|\operatorname{tr}_{m}^{G}({\bf A}_{\mathrm{rel}})-1|\geq\varepsilon\right)
Pr(|X1|ε),\displaystyle\leq\mathrm{Pr}(|X-1|\geq\varepsilon),

where 𝐀rel=𝐀rel(μ){\bf A}_{\mathrm{rel}}={\bf A}_{\mathrm{rel}}(\mu) is defined in (2) and XGamma(mμ2,mμ2)X\sim Gamma\left(\frac{m\mu}{2},\frac{m\mu}{2}\right).

Conjecture 3.

εrel2/(mμ)\varepsilon_{\mathrm{rel}}\leq 2/(m\mu).

For comparison, Corollary 1 holds for all ε>0\varepsilon>0, but with marginally weaker bounds.

Next is the absolute error bound. Again there is a bit of unit conversion: if 𝐀2=λ\|{\bf A}\|_{2}=\lambda and 𝐀F=ϕ\|{\bf A}\|_{F}=\phi, then scale(trmG(𝐀))=2λm\operatorname{scale}(\operatorname{tr}_{m}^{G}({\bf A}))=\frac{2\lambda}{m} and Var[trmG(𝐀)]=2mϕ2\operatorname{Var}[\operatorname{tr}_{m}^{G}({\bf A})]=\frac{2}{m}\phi^{2}.

Theorem 7.

Let 𝐀n×n{\bf A}\in\mathbb{R}^{n\times n} be symmetric with 𝐀2=λ\|{\bf A}\|_{2}=\lambda and 𝐀F=ϕ\|{\bf A}\|_{F}=\phi. For fixed sampling number mm, there exists εabs\varepsilon_{\mathrm{abs}} such that for εεabs\varepsilon\geq\varepsilon_{\mathrm{abs}},

Pr(|trmG(𝐀)tr(𝐀)|ε)\displaystyle\mathrm{Pr}\left(|\operatorname{tr}_{m}^{G}({\bf A})-\operatorname{tr}({\bf A})|\geq\varepsilon\right) 2Pr(trmG(𝐀abs)tr(𝐀abs)ε)\displaystyle\leq 2\,\mathrm{Pr}\left(\operatorname{tr}_{m}^{G}({\bf A}_{\mathrm{abs}})-\operatorname{tr}({\bf A}_{\mathrm{abs}})\geq\varepsilon\right)
2Pr(X𝔼[X]ε),\displaystyle\leq 2\,\mathrm{Pr}(X-\mathbb{E}[X]\geq\varepsilon),

where 𝐀abs=𝐀abs(λ,ϕ){\bf A}_{\mathrm{abs}}={\bf A}_{\mathrm{abs}}(\lambda,\phi) is defined in (3) and XGamma(mρ2,m2λ)X\sim Gamma\left(\frac{m\rho}{2},\frac{m}{2\lambda}\right) with ρ=rstab(𝐀)=ϕ2/λ2\rho=r_{\mathrm{stab}}({\bf A})=\phi^{2}/\lambda^{2}.

Conjecture 4.

εabs2λm+2mϕ2+(2λm)2\varepsilon_{\mathrm{abs}}\leq\frac{2\lambda}{m}+\sqrt{\frac{2}{m}\phi^{2}+\left(\frac{2\lambda}{m}\right)^{2}}.

For comparison, Theorem 1 holds for all ε>0\varepsilon>0, but with marginally weaker bounds.

The factor of 2 appearing in the bounds in Theorem 7 is due to the fact that the result uses 𝐀abs{\bf A}_{\mathrm{abs}} for the upper tail bound and 𝐀abs-{\bf A}_{\mathrm{abs}} for the lower tail bound (resp. XX and X-X). Note also that in Conjecture 4 the terms ϕ\phi and λ\lambda scale differently as the sampling number mm increases. If true, the conjecture implies that εabs\varepsilon_{\mathrm{abs}} converges to ϕ\phi as mm\rightarrow\infty.

6 Conclusion

So what exactly are the practical implications of all of this? One takeaway, following from Theorem 6, is that the relative error bound for SPSD matrices depends on mreff(𝐀)m\cdot r_{\mathrm{eff}}({\bf A}), so having a large effective rank is just as beneficial as using a large sampling number. One particular application is in estimating the Frobenius norm of a matrix. The worst-case bounds of [9] hold when the matrix has rank one, but with a lower bound on the stable rank of 𝐀{\bf A} these bounds can be tightened. This could in turn reduce the number of samples required to estimate the Frobenius norm to a given tolerance, as in [5, Lemma 2.2].

Another takeaway is that the bounds by Cortinovis and Kressner in Theorem 1 and Corollary 1 are already quite good. The improvements made in this paper, which establishes bounds that are tight given certain information about 𝐀{\bf A}, are fairly minor. If you want any further improvements to these tail bounds, you will have to use more information about the spectrum of 𝐀{\bf A}.

In practice, you probably shouldn’t use the Gaussian trace estimator on its own. If your matrix has a small stable rank or effective rank, you’d do better to combine the trace estimator with a low-rank approximation such as in [4, 5, 3]. Furthermore, you can reduce the variance by using the Hutchinson estimator (sampling vectors with random ±1\pm 1 entries), or by normalizing the Gaussian vectors to have unit length [2]. The main proof techniques in this paper do not work on the Hutchinson estimator because it has a discrete distribution. The techniques could potentially be applied to the normalized Gaussian estimator, but the proofs will be more complicated.

Finally, this paper successfully applies the proof techniques of [10, 9, 8] to obtain a majorization result and tight tail bounds for matrices with fixed 22-norm and Frobenius norm. The application to trace estimation may not yield results significantly better than those that can be obtained through concentration inequalities, but the approach—comparing the CDFs of distributions to solve an optimization problem directly—is markedly different.

Appendix A Proofs

Proof of Lemma 2.

Recall from Definition 5 that 𝝁F𝝀\boldsymbol{\mu}\preceq_{F}\boldsymbol{\lambda} if

(𝝁+)2w(𝝀+)2,\displaystyle(\boldsymbol{\mu}^{+})^{2}\preceq_{w}(\boldsymbol{\lambda}^{+})^{2}, (7a revisited)
(𝝀)2w(𝝁)2,\displaystyle(\boldsymbol{\lambda}^{-})^{2}\preceq_{w}(\boldsymbol{\mu}^{-})^{2}, (7b revisited)
i=1nμi2=i=1nλi2.\displaystyle\sum_{i=1}^{n}\mu_{i}^{2}=\sum_{i=1}^{n}\lambda_{i}^{2}. (7c revisited)

First suppose that the inequalities in (7a) and (7b) are strict (either both are, or neither). Find the leading slack indices (see Definition 3) j+j^{+} and jj^{-} for each inequality. Increase the corresponding nonnegative coordinate222If no such coordinate exists, pad 𝝁\boldsymbol{\mu} with zeros as needed. For example, if 𝝀=(2,2)\boldsymbol{\lambda}=(2,2) and 𝝁=(2,2)\boldsymbol{\mu}=(-2,-2), we would get the sequence (2,2,0)(2,0,2)(0,2,2)(-2,-2,0)\mapsto(-2,0,2)\mapsto(0,2,2). Such padding allows us to keep the sum of squares constant while changing the vectors continuously. of 𝝁\boldsymbol{\mu} while shrinking the negative coordinate toward zero, keeping their sum-of-squares constant, until one of the slack inequalities becomes an equality. Repeat a finite number of times to eliminate all slack.

Then, apply Lemma (1) to the cases (𝝁+)2(𝝀+)2(\boldsymbol{\mu}^{+})^{2}\preceq(\boldsymbol{\lambda}^{+})^{2} and (𝝀)2(𝝁)2(\boldsymbol{\lambda}^{-})^{2}\preceq(\boldsymbol{\mu}^{-})^{2} separately. ∎

A.1 Relative error majorization theorem

As a reminder, most of this proof is substantially the same as the one given in [8]. I provide it here in part to show how the proof techniques relate to those used for the absolute error case.

Proof of Theorem 2.

Lemma 1 implies that we can without loss of generality assume that 𝝁\boldsymbol{\mu} and 𝝀\boldsymbol{\lambda} differ in two coordinates only, satisfying 0λk<μkμj<λj0\leq\lambda_{k}<\mu_{k}\leq\mu_{j}<\lambda_{j}. For t[0,1]t\in[0,1], define

νi(t)\displaystyle\nu_{i}(t) :=tλi+(1t)μi,i{j,k},\displaystyle:=t\lambda_{i}+(1-t)\mu_{i},\quad i\in\{j,k\},
νi(t)\displaystyle\nu_{i}(t) :=λi,ij,\displaystyle:=\lambda_{i},\quad i\neq j,
Y(t)\displaystyle Y(t) :=i=1nνi(t)Xi.\displaystyle:=\sum_{i=1}^{n}\nu_{i}(t)X_{i}.

This interpolation satisfies Y(0)=Q𝝁Y(0)=Q_{\boldsymbol{\mu}} and Y(1)=Q𝝀Y(1)=Q_{\boldsymbol{\lambda}}. Our goal is to show that for certain fixed values of xx, the CDF FY(t)(x)F_{Y(t)}(x) is monotonic for t[0,1]t\in[0,1].

Take the Laplace transform of FY(t)F_{Y(t)} as

J(t,z):=[FY(t)](z)=0ezxFY(t)(x)𝑑x=1z0FY(t)(x)d(ezx)=1z0ezx𝑑FY(t)(x)=1z[Y(t)](z),\displaystyle\begin{split}J(t,z):=\mathcal{L}[F_{Y(t)}](z)&=\int_{0}^{\infty}e^{-zx}F_{Y(t)}(x)\,dx\\ &=\frac{-1}{z}\int_{0}^{\infty}F_{Y(t)}(x)\,d\left(e^{-zx}\right)\\ &=\frac{1}{z}\int_{0}^{\infty}e^{-zx}\,dF_{Y(t)}(x)\\ &=\frac{1}{z}\mathcal{L}[Y(t)](z),\end{split}

where [Y(t)](z):=𝔼[ezY(t)]\mathcal{L}[Y(t)](z):=\mathbb{E}[e^{-zY(t)}], the Laplace transform of Y(t)Y(t), satisfies

[Y(t)](z)=i=1n(1+νi(t)zβ)α,z,(z)>min1inβνi(t).\mathcal{L}[Y(t)](z)=\prod_{i=1}^{n}\left(1+\frac{\nu_{i}(t)z}{\beta}\right)^{-\alpha},\quad z\in\mathbb{C},\ \Re(z)>-\min_{1\leq i\leq n}\frac{\beta}{\nu_{i}(t)}. (11)

Differentiating with respect to tt yields

tJ(t,z)\displaystyle\frac{\partial}{\partial t}J(t,z) =J(t,z)tlnJ(t,z)\displaystyle=J(t,z)\frac{\partial}{\partial t}\ln J(t,z)
=J(t,z)ti{j,k}(αln(1+νi(t)zβ))\displaystyle=J(t,z)\frac{\partial}{\partial t}\sum_{i\in\{j,k\}}\left(-\alpha\ln\left(1+\frac{\nu_{i}(t)z}{\beta}\right)\right)
=J(t,z)zαβi{j,k}λiμi1+νi(t)zβ.\displaystyle=-J(t,z)z\frac{\alpha}{\beta}\sum_{i\in\{j,k\}}\frac{\lambda_{i}-\mu_{i}}{1+\frac{\nu_{i}(t)z}{\beta}}.

Recall that λjμj=μkλk>0\lambda_{j}-\mu_{j}=\mu_{k}-\lambda_{k}>0 (since 𝝁𝝀\boldsymbol{\mu}\preceq\boldsymbol{\lambda}), and see that

11+νj(t)zβ11+νk(t)zβ=(νk(t)νj(t))zβi{j,k}(1+νi(t)zβ)1.\frac{1}{1+\frac{\nu_{j}(t)z}{\beta}}-\frac{1}{1+\frac{\nu_{k}(t)z}{\beta}}=(\nu_{k}(t)-\nu_{j}(t))\frac{z}{\beta}\prod_{i\in\{j,k\}}\left(1+\frac{\nu_{i}(t)z}{\beta}\right)^{-1}. (12)

It follows that

Jt(t,z)=J(t,z)z2αβ2(λjμj)(νj(t)νk(t))i{j,k}(1+νi(t)zβ).\frac{\partial J}{\partial t}(t,z)=J(t,z)z^{2}\frac{\alpha}{\beta^{2}}\frac{(\lambda_{j}-\mu_{j})(\nu_{j}(t)-\nu_{k}(t))}{\prod_{i\in\{j,k\}}\left(1+\frac{\nu_{i}(t)z}{\beta}\right)}.

Substitute J(t,z):=[FY(t)](z)J(t,z):=\mathcal{L}[F_{Y(t)}](z) and apply the inverse Laplace transform to get

tFY(t)(x)\displaystyle\frac{\partial}{\partial t}F_{Y(t)}(x) =αβ2(λjμj)(νj(t)νk(t))2x2Pr(Y(t)+νj(t)ψ+νkψx)\displaystyle=\frac{\alpha}{\beta^{2}}(\lambda_{j}-\mu_{j})(\nu_{j}(t)-\nu_{k}(t))\frac{\partial^{2}}{\partial x^{2}}\mathrm{Pr}\left(Y(t)+\nu_{j}(t)\psi+\nu_{k}\psi^{\prime}\leq x\right)
=αβ2(λjμj)(νj(t)νk(t))xfY(t)+νj(t)ψ+νkψ(x),\displaystyle=\frac{\alpha}{\beta^{2}}(\lambda_{j}-\mu_{j})(\nu_{j}(t)-\nu_{k}(t))\frac{\partial}{\partial x}f_{Y(t)+\nu_{j}(t)\psi+\nu_{k}\psi^{\prime}}(x),

where ψ,ψi.i.d.Gamma(1,β)\psi,\psi^{\prime}\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}Gamma(1,\beta) are i.i.d exponential r.v’s which are also independent of Y(t)Y(t), and f(x)f(x) is the probability density function (PDF). Now λj>μj\lambda_{j}>\mu_{j} by assumption, and for any t[0,1]t\in[0,1] it holds that νj(t)νk(t)\nu_{j}(t)\geq\nu_{k}(t) as well. Thus for each t[0,1]t\in[0,1] the left-hand side tFY(t)(x)\frac{\partial}{\partial t}F_{Y(t)}(x) has the same sign as xfY(t)+νj(t)ψ+νkψ(x)\frac{\partial}{\partial x}f_{Y(t)+\nu_{j}(t)\psi+\nu_{k}\psi^{\prime}}(x), the derivative of the density function.

It is known that the distribution of any linear combination of Gamma random variables is unimodal (see [10, Thm. 4]), so by the definition of x¯upper\overline{x}_{\mathrm{upper}} and x¯lower\overline{x}_{\mathrm{lower}}, the density function is increasing on (0,x¯lower)(0,\overline{x}_{\mathrm{lower}}) and decreasing on (x¯upper,)(\overline{x}_{\mathrm{upper}},\infty). Therefore, for any xx in either of these regions, FY(t)(x)F_{Y(t)}(x) is monotonic with respect to t[0,1]t\in[0,1]. Since Y(0)=Q𝝁Y(0)=Q_{\boldsymbol{\mu}} and Y(1)=Q𝝀Y(1)=Q_{\boldsymbol{\lambda}}, the desired inequalities follow. ∎

A.2 Absolute error majorization theorem

Proof of Theorem 4.

Lemma 2 implies that we can without loss of generality assume that 𝝀\boldsymbol{\lambda} and 𝝁\boldsymbol{\mu} differ in two coordinates, satisfying λj2μj2=μk2λk2>0\lambda_{j}^{2}-\mu_{j}^{2}=\mu_{k}^{2}-\lambda_{k}^{2}>0. For t[0,1]t\in[0,1], define

νi(t):=sgn(λi+μi)tλi2+(1t)μi2,i{j,k},νi(t):=λi,ij,k,Y(t):=i=1nνi(t)(Xiα/β).\displaystyle\begin{split}\nu_{i}(t)&:=\mathrm{sgn}(\lambda_{i}+\mu_{i})\sqrt{t\lambda_{i}^{2}+(1-t)\mu_{i}^{2}},\quad i\in\{j,k\},\\ \nu_{i}(t)&:=\lambda_{i},\quad i\neq j,\,k,\\ Y(t)&:=\sum_{i=1}^{n}\nu_{i}(t)(X_{i}-\alpha/\beta).\end{split}

Note that 𝔼[Y(t)]=0\mathbb{E}[Y(t)]=0 by design. As for the term sgn(λi+μi)\mathrm{sgn}(\lambda_{i}+\mu_{i}), Lemma 2 implies that we need only consider cases where λi\lambda_{i} and μi\mu_{i} are both nonpositive or both nonnegative. This slightly awkward term therefore ensures that Y(0)=Q𝝁Y(0)=Q_{\boldsymbol{\mu}} and Y(1)=Q𝝀Y(1)=Q_{\boldsymbol{\lambda}}. We also note that for i{j,k}i\in\{j,k\} and νi(t)0\nu_{i}(t)\neq 0, we have

ddtνi(t)=λi2μi22νi(t).\frac{d}{dt}\nu_{i}(t)=\frac{\lambda_{i}^{2}-\mu_{i}^{2}}{2\nu_{i}(t)}. (13)

Our goal is again to show that for certain fixed values of xx, the CDF FY(t)F_{Y(t)} is monotonic for t[0,1]t\in[0,1].

Take the Laplace transform of FY(t)F_{Y(t)} as

J(t,q):=[FY(t)](z)=1z[Y(t)](z),J(t,q):=\mathcal{L}[F_{Y(t)}](z)=\frac{1}{z}\mathcal{L}[Y(t)](z),

where [Y(t)](z):=𝔼[ezY(t)]\mathcal{L}[Y(t)](z):=\mathbb{E}[e^{-zY(t)}], the Laplace transform of Y(t)Y(t), satisfies

[Y(t)](z)=i=1nezνi(t)α/β(1+νi(t)zβ)α,\mathcal{L}[Y(t)](z)=\prod_{i=1}^{n}e^{z\nu_{i}(t)\alpha/\beta}\left(1+\frac{\nu_{i}(t)z}{\beta}\right)^{-\alpha},

defined over the strip

z,βmaxνi(t)>0νi(t)<(z)<βmaxνi(t)<0|νi(t)|.z\in\mathbb{C},\quad\frac{-\beta}{\max_{\nu_{i}(t)>0}\nu_{i}(t)}<\Re(z)<\frac{\beta}{\max_{\nu_{i}(t)<0}|\nu_{i}(t)|}. (14)

As compared with (11), the extra term ezνi(t)α/βe^{z\nu_{i}(t)\alpha/\beta} comes from our using the centered variables Xiα/βX_{i}-\alpha/\beta. The region of convergence in (14) is a strip rather than a half-plane because Y(t)Y(t) may include both positive and negative weights νi(t)\nu_{i}(t).

Differentiating J(t,z)J(t,z) with respect to tt and using (13) yields

tJ(t,z)\displaystyle\frac{\partial}{\partial t}J(t,z) =J(t,z)tlnJ(t,z)\displaystyle=J(t,z)\frac{\partial}{\partial t}\ln J(t,z)
=J(t,z)i{j,k}t(zνi(t)αβαln(1+νi(t)zβ))\displaystyle=J(t,z)\sum_{i\in\{j,k\}}\frac{\partial}{\partial t}\left(z\nu_{i}(t)\frac{\alpha}{\beta}-\alpha\ln\left(1+\frac{\nu_{i}(t)z}{\beta}\right)\right)
=J(t,z)i{j,k}zαβ(111+νi(t)zβ)ddtνi(t)\displaystyle=J(t,z)\sum_{i\in\{j,k\}}z\frac{\alpha}{\beta}\left(1-\frac{1}{1+\tfrac{\nu_{i}(t)z}{\beta}}\right)\frac{d}{dt}\nu_{i}(t)
=J(t,z)z2α2β2i{j,k}λi2μi21+νi(t)z/β.\displaystyle=J(t,z)z^{2}\frac{\alpha}{2\beta^{2}}\sum_{i\in\{j,k\}}\frac{\lambda_{i}^{2}-\mu_{i}^{2}}{1+\nu_{i}(t)z/\beta}.

Recall that λj2μj2=μk2λk2>0\lambda_{j}^{2}-\mu_{j}^{2}=\mu_{k}^{2}-\lambda_{k}^{2}>0, and again use (12) to get

tJ(t,z)=J(t,z)z3α2β3(λj2μj2)(νk(t)νj(t))(1+νj(t)zβ)(1+νk(t)zβ).\frac{\partial}{\partial t}J(t,z)=J(t,z)z^{3}\frac{\alpha}{2\beta^{3}}\frac{(\lambda_{j}^{2}-\mu_{j}^{2})(\nu_{k}(t)-\nu_{j}(t))}{\left(1+\tfrac{\nu_{j}(t)z}{\beta}\right)\left(1+\tfrac{\nu_{k}(t)z}{\beta}\right)}.

Taking the inverse transform then gives

tFY(t)(x)\displaystyle\frac{\partial}{\partial t}F_{Y(t)}(x) =α2β3(λj2μj2)(νk(t)νj(t))3x3Pr(Y(t)+νj(t)ψ+νk(t)ψx)\displaystyle=\frac{\alpha}{2\beta^{3}}(\lambda_{j}^{2}-\mu_{j}^{2})(\nu_{k}(t)-\nu_{j}(t))\frac{\partial^{3}}{\partial x^{3}}\mathrm{Pr}\left(Y(t)+\nu_{j}(t)\psi+\nu_{k}(t)\psi^{\prime}\leq x\right)
=α2β3(λj2μj2)(νk(t)νj(t))2x2fY(t)+νj(t)ψ+νk(t)ψ(x),\displaystyle=\frac{\alpha}{2\beta^{3}}(\lambda_{j}^{2}-\mu_{j}^{2})(\nu_{k}(t)-\nu_{j}(t))\frac{\partial^{2}}{\partial x^{2}}f_{Y(t)+\nu_{j}(t)\psi+\nu_{k}(t)\psi^{\prime}}(x),

where ψ,ψi.i.d.Gamma(1,β)\psi,\psi^{\prime}\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}Gamma(1,\beta) are i.i.d exponential r.v’s which are also independent of Y(t)Y(t), and f(x)f(x) is the probability density function (PDF).

By design, λj2μj2>0\lambda_{j}^{2}-\mu_{j}^{2}>0. Checking the three cases considered in Lemma 2 shows also that νk(t)νj(t)0\nu_{k}(t)-\nu_{j}(t)\leq 0 for t[0,1]t\in[0,1]. Thus the sign of tFY(t)(x)\frac{\partial}{\partial t}F_{Y(t)}(x) is always opposite that of 2x2fY(t)+νj(t)ψj+νk(t)ψk(x)\frac{\partial^{2}}{\partial x^{2}}f_{Y(t)+\nu_{j}(t)\psi_{j}+\nu_{k}(t)\psi_{k}}(x), which is the convexity of the density function. By the definition of x^upper\hat{x}_{\mathrm{upper}}, the density function is convex on (x^upper,)(\hat{x}_{\mathrm{upper}},\infty), and so for any xx in this region, FY(t)(x)F_{Y(t)}(x) decreases monotonically for t[0,1]t\in[0,1]. Since Y(0)=Q𝝁𝔼[Q𝝁]Y(0)=Q_{\boldsymbol{\mu}}-\mathbb{E}[Q_{\boldsymbol{\mu}}] and Y(1)=Q𝝀𝔼[Q𝝀]Y(1)=Q_{\boldsymbol{\lambda}}-\mathbb{E}[Q_{\boldsymbol{\lambda}}], the desired upper-tail inequality follows.

The lower-tail bound follows by symmetry. ∎

A.3 Absolute error tail results

Proof of Theorem 5.

Let Q𝝀=i=1rλ^XiQ_{\boldsymbol{\lambda}}=\sum_{i=1}^{r}\hat{\lambda}X_{i}, where Xii.i.d.Gamma(α,β)X_{i}\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}Gamma(\alpha,\beta), r=ϕ2/(λ2α)r=\lceil\phi^{2}/(\lambda^{2}\alpha)\rceil and λ^=βϕrα\hat{\lambda}=\frac{\beta\phi}{\sqrt{r\alpha}}. This is a valid choice of distribution since

Var[Q𝝀]=rλ^2Var[X1]=r(β2ϕ2rα)αβ2=ϕ2\operatorname{Var}[Q_{\boldsymbol{\lambda}}]=r\hat{\lambda}^{2}\operatorname{Var}[X_{1}]=r\left(\frac{\beta^{2}\phi^{2}}{r\alpha}\right)\frac{\alpha}{\beta^{2}}=\phi^{2}

and

scale(Q𝝀)=λ^β=ϕrαϕϕ2/λ2=λ.\operatorname{scale}(Q_{\boldsymbol{\lambda}})=\frac{\hat{\lambda}}{\beta}=\frac{\phi}{\sqrt{r\alpha}}\leq\frac{\phi}{\sqrt{\phi^{2}/\lambda^{2}}}=\lambda.

Then 𝔼[Q𝝀]=rλ^αβ=ϕrα\mathbb{E}[Q_{\boldsymbol{\lambda}}]=r\hat{\lambda}\frac{\alpha}{\beta}=\phi\sqrt{r\alpha}, and the r.v Q𝝀+λ1ψ+λ2ψQ_{\boldsymbol{\lambda}}+\lambda_{1}\psi+\lambda_{2}\psi^{\prime} has distribution Gamma(rα+2,βλ^1)Gamma(r\alpha+2,\beta\hat{\lambda}^{-1}), or equivalently, Gamma(rα+2,rα/ϕ)Gamma(r\alpha+2,\sqrt{r\alpha}/\phi).

Now in general, a Gamma(α,β)Gamma(\alpha,\beta) random variable has density proportional to xα1eβxx^{\alpha-1}e^{-\beta x} for x0x\geq 0 (see (9)). The inflection points are where the second derivative is equal to zero. Since

d2dx2xα1eβx\displaystyle\frac{\mathrm{d}^{2}}{\mathrm{d}x^{2}}\,x^{\alpha-1}e^{-\beta x} =ddx(βxα1+(α1)xα2)eβx\displaystyle=\frac{\mathrm{d}}{\mathrm{d}x}\,(-\beta x^{\alpha-1}+(\alpha-1)x^{\alpha-2})e^{-\beta x}
=(β2xα12β(α1)xα2+(α1)(α2)xα3)eβx\displaystyle=(\beta^{2}x^{\alpha-1}-2\beta(\alpha-1)x^{\alpha-2}+(\alpha-1)(\alpha-2)x^{\alpha-3})e^{-\beta x}
=((βx)22(α1)(βx)+(α1)(α2))xα3eβx,\displaystyle=((\beta x)^{2}-2(\alpha-1)(\beta x)+(\alpha-1)(\alpha-2))x^{\alpha-3}e^{-\beta x},

the nontrivial inflection points are the two zeros of the quadratic term. The larger of the two zeros is given by

x^+=β12(α1)+4(α1)24(α1)(α2)2=α1+α1β.\hat{x}_{+}=\beta^{-1}\frac{2(\alpha-1)+\sqrt{4(\alpha-1)^{2}-4(\alpha-1)(\alpha-2)}}{2}=\frac{\alpha-1+\sqrt{\alpha-1}}{\beta}.

Substitute for α\alpha and β\beta the values rα+2r\alpha+2 and rα/ϕ\sqrt{r\alpha}/\phi, and subtract 𝔼[Q𝝀]\mathbb{E}[Q_{\boldsymbol{\lambda}}] to get

x^upperrα+1+rα+1rα/ϕϕrα=ϕ1+rα+1rα.\hat{x}_{\mathrm{upper}}\geq\frac{r\alpha+1+\sqrt{r\alpha+1}}{\sqrt{r\alpha}/\phi}-\phi\sqrt{r\alpha}=\phi\frac{1+\sqrt{r\alpha+1}}{\sqrt{r\alpha}}. (15)

Conjecture 2 is motivated by the fact that the right-hand-side of (15) is bounded above as

ϕ1+rα+1rαϕ1+ϕ2/λ2+1ϕ2/λ2=λ+ϕ2+λ2.\phi\frac{1+\sqrt{r\alpha+1}}{\sqrt{r\alpha}}\leq\phi\frac{1+\sqrt{\phi^{2}/\lambda^{2}+1}}{\sqrt{\phi^{2}/\lambda^{2}}}=\lambda+\sqrt{\phi^{2}+\lambda^{2}}.

References

  • [1] Alice Cortinovis and Daniel Kressner. On randomized trace estimates for indefinite matrices with an application to determinants. Foundations of Computational Mathematics, 22(3):875–903, 2022.
  • [2] Ethan N Epperly. Don’t use gaussians in stochastic trace estimation. https://www.ethanepperly.com/index.php/2024/01/28/, Jan 2024. Accessed: 2024-11-01.
  • [3] Ethan N. Epperly, Joel A. Tropp, and Robert J. Webber. Xtrace: Making the most of every sample in stochastic trace estimation. SIAM Journal on Matrix Analysis and Applications, 45(1):1–23, 2024.
  • [4] Raphael A Meyer, Cameron Musco, Christopher Musco, and David P Woodruff. Hutch++: Optimal stochastic trace estimation. In Symposium on Simplicity in Algorithms (SOSA), pages 142–155. SIAM, 2021.
  • [5] David Persson, Alice Cortinovis, and Daniel Kressner. Improved variants of the hutch++ algorithm for trace estimation. SIAM Journal on Matrix Analysis and Applications, 43(3):1162–1185, 2022.
  • [6] David Persson, Alice Cortinovis, and Daniel Kressner. Improved variants of the hutch++ algorithm for trace estimation (preprint), 2022.
  • [7] Josip E Pečarić and Yung Liang Tong. Convex functions, partial orderings, and statistical applications. Academic Press, 1992.
  • [8] Farbod Roosta-Khorasani and Gábor J. Székely. Schur properties of convolutions of gamma random variables. Metrika, 78(8):997–1014, 2015.
  • [9] Farbod Roosta-Khorasani, Gábor J. Székely, and Uri M. Ascher. Assessing stochastic algorithms for large scale nonlinear least squares problems using extremal probabilities of linear combinations of gamma random variables. SIAM/ASA Journal on Uncertainty Quantification, 3(1):61–90, 2015.
  • [10] Gábor J. Székely and Nail K. Bakirov. Extremal probabilities for gaussian quadratic forms. Probability Theory and Related Fields, 126(2):184–202, 2003.