Extremal bounds for Gaussian trace estimation

Eric Hallman

Abstract: This work derives extremal tail bounds for the Gaussian trace estimator applied to a real symmetric matrix. We define a partial ordering on the eigenvalues, so that when a matrix has greater spectrum under this ordering, its estimator will have worse tail bounds. This is done for two families of matrices: positive semidefinite matrices with bounded effective rank, and indefinite matrices with bounded 2-norm and fixed Frobenius norm. In each case, the tail region is defined rigorously and is constant for a given family.

1 Introduction

Let ${\bf A}\in\mathbb{R}^{n\times n}$ be a symmetric matrix with eigenvalues $\lambda_{1}\geq\ldots\geq\lambda_{n}$ , and let ${\bf z}_{1},\ldots,{\bf z}_{m}\in\mathbb{R}^{n}$ be i.i.d standard Gaussian vectors. The Gaussian trace estimator is given by

\operatorname{tr}({\bf A})\approx\operatorname{tr}_{m}^{G}({\bf A}):=\frac{1}{m}\sum_{j=1}^{m}{\bf z}_{j}^{T}{\bf A}{\bf z}_{j}.

(1)

It is well known that this estimator is unbiased and has variance $\tfrac{2}{m}\|{\bf A}\|_{F}^{2}$ .

Often, it is useful to know how many samples $m$ are needed to produce an estimate that satisfies a given error tolerance. Cortinovis and Kressner [1] have used concentration inequalities to derive good results on this problem, with slight improvements by Persson, Cortinovis, and Kressner in [6].

Theorem 1 ([1], Thm. 1).

Let ${\bf A}\in\mathbb{R}^{n\times n}$ be symmetric. Then for all $\varepsilon>0$ ,

\mathrm{Pr}\left(|\operatorname{tr}_{m}^{G}({\bf A})-\operatorname{tr}({\bf A})|\geq\varepsilon\right)\leq 2\exp\left(-\frac{m\varepsilon^{2}}{4\|{\bf A}\|_{F}^{2}+4\varepsilon\|{\bf A}\|_{2}}\right).

For nonzero symmetric positive semidefinite (SPSD) matrices, this result can be turned into a relative error estimate. The trace estimator will generally be more accurate on matrices with tightly clustered eigenvalues.

Definition 1.

The effective rank of a nonzero SPSD matrix ${\bf A}$ is

r_{\mathrm{eff}}({\bf A}):=\frac{\operatorname{tr}({\bf A})}{\|{\bf A}\|_{2}}.

Corollary 1 ([1], Remark 2).

For nonzero SPSD ${\bf A}\in\mathbb{R}^{n\times n}$ , replace $\varepsilon$ by $\varepsilon\cdot\operatorname{tr}({\bf A})$ in Theorem 1, and note that $\|{\bf A}\|_{F}^{2}/\operatorname{tr}({\bf A})^{2}\leq r_{\mathrm{eff}}({\bf A})^{-1}$ . For $\varepsilon>0$ ,

\mathrm{Pr}\left(|\operatorname{tr}_{m}^{G}({\bf A})-\operatorname{tr}({\bf A})|\geq\varepsilon\cdot\operatorname{tr}({\bf A})\right)\leq 2\exp\left(-\frac{m\varepsilon^{2}\cdot r_{\mathrm{eff}}({\bf A})}{4(1+\varepsilon)}\right).

The goal of this paper is to tighten these bounds as much as possible using techniques from [10, 9, 8]. The practical benefit may be marginal, since the above bounds are already quite good. Still, I found the techniques interesting.

1.1 Extremal bounds

Words like tight and optimal can be slippery. To be precise, I mean to find extremal tail probabilities: ones that can be expressed as the supremum or infimum over some set. The error bounds of Theorem 1 and Corollary 1 use only certain information about ${\bf A}$ ; respectively, they use $(\|{\bf A}\|_{2},\|{\bf A}\|_{F})$ and $r_{\mathrm{eff}}({\bf A})$ . It is therefore worth considering the sets of all matrices with the same summary statistics:

	$\displaystyle\mathcal{A}_{\mathrm{abs}}(\lambda,\phi)$	$\displaystyle:=\{{\bf A}\in\mathcal{S}_{\infty}\,:\,\\|{\bf A}\\|_{2}\leq\lambda,\,\\|{\bf A}\\|_{F}=\phi\},\quad 0<\lambda\leq\phi,$
	$\displaystyle\mathcal{A}_{\mathrm{rel}}(\mu)$	$\displaystyle:=\{{\bf A}\in\mathcal{S}_{\infty}^{+}\,:\,\\|{\bf A}\\|_{2}\leq\tfrac{1}{\mu},\,\operatorname{tr}({\bf A})=1\},\quad 1\leq\mu,$

where $\mathcal{S}_{\infty}$ is the set of all symmetric matrices of any dimension, and $\mathcal{S}_{\infty}^{+}$ is the set of all SPSD matrices of any dimension. For $\mathcal{A}_{\mathrm{rel}}(\mu)$ , the parameter $\mu$ is a lower bound on the effective rank of ${\bf A}$ ; we can fix $\operatorname{tr}({\bf A})$ without loss of generality since the relative error of $\operatorname{tr}_{m}^{G}({\bf A})$ is scale invariant.

For each of the above sets $\mathcal{A}$ , we define a partial ordering:

•

For $\mathcal{A}_{\mathrm{rel}}(\mu)$ , the partial ordering is the vector majorization order on the eigenvalues. This paper will show that $\boldsymbol{\lambda}_{{\bf A}}\preceq\boldsymbol{\lambda}_{{\bf B}}$ implies that $\operatorname{tr}_{m}^{G}({\bf B})$ has worse upper and lower tail bounds than $\operatorname{tr}_{m}^{G}({\bf A})$ .
•

For $\mathcal{A}_{\mathrm{abs}}(\lambda,\phi)$ , the partial ordering is related to the majorization order but is a little more complicated. This paper will show that $\boldsymbol{\lambda}_{{\bf A}}\preceq\boldsymbol{\lambda}_{{\bf B}}$ implies that $\operatorname{tr}_{m}^{G}({\bf B})$ has worse upper tail bounds and that $\operatorname{tr}_{m}^{G}({\bf A})$ has worse lower tail bounds.

With these partial orderings, $\mathcal{A}_{\mathrm{rel}}(\mu)$ has a maximum element and $\mathcal{A}_{\mathrm{abs}}(\lambda,\phi)$ has both maximum and minimum elements.¹¹1Strictly speaking, the maximum is unique only if we treat matrices as equivalent when they have the same nonzero eigenvalues.

1.2 The tail

It is not particularly useful to show that if $\boldsymbol{\lambda}_{{\bf A}}\preceq\boldsymbol{\lambda}_{{\bf B}}$ then one trace estimate eventually has worse tail bounds than the other—this much can already be deduced just by comparing the asymptotic behavior of the probability distributions.

The critical result, then, is that we can define the “tail” regions so that their value depends only on $\mathcal{A}$ . For error tolerances within these regions, the partial ordering works as advertised on all elements of $\mathcal{A}$ .

Unfortunately, it has so far been beyond my ability to prove exactly where these tail regions begin. This paper contains some pessimistic bounds, but otherwise is restricted to conjecture.

1.3 The extremal matrices

For the relative error, the matrix with the worst tail bounds is

{\bf A}_{\mathrm{rel}}(\mu)=\frac{1}{\mu}\begin{bmatrix}{\bf I}_{\lfloor\mu\rfloor}&{\bf 0}&{\bf 0}\\ {\bf 0}&\mu-\lfloor\mu\rfloor&{\bf 0}\\ {\bf 0}&{\bf 0}&{\bf 0}\end{bmatrix}.

(2)

For the absolute error, the matrix with the worst upper tail bounds is

{\bf A}_{\mathrm{abs}}(\lambda,\phi)=\lambda\begin{bmatrix}{\bf I}_{\lfloor\rho\rfloor}&{\bf 0}&{\bf 0}\\ {\bf 0}&\sqrt{\rho-\lfloor\rho\rfloor}&{\bf 0}\\ {\bf 0}&{\bf 0}&{\bf 0}\end{bmatrix},\quad\rho:=\frac{\phi^{2}}{\lambda^{2}},

(3)

where $\rho$ is an upper bound on the stable rank of ${\bf A}$ . By symmetry, the matrix with the worst lower tail bounds is $-{\bf A}_{\mathrm{abs}}(\lambda,\phi)$ .

Definition 2.

The stable rank of a nonzero matrix ${\bf A}$ is

r_{\mathrm{stab}}({\bf A}):=\frac{\|{\bf A}\|_{F}^{2}}{\|{\bf A}\|_{2}^{2}}.

1.3.1 Extension to Gamma random variables

When $\mu$ and $\rho$ are not integers, the distributions of the trace estimators for (2) and (3) are not quite as elegant as I would have liked them to be. We can, however, relax the problem by considering more general linear combinations of Gamma random variables, as opposed to only those whose distribution can be expressed as the Gaussian trace estimate of some matrix as in (1).

For distribution families $\mathcal{Q}_{\mathrm{rel}}$ and $\mathcal{Q}_{\mathrm{abs}}$ to be defined later, we will show that

\max\,\mathcal{Q}_{\mathrm{rel}}\sim Gamma\left(\frac{m\mu}{2},\frac{m\mu}{2}\right)

(4)

and

\operatorname*{extrema}\,\mathcal{Q}_{\mathrm{abs}}=\pm\lambda Q,\quad Q\sim Gamma\left(\frac{m\rho}{2},\frac{m}{2}\right),\,\rho:=\frac{\phi^{2}}{\lambda^{2}}.

(5)

When $\mu$ and $\rho$ are integers, these are exactly the distributions of the trace estimators $\operatorname{tr}_{m}^{G}({\bf A}_{\mathrm{rel}})$ and $\operatorname{tr}_{m}^{G}({\bf A}_{\mathrm{abs}})$ from (2) and (3).

1.4 Related work

This paper is primarily a spiritual successor to the works [10, 9, 8]. Székely [10] gives thorough tail bounds for the relative error when $Q$ is a nonnegative sum of chi-squared r.v’s with no further restrictions (the worst case is typically $Q\sim\chi_{1}^{2}$ ) and provides a majorization result as a corollary. Roosta-Khorasani, Székely, and Ascher [9] extend the main result to where $Q$ is a nonnegative sum of Gamma r.v’s with arbitrary shape and scale parameters. This allows them to consider the effects of using a larger sampling number $m$ for trace estimation. Finally, Roosta-Khorasani and Székely [8] generalize the worst-case bounds of [9] by obtaining a majorization result for nonnegative sums of Gamma r.v’s. The present work’s Theorem 2, in particular, is more or less a restatement of a result from [8].

This current paper is novel in two main ways. First, it considers how the above bounds might be strengthened when the matrix in question has a large effective rank. Second, it obtains absolute error bounds for indefinite matrices —that is, when the sum of Gamma random variables is not necessarily nonnegative.

2 Majorization

In [8], Roosta-Khorasani and Székely use the majorization order on the eigenvalues of a matrix as a measure of the “skewness” of its spectrum. Their general observation is that the Gaussian trace estimator performs worse when the spectrum is highly skewed. In other words: the majorization order is the partial ordering that they (and we) use to determine for which matrices the Gaussian trace estimator yields the worst relative error bounds.

Given a nonnegative vector $\boldsymbol{\lambda}\in\mathbb{R}^{n}$ , let $\lambda_{[i]}$ denote its indices in sorted order, so that $\lambda_{[1]}\geq\ldots\geq\lambda_{[n]}\geq 0$ . We say that $\boldsymbol{\lambda}$ majorizes $\boldsymbol{\mu}$ , denoted $\boldsymbol{\mu}\preceq\boldsymbol{\lambda}$ , if


$\displaystyle\sum_{i=1}^{k}\mu_{[i]}$	$\displaystyle\leq\sum_{i=1}^{k}\lambda_{[i]},\quad\forall k\leq n,$	(6a)
$\displaystyle\sum_{i=1}^{n}\mu_{i}$	$\displaystyle=\sum_{i=1}^{n}\lambda_{i}.$	(6b)

Similarly, we say that $\boldsymbol{\lambda}$ weakly majorizes $\boldsymbol{\mu}$ , denoted $\boldsymbol{\mu}\preceq_{w}\boldsymbol{\lambda}$ , just as long as (6a) holds. If $\boldsymbol{\lambda}$ and $\boldsymbol{\mu}$ do not have the same length, pad the shorter one with zeroes as necessary.

If $\boldsymbol{\lambda}$ weakly majorizes $\boldsymbol{\mu}$ but does not majorize it, then $\sum_{i=1}^{n}\mu_{i}<\sum_{i=1}^{n}\lambda_{i}$ . For our proofs, it will be useful to identify the indices for which the inequalities in (6a) are strict.

Definition 3.

Given $\boldsymbol{\mu}\preceq_{w}\boldsymbol{\lambda}$ , the leading slack index is the smallest number $j$ such that

\sum_{i=1}^{\ell}\mu_{[i]}<\sum_{i=1}^{\ell}\lambda_{[i]},\quad\forall\ell\in\{j,\ldots,n\}.

This index has the property that $\mu_{[j]}$ (and no larger entry of $\boldsymbol{\mu}$ ) can be increased by a nonzero amount without violating the majorization condition.

In deriving their main results, Roosta-Khorasani and Székely make use of the following lemma [7, 12.5a].

Lemma 1.

If $\boldsymbol{\mu}\preceq\boldsymbol{\lambda}$ , then there exists a finite sequence of vectors

\boldsymbol{\mu}\preceq\boldsymbol{\eta}_{1}\preceq\ldots\preceq\boldsymbol{\eta}_{r}=\boldsymbol{\lambda}

so that $\boldsymbol{\eta}_{i}$ and $\boldsymbol{\eta}_{i+1}$ differ in two coordinates only for $i=1,\ldots,r-1$ .

This lemma lets us assume without loss of generality that $\boldsymbol{\mu}$ and $\boldsymbol{\lambda}$ differ in two coordinates only, satisfying

0\leq\lambda_{k}<\mu_{k}\leq\mu_{j}<\lambda_{j}.

2.1 Indefinite majorization

For indefinite matrices with fixed Frobenius norm as in $\mathcal{A}_{\mathrm{abs}}(\lambda,\phi)$ , the majorization condition is somewhat more complicated. Here is why: a Gamma random variable is nonnegative, so its lower tail can’t get too far from the mean. The average of many Gamma random variables, however, looks more like a normal distribution which has tails in both directions. This suggests that $\operatorname{tr}_{m}^{G}({\bf A})$ will have worse absolute upper tail bounds when the nonnegative eigenvalues are highly skewed and when the negative eigenvalues are tightly clustered.

We will also see that $\operatorname{tr}_{m}^{G}(|{\bf A}|)$ has worse upper tail bounds than $\operatorname{tr}_{m}^{G}({\bf A})$ , for indefinite ${\bf A}$ . Analogous results for lower tail bounds follow by symmetry.

Now we describe the majorization condition more precisely.

Definition 4.

For a vector $\boldsymbol{\lambda}$ , define

\boldsymbol{\lambda}^{-}:=\min(\boldsymbol{\lambda},\boldsymbol{0})\quad\text{and}\quad\boldsymbol{\lambda}^{+}:=\max(\boldsymbol{\lambda},\boldsymbol{0}),

where the min and max operations are done elementwise.

Definition 5.

For vectors $\boldsymbol{\lambda}$ and $\boldsymbol{\mu}$ , we say that $\boldsymbol{\mu}\preceq_{F}\boldsymbol{\lambda}$ if


$\displaystyle(\boldsymbol{\mu}^{+})^{2}$	$\displaystyle\preceq_{w}(\boldsymbol{\lambda}^{+})^{2},$	(7a)
$\displaystyle(\boldsymbol{\lambda}^{-})^{2}$	$\displaystyle\preceq_{w}(\boldsymbol{\mu}^{-})^{2},$	(7b)
$\displaystyle\sum_{i=1}^{n}\mu_{i}^{2}$	$\displaystyle=\sum_{i=1}^{n}\lambda_{i}^{2}.$	(7c)

Condition (7a) specifies that the positive entries of $\boldsymbol{\lambda}$ are more skewed, and perhaps have more total weight, than those of $\boldsymbol{\mu}$ . Condition (7b) specifies the opposite for the negative entries. Condition (7c) means that the matrices associated with $\boldsymbol{\lambda}$ and $\boldsymbol{\mu}$ have the same Frobenius norm. This last condition also means that the associated trace estimators have the same variance.

For the main result, we propose a lemma analogous to Lemma 1 (see Appendix A) for the proof).

Lemma 2.

If $\boldsymbol{\mu}\preceq_{F}\boldsymbol{\lambda}$ , then there exists a finite sequence of vectors

\boldsymbol{\mu}\preceq_{F}\boldsymbol{\eta}_{1}\preceq_{F}\ldots\preceq_{F}\boldsymbol{\eta}_{r}=\boldsymbol{\lambda}

so that $\boldsymbol{\eta}_{i}$ and $\boldsymbol{\eta}_{i+1}$ differ in two coordinates only for $i=1,\ldots,r-1$ . Furthermore, the difference between consecutive vectors may be assumed to take one of the following forms:

1.

$\mu_{k}<\lambda_{k}\leq 0\leq\mu_{j}<\lambda_{j}$ ;
2.

$0\leq\lambda_{k}<\mu_{k}\leq\mu_{j}<\lambda_{j}$ ;
3.

$\mu_{k}<\lambda_{k}\leq\lambda_{j}<\mu_{j}\leq 0$ .

Once again, we may assume without loss of generality that $\boldsymbol{\mu}$ and $\boldsymbol{\lambda}$ differ in two coordinates only, this time satisfying $\lambda_{j}^{2}-\mu_{j}^{2}=\mu_{k}^{2}-\lambda_{k}^{2}>0$ .

3 Gamma random variables

In this section we cover some relevant properties of Gamma random variables, and explain how to reformulate the original trace estimation problem in terms of such variables.

Since real symmetric matrices are orthogonally diagonalizable and Gaussian vectors are rotationally invariant, the Gaussian trace estimator (1) can be written in the form

\operatorname{tr}_{m}^{G}({\bf A})=\sum_{i=1}^{n}\lambda_{i}X_{i},\quad X_{i}\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}\tfrac{1}{m}\chi_{m}^{2},

(8)

where $\chi_{m}^{2}$ is a chi-squared variable with $m$ degrees of freedom.

The chi-squared distribution is a special case of the Gamma distribution, and so we can generalize (8) as follows:

Definition 6.

Given a vector $\boldsymbol{\lambda}=(\lambda_{1},\ldots,\lambda_{n})\in\mathbb{R}^{n}$ , denote

Q(\boldsymbol{\lambda};\alpha,\beta):=\sum_{i=1}^{n}\lambda_{i}X_{i},\quad X_{i}\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}Gamma\left(\alpha,\beta\right).

When $\alpha$ and $\beta$ are clear from context, we will use $Q_{\boldsymbol{\lambda}}$ for short.

With the notation above, $\operatorname{tr}_{m}^{G}({\bf A})$ has the same distribution as $Q\left(\boldsymbol{\lambda}_{{\bf A}};\tfrac{m}{2},\tfrac{m}{2}\right)$ .

3.1 Basic properties

Here are a few other facts about the Gamma distribution:

•

A Gamma random variable with shape $\alpha>0$ and rate $\beta>0$ has the PDF

$f(x)=\begin{cases}\frac{\beta^{\alpha}}{\Gamma(\alpha)}x^{\alpha-1}e^{-\beta x}&x\geq 0,\\ 0&x\leq 0.\end{cases}$ (9)

The tail decays more slowly than the tail of a normal distribution.
•

If $X\sim Gamma(\alpha,\beta)$ then $\mathbb{E}[X]=\alpha/\beta$ and $\operatorname{Var}[X]=\alpha/\beta^{2}$ . Furthermore, $X$ is unimodal with mode $(\alpha-1)/\beta$ if $\alpha\geq 1$ and 0 otherwise.
•

Gamma random variables follow a scaling law: if ${X\sim Gamma(\alpha,\beta)}$ and $\lambda>0$ , then $\lambda X\sim Gamma(\alpha,\beta/\lambda)$ .
•

Gamma random variables with the same rate parameter have an additive property: if $X_{1}\sim Gamma(\alpha_{1},\beta)$ and $X_{2}\sim Gamma(\alpha_{2},\beta)$ , then ${X_{1}+X_{2}\sim Gamma(\alpha_{1}+\alpha_{2},\beta)}$ .
•

Conversely, Gamma random variables are infinitely divisible: if $X\sim Gamma(\alpha,\beta)$ then for any positive integer $T$ , $X$ has the same distribution as $\sum_{i=1}^{T}X_{i}$ , where $X_{1},\ldots,X_{T}\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}Gamma(\alpha/T,\beta)$ .

3.2 Generalizing the distribution families

In order to define $\mathcal{Q}_{\mathrm{rel}}$ and $\mathcal{Q}_{\mathrm{abs}}$ as extensions of $\mathcal{A}_{\mathrm{rel}}$ and $\mathcal{A}_{\mathrm{abs}}$ , we first generalize the notions of 2-norm and effective rank.

Definition 7.

For a random variable $Q_{\boldsymbol{\lambda}}=Q(\boldsymbol{\lambda};\alpha,\beta)$ , we define the scale of $Q_{\boldsymbol{\lambda}}$ as

\operatorname{scale}(Q_{\boldsymbol{\lambda}})=\frac{\max_{i}|\lambda_{i}|}{\beta}.

It is worth mentioning that $\operatorname{scale}(Q_{\boldsymbol{\lambda}})$ is a property of the distribution itself, not just its representation in terms of $(\boldsymbol{\lambda},\alpha,\beta)$ .

Definition 8.

For a random variable $Q_{\boldsymbol{\lambda}}=Q(\boldsymbol{\lambda};\alpha,\beta)$ with nonnegative weights $\boldsymbol{\lambda}$ , we define the effective shape of $Q_{\boldsymbol{\lambda}}$ as

\alpha_{\mathrm{eff}}(Q_{\boldsymbol{\lambda}}):=\frac{\mathbb{E}[Q_{\boldsymbol{\lambda}}]}{\operatorname{scale}(Q_{\boldsymbol{\lambda}})}=\frac{\alpha\sum_{i=1}^{n}\lambda_{i}}{\max_{i}\lambda_{i}}.

Now we can extend the definition of $\mathcal{A}_{\mathrm{rel}}(\mu)$ :

Definition 9.

For fixed $\alpha>0$ and $\beta>0$ , define

\mathcal{Q}_{\mathrm{rel}}(\mu;\alpha,\beta):=\{Q_{\boldsymbol{\lambda}}\,:\,\operatorname{scale}(Q_{\boldsymbol{\lambda}})\leq\tfrac{1}{\mu},\,\mathbb{E}[Q_{\boldsymbol{\lambda}}]=1\},\quad\alpha\leq\mu,

where the weights $\boldsymbol{\lambda}\in\mathbb{R}^{n}$ are nonnegative.

The parameter $\mu$ in $\mathcal{Q}_{\mathrm{rel}}(\mu;\alpha,\beta)$ is a lower bound on the effective shape of $Q_{\boldsymbol{\lambda}}$ . This definition generalizes the case of matrix trace estimation since

\operatorname{tr}_{m}^{G}\left(\mathcal{A}_{\mathrm{rel}}(\mu)\right)=\mathcal{Q}_{\mathrm{rel}}\left(\tfrac{m\mu}{2};\tfrac{m}{2},\tfrac{m}{2}\right).

The definition of $\mathcal{A}_{\mathrm{abs}}(\lambda,\phi)$ can be extended similarly:

Definition 10.

For fixed $\alpha>0$ and $\beta>0$ , define

\mathcal{Q}_{\mathrm{abs}}(\lambda,\phi;\alpha,\beta):=\{Q_{\boldsymbol{\lambda}}\,:\,\operatorname{scale}(Q_{\boldsymbol{\lambda}})\leq\lambda,\operatorname{Var}[Q_{\boldsymbol{\lambda}}]=\phi^{2}\},\quad 0<\lambda\leq\tfrac{1}{\sqrt{\alpha}}\phi.

In this case, we have

\operatorname{tr}_{m}^{G}(\mathcal{A}_{\mathrm{abs}}(\lambda,\phi))=\mathcal{Q}_{\mathrm{abs}}\left(\tfrac{2\lambda}{m},\phi\sqrt{\tfrac{2}{m}};\tfrac{m}{2},\tfrac{m}{2}\right).

3.3 Infinite division

Our strategy for relaxing the original trace estimation problem is to use the fact that Gamma random variables are infinitely divisible (see Section 3.1). Any Gamma random variable may be rewritten as a sum of arbitrarily many Gamma random variables with smaller shape parameters $\alpha$ , and this fact enables us to nest some distribution families within others.

Lemma 3.

For any integer $T\geq 1$ , it holds that $Q(\boldsymbol{\lambda};\alpha,\beta)$ has the same distribution as $Q\big{(}(\underbrace{\boldsymbol{\lambda},\ldots,\boldsymbol{\lambda}}_{T\text{ times}});\tfrac{\alpha}{T},\beta\big{)}$ , and consequently that

\mathcal{Q}_{\mathrm{rel}}(\mu;\alpha,\beta)\subseteq\mathcal{Q}_{\mathrm{rel}}\left(\mu;\tfrac{\alpha}{T},\beta\right)\quad\text{and}\quad\mathcal{Q}_{\mathrm{abs}}(\lambda,\phi;\alpha,\beta)\subseteq\mathcal{Q}_{\mathrm{abs}}\left(\lambda,\phi;\tfrac{\alpha}{T},\beta\right).

By considering the limit as $T\rightarrow\infty$ , we can effectively do away with the constraint of sharing a single scale parameter, and in doing so obtain the more general sets $\mathcal{Q}_{\mathrm{rel}}(\mu)$ and $\mathcal{Q}_{\mathrm{abs}}(\lambda,\phi)$ promised in (4) and (5).

Definition 11.

For $\mu>0$ , let $\mathcal{Q}_{\mathrm{rel}}(\mu)$ be the set of all finite linear combinations of Gamma random variables

Q=\sum_{i=1}^{n}\lambda_{i}X_{i},\quad X_{i}\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}Gamma(\alpha_{i},\beta_{i}),\,\lambda_{i}\geq 0,\,\alpha_{i}>0,\,\beta_{i}>0,

subject to the constraints $\operatorname{scale}[Q]\leq\tfrac{1}{\mu}$ and $\mathbb{E}[Q]=1$ .

Definition 12.

For $\lambda>0$ and $\phi>0$ , let $\mathcal{Q}_{\mathrm{abs}}(\lambda,\phi)$ be the set of all finite linear combinations of Gamma random variables

Q=\sum_{i=1}^{n}\lambda_{i}X_{i},\quad X_{i}\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}Gamma(\alpha_{i},\beta_{i}),\,\alpha_{i}>0,\,\beta_{i}>0,

subject to the constraints $\operatorname{scale}[Q]\leq\lambda$ and $\operatorname{Var}[Q]=\phi^{2}$ .

For any fixed $\alpha>0$ and $\beta>0$ , for sufficiently large $T$ the set $\mathcal{Q}_{\mathrm{rel}}(\mu;\tfrac{\alpha}{T},\beta)$ contains distributions arbitrarily close to any given distribution in $\mathcal{Q}_{\mathrm{rel}}(\mu)$ ; the same goes for $\mathcal{Q}_{\mathrm{abs}}(\lambda,\phi;\tfrac{\alpha}{T},\beta)$ and $\mathcal{Q}_{\mathrm{abs}}(\lambda,\phi)$ . Thus any bounds we derive for the former set will also apply to the latter.

3.4 Fixed shape and scale parameters

The main results assume that parameters $\alpha$ and $\beta$ are fixed in order to keep the majorization definitions as simple as possible. If $\alpha_{i}$ and $\beta_{i}$ were allowed to differ for each random variable $X_{i}$ in the mixture $Q$ , we could still map $Q$ to a random variable that takes on values $\lambda_{i}/\beta_{i}$ with probability $\alpha_{i}$ , then make comparisons using the stochastic ordering. I don’t feel that the added complexity is worth it, so won’t pursue this approach further.

4 Majorization theorems

This section presents the main majorization results. The relative error result is essentially a restatement of results from [8]; to the best of my knowledge the absolute error result is novel.

4.1 Relative error

As mentioned above, this first result was essentially proved as part of [8, Thm. 1]. I’ve still included the proof (Appendix A.1) for the sake of completeness.

Theorem 2.

Fix $\mu\geq\alpha>0$ , and $\beta>0$ . Define $\overline{x}_{\mathrm{upper}}$ and $\overline{x}_{\mathrm{lower}}$ to be the extreme values of the mode of the PDF of the distribution

Q_{\boldsymbol{\lambda}}+\lambda_{j}\psi+\lambda_{k}\psi^{\prime},\quad Q_{\boldsymbol{\lambda}}\in\mathcal{Q}_{\mathrm{rel}}(\mu;\alpha,\beta),\ j\neq k,

where $\psi,\psi^{\prime}\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}Gamma\left(1,\beta\right)$ are exponential r.v’s independent of $Q_{\boldsymbol{\lambda}}$ . If $Q_{\boldsymbol{\lambda}},Q_{\boldsymbol{\mu}}\in\mathcal{Q}_{\mathrm{rel}}(\mu;\alpha,\beta)$ satisfy $\boldsymbol{\mu}\preceq\boldsymbol{\lambda}$ , then

	$\displaystyle\mathrm{Pr}\left(Q_{\boldsymbol{\mu}}\leq x\right)$	$\displaystyle\geq\mathrm{Pr}\left(Q_{\boldsymbol{\lambda}}\leq x\right),\quad\forall x\geq\overline{x}_{\mathrm{upper}},$
	$\displaystyle\mathrm{Pr}\left(Q_{\boldsymbol{\mu}}\leq x\right)$	$\displaystyle\leq\mathrm{Pr}\left(Q_{\boldsymbol{\lambda}}\leq x\right),\quad\forall x\leq\overline{x}_{\mathrm{lower}}.$

That is, $Q_{\boldsymbol{\lambda}}$ has worse upper and lower tail bounds.

4.1.1 Locating the tail

But what are the values of $\overline{x}_{\mathrm{upper}}$ and $\overline{x}_{\mathrm{lower}}$ ? In the most general case $\mu=\alpha$ , [8, Thm. 1] shows that $\overline{x}_{\mathrm{upper}}=1+(2\alpha)^{-1}$ and $\overline{x}_{\mathrm{lower}}=1-\alpha^{-1}$ . These values necessarily serve as bounds whenever $\mu\geq\alpha$ , but I would like to obtain stronger results for distributions with greater effective shape.

Theorem 3 gives a pessimistic result: it tells us where the tails are not, rather than where they are.

Theorem 3.

If $\mu>\alpha$ , the points $\overline{x}_{\mathrm{upper}}$ and $\overline{x}_{\mathrm{lower}}$ as defined in Theorem 2 satisfy

\overline{x}_{\mathrm{upper}}\geq 1+(\alpha\lceil\mu/\alpha\rceil)^{-1}\quad\text{and}\quad\overline{x}_{\mathrm{lower}}\leq 1-(\alpha\lceil\mu/\alpha\rceil)^{-1}.

Proof.

For the bound on $\overline{x}_{\mathrm{upper}}$ , set $Q_{\boldsymbol{\lambda}}=\sum_{i=1}^{\lceil\mu/\alpha\rceil}\lambda X_{i}$ with $\lambda=\tfrac{\beta}{\alpha\lceil\mu/\alpha\rceil}$ . The distribution $Q_{\boldsymbol{\lambda}}+\lambda_{1}\psi+\lambda_{2}\psi^{\prime}$ is $Gamma(\alpha\lceil\mu/\alpha\rceil+2,\lambda^{-1}\beta)$ , so its mode is ${1+(\alpha\lceil\mu/\alpha\rceil)^{-1}}$ as desired while $\mathbb{E}[Q_{\boldsymbol{\lambda}}]=1$ .

For the bound on $\overline{x}_{\mathrm{lower}}$ , take the same $Q_{\boldsymbol{\lambda}}$ , but pad it with eigenvalues $\lambda_{j}=\lambda_{k}=0$ instead. ∎

Unfortunately, I have not so far been able to establish bounds in the opposite direction. I will therefore leave them as a conjecture:

Conjecture 1.

The points $\overline{x}_{\mathrm{upper}}$ and $\overline{x}_{\mathrm{lower}}$ as defined in Theorem 2 satisfy

\overline{x}_{\mathrm{upper}}\leq 1+\mu^{-1}\quad\text{and}\quad\overline{x}_{\mathrm{lower}}\geq 1-\mu^{-1}.

4.2 Absolute error

The relative error analysis in the previous section has two significant limitations: first, it only applies when ${\bf A}$ is SPSD. Second, $r_{\mathrm{eff}}({\bf A})$ may sometimes be significantly larger than $r_{\mathrm{stab}}({\bf A})$ , which is a better indicator of the quality of the Gaussian trace estimator since the variance of $\operatorname{tr}_{m}^{G}({\bf A})$ is $\tfrac{2}{m}\|{\bf A}\|_{F}^{2}$ .

We therefore turn to the case where ${\bf A}$ may be indefinite but bounds on $\|{\bf A}\|_{F}$ and $\|{\bf A}\|_{2}$ are known. As before, it is useful to present the results in terms of more general Gamma random variables.

Here is the main result on the absolute error of the estimator (proof in Appendix A.2).

Theorem 4.

Fix $\alpha>0$ , $\beta>0$ , and $\frac{\phi}{\sqrt{\alpha}}\geq\lambda>0$ . Define $\hat{x}_{\mathrm{upper}}$ to be the supremum over all inflection points of the PDF of the distribution

Q_{\boldsymbol{\lambda}}+\lambda_{j}\psi+\lambda_{k}\psi^{\prime}-\mathbb{E}[Q_{\boldsymbol{\lambda}}],\quad Q_{\boldsymbol{\lambda}}\in\mathcal{Q}_{\mathrm{abs}}(\lambda,\phi;\alpha,\beta),\,j\neq k,

(10)

where $\psi,\psi^{\prime}\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}Gamma(1,\beta)$ are exponential r.v’s independent of $Q_{\boldsymbol{\lambda}}$ . If $Q_{\boldsymbol{\lambda}},Q_{\boldsymbol{\mu}}\in\mathcal{Q}_{\mathrm{abs}}(\lambda,\phi;\alpha,\beta)$ satisfy $\boldsymbol{\mu}\preceq_{F}\boldsymbol{\lambda}$ , then

\mathrm{Pr}(Q_{\boldsymbol{\mu}}-\mathbb{E}[Q_{\boldsymbol{\mu}}]\leq x)\geq\mathrm{Pr}(Q_{\boldsymbol{\lambda}}-\mathbb{E}[Q_{\boldsymbol{\lambda}}]\leq x),\quad\forall|x|>\hat{x}_{\mathrm{upper}}.

This single equation (note the use of the absolute value $|x|$ in the condition) means that $Q_{\boldsymbol{\lambda}}$ has worse upper tail bounds and that $Q_{\boldsymbol{\mu}}$ has worse lower tail bounds.

4.2.1 Locating the tail

Theorem 4 implies that the “tails” are the regions where all density functions of the form (10) are convex. For reference, a normal distribution has inflection points equal to the mean plus or minus one standard deviation.

How much further away could the tails begin? Theorem 5 gives a pessimistic result.

Theorem 5.

The point $\hat{x}_{\mathrm{upper}}$ as defined in Theorem 4 satisfies

\hat{x}_{\mathrm{upper}}\geq\phi\frac{1+\sqrt{r\alpha+1}}{\sqrt{r\alpha}},\quad\text{where}\ r=\left\lceil\frac{\phi^{2}}{\lambda^{2}\alpha}\right\rceil.

See Appendix A.3 for the proof. I’ll leave an upper bound as a conjecture.

Conjecture 2.

The point $\hat{x}_{\mathrm{upper}}$ as defined in Theorem 4 satisfies

\hat{x}_{\mathrm{upper}}\leq\lambda+\sqrt{\phi^{2}+\lambda^{2}}.

5 Application to trace estimation

In this section I’ll rephrase the main theorems (Theorems 2 and 4) more directly in terms of Gaussian trace estimation.

Here is the basic idea: the matrix ${\bf A}_{\mathrm{rel}}(\mu)$ from (2) (resp. ${\bf A}_{\mathrm{abs}}(\lambda,\phi)$ from (3)) majorizes every other element of $\mathcal{A}_{\mathrm{rel}}(\mu)$ (resp. $\mathcal{A}_{\mathrm{abs}}(\lambda,\phi)$ ). Thus by the main theorems, these two matrices have the worst tail bounds. Employing the infinite division strategy of Section (3.3) shows that the Gaussian trace estimators for these matrices are in turn tail-bounded by the Gamma distributions in (4) and (5), respectively. Finally, increasing the sampling number $m$ will make the tails take up a larger portion of the distribution, increasing the domain over which the bounds in this paper are effective.

First up is the relative error bound. Note that a little unit conversion is used here: if $r_{\mathrm{eff}}({\bf A})=\mu$ , then $\alpha_{\mathrm{eff}}(\operatorname{tr}_{m}^{G}({\bf A}))=\frac{m\mu}{2}$ .

Theorem 6.

For nonzero SPSD ${\bf A}\in\mathbb{R}^{n\times n}$ and sampling number $m$ , let $\mu=r_{\mathrm{eff}}({\bf A})$ . Then there exists $\varepsilon_{\mathrm{rel}}$ such that for $\varepsilon\geq\varepsilon_{\mathrm{rel}}$ ,

	$\displaystyle\mathrm{Pr}\left(\|\operatorname{tr}_{m}^{G}({\bf A})-\operatorname{tr}({\bf A})\|\geq\varepsilon\cdot\operatorname{tr}({\bf A})\right)$	$\displaystyle\leq\mathrm{Pr}\left(\|\operatorname{tr}_{m}^{G}({\bf A}_{\mathrm{rel}})-1\|\geq\varepsilon\right)$
		$\displaystyle\leq\mathrm{Pr}(\|X-1\|\geq\varepsilon),$

where ${\bf A}_{\mathrm{rel}}={\bf A}_{\mathrm{rel}}(\mu)$ is defined in (2) and $X\sim Gamma\left(\frac{m\mu}{2},\frac{m\mu}{2}\right)$ .

Conjecture 3.

$\varepsilon_{\mathrm{rel}}\leq 2/(m\mu)$ .

For comparison, Corollary 1 holds for all $\varepsilon>0$ , but with marginally weaker bounds.

Next is the absolute error bound. Again there is a bit of unit conversion: if $\|{\bf A}\|_{2}=\lambda$ and $\|{\bf A}\|_{F}=\phi$ , then $\operatorname{scale}(\operatorname{tr}_{m}^{G}({\bf A}))=\frac{2\lambda}{m}$ and $\operatorname{Var}[\operatorname{tr}_{m}^{G}({\bf A})]=\frac{2}{m}\phi^{2}$ .

Theorem 7.

Let ${\bf A}\in\mathbb{R}^{n\times n}$ be symmetric with $\|{\bf A}\|_{2}=\lambda$ and $\|{\bf A}\|_{F}=\phi$ . For fixed sampling number $m$ , there exists $\varepsilon_{\mathrm{abs}}$ such that for $\varepsilon\geq\varepsilon_{\mathrm{abs}}$ ,

	$\displaystyle\mathrm{Pr}\left(\|\operatorname{tr}_{m}^{G}({\bf A})-\operatorname{tr}({\bf A})\|\geq\varepsilon\right)$	$\displaystyle\leq 2\,\mathrm{Pr}\left(\operatorname{tr}_{m}^{G}({\bf A}_{\mathrm{abs}})-\operatorname{tr}({\bf A}_{\mathrm{abs}})\geq\varepsilon\right)$
		$\displaystyle\leq 2\,\mathrm{Pr}(X-\mathbb{E}[X]\geq\varepsilon),$

where ${\bf A}_{\mathrm{abs}}={\bf A}_{\mathrm{abs}}(\lambda,\phi)$ is defined in (3) and $X\sim Gamma\left(\frac{m\rho}{2},\frac{m}{2\lambda}\right)$ with $\rho=r_{\mathrm{stab}}({\bf A})=\phi^{2}/\lambda^{2}$ .

Conjecture 4.

$\varepsilon_{\mathrm{abs}}\leq\frac{2\lambda}{m}+\sqrt{\frac{2}{m}\phi^{2}+\left(\frac{2\lambda}{m}\right)^{2}}$ .

For comparison, Theorem 1 holds for all $\varepsilon>0$ , but with marginally weaker bounds.

The factor of 2 appearing in the bounds in Theorem 7 is due to the fact that the result uses ${\bf A}_{\mathrm{abs}}$ for the upper tail bound and $-{\bf A}_{\mathrm{abs}}$ for the lower tail bound (resp. $X$ and $-X$ ). Note also that in Conjecture 4 the terms $\phi$ and $\lambda$ scale differently as the sampling number $m$ increases. If true, the conjecture implies that $\varepsilon_{\mathrm{abs}}$ converges to $\phi$ as $m\rightarrow\infty$ .

6 Conclusion

So what exactly are the practical implications of all of this? One takeaway, following from Theorem 6, is that the relative error bound for SPSD matrices depends on $m\cdot r_{\mathrm{eff}}({\bf A})$ , so having a large effective rank is just as beneficial as using a large sampling number. One particular application is in estimating the Frobenius norm of a matrix. The worst-case bounds of [9] hold when the matrix has rank one, but with a lower bound on the stable rank of ${\bf A}$ these bounds can be tightened. This could in turn reduce the number of samples required to estimate the Frobenius norm to a given tolerance, as in [5, Lemma 2.2].

Another takeaway is that the bounds by Cortinovis and Kressner in Theorem 1 and Corollary 1 are already quite good. The improvements made in this paper, which establishes bounds that are tight given certain information about ${\bf A}$ , are fairly minor. If you want any further improvements to these tail bounds, you will have to use more information about the spectrum of ${\bf A}$ .

In practice, you probably shouldn’t use the Gaussian trace estimator on its own. If your matrix has a small stable rank or effective rank, you’d do better to combine the trace estimator with a low-rank approximation such as in [4, 5, 3]. Furthermore, you can reduce the variance by using the Hutchinson estimator (sampling vectors with random $\pm 1$ entries), or by normalizing the Gaussian vectors to have unit length [2]. The main proof techniques in this paper do not work on the Hutchinson estimator because it has a discrete distribution. The techniques could potentially be applied to the normalized Gaussian estimator, but the proofs will be more complicated.

Finally, this paper successfully applies the proof techniques of [10, 9, 8] to obtain a majorization result and tight tail bounds for matrices with fixed $2$ -norm and Frobenius norm. The application to trace estimation may not yield results significantly better than those that can be obtained through concentration inequalities, but the approach—comparing the CDFs of distributions to solve an optimization problem directly—is markedly different.

Appendix A Proofs

Proof of Lemma 2.

Recall from Definition 5 that $\boldsymbol{\mu}\preceq_{F}\boldsymbol{\lambda}$ if

	$\displaystyle(\boldsymbol{\mu}^{+})^{2}\preceq_{w}(\boldsymbol{\lambda}^{+})^{2},$		(7a revisited)
	$\displaystyle(\boldsymbol{\lambda}^{-})^{2}\preceq_{w}(\boldsymbol{\mu}^{-})^{2},$		(7b revisited)
	$\displaystyle\sum_{i=1}^{n}\mu_{i}^{2}=\sum_{i=1}^{n}\lambda_{i}^{2}.$		(7c revisited)

First suppose that the inequalities in (7a) and (7b) are strict (either both are, or neither). Find the leading slack indices (see Definition 3) $j^{+}$ and $j^{-}$ for each inequality. Increase the corresponding nonnegative coordinate²²2If no such coordinate exists, pad $\boldsymbol{\mu}$ with zeros as needed. For example, if $\boldsymbol{\lambda}=(2,2)$ and $\boldsymbol{\mu}=(-2,-2)$ , we would get the sequence $(-2,-2,0)\mapsto(-2,0,2)\mapsto(0,2,2)$ . Such padding allows us to keep the sum of squares constant while changing the vectors continuously. of $\boldsymbol{\mu}$ while shrinking the negative coordinate toward zero, keeping their sum-of-squares constant, until one of the slack inequalities becomes an equality. Repeat a finite number of times to eliminate all slack.

Then, apply Lemma (1) to the cases $(\boldsymbol{\mu}^{+})^{2}\preceq(\boldsymbol{\lambda}^{+})^{2}$ and $(\boldsymbol{\lambda}^{-})^{2}\preceq(\boldsymbol{\mu}^{-})^{2}$ separately. ∎

A.1 Relative error majorization theorem

As a reminder, most of this proof is substantially the same as the one given in [8]. I provide it here in part to show how the proof techniques relate to those used for the absolute error case.

Proof of Theorem 2.

Lemma 1 implies that we can without loss of generality assume that $\boldsymbol{\mu}$ and $\boldsymbol{\lambda}$ differ in two coordinates only, satisfying $0\leq\lambda_{k}<\mu_{k}\leq\mu_{j}<\lambda_{j}$ . For $t\in[0,1]$ , define

	$\displaystyle\nu_{i}(t)$	$\displaystyle:=t\lambda_{i}+(1-t)\mu_{i},\quad i\in\{j,k\},$
	$\displaystyle\nu_{i}(t)$	$\displaystyle:=\lambda_{i},\quad i\neq j,$
	$\displaystyle Y(t)$	$\displaystyle:=\sum_{i=1}^{n}\nu_{i}(t)X_{i}.$

This interpolation satisfies $Y(0)=Q_{\boldsymbol{\mu}}$ and $Y(1)=Q_{\boldsymbol{\lambda}}$ . Our goal is to show that for certain fixed values of $x$ , the CDF $F_{Y(t)}(x)$ is monotonic for $t\in[0,1]$ .

Take the Laplace transform of $F_{Y(t)}$ as

\displaystyle\begin{split}J(t,z):=\mathcal{L}[F_{Y(t)}](z)&=\int_{0}^{\infty}e^{-zx}F_{Y(t)}(x)\,dx\\ &=\frac{-1}{z}\int_{0}^{\infty}F_{Y(t)}(x)\,d\left(e^{-zx}\right)\\ &=\frac{1}{z}\int_{0}^{\infty}e^{-zx}\,dF_{Y(t)}(x)\\ &=\frac{1}{z}\mathcal{L}[Y(t)](z),\end{split}

where $\mathcal{L}[Y(t)](z):=\mathbb{E}[e^{-zY(t)}]$ , the Laplace transform of $Y(t)$ , satisfies

\mathcal{L}[Y(t)](z)=\prod_{i=1}^{n}\left(1+\frac{\nu_{i}(t)z}{\beta}\right)^{-\alpha},\quad z\in\mathbb{C},\ \Re(z)>-\min_{1\leq i\leq n}\frac{\beta}{\nu_{i}(t)}.

(11)

Differentiating with respect to $t$ yields

	$\displaystyle\frac{\partial}{\partial t}J(t,z)$	$\displaystyle=J(t,z)\frac{\partial}{\partial t}\ln J(t,z)$
		$\displaystyle=J(t,z)\frac{\partial}{\partial t}\sum_{i\in\{j,k\}}\left(-\alpha\ln\left(1+\frac{\nu_{i}(t)z}{\beta}\right)\right)$
		$\displaystyle=-J(t,z)z\frac{\alpha}{\beta}\sum_{i\in\{j,k\}}\frac{\lambda_{i}-\mu_{i}}{1+\frac{\nu_{i}(t)z}{\beta}}.$

Recall that $\lambda_{j}-\mu_{j}=\mu_{k}-\lambda_{k}>0$ (since $\boldsymbol{\mu}\preceq\boldsymbol{\lambda}$ ), and see that

\frac{1}{1+\frac{\nu_{j}(t)z}{\beta}}-\frac{1}{1+\frac{\nu_{k}(t)z}{\beta}}=(\nu_{k}(t)-\nu_{j}(t))\frac{z}{\beta}\prod_{i\in\{j,k\}}\left(1+\frac{\nu_{i}(t)z}{\beta}\right)^{-1}.

(12)

It follows that

\frac{\partial J}{\partial t}(t,z)=J(t,z)z^{2}\frac{\alpha}{\beta^{2}}\frac{(\lambda_{j}-\mu_{j})(\nu_{j}(t)-\nu_{k}(t))}{\prod_{i\in\{j,k\}}\left(1+\frac{\nu_{i}(t)z}{\beta}\right)}.

Substitute $J(t,z):=\mathcal{L}[F_{Y(t)}](z)$ and apply the inverse Laplace transform to get

	$\displaystyle\frac{\partial}{\partial t}F_{Y(t)}(x)$	$\displaystyle=\frac{\alpha}{\beta^{2}}(\lambda_{j}-\mu_{j})(\nu_{j}(t)-\nu_{k}(t))\frac{\partial^{2}}{\partial x^{2}}\mathrm{Pr}\left(Y(t)+\nu_{j}(t)\psi+\nu_{k}\psi^{\prime}\leq x\right)$
		$\displaystyle=\frac{\alpha}{\beta^{2}}(\lambda_{j}-\mu_{j})(\nu_{j}(t)-\nu_{k}(t))\frac{\partial}{\partial x}f_{Y(t)+\nu_{j}(t)\psi+\nu_{k}\psi^{\prime}}(x),$

where $\psi,\psi^{\prime}\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}Gamma(1,\beta)$ are i.i.d exponential r.v’s which are also independent of $Y(t)$ , and $f(x)$ is the probability density function (PDF). Now $\lambda_{j}>\mu_{j}$ by assumption, and for any $t\in[0,1]$ it holds that $\nu_{j}(t)\geq\nu_{k}(t)$ as well. Thus for each $t\in[0,1]$ the left-hand side $\frac{\partial}{\partial t}F_{Y(t)}(x)$ has the same sign as $\frac{\partial}{\partial x}f_{Y(t)+\nu_{j}(t)\psi+\nu_{k}\psi^{\prime}}(x)$ , the derivative of the density function.

It is known that the distribution of any linear combination of Gamma random variables is unimodal (see [10, Thm. 4]), so by the definition of $\overline{x}_{\mathrm{upper}}$ and $\overline{x}_{\mathrm{lower}}$ , the density function is increasing on $(0,\overline{x}_{\mathrm{lower}})$ and decreasing on $(\overline{x}_{\mathrm{upper}},\infty)$ . Therefore, for any $x$ in either of these regions, $F_{Y(t)}(x)$ is monotonic with respect to $t\in[0,1]$ . Since $Y(0)=Q_{\boldsymbol{\mu}}$ and $Y(1)=Q_{\boldsymbol{\lambda}}$ , the desired inequalities follow. ∎

A.2 Absolute error majorization theorem

Proof of Theorem 4.

Lemma 2 implies that we can without loss of generality assume that $\boldsymbol{\lambda}$ and $\boldsymbol{\mu}$ differ in two coordinates, satisfying $\lambda_{j}^{2}-\mu_{j}^{2}=\mu_{k}^{2}-\lambda_{k}^{2}>0$ . For $t\in[0,1]$ , define

\displaystyle\begin{split}\nu_{i}(t)&:=\mathrm{sgn}(\lambda_{i}+\mu_{i})\sqrt{t\lambda_{i}^{2}+(1-t)\mu_{i}^{2}},\quad i\in\{j,k\},\\ \nu_{i}(t)&:=\lambda_{i},\quad i\neq j,\,k,\\ Y(t)&:=\sum_{i=1}^{n}\nu_{i}(t)(X_{i}-\alpha/\beta).\end{split}

Note that $\mathbb{E}[Y(t)]=0$ by design. As for the term $\mathrm{sgn}(\lambda_{i}+\mu_{i})$ , Lemma 2 implies that we need only consider cases where $\lambda_{i}$ and $\mu_{i}$ are both nonpositive or both nonnegative. This slightly awkward term therefore ensures that $Y(0)=Q_{\boldsymbol{\mu}}$ and $Y(1)=Q_{\boldsymbol{\lambda}}$ . We also note that for $i\in\{j,k\}$ and $\nu_{i}(t)\neq 0$ , we have

\frac{d}{dt}\nu_{i}(t)=\frac{\lambda_{i}^{2}-\mu_{i}^{2}}{2\nu_{i}(t)}.

(13)

Our goal is again to show that for certain fixed values of $x$ , the CDF $F_{Y(t)}$ is monotonic for $t\in[0,1]$ .

Take the Laplace transform of $F_{Y(t)}$ as

J(t,q):=\mathcal{L}[F_{Y(t)}](z)=\frac{1}{z}\mathcal{L}[Y(t)](z),

where $\mathcal{L}[Y(t)](z):=\mathbb{E}[e^{-zY(t)}]$ , the Laplace transform of $Y(t)$ , satisfies

\mathcal{L}[Y(t)](z)=\prod_{i=1}^{n}e^{z\nu_{i}(t)\alpha/\beta}\left(1+\frac{\nu_{i}(t)z}{\beta}\right)^{-\alpha},

defined over the strip

z\in\mathbb{C},\quad\frac{-\beta}{\max_{\nu_{i}(t)>0}\nu_{i}(t)}<\Re(z)<\frac{\beta}{\max_{\nu_{i}(t)<0}|\nu_{i}(t)|}.

(14)

As compared with (11), the extra term $e^{z\nu_{i}(t)\alpha/\beta}$ comes from our using the centered variables $X_{i}-\alpha/\beta$ . The region of convergence in (14) is a strip rather than a half-plane because $Y(t)$ may include both positive and negative weights $\nu_{i}(t)$ .

Differentiating $J(t,z)$ with respect to $t$ and using (13) yields

	$\displaystyle\frac{\partial}{\partial t}J(t,z)$	$\displaystyle=J(t,z)\frac{\partial}{\partial t}\ln J(t,z)$
		$\displaystyle=J(t,z)\sum_{i\in\{j,k\}}\frac{\partial}{\partial t}\left(z\nu_{i}(t)\frac{\alpha}{\beta}-\alpha\ln\left(1+\frac{\nu_{i}(t)z}{\beta}\right)\right)$
		$\displaystyle=J(t,z)\sum_{i\in\{j,k\}}z\frac{\alpha}{\beta}\left(1-\frac{1}{1+\tfrac{\nu_{i}(t)z}{\beta}}\right)\frac{d}{dt}\nu_{i}(t)$
		$\displaystyle=J(t,z)z^{2}\frac{\alpha}{2\beta^{2}}\sum_{i\in\{j,k\}}\frac{\lambda_{i}^{2}-\mu_{i}^{2}}{1+\nu_{i}(t)z/\beta}.$

Recall that $\lambda_{j}^{2}-\mu_{j}^{2}=\mu_{k}^{2}-\lambda_{k}^{2}>0$ , and again use (12) to get

\frac{\partial}{\partial t}J(t,z)=J(t,z)z^{3}\frac{\alpha}{2\beta^{3}}\frac{(\lambda_{j}^{2}-\mu_{j}^{2})(\nu_{k}(t)-\nu_{j}(t))}{\left(1+\tfrac{\nu_{j}(t)z}{\beta}\right)\left(1+\tfrac{\nu_{k}(t)z}{\beta}\right)}.

Taking the inverse transform then gives

	$\displaystyle\frac{\partial}{\partial t}F_{Y(t)}(x)$	$\displaystyle=\frac{\alpha}{2\beta^{3}}(\lambda_{j}^{2}-\mu_{j}^{2})(\nu_{k}(t)-\nu_{j}(t))\frac{\partial^{3}}{\partial x^{3}}\mathrm{Pr}\left(Y(t)+\nu_{j}(t)\psi+\nu_{k}(t)\psi^{\prime}\leq x\right)$
		$\displaystyle=\frac{\alpha}{2\beta^{3}}(\lambda_{j}^{2}-\mu_{j}^{2})(\nu_{k}(t)-\nu_{j}(t))\frac{\partial^{2}}{\partial x^{2}}f_{Y(t)+\nu_{j}(t)\psi+\nu_{k}(t)\psi^{\prime}}(x),$

By design, $\lambda_{j}^{2}-\mu_{j}^{2}>0$ . Checking the three cases considered in Lemma 2 shows also that $\nu_{k}(t)-\nu_{j}(t)\leq 0$ for $t\in[0,1]$ . Thus the sign of $\frac{\partial}{\partial t}F_{Y(t)}(x)$ is always opposite that of $\frac{\partial^{2}}{\partial x^{2}}f_{Y(t)+\nu_{j}(t)\psi_{j}+\nu_{k}(t)\psi_{k}}(x)$ , which is the convexity of the density function. By the definition of $\hat{x}_{\mathrm{upper}}$ , the density function is convex on $(\hat{x}_{\mathrm{upper}},\infty)$ , and so for any $x$ in this region, $F_{Y(t)}(x)$ decreases monotonically for $t\in[0,1]$ . Since $Y(0)=Q_{\boldsymbol{\mu}}-\mathbb{E}[Q_{\boldsymbol{\mu}}]$ and $Y(1)=Q_{\boldsymbol{\lambda}}-\mathbb{E}[Q_{\boldsymbol{\lambda}}]$ , the desired upper-tail inequality follows.

The lower-tail bound follows by symmetry. ∎

A.3 Absolute error tail results

Proof of Theorem 5.

Let $Q_{\boldsymbol{\lambda}}=\sum_{i=1}^{r}\hat{\lambda}X_{i}$ , where $X_{i}\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}Gamma(\alpha,\beta)$ , $r=\lceil\phi^{2}/(\lambda^{2}\alpha)\rceil$ and $\hat{\lambda}=\frac{\beta\phi}{\sqrt{r\alpha}}$ . This is a valid choice of distribution since

\operatorname{Var}[Q_{\boldsymbol{\lambda}}]=r\hat{\lambda}^{2}\operatorname{Var}[X_{1}]=r\left(\frac{\beta^{2}\phi^{2}}{r\alpha}\right)\frac{\alpha}{\beta^{2}}=\phi^{2}

and

\operatorname{scale}(Q_{\boldsymbol{\lambda}})=\frac{\hat{\lambda}}{\beta}=\frac{\phi}{\sqrt{r\alpha}}\leq\frac{\phi}{\sqrt{\phi^{2}/\lambda^{2}}}=\lambda.

Then $\mathbb{E}[Q_{\boldsymbol{\lambda}}]=r\hat{\lambda}\frac{\alpha}{\beta}=\phi\sqrt{r\alpha}$ , and the r.v $Q_{\boldsymbol{\lambda}}+\lambda_{1}\psi+\lambda_{2}\psi^{\prime}$ has distribution $Gamma(r\alpha+2,\beta\hat{\lambda}^{-1})$ , or equivalently, $Gamma(r\alpha+2,\sqrt{r\alpha}/\phi)$ .

Now in general, a $Gamma(\alpha,\beta)$ random variable has density proportional to $x^{\alpha-1}e^{-\beta x}$ for $x\geq 0$ (see (9)). The inflection points are where the second derivative is equal to zero. Since

	$\displaystyle\frac{\mathrm{d}^{2}}{\mathrm{d}x^{2}}\,x^{\alpha-1}e^{-\beta x}$	$\displaystyle=\frac{\mathrm{d}}{\mathrm{d}x}\,(-\beta x^{\alpha-1}+(\alpha-1)x^{\alpha-2})e^{-\beta x}$
		$\displaystyle=(\beta^{2}x^{\alpha-1}-2\beta(\alpha-1)x^{\alpha-2}+(\alpha-1)(\alpha-2)x^{\alpha-3})e^{-\beta x}$
		$\displaystyle=((\beta x)^{2}-2(\alpha-1)(\beta x)+(\alpha-1)(\alpha-2))x^{\alpha-3}e^{-\beta x},$

the nontrivial inflection points are the two zeros of the quadratic term. The larger of the two zeros is given by

\hat{x}_{+}=\beta^{-1}\frac{2(\alpha-1)+\sqrt{4(\alpha-1)^{2}-4(\alpha-1)(\alpha-2)}}{2}=\frac{\alpha-1+\sqrt{\alpha-1}}{\beta}.

Substitute for $\alpha$ and $\beta$ the values $r\alpha+2$ and $\sqrt{r\alpha}/\phi$ , and subtract $\mathbb{E}[Q_{\boldsymbol{\lambda}}]$ to get

\hat{x}_{\mathrm{upper}}\geq\frac{r\alpha+1+\sqrt{r\alpha+1}}{\sqrt{r\alpha}/\phi}-\phi\sqrt{r\alpha}=\phi\frac{1+\sqrt{r\alpha+1}}{\sqrt{r\alpha}}.

(15)

∎

Conjecture 2 is motivated by the fact that the right-hand-side of (15) is bounded above as

\phi\frac{1+\sqrt{r\alpha+1}}{\sqrt{r\alpha}}\leq\phi\frac{1+\sqrt{\phi^{2}/\lambda^{2}+1}}{\sqrt{\phi^{2}/\lambda^{2}}}=\lambda+\sqrt{\phi^{2}+\lambda^{2}}.

References

[1] Alice Cortinovis and Daniel Kressner. On randomized trace estimates for indefinite matrices with an application to determinants. Foundations of Computational Mathematics, 22(3):875–903, 2022.
[2] Ethan N Epperly. Don’t use gaussians in stochastic trace estimation. https://www.ethanepperly.com/index.php/2024/01/28/, Jan 2024. Accessed: 2024-11-01.
[3] Ethan N. Epperly, Joel A. Tropp, and Robert J. Webber. Xtrace: Making the most of every sample in stochastic trace estimation. SIAM Journal on Matrix Analysis and Applications, 45(1):1–23, 2024.
[4] Raphael A Meyer, Cameron Musco, Christopher Musco, and David P Woodruff. Hutch++: Optimal stochastic trace estimation. In Symposium on Simplicity in Algorithms (SOSA), pages 142–155. SIAM, 2021.
[5] David Persson, Alice Cortinovis, and Daniel Kressner. Improved variants of the hutch++ algorithm for trace estimation. SIAM Journal on Matrix Analysis and Applications, 43(3):1162–1185, 2022.
[6] David Persson, Alice Cortinovis, and Daniel Kressner. Improved variants of the hutch++ algorithm for trace estimation (preprint), 2022.
[7] Josip E Pečarić and Yung Liang Tong. Convex functions, partial orderings, and statistical applications. Academic Press, 1992.
[8] Farbod Roosta-Khorasani and Gábor J. Székely. Schur properties of convolutions of gamma random variables. Metrika, 78(8):997–1014, 2015.
[9] Farbod Roosta-Khorasani, Gábor J. Székely, and Uri M. Ascher. Assessing stochastic algorithms for large scale nonlinear least squares problems using extremal probabilities of linear combinations of gamma random variables. SIAM/ASA Journal on Uncertainty Quantification, 3(1):61–90, 2015.
[10] Gábor J. Székely and Nail K. Bakirov. Extremal probabilities for gaussian quadratic forms. Probability Theory and Related Fields, 126(2):184–202, 2003.