∎

¹¹institutetext: R. A. Baston and Y. Nakatsukasa ²²institutetext: Mathematical Institute
University of Oxford
Tel.: +44-1865-615319
²²email: [email protected], ²²email: [email protected]

Stochastic diagonal estimation: probabilistic bounds and an improved algorithm

Robert A. Baston Yuji Nakatsukasa

(Received: date / Accepted: date)

Abstract

We study the problem of estimating the diagonal of an implicitly given matrix $A$ . For such a matrix we have access to an oracle that allows us to evaluate the matrix vector product $A\bm{v}$ . For random variable $\bm{v}$ drawn from an appropriate distribution, this may be used to return an estimate of the diagonal of the matrix $A$ . Whilst results exist for probabilistic guarantees relating to the error of estimates of the trace of $A$ , no such results have yet been derived for the diagonal. We make two contributions in this regard. We analyse the number of queries $s$ required to guarantee that with probability at least $1-\delta$ the estimates of the relative error of the diagonal entries is at most $\varepsilon$ . We extend this analysis to the 2-norm of the difference between the estimate and the diagonal of $A$ . We prove, discuss and experiment with bounds on the number of queries $s$ required to guarantee a probabilistic bound on the estimates of the diagonal by employing Rademacher and Gaussian random variables. Two sufficient upper bounds on the minimum number of query vectors are proved, extending the work of Avron and Toledo [JACM 58(2)8, 2011], and later work of Roosta-Khorasani and Ascher [FoCM 15, 1187-1212, 2015]. We first prove sufficient bounds for diagonal estimation with Gaussian vectors, and then similarly with Rademacher vectors. We find that, generally, there is little difference between the two, with convergence going as $O(\log(1/\delta)/\varepsilon^{2})$ for individual diagonal elements. However for small $s$ , we find that the Rademacher estimator is superior. These results allow us to then extend the ideas of Meyer, Musco, Musco and Woodruff [SOSA, 142-155, 2021], suggesting algorithm Diag++, to speed up the convergence of diagonal estimation from $O(1/\varepsilon^{2})$ to $O(1/\varepsilon)$ and make it robust to the spectrum of any positive semi-definite matrix $A$ . We analyse our algorithm to demonstrate state-of-the-art convergence.

Keywords:

Monte Carlo Stochastic estimation Diagonal Matrix Trace estimation

MSC:

65C05 68W20 60E15

1 Introduction

An ever-present problem in numerical linear algebra relates to estimating the trace of a matrix $A\in\mathbb{R}^{n\times n}$ that may only be accessed through its operation on a given vector. Such a calculation is required in a wide variety of applications, including: matrix norm estimation han2017approximating ; musco2017spectrum , image restoration golub1979generalized ; golub1997generalized , density functional theory bekas2007estimator and log-determinant approximation boutsidis2017randomized ; han2015large . The concepts surrounding diagonal estimation are younger than those of trace estimation, but have recently come of age in multiple applications. The theory however, is lagging. Originally used for electronic structure calculations bekas2007estimator ; goedecker1999linear ; goedecker1995tight , the ideas developed for diagonal estimation have recently been used by Yao et. al yao2020adahessian in a novel, state-of-the-art machine learning algorithm. They estimate the diagonal of a Hessian matrix to “precondition” descent vectors, for Newton-style descent, in the training of convolution neural networks. Their method produces exceptional accuracy across a wide array of problems. By understanding the properties of diagonal estimation more fully we may uncover new algorithms or methods to improve even these results.

Such estimations involve a matrix $A$ which is a “black box”. We are given an oracle that may evaluate $A\bm{v}$ for any $\bm{v}\in\mathbb{R}^{n}$ , and we seek to return the chosen property of $A$ using as few query vectors $\bm{v}$ as possible: with $s$ queries, we desire $s\ll n$ . The first exploration into this area was by Hutchinson hutchinson1989stochastic , which resulted in the standard approach to estimate the trace of a matrix:

\text{tr}(A)\approx\text{tr}_{R}^{s}(A):=\frac{1}{s}\sum_{k=1}^{s}\bm{v}_{k}^{T}A\bm{v}_{k}.

(1)

The $\bm{v}_{k}\in\mathbb{R}^{n}$ are vectors with entries given as independent, identically distributed Rademacher variables: the $i^{th}$ entry of $\bm{v}_{k}$ takes the value $-1$ or $1$ with equal probability for all $i$ , and independence across both $i$ and $k$ . The subscript $R$ signifies the distribution of the entries of $\bm{v}_{k}$ to be Rademacher.

Later work expanded on this idea, using Gaussian vectors distributed as $\bm{v}_{k}\sim N(0,I_{n})$ , but until the paper of Avron and Toledo avron2011randomized , analysis and comparison of these estimators was limited to explorations of their variance. It is well known hutchinson1989stochastic that the variance of the Hutchinson method is the lowest when compared to other standard methods, thus is used most extensively in applications. However, whilst intuition suggests this is a good choice, it is not completely rigorous. The techniques of avron2011randomized address the error bounds in probability directly.

The $(\varepsilon,\delta)$ approach

In avron2011randomized the authors propose so-called $(\varepsilon,\delta)$ bounds. A lower bound on the number of queries, $s$ , sufficient to achieve a relative error of the estimated trace, is obtained. This is a probabilistic guarantee. Formally, for a pair of $(\varepsilon,\delta)$ values, a sufficient number of queries, $s$ , is found, such that, for positive semi-definite matrix $A$

\Pr\Big{(}|\text{tr}_{\mathbb{P}}^{s}(A)-\text{tr}(A)|\leq\varepsilon\text{tr}(A)\Big{)}\geq 1-\delta.

(2)

The term $\text{tr}_{\mathbb{P}}^{s}(A)$ is the $s$ -query estimator of the trace, using query vectors drawn from the distribution $\mathbb{P}$ . Supposing $s$ random vectors have been used to query the matrix, we have, with high probability, a relative error that is at most $\varepsilon$ , where $\varepsilon$ is dependent on $s$ . The authors conclude that analysis of the variance of the estimators is not the most informative approach for selecting which estimator is likely to return the best results for the fewest queries. This is a key insight.

The diagonal estimator.

Bekas et. al bekas2007estimator extended Hutchinson’s original method to diagonal estimation by considering the Hadamard product of $\bm{v}_{k}$ and $A\bm{v}_{k}$ . The method is different from trace estimation in two regards. Firstly, in trace estimation, we sum the element-wise multiplication terms to get the inner product of $\bm{v}_{k}$ and $A\bm{v}_{k}$ . This step is removed for diagonal estimation. Secondly, the estimation is not necessarily scaled by the number of queries, but by the element-size of the query vectors. This gives

\bm{D}^{s}=\Big{[}\sum_{k=1}^{s}\bm{v}_{k}\odot A\bm{v}_{k}\Big{]}\oslash\Big{[}\sum_{k=1}^{s}\bm{v}_{k}\odot\bm{v}_{k}\Big{]}.

(3)

The method returns $\bm{D}^{s}$ , an estimation of diag( $A$ ), the diagonal of $A$ reshaped as an $n$ -dimensional vector. The symbol $\odot$ represents component-wise multiplication of vectors and $\oslash$ represents component-wise division. In bekas2007estimator , the authors focus on employing deterministic vectors for $\bm{v}_{k}$ , no further treatment of the stochastic estimators is given, except in numerical experiments.

Contributions

In this paper, we proceed to consider the same aims as those of Avron and Toledo avron2011randomized and those of Bekas et. al bekas2007estimator : applying the ideas of avron2011randomized to analyse the stochastic diagonal estimator suggested in bekas2007estimator and shown in equation (3). No paper to date has presented a theoretical bound on the number of queries $s$ required to achieve an $\varepsilon$ approximation of the diagonal of a matrix. Specifically, we prove upper bounds on the minimum number of query vectors required for the relative (and “row-dependent”) error of the entries of the diagonal estimate to be at most $\varepsilon$ , with probability $1-\delta$ . We extend this result to consider the error across all diagonal entries, and the associated number of queries required for a similar error bound. We consider estimators whose query vectors $\bm{v}_{k}$ obey different distributions. This analysis allows us to then suggest and analyse a novel algorithm for speed-up of diagonal estimation. Extending the work in meyer2021hutch++ for trace estimation, we propose an algorithm for diagonal estimation that shifts from the classic Monte-Carlo regime of $O(1/\varepsilon^{2})$ to $O(1/\varepsilon)$ , a fundamental, state-of-the-art improvement for diagonal estimation. Much of the paper is based on the first author’s MSc dissertation baston2021thesis .

Outline of this work

This paper is outlined as follows. In Section 2 we outline our use of notation, provide a brief overview of previous work, and outline the relation of the Rademacher diagonal estimator to Hutchinson’s trace estimator. In Section 3 we define what is meant by an $(\varepsilon,\delta)$ -approximator in the context of diagonal estimation, and outline two types of stochastic diagonal estimators, Gaussian and Rademacher. In Section 4 we state our results, proved in the following sections. We prove an upper bound on the minimum number of sufficient queries for an $(\varepsilon,\delta)$ -approximator using the Gaussian diagonal estimator in Section 5, and then for the Rademacher diagonal estimator in Section 6. Section 7 manipulates and explores these results in an attempt to make them more useful to the practitioner. We perform numerical experiments illustrating the results in Section 8. Where appropriate a touch of theory is also covered in this section. In Section 9 we use these previous results to adapt the Hutch++ algorithm from meyer2021hutch++ to suggest and analyse a similar algorithm, Diag++, that enables $O(1/\varepsilon^{2})$ to $O(1/\varepsilon)$ speed-up of stochastic diagonal estimation. Section 10 provides concluding remarks.

2 Preliminaries

In this section we outline our use of notation, and provide a brief overview of previous work.

2.1 Notation

For all $\bm{u}\in\mathbb{R}^{n}$ , $\|\bm{u}\|_{2}=(\sum_{i=1}^{n}u_{i}^{2})^{1/2}$ denotes the $\ell_{2}$ norm and likewise $\|\bm{u}\|_{1}=\sum_{i=1}^{n}|u_{i}|$ denotes the $\ell_{1}$ norm. For a general matrix $A\in\mathbb{R}^{m\times n}$ , $\|A\|_{F}=(\sum_{i=1}^{m}\sum_{j=1}^{n}A_{ij}^{2})^{1/2}$ denotes the Frobenius norm of a matrix, where, of course, $A_{ij}$ is the element of $A$ from the $i^{th}$ row and the $j^{th}$ column. In the case of $A\in\mathbb{R}^{n\times n}$ , $\text{tr}(A)=\sum_{i=1}^{n}A_{ii}=\sum_{i=1}^{n}\lambda_{i}$ is trace of the matrix $A$ , with $\lambda_{i}$ its eigenvalues. Sometimes in this paper we consider the matrix $A$ to be symmetric: $A=A^{T}$ , or symmetric positive semi-definite (PSD), indicated by $A\succeq 0$ . Such $A$ has an eigendecomposition of the form $A=V\Lambda V^{T}$ where $V\in\mathbb{R}^{n\times n}$ is the orthogonal matrix of eigenvectors of $A$ and $\Lambda$ is a real-valued diagonal matrix. We write $\bm{\lambda}=\text{diag}(\Lambda)$ to be a vector of length $n$ containing the eigenvalues of $A$ in descending order $\lambda_{1}\geq\lambda_{2}\geq\ldots\geq\lambda_{n}.$ Where the differences between the eigenvalues are relatively small, we say a matrix has a “flatter” spectrum. Where the differences between the eigenvalues are larger, we describe the spectrum as “steep”¹¹1Whilst there is much more nuance to be found in the spectra of matrices, this description serves our purposes sufficiently, without over-complicating the details.. For $A\succeq 0$ we have that $\lambda_{i}\geq 0$ for all $i$ . Where $A$ is diagonalisable we use the identity $\|\bm{\lambda}\|_{2}\equiv\|A\|_{F}$ and where $A\succeq 0$ we write $\|\bm{\lambda}\|_{1}\equiv\text{tr}(A)$ We denote $A_{r}=\operatorname*{arg\,min}_{B,\text{ rank}(B)=r}\|A-B\|_{F}$ as the optimal $r$ -rank approximation to $A$ . When the matrix $A$ is PSD, the eigenvalues of the matrix coincide with its singular values. Thus, by the Eckart-Young-Mirsky theorem eckart1936approximation , for such $A$ , $A_{r}=V_{r}\Lambda_{r}V_{r}^{T}$ with $V_{r}\in\mathbb{R}^{n\times r}$ containing the first $r$ columns of $V$ and $\Lambda_{r}\in\mathbb{R}^{r\times r}$ being the top left sub-matrix of $\Lambda$ . Finally we denote $\bm{A}_{d}=\text{diag}(A)$ as reshaping of the diagonal elements of the matrix $A$ to a vector of length $n$ . For this vector, we clearly have $\|\bm{A}_{d}\|_{2}=(\sum_{i=1}^{n}A_{ii}^{2})^{1/2}$ and in the case of $A\succeq 0$ , $\|\bm{A}_{d}\|_{1}\equiv\text{tr}(A)$ , since $A_{ii}$ are necessarily non-negative for such $A$ .

2.2 Hutchinson’s Method and Previous Work

The classic Monte Carlo method for approximating the trace of an implicit matrix is due to Hutchinson hutchinson1989stochastic , who states and proves the following lemma.

Lemma 1

Let $A$ be an $n\times n$ symmetric matrix. Let $\bm{v}$ be a random vector whose queries are i.i.d Rademacher random variables $(\Pr(v^{i}=\pm 1)=1/2)$ , then $\bm{v}^{T}A\bm{v}$ is an unbiased estimator of $\textnormal{tr}(A)$ , that is

\mathbb{E}\Big{[}\bm{v}^{T}A\bm{v}\Big{]}=\textnormal{tr}(A)

(4)

and

\textnormal{Var}\Big{(}\bm{v}^{T}A\bm{v}\Big{)}=2\Big{(}\|A\|_{F}^{2}-\sum_{i=1}^{n}A_{ii}^{2}\Big{)}=2\Big{(}\|A\|_{F}^{2}-\|\bm{A}_{d}\|_{2}^{2}\Big{)}.

(5)

Considering the variance term, we can easily see that it captures how much of the matrix’s “energy” (i.e. the Frobenius norm) is located in the off-diagonal elements. Unfortunately, in the general case, this variance may be arbitrarily large. More vexatious, is the fact that Lemma 1 does not provide the user with a rigorous bound on the number of matrix-vector queries for a given accuracy. To address this issue, Avron and Toledo avron2011randomized introduced the following definition for $A\succeq 0$ (Defintion 4.1 of avron2011randomized ):

Definition 1

Let $A$ be a symmetric positive semi-definite matrix. A randomised trace estimator: $\textnormal{tr}_{\mathbb{P}}^{s}(A)$ , is an $(\varepsilon,\delta)$ -approximator of $\textnormal{tr}(A)$ if

\Pr\Big{(}|\textnormal{tr}_{\mathbb{P}}^{s}(A)-\textnormal{tr}(A)|\leq\varepsilon\textnormal{tr}(A)\Big{)}\geq 1-\delta.

(6)

This definition provides a better way to be able to bound the requisite number of queries. It guarantees that the probability of the relative error exceeding $\varepsilon$ is at most $\delta$ . For example, when estimating the trace according to Hutchinson’s method, it may be shown roosta2015improved that it is sufficient to take the number of queries, $s$ as

s>\frac{6\ln(2/\delta)}{\varepsilon^{2}}

(7)

for any PSD matrix $A$ (see Theorem 1, Section 2, roosta2015improved ), whereas if Gaussian vectors are used, the bound on $s$ becomes

s>\frac{8\ln(2/\delta)}{\varepsilon^{2}}

(8)

(Theorem 3, Section 3, roosta2015improved ). We note that the recent paper cortinovis2021randomized gives query bounds for indefinite matrices.

Both equations (7) and (8) immediately put us in the Monte Carlo regime, with error converging as $O\big{(}1/\sqrt{s}\big{)}$ for $s$ queries. Let us step back and remark on the fact that the trace of a matrix is invariant under orthogonal similarity transforms of the matrix. As such, we can intuit that the loose, matrix-independent bounds in equations (7) and (8) are naturally to be expected. Can we say the same of estimating the diagonal? We cannot be sure that the previous intuition applies, as the diagonal of a matrix is not (generally) invariant under orthogonal similarity transformations.

2.3 Approximating the Diagonal of a Matrix

The ideas of Hutchinson are readily extended to approximate the full diagonal, as was first introduced by Bekas et. al bekas2007estimator , and which we outline here. Take a sequence of $s$ vectors $\bm{v}_{1},...,\bm{v}_{s}$ , and suppose that their entries follow some suitable distribution, being i.i.d with expectation²²2Because we scale the estimation as in equation (9) it may be shown that the distribution may have arbitrary variance, though $\mathbb{E}[X^{2}]=1$ is used for simplicity throughout. zero. Then the diagonal approximation of matrix $A$ , expressed as an $n$ -dimensional vector $\bm{D}^{s}$ , may be taken as

\bm{D}^{s}=\Bigg{[}\sum_{k=1}^{s}\bm{v}_{k}\odot A\bm{v}_{k}\Bigg{]}\oslash\Bigg{[}\sum_{k=1}^{s}\bm{v}_{k}\odot\bm{v}_{k}\Bigg{]}

(9)

where $\odot$ is the component-wise multiplication of the vector (a Hadamard product), and similarly, $\oslash$ the component-wise division of the vectors. Let us examine a single component of $\bm{D}^{s}$ to gain more insight. Let $v_{k}^{i}$ denote the $i^{th}$ element of the $k^{th}$ vector, we have

\begin{split}D_{i}^{s}&=\Bigg{[}\sum_{k=1}^{s}v_{k}^{i}\sum_{j=1}^{n}A_{ij}v_{k}^{j}\Bigg{]}\Bigg{/}\Bigg{[}\sum_{k=1}^{s}{(v_{k}^{i})^{2}}\Bigg{]}\\ &=A_{ii}+\Bigg{[}\sum_{k=1}^{s}v_{k}^{i}\sum_{j\neq i}^{n}A_{ij}v_{k}^{j}\Bigg{]}\Bigg{/}\Bigg{[}\sum_{k=1}^{s}{(v_{k}^{i})^{2}}\Bigg{]}\\ &=A_{ii}+{{\sum_{j\neq i}^{n}}}A_{ij}\Bigg{(}\frac{\sum_{k=1}^{s}v_{k}^{i}v_{k}^{j}}{\sum_{k=1}^{s}{(v_{k}^{i})^{2}}}\Bigg{)}.\end{split}

In expectation, the coefficients of the $A_{ij}$ entries shall be zero. This is true provided that the components of the $\bm{v}_{k}$ vectors, $v_{k}^{i}$ , themselves have an expectation of zero, regardless of the form of the denominator³³3Easily shown by the chain rule of expectation..

2.4 Relation to Hutchinson’s method

If the vectors $\bm{v}_{k}$ have i.i.d Rademacher entries, then it is clear to see that the sum of the entries of the vector $\bm{D}^{s}$ is nothing but the Hutchinson trace estimate. To show this rigorously we examine the expectation and variance of a single component $D_{i}^{s}$ of $\bm{D}^{s}$ , when $v_{k}^{i}$ are independent Rademacher variables. We note that this is straightforward to find, and so the results are included only for completeness.

\centering\begin{split}\mathbb{E}[D_{i}^{s}]&=\mathbb{E}\Bigg{[}A_{ii}+{\sum_{j\neq i}^{n}}A_{ij}\Bigg{(}\frac{\sum_{k=1}^{s}v_{k}^{i}v_{k}^{j}}{\sum_{k=1}^{s}{(v_{k}^{i})^{2}}}\Bigg{)}\Bigg{]}\\ &=A_{ii}+{\sum_{j\neq i}^{n}}A_{ij}\mathbb{E}\Bigg{[}\frac{\sum_{k=1}^{s}v_{k}^{i}v_{k}^{j}}{\sum_{k=1}^{s}{(v_{k}^{i})^{2}}}\Bigg{]}\hskip 35.56593pt\text{(by linearity)}\\ &=A_{ii}+{\sum_{j\neq i}^{n}}A_{ij}\mathbb{E}\Bigg{[}\frac{\sum_{k=1}^{s}v_{k}^{i}v_{k}^{j}}{s}\Bigg{]}\hskip 37.55785pt\text{(since $(v_{k}^{i})^{2}=1\;\forall\;i,k$)}\\ &=A_{ii}+{\sum_{j\neq i}^{n}}A_{ij}\frac{\sum_{k=1}^{s}\mathbb{E}[v_{k}^{i}]\mathbb{E}[v_{k}^{j}]}{s}\hskip 32.72049pt\text{(by independence, $j\neq i$)}\\ &=A_{ii}\end{split}\@add@centering

and by the linearity of expectation it is easy to see that

\mathbb{E}\Big{[}\sum_{i=1}^{n}D^{s}_{i}\Big{]}=\sum_{i=1}^{n}\mathbb{E}\Big{[}D^{s}_{i}\Big{]}=\text{tr}(A).

(10)

Specifically for a single query $s=1$ we are returned equation (4), the first result of Lemma 1. We now find the variance; most easily done by considering a single estimate, $s=1$ . Thus

\begin{split}\text{Var}(D_{i}^{1})&=\text{Var}(D_{i}^{1}-A_{ii})\\ &=\mathbb{E}\Bigg{[}\Big{(}\sum_{j\neq i}^{n}A_{ij}v^{i}v^{j}\Big{)}^{2}\Bigg{]}-\mathbb{E}\Bigg{[}\sum_{j\neq i}^{n}A_{ij}v^{i}v^{j}\Bigg{]}^{2}\\ &=\sum_{j\neq i}^{n}\sum_{l\neq i}^{n}A_{ij}A_{il}\mathbb{E}\Big{[}(v^{i})^{2}v^{j}v^{l}\Big{]}\\ &=\sum_{j\neq i}^{n}A_{ij}^{2}\\ &=\Big{(}\|\bm{A}_{i}\|_{2}^{2}-A_{ii}^{2}\Big{)}\end{split}

(11)

where $\bm{A}_{i}$ is the $i^{th}$ row⁴⁴4 $\bm{A}_{i}$ possessing a 2-norm is intuitive, but formally we are really considering $\bm{A}_{i}^{T}$ of $A$ . When considering the variance of the sum of the entries of $\bm{D}^{s}$ , we must also find the covariance of the terms between two elements. For $D_{i}^{1}$ and $D_{j}^{1}$ , where $i\neq j$ , and with $s=1$ , we have

\begin{split}\text{Cov}(D_{i}^{1},D_{j}^{1})&=\mathbb{E}\Big{[}(D_{i}^{1}-A_{ii})(D_{j}^{1}-A_{jj})\Big{]}\\ &=\mathbb{E}\Big{[}\Big{(}\sum_{k\neq i}^{n}A_{ik}v^{i}v^{k}\Big{)}\Big{(}\sum_{l\neq j}^{n}A_{jl}v^{j}v^{l}\Big{)}\Big{]}\\ &=\sum_{k\neq i}^{n}\sum_{l\neq j}^{n}A_{ik}A_{jl}\mathbb{E}\Big{[}v^{i}v^{j}v^{k}v^{l}\Big{]}\end{split}

and since $k\neq i$ and $l\neq j$ , along with the fact we are examining $i\neq j$ , only one term is returned: when $k=j$ and $l=i$ . All other terms vanish as a result of independence and zero expectation. This leaves

\centering\begin{split}\text{Cov}(D_{i}^{1},D_{j}^{1})&=A_{ij}A_{ji}\mathbb{E}\Big{[}(v^{i})^{2}(v^{j})^{2}\Big{]}\\ &=A_{ij}A_{ji}\mathbb{E}\Big{[}(v^{i})^{2}\Big{]}\mathbb{E}\Big{[}(v^{j})^{2}\Big{]}\\ &=A_{ij}A_{ji}.\end{split}\@add@centering

Of course, if $A$ is symmetric, this is $A_{ij}^{2}$ . Thus, for such symmetric $A$ , we have

\begin{split}\text{Var}\Big{(}\sum_{i=1}^{n}D_{i}\Big{)}&=\sum_{i=1}^{n}\text{Var}(D_{i})+\sum_{i}^{n}\sum_{j\neq i}^{n}\text{Cov}(D_{i},D_{j})\\ &=\sum_{i=1}^{n}(\|\bm{A}_{i}\|_{2}^{2}-A_{ii})+\sum_{i}^{n}\sum_{j\neq i}^{n}A_{ij}^{2}\\ &=2\Big{(}\|A\|_{F}^{2}-\sum_{i=1}^{n}A_{ii}^{2}\Big{)}.\end{split}

(12)

So we are returned equation (5), the second result of Lemma 1 and the relationship of trace and diagonal estimation is clear.

The dependence of the elements $D_{i}^{s}$ upon one another threatens to confound any further analysis. Thus, to circumvent this, we first examine individual elements of the diagonal estimator, $\bm{D}^{s}$ . When treating all elements simultaneously, we employ general bounds and inequalities that, irrespective of dependence, may be applied. In some instances, this can result in looser bounds than might be necessary in practice. Therefore our results ought to be treated as a guide, rather than exact truths. We are left with two main questions to answer:

1.

How many query vectors are needed to bound the error of the stochastic diagonal estimator in (9), and how does this change with distribution of vector entries?
2.

How does matrix structure affect estimation, and how might we improve worst case scenarios?

3 Definitions and Estimators

3.1 Definitions

We state precisely what is meant by an $(\varepsilon,\delta)$ -approximator in the context of diagonal estimation: for reasons which will become apparent, we introduce two such definitions. We also define what is meant by a diagonal estimator when using vectors of a given distribution.

3.1.1 $(\varepsilon,\delta)$ -approximators

Definition 2

The $i^{th}$ component, $D_{i}^{s}$ , of a diagonal estimator is said to be a row dependent $(\varepsilon,\delta)$ -approximator, if

\Pr\Bigg{(}|D_{i}^{s}-A_{ii}|^{2}\leq\varepsilon^{2}\Big{(}\|\bm{A}_{i}\|_{2}^{2}-A_{ii}^{2}\Big{)}\Bigg{)}\geq 1-\delta

where $\bm{A}_{i}$ is the $i^{th}$ row of the matrix $A$ , reshaped as a vector.

The error bound in this definition may be tightened to be dependent only upon $|A_{ii}|$ , giving Definition 3 for examination of the relative error:

Definition 3

The $i^{th}$ component, $D_{i}^{s}$ , of a diagonal estimator is said to be a relative $(\varepsilon,\delta)$ -approximator, if

\Pr\Big{(}|D_{i}^{s}-A_{ii}|\leq\varepsilon|A_{ii}|\Big{)}\geq 1-\delta

Definition 3 gives us a way to analyse the estimators by bounding the number of queries required to guarantee that the probability the relative error exceeds $\varepsilon$ is at most $\delta$ . Definition 2 loosens this condition slightly and is useful for reasons which become shortly apparent.

3.1.2 Two estimators

Recall that all estimators follow the same simple pattern, as in equation (9): some random vector $\bm{v}_{k}$ is drawn from a fixed distribution, and a normalised Hadamard product is used to estimate the diagonals. The process is repeated $s$ times, using i.i.d queries.

The first estimator uses vectors whose entries are standard Gaussian (normal) variables.

Definition 4

A Gaussian diagonal estimator for any square matrix $A\in\mathbb{R}^{n\times n}$ is given by

\bm{G}^{s}=\Big{[}\sum_{k=1}^{s}\bm{v}_{k}\odot A\bm{v}_{k}\Big{]}\oslash\Big{[}\sum_{k=1}^{s}\bm{v}_{k}\odot\bm{v}_{k}\Big{]}

(13)

where the $\bm{v}_{k}$ vectors are $s$ independent random vectors whose entries are i.i.d standard normal variables, $v_{k}^{i}\sim N(0,1)\iff\bm{v}_{k}\sim N(0,I_{n})$ .

Note that the Gaussian estimator does not constrain the 2-norm of the $\bm{v}_{k}$ vectors; so any single estimate may be arbitrarily large or small.

In contrast, the Rademacher diagonal estimator constrains the $\bm{v}_{k}$ to have a fixed 2-norm, and results in a fixed denominator in the estimator equation.

Definition 5

A Rademacher diagonal estimator for any square matrix $A\in\mathbb{R}^{n\times n}$ is given by

\bm{R}^{s}=\Big{[}\sum_{k=1}^{s}\bm{v}_{k}\odot A\bm{v}_{k}\Big{]}\oslash\Big{[}\sum_{k=1}^{s}\bm{v}_{k}\odot\bm{v}_{k}\Big{]}=\frac{1}{s}\sum_{k=1}^{s}\bm{v}_{k}\odot A\bm{v}_{k}

(14)

where the $\bm{v}_{k}$ vectors are s independent random vectors whose entries are i.i.d uniformly distributed Rademacher variables: $v_{k}^{i}\sim\{-1,1\}$ each with probability $1/2$ .

It is worth noting that, unlike the trace estimators, the diagonal estimators above return different values when applied to $A^{T}$ instead of $A$ . It is possible that some (but never all, as our results indicate) diagonal elements are estimated more accurately by working with $A^{T}$ .

4 Comparing the Quality of Estimators

Bekas’ et. al bekas2007estimator sought to replace stochastic estimation by taking the vectors $\bm{v}_{k}$ to be deterministic, as empirically they found that stochastic estimation could be slow. Sections 5 and 6 show why. Table 1 summarises our results.

Table 1: Summary of results for the convergence of estimators according to the distribution of the query vectors. Note that these results do not make any assumptions on the matrix

A

Query vector distribution	Rademacher	Gaussian
Sufficient query bound for a row-dependent $(\varepsilon,\delta)$ -approximator	$2\ln(2/\delta)/\varepsilon^{2}$	$4\log_{2}(\sqrt{2}/\delta)/\varepsilon^{2}$
Sufficient query bound for a relative $(\varepsilon,\delta)$ -approximator	${\color[rgb]{1,1,1}\Bigg{\|}}2{\Big{(}\frac{\\|\bm{A}_{i}\\|_{2}^{2}-A_{ii}^{2}}{A_{ii}^{2}}\Big{)}}\ln(2/\delta)/\varepsilon^{2}$	${\color[rgb]{1,1,1}\Bigg{\|}}4{\Big{(}\frac{\\|\bm{A}_{i}\\|_{2}^{2}-A_{ii}^{2}}{A_{ii}^{2}}\Big{)}}\log_{2}(\sqrt{2}/\delta)/\varepsilon^{2}$
Permitted $\varepsilon$ values	arbitrary	$(0,1]$

5 Gaussian diagonal estimator

We now prove an $(\varepsilon,\delta)$ bound for the Gaussian diagonal estimator. We find a sufficient minimum number of queries $s$ , for a row-dependent $(\varepsilon,\delta)$ bound to hold.

Theorem 5.1

Let $G_{i}^{s}$ be the Gaussian diagonal estimator (13) of $A$ , then it is sufficient to take $s$ as

s>4\log_{2}(\sqrt{2}/\delta)/\varepsilon^{2}

(15)

with $\varepsilon\in(0,1]$ , such that the estimator satisfies

\Pr\Bigg{(}|G_{i}^{s}-A_{ii}|^{2}\leq\varepsilon^{2}\Big{(}\|\bm{A}_{i}\|_{2}^{2}-A_{ii}^{2}\Big{)}\Bigg{)}\geq 1-\delta,

for $i=1,\ldots,n$ , where $\bm{A}_{i}$ is the $i^{th}$ row of the matrix, reshaped as a vector.

The proof of this theorem is best read as three sequential steps: rotational invariance, conditioned integration and function bounding. We caution the reader that the vector $\bm{u}$ is introduced that has components denoted as $v$ . This notation is an unfortunate, yet unavoidable, consequence of the proof.

Proof

Step 1: Rotational invariance

The first trick is to rewrite the coefficient of each $A_{ij},\;j\neq i$ , in

G_{i}^{s}=A_{ii}+\sum_{j\neq i}A_{ij}\frac{\sum_{k=1}^{s}v_{k}^{i}v_{k}^{j}}{\sum_{k=1}^{s}(v_{k}^{i})^{2}}.

(16)

Let us denote⁵⁵5We use the notation $\bm{u}_{i}\in\mathbb{R}^{s}$ so as not to confuse ourselves with the query vectors $\bm{v}_{k}\in\mathbb{R}^{n}$ . Definitively, the vector $\bm{u}_{i}$ is composed of the $i^{th}$ components of each query vector. $\bm{u}_{i}=[v_{1}^{i},v_{2}^{i},...,v_{s}^{i}]^{T}\in\mathbb{R}^{s}$ such that $\bm{u}_{i}\sim N(0,I_{s})$ . Thus

G_{i}^{s}=A_{ii}+\sum_{j\neq i}A_{ij}\frac{\bm{u}_{i}^{T}\bm{u}_{j}}{\|\bm{u}_{i}\|_{2}^{2}}.

Namely the coefficient of each $A_{ij},\;j\neq i$ , becomes

\frac{\bm{u}_{i}^{T}\bm{u}_{j}}{\|\bm{u}_{i}\|_{2}^{2}}=\frac{\bm{u}_{i}^{T}Q^{T}Q\bm{u}_{j}}{\|Q\bm{u}_{i}\|_{2}^{2}}

(17)

for any orthogonal rotation $Q$ , $Q^{T}Q=I_{s}$ . The key insight is to exploit the rotational invariance of $\bm{u}_{j}\sim N(0,I_{s})$ . Hence choose $Q$ such that

\frac{Q\bm{u}_{i}}{\|Q\bm{u}_{i}\|_{2}}=[1,0,\ldots,0]^{T}=\bm{e}_{1}\in\mathbb{R}^{s}.

So the coefficients (17) become, for all $j\neq i$ ,

\frac{1}{\|Q\bm{u}_{i}\|_{2}}\bm{e}_{1}^{T}Q\bm{u}_{j}=\frac{1}{\|\bm{u}_{i}\|_{2}}\bm{e}_{1}^{T}Q\bm{u}_{j}=\frac{g_{j}}{\|\bm{u}_{i}\|_{2}}

since $Q$ is orthogonal, so $Q\bm{u}_{j}\sim N(0,I_{s})\implies\bm{e}_{1}^{T}Q\bm{u}_{j}=g_{j}\sim N(0,1)$ . Note that the unit direction of a standard normal multivariate vector and its 2-norm are independent, implying that $Q$ and thus $Q\bm{u}_{j}$ are independent of $\|\bm{u}_{i}\|_{2}$ . Thus we may write (16) as

G_{i}^{s}=A_{ii}+\sum_{j\neq i}A_{ij}\frac{g_{j}}{\|\bm{u}_{i}\|_{2}}.

Let us denote $\bm{\widetilde{A}}_{i}\in\mathbb{R}^{n-1}$ as a vector containing the elements of the $i^{th}$ row of $A$ , excluding $A_{ii}$ . Also let $\bm{\widetilde{g}}\in\mathbb{R}^{n-1}$ be a vector containing the independent Gaussian $g_{j}$ entries. That is $\bm{\widetilde{g}}\sim N(0,I_{n-1})$ . So we have

G_{i}^{s}-A_{ii}=\frac{\bm{\widetilde{A}}_{i}^{T}\bm{\widetilde{g}}}{\|\bm{u}_{i}\|_{2}}.

The second trick is to consider

\frac{G_{i}^{s}-A_{ii}}{\|\bm{\widetilde{A}}_{i}\|_{2}}=\frac{1}{\|\bm{u}_{i}\|_{2}}\cdot\frac{\bm{\widetilde{A}}_{i}^{T}\bm{\widetilde{g}}}{\|\bm{\widetilde{A}}_{i}\|_{2}}.

(18)

Once again, we use the rotational invariance of the distribution of $\bm{\widetilde{g}}$ . The distribution of $\bm{\widetilde{A}}_{i}^{T}\bm{\widetilde{g}}/\|\bm{\widetilde{A}}_{i}\|_{2}$ is equal to $\bm{e}_{1}^{T}\widetilde{Q}\bm{\widetilde{g}}$ , where now, $\bm{e}_{1}=[1,0,...,0]^{T}\in\mathbb{R}^{n-1}$ and $\widetilde{Q}$ rotates $\widetilde{\bm{A}}_{i}/\|\widetilde{\bm{A}}_{i}\|_{2}$ to $\bm{e}_{1}$ . Hence we have that the distribution of $\bm{\widetilde{A}}_{i}^{T}\bm{\widetilde{g}}/\|\bm{\widetilde{A}}_{i}\|_{2}$ is equal to some $x\sim N(0,1)$ , giving

\frac{G_{i}^{s}-A_{ii}}{\|\bm{\widetilde{A}}_{i}\|_{2}}=\frac{x}{\|\bm{u}_{i}\|_{2}}\implies\frac{(G_{i}^{s}-A_{ii})^{2}}{\|\bm{\widetilde{A}}_{i}\|_{2}^{2}}=\frac{x^{2}}{\|\bm{u}_{i}\|_{2}^{2}}

(19)

where clearly $x$ and $\|\bm{u}_{i}\|_{2}$ are independent⁶⁶6This a scaled F-distribution..

Step 2: Conditioned integration

Recall that $\bm{u}_{i}\sim N(0,I_{s})\in\mathbb{R}^{s}$ and so $\|\bm{u}_{i}\|_{2}^{2}\sim\chi^{2}_{s}$ , is a chi-squared distribution of degree $s$ , and, of course, $x^{2}\sim\chi^{2}_{1}$ , is a chi-squared distribution of degree 1. Thus the proof reduces to bounding $\Pr(\big{|}x/\|\bm{u}_{i}\|_{2}|\leq\varepsilon)=\Pr(x^{2}/\|\bm{u}_{i}\|_{2}^{2}\leq\varepsilon^{2})$ on the right hand side of equation (19). This is most easily examined by investigating the tail bound,

F:=\Pr\Bigg{(}\frac{x^{2}}{\|\bm{u}_{i}\|_{2}^{2}}\geq\varepsilon^{2}\Bigg{)}=\Pr\Bigg{(}\frac{x^{2}}{y}\geq\varepsilon^{2}\Bigg{)}

(20)

where, for ease of notation, we have let $y=\|\bm{u}_{i}\|_{2}^{2}$ . The next key insight is to approach this by conditioning the probability. Thus we have,

\begin{split}F=&\Pr(x^{2}\geq\varepsilon^{2}y)\\ =&\int_{0}^{\infty}\Pr(x^{2}\geq\varepsilon^{2}y\big{|}y)\cdot f(y)dy\end{split}

(21)

where $f(y)$ is the probability density function of the $y\sim\chi_{s}^{2}$ variable. So we have

\begin{split}F=&\int_{0}^{\infty}\Pr(x^{2}\geq\varepsilon^{2}y\big{|}y)\cdot f(y)dy\\ =&\int_{0}^{\infty}\Pr(\text{e}^{\lambda x^{2}}\geq\text{e}^{\lambda\varepsilon^{2}y}\big{|}y)\cdot f(y)dy\hskip 147.95424pt\text{(for $\lambda>0$)}\\ \leq&\int_{0}^{\infty}\frac{\mathbb{E}[\text{e}^{\lambda x^{2}}]}{\text{e}^{\lambda\varepsilon^{2}y}}\cdot f(y)dy\hskip 142.26378pt\text{(by Markov's inequality)}\\ =&\int_{0}^{\infty}\frac{1}{\sqrt{1-2\lambda}\>\text{e}^{\lambda\varepsilon^{2}y}}\cdot f(y)dy\hskip 48.36958pt(\lambda\in(0,1/2),\text{ MGF of a normal variable})\end{split}

which, introducing the pdf, gives

\begin{split}=&\int_{0}^{\infty}\frac{1}{\sqrt{1-2\lambda}\>\text{e}^{\lambda\varepsilon^{2}y}}\cdot\frac{1}{2^{s/2}\Gamma(s/2)}y^{s/2-1}\text{e}^{-\frac{1}{2}y}dy\hskip 62.59596pt\text{(pdf of $\chi_{s}^{2}$ variable)}\\ =&\int_{0}^{\infty}\frac{1}{\sqrt{1-2\lambda}}\cdot\frac{1}{2^{s/2}\Gamma(s/2)}y^{s/2-1}\text{e}^{-(\frac{1}{2}+\lambda\varepsilon^{2})y}dy\hskip 96.73918pt\text{(rearrange).}\end{split}

Let $z=(1+2\lambda\varepsilon^{2})y$ , such that $dz=(1+2\lambda\varepsilon^{2})dy$ . Thus substitution yields

\begin{split}F\leq&\int_{0}^{\infty}\frac{1}{\sqrt{1-2\lambda}}\cdot\frac{1}{(1+2\lambda\varepsilon^{2})^{s/2}}\cdot\frac{1}{2^{s/2}\Gamma(s/2)}z^{s/2-1}\text{e}^{-\frac{1}{2}z}dz\\ =&\frac{1}{\sqrt{1-2\lambda}}\cdot\frac{1}{(1+2\lambda\varepsilon^{2})^{s/2}}\hskip 113.81102pt\text{(by the integral of the pdf)}\\ =&\frac{1}{\sqrt{1-2\lambda}}\cdot\frac{1}{\text{e}^{\ln(1+2\lambda\varepsilon^{2})s/2}}.\end{split}

Step 3: Function bounding

Note that $2\lambda\varepsilon^{2}\in(0,1)$ since $\lambda\in(0,1/2)$ and $\varepsilon\in(0,1]$ , and consider that

\ln(1+x)\geq\ln(2)x

for $x\in(0,1)$ . Therefore we can bound $F$ by

\begin{split}F\leq\frac{1}{\sqrt{1-2\lambda}}\cdot\frac{1}{\text{e}^{\ln(2)\lambda\varepsilon^{2}s}}=\frac{1}{\sqrt{1-2\lambda}}\cdot\frac{1}{2^{\lambda\varepsilon^{2}s}}\end{split}

(22)

Naively (we do not deal with optimal $\lambda$ here), setting $\lambda=1/4$ , and bounding the above by $\delta$ gives

\begin{split}F\leq\frac{\sqrt{2}}{2^{s\varepsilon^{2}/4}}<\delta\end{split}

(23)

thus we have

s>4\log_{2}\Big{(}\frac{\sqrt{2}}{\delta}\Big{)}\Big{/}\varepsilon^{2}

(24)

as a sufficient bound for $s$ , such that, for all such $s$

\Pr(|G_{i}^{s}-A_{ii}|\geq\varepsilon\|\bm{\widetilde{A}}_{i}\|_{2})=F<\delta

(25)

which completes the proof. $\square$

Implications

Theorem 5.1 deals only with a single entry of the diagonal. Concretely, we have found that

\begin{split}\Pr\Big{(}|G_{i}^{s}-A_{ii}|^{2}\leq\varepsilon^{2}\|\bm{\widetilde{A}}_{i}\|_{2}^{2}\Big{)}\geq 1-\delta\end{split}

for $s>4\log_{2}(\sqrt{2}/\delta)/\varepsilon^{2}$ . This means $G_{i}^{s}$ is a row-dependent $(\varepsilon,\delta)$ -approximator for all such $s$ . For a relative $(\varepsilon,\delta)$ -approximator, we must substitute

\varepsilon^{2}=\bar{\varepsilon}^{2}\frac{A_{ii}^{2}}{\|\bm{\widetilde{A}}_{i}\|_{2}^{2}}

to get

\Pr\Big{(}|G_{i}^{s}-A_{ii}|\leq\bar{\varepsilon}|A_{ii}|\Big{)}\geq 1-\delta

provided

s>4\Bigg{(}\frac{\|\bm{\widetilde{A}}_{i}\|_{2}^{2}}{A_{ii}^{2}}\Bigg{)}\frac{\log_{2}(\sqrt{2}/\delta)}{\bar{\varepsilon}^{2}}.

(26)

For the entire diagonal, if we demand $|G_{i}^{s}-A_{ii}|^{2}\leq\varepsilon^{2}\Big{(}\|\bm{A}_{i}\|_{2}^{2}-A_{ii}^{2}\Big{)}\hskip 5.69046pt\forall\;i$ we may apply the ideas from the proof of Theorem 5.1, to obtain the number of vectors required for this condition to hold.

Corollary 1

For

\Pr\Bigg{(}\sum_{i=1}^{n}|G_{i}^{s}-A_{ii}|^{2}\leq\varepsilon^{2}\sum_{i=1}^{n}\Big{(}\|\bm{A}_{i}\|_{2}^{2}-A_{ii}^{2}\Big{)}\Bigg{)}\geq 1-\delta

with $\varepsilon\in(0,1]$ it is sufficient to take

s>4\log_{2}\Big{(}\frac{n\sqrt{2}}{\delta}\Big{)}\Big{/}\varepsilon^{2}

query vectors if vector entries are i.i.d Gaussian, where $\bm{A}_{i}$ is the $i^{th}$ row of the matrix A, and $G_{i}^{s}$ is the $i^{th}$ component of the Gaussian diagonal estimator.

Proof

By equation (23) of the proof of Theorem 5.1, we seek

\Pr\Bigg{(}\frac{(G_{i}^{s}-A_{ii})^{2}}{\|\bm{\widetilde{A}}_{i}\|_{2}^{2}}\geq\varepsilon^{2}\Bigg{)}=\Pr\Bigg{(}\frac{\chi^{2}_{1}}{\chi^{2}_{s}}\geq\varepsilon^{2}\Bigg{)}\leq\frac{\sqrt{2}}{2^{s\varepsilon^{2}/4}}\;\forall\;i.

Applying a union bound, and bounding this by $\delta$ we readily find

n\cdot\frac{\sqrt{2}}{2^{s\varepsilon^{2}/4}}<\delta\implies s>4\log_{2}\Big{(}\frac{n\sqrt{2}}{\delta}\Big{)}\Big{/}\varepsilon^{2}

(27)

is sufficient, which completes the proof. $\square$

Taking the summation over $i$ , Corollary 1 means we have, with probability at least $1-\delta$

\|\bm{G}^{s}-\bm{A}_{d}\|_{2}^{2}\leq\varepsilon^{2}\big{(}\|A\|_{F}^{2}-\|\bm{A}_{d}\|_{2}^{2}\big{)}

(28)

if we have $s$ as in equation (27). To make this bound a relative estimate, in comparison to $\bm{A}_{d}$ , we write

\varepsilon^{2}=\Bigg{(}\frac{\|A\|_{F}^{2}-\|\bm{A}_{d}\|_{2}^{2}}{\|\bm{A}_{d}\|^{2}}\Bigg{)}\bar{\varepsilon}^{2}

to get

\Pr\Big{(}\|\bm{G}^{s}-\bm{A}_{d}\|_{2}\leq\bar{\varepsilon}\|\bm{A}_{d}\|_{2}\Big{)}\geq 1-\delta

(29)

provided that the number of query vectors $s$ is

s>4\Bigg{(}\frac{\|A\|_{F}^{2}-\|\bm{A}_{d}\|_{2}^{2}}{\|\bm{A}_{d}\|^{2}}\Bigg{)}\frac{\log_{2}\Big{(}n\sqrt{2}/\delta\Big{)}}{\bar{\varepsilon}^{2}}.

(30)

In summary, we have found that to get a relative 2-norm estimate of the entire diagonal with error at most $\bar{\varepsilon}$ , and probability at least $1-\delta$ , it is sufficient to use the $s$ queries as in equation (30), if Gaussian query vectors are used.

6 Rademacher diagonal estimator

This section deals with the proof of an $(\varepsilon,\delta)$ bound for the Rademacher diagonal estimator. We choose to break this down into two main steps, the first of which is given in Lemma 2. The lemma concerns the expectation of an exponential, which will subsequently be used in a Chernoff-style argument. One may regard it simply as a slight tweak on Khintchine’s lemma, but introducing an additional summation term and so we outline it fully for completeness. Before continuing, we note that examining the Rademacher estimator element-wise we have

\begin{split}R_{i}^{s}=A_{ii}+\sum_{j\neq i}^{n}A_{ij}\frac{\sum_{k=1}^{s}v_{k}^{i}v_{k}^{j}}{s}=A_{ii}+\frac{1}{s}\sum_{j\neq i}^{n}A_{ij}\sum_{k=1}^{s}z_{k}^{ij}\end{split}

(31)

where we have rewritten $v_{k}^{i}v_{k}^{j}$ as $z_{k}^{ij}$ and $z_{k}^{ij}$ is also a Rademacher variable⁷⁷7It is easy to verify that a Rademacher variable times an independent ( $j\neq i$ ) Rademacher variable is also Rademacher variable.. We now introduce the following lemma.

Lemma 2

For

X^{s}_{i}:=R_{i}^{s}-A_{ii}

we have that

\mathbb{E}[\exp(tX^{s}_{i})]\leq\exp\Bigg{(}\frac{t^{2}\|\bm{\widetilde{A}}_{i}\|_{2}^{2}}{2s}\Bigg{)}

where $\bm{\widetilde{A}}_{i}\in\mathbb{R}^{n-1}$ is a vector containing the elements of the $i^{th}$ row of A, excluding the $i^{th}$ element.

Proof

Note that $X^{s}_{i}$ is simply

X_{i}^{s}=\frac{1}{s}\sum_{j\neq i}^{n}A_{ij}\sum_{k=1}^{s}z_{k}^{ij}

Consider now

\begin{split}\mathbb{E}[\exp(tX_{i}^{s})]=&\mathbb{E}\Big{[}\exp\Big{(}\frac{t}{s}\sum_{j\neq i}^{n}A_{ij}\sum_{k=1}^{s}z_{k}^{ij}\Big{)}\Big{]}\\ =&\prod_{j\neq i}^{n}\prod_{k=1}^{s}\mathbb{E}\Big{[}\exp\Big{(}\frac{t}{s}A_{ij}z_{k}^{ij}\Big{)}\Big{]}\\ =&\prod_{j\neq i}^{n}\prod_{k=1}^{s}\frac{1}{2}\Big{(}\exp\Big{(}\frac{t}{s}A_{ij}\Big{)}+\Big{(}-\frac{t}{s}A_{ij}\Big{)}\Big{)}\\ =&\prod_{j\neq i}^{n}\prod_{k=1}^{s}\Big{(}1+\frac{(tA_{ij}/s)^{2}}{2!}+\frac{(tA_{ij}/s)^{4}}{4!}+...\Big{)}\\ \leq&\prod_{j\neq i}^{n}\prod_{k=1}^{s}\Big{(}1+\frac{(tA_{ij}/s)^{2}}{1!\cdot 2}+\frac{(tA_{ij}/s)^{4}}{2!\cdot 2^{2}}+...\Big{)}\\ =&\prod_{j\neq i}^{n}\prod_{k=1}^{s}\exp\Big{(}\frac{t^{2}A_{ij}^{2}}{2s^{2}}\Big{)}\\ =&\prod_{j\neq i}^{n}\exp\Big{(}\frac{t^{2}A_{ij}^{2}}{2s}\Big{)}\\ =&\exp\Big{(}\frac{t^{2}\|\bm{\widetilde{A}}_{i}\|_{2}^{2}}{2s}\Big{)}\end{split}

which completes the proof. $\square$

With this lemma in place, we can now prove an $(\varepsilon,\delta)$ bound for the Rademacher diagonal estimator. Specifically, we find the minimum sufficient number of queries $s$ , for a row-dependent $(\varepsilon,\delta)$ bound to hold.

Theorem 6.1

Let $R_{i}^{s}$ be the Rademacher diagonal estimator (14) of $A$ , then it is sufficient to take $s$ as

s>2\ln(2/\delta)/\varepsilon^{2}

(32)

with arbitrary $\varepsilon$ , such that the estimator satisfies

\Pr\Bigg{(}|R_{i}^{s}-A_{ii}|^{2}\leq\varepsilon^{2}\Big{(}\|\bm{A}_{i}\|_{2}^{2}-A_{ii}^{2}\Big{)}\Bigg{)}\geq 1-\delta,

where $\bm{A}_{i}$ is the $i^{th}$ row of the matrix reshaped as a vector.

Proof

The problem may be reduced to finding the tail probability, $\Pr(|X_{i}^{s}|\geq\varepsilon\|\widetilde{\bm{A}}_{i}\|_{2})$ . Investigating the positive tail probability we have that

\begin{split}\Pr(X^{s}_{i}\geq\varepsilon\|\widetilde{\bm{A}}_{i}\|_{2})=&\Pr(\text{e}^{tX^{s}_{i}}\geq\text{e}^{t\varepsilon\|\widetilde{\bm{A}}_{i}\|_{2}})\hskip 88.2037pt\text{(all $t>0$)}\\ \leq&\frac{\mathbb{E}[\text{e}^{tX^{s}_{i}}]}{\text{e}^{t\varepsilon\|\widetilde{\bm{A}}_{i}\|_{2}}}\hskip 139.41832pt\text{(By Markov's inequality)}\\ \leq&\exp\Big{(}-t\varepsilon\|\widetilde{\bm{A}}_{i}\|_{2}+t^{2}\|\widetilde{\bm{A}}_{i}\|_{2}^{2}/2s\Big{)}.\hskip 34.14322pt\text{(By Lemma \ref{Chernoff Lemma})}\end{split}

(33)

This is minimum for $t=s\varepsilon/\|\widetilde{\bm{A}}_{i}\|_{2}$ , so we obtain ⁸⁸8We can also obtain (34) from a direct application of Hoeffding’s bound for the sum of unbounded random variables (wainwright2019high, , Prop. 2.5).

\Pr(X^{s}_{i}\geq\varepsilon\|\widetilde{\bm{A}}_{i}\|_{2})\leq\exp\Big{(}-\frac{1}{2}s\varepsilon^{2}\Big{)}.

(34)

Since we are considering $|X^{s}_{i}|\geq\varepsilon\|\widetilde{\bm{A}}_{i}\|_{2}$ , we need also to bound the negative tail

\begin{split}\Pr(-X^{s}_{i}\geq\varepsilon\|\widetilde{\bm{A}}_{i}\|_{2})&=\Pr(\text{e}^{-tX^{s}_{i}}\geq\text{e}^{t\varepsilon\|\widetilde{\bm{A}}_{i}\|_{2}})\\ &\leq\frac{\mathbb{E}[\text{e}^{-tX^{s}_{i}}]}{\text{e}^{t\varepsilon\|\widetilde{\bm{A}}_{i}\|_{2}}}.\end{split}

It is also easily shown, by Lemma 2, that

\mathbb{E}[\text{e}^{-tX^{s}_{i}}]\leq\exp((-t)^{2}\|\widetilde{\bm{A}}_{i}\|^{2}_{2}/2s)

which means that, just as for the positive tail

\Pr(-X^{s}_{i}\geq\varepsilon\|\widetilde{\bm{A}}_{2}\|_{2})\leq\exp\Big{(}-\frac{1}{2}s\varepsilon^{2}\Big{)}

(35)

Or indeed, we may easily deduce the above by considering the symmetry of the Rademacher variables $z_{k}^{ij}$ in Lemma 2.

Given that the events in equations (34) and (35) are disjoint, we have

\Pr(|X^{s}_{i}|\geq\varepsilon\|\widetilde{\bm{A}}_{i}\|_{2})\leq 2\exp\Big{(}-\frac{1}{2}s\varepsilon^{2}\Big{)}.

(36)

Bounding this probability by $\delta$ yields

2\exp\Big{(}-\frac{1}{2}s\varepsilon^{2}\Big{)}<\delta\implies s>2\ln\Big{(}\frac{2}{\delta}\Big{)}\Big{/}\varepsilon^{2}

(37)

as the sufficient lower bound for $s$ , such that, for all such $s$

\Pr(|X_{i}^{s}|\geq\varepsilon\|\widetilde{\bm{A}}_{i}\|_{2})<\delta

(38)

which completes the proof. $\square$

Implications

The above result is for a single diagonal entry. We have found that

\Pr\Big{(}|R_{i}^{s}-A_{ii}|^{2}\geq\varepsilon^{2}\|\widetilde{\bm{A}}_{i}\|_{2}^{2}\Big{)}\leq 2\exp(-s\varepsilon^{2}/2)<\delta

so that $R_{i}^{s}$ is a row-dependent $(\varepsilon,\delta)$ -approximator for $s>2\ln(2/\delta)/\varepsilon^{2}$ . For a relative $(\varepsilon,\delta)$ -approximator we must substitute⁹⁹9Replacing $\|\bm{\widetilde{A}}_{i}\|_{2}$ with $A_{ii}$ in equation (33), yields the same result as in equation (40).

\varepsilon^{2}=\bar{\varepsilon}^{2}\frac{A_{ii}^{2}}{\|\bm{A}_{d}\|_{2}^{2}}

to get

\Pr\Big{(}|R_{i}^{s}-A_{ii}|\geq\bar{\varepsilon}|A_{ii}|\Big{)}<\delta

(39)

provided

s>2\Bigg{(}\frac{\|\widetilde{\bm{A}}_{i}\|_{2}^{2}}{A_{ii}^{2}}\Bigg{)}\frac{\ln(2/\delta)}{\bar{\varepsilon}^{2}}.

(40)

For the whole diagonal, and in much the same way as when considering the Gaussian case, if we demand $|R_{i}^{s}-A_{ii}|^{2}\leq\varepsilon^{2}\Big{(}\|\bm{A}_{i}\|_{2}^{2}-A_{ii}^{2}\Big{)}\hskip 5.69046pt\forall\;i$ , we need simply apply the ideas from the proof of Theorem 6.1, to obtain the number of vectors required for this condition to hold.

Corollary 2

For

\Pr\Bigg{(}\sum_{i=1}^{n}|R_{i}^{s}-A_{ii}|^{2}\leq\varepsilon^{2}\sum_{i=1}^{n}\Big{(}\|\bm{A}_{i}\|_{2}^{2}-A_{ii}^{2}\Big{)}\Bigg{)}\geq 1-\delta

with $\varepsilon$ arbitrary it is sufficient to take

s>2\ln\Big{(}\frac{2n}{\delta}\Big{)}\Big{/}\varepsilon^{2}

(41)

query vectors if vector entries are i.i.d Rademacher, where $\bm{A}_{i}$ is the $i^{th}$ row of the matrix A, and $R_{i}^{s}$ is the $i^{th}$ component of the Rademacher diagonal estimator.

Proof

By equation (36) in the proof of Theorem 6.1, we seek

\Pr(|X^{s}_{i}|\geq\varepsilon\|\widetilde{\bm{A}}_{i}\|_{2})\leq 2\exp\Big{(}-\frac{1}{2}s\varepsilon^{2}\Big{)}\;\forall\;i.

Applying a union bound for all entries, and bounding this, we readily find

n\cdot 2\exp\Big{(}-\frac{1}{2}s\varepsilon^{2}\Big{)}<\delta\implies s>2\ln\Big{(}\frac{2n}{\delta}\Big{)}\Big{/}\varepsilon^{2}

(42)

which completes the proof. $\square$

Thus taking the summation over $i$ , we have, with probability $1-\delta$

\|\bm{R}^{s}-\bm{A}_{d}\|_{2}^{2}\leq\varepsilon^{2}\big{(}\|A\|_{F}^{2}-\|\bm{A}_{d}\|_{2}^{2}\big{)}

(43)

provided we have used $s$ -queries as in (42). If we seek a relative error, in comparison to $\bm{A}_{d}$ , let

\varepsilon^{2}=\Bigg{(}\frac{\|A\|_{F}^{2}-\|\bm{A}_{d}\|_{2}^{2}}{\|\bm{A}_{d}\|_{2}}\Bigg{)}\bar{\varepsilon}^{2}

to get

\Pr\Big{(}\|\bm{R}^{s}-\bm{A}_{d}\|_{2}\leq\bar{\varepsilon}\|\bm{A}_{d}\|_{2}\Big{)}\geq 1-\delta

(44)

provided

s>2\Bigg{(}\frac{\|A\|_{F}^{2}-\|\bm{A}_{d}\|_{2}^{2}}{\|\bm{A}_{d}\|^{2}}\Bigg{)}\frac{\ln\Big{(}2n/\delta\Big{)}}{\bar{\varepsilon}^{2}}

(45)

Overall, we have found that to get a relative 2-norm estimate of the entire diagonal with error at most $\varepsilon$ , and probability at least $1-\delta$ , it is sufficient to use the $s$ queries as in equation (45), if we use Rademacher query vectors.

A brief comparison

Comparing (15) with (32) (or (1) with (41)), we see that both the Rademacher estimator and the Gaussian estimator display strikingly similar criteria for the selection of $s$ . Whilst the Gaussian estimator appears slightly worse than its Rademacher counterpart, this difference is, in general terms, very small. A more insidious point to note relates to the demands on $\varepsilon$ in the Gaussian case. We have demanded $\varepsilon\in(0,1]$ for equation (22) to hold. Whilst seemingly a minor detail, this becomes very important if we are to use a small number of query vectors $s$ , say, on the order of $10$ . In Section 8 we demonstrate the implications of this assumption, and offer intuition as for why, when $s$ is small, the Gaussian diagonal estimator performs worse than the Rademacher diagonal estimator.

7 Positive-semi-definite diagonal estimation

We now examine the case where $A\succeq 0$ , as is relevant in many applications braverman2020schatten ; chen2016accurately ; cohen2018approximating ; di2016efficient ; han2017approximating ; li2020well ; lin2016approximating ; ubaru2017applications , and analysed extensively in avron2011randomized ; roosta2015improved . For both estimators we have

|D_{i}^{s}-A_{ii}|^{2}\leq\varepsilon^{2}\Big{(}\|\bm{A}_{i}\|_{2}^{2}-A_{ii}^{2}\Big{)}

(46)

with probability, $1-\delta$ , provided we have used $s=O(\log(1/\delta)/\varepsilon^{2})$ queries and for given $i$ . As before, $\bm{A}_{i}$ is the $i^{th}$ row of the matrix (and therefore the $i^{th}$ column by symmetry). There is no escaping the dependence of the absolute error on the matrix structure, and, as we have also seen, shifting to a relative error framework means that the number of sufficient queries becomes dependent on the matrix. Can we draw any other general conclusions about the dependence of the number of queries on the structure of the matrix? Might they fit with our intuitions, and indicate when stochastic estimation may be suitable? We find, as expected, that both flatter spectra lend themselves to diagonal estimation and that the eigenvector basis of the matrix plays a key role. These results are included for the practitioner, who may have access to some of the quantities explored in the following sections (e.g. $\kappa_{2}(A)$ ), or may be able to simulate other quantities in particular scenarios (e.g. $\sigma_{\min}(V\odot V)$ ).

7.1 Spectrum influence - element-wise

To make a start let us consider the following lemma. Recall that $\lambda_{1}$ is the largest eigenvalue.

Lemma 3

For all $A\succeq 0$ we have that $\|\bm{A}_{i}\|_{2}^{2}\leq\lambda_{1}A_{ii}\hskip 5.69046pt\forall\;i$ .

Proof

We may express the column $\bm{A}_{i}$ as

\bm{A}_{i}=\sum_{j=1}^{n}\lambda_{j}V_{ij}\bm{V}_{j}

where $\bm{V}_{j}$ is the $j^{th}$ eigenvector of $A$ and so we also have that $A_{ii}=\sum_{j=1}^{n}\lambda_{j}(V_{ij})^{2}$ . Thus

\|\bm{A}_{i}\|_{2}^{2}=\Big{\|}\sum_{j=1}^{n}\lambda_{j}V_{ij}\bm{V}_{j}\Big{\|}_{2}^{2}=\sum_{j=1}^{n}(\lambda_{j}V_{ij})^{2}\leq\lambda_{1}\sum_{j=1}^{n}\lambda_{j}(V_{ij})^{2}=\lambda_{1}A_{ii}

where the second equality holds by orthogonality, and the inequality by positive-semi-definiteness. $\square$

Remark 1

For $A\succeq 0$ , where $\lambda_{n}=0$ , then $A_{ii}$ may take the value $0$ . When $A_{ii}$ does take this value, by Lemma 3 we have that $\bm{A}_{i}=\bm{0}$ and thus all diagonal zeroes will be found exactly by either the Rademacher or Gaussian estimator.

Now suppose $A\succ 0$ , such that $\lambda_{n}\neq 0$ , then by equation (46) and Lemma 3, for sufficient $s$ we have

\begin{split}|D_{i}^{s}-A_{ii}|^{2}&\leq\varepsilon^{2}(\lambda_{1}A_{ii}-A_{ii}^{2})\\ &=\varepsilon^{2}\Big{(}\frac{\lambda_{1}}{A_{ii}}-1\Big{)}A_{ii}^{2}\\ &\leq\varepsilon^{2}(\kappa_{2}(A)-1)A_{ii}^{2}.\end{split}

(47)

Then, setting $\bar{\varepsilon}^{2}=\varepsilon^{2}/(\kappa_{2}(A)-1)$ , we shall return to a relative $(\varepsilon,\delta)$ -approximator provided

s>O\Bigg{(}\frac{(\kappa_{2}(A)-1)\log(1/\delta)}{\bar{\varepsilon}^{2}}\Bigg{)}.

(48)

Extending this to all diagonal elements, we have the following lemma.

Lemma 4

For

\Pr\Big{(}\|\bm{D}^{s}-\bm{A}_{d}\|_{2}^{2}\leq\bar{\varepsilon}^{2}\|\bm{A}_{d}\|_{2}^{2}\Big{)}\geq 1-\delta

we have that

s>O\Bigg{(}\frac{(\kappa_{2}(A)-1)\log(n/\delta)}{\bar{\varepsilon}^{2}}\Bigg{)}

(49)

is a sufficient number of queries for both the Rademacher or Gaussian estimator.

Of course, this analysis produces a weaker sufficiency bound¹⁰¹⁰10Due to the inequalities applied, we expect to arrive at sufficiency before $s$ as in (49). on $s$ than either of those in Table 1, but it is more likely to be practically computable (e.g. via few steps of the Lanczos algorithm) or perhaps even known for implicit $A$ .

Lemma 4 suggests roughly that the conditioning of the matrix indicates its ability to be estimated. For a given $s$ , if $\kappa_{2}(A)$ is larger, error estimates are likely to be worse, but as $\kappa_{2}(A)$ approaches 1 they become better and better. This is intuitive, since if $\kappa_{2}(A)=1$ , then the matrix $A$ shall be a scalar multiple of the identity: $A=V(cI)V^{T}=cI$ for some scalar $c$ . Hence the lower bound on $s$ becomes $0$ and exact estimates are found after a single query. More generally, if $\kappa_{2}(A)\ll n$ , then the bounds (48) and (49) are good, and stochastic estimation is suitable to estimate the diagonal. Conversely, if $\kappa_{2}(A)=O(n)$ or greater, then clearly this bound is no longer helpful since we seek $s\ll n$ queries.

However, if the bound of Lemma 3 is loose, and line three of equation (47) is loose, then the bounds (48) and (49) are too pessimistic. To partly rectify this, and extend these ideas for general $A\succeq 0$ , we stop at the second line of (47) to write, with $c_{i}(A):=\lambda_{1}/A_{ii}$

|D_{i}^{s}-A_{ii}|^{2}\leq\varepsilon^{2}(c_{i}(A)-1)A_{ii}^{2}

where $c_{i}(A)$ is best considered as the “conditioning” of a diagonal element. Note again that in this analysis we are considering $A_{ii}\neq 0$ , since we already know that if $A_{ii}=0$ it will be found exactly. Let us define

\kappa_{d}(A):=\frac{\lambda_{1}}{\min_{i,A_{ii}\neq 0}A_{ii}}

(50)

as the conditioning of the diagonal. So we have,

|D_{s}^{i}-A_{ii}|^{2}\leq\varepsilon^{2}(\kappa_{d}(A)-1)A_{ii}^{2}

and the same ideas as for $\kappa_{2}(A)$ may be applied. In particular, note that we shall obtain a relative $(\varepsilon,\delta)$ -approximator if, with $\bar{\varepsilon}^{2}=\varepsilon^{2}/(\kappa_{d}(A)-1)$

s>O\Bigg{(}\frac{(\kappa_{d}(A)-1)\log(1/\delta)}{\bar{\varepsilon}^{2}}\Bigg{)}

(51)

and, extending this to the diagonal elements, we readily obtain the following lemma.

Lemma 5

For

\Pr\Big{(}\|\bm{D}^{s}-\bm{A}_{d}\|_{2}^{2}\leq\bar{\varepsilon}^{2}\|\bm{A}_{d}\|_{2}^{2}\Big{)}\geq 1-\delta

we have that

s>O\Bigg{(}\frac{(\kappa_{d}(A)-1)\log(n/\delta)}{\bar{\varepsilon}^{2}}\Bigg{)}

(52)

is a sufficient number of queries for both Rademacher and Gaussian estimators.

So if $\kappa_{d}(A)\ll n$ then stochastic estimation is suitable to estimate the diagonal of such a matrix. Since $\kappa_{2}(A)\geq\kappa_{d}(A)$ , this demand is easier to satisfy than demanding that $\kappa_{2}(A)\ll n$ .

7.2 Spectrum influence - the entire diagonal

So far we have built our understanding by considering single element estimates, and examining their implications for the whole diagonal. We can gain more information by examining the inequality

\|\bm{D}^{s}-\bm{A}_{d}\|_{2}^{2}\leq\varepsilon^{2}(\|A\|_{F}^{2}-\|\bm{A}_{d}\|_{2}^{2})

(53)

since this accounts for the entire matrix structure. This holds, with probability at least $1-\delta$ if $s>O\Big{(}\log(n/\delta)/\varepsilon^{2}\Big{)}$ . Recall that, for $A\succeq 0$ , $\|A\|_{F}^{2}=\|\bm{\lambda}\|_{2}^{2}$ , so we have

\|\bm{D}^{s}-\bm{A}_{d}\|_{2}^{2}\leq\varepsilon^{2}\Big{(}\|\bm{\lambda}\|_{2}^{2}-\|\bm{A}_{d}\|_{2}^{2}\Big{)}.

(54)

Now, we also have for $A\succeq 0$ , that

\|\bm{A}_{d}\|_{2}^{2}\geq\frac{1}{n}\|\bm{A}_{d}\|_{1}^{2}=\frac{1}{n}\|\bm{\lambda}\|_{1}^{2}

which means that

\|\bm{D}^{s}-\bm{A}_{d}\|_{2}^{2}\leq\varepsilon^{2}\Big{(}\|\bm{\lambda}\|_{2}^{2}-\frac{1}{n}\|\bm{\lambda}\|_{1}^{1}\Big{)}

(55)

and we can once again see the influence of the eigenvalues on the error of our estimates.

This equation, however, is dependent on the entire spectrum, in comparison to only $\kappa_{2}(A)$ or $\kappa_{d}(A)$ as was previously found. As expected, the conditioning of a matrix does not tell the whole story. If $\bm{\lambda}=[1,1,\ldots,1,\epsilon]$ the condition number of the matrix in question may be large, even $O(n)$ , but equation (55) suggests that the scaling of the absolute error will be acceptable. Clearly, a flat spectrum, such as $\bm{\lambda}=[1,1,...,1]\in\mathbb{R}^{n}$ , corresponding to the identity, yields a right hand side of zero. Meanwhile, the value of $\|\bm{\lambda}\|_{2}^{2}-\|\bm{\lambda}\|_{1}^{2}/n$ is greatest when mass is concentrated on a single eigenvalue.

For completeness, note that the reverse is true of the mass of the diagonal, since

\begin{split}\|\bm{\lambda}\|_{2}^{2}&\leq\|\bm{\lambda}\|_{1}^{2}=\|\bm{A}_{d}\|_{1}^{2}\end{split}

and so

\|\bm{D}^{s}-\bm{A}_{d}\|_{2}^{2}\leq\varepsilon^{2}\Big{(}\|\bm{A}_{d}\|_{1}^{2}-\|\bm{A}_{d}\|_{2}^{2}\Big{)}

(56)

which is worse as diagonal elements become flatter but better as most diagonal mass is concentrated on a handful of entries.

7.3 Eigenvector influence

Equation (56) hints at the notion that it is not only the spectrum of $A$ that influences the error of diagonal estimation. Indeed, if we have a very steep spectrum, but the eigenvectors of $A$ are columns of the identity, that is $V=I$ , then we shall still have a diagonal matrix and so expect one-shot estimation. Consider the following.

Lemma 6

Let $A\in\mathbb{R}^{n\times n}$ be a diagonalisable matrix, $A=V\Lambda V^{T}$ , with $\bm{\lambda}=\textnormal{diag}(\Lambda)$ , and $\bm{A}_{d}=\textnormal{diag}(A)$ , then

\bm{A}_{d}=(V\odot V)\bm{\lambda}

where $V\odot V$ , is the Hadamard (element-wise) product of $V$ with itself.

Proof

The above is easily found by recalling

A_{ii}=\sum_{j=1}^{n}\lambda_{j}(V_{ij})^{2}\quad\forall\;i

and simply extending to the matrix equation. $\square$

Hence, Lemma 6 and equation (54) imply, for $s>O\Big{(}\log(n/\delta)/\varepsilon^{2}\Big{)}$ ,

\begin{split}\|\bm{D}^{s}-\bm{A}_{d}\|_{2}^{2}&\leq\varepsilon^{2}\Big{(}\|\bm{\lambda}\|_{2}^{2}-\|(V\odot V)\bm{\lambda}\|_{2}^{2}\Big{)}\\ &\leq\varepsilon^{2}\Big{(}1-\sigma^{2}_{\min}(V\odot V)\Big{)}\|\bm{\lambda}\|_{2}^{2}.\end{split}

If we write

\varepsilon^{2}=\bar{\varepsilon}^{2}\frac{\|\bm{A}_{d}\|_{2}^{2}}{\Big{(}1-\sigma^{2}_{\min}(V\odot V)\Big{)}\|\bm{\lambda}\|_{2}^{2}}

so we may readily obtain the following lemma.

Lemma 7

For

\|\bm{D}^{s}-\bm{A}_{d}\|_{2}^{2}\leq\bar{\varepsilon}^{2}\|\bm{A}_{d}\|_{2}^{2}

we have that

s>O\Bigg{(}\frac{\Big{(}1-\sigma^{2}_{\min}(V\odot V)\Big{)}\|\bm{\lambda}\|_{2}^{2}}{\|\bm{A}_{d}\|_{2}^{2}}\cdot\frac{\log(n/\delta)}{\bar{\varepsilon}^{2}}\Bigg{)}

(57)

is a sufficient number of queries.

This returns our prediction for one-shot estimation if $V=I$ . In this case we find $\sigma_{\min}(V\odot V)=\sigma_{\min}(I)=1$ and the bound (57) is zero, so $s=1$ is sufficient to recover the diagonal exactly. Further analysis of the value of $\sigma_{\min}(V\odot V)$ is complex and outside the scope of this paper. We point the interested reader to ando1987singular for further investigation.

7.4 Links with Hutchinson’s estimator variance

Lastly, once again let us turn our attention to the quantity:

\|A\|_{F}^{2}-\|\bm{A}_{d}\|_{2}^{2}

(58)

and recall that, but for a factor of 2, this is in fact the variance of Hutchinson’s estimator for the trace. A variety of papers suggest particular variance reduction schemes for Hutchinson’s estimator, focused on reducing (58). Certain approaches take advantage of the sparsity of the matrix $A$ stathopoulos2013hierarchical ; tang2011domain (by a process often called probing), whilst others still take a “decomposition” approach adams2018estimating . Perhaps the most promising approach in this case is that taken by Gambhir et. al gambhir2017deflation , Meyer et. al meyer2021hutch++ and Lin lin2017randomized . In these papers the approach is to form a decomposition by approximate projection onto the top eigenspace of $A$ , using the matrix $Q\in\mathbb{R}^{n\times r}$ designed to capture most of the column space of $A$ . This matrix roughly spans the top eigenspace of $A$ . In particular, gambhir2017deflation and lin2017randomized justify this approach for estimating the trace when $A$ is low rank as $\text{tr}(QQ^{T}A)=\text{tr}(Q^{T}AQ)$ will capture most of the trace of $A$ . A moment’s thought suggests that this method may be adapted to estimate the whole diagonal, since $QQ^{T}A$ or indeed $QQ^{T}AQQ^{T}$ are randomised low-rank approximations to the matrix $A$ halko2011finding . We might do well to find the exact diagonals of either $QQ^{T}A$ or $QQ^{T}AQQ^{T}$ and then estimate the diagonal of the remainder of the matrix: $(I-QQ^{T})A$ or $(I-QQ^{T})A(I-QQ^{T})$ . We expand further on this idea in Section 9 extending the results and analysis of meyer2021hutch++ .

8 Numerical Experiments

8.1 Single element estimates

We first examine the Rademacher diagonal estimator and explore the implications of equations (39) and (40), since the sufficient query bound for such an estimator is likely the tightest derived thus far. Moreover, the value of $\varepsilon$ in Theorem 6.1 is arbitrary, so any theoretical bound ought to hold for all values $s$ , at all values of $\delta$ . For different diagonal elements we expect the same rates of convergence, but different scaling, according to the value of $(\|\bm{A}\|_{2}^{2}-A_{ii}^{2})/A_{ii}^{2}$ for each $i$ .

This is exactly what is found in practice, as shown in Figure 1. This figure shows the relative convergence of estimates of selected diagonal elements of the real world matrix “Boeing msc10480” from the SuiteSparse matrix collection¹¹¹¹11Formerly the University of Florida matrix collection. davis2011university . These elements were chosen to be displayed as they have large differences in their values of $(\|\bm{A}\|_{2}^{2}-A_{ii}^{2})/A_{ii}^{2}$ , and so are most easily resolved in the plots of relative error. As expected, the larger the value of $(\|\bm{A}\|_{2}^{2}-A_{ii}^{2})/A_{ii}^{2}$ , the greater the error after a given number of queries.

For a fixed value of $\delta$ and a given matrix entry, the ratio of the numerical error and theoretical bound is roughly fixed - the convergence of the numerical experiments mirrors that of the theoretical convergence bound. Thus the theoretical bound is tight in $\varepsilon$ , as numerical error also scales with $O(1/\varepsilon^{2})$ . In addition, for decreasing $\delta$ , the bound to observation ratio can be seen to decrease, suggesting that the query bound is tight in $\delta$ , since we would have otherwise expected it to grow.

Refer to caption — Figure 1: The convergence of Rademacher estimates of selected diagonal elements of the matrix “Boeing/msc10480” from the SuiteSparse matrix collection. The numerical errors are shown in solid, and the theoretical bounds in dash. For $100$ trials, we plot the median error (top left), the $67^{th}$ percentile error (top right), the $80^{th}$ percentile error (bottom left) and the $90^{th}$ percentile error (bottom right), along with the corresponding theoretical error bounds for the associated value of $\delta$ . The constants $(\|\bm{A}_{i}\|_{2}^{2}-A_{ii}^{2})/A_{ii}^{2}$ associated with each index are found as $0.3868$ , $0.0125$ and $4.7154$ for $i$ equal to $1$ , $10$ and $1000$ respectively.

8.2 Rademacher vs Gaussian element estimates

Table 2: ^∗Sufficient query bounds for a row-dependent

(\varepsilon,\delta)

-approximator as found by Theorems 5.1 and 6.1, with the corresponding demands on

\varepsilon

Query vector distribution	Rademacher	Gaussian
Sufficient query bound^∗	$2\ln(2/\delta)/\varepsilon^{2}$	$4\log_{2}(\sqrt{2}/\delta)/\varepsilon^{2}$
Demand on $\varepsilon$	none	$\varepsilon\in(0,1]$

Table 2 summaries sufficient query bounds for a row-dependent $(\varepsilon,\delta)$ -approximator for ease of reference. For a relative $(\varepsilon,\delta)$ -approximator, we simply introduce the matrix constant $(\|\bm{A}_{i}\|_{2}^{2}-A_{ii}^{2})/A_{ii}^{2}$ in each bound. The bound for the Rademacher diagonal estimator is lower than that for the Gaussian diagonal estimator, so we expect to be able to reach a lower error faster with the Rademacher estimator. Note however, that since we have not tuned the proof of Theorem 5.1 for optimal $\lambda$ (see equation (23)), the query bound might indeed be lower, but not by much. Experiments suggest that (23) is quite tight unless $s\gg\frac{1}{\epsilon^{2}}$ .

Figure 2 illustrates this comparison, again on the matrix “Boeing/msc10480”. For the median experiments, $\delta=0.5$ , the Gaussian theoretical bound contains the corresponding experimental error, but for the $90^{th}$ percentile it seems as if this is not the case. In this instance, where $\delta=0.1$ , the lowest value $s$ may take is when $\varepsilon=1$ . Hence

s>4\log_{2}(\sqrt{2}/0.1)/1^{2}=15.3

(59)

so we can only expect the theoretical bound to be valid for $s\geq 16$ . If only one Gaussian query is made the relative error appears to not be bounded by the theoretical limit, but this is no surprise, since Theorem 5.1 poses the question: for given $\delta<1$ and $\varepsilon\in(0,1]$ , what is $s$ bounded by? Whereas experimentation shows us, for given $s$ and $\delta$ , the value $\varepsilon$ will take¹²¹²12Note that plots are of relative error, but $\varepsilon$ in the context of Theorem 5.1 refers to row-dependent $(\varepsilon,\delta)$ -approximators.. The theorem gives us a tool to find $s$ , but in return asks that $\varepsilon$ is bounded¹³¹³13For equation (22) to be valid. from above by $1$ . The experiment, meanwhile, takes an input of $s$ and $\delta$ and returns $\varepsilon$ .

Whilst the theoretical bound in the right hand plot of Figure 2 only kicks in at $s\geq 16$ , we have drawn it extending back to $s=1$ to illustrate the fact that it does not apply for $s$ small enough. Meanwhile, as noted, the Rademacher bound holds for arbitrary $\varepsilon$ . The effect is further manifested by the fact that with $\delta=0.1$ , the Rademacher estimator is more accurate for sufficiently small $s$ , roughly $s\lesssim 10$ .

8.2.1 Intuitions for performance with small $s$

To gain more intuition for why the Rademacher diagonal estimator is better than the Gaussian diagonal estimator for small $s$ , it is useful to compare the variances of individual elements for the two estimators. From equation (11), the variance of the Radamacher estimate is

\text{Var}(R_{i}^{s})=\frac{1}{s}\sum_{j\neq i}A_{ij}^{2},

(60)

whilst in the Gaussian case for $s>2$ , we find

\text{Var}(G_{i}^{s})=\frac{1}{s-2}\sum_{j\neq i}A_{ij}^{2}

(61)

but for $s\leq 2$ , the variance is not bounded. For a given $i$ , this is found in the following manner

\begin{split}\text{Var}(G_{i}^{s})&=\text{Var}(G_{i}^{s}-A_{ii})\\ &=\text{Var}\Bigg{(}{\sum_{j\neq i}^{n}}A_{ij}\Bigg{(}\frac{\sum_{k=1}^{s}v_{k}^{i}v_{k}^{j}}{\sum_{k=1}^{s}{(v_{k}^{i})^{2}}}\Bigg{)}\Bigg{)}\\ &=\text{Var}\Bigg{(}{\sum_{j\neq i}^{n}}A_{ij}\frac{\bm{u}_{i}^{T}\bm{u}_{j}}{\|\bm{u_{i}}\|_{2}^{2}}\Bigg{)}\\ &={\sum_{j\neq i}^{n}}A_{ij}^{2}\text{Var}\Bigg{(}\frac{\bm{u}_{i}^{T}\bm{u}_{j}}{\|\bm{u_{i}}\|_{2}^{2}}\Bigg{)}\hskip 28.45274pt\text{(independence wrt $j$)}\end{split}

where $\bm{u}_{i}$ and $\bm{u}_{j}$ are as in the proof of Theorem 5.1. Referring back to this proof again, we already know that the distribution of $(\bm{u}_{i}^{T}\bm{u}_{j}/\|\bm{u}_{i}\|_{2}^{2})$ is the same as $g_{j}/\|\bm{u}_{i}\|_{2}$ for some $g_{j}\sim N(0,1)$ independent of $\|\bm{u}_{i}\|_{2}$ . This gives

\begin{split}\text{Var}(G_{i}^{s})&=\sum_{i=1}^{n}A_{ij}^{2}\Bigg{(}\mathbb{E}\Bigg{[}\frac{g_{j}^{2}}{\|\bm{u}_{i}\|_{2}^{2}}\Bigg{]}-\mathbb{E}\Bigg{[}\frac{g_{j}}{\|\bm{u}_{i}\|_{2}}\Bigg{]}^{2}\Bigg{)}\\ &=\sum_{i=1}^{n}A_{ij}^{2}\mathbb{E}\Bigg{[}\frac{g_{j}^{2}}{\|\bm{u}_{i}\|_{2}^{2}}\Bigg{]}\\ &=\sum_{i=1}^{n}A_{ij}^{2}\mathbb{E}\Bigg{[}\frac{\chi_{1}^{2}}{\chi_{s}^{2}}\Bigg{]}\\ &=\sum_{i=1}^{n}A_{ij}^{2}\mathbb{E}\Bigg{[}\frac{1}{s}F(1,s)\Bigg{]}\\ \end{split}

where $F(1,s)$ is an $F$ -distribution with parameters $1$ and $s$ . The expectation of such a variable is readily found as $s/(s-2)$ , for $s>2$ , and otherwise is infinite, yielding the result of equation (61). Thus we expect the estimators to be similarly behaved for sufficiently large $s$ but not if $s$ is small, in which case we expect that the Rademacher estimator shall be the better of the two, as the numerical results indicate.

8.2.2 Implications for related work

This has implications for the recent paper of Yao et. al yao2020adahessian . The authors estimate the diagonal of the Hessian of a loss function in machine learning tasks, implicit from back-propagation. Only a single Rademacher query vector is used due to computational restrictions. These results suggest it would be unwise to switch to a single Gaussian query vector as in Definition 4, since such an estimate would have unbounded variance.

We remark further that our the results, particularly Theorem 6.1 and Corollary 2, suggest the algorithm may greatly benefit from taking further estimates of the diagonal. Indeed the authors indicate yao2020video that the noise the algorithm experiences is dominated by the error of estimation of the diagonal. It is now clear why.

8.3 Full diagonal estimates

Figure 3 shows the convergence of the Rademacher and Gaussian diagonal estimators to the full diagonal. Note that, in contrast to the experiments for single elements of the diagonal, the theoretical bounds appear looser. This is expected, since in arriving at the theoretical bounds for our complete diagonal estimates we have used multiple inequalities for single elements. Therefore the query bounds will be loose. Use of a union bound in Corollaries 1 and 2 shall further compound this loosening. Note once again that, in the Gaussian case, the theoretical bound holds only for sufficiently large $s$ , as expected. The effects seen in the error of single diagonal element estimates carry over to the full diagonal, but are even more pronounced. For a handful of queries (again, $s\lesssim 10$ or so) the Rademacher estimator out performs its Gaussian counterpart, but as a greater number of queries are used the convergence behaviours become increasing similar. As before, with the Rademacher estimator the ratio of the theoretical bound to the experimental error is roughly constant for each $\delta$ , so the convergence is of the order of $O(1/\varepsilon^{2})$ , just as for single elements. In the Gaussian case, for large enough $s$ , the same may be said.

8.4 Experiments on Synthetic Matrices.

We also test the Rademacher estimator on random matrices with power law spectra. We take $\Lambda$ to be diagonal with $\Lambda_{ii}=i^{-c}$ , for varying c. We generate a random orthogonal matrix $V\in\mathbb{R}^{5000\times 5000}$ by orthogonalising a Gaussian matrix, and set $A=V^{T}\Lambda V$ . A larger value of $c$ results in a more quickly decaying spectrum, meaning that the quantity $\|\bm{\lambda}\|_{2}^{2}-\frac{1}{n}\|\bm{\lambda}\|_{1}^{2}$ (as in equation (55)) will be larger, so we expect worse estimates for a given number of queries. This is exactly as found empirically, as shown in figure 4. In particular, for $c=1$ and $c=1.5$ , the error of our estimates is still woeful even after $s=512$ vectors! This does not bode well for stochastic diagonal estimation of general dense matrices with sharply decaying spectra. Whilst we still have $O(1/\varepsilon^{2})$ convergence, the error starts off too poorly for this to be of much use.

8.5 Summary

At the start of this section, we set out to find an answer for the number of queries needed for a given error estimation. We have seen that, for fixed $\delta$ , $s$ goes as $O(1/\varepsilon^{2})$ , and that the error scales with $(\|\bm{A}_{i}\|_{2}^{2}-A_{ii}^{2})^{1/2}$ . We have seen that the Rademacher diagonal estimator has both a better theoretical bound than its Gaussian counterpart and also has lower variance for each element. This is in contrast to the conclusions of Avron and Toledo avron2011randomized , who found that Gaussian trace estimators were better than Hutchinson’s, Rademacher estimator. However, there is no real contention here. Summing the components of the Gaussian diagonal estimator is not the same as the Gaussian estimator of the trace examined in avron2011randomized . The difference is merely an artifact of the denominator introduced in (13).

We have also seen that positive semi-definite matrices with flatter spectra are suitable for estimation in this way, but as the spectrum of a matrix becomes steeper, estimates become worse. Perhaps this may spell the end of stochastic diagonal estimation. Perhaps it may restrict it to a specific class of matrices, in contexts where the spectrum of the matrix is known to be close to flat due to properties of the system under investigation. All things considered, it so far seems that, at the best, this form of estimation may only be useful under special circumstances. Might there be a way to improve on this poor performance for general, dense matrices, with rapidly decaying spectra? This is the subject of Section 9. Since it is found to be the best, we take forward only the Rademacher diagonal estimator.

9 Improved diagonal estimation

Is it possible to improve stochastic diagonal estimation, particularly the worst case scenarios of the previous section? This section seeks to address this by extending the ideas of meyer2021hutch++ , adapting the algorithm therein for full diagonal estimation. For $A\succeq 0$ , we propose an $O(1/\varepsilon^{2})$ to $O(1/\varepsilon)$ improvement in the number of required query vectors and illustrate the robustness of our algorithm to matrix spectra.

9.1 Relation to Hutch++

We briefly give an overview of the work of Meyer, Musco, Musco and Woodruff meyer2021hutch++ . In their paper, they motivate their algorithm, “Hutch++” as a quadratic improvement in trace estimation. Using randomised projection to approximately project off the top eigenvalues, they then approximate the trace of the remainder of the matrix. All of the error results from this remainder.

They arrive at their result in the following manner, first introducing the following lemma (Lemma 2, Section 2, meyer2021hutch++ ) for a general square matrix $A$ .

Lemma 8

For $A\in\mathbb{R}^{n\times n},\;\delta\in(0,1/2],\;s\in\mathbb{N}$ . Let $\textnormal{tr}_{R}^{s}(A)$ be the Hutchinson estimate of the trace. For fixed constants $c,C$ , if $s\geq c\log(1/\delta)$ , then with probability $1-\delta$

|\textnormal{tr}_{R}^{s}(A)-\textnormal{tr}(A)|\leq C\sqrt{\frac{\log(1/\delta)}{s}}\|A\|_{F}.

Thus, if $s=O(\log(1/\delta)/\varepsilon^{2})$ then, with probability $1-\delta$ , $|\textnormal{tr}_{R}^{s}(A)-\textnormal{tr}(A)|\leq\varepsilon\|A\|_{F}$ .

So, for $A\succeq 0$ , we simply have $\|A\|_{F}=\|\bm{\lambda}\|_{2}\leq\|\bm{\lambda}\|_{1}=\text{tr}(A)$ , to get a relative trace approximator. Note, this bound is only tight when $\|\bm{\lambda}\|_{2}\approx\|\bm{\lambda}\|_{1}$ . The authors then give the following lemma (see Lemma 3, Section 3, meyer2021hutch++ ).

Lemma 9

Let $A_{r}$ be the best rank-r approximation to PSD matrix $A$ . Then

\|A-A_{r}\|_{F}\leq\frac{1}{\sqrt{r}}\textnormal{tr}(A).

Proof

Since $\lambda_{r+1}\leq\frac{1}{r}\sum_{i=1}^{r}\lambda_{i}\leq\frac{1}{r}\text{tr}(A)$ ,

\|A-A_{r}\|_{F}^{2}=\sum_{i=r+1}^{n}\lambda_{i}^{2}\leq\lambda_{r+1}\sum_{i=k+1}^{n}\lambda_{i}\leq\frac{1}{r}\text{tr}(A)\sum_{i=r+1}^{n}\lambda_{i}\leq\frac{1}{r}\text{tr}(A)^{2}.

$\square$

They argue that this result immediately lends itself to the possibility of an algorithm with $O(1/\varepsilon)$ complexity. Setting $s=r=O(1/\varepsilon)$ and splitting $\text{tr}(A)=\text{tr}(A_{s})+\text{tr}(A-A_{s})$ , the first term may be computed exactly if $V_{s}\in\mathbb{R}^{n\times s}$ (the top $s$ eigenvectors) is known, since $\text{tr}(A_{s})=\text{tr}(V_{s}^{T}AV_{s})$ . Thus the error in the estimate of the trace is contained in our estimate of $\text{tr}(A-A_{s})$ . So (roughly) combining Lemmas 8 and 9, the error becomes, for their trace estimator: $\text{tr}_{H++}^{s}(A)$

\begin{split}|\text{tr}_{H++}^{s}(A)-\text{tr}(A)|&\approx|\text{tr}_{R}^{s}(A-A_{s})-\text{tr}(A-A_{s})|\\ &\leq C\sqrt{\frac{\log(1/\delta)}{s}}\|A-A_{s}\|_{F}\\ &\leq C\frac{\sqrt{\log(1/\delta)}}{s}\text{tr}(A).\end{split}

(62)

Hence $s=O(1/\varepsilon)$ for fixed $\delta$ is sufficient for an $(\varepsilon,\delta)$ -approximator to the trace of $A$ .

Of course, $V_{s}$ cannot be computed exactly in a handful of matrix-vector queries, but this is easily resolved with standard tools from randomised linear algebra halko2011finding . Namely, $O(s)$ queries is sufficient to find a $Q$ with $\|(I-QQ^{T})A(I-QQ^{T})\|_{F}\leq O(\|A-A_{s}\|_{F})$ with high probability, which is all that is required in the second line of (62), and how the algorithm is implemented in practice. For a full, formal treatment of these ideas see meyer2021hutch++ , Theorem 4, Section 3. We elaborate on this in the next section.

9.2 Diag++

Here we outline how the above ideas may be applied to estimation of the entire diagonal, using $O(1/\varepsilon)$ queries. Recall once again, for

\Pr\Bigg{(}\|\bm{D}^{s}-\bm{A}_{d}\|_{2}^{2}\leq\varepsilon^{2}\Big{(}\|A\|_{F}^{2}-\|\bm{A}_{d}\|_{2}^{2}\Big{)}\Bigg{)}\geq 1-\delta

(63)

it is sufficient to take

s>O\Bigg{(}\frac{\log(n/\delta)}{\varepsilon^{2}}\Bigg{)}.

(64)

We may now engage in a trade-off. For fixed $\delta$ and $n$ , we swap $O(1/\varepsilon^{2})$ convergence for $O(1/\varepsilon)$ convergence. This comes at the (potentially minor) expense of a greater multiplicative factor in our query bound. Just as in Hutch++, we first motivate this supposing we know $V_{s}$ , the top $s$ eigenvectors, exactly.

Supposing we have projected off the $s$ -rank approximation $A_{s}$ and extracted its diagonal, $\text{diag}(A_{s})$ , we are left to estimate the diagonal of

\widehat{A}:=A-A_{s}.

Denoting $\widehat{\bm{D}}^{s}$ as the diagonal estimate of this, we have

\|\widehat{\bm{D}}^{s}-\widehat{\bm{A}}_{d}\|_{2}^{2}\leq\varepsilon^{2}\Big{(}\|\widehat{A}\|_{F}^{2}-\|\widehat{\bm{A}}_{d}\|_{2}^{2}\Big{)}

(65)

with probability, $1-\delta$ , if we have used a sufficient number of queries: $s>\frac{2}{\varepsilon^{2}}\ln(2n/\delta)$ for the Rademacher diagonal estimator. Ignoring the $\|\widehat{\bm{A}}_{d}\|_{2}^{2}$ term, and employing the result of Lemma 9, yields

\|\widehat{\bm{D}}^{s}-\widehat{\bm{A}}_{d}\|_{2}^{2}\leq\frac{\varepsilon^{2}\text{tr}(A)^{2}}{s}.

Letting

\varepsilon^{2}=s\bar{\varepsilon}^{2}\Bigg{(}\frac{\|\bm{A}_{d}\|_{2}^{2}}{\text{tr}(A)^{2}}\Bigg{)}

gives

\|\widehat{\bm{D}}^{s}-\widehat{\bm{A}}_{d}\|_{2}^{2}\leq\bar{\varepsilon}^{2}\|\bm{A}_{d}\|_{2}^{2}

(66)

still with probability $1-\delta$ , if we have used sufficient $s$ . Note that this is not a relative estimation of the diagonal of $\widehat{\bm{A}}$ ; rather the error is qualified relative to $\|\bm{A}_{d}\|_{2}^{2}$ , since this is, overall, the error that we are interested in. Explicitly, for the Rademacher diagonal estimator, this results in sufficiency if

\begin{split}s&>\frac{2}{s}\Bigg{(}\frac{\text{tr}(A)^{2}}{\|\bm{A}_{d}\|_{2}^{2}}\Bigg{)}\frac{\ln(2n/\delta)}{\bar{\varepsilon}^{2}}\\ \implies s&>\sqrt{2}\Bigg{(}\frac{\text{tr}(A)}{\|\bm{A}_{d}\|_{2}}\Bigg{)}\frac{\sqrt{\ln(2n/\delta)}}{\bar{\varepsilon}}\end{split}

(67)

and we have arrived at $O(1/\varepsilon)$ convergence, unlike $O(1/\varepsilon^{2})$ as for all previous estimates! We remark that when (67) is tight, this is a dramatic improvement! In the case where $\|\bm{A}_{d}\|_{1}/\|\bm{A}_{d}\|_{2}=O(1)$ , the result of (67) is very favorable¹⁴¹⁴14Such a case may be found for the “Model Hamiltonian” in Section 4 of bekas2007estimator .. Since $\text{tr}(A)=\|\bm{A}_{d}\|_{1}\leq\sqrt{n}\|\bm{A}_{d}\|_{2}$ , the worst case scenario (when the diagonals take uniform values) yields

s>\frac{\sqrt{2n\ln(2n/\delta})}{\bar{\varepsilon}}

(68)

and we observe the introduction of a potentially expensive constant, but one which, for moderate $\bar{\varepsilon}$ and $\delta$ will nonetheless give $s\ll n$ for large $n$ .

Once again, we have motivated the above for exact $s$ -rank approximation, but, as in Hutch++, the same tools from randomised linear algebra halko2011finding mean that we can approximate $V_{s}$ using $O(s)$ queries. Specifically $O(s)$ queries are sufficient to find $Q$ such that

\|(I-QQ^{T})A(I-QQ^{T})\|_{F}\leq O(\|A-A_{s}\|_{F}).

(69)

In detail (see musco2020projection , Corollary 7 and Claim 1), provided $O(r+\log(1/\delta))$ vectors have been used to form $Q\in\mathbb{R}^{n\times r}$ , then, with probability $1-\delta$

\|A(I-QQ^{T})\|_{F}^{2}\leq 2\|A-A_{r}\|_{F}^{2}

and since

\begin{split}\|(I-QQ^{T})A(I-QQ^{T})\|_{F}&\leq\|(I-QQ^{T})\|_{2}\|A(I-QQ^{T})\|_{F}\\ &=\|A(I-QQ^{T})\|_{F}\end{split}

so equation (69) holds for $s=O(r+\log(1/\delta))$ . Thus, overall it is sufficient to take $s=O\Big{(}\frac{\text{tr}(A)}{\|\bm{A}_{d}\|_{2}}\frac{\sqrt{\log(n/\delta)}}{\bar{\varepsilon}}+\log(1/\delta)\Big{)}$ queries where clearly the first term dominates. Applying the same ideas as in equations (65) to (67), for the randomised Rademacher case we readily arrive at

s>4\Bigg{(}\frac{\text{tr}(A)}{\|\bm{A}_{d}\|_{2}}\Bigg{)}\frac{\sqrt{\ln(2n/\delta)}}{\bar{\varepsilon}}+c\log(1/\delta)

(70)

for some constant $c$ , is sufficient to approximate the diagonal to within $\bar{\varepsilon}$ error, with probability $1-\delta$ . Of course, we remark that this might be a very pessimistic bound. We have ignored $\|\widehat{\bm{A}}_{d}\|_{2}^{2}$ and employed the potentially loose result of Lemma 9, so cannot necessarily expect this bound to be good. However, if we have a steep spectrum¹⁵¹⁵15Such an example may be found for the “Graph Estrada Index” in Section 5.2 of meyer2021hutch++ . with both $\|\widehat{A}\|_{F}^{2}\gg\|\widehat{\bm{A}}_{d}\|_{2}^{2}$ and $\|A\|_{F}\approx\text{tr}(A)$ , then the bound on $s$ should result in Diag++ outperforming simple stochastic estimation. Concretely the algorithm for Diag++ with Rademacher query vectors is as follows:

Algorithm 1 Diag++. Adaption of Hutch++ for diagonal estimation.

Input: Implicit-matrix oracle, for $A\succeq 0$ . Number of queries $s$ .
Output: Diag++( $A$ ): Approximation to diag( $A$ )

1:Draw

R_{1}\in\mathbb{R}^{n\times s/3}

with Rademacher entries.

2:Compute orthonormal basis

Q

for

AR_{1}

3:Estimate diag(

(I-QQ^{T})A(I-QQ^{T})

)

\approx\widehat{\bm{D}}^{s/3}\newline =\frac{3}{s}\sum_{k=1}^{s/3}\bm{v}_{k}\odot(I-QQ^{T})A(I-QQ^{T})\bm{v}_{k}

for

\bm{v}_{k}

with Rademacher entries

4:return Diag++(

A

)

=\text{diag}(QQ^{T}AQQ^{T})+\widehat{\bm{D}}^{s/3}

In total $s$ matrix-vector multiplications with $A$ are required¹⁶¹⁶16A recent preprint persson2021improved suggests that one can often improve the estimate by adaptively choosing the number of queries used for computing $AR_{1}$ , $AQ$ and $\widehat{\bm{D}}^{s/3}$ , instead of the equal queries $s/3$ used in Hutch++. It would be of interest to explore such improvement for diagonal estimation.: $s/3$ for $AR_{1}$ , $s/3$ for $\widehat{\bm{D}}^{s/3}$ and $s/3$ for $AQ$ . Computing $Q$ requires $O(s^{2}n)$ extra flops, but this is dominated by the cost of $s$ matrix-vector multiplications: $O(sn^{2})$ .

9.3 Numerical Experiments

To investigate the performance of Diag++, we once again turn to synthetic matrices: $A=V^{T}\Lambda V$ : $\Lambda_{ii}=i^{-c}$ and $V$ a random orthogonal matrix $V\in\mathbb{R}^{5000\times 5000}$ formed by orthogonalising a Gaussian matrix. Recall that with stochastic estimation alone, convergence was slow for such matrices with steep spectra. We expect Diag++ to perform more or less similarly to standard stochastic estimation if the spectrum of the matrix is flat, since (70) is pessimistic, whilst if the spectrum is steep, (70) is much more optimistic and faster convergence is to be expected. The algorithm is robust to the spectrum of $A$ , making the “best of both worlds” from projection and estimation. If the spectrum is steep, then we shall be able to capture most of the diagonal of $A$ through projection, whilst if it is flat, the stochastic estimation of the diagonal of the remainder $\widehat{A}$ will buffer the effects of projection. This is as is found in Figure 5. Not only is Diag++ robust to spectra decaying at different rates, but also converges faster than simple stochastic estimation when the matrices in question have rapidly decaying eigenvalues. The algorithm even overtakes simple projection after a handful of vector queries in such cases. Thus, the answer to our original question at the start of this section is yes: we can improve stochastic estimation, particularly for what would otherwise be worst case scenarios.

10 Conclusion

10.1 Summary

In setting out, we sought to gain insight into particular questions surrounding diagonal estimation:

1.

How many query vectors are needed to bound the error of the stochastic diagonal estimator in (9), and how does this change with distribution of vector entries?
2.

How does matrix structure affect estimation, and how might we improve worst case scenarios?

In answer to 1, we have seen that elemental error probabilistically converges as $O(1/\varepsilon^{2})$ , and scales as $\|\bm{A}_{i}\|_{2}^{2}-A_{ii}^{2}$ . We have seen that for the entire diagonal, similar results arise, with error going as $O(1/\varepsilon^{2})$ , and scaling as $\|A\|_{F}^{2}-\|\bm{A}_{d}\|_{2}^{2}$ . Additionally we have observed both theoretically and numerically, that if $s$ is small, error arising from the use of the Rademacher diagonal estimator is less than that arising from the Gaussian diagonal estimator. Conversely, if $s$ is large, there is an inconsequential difference between the two estimators.

In answer to 2, we have seen, as per our intuitions, that both eigenvector basis, and eigenvalue spectrum, influence the estimation of positive semi-definite matrices. Generally, the steeper the spectrum, the worse the stochastic diagonal estimator. However, we have introduced a new method of overcoming these worst case scenarios. We have analysed it to find favourable convergence properties, and demonstrated its superiority for estimating the diagonal of general dense matrices with steep spectra, whilst maintaining robustness to flatter spectra.

10.2 Future Work

There is ample scope to extend the work presented herein, both to applications, and further analysis.

1.

We have seen that the error of a given diagonal element is dependent on the rest of its row. How can we ascertain the actual error of our estimates if we don’t know the norm of the row of our implicit matrix? Or if we don’t have an idea of its spectrum? Is the context of use of stochastic esimation a necessity? We note that we can make tracks on this question since we can estimate $\|A\|_{F}^{2}$ using trace estimation of $\text{tr}(A^{T}A)$ , a free byproduct of diagonal estimation.
2.

In our analysis we have perhaps wielded a hammer, rather than a scalpel, in order to arrive at the bounds for $\|\bm{D}_{s}-\bm{A}_{d}\|_{2}$ . In particular, union bounding introduces the potentially unfavourable factor of $\ln(n)$ . Perhaps future analysis could employ the dependence of element estimates in our favour to remove such a term?
3.

Having noted that Yao et. al yao2020adahessian use only a single Rademacher estimator in their state-of-the-art second order machine learning optimiser, how much better will their method become with further estimation? Given the possibility of parallelised estimation, will this introduce a rapid new stochastic optimisation technique that outperforms current methods?

These are interesting and exciting open questions, providing the potential investigator with both theoretical and numerical directions in which to head. During this paper, we also considered extending elemental diagonal analysis to shed new light on the original investigations into trace estimation. Whilst initial work suggested itself fruitful, the dependence of each element estimate upon one another confounded further analysis. We readily invite suggestions or insights into how these ideas may be useful in approaching trace estimation.

References

[1] R. P. Adams, J. Pennington, M. J. Johnson, J. Smith, Y. Ovadia, B. Patton, and J. Saunderson. Estimating the spectral density of large implicit matrices. arXiv:1802.03451, 2018.
[2] T. Ando, R. A. Horn, and C. R. Johnson. The singular values of a Hadamard product: A basic inequality. Linear Multilinear Algebra, 21(4):345–365, 1987.
[3] H. Avron and S. Toledo. Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix. Journal of the ACM, 58(2):1–34, 2011.
[4] R. A. Baston. On matrix estimation. Master’s thesis, University of Oxford, 2021.
[5] C. Bekas, E. Kokiopoulou, and Y. Saad. An estimator for the diagonal of a matrix. Appl. Numer. Math., 57(11-12):1214–1229, 2007.
[6] C. Boutsidis, P. Drineas, P. Kambadur, E.-M. Kontopoulou, and A. Zouzias. A randomized algorithm for approximating the log determinant of a symmetric positive definite matrix. Linear Algebra Appl., 533:95–117, 2017.
[7] V. Braverman, R. Krauthgamer, A. Krishnan, and R. Sinoff. Schatten norms in matrix streams: Hello sparsity, goodbye dimension. In ICML 2020. PMLR, 2020.
[8] J. Chen. How accurately should I compute implicit matrix-vector products when applying the Hutchinson trace estimator? SIAM J. Sci. Comp, 38(6):A3515–A3539, 2016.
[9] D. Cohen-Steiner, W. Kong, C. Sohler, and G. Valiant. Approximating the spectrum of a graph. In Proceedings of the 24th acm sigkdd international conference on knowledge discovery & data mining, pages 1263–1271, 2018.
[10] A. Cortinovis and D. Kressner. On randomized trace estimates for indefinite matrices with an application to determinants. Found. Comput. Math., pages 1–29, 2021.
[11] T. A. Davis and Y. Hu. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS), 38(1):1–25, 2011.
[12] E. Di Napoli, E. Polizzi, and Y. Saad. Efficient estimation of eigenvalue counts in an interval. Numer. Lin. Alg. Appl., 23(4):674–692, 2016.
[13] C. Eckart and G. Young. The approximation of one matrix by another of lower rank. Psychometrika, 1(3):211–218, 1936.
[14] A. S. Gambhir, A. Stathopoulos, and K. Orginos. Deflation as a method of variance reduction for estimating the trace of a matrix inverse. SIAM J. Sci. Comp, 39(2):A532–A558, 2017.
[15] S. Goedecker. Linear scaling electronic structure methods. Reviews of Modern Physics, 71(4):1085, 1999.
[16] S. Goedecker and M. Teter. Tight-binding electronic-structure calculations and tight-binding molecular dynamics with localized orbitals. Physical Review B, 51(15):9455, 1995.
[17] G. H. Golub, M. Heath, and G. Wahba. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics, 21(2):215–223, 1979.
[18] G. H. Golub and U. Von Matt. Generalized cross-validation for large-scale problems. J. Comput. Graph. Stat., 6(1):1–34, 1997.
[19] N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev., 53(2):217–288, 2011.
[20] I. Han, D. Malioutov, H. Avron, and J. Shin. Approximating spectral sums of large-scale matrices using stochastic Chebyshev approximations. SIAM J. Sci. Comp, 39(4):A1558–A1585, 2017.
[21] I. Han, D. Malioutov, and J. Shin. Large-scale log-determinant computation through stochastic Chebyshev expansions. In ICML 2015, pages 908–917. PMLR, 2015.
[22] M. F. Hutchinson. A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines. Commun. Stat. - Simul. Comput., 18(3):1059–1076, 1989.
[23] J. Li, A. Sidford, K. Tian, and H. Zhang. Well-conditioned methods for ill-conditioned systems: Linear regression with semi-random noise. arXiv:2008.01722, 2020.
[24] L. Lin. Randomized estimation of spectral densities of large matrices made accurate. Numer. Math., 136(1):183–213, 2017.
[25] L. Lin, Y. Saad, and C. Yang. Approximating spectral densities of large matrices. SIAM Rev., 58(1):34–65, 2016.
[26] R. A. Meyer, C. Musco, C. Musco, and D. P. Woodruff. Hutch++: Optimal stochastic trace estimation. In Symposium on Simplicity in Algorithms (SOSA), pages 142–155. SIAM, 2021.
[27] C. Musco and C. Musco. Projection-cost-preserving sketches: Proof strategies and constructions. arXiv:2004.08434, 2020.
[28] C. Musco, P. Netrapalli, A. Sidford, S. Ubaru, and D. P. Woodruff. Spectrum approximation beyond fast matrix multiplication: Algorithms and hardness. arXiv:1704.04163, 2017.
[29] D. Persson, A. Cortinovis, and D. Kressner. Improved variants of the Hutch++ algorithm for trace estimation. arXiv:2109.10659, 2021.
[30] F. Roosta-Khorasani and U. Ascher. Improved bounds on sample size for implicit matrix trace estimators. Foundations of Computational Mathematics, 15(5):1187–1212, 2015.
[31] A. Stathopoulos, J. Laeuchli, and K. Orginos. Hierarchical probing for estimating the trace of the matrix inverse on toroidal lattices. SIAM J. Sci. Comp, 35(5):S299–S322, 2013.
[32] J. M. Tang and Y. Saad. Domain-decomposition-type methods for computing the diagonal of a matrix inverse. SIAM J. Sci. Comp, 33(5):2823–2847, 2011.
[33] S. Ubaru and Y. Saad. Applications of trace estimation techniques. In International Conference on High Performance Computing in Science and Engineering, pages 19–33. Springer, 2017.
[34] M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.
[35] Z. Yao and A. Gholami. With the authors: AdaHessian optimiser. https://youtu.be/S87ancnZ0MM?t=2000. Accessed: 24/08/2021.
[36] Z. Yao, A. Gholami, S. Shen, M. Mustafa, K. Keutzer, and M. W. Mahoney. AdaHessian: An adaptive second order optimizer for machine learning. arXiv:2006.00719, 2020.

Stochastic diagonal estimation: probabilistic bounds and an improved algorithm

Abstract

Keywords:

MSC:

1 Introduction

The (ε,δ)(\varepsilon,\delta) approach

The diagonal estimator.

Contributions

Outline of this work

2 Preliminaries

2.1 Notation

2.2 Hutchinson’s Method and Previous Work

Lemma 1

Definition 1

2.3 Approximating the Diagonal of a Matrix

2.4 Relation to Hutchinson’s method

3 Definitions and Estimators

3.1 Definitions

3.1.1 (ε,δ)(\varepsilon,\delta)-approximators

Definition 2

Definition 3

3.1.2 Two estimators

Definition 4

Definition 5

4 Comparing the Quality of Estimators

5 Gaussian diagonal estimator

Theorem 5.1

Proof

Step 1: Rotational invariance

Step 2: Conditioned integration

Step 3: Function bounding

Implications

Corollary 1

Proof

6 Rademacher diagonal estimator

Lemma 2

Proof

Theorem 6.1

Proof

Implications

Corollary 2

Proof

A brief comparison

7 Positive-semi-definite diagonal estimation

7.1 Spectrum influence - element-wise

Lemma 3

Proof

Remark 1

Lemma 4

Lemma 5

7.2 Spectrum influence - the entire diagonal

7.3 Eigenvector influence

Lemma 6

Proof

Lemma 7

7.4 Links with Hutchinson’s estimator variance

8 Numerical Experiments

8.1 Single element estimates

8.2 Rademacher vs Gaussian element estimates

8.2.1 Intuitions for performance with small ss

8.2.2 Implications for related work

8.3 Full diagonal estimates

8.4 Experiments on Synthetic Matrices.

8.5 Summary

9 Improved diagonal estimation

9.1 Relation to Hutch++

Lemma 8

Lemma 9

Proof

9.2 Diag++

9.3 Numerical Experiments

10 Conclusion

10.1 Summary

10.2 Future Work

References

The $(\varepsilon,\delta)$ approach

3.1.1 $(\varepsilon,\delta)$ -approximators

8.2.1 Intuitions for performance with small $s$