This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: R. A. Baston and Y. Nakatsukasa 22institutetext: Mathematical Institute
University of Oxford
Tel.: +44-1865-615319
22email: [email protected], 22email: [email protected]

Stochastic diagonal estimation: probabilistic bounds and an improved algorithm

Robert A. Baston    Yuji Nakatsukasa
(Received: date / Accepted: date)
Abstract

We study the problem of estimating the diagonal of an implicitly given matrix AA. For such a matrix we have access to an oracle that allows us to evaluate the matrix vector product A𝒗A\bm{v}. For random variable 𝒗\bm{v} drawn from an appropriate distribution, this may be used to return an estimate of the diagonal of the matrix AA. Whilst results exist for probabilistic guarantees relating to the error of estimates of the trace of AA, no such results have yet been derived for the diagonal. We make two contributions in this regard. We analyse the number of queries ss required to guarantee that with probability at least 1δ1-\delta the estimates of the relative error of the diagonal entries is at most ε\varepsilon. We extend this analysis to the 2-norm of the difference between the estimate and the diagonal of AA. We prove, discuss and experiment with bounds on the number of queries ss required to guarantee a probabilistic bound on the estimates of the diagonal by employing Rademacher and Gaussian random variables. Two sufficient upper bounds on the minimum number of query vectors are proved, extending the work of Avron and Toledo [JACM 58(2)8, 2011], and later work of Roosta-Khorasani and Ascher [FoCM 15, 1187-1212, 2015]. We first prove sufficient bounds for diagonal estimation with Gaussian vectors, and then similarly with Rademacher vectors. We find that, generally, there is little difference between the two, with convergence going as O(log(1/δ)/ε2)O(\log(1/\delta)/\varepsilon^{2}) for individual diagonal elements. However for small ss, we find that the Rademacher estimator is superior. These results allow us to then extend the ideas of Meyer, Musco, Musco and Woodruff [SOSA, 142-155, 2021], suggesting algorithm Diag++, to speed up the convergence of diagonal estimation from O(1/ε2)O(1/\varepsilon^{2}) to O(1/ε)O(1/\varepsilon) and make it robust to the spectrum of any positive semi-definite matrix AA. We analyse our algorithm to demonstrate state-of-the-art convergence.

Keywords:
Monte Carlo Stochastic estimation Diagonal Matrix Trace estimation
MSC:
65C05 68W20 60E15

1 Introduction

An ever-present problem in numerical linear algebra relates to estimating the trace of a matrix An×nA\in\mathbb{R}^{n\times n} that may only be accessed through its operation on a given vector. Such a calculation is required in a wide variety of applications, including: matrix norm estimation han2017approximating ; musco2017spectrum , image restoration golub1979generalized ; golub1997generalized , density functional theory bekas2007estimator and log-determinant approximation boutsidis2017randomized ; han2015large . The concepts surrounding diagonal estimation are younger than those of trace estimation, but have recently come of age in multiple applications. The theory however, is lagging. Originally used for electronic structure calculations bekas2007estimator ; goedecker1999linear ; goedecker1995tight , the ideas developed for diagonal estimation have recently been used by Yao et. al yao2020adahessian in a novel, state-of-the-art machine learning algorithm. They estimate the diagonal of a Hessian matrix to “precondition” descent vectors, for Newton-style descent, in the training of convolution neural networks. Their method produces exceptional accuracy across a wide array of problems. By understanding the properties of diagonal estimation more fully we may uncover new algorithms or methods to improve even these results.

Such estimations involve a matrix AA which is a “black box”. We are given an oracle that may evaluate A𝒗A\bm{v} for any 𝒗n\bm{v}\in\mathbb{R}^{n}, and we seek to return the chosen property of AA using as few query vectors 𝒗\bm{v} as possible: with ss queries, we desire sns\ll n. The first exploration into this area was by Hutchinson hutchinson1989stochastic , which resulted in the standard approach to estimate the trace of a matrix:

tr(A)trRs(A):=1sk=1s𝒗kTA𝒗k.\text{tr}(A)\approx\text{tr}_{R}^{s}(A):=\frac{1}{s}\sum_{k=1}^{s}\bm{v}_{k}^{T}A\bm{v}_{k}. (1)

The 𝒗kn\bm{v}_{k}\in\mathbb{R}^{n} are vectors with entries given as independent, identically distributed Rademacher variables: the ithi^{th} entry of 𝒗k\bm{v}_{k} takes the value 1-1 or 11 with equal probability for all ii, and independence across both ii and kk. The subscript RR signifies the distribution of the entries of 𝒗k\bm{v}_{k} to be Rademacher.

Later work expanded on this idea, using Gaussian vectors distributed as 𝒗kN(0,In)\bm{v}_{k}\sim N(0,I_{n}), but until the paper of Avron and Toledo avron2011randomized , analysis and comparison of these estimators was limited to explorations of their variance. It is well known hutchinson1989stochastic that the variance of the Hutchinson method is the lowest when compared to other standard methods, thus is used most extensively in applications. However, whilst intuition suggests this is a good choice, it is not completely rigorous. The techniques of avron2011randomized address the error bounds in probability directly.

The (ε,δ)(\varepsilon,\delta) approach

In avron2011randomized the authors propose so-called (ε,δ)(\varepsilon,\delta) bounds. A lower bound on the number of queries, ss, sufficient to achieve a relative error of the estimated trace, is obtained. This is a probabilistic guarantee. Formally, for a pair of (ε,δ)(\varepsilon,\delta) values, a sufficient number of queries, ss, is found, such that, for positive semi-definite matrix AA

Pr(|trs(A)tr(A)|εtr(A))1δ.\Pr\Big{(}|\text{tr}_{\mathbb{P}}^{s}(A)-\text{tr}(A)|\leq\varepsilon\text{tr}(A)\Big{)}\geq 1-\delta. (2)

The term trs(A)\text{tr}_{\mathbb{P}}^{s}(A) is the ss-query estimator of the trace, using query vectors drawn from the distribution \mathbb{P}. Supposing ss random vectors have been used to query the matrix, we have, with high probability, a relative error that is at most ε\varepsilon, where ε\varepsilon is dependent on ss. The authors conclude that analysis of the variance of the estimators is not the most informative approach for selecting which estimator is likely to return the best results for the fewest queries. This is a key insight.

The diagonal estimator.

Bekas et. al bekas2007estimator extended Hutchinson’s original method to diagonal estimation by considering the Hadamard product of 𝒗k\bm{v}_{k} and A𝒗kA\bm{v}_{k}. The method is different from trace estimation in two regards. Firstly, in trace estimation, we sum the element-wise multiplication terms to get the inner product of 𝒗k\bm{v}_{k} and A𝒗kA\bm{v}_{k}. This step is removed for diagonal estimation. Secondly, the estimation is not necessarily scaled by the number of queries, but by the element-size of the query vectors. This gives

𝑫s=[k=1s𝒗kA𝒗k][k=1s𝒗k𝒗k].\bm{D}^{s}=\Big{[}\sum_{k=1}^{s}\bm{v}_{k}\odot A\bm{v}_{k}\Big{]}\oslash\Big{[}\sum_{k=1}^{s}\bm{v}_{k}\odot\bm{v}_{k}\Big{]}. (3)

The method returns 𝑫s\bm{D}^{s}, an estimation of diag(AA), the diagonal of AA reshaped as an nn-dimensional vector. The symbol \odot represents component-wise multiplication of vectors and \oslash represents component-wise division. In bekas2007estimator , the authors focus on employing deterministic vectors for 𝒗k\bm{v}_{k}, no further treatment of the stochastic estimators is given, except in numerical experiments.

Contributions

In this paper, we proceed to consider the same aims as those of Avron and Toledo avron2011randomized and those of Bekas et. al bekas2007estimator : applying the ideas of avron2011randomized to analyse the stochastic diagonal estimator suggested in bekas2007estimator and shown in equation (3). No paper to date has presented a theoretical bound on the number of queries ss required to achieve an ε\varepsilon approximation of the diagonal of a matrix. Specifically, we prove upper bounds on the minimum number of query vectors required for the relative (and “row-dependent”) error of the entries of the diagonal estimate to be at most ε\varepsilon, with probability 1δ1-\delta. We extend this result to consider the error across all diagonal entries, and the associated number of queries required for a similar error bound. We consider estimators whose query vectors 𝒗k\bm{v}_{k} obey different distributions. This analysis allows us to then suggest and analyse a novel algorithm for speed-up of diagonal estimation. Extending the work in meyer2021hutch++ for trace estimation, we propose an algorithm for diagonal estimation that shifts from the classic Monte-Carlo regime of O(1/ε2)O(1/\varepsilon^{2}) to O(1/ε)O(1/\varepsilon), a fundamental, state-of-the-art improvement for diagonal estimation. Much of the paper is based on the first author’s MSc dissertation baston2021thesis .

Outline of this work

This paper is outlined as follows. In Section 2 we outline our use of notation, provide a brief overview of previous work, and outline the relation of the Rademacher diagonal estimator to Hutchinson’s trace estimator. In Section 3 we define what is meant by an (ε,δ)(\varepsilon,\delta)-approximator in the context of diagonal estimation, and outline two types of stochastic diagonal estimators, Gaussian and Rademacher. In Section 4 we state our results, proved in the following sections. We prove an upper bound on the minimum number of sufficient queries for an (ε,δ)(\varepsilon,\delta)-approximator using the Gaussian diagonal estimator in Section 5, and then for the Rademacher diagonal estimator in Section 6. Section 7 manipulates and explores these results in an attempt to make them more useful to the practitioner. We perform numerical experiments illustrating the results in Section 8. Where appropriate a touch of theory is also covered in this section. In Section 9 we use these previous results to adapt the Hutch++ algorithm from meyer2021hutch++ to suggest and analyse a similar algorithm, Diag++, that enables O(1/ε2)O(1/\varepsilon^{2}) to O(1/ε)O(1/\varepsilon) speed-up of stochastic diagonal estimation. Section 10 provides concluding remarks.

2 Preliminaries

In this section we outline our use of notation, and provide a brief overview of previous work.

2.1 Notation

For all 𝒖n\bm{u}\in\mathbb{R}^{n}, 𝒖2=(i=1nui2)1/2\|\bm{u}\|_{2}=(\sum_{i=1}^{n}u_{i}^{2})^{1/2} denotes the 2\ell_{2} norm and likewise 𝒖1=i=1n|ui|\|\bm{u}\|_{1}=\sum_{i=1}^{n}|u_{i}| denotes the 1\ell_{1} norm. For a general matrix Am×nA\in\mathbb{R}^{m\times n}, AF=(i=1mj=1nAij2)1/2\|A\|_{F}=(\sum_{i=1}^{m}\sum_{j=1}^{n}A_{ij}^{2})^{1/2} denotes the Frobenius norm of a matrix, where, of course, AijA_{ij} is the element of AA from the ithi^{th} row and the jthj^{th} column. In the case of An×nA\in\mathbb{R}^{n\times n}, tr(A)=i=1nAii=i=1nλi\text{tr}(A)=\sum_{i=1}^{n}A_{ii}=\sum_{i=1}^{n}\lambda_{i} is trace of the matrix AA, with λi\lambda_{i} its eigenvalues. Sometimes in this paper we consider the matrix AA to be symmetric: A=ATA=A^{T}, or symmetric positive semi-definite (PSD), indicated by A0A\succeq 0. Such AA has an eigendecomposition of the form A=VΛVTA=V\Lambda V^{T} where Vn×nV\in\mathbb{R}^{n\times n} is the orthogonal matrix of eigenvectors of AA and Λ\Lambda is a real-valued diagonal matrix. We write 𝝀=diag(Λ)\bm{\lambda}=\text{diag}(\Lambda) to be a vector of length nn containing the eigenvalues of AA in descending order λ1λ2λn.\lambda_{1}\geq\lambda_{2}\geq\ldots\geq\lambda_{n}. Where the differences between the eigenvalues are relatively small, we say a matrix has a “flatter” spectrum. Where the differences between the eigenvalues are larger, we describe the spectrum as “steep”111Whilst there is much more nuance to be found in the spectra of matrices, this description serves our purposes sufficiently, without over-complicating the details.. For A0A\succeq 0 we have that λi0\lambda_{i}\geq 0 for all ii. Where AA is diagonalisable we use the identity 𝝀2AF\|\bm{\lambda}\|_{2}\equiv\|A\|_{F} and where A0A\succeq 0 we write 𝝀1tr(A)\|\bm{\lambda}\|_{1}\equiv\text{tr}(A) We denote Ar=argminB, rank(B)=rABFA_{r}=\operatorname*{arg\,min}_{B,\text{ rank}(B)=r}\|A-B\|_{F} as the optimal rr-rank approximation to AA. When the matrix AA is PSD, the eigenvalues of the matrix coincide with its singular values. Thus, by the Eckart-Young-Mirsky theorem eckart1936approximation , for such AA, Ar=VrΛrVrTA_{r}=V_{r}\Lambda_{r}V_{r}^{T} with Vrn×rV_{r}\in\mathbb{R}^{n\times r} containing the first rr columns of VV and Λrr×r\Lambda_{r}\in\mathbb{R}^{r\times r} being the top left sub-matrix of Λ\Lambda. Finally we denote 𝑨d=diag(A)\bm{A}_{d}=\text{diag}(A) as reshaping of the diagonal elements of the matrix AA to a vector of length nn. For this vector, we clearly have 𝑨d2=(i=1nAii2)1/2\|\bm{A}_{d}\|_{2}=(\sum_{i=1}^{n}A_{ii}^{2})^{1/2} and in the case of A0A\succeq 0, 𝑨d1tr(A)\|\bm{A}_{d}\|_{1}\equiv\text{tr}(A), since AiiA_{ii} are necessarily non-negative for such AA.

2.2 Hutchinson’s Method and Previous Work

The classic Monte Carlo method for approximating the trace of an implicit matrix is due to Hutchinson hutchinson1989stochastic , who states and proves the following lemma.

Lemma 1

Let AA be an n×nn\times n symmetric matrix. Let 𝐯\bm{v} be a random vector whose queries are i.i.d Rademacher random variables (Pr(vi=±1)=1/2)(\Pr(v^{i}=\pm 1)=1/2), then 𝐯TA𝐯\bm{v}^{T}A\bm{v} is an unbiased estimator of tr(A)\textnormal{tr}(A), that is

𝔼[𝒗TA𝒗]=tr(A)\mathbb{E}\Big{[}\bm{v}^{T}A\bm{v}\Big{]}=\textnormal{tr}(A) (4)

and

Var(𝒗TA𝒗)=2(AF2i=1nAii2)=2(AF2𝑨d22).\textnormal{Var}\Big{(}\bm{v}^{T}A\bm{v}\Big{)}=2\Big{(}\|A\|_{F}^{2}-\sum_{i=1}^{n}A_{ii}^{2}\Big{)}=2\Big{(}\|A\|_{F}^{2}-\|\bm{A}_{d}\|_{2}^{2}\Big{)}. (5)

Considering the variance term, we can easily see that it captures how much of the matrix’s “energy” (i.e. the Frobenius norm) is located in the off-diagonal elements. Unfortunately, in the general case, this variance may be arbitrarily large. More vexatious, is the fact that Lemma 1 does not provide the user with a rigorous bound on the number of matrix-vector queries for a given accuracy. To address this issue, Avron and Toledo avron2011randomized introduced the following definition for A0A\succeq 0 (Defintion 4.1 of avron2011randomized ):

Definition 1

Let AA be a symmetric positive semi-definite matrix. A randomised trace estimator: trs(A)\textnormal{tr}_{\mathbb{P}}^{s}(A), is an (ε,δ)(\varepsilon,\delta)-approximator of tr(A)\textnormal{tr}(A) if

Pr(|trs(A)tr(A)|εtr(A))1δ.\Pr\Big{(}|\textnormal{tr}_{\mathbb{P}}^{s}(A)-\textnormal{tr}(A)|\leq\varepsilon\textnormal{tr}(A)\Big{)}\geq 1-\delta. (6)

This definition provides a better way to be able to bound the requisite number of queries. It guarantees that the probability of the relative error exceeding ε\varepsilon is at most δ\delta. For example, when estimating the trace according to Hutchinson’s method, it may be shown roosta2015improved that it is sufficient to take the number of queries, ss as

s>6ln(2/δ)ε2s>\frac{6\ln(2/\delta)}{\varepsilon^{2}} (7)

for any PSD matrix AA (see Theorem 1, Section 2, roosta2015improved ), whereas if Gaussian vectors are used, the bound on ss becomes

s>8ln(2/δ)ε2s>\frac{8\ln(2/\delta)}{\varepsilon^{2}} (8)

(Theorem 3, Section 3, roosta2015improved ). We note that the recent paper cortinovis2021randomized gives query bounds for indefinite matrices.

Both equations (7) and (8) immediately put us in the Monte Carlo regime, with error converging as O(1/s)O\big{(}1/\sqrt{s}\big{)} for ss queries. Let us step back and remark on the fact that the trace of a matrix is invariant under orthogonal similarity transforms of the matrix. As such, we can intuit that the loose, matrix-independent bounds in equations (7) and (8) are naturally to be expected. Can we say the same of estimating the diagonal? We cannot be sure that the previous intuition applies, as the diagonal of a matrix is not (generally) invariant under orthogonal similarity transformations.

2.3 Approximating the Diagonal of a Matrix

The ideas of Hutchinson are readily extended to approximate the full diagonal, as was first introduced by Bekas et. al bekas2007estimator , and which we outline here. Take a sequence of ss vectors 𝒗1,,𝒗s\bm{v}_{1},...,\bm{v}_{s}, and suppose that their entries follow some suitable distribution, being i.i.d with expectation222Because we scale the estimation as in equation (9) it may be shown that the distribution may have arbitrary variance, though 𝔼[X2]=1\mathbb{E}[X^{2}]=1 is used for simplicity throughout. zero. Then the diagonal approximation of matrix AA, expressed as an nn-dimensional vector 𝑫s\bm{D}^{s}, may be taken as

𝑫s=[k=1s𝒗kA𝒗k][k=1s𝒗k𝒗k]\bm{D}^{s}=\Bigg{[}\sum_{k=1}^{s}\bm{v}_{k}\odot A\bm{v}_{k}\Bigg{]}\oslash\Bigg{[}\sum_{k=1}^{s}\bm{v}_{k}\odot\bm{v}_{k}\Bigg{]} (9)

where \odot is the component-wise multiplication of the vector (a Hadamard product), and similarly, \oslash the component-wise division of the vectors. Let us examine a single component of 𝑫s\bm{D}^{s} to gain more insight. Let vkiv_{k}^{i} denote the ithi^{th} element of the kthk^{th} vector, we have

Dis=[k=1svkij=1nAijvkj]/[k=1s(vki)2]=Aii+[k=1svkijinAijvkj]/[k=1s(vki)2]=Aii+jinAij(k=1svkivkjk=1s(vki)2).\begin{split}D_{i}^{s}&=\Bigg{[}\sum_{k=1}^{s}v_{k}^{i}\sum_{j=1}^{n}A_{ij}v_{k}^{j}\Bigg{]}\Bigg{/}\Bigg{[}\sum_{k=1}^{s}{(v_{k}^{i})^{2}}\Bigg{]}\\ &=A_{ii}+\Bigg{[}\sum_{k=1}^{s}v_{k}^{i}\sum_{j\neq i}^{n}A_{ij}v_{k}^{j}\Bigg{]}\Bigg{/}\Bigg{[}\sum_{k=1}^{s}{(v_{k}^{i})^{2}}\Bigg{]}\\ &=A_{ii}+{{\sum_{j\neq i}^{n}}}A_{ij}\Bigg{(}\frac{\sum_{k=1}^{s}v_{k}^{i}v_{k}^{j}}{\sum_{k=1}^{s}{(v_{k}^{i})^{2}}}\Bigg{)}.\end{split}

In expectation, the coefficients of the AijA_{ij} entries shall be zero. This is true provided that the components of the 𝒗k\bm{v}_{k} vectors, vkiv_{k}^{i}, themselves have an expectation of zero, regardless of the form of the denominator333Easily shown by the chain rule of expectation..

2.4 Relation to Hutchinson’s method

If the vectors 𝒗k\bm{v}_{k} have i.i.d Rademacher entries, then it is clear to see that the sum of the entries of the vector 𝑫s\bm{D}^{s} is nothing but the Hutchinson trace estimate. To show this rigorously we examine the expectation and variance of a single component DisD_{i}^{s} of 𝑫s\bm{D}^{s}, when vkiv_{k}^{i} are independent Rademacher variables. We note that this is straightforward to find, and so the results are included only for completeness.

𝔼[Dis]=𝔼[Aii+jinAij(k=1svkivkjk=1s(vki)2)]=Aii+jinAij𝔼[k=1svkivkjk=1s(vki)2](by linearity)=Aii+jinAij𝔼[k=1svkivkjs](since (vki)2=1i,k)=Aii+jinAijk=1s𝔼[vki]𝔼[vkj]s(by independence, ji)=Aii\centering\begin{split}\mathbb{E}[D_{i}^{s}]&=\mathbb{E}\Bigg{[}A_{ii}+{\sum_{j\neq i}^{n}}A_{ij}\Bigg{(}\frac{\sum_{k=1}^{s}v_{k}^{i}v_{k}^{j}}{\sum_{k=1}^{s}{(v_{k}^{i})^{2}}}\Bigg{)}\Bigg{]}\\ &=A_{ii}+{\sum_{j\neq i}^{n}}A_{ij}\mathbb{E}\Bigg{[}\frac{\sum_{k=1}^{s}v_{k}^{i}v_{k}^{j}}{\sum_{k=1}^{s}{(v_{k}^{i})^{2}}}\Bigg{]}\hskip 35.56593pt\text{(by linearity)}\\ &=A_{ii}+{\sum_{j\neq i}^{n}}A_{ij}\mathbb{E}\Bigg{[}\frac{\sum_{k=1}^{s}v_{k}^{i}v_{k}^{j}}{s}\Bigg{]}\hskip 37.55785pt\text{(since $(v_{k}^{i})^{2}=1\;\forall\;i,k$)}\\ &=A_{ii}+{\sum_{j\neq i}^{n}}A_{ij}\frac{\sum_{k=1}^{s}\mathbb{E}[v_{k}^{i}]\mathbb{E}[v_{k}^{j}]}{s}\hskip 32.72049pt\text{(by independence, $j\neq i$)}\\ &=A_{ii}\end{split}\@add@centering

and by the linearity of expectation it is easy to see that

𝔼[i=1nDis]=i=1n𝔼[Dis]=tr(A).\mathbb{E}\Big{[}\sum_{i=1}^{n}D^{s}_{i}\Big{]}=\sum_{i=1}^{n}\mathbb{E}\Big{[}D^{s}_{i}\Big{]}=\text{tr}(A). (10)

Specifically for a single query s=1s=1 we are returned equation (4), the first result of Lemma 1. We now find the variance; most easily done by considering a single estimate, s=1s=1. Thus

Var(Di1)=Var(Di1Aii)=𝔼[(jinAijvivj)2]𝔼[jinAijvivj]2=jinlinAijAil𝔼[(vi)2vjvl]=jinAij2=(𝑨i22Aii2)\begin{split}\text{Var}(D_{i}^{1})&=\text{Var}(D_{i}^{1}-A_{ii})\\ &=\mathbb{E}\Bigg{[}\Big{(}\sum_{j\neq i}^{n}A_{ij}v^{i}v^{j}\Big{)}^{2}\Bigg{]}-\mathbb{E}\Bigg{[}\sum_{j\neq i}^{n}A_{ij}v^{i}v^{j}\Bigg{]}^{2}\\ &=\sum_{j\neq i}^{n}\sum_{l\neq i}^{n}A_{ij}A_{il}\mathbb{E}\Big{[}(v^{i})^{2}v^{j}v^{l}\Big{]}\\ &=\sum_{j\neq i}^{n}A_{ij}^{2}\\ &=\Big{(}\|\bm{A}_{i}\|_{2}^{2}-A_{ii}^{2}\Big{)}\end{split} (11)

where 𝑨i\bm{A}_{i} is the ithi^{th} row444𝑨i\bm{A}_{i} possessing a 2-norm is intuitive, but formally we are really considering 𝑨iT\bm{A}_{i}^{T} of AA. When considering the variance of the sum of the entries of 𝑫s\bm{D}^{s}, we must also find the covariance of the terms between two elements. For Di1D_{i}^{1} and Dj1D_{j}^{1}, where iji\neq j, and with s=1s=1, we have

Cov(Di1,Dj1)=𝔼[(Di1Aii)(Dj1Ajj)]=𝔼[(kinAikvivk)(ljnAjlvjvl)]=kinljnAikAjl𝔼[vivjvkvl]\begin{split}\text{Cov}(D_{i}^{1},D_{j}^{1})&=\mathbb{E}\Big{[}(D_{i}^{1}-A_{ii})(D_{j}^{1}-A_{jj})\Big{]}\\ &=\mathbb{E}\Big{[}\Big{(}\sum_{k\neq i}^{n}A_{ik}v^{i}v^{k}\Big{)}\Big{(}\sum_{l\neq j}^{n}A_{jl}v^{j}v^{l}\Big{)}\Big{]}\\ &=\sum_{k\neq i}^{n}\sum_{l\neq j}^{n}A_{ik}A_{jl}\mathbb{E}\Big{[}v^{i}v^{j}v^{k}v^{l}\Big{]}\end{split}

and since kik\neq i and ljl\neq j, along with the fact we are examining iji\neq j, only one term is returned: when k=jk=j and l=il=i. All other terms vanish as a result of independence and zero expectation. This leaves

Cov(Di1,Dj1)=AijAji𝔼[(vi)2(vj)2]=AijAji𝔼[(vi)2]𝔼[(vj)2]=AijAji.\centering\begin{split}\text{Cov}(D_{i}^{1},D_{j}^{1})&=A_{ij}A_{ji}\mathbb{E}\Big{[}(v^{i})^{2}(v^{j})^{2}\Big{]}\\ &=A_{ij}A_{ji}\mathbb{E}\Big{[}(v^{i})^{2}\Big{]}\mathbb{E}\Big{[}(v^{j})^{2}\Big{]}\\ &=A_{ij}A_{ji}.\end{split}\@add@centering

Of course, if AA is symmetric, this is Aij2A_{ij}^{2}. Thus, for such symmetric AA, we have

Var(i=1nDi)=i=1nVar(Di)+injinCov(Di,Dj)=i=1n(𝑨i22Aii)+injinAij2=2(AF2i=1nAii2).\begin{split}\text{Var}\Big{(}\sum_{i=1}^{n}D_{i}\Big{)}&=\sum_{i=1}^{n}\text{Var}(D_{i})+\sum_{i}^{n}\sum_{j\neq i}^{n}\text{Cov}(D_{i},D_{j})\\ &=\sum_{i=1}^{n}(\|\bm{A}_{i}\|_{2}^{2}-A_{ii})+\sum_{i}^{n}\sum_{j\neq i}^{n}A_{ij}^{2}\\ &=2\Big{(}\|A\|_{F}^{2}-\sum_{i=1}^{n}A_{ii}^{2}\Big{)}.\end{split} (12)

So we are returned equation (5), the second result of Lemma 1 and the relationship of trace and diagonal estimation is clear.

The dependence of the elements DisD_{i}^{s} upon one another threatens to confound any further analysis. Thus, to circumvent this, we first examine individual elements of the diagonal estimator, 𝑫s\bm{D}^{s}. When treating all elements simultaneously, we employ general bounds and inequalities that, irrespective of dependence, may be applied. In some instances, this can result in looser bounds than might be necessary in practice. Therefore our results ought to be treated as a guide, rather than exact truths. We are left with two main questions to answer:

  1. 1.

    How many query vectors are needed to bound the error of the stochastic diagonal estimator in (9), and how does this change with distribution of vector entries?

  2. 2.

    How does matrix structure affect estimation, and how might we improve worst case scenarios?

3 Definitions and Estimators

3.1 Definitions

We state precisely what is meant by an (ε,δ)(\varepsilon,\delta)-approximator in the context of diagonal estimation: for reasons which will become apparent, we introduce two such definitions. We also define what is meant by a diagonal estimator when using vectors of a given distribution.

3.1.1 (ε,δ)(\varepsilon,\delta)-approximators

Definition 2

The ithi^{th} component, DisD_{i}^{s}, of a diagonal estimator is said to be a row dependent (ε,δ)(\varepsilon,\delta)-approximator, if

Pr(|DisAii|2ε2(𝑨i22Aii2))1δ\Pr\Bigg{(}|D_{i}^{s}-A_{ii}|^{2}\leq\varepsilon^{2}\Big{(}\|\bm{A}_{i}\|_{2}^{2}-A_{ii}^{2}\Big{)}\Bigg{)}\geq 1-\delta

where 𝑨i\bm{A}_{i} is the ithi^{th} row of the matrix AA, reshaped as a vector.

The error bound in this definition may be tightened to be dependent only upon |Aii||A_{ii}|, giving Definition 3 for examination of the relative error:

Definition 3

The ithi^{th} component, DisD_{i}^{s}, of a diagonal estimator is said to be a relative (ε,δ)(\varepsilon,\delta)-approximator, if

Pr(|DisAii|ε|Aii|)1δ\Pr\Big{(}|D_{i}^{s}-A_{ii}|\leq\varepsilon|A_{ii}|\Big{)}\geq 1-\delta

Definition 3 gives us a way to analyse the estimators by bounding the number of queries required to guarantee that the probability the relative error exceeds ε\varepsilon is at most δ\delta. Definition 2 loosens this condition slightly and is useful for reasons which become shortly apparent.

3.1.2 Two estimators

Recall that all estimators follow the same simple pattern, as in equation (9): some random vector 𝒗k\bm{v}_{k} is drawn from a fixed distribution, and a normalised Hadamard product is used to estimate the diagonals. The process is repeated ss times, using i.i.d queries.

The first estimator uses vectors whose entries are standard Gaussian (normal) variables.

Definition 4

A Gaussian diagonal estimator for any square matrix An×nA\in\mathbb{R}^{n\times n} is given by

𝑮s=[k=1s𝒗kA𝒗k][k=1s𝒗k𝒗k]\bm{G}^{s}=\Big{[}\sum_{k=1}^{s}\bm{v}_{k}\odot A\bm{v}_{k}\Big{]}\oslash\Big{[}\sum_{k=1}^{s}\bm{v}_{k}\odot\bm{v}_{k}\Big{]} (13)

where the 𝒗k\bm{v}_{k} vectors are ss independent random vectors whose entries are i.i.d standard normal variables, vkiN(0,1)𝒗kN(0,In)v_{k}^{i}\sim N(0,1)\iff\bm{v}_{k}\sim N(0,I_{n}).

Note that the Gaussian estimator does not constrain the 2-norm of the 𝒗k\bm{v}_{k} vectors; so any single estimate may be arbitrarily large or small.

In contrast, the Rademacher diagonal estimator constrains the 𝒗k\bm{v}_{k} to have a fixed 2-norm, and results in a fixed denominator in the estimator equation.

Definition 5

A Rademacher diagonal estimator for any square matrix An×nA\in\mathbb{R}^{n\times n} is given by

𝑹s=[k=1s𝒗kA𝒗k][k=1s𝒗k𝒗k]=1sk=1s𝒗kA𝒗k\bm{R}^{s}=\Big{[}\sum_{k=1}^{s}\bm{v}_{k}\odot A\bm{v}_{k}\Big{]}\oslash\Big{[}\sum_{k=1}^{s}\bm{v}_{k}\odot\bm{v}_{k}\Big{]}=\frac{1}{s}\sum_{k=1}^{s}\bm{v}_{k}\odot A\bm{v}_{k} (14)

where the 𝒗k\bm{v}_{k} vectors are s independent random vectors whose entries are i.i.d uniformly distributed Rademacher variables: vki{1,1}v_{k}^{i}\sim\{-1,1\} each with probability 1/21/2.

It is worth noting that, unlike the trace estimators, the diagonal estimators above return different values when applied to ATA^{T} instead of AA. It is possible that some (but never all, as our results indicate) diagonal elements are estimated more accurately by working with ATA^{T}.

4 Comparing the Quality of Estimators

Bekas’ et. al bekas2007estimator sought to replace stochastic estimation by taking the vectors 𝒗k\bm{v}_{k} to be deterministic, as empirically they found that stochastic estimation could be slow. Sections 5 and 6 show why. Table 1 summarises our results.

Table 1: Summary of results for the convergence of estimators according to the distribution of the query vectors. Note that these results do not make any assumptions on the matrix AA.
Query vector distribution Rademacher Gaussian
Sufficient query bound
for a row-dependent (ε,δ)(\varepsilon,\delta)-approximator
2ln(2/δ)/ε22\ln(2/\delta)/\varepsilon^{2} 4log2(2/δ)/ε24\log_{2}(\sqrt{2}/\delta)/\varepsilon^{2}
Sufficient query bound for a relative (ε,δ)(\varepsilon,\delta)-approximator |2(𝑨i22Aii2Aii2)ln(2/δ)/ε2{\color[rgb]{1,1,1}\Bigg{|}}2{\Big{(}\frac{\|\bm{A}_{i}\|_{2}^{2}-A_{ii}^{2}}{A_{ii}^{2}}\Big{)}}\ln(2/\delta)/\varepsilon^{2} |4(𝑨i22Aii2Aii2)log2(2/δ)/ε2{\color[rgb]{1,1,1}\Bigg{|}}4{\Big{(}\frac{\|\bm{A}_{i}\|_{2}^{2}-A_{ii}^{2}}{A_{ii}^{2}}\Big{)}}\log_{2}(\sqrt{2}/\delta)/\varepsilon^{2}
Permitted ε\varepsilon values arbitrary (0,1](0,1]

5 Gaussian diagonal estimator

We now prove an (ε,δ)(\varepsilon,\delta) bound for the Gaussian diagonal estimator. We find a sufficient minimum number of queries ss, for a row-dependent (ε,δ)(\varepsilon,\delta) bound to hold.

Theorem 5.1

Let GisG_{i}^{s} be the Gaussian diagonal estimator (13) of AA, then it is sufficient to take ss as

s>4log2(2/δ)/ε2s>4\log_{2}(\sqrt{2}/\delta)/\varepsilon^{2} (15)

with ε(0,1]\varepsilon\in(0,1], such that the estimator satisfies

Pr(|GisAii|2ε2(𝑨i22Aii2))1δ,\Pr\Bigg{(}|G_{i}^{s}-A_{ii}|^{2}\leq\varepsilon^{2}\Big{(}\|\bm{A}_{i}\|_{2}^{2}-A_{ii}^{2}\Big{)}\Bigg{)}\geq 1-\delta,

for i=1,,ni=1,\ldots,n, where 𝐀i\bm{A}_{i} is the ithi^{th} row of the matrix, reshaped as a vector.

The proof of this theorem is best read as three sequential steps: rotational invariance, conditioned integration and function bounding. We caution the reader that the vector 𝒖\bm{u} is introduced that has components denoted as vv. This notation is an unfortunate, yet unavoidable, consequence of the proof.

Proof


Step 1: Rotational invariance

The first trick is to rewrite the coefficient of each Aij,jiA_{ij},\;j\neq i, in

Gis=Aii+jiAijk=1svkivkjk=1s(vki)2.G_{i}^{s}=A_{ii}+\sum_{j\neq i}A_{ij}\frac{\sum_{k=1}^{s}v_{k}^{i}v_{k}^{j}}{\sum_{k=1}^{s}(v_{k}^{i})^{2}}. (16)

Let us denote555We use the notation 𝒖is\bm{u}_{i}\in\mathbb{R}^{s} so as not to confuse ourselves with the query vectors 𝒗kn\bm{v}_{k}\in\mathbb{R}^{n}. Definitively, the vector 𝒖i\bm{u}_{i} is composed of the ithi^{th} components of each query vector. 𝒖i=[v1i,v2i,,vsi]Ts\bm{u}_{i}=[v_{1}^{i},v_{2}^{i},...,v_{s}^{i}]^{T}\in\mathbb{R}^{s} such that 𝒖iN(0,Is)\bm{u}_{i}\sim N(0,I_{s}). Thus

Gis=Aii+jiAij𝒖iT𝒖j𝒖i22.G_{i}^{s}=A_{ii}+\sum_{j\neq i}A_{ij}\frac{\bm{u}_{i}^{T}\bm{u}_{j}}{\|\bm{u}_{i}\|_{2}^{2}}.

Namely the coefficient of each Aij,jiA_{ij},\;j\neq i, becomes

𝒖iT𝒖j𝒖i22=𝒖iTQTQ𝒖jQ𝒖i22\frac{\bm{u}_{i}^{T}\bm{u}_{j}}{\|\bm{u}_{i}\|_{2}^{2}}=\frac{\bm{u}_{i}^{T}Q^{T}Q\bm{u}_{j}}{\|Q\bm{u}_{i}\|_{2}^{2}} (17)

for any orthogonal rotation QQ, QTQ=IsQ^{T}Q=I_{s}. The key insight is to exploit the rotational invariance of 𝒖jN(0,Is)\bm{u}_{j}\sim N(0,I_{s}). Hence choose QQ such that

Q𝒖iQ𝒖i2=[1,0,,0]T=𝒆1s.\frac{Q\bm{u}_{i}}{\|Q\bm{u}_{i}\|_{2}}=[1,0,\ldots,0]^{T}=\bm{e}_{1}\in\mathbb{R}^{s}.

So the coefficients (17) become, for all jij\neq i,

1Q𝒖i2𝒆1TQ𝒖j=1𝒖i2𝒆1TQ𝒖j=gj𝒖i2\frac{1}{\|Q\bm{u}_{i}\|_{2}}\bm{e}_{1}^{T}Q\bm{u}_{j}=\frac{1}{\|\bm{u}_{i}\|_{2}}\bm{e}_{1}^{T}Q\bm{u}_{j}=\frac{g_{j}}{\|\bm{u}_{i}\|_{2}}

since QQ is orthogonal, so Q𝒖jN(0,Is)𝒆1TQ𝒖j=gjN(0,1)Q\bm{u}_{j}\sim N(0,I_{s})\implies\bm{e}_{1}^{T}Q\bm{u}_{j}=g_{j}\sim N(0,1). Note that the unit direction of a standard normal multivariate vector and its 2-norm are independent, implying that QQ and thus Q𝒖jQ\bm{u}_{j} are independent of 𝒖i2\|\bm{u}_{i}\|_{2}. Thus we may write (16) as

Gis=Aii+jiAijgj𝒖i2.G_{i}^{s}=A_{ii}+\sum_{j\neq i}A_{ij}\frac{g_{j}}{\|\bm{u}_{i}\|_{2}}.

Let us denote 𝑨~in1\bm{\widetilde{A}}_{i}\in\mathbb{R}^{n-1} as a vector containing the elements of the ithi^{th} row of AA, excluding AiiA_{ii}. Also let 𝒈~n1\bm{\widetilde{g}}\in\mathbb{R}^{n-1} be a vector containing the independent Gaussian gjg_{j} entries. That is 𝒈~N(0,In1)\bm{\widetilde{g}}\sim N(0,I_{n-1}). So we have

GisAii=𝑨~iT𝒈~𝒖i2.G_{i}^{s}-A_{ii}=\frac{\bm{\widetilde{A}}_{i}^{T}\bm{\widetilde{g}}}{\|\bm{u}_{i}\|_{2}}.

The second trick is to consider

GisAii𝑨~i2=1𝒖i2𝑨~iT𝒈~𝑨~i2.\frac{G_{i}^{s}-A_{ii}}{\|\bm{\widetilde{A}}_{i}\|_{2}}=\frac{1}{\|\bm{u}_{i}\|_{2}}\cdot\frac{\bm{\widetilde{A}}_{i}^{T}\bm{\widetilde{g}}}{\|\bm{\widetilde{A}}_{i}\|_{2}}. (18)

Once again, we use the rotational invariance of the distribution of 𝒈~\bm{\widetilde{g}}. The distribution of 𝑨~iT𝒈~/𝑨~i2\bm{\widetilde{A}}_{i}^{T}\bm{\widetilde{g}}/\|\bm{\widetilde{A}}_{i}\|_{2} is equal to 𝒆1TQ~𝒈~\bm{e}_{1}^{T}\widetilde{Q}\bm{\widetilde{g}}, where now, 𝒆1=[1,0,,0]Tn1\bm{e}_{1}=[1,0,...,0]^{T}\in\mathbb{R}^{n-1} and Q~\widetilde{Q} rotates 𝑨~i/𝑨~i2\widetilde{\bm{A}}_{i}/\|\widetilde{\bm{A}}_{i}\|_{2} to 𝒆1\bm{e}_{1}. Hence we have that the distribution of 𝑨~iT𝒈~/𝑨~i2\bm{\widetilde{A}}_{i}^{T}\bm{\widetilde{g}}/\|\bm{\widetilde{A}}_{i}\|_{2} is equal to some xN(0,1)x\sim N(0,1), giving

GisAii𝑨~i2=x𝒖i2(GisAii)2𝑨~i22=x2𝒖i22\frac{G_{i}^{s}-A_{ii}}{\|\bm{\widetilde{A}}_{i}\|_{2}}=\frac{x}{\|\bm{u}_{i}\|_{2}}\implies\frac{(G_{i}^{s}-A_{ii})^{2}}{\|\bm{\widetilde{A}}_{i}\|_{2}^{2}}=\frac{x^{2}}{\|\bm{u}_{i}\|_{2}^{2}} (19)

where clearly xx and 𝒖i2\|\bm{u}_{i}\|_{2} are independent666This a scaled F-distribution..

Step 2: Conditioned integration

Recall that 𝒖iN(0,Is)s\bm{u}_{i}\sim N(0,I_{s})\in\mathbb{R}^{s} and so 𝒖i22χs2\|\bm{u}_{i}\|_{2}^{2}\sim\chi^{2}_{s}, is a chi-squared distribution of degree ss, and, of course, x2χ12x^{2}\sim\chi^{2}_{1}, is a chi-squared distribution of degree 1. Thus the proof reduces to bounding Pr(|x/𝒖i2|ε)=Pr(x2/𝒖i22ε2)\Pr(\big{|}x/\|\bm{u}_{i}\|_{2}|\leq\varepsilon)=\Pr(x^{2}/\|\bm{u}_{i}\|_{2}^{2}\leq\varepsilon^{2}) on the right hand side of equation (19). This is most easily examined by investigating the tail bound,

F:=Pr(x2𝒖i22ε2)=Pr(x2yε2)F:=\Pr\Bigg{(}\frac{x^{2}}{\|\bm{u}_{i}\|_{2}^{2}}\geq\varepsilon^{2}\Bigg{)}=\Pr\Bigg{(}\frac{x^{2}}{y}\geq\varepsilon^{2}\Bigg{)} (20)

where, for ease of notation, we have let y=𝒖i22y=\|\bm{u}_{i}\|_{2}^{2}. The next key insight is to approach this by conditioning the probability. Thus we have,

F=Pr(x2ε2y)=0Pr(x2ε2y|y)f(y)𝑑y\begin{split}F=&\Pr(x^{2}\geq\varepsilon^{2}y)\\ =&\int_{0}^{\infty}\Pr(x^{2}\geq\varepsilon^{2}y\big{|}y)\cdot f(y)dy\end{split} (21)

where f(y)f(y) is the probability density function of the yχs2y\sim\chi_{s}^{2} variable. So we have

F=0Pr(x2ε2y|y)f(y)𝑑y=0Pr(eλx2eλε2y|y)f(y)𝑑y(for λ>0)0𝔼[eλx2]eλε2yf(y)𝑑y(by Markov’s inequality)=0112λeλε2yf(y)𝑑y(λ(0,1/2), MGF of a normal variable)\begin{split}F=&\int_{0}^{\infty}\Pr(x^{2}\geq\varepsilon^{2}y\big{|}y)\cdot f(y)dy\\ =&\int_{0}^{\infty}\Pr(\text{e}^{\lambda x^{2}}\geq\text{e}^{\lambda\varepsilon^{2}y}\big{|}y)\cdot f(y)dy\hskip 147.95424pt\text{(for $\lambda>0$)}\\ \leq&\int_{0}^{\infty}\frac{\mathbb{E}[\text{e}^{\lambda x^{2}}]}{\text{e}^{\lambda\varepsilon^{2}y}}\cdot f(y)dy\hskip 142.26378pt\text{(by Markov's inequality)}\\ =&\int_{0}^{\infty}\frac{1}{\sqrt{1-2\lambda}\>\text{e}^{\lambda\varepsilon^{2}y}}\cdot f(y)dy\hskip 48.36958pt(\lambda\in(0,1/2),\text{ MGF of a normal variable})\end{split}

which, introducing the pdf, gives

=0112λeλε2y12s/2Γ(s/2)ys/21e12y𝑑y(pdf of χs2 variable)=0112λ12s/2Γ(s/2)ys/21e(12+λε2)y𝑑y(rearrange).\begin{split}=&\int_{0}^{\infty}\frac{1}{\sqrt{1-2\lambda}\>\text{e}^{\lambda\varepsilon^{2}y}}\cdot\frac{1}{2^{s/2}\Gamma(s/2)}y^{s/2-1}\text{e}^{-\frac{1}{2}y}dy\hskip 62.59596pt\text{(pdf of $\chi_{s}^{2}$ variable)}\\ =&\int_{0}^{\infty}\frac{1}{\sqrt{1-2\lambda}}\cdot\frac{1}{2^{s/2}\Gamma(s/2)}y^{s/2-1}\text{e}^{-(\frac{1}{2}+\lambda\varepsilon^{2})y}dy\hskip 96.73918pt\text{(rearrange).}\end{split}

Let z=(1+2λε2)yz=(1+2\lambda\varepsilon^{2})y, such that dz=(1+2λε2)dydz=(1+2\lambda\varepsilon^{2})dy. Thus substitution yields

F0112λ1(1+2λε2)s/212s/2Γ(s/2)zs/21e12z𝑑z=112λ1(1+2λε2)s/2(by the integral of the pdf)=112λ1eln(1+2λε2)s/2.\begin{split}F\leq&\int_{0}^{\infty}\frac{1}{\sqrt{1-2\lambda}}\cdot\frac{1}{(1+2\lambda\varepsilon^{2})^{s/2}}\cdot\frac{1}{2^{s/2}\Gamma(s/2)}z^{s/2-1}\text{e}^{-\frac{1}{2}z}dz\\ =&\frac{1}{\sqrt{1-2\lambda}}\cdot\frac{1}{(1+2\lambda\varepsilon^{2})^{s/2}}\hskip 113.81102pt\text{(by the integral of the pdf)}\\ =&\frac{1}{\sqrt{1-2\lambda}}\cdot\frac{1}{\text{e}^{\ln(1+2\lambda\varepsilon^{2})s/2}}.\end{split}

Step 3: Function bounding

Note that 2λε2(0,1)2\lambda\varepsilon^{2}\in(0,1) since λ(0,1/2)\lambda\in(0,1/2) and ε(0,1]\varepsilon\in(0,1], and consider that

ln(1+x)ln(2)x\ln(1+x)\geq\ln(2)x

for x(0,1)x\in(0,1). Therefore we can bound FF by

F112λ1eln(2)λε2s=112λ12λε2s\begin{split}F\leq\frac{1}{\sqrt{1-2\lambda}}\cdot\frac{1}{\text{e}^{\ln(2)\lambda\varepsilon^{2}s}}=\frac{1}{\sqrt{1-2\lambda}}\cdot\frac{1}{2^{\lambda\varepsilon^{2}s}}\end{split} (22)

Naively (we do not deal with optimal λ\lambda here), setting λ=1/4\lambda=1/4, and bounding the above by δ\delta gives

F22sε2/4<δ\begin{split}F\leq\frac{\sqrt{2}}{2^{s\varepsilon^{2}/4}}<\delta\end{split} (23)

thus we have

s>4log2(2δ)/ε2s>4\log_{2}\Big{(}\frac{\sqrt{2}}{\delta}\Big{)}\Big{/}\varepsilon^{2} (24)

as a sufficient bound for ss, such that, for all such ss

Pr(|GisAii|ε𝑨~i2)=F<δ\Pr(|G_{i}^{s}-A_{ii}|\geq\varepsilon\|\bm{\widetilde{A}}_{i}\|_{2})=F<\delta (25)

which completes the proof. \square

Implications

Theorem 5.1 deals only with a single entry of the diagonal. Concretely, we have found that

Pr(|GisAii|2ε2𝑨~i22)1δ\begin{split}\Pr\Big{(}|G_{i}^{s}-A_{ii}|^{2}\leq\varepsilon^{2}\|\bm{\widetilde{A}}_{i}\|_{2}^{2}\Big{)}\geq 1-\delta\end{split}

for s>4log2(2/δ)/ε2s>4\log_{2}(\sqrt{2}/\delta)/\varepsilon^{2}. This means GisG_{i}^{s} is a row-dependent (ε,δ)(\varepsilon,\delta)-approximator for all such ss. For a relative (ε,δ)(\varepsilon,\delta)-approximator, we must substitute

ε2=ε¯2Aii2𝑨~i22\varepsilon^{2}=\bar{\varepsilon}^{2}\frac{A_{ii}^{2}}{\|\bm{\widetilde{A}}_{i}\|_{2}^{2}}

to get

Pr(|GisAii|ε¯|Aii|)1δ\Pr\Big{(}|G_{i}^{s}-A_{ii}|\leq\bar{\varepsilon}|A_{ii}|\Big{)}\geq 1-\delta

provided

s>4(𝑨~i22Aii2)log2(2/δ)ε¯2.s>4\Bigg{(}\frac{\|\bm{\widetilde{A}}_{i}\|_{2}^{2}}{A_{ii}^{2}}\Bigg{)}\frac{\log_{2}(\sqrt{2}/\delta)}{\bar{\varepsilon}^{2}}. (26)

For the entire diagonal, if we demand |GisAii|2ε2(𝑨i22Aii2)i|G_{i}^{s}-A_{ii}|^{2}\leq\varepsilon^{2}\Big{(}\|\bm{A}_{i}\|_{2}^{2}-A_{ii}^{2}\Big{)}\hskip 5.69046pt\forall\;i we may apply the ideas from the proof of Theorem 5.1, to obtain the number of vectors required for this condition to hold.

Corollary 1

For

Pr(i=1n|GisAii|2ε2i=1n(𝑨i22Aii2))1δ\Pr\Bigg{(}\sum_{i=1}^{n}|G_{i}^{s}-A_{ii}|^{2}\leq\varepsilon^{2}\sum_{i=1}^{n}\Big{(}\|\bm{A}_{i}\|_{2}^{2}-A_{ii}^{2}\Big{)}\Bigg{)}\geq 1-\delta

with ε(0,1]\varepsilon\in(0,1] it is sufficient to take

s>4log2(n2δ)/ε2s>4\log_{2}\Big{(}\frac{n\sqrt{2}}{\delta}\Big{)}\Big{/}\varepsilon^{2}

query vectors if vector entries are i.i.d Gaussian, where 𝐀i\bm{A}_{i} is the ithi^{th} row of the matrix A, and GisG_{i}^{s} is the ithi^{th} component of the Gaussian diagonal estimator.

Proof

By equation (23) of the proof of Theorem 5.1, we seek

Pr((GisAii)2𝑨~i22ε2)=Pr(χ12χs2ε2)22sε2/4i.\Pr\Bigg{(}\frac{(G_{i}^{s}-A_{ii})^{2}}{\|\bm{\widetilde{A}}_{i}\|_{2}^{2}}\geq\varepsilon^{2}\Bigg{)}=\Pr\Bigg{(}\frac{\chi^{2}_{1}}{\chi^{2}_{s}}\geq\varepsilon^{2}\Bigg{)}\leq\frac{\sqrt{2}}{2^{s\varepsilon^{2}/4}}\;\forall\;i.

Applying a union bound, and bounding this by δ\delta we readily find

n22sε2/4<δs>4log2(n2δ)/ε2n\cdot\frac{\sqrt{2}}{2^{s\varepsilon^{2}/4}}<\delta\implies s>4\log_{2}\Big{(}\frac{n\sqrt{2}}{\delta}\Big{)}\Big{/}\varepsilon^{2} (27)

is sufficient, which completes the proof. \square

Taking the summation over ii, Corollary 1 means we have, with probability at least 1δ1-\delta

𝑮s𝑨d22ε2(AF2𝑨d22)\|\bm{G}^{s}-\bm{A}_{d}\|_{2}^{2}\leq\varepsilon^{2}\big{(}\|A\|_{F}^{2}-\|\bm{A}_{d}\|_{2}^{2}\big{)} (28)

if we have ss as in equation (27). To make this bound a relative estimate, in comparison to 𝑨d\bm{A}_{d}, we write

ε2=(AF2𝑨d22𝑨d2)ε¯2\varepsilon^{2}=\Bigg{(}\frac{\|A\|_{F}^{2}-\|\bm{A}_{d}\|_{2}^{2}}{\|\bm{A}_{d}\|^{2}}\Bigg{)}\bar{\varepsilon}^{2}

to get

Pr(𝑮s𝑨d2ε¯𝑨d2)1δ\Pr\Big{(}\|\bm{G}^{s}-\bm{A}_{d}\|_{2}\leq\bar{\varepsilon}\|\bm{A}_{d}\|_{2}\Big{)}\geq 1-\delta (29)

provided that the number of query vectors ss is

s>4(AF2𝑨d22𝑨d2)log2(n2/δ)ε¯2.s>4\Bigg{(}\frac{\|A\|_{F}^{2}-\|\bm{A}_{d}\|_{2}^{2}}{\|\bm{A}_{d}\|^{2}}\Bigg{)}\frac{\log_{2}\Big{(}n\sqrt{2}/\delta\Big{)}}{\bar{\varepsilon}^{2}}. (30)

In summary, we have found that to get a relative 2-norm estimate of the entire diagonal with error at most ε¯\bar{\varepsilon}, and probability at least 1δ1-\delta, it is sufficient to use the ss queries as in equation (30), if Gaussian query vectors are used.

6 Rademacher diagonal estimator

This section deals with the proof of an (ε,δ)(\varepsilon,\delta) bound for the Rademacher diagonal estimator. We choose to break this down into two main steps, the first of which is given in Lemma 2. The lemma concerns the expectation of an exponential, which will subsequently be used in a Chernoff-style argument. One may regard it simply as a slight tweak on Khintchine’s lemma, but introducing an additional summation term and so we outline it fully for completeness. Before continuing, we note that examining the Rademacher estimator element-wise we have

Ris=Aii+jinAijk=1svkivkjs=Aii+1sjinAijk=1szkij\begin{split}R_{i}^{s}=A_{ii}+\sum_{j\neq i}^{n}A_{ij}\frac{\sum_{k=1}^{s}v_{k}^{i}v_{k}^{j}}{s}=A_{ii}+\frac{1}{s}\sum_{j\neq i}^{n}A_{ij}\sum_{k=1}^{s}z_{k}^{ij}\end{split} (31)

where we have rewritten vkivkjv_{k}^{i}v_{k}^{j} as zkijz_{k}^{ij} and zkijz_{k}^{ij} is also a Rademacher variable777It is easy to verify that a Rademacher variable times an independent (jij\neq i) Rademacher variable is also Rademacher variable.. We now introduce the following lemma.

Lemma 2

For

Xis:=RisAiiX^{s}_{i}:=R_{i}^{s}-A_{ii}

we have that

𝔼[exp(tXis)]exp(t2𝑨~i222s)\mathbb{E}[\exp(tX^{s}_{i})]\leq\exp\Bigg{(}\frac{t^{2}\|\bm{\widetilde{A}}_{i}\|_{2}^{2}}{2s}\Bigg{)}

where 𝐀~in1\bm{\widetilde{A}}_{i}\in\mathbb{R}^{n-1} is a vector containing the elements of the ithi^{th} row of A, excluding the ithi^{th} element.

Proof

Note that XisX^{s}_{i} is simply

Xis=1sjinAijk=1szkijX_{i}^{s}=\frac{1}{s}\sum_{j\neq i}^{n}A_{ij}\sum_{k=1}^{s}z_{k}^{ij}

Consider now

𝔼[exp(tXis)]=𝔼[exp(tsjinAijk=1szkij)]=jink=1s𝔼[exp(tsAijzkij)]=jink=1s12(exp(tsAij)+(tsAij))=jink=1s(1+(tAij/s)22!+(tAij/s)44!+)jink=1s(1+(tAij/s)21!2+(tAij/s)42!22+)=jink=1sexp(t2Aij22s2)=jinexp(t2Aij22s)=exp(t2𝑨~i222s)\begin{split}\mathbb{E}[\exp(tX_{i}^{s})]=&\mathbb{E}\Big{[}\exp\Big{(}\frac{t}{s}\sum_{j\neq i}^{n}A_{ij}\sum_{k=1}^{s}z_{k}^{ij}\Big{)}\Big{]}\\ =&\prod_{j\neq i}^{n}\prod_{k=1}^{s}\mathbb{E}\Big{[}\exp\Big{(}\frac{t}{s}A_{ij}z_{k}^{ij}\Big{)}\Big{]}\\ =&\prod_{j\neq i}^{n}\prod_{k=1}^{s}\frac{1}{2}\Big{(}\exp\Big{(}\frac{t}{s}A_{ij}\Big{)}+\Big{(}-\frac{t}{s}A_{ij}\Big{)}\Big{)}\\ =&\prod_{j\neq i}^{n}\prod_{k=1}^{s}\Big{(}1+\frac{(tA_{ij}/s)^{2}}{2!}+\frac{(tA_{ij}/s)^{4}}{4!}+...\Big{)}\\ \leq&\prod_{j\neq i}^{n}\prod_{k=1}^{s}\Big{(}1+\frac{(tA_{ij}/s)^{2}}{1!\cdot 2}+\frac{(tA_{ij}/s)^{4}}{2!\cdot 2^{2}}+...\Big{)}\\ =&\prod_{j\neq i}^{n}\prod_{k=1}^{s}\exp\Big{(}\frac{t^{2}A_{ij}^{2}}{2s^{2}}\Big{)}\\ =&\prod_{j\neq i}^{n}\exp\Big{(}\frac{t^{2}A_{ij}^{2}}{2s}\Big{)}\\ =&\exp\Big{(}\frac{t^{2}\|\bm{\widetilde{A}}_{i}\|_{2}^{2}}{2s}\Big{)}\end{split}

which completes the proof. \square

With this lemma in place, we can now prove an (ε,δ)(\varepsilon,\delta) bound for the Rademacher diagonal estimator. Specifically, we find the minimum sufficient number of queries ss, for a row-dependent (ε,δ)(\varepsilon,\delta) bound to hold.

Theorem 6.1

Let RisR_{i}^{s} be the Rademacher diagonal estimator (14) of AA, then it is sufficient to take ss as

s>2ln(2/δ)/ε2s>2\ln(2/\delta)/\varepsilon^{2} (32)

with arbitrary ε\varepsilon, such that the estimator satisfies

Pr(|RisAii|2ε2(𝑨i22Aii2))1δ,\Pr\Bigg{(}|R_{i}^{s}-A_{ii}|^{2}\leq\varepsilon^{2}\Big{(}\|\bm{A}_{i}\|_{2}^{2}-A_{ii}^{2}\Big{)}\Bigg{)}\geq 1-\delta,

where 𝐀i\bm{A}_{i} is the ithi^{th} row of the matrix reshaped as a vector.

Proof

The problem may be reduced to finding the tail probability, Pr(|Xis|ε𝑨~i2)\Pr(|X_{i}^{s}|\geq\varepsilon\|\widetilde{\bm{A}}_{i}\|_{2}). Investigating the positive tail probability we have that

Pr(Xisε𝑨~i2)=Pr(etXisetε𝑨~i2)(all t>0)𝔼[etXis]etε𝑨~i2(By Markov’s inequality)exp(tε𝑨~i2+t2𝑨~i22/2s).(By Lemma 2)\begin{split}\Pr(X^{s}_{i}\geq\varepsilon\|\widetilde{\bm{A}}_{i}\|_{2})=&\Pr(\text{e}^{tX^{s}_{i}}\geq\text{e}^{t\varepsilon\|\widetilde{\bm{A}}_{i}\|_{2}})\hskip 88.2037pt\text{(all $t>0$)}\\ \leq&\frac{\mathbb{E}[\text{e}^{tX^{s}_{i}}]}{\text{e}^{t\varepsilon\|\widetilde{\bm{A}}_{i}\|_{2}}}\hskip 139.41832pt\text{(By Markov's inequality)}\\ \leq&\exp\Big{(}-t\varepsilon\|\widetilde{\bm{A}}_{i}\|_{2}+t^{2}\|\widetilde{\bm{A}}_{i}\|_{2}^{2}/2s\Big{)}.\hskip 34.14322pt\text{(By Lemma \ref{Chernoff Lemma})}\end{split} (33)

This is minimum for t=sε/𝑨~i2t=s\varepsilon/\|\widetilde{\bm{A}}_{i}\|_{2}, so we obtain 888We can also obtain (34) from a direct application of Hoeffding’s bound for the sum of unbounded random variables (wainwright2019high, , Prop. 2.5).

Pr(Xisε𝑨~i2)exp(12sε2).\Pr(X^{s}_{i}\geq\varepsilon\|\widetilde{\bm{A}}_{i}\|_{2})\leq\exp\Big{(}-\frac{1}{2}s\varepsilon^{2}\Big{)}. (34)

Since we are considering |Xis|ε𝑨~i2|X^{s}_{i}|\geq\varepsilon\|\widetilde{\bm{A}}_{i}\|_{2}, we need also to bound the negative tail

Pr(Xisε𝑨~i2)=Pr(etXisetε𝑨~i2)𝔼[etXis]etε𝑨~i2.\begin{split}\Pr(-X^{s}_{i}\geq\varepsilon\|\widetilde{\bm{A}}_{i}\|_{2})&=\Pr(\text{e}^{-tX^{s}_{i}}\geq\text{e}^{t\varepsilon\|\widetilde{\bm{A}}_{i}\|_{2}})\\ &\leq\frac{\mathbb{E}[\text{e}^{-tX^{s}_{i}}]}{\text{e}^{t\varepsilon\|\widetilde{\bm{A}}_{i}\|_{2}}}.\end{split}

It is also easily shown, by Lemma 2, that

𝔼[etXis]exp((t)2𝑨~i22/2s)\mathbb{E}[\text{e}^{-tX^{s}_{i}}]\leq\exp((-t)^{2}\|\widetilde{\bm{A}}_{i}\|^{2}_{2}/2s)

which means that, just as for the positive tail

Pr(Xisε𝑨~22)exp(12sε2)\Pr(-X^{s}_{i}\geq\varepsilon\|\widetilde{\bm{A}}_{2}\|_{2})\leq\exp\Big{(}-\frac{1}{2}s\varepsilon^{2}\Big{)} (35)

Or indeed, we may easily deduce the above by considering the symmetry of the Rademacher variables zkijz_{k}^{ij} in Lemma 2.

Given that the events in equations (34) and (35) are disjoint, we have

Pr(|Xis|ε𝑨~i2)2exp(12sε2).\Pr(|X^{s}_{i}|\geq\varepsilon\|\widetilde{\bm{A}}_{i}\|_{2})\leq 2\exp\Big{(}-\frac{1}{2}s\varepsilon^{2}\Big{)}. (36)

Bounding this probability by δ\delta yields

2exp(12sε2)<δs>2ln(2δ)/ε22\exp\Big{(}-\frac{1}{2}s\varepsilon^{2}\Big{)}<\delta\implies s>2\ln\Big{(}\frac{2}{\delta}\Big{)}\Big{/}\varepsilon^{2} (37)

as the sufficient lower bound for ss, such that, for all such ss

Pr(|Xis|ε𝑨~i2)<δ\Pr(|X_{i}^{s}|\geq\varepsilon\|\widetilde{\bm{A}}_{i}\|_{2})<\delta (38)

which completes the proof. \square

Implications

The above result is for a single diagonal entry. We have found that

Pr(|RisAii|2ε2𝑨~i22)2exp(sε2/2)<δ\Pr\Big{(}|R_{i}^{s}-A_{ii}|^{2}\geq\varepsilon^{2}\|\widetilde{\bm{A}}_{i}\|_{2}^{2}\Big{)}\leq 2\exp(-s\varepsilon^{2}/2)<\delta

so that RisR_{i}^{s} is a row-dependent (ε,δ)(\varepsilon,\delta)-approximator for s>2ln(2/δ)/ε2s>2\ln(2/\delta)/\varepsilon^{2}. For a relative (ε,δ)(\varepsilon,\delta)-approximator we must substitute999Replacing 𝑨~i2\|\bm{\widetilde{A}}_{i}\|_{2} with AiiA_{ii} in equation (33), yields the same result as in equation (40).

ε2=ε¯2Aii2𝑨d22\varepsilon^{2}=\bar{\varepsilon}^{2}\frac{A_{ii}^{2}}{\|\bm{A}_{d}\|_{2}^{2}}

to get

Pr(|RisAii|ε¯|Aii|)<δ\Pr\Big{(}|R_{i}^{s}-A_{ii}|\geq\bar{\varepsilon}|A_{ii}|\Big{)}<\delta (39)

provided

s>2(𝑨~i22Aii2)ln(2/δ)ε¯2.s>2\Bigg{(}\frac{\|\widetilde{\bm{A}}_{i}\|_{2}^{2}}{A_{ii}^{2}}\Bigg{)}\frac{\ln(2/\delta)}{\bar{\varepsilon}^{2}}. (40)

For the whole diagonal, and in much the same way as when considering the Gaussian case, if we demand |RisAii|2ε2(𝑨i22Aii2)i|R_{i}^{s}-A_{ii}|^{2}\leq\varepsilon^{2}\Big{(}\|\bm{A}_{i}\|_{2}^{2}-A_{ii}^{2}\Big{)}\hskip 5.69046pt\forall\;i, we need simply apply the ideas from the proof of Theorem 6.1, to obtain the number of vectors required for this condition to hold.

Corollary 2

For

Pr(i=1n|RisAii|2ε2i=1n(𝑨i22Aii2))1δ\Pr\Bigg{(}\sum_{i=1}^{n}|R_{i}^{s}-A_{ii}|^{2}\leq\varepsilon^{2}\sum_{i=1}^{n}\Big{(}\|\bm{A}_{i}\|_{2}^{2}-A_{ii}^{2}\Big{)}\Bigg{)}\geq 1-\delta

with ε\varepsilon arbitrary it is sufficient to take

s>2ln(2nδ)/ε2s>2\ln\Big{(}\frac{2n}{\delta}\Big{)}\Big{/}\varepsilon^{2} (41)

query vectors if vector entries are i.i.d Rademacher, where 𝐀i\bm{A}_{i} is the ithi^{th} row of the matrix A, and RisR_{i}^{s} is the ithi^{th} component of the Rademacher diagonal estimator.

Proof

By equation (36) in the proof of Theorem 6.1, we seek

Pr(|Xis|ε𝑨~i2)2exp(12sε2)i.\Pr(|X^{s}_{i}|\geq\varepsilon\|\widetilde{\bm{A}}_{i}\|_{2})\leq 2\exp\Big{(}-\frac{1}{2}s\varepsilon^{2}\Big{)}\;\forall\;i.

Applying a union bound for all entries, and bounding this, we readily find

n2exp(12sε2)<δs>2ln(2nδ)/ε2n\cdot 2\exp\Big{(}-\frac{1}{2}s\varepsilon^{2}\Big{)}<\delta\implies s>2\ln\Big{(}\frac{2n}{\delta}\Big{)}\Big{/}\varepsilon^{2} (42)

which completes the proof. \square

Thus taking the summation over ii, we have, with probability 1δ1-\delta

𝑹s𝑨d22ε2(AF2𝑨d22)\|\bm{R}^{s}-\bm{A}_{d}\|_{2}^{2}\leq\varepsilon^{2}\big{(}\|A\|_{F}^{2}-\|\bm{A}_{d}\|_{2}^{2}\big{)} (43)

provided we have used ss-queries as in (42). If we seek a relative error, in comparison to 𝑨d\bm{A}_{d}, let

ε2=(AF2𝑨d22𝑨d2)ε¯2\varepsilon^{2}=\Bigg{(}\frac{\|A\|_{F}^{2}-\|\bm{A}_{d}\|_{2}^{2}}{\|\bm{A}_{d}\|_{2}}\Bigg{)}\bar{\varepsilon}^{2}

to get

Pr(𝑹s𝑨d2ε¯𝑨d2)1δ\Pr\Big{(}\|\bm{R}^{s}-\bm{A}_{d}\|_{2}\leq\bar{\varepsilon}\|\bm{A}_{d}\|_{2}\Big{)}\geq 1-\delta (44)

provided

s>2(AF2𝑨d22𝑨d2)ln(2n/δ)ε¯2s>2\Bigg{(}\frac{\|A\|_{F}^{2}-\|\bm{A}_{d}\|_{2}^{2}}{\|\bm{A}_{d}\|^{2}}\Bigg{)}\frac{\ln\Big{(}2n/\delta\Big{)}}{\bar{\varepsilon}^{2}} (45)

Overall, we have found that to get a relative 2-norm estimate of the entire diagonal with error at most ε\varepsilon, and probability at least 1δ1-\delta, it is sufficient to use the ss queries as in equation (45), if we use Rademacher query vectors.

A brief comparison

Comparing (15) with (32) (or (1) with (41)), we see that both the Rademacher estimator and the Gaussian estimator display strikingly similar criteria for the selection of ss. Whilst the Gaussian estimator appears slightly worse than its Rademacher counterpart, this difference is, in general terms, very small. A more insidious point to note relates to the demands on ε\varepsilon in the Gaussian case. We have demanded ε(0,1]\varepsilon\in(0,1] for equation (22) to hold. Whilst seemingly a minor detail, this becomes very important if we are to use a small number of query vectors ss, say, on the order of 1010. In Section 8 we demonstrate the implications of this assumption, and offer intuition as for why, when ss is small, the Gaussian diagonal estimator performs worse than the Rademacher diagonal estimator.

7 Positive-semi-definite diagonal estimation

We now examine the case where A0A\succeq 0, as is relevant in many applications braverman2020schatten ; chen2016accurately ; cohen2018approximating ; di2016efficient ; han2017approximating ; li2020well ; lin2016approximating ; ubaru2017applications , and analysed extensively in avron2011randomized ; roosta2015improved . For both estimators we have

|DisAii|2ε2(𝑨i22Aii2)|D_{i}^{s}-A_{ii}|^{2}\leq\varepsilon^{2}\Big{(}\|\bm{A}_{i}\|_{2}^{2}-A_{ii}^{2}\Big{)} (46)

with probability, 1δ1-\delta, provided we have used s=O(log(1/δ)/ε2)s=O(\log(1/\delta)/\varepsilon^{2}) queries and for given ii. As before, 𝑨i\bm{A}_{i} is the ithi^{th} row of the matrix (and therefore the ithi^{th} column by symmetry). There is no escaping the dependence of the absolute error on the matrix structure, and, as we have also seen, shifting to a relative error framework means that the number of sufficient queries becomes dependent on the matrix. Can we draw any other general conclusions about the dependence of the number of queries on the structure of the matrix? Might they fit with our intuitions, and indicate when stochastic estimation may be suitable? We find, as expected, that both flatter spectra lend themselves to diagonal estimation and that the eigenvector basis of the matrix plays a key role. These results are included for the practitioner, who may have access to some of the quantities explored in the following sections (e.g. κ2(A)\kappa_{2}(A)), or may be able to simulate other quantities in particular scenarios (e.g. σmin(VV)\sigma_{\min}(V\odot V)).

7.1 Spectrum influence - element-wise

To make a start let us consider the following lemma. Recall that λ1\lambda_{1} is the largest eigenvalue.

Lemma 3

For all A0A\succeq 0 we have that 𝐀i22λ1Aiii\|\bm{A}_{i}\|_{2}^{2}\leq\lambda_{1}A_{ii}\hskip 5.69046pt\forall\;i.

Proof

We may express the column 𝑨i\bm{A}_{i} as

𝑨i=j=1nλjVij𝑽j\bm{A}_{i}=\sum_{j=1}^{n}\lambda_{j}V_{ij}\bm{V}_{j}

where 𝑽j\bm{V}_{j} is the jthj^{th} eigenvector of AA and so we also have that Aii=j=1nλj(Vij)2A_{ii}=\sum_{j=1}^{n}\lambda_{j}(V_{ij})^{2}. Thus

𝑨i22=j=1nλjVij𝑽j22=j=1n(λjVij)2λ1j=1nλj(Vij)2=λ1Aii\|\bm{A}_{i}\|_{2}^{2}=\Big{\|}\sum_{j=1}^{n}\lambda_{j}V_{ij}\bm{V}_{j}\Big{\|}_{2}^{2}=\sum_{j=1}^{n}(\lambda_{j}V_{ij})^{2}\leq\lambda_{1}\sum_{j=1}^{n}\lambda_{j}(V_{ij})^{2}=\lambda_{1}A_{ii}

where the second equality holds by orthogonality, and the inequality by positive-semi-definiteness. \square

Remark 1

For A0A\succeq 0, where λn=0\lambda_{n}=0, then AiiA_{ii} may take the value 0. When AiiA_{ii} does take this value, by Lemma 3 we have that 𝑨i=𝟎\bm{A}_{i}=\bm{0} and thus all diagonal zeroes will be found exactly by either the Rademacher or Gaussian estimator.

Now suppose A0A\succ 0, such that λn0\lambda_{n}\neq 0, then by equation (46) and Lemma 3, for sufficient ss we have

|DisAii|2ε2(λ1AiiAii2)=ε2(λ1Aii1)Aii2ε2(κ2(A)1)Aii2.\begin{split}|D_{i}^{s}-A_{ii}|^{2}&\leq\varepsilon^{2}(\lambda_{1}A_{ii}-A_{ii}^{2})\\ &=\varepsilon^{2}\Big{(}\frac{\lambda_{1}}{A_{ii}}-1\Big{)}A_{ii}^{2}\\ &\leq\varepsilon^{2}(\kappa_{2}(A)-1)A_{ii}^{2}.\end{split} (47)

Then, setting ε¯2=ε2/(κ2(A)1)\bar{\varepsilon}^{2}=\varepsilon^{2}/(\kappa_{2}(A)-1), we shall return to a relative (ε,δ)(\varepsilon,\delta)-approximator provided

s>O((κ2(A)1)log(1/δ)ε¯2).s>O\Bigg{(}\frac{(\kappa_{2}(A)-1)\log(1/\delta)}{\bar{\varepsilon}^{2}}\Bigg{)}. (48)

Extending this to all diagonal elements, we have the following lemma.

Lemma 4

For

Pr(𝑫s𝑨d22ε¯2𝑨d22)1δ\Pr\Big{(}\|\bm{D}^{s}-\bm{A}_{d}\|_{2}^{2}\leq\bar{\varepsilon}^{2}\|\bm{A}_{d}\|_{2}^{2}\Big{)}\geq 1-\delta

we have that

s>O((κ2(A)1)log(n/δ)ε¯2)s>O\Bigg{(}\frac{(\kappa_{2}(A)-1)\log(n/\delta)}{\bar{\varepsilon}^{2}}\Bigg{)} (49)

is a sufficient number of queries for both the Rademacher or Gaussian estimator.

Of course, this analysis produces a weaker sufficiency bound101010Due to the inequalities applied, we expect to arrive at sufficiency before ss as in (49). on ss than either of those in Table 1, but it is more likely to be practically computable (e.g. via few steps of the Lanczos algorithm) or perhaps even known for implicit AA.

Lemma 4 suggests roughly that the conditioning of the matrix indicates its ability to be estimated. For a given ss, if κ2(A)\kappa_{2}(A) is larger, error estimates are likely to be worse, but as κ2(A)\kappa_{2}(A) approaches 1 they become better and better. This is intuitive, since if κ2(A)=1\kappa_{2}(A)=1, then the matrix AA shall be a scalar multiple of the identity: A=V(cI)VT=cIA=V(cI)V^{T}=cI for some scalar cc. Hence the lower bound on ss becomes 0 and exact estimates are found after a single query. More generally, if κ2(A)n\kappa_{2}(A)\ll n, then the bounds (48) and (49) are good, and stochastic estimation is suitable to estimate the diagonal. Conversely, if κ2(A)=O(n)\kappa_{2}(A)=O(n) or greater, then clearly this bound is no longer helpful since we seek sns\ll n queries.

However, if the bound of Lemma 3 is loose, and line three of equation (47) is loose, then the bounds (48) and (49) are too pessimistic. To partly rectify this, and extend these ideas for general A0A\succeq 0, we stop at the second line of (47) to write, with ci(A):=λ1/Aiic_{i}(A):=\lambda_{1}/A_{ii}

|DisAii|2ε2(ci(A)1)Aii2|D_{i}^{s}-A_{ii}|^{2}\leq\varepsilon^{2}(c_{i}(A)-1)A_{ii}^{2}

where ci(A)c_{i}(A) is best considered as the “conditioning” of a diagonal element. Note again that in this analysis we are considering Aii0A_{ii}\neq 0, since we already know that if Aii=0A_{ii}=0 it will be found exactly. Let us define

κd(A):=λ1mini,Aii0Aii\kappa_{d}(A):=\frac{\lambda_{1}}{\min_{i,A_{ii}\neq 0}A_{ii}} (50)

as the conditioning of the diagonal. So we have,

|DsiAii|2ε2(κd(A)1)Aii2|D_{s}^{i}-A_{ii}|^{2}\leq\varepsilon^{2}(\kappa_{d}(A)-1)A_{ii}^{2}

and the same ideas as for κ2(A)\kappa_{2}(A) may be applied. In particular, note that we shall obtain a relative (ε,δ)(\varepsilon,\delta)-approximator if, with ε¯2=ε2/(κd(A)1)\bar{\varepsilon}^{2}=\varepsilon^{2}/(\kappa_{d}(A)-1)

s>O((κd(A)1)log(1/δ)ε¯2)s>O\Bigg{(}\frac{(\kappa_{d}(A)-1)\log(1/\delta)}{\bar{\varepsilon}^{2}}\Bigg{)} (51)

and, extending this to the diagonal elements, we readily obtain the following lemma.

Lemma 5

For

Pr(𝑫s𝑨d22ε¯2𝑨d22)1δ\Pr\Big{(}\|\bm{D}^{s}-\bm{A}_{d}\|_{2}^{2}\leq\bar{\varepsilon}^{2}\|\bm{A}_{d}\|_{2}^{2}\Big{)}\geq 1-\delta

we have that

s>O((κd(A)1)log(n/δ)ε¯2)s>O\Bigg{(}\frac{(\kappa_{d}(A)-1)\log(n/\delta)}{\bar{\varepsilon}^{2}}\Bigg{)} (52)

is a sufficient number of queries for both Rademacher and Gaussian estimators.

So if κd(A)n\kappa_{d}(A)\ll n then stochastic estimation is suitable to estimate the diagonal of such a matrix. Since κ2(A)κd(A)\kappa_{2}(A)\geq\kappa_{d}(A), this demand is easier to satisfy than demanding that κ2(A)n\kappa_{2}(A)\ll n.

7.2 Spectrum influence - the entire diagonal

So far we have built our understanding by considering single element estimates, and examining their implications for the whole diagonal. We can gain more information by examining the inequality

𝑫s𝑨d22ε2(AF2𝑨d22)\|\bm{D}^{s}-\bm{A}_{d}\|_{2}^{2}\leq\varepsilon^{2}(\|A\|_{F}^{2}-\|\bm{A}_{d}\|_{2}^{2}) (53)

since this accounts for the entire matrix structure. This holds, with probability at least 1δ1-\delta if s>O(log(n/δ)/ε2)s>O\Big{(}\log(n/\delta)/\varepsilon^{2}\Big{)}. Recall that, for A0A\succeq 0, AF2=𝝀22\|A\|_{F}^{2}=\|\bm{\lambda}\|_{2}^{2}, so we have

𝑫s𝑨d22ε2(𝝀22𝑨d22).\|\bm{D}^{s}-\bm{A}_{d}\|_{2}^{2}\leq\varepsilon^{2}\Big{(}\|\bm{\lambda}\|_{2}^{2}-\|\bm{A}_{d}\|_{2}^{2}\Big{)}. (54)

Now, we also have for A0A\succeq 0, that

𝑨d221n𝑨d12=1n𝝀12\|\bm{A}_{d}\|_{2}^{2}\geq\frac{1}{n}\|\bm{A}_{d}\|_{1}^{2}=\frac{1}{n}\|\bm{\lambda}\|_{1}^{2}

which means that

𝑫s𝑨d22ε2(𝝀221n𝝀11)\|\bm{D}^{s}-\bm{A}_{d}\|_{2}^{2}\leq\varepsilon^{2}\Big{(}\|\bm{\lambda}\|_{2}^{2}-\frac{1}{n}\|\bm{\lambda}\|_{1}^{1}\Big{)} (55)

and we can once again see the influence of the eigenvalues on the error of our estimates.

This equation, however, is dependent on the entire spectrum, in comparison to only κ2(A)\kappa_{2}(A) or κd(A)\kappa_{d}(A) as was previously found. As expected, the conditioning of a matrix does not tell the whole story. If 𝝀=[1,1,,1,ϵ]\bm{\lambda}=[1,1,\ldots,1,\epsilon] the condition number of the matrix in question may be large, even O(n)O(n), but equation (55) suggests that the scaling of the absolute error will be acceptable. Clearly, a flat spectrum, such as 𝝀=[1,1,,1]n\bm{\lambda}=[1,1,...,1]\in\mathbb{R}^{n}, corresponding to the identity, yields a right hand side of zero. Meanwhile, the value of 𝝀22𝝀12/n\|\bm{\lambda}\|_{2}^{2}-\|\bm{\lambda}\|_{1}^{2}/n is greatest when mass is concentrated on a single eigenvalue.

For completeness, note that the reverse is true of the mass of the diagonal, since

𝝀22𝝀12=𝑨d12\begin{split}\|\bm{\lambda}\|_{2}^{2}&\leq\|\bm{\lambda}\|_{1}^{2}=\|\bm{A}_{d}\|_{1}^{2}\end{split}

and so

𝑫s𝑨d22ε2(𝑨d12𝑨d22)\|\bm{D}^{s}-\bm{A}_{d}\|_{2}^{2}\leq\varepsilon^{2}\Big{(}\|\bm{A}_{d}\|_{1}^{2}-\|\bm{A}_{d}\|_{2}^{2}\Big{)} (56)

which is worse as diagonal elements become flatter but better as most diagonal mass is concentrated on a handful of entries.

7.3 Eigenvector influence

Equation (56) hints at the notion that it is not only the spectrum of AA that influences the error of diagonal estimation. Indeed, if we have a very steep spectrum, but the eigenvectors of AA are columns of the identity, that is V=IV=I, then we shall still have a diagonal matrix and so expect one-shot estimation. Consider the following.

Lemma 6

Let An×nA\in\mathbb{R}^{n\times n} be a diagonalisable matrix, A=VΛVTA=V\Lambda V^{T}, with 𝛌=diag(Λ)\bm{\lambda}=\textnormal{diag}(\Lambda), and 𝐀d=diag(A)\bm{A}_{d}=\textnormal{diag}(A), then

𝑨d=(VV)𝝀\bm{A}_{d}=(V\odot V)\bm{\lambda}

where VVV\odot V, is the Hadamard (element-wise) product of VV with itself.

Proof

The above is easily found by recalling

Aii=j=1nλj(Vij)2iA_{ii}=\sum_{j=1}^{n}\lambda_{j}(V_{ij})^{2}\quad\forall\;i

and simply extending to the matrix equation. \square

Hence, Lemma 6 and equation (54) imply, for s>O(log(n/δ)/ε2)s>O\Big{(}\log(n/\delta)/\varepsilon^{2}\Big{)},

𝑫s𝑨d22ε2(𝝀22(VV)𝝀22)ε2(1σmin2(VV))𝝀22.\begin{split}\|\bm{D}^{s}-\bm{A}_{d}\|_{2}^{2}&\leq\varepsilon^{2}\Big{(}\|\bm{\lambda}\|_{2}^{2}-\|(V\odot V)\bm{\lambda}\|_{2}^{2}\Big{)}\\ &\leq\varepsilon^{2}\Big{(}1-\sigma^{2}_{\min}(V\odot V)\Big{)}\|\bm{\lambda}\|_{2}^{2}.\end{split}

If we write

ε2=ε¯2𝑨d22(1σmin2(VV))𝝀22\varepsilon^{2}=\bar{\varepsilon}^{2}\frac{\|\bm{A}_{d}\|_{2}^{2}}{\Big{(}1-\sigma^{2}_{\min}(V\odot V)\Big{)}\|\bm{\lambda}\|_{2}^{2}}

so we may readily obtain the following lemma.

Lemma 7

For

𝑫s𝑨d22ε¯2𝑨d22\|\bm{D}^{s}-\bm{A}_{d}\|_{2}^{2}\leq\bar{\varepsilon}^{2}\|\bm{A}_{d}\|_{2}^{2}

we have that

s>O((1σmin2(VV))𝝀22𝑨d22log(n/δ)ε¯2)s>O\Bigg{(}\frac{\Big{(}1-\sigma^{2}_{\min}(V\odot V)\Big{)}\|\bm{\lambda}\|_{2}^{2}}{\|\bm{A}_{d}\|_{2}^{2}}\cdot\frac{\log(n/\delta)}{\bar{\varepsilon}^{2}}\Bigg{)} (57)

is a sufficient number of queries.

This returns our prediction for one-shot estimation if V=IV=I. In this case we find σmin(VV)=σmin(I)=1\sigma_{\min}(V\odot V)=\sigma_{\min}(I)=1 and the bound (57) is zero, so s=1s=1 is sufficient to recover the diagonal exactly. Further analysis of the value of σmin(VV)\sigma_{\min}(V\odot V) is complex and outside the scope of this paper. We point the interested reader to ando1987singular for further investigation.

7.4 Links with Hutchinson’s estimator variance

Lastly, once again let us turn our attention to the quantity:

AF2𝑨d22\|A\|_{F}^{2}-\|\bm{A}_{d}\|_{2}^{2} (58)

and recall that, but for a factor of 2, this is in fact the variance of Hutchinson’s estimator for the trace. A variety of papers suggest particular variance reduction schemes for Hutchinson’s estimator, focused on reducing (58). Certain approaches take advantage of the sparsity of the matrix AA stathopoulos2013hierarchical ; tang2011domain (by a process often called probing), whilst others still take a “decomposition” approach adams2018estimating . Perhaps the most promising approach in this case is that taken by Gambhir et. al gambhir2017deflation , Meyer et. al meyer2021hutch++ and Lin lin2017randomized . In these papers the approach is to form a decomposition by approximate projection onto the top eigenspace of AA, using the matrix Qn×rQ\in\mathbb{R}^{n\times r} designed to capture most of the column space of AA. This matrix roughly spans the top eigenspace of AA. In particular, gambhir2017deflation and lin2017randomized justify this approach for estimating the trace when AA is low rank as tr(QQTA)=tr(QTAQ)\text{tr}(QQ^{T}A)=\text{tr}(Q^{T}AQ) will capture most of the trace of AA. A moment’s thought suggests that this method may be adapted to estimate the whole diagonal, since QQTAQQ^{T}A or indeed QQTAQQTQQ^{T}AQQ^{T} are randomised low-rank approximations to the matrix AA halko2011finding . We might do well to find the exact diagonals of either QQTAQQ^{T}A or QQTAQQTQQ^{T}AQQ^{T} and then estimate the diagonal of the remainder of the matrix: (IQQT)A(I-QQ^{T})A or (IQQT)A(IQQT)(I-QQ^{T})A(I-QQ^{T}). We expand further on this idea in Section 9 extending the results and analysis of meyer2021hutch++ .

8 Numerical Experiments

8.1 Single element estimates

We first examine the Rademacher diagonal estimator and explore the implications of equations (39) and (40), since the sufficient query bound for such an estimator is likely the tightest derived thus far. Moreover, the value of ε\varepsilon in Theorem 6.1 is arbitrary, so any theoretical bound ought to hold for all values ss, at all values of δ\delta. For different diagonal elements we expect the same rates of convergence, but different scaling, according to the value of (𝑨22Aii2)/Aii2(\|\bm{A}\|_{2}^{2}-A_{ii}^{2})/A_{ii}^{2} for each ii.

This is exactly what is found in practice, as shown in Figure 1. This figure shows the relative convergence of estimates of selected diagonal elements of the real world matrix “Boeing msc10480” from the SuiteSparse matrix collection111111Formerly the University of Florida matrix collection. davis2011university . These elements were chosen to be displayed as they have large differences in their values of (𝑨22Aii2)/Aii2(\|\bm{A}\|_{2}^{2}-A_{ii}^{2})/A_{ii}^{2}, and so are most easily resolved in the plots of relative error. As expected, the larger the value of (𝑨22Aii2)/Aii2(\|\bm{A}\|_{2}^{2}-A_{ii}^{2})/A_{ii}^{2}, the greater the error after a given number of queries.

For a fixed value of δ\delta and a given matrix entry, the ratio of the numerical error and theoretical bound is roughly fixed - the convergence of the numerical experiments mirrors that of the theoretical convergence bound. Thus the theoretical bound is tight in ε\varepsilon, as numerical error also scales with O(1/ε2)O(1/\varepsilon^{2}). In addition, for decreasing δ\delta, the bound to observation ratio can be seen to decrease, suggesting that the query bound is tight in δ\delta, since we would have otherwise expected it to grow.

Refer to caption Refer to caption
Refer to caption Refer to caption
Figure 1: The convergence of Rademacher estimates of selected diagonal elements of the matrix “Boeing/msc10480” from the SuiteSparse matrix collection. The numerical errors are shown in solid, and the theoretical bounds in dash. For 100100 trials, we plot the median error (top left), the 67th67^{th} percentile error (top right), the 80th80^{th} percentile error (bottom left) and the 90th90^{th} percentile error (bottom right), along with the corresponding theoretical error bounds for the associated value of δ\delta. The constants (𝐀i22Aii2)/Aii2(\|\bm{A}_{i}\|_{2}^{2}-A_{ii}^{2})/A_{ii}^{2} associated with each index are found as 0.38680.3868, 0.01250.0125 and 4.71544.7154 for ii equal to 11, 1010 and 10001000 respectively.

8.2 Rademacher vs Gaussian element estimates

Table 2: Sufficient query bounds for a row-dependent (ε,δ)(\varepsilon,\delta)-approximator as found by Theorems 5.1 and 6.1, with the corresponding demands on ε\varepsilon.
Query vector distribution Rademacher Gaussian
Sufficient query bound 2ln(2/δ)/ε22\ln(2/\delta)/\varepsilon^{2} 4log2(2/δ)/ε24\log_{2}(\sqrt{2}/\delta)/\varepsilon^{2}
Demand on ε\varepsilon none ε(0,1]\varepsilon\in(0,1]
Refer to caption Refer to caption
Figure 2: Comparison of the convergence of the Rademacher and Gaussian diagonal estimators. We run 100 trials for each distribution and report the median error and the 90th90^{th} percentile error in each case, along with their respective theoretical bounds.

Table 2 summaries sufficient query bounds for a row-dependent (ε,δ)(\varepsilon,\delta)-approximator for ease of reference. For a relative (ε,δ)(\varepsilon,\delta)-approximator, we simply introduce the matrix constant (𝑨i22Aii2)/Aii2(\|\bm{A}_{i}\|_{2}^{2}-A_{ii}^{2})/A_{ii}^{2} in each bound. The bound for the Rademacher diagonal estimator is lower than that for the Gaussian diagonal estimator, so we expect to be able to reach a lower error faster with the Rademacher estimator. Note however, that since we have not tuned the proof of Theorem 5.1 for optimal λ\lambda (see equation (23)), the query bound might indeed be lower, but not by much. Experiments suggest that (23) is quite tight unless s1ϵ2s\gg\frac{1}{\epsilon^{2}}.

Figure 2 illustrates this comparison, again on the matrix “Boeing/msc10480”. For the median experiments, δ=0.5\delta=0.5, the Gaussian theoretical bound contains the corresponding experimental error, but for the 90th90^{th} percentile it seems as if this is not the case. In this instance, where δ=0.1\delta=0.1, the lowest value ss may take is when ε=1\varepsilon=1. Hence

s>4log2(2/0.1)/12=15.3s>4\log_{2}(\sqrt{2}/0.1)/1^{2}=15.3 (59)

so we can only expect the theoretical bound to be valid for s16s\geq 16. If only one Gaussian query is made the relative error appears to not be bounded by the theoretical limit, but this is no surprise, since Theorem 5.1 poses the question: for given δ<1\delta<1 and ε(0,1]\varepsilon\in(0,1], what is ss bounded by? Whereas experimentation shows us, for given ss and δ\delta, the value ε\varepsilon will take121212Note that plots are of relative error, but ε\varepsilon in the context of Theorem 5.1 refers to row-dependent (ε,δ)(\varepsilon,\delta)-approximators.. The theorem gives us a tool to find ss, but in return asks that ε\varepsilon is bounded131313For equation (22) to be valid. from above by 11. The experiment, meanwhile, takes an input of ss and δ\delta and returns ε\varepsilon.

Whilst the theoretical bound in the right hand plot of Figure 2 only kicks in at s16s\geq 16, we have drawn it extending back to s=1s=1 to illustrate the fact that it does not apply for ss small enough. Meanwhile, as noted, the Rademacher bound holds for arbitrary ε\varepsilon. The effect is further manifested by the fact that with δ=0.1\delta=0.1, the Rademacher estimator is more accurate for sufficiently small ss, roughly s10s\lesssim 10.

8.2.1 Intuitions for performance with small ss

To gain more intuition for why the Rademacher diagonal estimator is better than the Gaussian diagonal estimator for small ss, it is useful to compare the variances of individual elements for the two estimators. From equation (11), the variance of the Radamacher estimate is

Var(Ris)=1sjiAij2,\text{Var}(R_{i}^{s})=\frac{1}{s}\sum_{j\neq i}A_{ij}^{2}, (60)

whilst in the Gaussian case for s>2s>2, we find

Var(Gis)=1s2jiAij2\text{Var}(G_{i}^{s})=\frac{1}{s-2}\sum_{j\neq i}A_{ij}^{2} (61)

but for s2s\leq 2, the variance is not bounded. For a given ii, this is found in the following manner

Var(Gis)=Var(GisAii)=Var(jinAij(k=1svkivkjk=1s(vki)2))=Var(jinAij𝒖iT𝒖j𝒖𝒊22)=jinAij2Var(𝒖iT𝒖j𝒖𝒊22)(independence wrt j)\begin{split}\text{Var}(G_{i}^{s})&=\text{Var}(G_{i}^{s}-A_{ii})\\ &=\text{Var}\Bigg{(}{\sum_{j\neq i}^{n}}A_{ij}\Bigg{(}\frac{\sum_{k=1}^{s}v_{k}^{i}v_{k}^{j}}{\sum_{k=1}^{s}{(v_{k}^{i})^{2}}}\Bigg{)}\Bigg{)}\\ &=\text{Var}\Bigg{(}{\sum_{j\neq i}^{n}}A_{ij}\frac{\bm{u}_{i}^{T}\bm{u}_{j}}{\|\bm{u_{i}}\|_{2}^{2}}\Bigg{)}\\ &={\sum_{j\neq i}^{n}}A_{ij}^{2}\text{Var}\Bigg{(}\frac{\bm{u}_{i}^{T}\bm{u}_{j}}{\|\bm{u_{i}}\|_{2}^{2}}\Bigg{)}\hskip 28.45274pt\text{(independence wrt $j$)}\end{split}

where 𝒖i\bm{u}_{i} and 𝒖j\bm{u}_{j} are as in the proof of Theorem 5.1. Referring back to this proof again, we already know that the distribution of (𝒖iT𝒖j/𝒖i22)(\bm{u}_{i}^{T}\bm{u}_{j}/\|\bm{u}_{i}\|_{2}^{2}) is the same as gj/𝒖i2g_{j}/\|\bm{u}_{i}\|_{2} for some gjN(0,1)g_{j}\sim N(0,1) independent of 𝒖i2\|\bm{u}_{i}\|_{2}. This gives

Var(Gis)=i=1nAij2(𝔼[gj2𝒖i22]𝔼[gj𝒖i2]2)=i=1nAij2𝔼[gj2𝒖i22]=i=1nAij2𝔼[χ12χs2]=i=1nAij2𝔼[1sF(1,s)]\begin{split}\text{Var}(G_{i}^{s})&=\sum_{i=1}^{n}A_{ij}^{2}\Bigg{(}\mathbb{E}\Bigg{[}\frac{g_{j}^{2}}{\|\bm{u}_{i}\|_{2}^{2}}\Bigg{]}-\mathbb{E}\Bigg{[}\frac{g_{j}}{\|\bm{u}_{i}\|_{2}}\Bigg{]}^{2}\Bigg{)}\\ &=\sum_{i=1}^{n}A_{ij}^{2}\mathbb{E}\Bigg{[}\frac{g_{j}^{2}}{\|\bm{u}_{i}\|_{2}^{2}}\Bigg{]}\\ &=\sum_{i=1}^{n}A_{ij}^{2}\mathbb{E}\Bigg{[}\frac{\chi_{1}^{2}}{\chi_{s}^{2}}\Bigg{]}\\ &=\sum_{i=1}^{n}A_{ij}^{2}\mathbb{E}\Bigg{[}\frac{1}{s}F(1,s)\Bigg{]}\\ \end{split}

where F(1,s)F(1,s) is an FF-distribution with parameters 11 and ss. The expectation of such a variable is readily found as s/(s2)s/(s-2), for s>2s>2, and otherwise is infinite, yielding the result of equation (61). Thus we expect the estimators to be similarly behaved for sufficiently large ss but not if ss is small, in which case we expect that the Rademacher estimator shall be the better of the two, as the numerical results indicate.

8.2.2 Implications for related work

This has implications for the recent paper of Yao et. al yao2020adahessian . The authors estimate the diagonal of the Hessian of a loss function in machine learning tasks, implicit from back-propagation. Only a single Rademacher query vector is used due to computational restrictions. These results suggest it would be unwise to switch to a single Gaussian query vector as in Definition 4, since such an estimate would have unbounded variance.

We remark further that our the results, particularly Theorem 6.1 and Corollary 2, suggest the algorithm may greatly benefit from taking further estimates of the diagonal. Indeed the authors indicate yao2020video that the noise the algorithm experiences is dominated by the error of estimation of the diagonal. It is now clear why.

8.3 Full diagonal estimates

Figure 3 shows the convergence of the Rademacher and Gaussian diagonal estimators to the full diagonal. Note that, in contrast to the experiments for single elements of the diagonal, the theoretical bounds appear looser. This is expected, since in arriving at the theoretical bounds for our complete diagonal estimates we have used multiple inequalities for single elements. Therefore the query bounds will be loose. Use of a union bound in Corollaries 1 and 2 shall further compound this loosening. Note once again that, in the Gaussian case, the theoretical bound holds only for sufficiently large ss, as expected. The effects seen in the error of single diagonal element estimates carry over to the full diagonal, but are even more pronounced. For a handful of queries (again, s10s\lesssim 10 or so) the Rademacher estimator out performs its Gaussian counterpart, but as a greater number of queries are used the convergence behaviours become increasing similar. As before, with the Rademacher estimator the ratio of the theoretical bound to the experimental error is roughly constant for each δ\delta, so the convergence is of the order of O(1/ε2)O(1/\varepsilon^{2}), just as for single elements. In the Gaussian case, for large enough ss, the same may be said.

Refer to caption Refer to caption
Figure 3: Convergence of the Rademacher and Gaussian diagonal estimators to the actual diagonal of the matrix “Boeing/msc10480” from the SuiteSparse matrix collection. The effect of the infinite variance for the Gaussian estimator when s2s\leq 2 is much more pronounced.
Refer to caption Refer to caption
Figure 4: Convergence of the Rademacher estimator on random matrices. We run 5050 trials and report the median and 90th90^{th} percentile errors for three values of c,c, 0.50.5, 11 and 1.51.5, corresponding to slow, medium and fast eigenvalue decay respectively. As expected, with all matrices, convergence goes as O(1/ε2)O(1/\varepsilon^{2}). The difference between the matrices arises from the fact that the values of (AF2𝐀d22)/𝐀d22(\|A\|_{F}^{2}-\|\bm{A}_{d}\|_{2}^{2})/\|\bm{A}_{d}\|_{2}^{2} increase as cc increases.

8.4 Experiments on Synthetic Matrices.

We also test the Rademacher estimator on random matrices with power law spectra. We take Λ\Lambda to be diagonal with Λii=ic\Lambda_{ii}=i^{-c}, for varying c. We generate a random orthogonal matrix V5000×5000V\in\mathbb{R}^{5000\times 5000} by orthogonalising a Gaussian matrix, and set A=VTΛVA=V^{T}\Lambda V. A larger value of cc results in a more quickly decaying spectrum, meaning that the quantity 𝝀221n𝝀12\|\bm{\lambda}\|_{2}^{2}-\frac{1}{n}\|\bm{\lambda}\|_{1}^{2} (as in equation (55)) will be larger, so we expect worse estimates for a given number of queries. This is exactly as found empirically, as shown in figure 4. In particular, for c=1c=1 and c=1.5c=1.5, the error of our estimates is still woeful even after s=512s=512 vectors! This does not bode well for stochastic diagonal estimation of general dense matrices with sharply decaying spectra. Whilst we still have O(1/ε2)O(1/\varepsilon^{2}) convergence, the error starts off too poorly for this to be of much use.

8.5 Summary

At the start of this section, we set out to find an answer for the number of queries needed for a given error estimation. We have seen that, for fixed δ\delta, ss goes as O(1/ε2)O(1/\varepsilon^{2}), and that the error scales with (𝑨i22Aii2)1/2(\|\bm{A}_{i}\|_{2}^{2}-A_{ii}^{2})^{1/2}. We have seen that the Rademacher diagonal estimator has both a better theoretical bound than its Gaussian counterpart and also has lower variance for each element. This is in contrast to the conclusions of Avron and Toledo avron2011randomized , who found that Gaussian trace estimators were better than Hutchinson’s, Rademacher estimator. However, there is no real contention here. Summing the components of the Gaussian diagonal estimator is not the same as the Gaussian estimator of the trace examined in avron2011randomized . The difference is merely an artifact of the denominator introduced in (13).

We have also seen that positive semi-definite matrices with flatter spectra are suitable for estimation in this way, but as the spectrum of a matrix becomes steeper, estimates become worse. Perhaps this may spell the end of stochastic diagonal estimation. Perhaps it may restrict it to a specific class of matrices, in contexts where the spectrum of the matrix is known to be close to flat due to properties of the system under investigation. All things considered, it so far seems that, at the best, this form of estimation may only be useful under special circumstances. Might there be a way to improve on this poor performance for general, dense matrices, with rapidly decaying spectra? This is the subject of Section 9. Since it is found to be the best, we take forward only the Rademacher diagonal estimator.

9 Improved diagonal estimation

Is it possible to improve stochastic diagonal estimation, particularly the worst case scenarios of the previous section? This section seeks to address this by extending the ideas of meyer2021hutch++ , adapting the algorithm therein for full diagonal estimation. For A0A\succeq 0, we propose an O(1/ε2)O(1/\varepsilon^{2}) to O(1/ε)O(1/\varepsilon) improvement in the number of required query vectors and illustrate the robustness of our algorithm to matrix spectra.

9.1 Relation to Hutch++

We briefly give an overview of the work of Meyer, Musco, Musco and Woodruff meyer2021hutch++ . In their paper, they motivate their algorithm, “Hutch++” as a quadratic improvement in trace estimation. Using randomised projection to approximately project off the top eigenvalues, they then approximate the trace of the remainder of the matrix. All of the error results from this remainder.

They arrive at their result in the following manner, first introducing the following lemma (Lemma 2, Section 2, meyer2021hutch++ ) for a general square matrix AA.

Lemma 8

For An×n,δ(0,1/2],sA\in\mathbb{R}^{n\times n},\;\delta\in(0,1/2],\;s\in\mathbb{N}. Let trRs(A)\textnormal{tr}_{R}^{s}(A) be the Hutchinson estimate of the trace. For fixed constants c,Cc,C, if sclog(1/δ)s\geq c\log(1/\delta), then with probability 1δ1-\delta

|trRs(A)tr(A)|Clog(1/δ)sAF.|\textnormal{tr}_{R}^{s}(A)-\textnormal{tr}(A)|\leq C\sqrt{\frac{\log(1/\delta)}{s}}\|A\|_{F}.

Thus, if s=O(log(1/δ)/ε2)s=O(\log(1/\delta)/\varepsilon^{2}) then, with probability 1δ1-\delta, |trRs(A)tr(A)|εAF|\textnormal{tr}_{R}^{s}(A)-\textnormal{tr}(A)|\leq\varepsilon\|A\|_{F}.

So, for A0A\succeq 0, we simply have AF=𝝀2𝝀1=tr(A)\|A\|_{F}=\|\bm{\lambda}\|_{2}\leq\|\bm{\lambda}\|_{1}=\text{tr}(A), to get a relative trace approximator. Note, this bound is only tight when 𝝀2𝝀1\|\bm{\lambda}\|_{2}\approx\|\bm{\lambda}\|_{1}. The authors then give the following lemma (see Lemma 3, Section 3, meyer2021hutch++ ).

Lemma 9

Let ArA_{r} be the best rank-r approximation to PSD matrix AA. Then

AArF1rtr(A).\|A-A_{r}\|_{F}\leq\frac{1}{\sqrt{r}}\textnormal{tr}(A).
Proof

Since λr+11ri=1rλi1rtr(A)\lambda_{r+1}\leq\frac{1}{r}\sum_{i=1}^{r}\lambda_{i}\leq\frac{1}{r}\text{tr}(A),

AArF2=i=r+1nλi2λr+1i=k+1nλi1rtr(A)i=r+1nλi1rtr(A)2.\|A-A_{r}\|_{F}^{2}=\sum_{i=r+1}^{n}\lambda_{i}^{2}\leq\lambda_{r+1}\sum_{i=k+1}^{n}\lambda_{i}\leq\frac{1}{r}\text{tr}(A)\sum_{i=r+1}^{n}\lambda_{i}\leq\frac{1}{r}\text{tr}(A)^{2}.

\square

They argue that this result immediately lends itself to the possibility of an algorithm with O(1/ε)O(1/\varepsilon) complexity. Setting s=r=O(1/ε)s=r=O(1/\varepsilon) and splitting tr(A)=tr(As)+tr(AAs)\text{tr}(A)=\text{tr}(A_{s})+\text{tr}(A-A_{s}), the first term may be computed exactly if Vsn×sV_{s}\in\mathbb{R}^{n\times s} (the top ss eigenvectors) is known, since tr(As)=tr(VsTAVs)\text{tr}(A_{s})=\text{tr}(V_{s}^{T}AV_{s}). Thus the error in the estimate of the trace is contained in our estimate of tr(AAs)\text{tr}(A-A_{s}). So (roughly) combining Lemmas 8 and 9, the error becomes, for their trace estimator: trH++s(A)\text{tr}_{H++}^{s}(A)

|trH++s(A)tr(A)||trRs(AAs)tr(AAs)|Clog(1/δ)sAAsFClog(1/δ)str(A).\begin{split}|\text{tr}_{H++}^{s}(A)-\text{tr}(A)|&\approx|\text{tr}_{R}^{s}(A-A_{s})-\text{tr}(A-A_{s})|\\ &\leq C\sqrt{\frac{\log(1/\delta)}{s}}\|A-A_{s}\|_{F}\\ &\leq C\frac{\sqrt{\log(1/\delta)}}{s}\text{tr}(A).\end{split} (62)

Hence s=O(1/ε)s=O(1/\varepsilon) for fixed δ\delta is sufficient for an (ε,δ)(\varepsilon,\delta)-approximator to the trace of AA.

Of course, VsV_{s} cannot be computed exactly in a handful of matrix-vector queries, but this is easily resolved with standard tools from randomised linear algebra halko2011finding . Namely, O(s)O(s) queries is sufficient to find a QQ with (IQQT)A(IQQT)FO(AAsF)\|(I-QQ^{T})A(I-QQ^{T})\|_{F}\leq O(\|A-A_{s}\|_{F}) with high probability, which is all that is required in the second line of (62), and how the algorithm is implemented in practice. For a full, formal treatment of these ideas see meyer2021hutch++ , Theorem 4, Section 3. We elaborate on this in the next section.

9.2 Diag++

Here we outline how the above ideas may be applied to estimation of the entire diagonal, using O(1/ε)O(1/\varepsilon) queries. Recall once again, for

Pr(𝑫s𝑨d22ε2(AF2𝑨d22))1δ\Pr\Bigg{(}\|\bm{D}^{s}-\bm{A}_{d}\|_{2}^{2}\leq\varepsilon^{2}\Big{(}\|A\|_{F}^{2}-\|\bm{A}_{d}\|_{2}^{2}\Big{)}\Bigg{)}\geq 1-\delta (63)

it is sufficient to take

s>O(log(n/δ)ε2).s>O\Bigg{(}\frac{\log(n/\delta)}{\varepsilon^{2}}\Bigg{)}. (64)

We may now engage in a trade-off. For fixed δ\delta and nn, we swap O(1/ε2)O(1/\varepsilon^{2}) convergence for O(1/ε)O(1/\varepsilon) convergence. This comes at the (potentially minor) expense of a greater multiplicative factor in our query bound. Just as in Hutch++, we first motivate this supposing we know VsV_{s}, the top ss eigenvectors, exactly.

Supposing we have projected off the ss-rank approximation AsA_{s} and extracted its diagonal, diag(As)\text{diag}(A_{s}), we are left to estimate the diagonal of

A^:=AAs.\widehat{A}:=A-A_{s}.

Denoting 𝑫^s\widehat{\bm{D}}^{s} as the diagonal estimate of this, we have

𝑫^s𝑨^d22ε2(A^F2𝑨^d22)\|\widehat{\bm{D}}^{s}-\widehat{\bm{A}}_{d}\|_{2}^{2}\leq\varepsilon^{2}\Big{(}\|\widehat{A}\|_{F}^{2}-\|\widehat{\bm{A}}_{d}\|_{2}^{2}\Big{)} (65)

with probability, 1δ1-\delta, if we have used a sufficient number of queries: s>2ε2ln(2n/δ)s>\frac{2}{\varepsilon^{2}}\ln(2n/\delta) for the Rademacher diagonal estimator. Ignoring the 𝑨^d22\|\widehat{\bm{A}}_{d}\|_{2}^{2} term, and employing the result of Lemma 9, yields

𝑫^s𝑨^d22ε2tr(A)2s.\|\widehat{\bm{D}}^{s}-\widehat{\bm{A}}_{d}\|_{2}^{2}\leq\frac{\varepsilon^{2}\text{tr}(A)^{2}}{s}.

Letting

ε2=sε¯2(𝑨d22tr(A)2)\varepsilon^{2}=s\bar{\varepsilon}^{2}\Bigg{(}\frac{\|\bm{A}_{d}\|_{2}^{2}}{\text{tr}(A)^{2}}\Bigg{)}

gives

𝑫^s𝑨^d22ε¯2𝑨d22\|\widehat{\bm{D}}^{s}-\widehat{\bm{A}}_{d}\|_{2}^{2}\leq\bar{\varepsilon}^{2}\|\bm{A}_{d}\|_{2}^{2} (66)

still with probability 1δ1-\delta, if we have used sufficient ss. Note that this is not a relative estimation of the diagonal of 𝑨^\widehat{\bm{A}}; rather the error is qualified relative to 𝑨d22\|\bm{A}_{d}\|_{2}^{2}, since this is, overall, the error that we are interested in. Explicitly, for the Rademacher diagonal estimator, this results in sufficiency if

s>2s(tr(A)2𝑨d22)ln(2n/δ)ε¯2s>2(tr(A)𝑨d2)ln(2n/δ)ε¯\begin{split}s&>\frac{2}{s}\Bigg{(}\frac{\text{tr}(A)^{2}}{\|\bm{A}_{d}\|_{2}^{2}}\Bigg{)}\frac{\ln(2n/\delta)}{\bar{\varepsilon}^{2}}\\ \implies s&>\sqrt{2}\Bigg{(}\frac{\text{tr}(A)}{\|\bm{A}_{d}\|_{2}}\Bigg{)}\frac{\sqrt{\ln(2n/\delta)}}{\bar{\varepsilon}}\end{split} (67)

and we have arrived at O(1/ε)O(1/\varepsilon) convergence, unlike O(1/ε2)O(1/\varepsilon^{2}) as for all previous estimates! We remark that when (67) is tight, this is a dramatic improvement! In the case where 𝑨d1/𝑨d2=O(1)\|\bm{A}_{d}\|_{1}/\|\bm{A}_{d}\|_{2}=O(1), the result of (67) is very favorable141414Such a case may be found for the “Model Hamiltonian” in Section 4 of bekas2007estimator .. Since tr(A)=𝑨d1n𝑨d2\text{tr}(A)=\|\bm{A}_{d}\|_{1}\leq\sqrt{n}\|\bm{A}_{d}\|_{2}, the worst case scenario (when the diagonals take uniform values) yields

s>2nln(2n/δ)ε¯s>\frac{\sqrt{2n\ln(2n/\delta})}{\bar{\varepsilon}} (68)

and we observe the introduction of a potentially expensive constant, but one which, for moderate ε¯\bar{\varepsilon} and δ\delta will nonetheless give sns\ll n for large nn.

Once again, we have motivated the above for exact ss-rank approximation, but, as in Hutch++, the same tools from randomised linear algebra halko2011finding mean that we can approximate VsV_{s} using O(s)O(s) queries. Specifically O(s)O(s) queries are sufficient to find QQ such that

(IQQT)A(IQQT)FO(AAsF).\|(I-QQ^{T})A(I-QQ^{T})\|_{F}\leq O(\|A-A_{s}\|_{F}). (69)

In detail (see musco2020projection , Corollary 7 and Claim 1), provided O(r+log(1/δ))O(r+\log(1/\delta)) vectors have been used to form Qn×rQ\in\mathbb{R}^{n\times r}, then, with probability 1δ1-\delta

A(IQQT)F22AArF2\|A(I-QQ^{T})\|_{F}^{2}\leq 2\|A-A_{r}\|_{F}^{2}

and since

(IQQT)A(IQQT)F(IQQT)2A(IQQT)F=A(IQQT)F\begin{split}\|(I-QQ^{T})A(I-QQ^{T})\|_{F}&\leq\|(I-QQ^{T})\|_{2}\|A(I-QQ^{T})\|_{F}\\ &=\|A(I-QQ^{T})\|_{F}\end{split}

so equation (69) holds for s=O(r+log(1/δ))s=O(r+\log(1/\delta)). Thus, overall it is sufficient to take s=O(tr(A)𝑨d2log(n/δ)ε¯+log(1/δ))s=O\Big{(}\frac{\text{tr}(A)}{\|\bm{A}_{d}\|_{2}}\frac{\sqrt{\log(n/\delta)}}{\bar{\varepsilon}}+\log(1/\delta)\Big{)} queries where clearly the first term dominates. Applying the same ideas as in equations (65) to (67), for the randomised Rademacher case we readily arrive at

s>4(tr(A)𝑨d2)ln(2n/δ)ε¯+clog(1/δ)s>4\Bigg{(}\frac{\text{tr}(A)}{\|\bm{A}_{d}\|_{2}}\Bigg{)}\frac{\sqrt{\ln(2n/\delta)}}{\bar{\varepsilon}}+c\log(1/\delta) (70)

for some constant cc, is sufficient to approximate the diagonal to within ε¯\bar{\varepsilon} error, with probability 1δ1-\delta. Of course, we remark that this might be a very pessimistic bound. We have ignored 𝑨^d22\|\widehat{\bm{A}}_{d}\|_{2}^{2} and employed the potentially loose result of Lemma 9, so cannot necessarily expect this bound to be good. However, if we have a steep spectrum151515Such an example may be found for the “Graph Estrada Index” in Section 5.2 of meyer2021hutch++ . with both A^F2𝑨^d22\|\widehat{A}\|_{F}^{2}\gg\|\widehat{\bm{A}}_{d}\|_{2}^{2} and AFtr(A)\|A\|_{F}\approx\text{tr}(A), then the bound on ss should result in Diag++ outperforming simple stochastic estimation. Concretely the algorithm for Diag++ with Rademacher query vectors is as follows:

Algorithm 1 Diag++. Adaption of Hutch++ for diagonal estimation.

Input: Implicit-matrix oracle, for A0A\succeq 0. Number of queries ss.
Output: Diag++(AA): Approximation to diag(AA)

1:Draw R1n×s/3R_{1}\in\mathbb{R}^{n\times s/3} with Rademacher entries.
2:Compute orthonormal basis QQ for AR1AR_{1}
3:Estimate diag((IQQT)A(IQQT)(I-QQ^{T})A(I-QQ^{T})) 𝑫^s/3=3sk=1s/3𝒗k(IQQT)A(IQQT)𝒗k\approx\widehat{\bm{D}}^{s/3}\newline =\frac{3}{s}\sum_{k=1}^{s/3}\bm{v}_{k}\odot(I-QQ^{T})A(I-QQ^{T})\bm{v}_{k}
for 𝒗k\bm{v}_{k} with Rademacher entries
4:return Diag++(AA) =diag(QQTAQQT)+𝑫^s/3=\text{diag}(QQ^{T}AQQ^{T})+\widehat{\bm{D}}^{s/3}

In total ss matrix-vector multiplications with AA are required161616A recent preprint persson2021improved suggests that one can often improve the estimate by adaptively choosing the number of queries used for computing AR1AR_{1}, AQAQ and 𝑫^s/3\widehat{\bm{D}}^{s/3}, instead of the equal queries s/3s/3 used in Hutch++. It would be of interest to explore such improvement for diagonal estimation.: s/3s/3 for AR1AR_{1}, s/3s/3 for 𝑫^s/3\widehat{\bm{D}}^{s/3} and s/3s/3 for AQAQ. Computing QQ requires O(s2n)O(s^{2}n) extra flops, but this is dominated by the cost of ss matrix-vector multiplications: O(sn2)O(sn^{2}).

9.3 Numerical Experiments

To investigate the performance of Diag++, we once again turn to synthetic matrices: A=VTΛVA=V^{T}\Lambda V: Λii=ic\Lambda_{ii}=i^{-c} and VV a random orthogonal matrix V5000×5000V\in\mathbb{R}^{5000\times 5000} formed by orthogonalising a Gaussian matrix. Recall that with stochastic estimation alone, convergence was slow for such matrices with steep spectra. We expect Diag++ to perform more or less similarly to standard stochastic estimation if the spectrum of the matrix is flat, since (70) is pessimistic, whilst if the spectrum is steep, (70) is much more optimistic and faster convergence is to be expected. The algorithm is robust to the spectrum of AA, making the “best of both worlds” from projection and estimation. If the spectrum is steep, then we shall be able to capture most of the diagonal of AA through projection, whilst if it is flat, the stochastic estimation of the diagonal of the remainder A^\widehat{A} will buffer the effects of projection. This is as is found in Figure 5. Not only is Diag++ robust to spectra decaying at different rates, but also converges faster than simple stochastic estimation when the matrices in question have rapidly decaying eigenvalues. The algorithm even overtakes simple projection after a handful of vector queries in such cases. Thus, the answer to our original question at the start of this section is yes: we can improve stochastic estimation, particularly for what would otherwise be worst case scenarios.

Refer to caption Refer to caption
Refer to caption Refer to caption
Refer to caption Refer to caption
Figure 5: Relative error versus number of matrix-vector queries: we report the median and 90th90^{th} percentile after 10 trials for Synthetic Matrices with c=0.5, 1, 1.5c=0.5,\>1,\>1.5. Whilst pure stochastic estimation outperforms Diag++ for slower eigenvalue decay, Diag++ converges at roughly the same rate. Estimation through projection only however, is poor. Meanwhile Diag++ obtains high performance for faster eigenvalue decay. It is clear that this algorithm is robust to the spectrum of the matrix in question, converging for both steep and flat eigenvalue distributions.

10 Conclusion

10.1 Summary

In setting out, we sought to gain insight into particular questions surrounding diagonal estimation:

  1. 1.

    How many query vectors are needed to bound the error of the stochastic diagonal estimator in (9), and how does this change with distribution of vector entries?

  2. 2.

    How does matrix structure affect estimation, and how might we improve worst case scenarios?

In answer to 1, we have seen that elemental error probabilistically converges as O(1/ε2)O(1/\varepsilon^{2}), and scales as 𝑨i22Aii2\|\bm{A}_{i}\|_{2}^{2}-A_{ii}^{2}. We have seen that for the entire diagonal, similar results arise, with error going as O(1/ε2)O(1/\varepsilon^{2}), and scaling as AF2𝑨d22\|A\|_{F}^{2}-\|\bm{A}_{d}\|_{2}^{2}. Additionally we have observed both theoretically and numerically, that if ss is small, error arising from the use of the Rademacher diagonal estimator is less than that arising from the Gaussian diagonal estimator. Conversely, if ss is large, there is an inconsequential difference between the two estimators.

In answer to 2, we have seen, as per our intuitions, that both eigenvector basis, and eigenvalue spectrum, influence the estimation of positive semi-definite matrices. Generally, the steeper the spectrum, the worse the stochastic diagonal estimator. However, we have introduced a new method of overcoming these worst case scenarios. We have analysed it to find favourable convergence properties, and demonstrated its superiority for estimating the diagonal of general dense matrices with steep spectra, whilst maintaining robustness to flatter spectra.

10.2 Future Work

There is ample scope to extend the work presented herein, both to applications, and further analysis.

  1. 1.

    We have seen that the error of a given diagonal element is dependent on the rest of its row. How can we ascertain the actual error of our estimates if we don’t know the norm of the row of our implicit matrix? Or if we don’t have an idea of its spectrum? Is the context of use of stochastic esimation a necessity? We note that we can make tracks on this question since we can estimate AF2\|A\|_{F}^{2} using trace estimation of tr(ATA)\text{tr}(A^{T}A), a free byproduct of diagonal estimation.

  2. 2.

    In our analysis we have perhaps wielded a hammer, rather than a scalpel, in order to arrive at the bounds for 𝑫s𝑨d2\|\bm{D}_{s}-\bm{A}_{d}\|_{2}. In particular, union bounding introduces the potentially unfavourable factor of ln(n)\ln(n). Perhaps future analysis could employ the dependence of element estimates in our favour to remove such a term?

  3. 3.

    Having noted that Yao et. al yao2020adahessian use only a single Rademacher estimator in their state-of-the-art second order machine learning optimiser, how much better will their method become with further estimation? Given the possibility of parallelised estimation, will this introduce a rapid new stochastic optimisation technique that outperforms current methods?

These are interesting and exciting open questions, providing the potential investigator with both theoretical and numerical directions in which to head. During this paper, we also considered extending elemental diagonal analysis to shed new light on the original investigations into trace estimation. Whilst initial work suggested itself fruitful, the dependence of each element estimate upon one another confounded further analysis. We readily invite suggestions or insights into how these ideas may be useful in approaching trace estimation.

References

  • [1] R. P. Adams, J. Pennington, M. J. Johnson, J. Smith, Y. Ovadia, B. Patton, and J. Saunderson. Estimating the spectral density of large implicit matrices. arXiv:1802.03451, 2018.
  • [2] T. Ando, R. A. Horn, and C. R. Johnson. The singular values of a Hadamard product: A basic inequality. Linear Multilinear Algebra, 21(4):345–365, 1987.
  • [3] H. Avron and S. Toledo. Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix. Journal of the ACM, 58(2):1–34, 2011.
  • [4] R. A. Baston. On matrix estimation. Master’s thesis, University of Oxford, 2021.
  • [5] C. Bekas, E. Kokiopoulou, and Y. Saad. An estimator for the diagonal of a matrix. Appl. Numer. Math., 57(11-12):1214–1229, 2007.
  • [6] C. Boutsidis, P. Drineas, P. Kambadur, E.-M. Kontopoulou, and A. Zouzias. A randomized algorithm for approximating the log determinant of a symmetric positive definite matrix. Linear Algebra Appl., 533:95–117, 2017.
  • [7] V. Braverman, R. Krauthgamer, A. Krishnan, and R. Sinoff. Schatten norms in matrix streams: Hello sparsity, goodbye dimension. In ICML 2020. PMLR, 2020.
  • [8] J. Chen. How accurately should I compute implicit matrix-vector products when applying the Hutchinson trace estimator? SIAM J. Sci. Comp, 38(6):A3515–A3539, 2016.
  • [9] D. Cohen-Steiner, W. Kong, C. Sohler, and G. Valiant. Approximating the spectrum of a graph. In Proceedings of the 24th acm sigkdd international conference on knowledge discovery & data mining, pages 1263–1271, 2018.
  • [10] A. Cortinovis and D. Kressner. On randomized trace estimates for indefinite matrices with an application to determinants. Found. Comput. Math., pages 1–29, 2021.
  • [11] T. A. Davis and Y. Hu. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS), 38(1):1–25, 2011.
  • [12] E. Di Napoli, E. Polizzi, and Y. Saad. Efficient estimation of eigenvalue counts in an interval. Numer. Lin. Alg. Appl., 23(4):674–692, 2016.
  • [13] C. Eckart and G. Young. The approximation of one matrix by another of lower rank. Psychometrika, 1(3):211–218, 1936.
  • [14] A. S. Gambhir, A. Stathopoulos, and K. Orginos. Deflation as a method of variance reduction for estimating the trace of a matrix inverse. SIAM J. Sci. Comp, 39(2):A532–A558, 2017.
  • [15] S. Goedecker. Linear scaling electronic structure methods. Reviews of Modern Physics, 71(4):1085, 1999.
  • [16] S. Goedecker and M. Teter. Tight-binding electronic-structure calculations and tight-binding molecular dynamics with localized orbitals. Physical Review B, 51(15):9455, 1995.
  • [17] G. H. Golub, M. Heath, and G. Wahba. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics, 21(2):215–223, 1979.
  • [18] G. H. Golub and U. Von Matt. Generalized cross-validation for large-scale problems. J. Comput. Graph. Stat., 6(1):1–34, 1997.
  • [19] N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev., 53(2):217–288, 2011.
  • [20] I. Han, D. Malioutov, H. Avron, and J. Shin. Approximating spectral sums of large-scale matrices using stochastic Chebyshev approximations. SIAM J. Sci. Comp, 39(4):A1558–A1585, 2017.
  • [21] I. Han, D. Malioutov, and J. Shin. Large-scale log-determinant computation through stochastic Chebyshev expansions. In ICML 2015, pages 908–917. PMLR, 2015.
  • [22] M. F. Hutchinson. A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines. Commun. Stat. - Simul. Comput., 18(3):1059–1076, 1989.
  • [23] J. Li, A. Sidford, K. Tian, and H. Zhang. Well-conditioned methods for ill-conditioned systems: Linear regression with semi-random noise. arXiv:2008.01722, 2020.
  • [24] L. Lin. Randomized estimation of spectral densities of large matrices made accurate. Numer. Math., 136(1):183–213, 2017.
  • [25] L. Lin, Y. Saad, and C. Yang. Approximating spectral densities of large matrices. SIAM Rev., 58(1):34–65, 2016.
  • [26] R. A. Meyer, C. Musco, C. Musco, and D. P. Woodruff. Hutch++: Optimal stochastic trace estimation. In Symposium on Simplicity in Algorithms (SOSA), pages 142–155. SIAM, 2021.
  • [27] C. Musco and C. Musco. Projection-cost-preserving sketches: Proof strategies and constructions. arXiv:2004.08434, 2020.
  • [28] C. Musco, P. Netrapalli, A. Sidford, S. Ubaru, and D. P. Woodruff. Spectrum approximation beyond fast matrix multiplication: Algorithms and hardness. arXiv:1704.04163, 2017.
  • [29] D. Persson, A. Cortinovis, and D. Kressner. Improved variants of the Hutch++ algorithm for trace estimation. arXiv:2109.10659, 2021.
  • [30] F. Roosta-Khorasani and U. Ascher. Improved bounds on sample size for implicit matrix trace estimators. Foundations of Computational Mathematics, 15(5):1187–1212, 2015.
  • [31] A. Stathopoulos, J. Laeuchli, and K. Orginos. Hierarchical probing for estimating the trace of the matrix inverse on toroidal lattices. SIAM J. Sci. Comp, 35(5):S299–S322, 2013.
  • [32] J. M. Tang and Y. Saad. Domain-decomposition-type methods for computing the diagonal of a matrix inverse. SIAM J. Sci. Comp, 33(5):2823–2847, 2011.
  • [33] S. Ubaru and Y. Saad. Applications of trace estimation techniques. In International Conference on High Performance Computing in Science and Engineering, pages 19–33. Springer, 2017.
  • [34] M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.
  • [35] Z. Yao and A. Gholami. With the authors: AdaHessian optimiser. https://youtu.be/S87ancnZ0MM?t=2000. Accessed: 24/08/2021.
  • [36] Z. Yao, A. Gholami, S. Shen, M. Mustafa, K. Keutzer, and M. W. Mahoney. AdaHessian: An adaptive second order optimizer for machine learning. arXiv:2006.00719, 2020.