∎
University of Oxford
Tel.: +44-1865-615319
22email: [email protected], 22email: [email protected]
Stochastic diagonal estimation: probabilistic bounds and an improved algorithm
Abstract
We study the problem of estimating the diagonal of an implicitly given matrix . For such a matrix we have access to an oracle that allows us to evaluate the matrix vector product . For random variable drawn from an appropriate distribution, this may be used to return an estimate of the diagonal of the matrix . Whilst results exist for probabilistic guarantees relating to the error of estimates of the trace of , no such results have yet been derived for the diagonal. We make two contributions in this regard. We analyse the number of queries required to guarantee that with probability at least the estimates of the relative error of the diagonal entries is at most . We extend this analysis to the 2-norm of the difference between the estimate and the diagonal of . We prove, discuss and experiment with bounds on the number of queries required to guarantee a probabilistic bound on the estimates of the diagonal by employing Rademacher and Gaussian random variables. Two sufficient upper bounds on the minimum number of query vectors are proved, extending the work of Avron and Toledo [JACM 58(2)8, 2011], and later work of Roosta-Khorasani and Ascher [FoCM 15, 1187-1212, 2015]. We first prove sufficient bounds for diagonal estimation with Gaussian vectors, and then similarly with Rademacher vectors. We find that, generally, there is little difference between the two, with convergence going as for individual diagonal elements. However for small , we find that the Rademacher estimator is superior. These results allow us to then extend the ideas of Meyer, Musco, Musco and Woodruff [SOSA, 142-155, 2021], suggesting algorithm Diag++, to speed up the convergence of diagonal estimation from to and make it robust to the spectrum of any positive semi-definite matrix . We analyse our algorithm to demonstrate state-of-the-art convergence.
Keywords:
Monte Carlo Stochastic estimation Diagonal Matrix Trace estimationMSC:
65C05 68W20 60E151 Introduction
An ever-present problem in numerical linear algebra relates to estimating the trace of a matrix that may only be accessed through its operation on a given vector. Such a calculation is required in a wide variety of applications, including: matrix norm estimation han2017approximating ; musco2017spectrum , image restoration golub1979generalized ; golub1997generalized , density functional theory bekas2007estimator and log-determinant approximation boutsidis2017randomized ; han2015large . The concepts surrounding diagonal estimation are younger than those of trace estimation, but have recently come of age in multiple applications. The theory however, is lagging. Originally used for electronic structure calculations bekas2007estimator ; goedecker1999linear ; goedecker1995tight , the ideas developed for diagonal estimation have recently been used by Yao et. al yao2020adahessian in a novel, state-of-the-art machine learning algorithm. They estimate the diagonal of a Hessian matrix to “precondition” descent vectors, for Newton-style descent, in the training of convolution neural networks. Their method produces exceptional accuracy across a wide array of problems. By understanding the properties of diagonal estimation more fully we may uncover new algorithms or methods to improve even these results.
Such estimations involve a matrix which is a “black box”. We are given an oracle that may evaluate for any , and we seek to return the chosen property of using as few query vectors as possible: with queries, we desire . The first exploration into this area was by Hutchinson hutchinson1989stochastic , which resulted in the standard approach to estimate the trace of a matrix:
(1) |
The are vectors with entries given as independent, identically distributed Rademacher variables: the entry of takes the value or with equal probability for all , and independence across both and . The subscript signifies the distribution of the entries of to be Rademacher.
Later work expanded on this idea, using Gaussian vectors distributed as , but until the paper of Avron and Toledo avron2011randomized , analysis and comparison of these estimators was limited to explorations of their variance. It is well known hutchinson1989stochastic that the variance of the Hutchinson method is the lowest when compared to other standard methods, thus is used most extensively in applications. However, whilst intuition suggests this is a good choice, it is not completely rigorous. The techniques of avron2011randomized address the error bounds in probability directly.
The approach
In avron2011randomized the authors propose so-called bounds. A lower bound on the number of queries, , sufficient to achieve a relative error of the estimated trace, is obtained. This is a probabilistic guarantee. Formally, for a pair of values, a sufficient number of queries, , is found, such that, for positive semi-definite matrix
(2) |
The term is the -query estimator of the trace, using query vectors drawn from the distribution . Supposing random vectors have been used to query the matrix, we have, with high probability, a relative error that is at most , where is dependent on . The authors conclude that analysis of the variance of the estimators is not the most informative approach for selecting which estimator is likely to return the best results for the fewest queries. This is a key insight.
The diagonal estimator.
Bekas et. al bekas2007estimator extended Hutchinson’s original method to diagonal estimation by considering the Hadamard product of and . The method is different from trace estimation in two regards. Firstly, in trace estimation, we sum the element-wise multiplication terms to get the inner product of and . This step is removed for diagonal estimation. Secondly, the estimation is not necessarily scaled by the number of queries, but by the element-size of the query vectors. This gives
(3) |
The method returns , an estimation of diag(), the diagonal of reshaped as an -dimensional vector. The symbol represents component-wise multiplication of vectors and represents component-wise division. In bekas2007estimator , the authors focus on employing deterministic vectors for , no further treatment of the stochastic estimators is given, except in numerical experiments.
Contributions
In this paper, we proceed to consider the same aims as those of Avron and Toledo avron2011randomized and those of Bekas et. al bekas2007estimator : applying the ideas of avron2011randomized to analyse the stochastic diagonal estimator suggested in bekas2007estimator and shown in equation (3). No paper to date has presented a theoretical bound on the number of queries required to achieve an approximation of the diagonal of a matrix. Specifically, we prove upper bounds on the minimum number of query vectors required for the relative (and “row-dependent”) error of the entries of the diagonal estimate to be at most , with probability . We extend this result to consider the error across all diagonal entries, and the associated number of queries required for a similar error bound. We consider estimators whose query vectors obey different distributions. This analysis allows us to then suggest and analyse a novel algorithm for speed-up of diagonal estimation. Extending the work in meyer2021hutch++ for trace estimation, we propose an algorithm for diagonal estimation that shifts from the classic Monte-Carlo regime of to , a fundamental, state-of-the-art improvement for diagonal estimation. Much of the paper is based on the first author’s MSc dissertation baston2021thesis .
Outline of this work
This paper is outlined as follows. In Section 2 we outline our use of notation, provide a brief overview of previous work, and outline the relation of the Rademacher diagonal estimator to Hutchinson’s trace estimator. In Section 3 we define what is meant by an -approximator in the context of diagonal estimation, and outline two types of stochastic diagonal estimators, Gaussian and Rademacher. In Section 4 we state our results, proved in the following sections. We prove an upper bound on the minimum number of sufficient queries for an -approximator using the Gaussian diagonal estimator in Section 5, and then for the Rademacher diagonal estimator in Section 6. Section 7 manipulates and explores these results in an attempt to make them more useful to the practitioner. We perform numerical experiments illustrating the results in Section 8. Where appropriate a touch of theory is also covered in this section. In Section 9 we use these previous results to adapt the Hutch++ algorithm from meyer2021hutch++ to suggest and analyse a similar algorithm, Diag++, that enables to speed-up of stochastic diagonal estimation. Section 10 provides concluding remarks.
2 Preliminaries
In this section we outline our use of notation, and provide a brief overview of previous work.
2.1 Notation
For all , denotes the norm and likewise denotes the norm. For a general matrix , denotes the Frobenius norm of a matrix, where, of course, is the element of from the row and the column. In the case of , is trace of the matrix , with its eigenvalues. Sometimes in this paper we consider the matrix to be symmetric: , or symmetric positive semi-definite (PSD), indicated by . Such has an eigendecomposition of the form where is the orthogonal matrix of eigenvectors of and is a real-valued diagonal matrix. We write to be a vector of length containing the eigenvalues of in descending order Where the differences between the eigenvalues are relatively small, we say a matrix has a “flatter” spectrum. Where the differences between the eigenvalues are larger, we describe the spectrum as “steep”111Whilst there is much more nuance to be found in the spectra of matrices, this description serves our purposes sufficiently, without over-complicating the details.. For we have that for all . Where is diagonalisable we use the identity and where we write We denote as the optimal -rank approximation to . When the matrix is PSD, the eigenvalues of the matrix coincide with its singular values. Thus, by the Eckart-Young-Mirsky theorem eckart1936approximation , for such , with containing the first columns of and being the top left sub-matrix of . Finally we denote as reshaping of the diagonal elements of the matrix to a vector of length . For this vector, we clearly have and in the case of , , since are necessarily non-negative for such .
2.2 Hutchinson’s Method and Previous Work
The classic Monte Carlo method for approximating the trace of an implicit matrix is due to Hutchinson hutchinson1989stochastic , who states and proves the following lemma.
Lemma 1
Let be an symmetric matrix. Let be a random vector whose queries are i.i.d Rademacher random variables , then is an unbiased estimator of , that is
(4) |
and
(5) |
Considering the variance term, we can easily see that it captures how much of the matrix’s “energy” (i.e. the Frobenius norm) is located in the off-diagonal elements. Unfortunately, in the general case, this variance may be arbitrarily large. More vexatious, is the fact that Lemma 1 does not provide the user with a rigorous bound on the number of matrix-vector queries for a given accuracy. To address this issue, Avron and Toledo avron2011randomized introduced the following definition for (Defintion 4.1 of avron2011randomized ):
Definition 1
Let be a symmetric positive semi-definite matrix. A randomised trace estimator: , is an -approximator of if
(6) |
This definition provides a better way to be able to bound the requisite number of queries. It guarantees that the probability of the relative error exceeding is at most . For example, when estimating the trace according to Hutchinson’s method, it may be shown roosta2015improved that it is sufficient to take the number of queries, as
(7) |
for any PSD matrix (see Theorem 1, Section 2, roosta2015improved ), whereas if Gaussian vectors are used, the bound on becomes
(8) |
(Theorem 3, Section 3, roosta2015improved ). We note that the recent paper cortinovis2021randomized gives query bounds for indefinite matrices.
Both equations (7) and (8) immediately put us in the Monte Carlo regime, with error converging as for queries. Let us step back and remark on the fact that the trace of a matrix is invariant under orthogonal similarity transforms of the matrix. As such, we can intuit that the loose, matrix-independent bounds in equations (7) and (8) are naturally to be expected. Can we say the same of estimating the diagonal? We cannot be sure that the previous intuition applies, as the diagonal of a matrix is not (generally) invariant under orthogonal similarity transformations.
2.3 Approximating the Diagonal of a Matrix
The ideas of Hutchinson are readily extended to approximate the full diagonal, as was first introduced by Bekas et. al bekas2007estimator , and which we outline here. Take a sequence of vectors , and suppose that their entries follow some suitable distribution, being i.i.d with expectation222Because we scale the estimation as in equation (9) it may be shown that the distribution may have arbitrary variance, though is used for simplicity throughout. zero. Then the diagonal approximation of matrix , expressed as an -dimensional vector , may be taken as
(9) |
where is the component-wise multiplication of the vector (a Hadamard product), and similarly, the component-wise division of the vectors. Let us examine a single component of to gain more insight. Let denote the element of the vector, we have
In expectation, the coefficients of the entries shall be zero. This is true provided that the components of the vectors, , themselves have an expectation of zero, regardless of the form of the denominator333Easily shown by the chain rule of expectation..
2.4 Relation to Hutchinson’s method
If the vectors have i.i.d Rademacher entries, then it is clear to see that the sum of the entries of the vector is nothing but the Hutchinson trace estimate. To show this rigorously we examine the expectation and variance of a single component of , when are independent Rademacher variables. We note that this is straightforward to find, and so the results are included only for completeness.
and by the linearity of expectation it is easy to see that
(10) |
Specifically for a single query we are returned equation (4), the first result of Lemma 1. We now find the variance; most easily done by considering a single estimate, . Thus
(11) |
where is the row444 possessing a 2-norm is intuitive, but formally we are really considering of . When considering the variance of the sum of the entries of , we must also find the covariance of the terms between two elements. For and , where , and with , we have
and since and , along with the fact we are examining , only one term is returned: when and . All other terms vanish as a result of independence and zero expectation. This leaves
Of course, if is symmetric, this is . Thus, for such symmetric , we have
(12) |
So we are returned equation (5), the second result of Lemma 1 and the relationship of trace and diagonal estimation is clear.
The dependence of the elements upon one another threatens to confound any further analysis. Thus, to circumvent this, we first examine individual elements of the diagonal estimator, . When treating all elements simultaneously, we employ general bounds and inequalities that, irrespective of dependence, may be applied. In some instances, this can result in looser bounds than might be necessary in practice. Therefore our results ought to be treated as a guide, rather than exact truths. We are left with two main questions to answer:
-
1.
How many query vectors are needed to bound the error of the stochastic diagonal estimator in (9), and how does this change with distribution of vector entries?
-
2.
How does matrix structure affect estimation, and how might we improve worst case scenarios?
3 Definitions and Estimators
3.1 Definitions
We state precisely what is meant by an -approximator in the context of diagonal estimation: for reasons which will become apparent, we introduce two such definitions. We also define what is meant by a diagonal estimator when using vectors of a given distribution.
3.1.1 -approximators
Definition 2
The component, , of a diagonal estimator is said to be a row dependent -approximator, if
where is the row of the matrix , reshaped as a vector.
The error bound in this definition may be tightened to be dependent only upon , giving Definition 3 for examination of the relative error:
Definition 3
The component, , of a diagonal estimator is said to be a relative -approximator, if
3.1.2 Two estimators
Recall that all estimators follow the same simple pattern, as in equation (9): some random vector is drawn from a fixed distribution, and a normalised Hadamard product is used to estimate the diagonals. The process is repeated times, using i.i.d queries.
The first estimator uses vectors whose entries are standard Gaussian (normal) variables.
Definition 4
A Gaussian diagonal estimator for any square matrix is given by
(13) |
where the vectors are independent random vectors whose entries are i.i.d standard normal variables, .
Note that the Gaussian estimator does not constrain the 2-norm of the vectors; so any single estimate may be arbitrarily large or small.
In contrast, the Rademacher diagonal estimator constrains the to have a fixed 2-norm, and results in a fixed denominator in the estimator equation.
Definition 5
A Rademacher diagonal estimator for any square matrix is given by
(14) |
where the vectors are s independent random vectors whose entries are i.i.d uniformly distributed Rademacher variables: each with probability .
It is worth noting that, unlike the trace estimators, the diagonal estimators above return different values when applied to instead of . It is possible that some (but never all, as our results indicate) diagonal elements are estimated more accurately by working with .
4 Comparing the Quality of Estimators
Bekas’ et. al bekas2007estimator sought to replace stochastic estimation by taking the vectors to be deterministic, as empirically they found that stochastic estimation could be slow. Sections 5 and 6 show why. Table 1 summarises our results.
Query vector distribution | Rademacher | Gaussian |
---|---|---|
Sufficient query bound
for a row-dependent -approximator |
||
Sufficient query bound for a relative -approximator | ||
Permitted values | arbitrary |
5 Gaussian diagonal estimator
We now prove an bound for the Gaussian diagonal estimator. We find a sufficient minimum number of queries , for a row-dependent bound to hold.
Theorem 5.1
Let be the Gaussian diagonal estimator (13) of , then it is sufficient to take as
(15) |
with , such that the estimator satisfies
for , where is the row of the matrix, reshaped as a vector.
The proof of this theorem is best read as three sequential steps: rotational invariance, conditioned integration and function bounding. We caution the reader that the vector is introduced that has components denoted as . This notation is an unfortunate, yet unavoidable, consequence of the proof.
Proof
Step 1: Rotational invariance
The first trick is to rewrite the coefficient of each , in
(16) |
Let us denote555We use the notation so as not to confuse ourselves with the query vectors . Definitively, the vector is composed of the components of each query vector. such that . Thus
Namely the coefficient of each , becomes
(17) |
for any orthogonal rotation , . The key insight is to exploit the rotational invariance of . Hence choose such that
So the coefficients (17) become, for all ,
since is orthogonal, so . Note that the unit direction of a standard normal multivariate vector and its 2-norm are independent, implying that and thus are independent of . Thus we may write (16) as
Let us denote as a vector containing the elements of the row of , excluding . Also let be a vector containing the independent Gaussian entries. That is . So we have
The second trick is to consider
(18) |
Once again, we use the rotational invariance of the distribution of . The distribution of is equal to , where now, and rotates to . Hence we have that the distribution of is equal to some , giving
(19) |
where clearly and are independent666This a scaled F-distribution..
Step 2: Conditioned integration
Recall that and so , is a chi-squared distribution of degree , and, of course, , is a chi-squared distribution of degree 1. Thus the proof reduces to bounding on the right hand side of equation (19). This is most easily examined by investigating the tail bound,
(20) |
where, for ease of notation, we have let . The next key insight is to approach this by conditioning the probability. Thus we have,
(21) |
where is the probability density function of the variable. So we have
which, introducing the pdf, gives
Let , such that . Thus substitution yields
Step 3: Function bounding
Note that since and , and consider that
for . Therefore we can bound by
(22) |
Naively (we do not deal with optimal here), setting , and bounding the above by gives
(23) |
thus we have
(24) |
as a sufficient bound for , such that, for all such
(25) |
which completes the proof.
Implications
Theorem 5.1 deals only with a single entry of the diagonal. Concretely, we have found that
for . This means is a row-dependent -approximator for all such . For a relative -approximator, we must substitute
to get
provided
(26) |
For the entire diagonal, if we demand we may apply the ideas from the proof of Theorem 5.1, to obtain the number of vectors required for this condition to hold.
Corollary 1
For
with it is sufficient to take
query vectors if vector entries are i.i.d Gaussian, where is the row of the matrix A, and is the component of the Gaussian diagonal estimator.
Proof
Taking the summation over , Corollary 1 means we have, with probability at least
(28) |
if we have as in equation (27). To make this bound a relative estimate, in comparison to , we write
to get
(29) |
provided that the number of query vectors is
(30) |
In summary, we have found that to get a relative 2-norm estimate of the entire diagonal with error at most , and probability at least , it is sufficient to use the queries as in equation (30), if Gaussian query vectors are used.
6 Rademacher diagonal estimator
This section deals with the proof of an bound for the Rademacher diagonal estimator. We choose to break this down into two main steps, the first of which is given in Lemma 2. The lemma concerns the expectation of an exponential, which will subsequently be used in a Chernoff-style argument. One may regard it simply as a slight tweak on Khintchine’s lemma, but introducing an additional summation term and so we outline it fully for completeness. Before continuing, we note that examining the Rademacher estimator element-wise we have
(31) |
where we have rewritten as and is also a Rademacher variable777It is easy to verify that a Rademacher variable times an independent () Rademacher variable is also Rademacher variable.. We now introduce the following lemma.
Lemma 2
For
we have that
where is a vector containing the elements of the row of A, excluding the element.
Proof
Note that is simply
Consider now
which completes the proof.
With this lemma in place, we can now prove an bound for the Rademacher diagonal estimator. Specifically, we find the minimum sufficient number of queries , for a row-dependent bound to hold.
Theorem 6.1
Let be the Rademacher diagonal estimator (14) of , then it is sufficient to take as
(32) |
with arbitrary , such that the estimator satisfies
where is the row of the matrix reshaped as a vector.
Proof
The problem may be reduced to finding the tail probability, . Investigating the positive tail probability we have that
(33) |
This is minimum for , so we obtain 888We can also obtain (34) from a direct application of Hoeffding’s bound for the sum of unbounded random variables (wainwright2019high, , Prop. 2.5).
(34) |
Since we are considering , we need also to bound the negative tail
It is also easily shown, by Lemma 2, that
which means that, just as for the positive tail
(35) |
Or indeed, we may easily deduce the above by considering the symmetry of the Rademacher variables in Lemma 2.
Implications
The above result is for a single diagonal entry. We have found that
so that is a row-dependent -approximator for . For a relative -approximator we must substitute999Replacing with in equation (33), yields the same result as in equation (40).
to get
(39) |
provided
(40) |
For the whole diagonal, and in much the same way as when considering the Gaussian case, if we demand , we need simply apply the ideas from the proof of Theorem 6.1, to obtain the number of vectors required for this condition to hold.
Corollary 2
For
with arbitrary it is sufficient to take
(41) |
query vectors if vector entries are i.i.d Rademacher, where is the row of the matrix A, and is the component of the Rademacher diagonal estimator.
Proof
Thus taking the summation over , we have, with probability
(43) |
provided we have used -queries as in (42). If we seek a relative error, in comparison to , let
to get
(44) |
provided
(45) |
Overall, we have found that to get a relative 2-norm estimate of the entire diagonal with error at most , and probability at least , it is sufficient to use the queries as in equation (45), if we use Rademacher query vectors.
A brief comparison
Comparing (15) with (32) (or (1) with (41)), we see that both the Rademacher estimator and the Gaussian estimator display strikingly similar criteria for the selection of . Whilst the Gaussian estimator appears slightly worse than its Rademacher counterpart, this difference is, in general terms, very small. A more insidious point to note relates to the demands on in the Gaussian case. We have demanded for equation (22) to hold. Whilst seemingly a minor detail, this becomes very important if we are to use a small number of query vectors , say, on the order of . In Section 8 we demonstrate the implications of this assumption, and offer intuition as for why, when is small, the Gaussian diagonal estimator performs worse than the Rademacher diagonal estimator.
7 Positive-semi-definite diagonal estimation
We now examine the case where , as is relevant in many applications braverman2020schatten ; chen2016accurately ; cohen2018approximating ; di2016efficient ; han2017approximating ; li2020well ; lin2016approximating ; ubaru2017applications , and analysed extensively in avron2011randomized ; roosta2015improved . For both estimators we have
(46) |
with probability, , provided we have used queries and for given . As before, is the row of the matrix (and therefore the column by symmetry). There is no escaping the dependence of the absolute error on the matrix structure, and, as we have also seen, shifting to a relative error framework means that the number of sufficient queries becomes dependent on the matrix. Can we draw any other general conclusions about the dependence of the number of queries on the structure of the matrix? Might they fit with our intuitions, and indicate when stochastic estimation may be suitable? We find, as expected, that both flatter spectra lend themselves to diagonal estimation and that the eigenvector basis of the matrix plays a key role. These results are included for the practitioner, who may have access to some of the quantities explored in the following sections (e.g. ), or may be able to simulate other quantities in particular scenarios (e.g. ).
7.1 Spectrum influence - element-wise
To make a start let us consider the following lemma. Recall that is the largest eigenvalue.
Lemma 3
For all we have that .
Proof
We may express the column as
where is the eigenvector of and so we also have that . Thus
where the second equality holds by orthogonality, and the inequality by positive-semi-definiteness.
Remark 1
For , where , then may take the value . When does take this value, by Lemma 3 we have that and thus all diagonal zeroes will be found exactly by either the Rademacher or Gaussian estimator.
Now suppose , such that , then by equation (46) and Lemma 3, for sufficient we have
(47) |
Then, setting , we shall return to a relative -approximator provided
(48) |
Extending this to all diagonal elements, we have the following lemma.
Lemma 4
For
we have that
(49) |
is a sufficient number of queries for both the Rademacher or Gaussian estimator.
Of course, this analysis produces a weaker sufficiency bound101010Due to the inequalities applied, we expect to arrive at sufficiency before as in (49). on than either of those in Table 1, but it is more likely to be practically computable (e.g. via few steps of the Lanczos algorithm) or perhaps even known for implicit .
Lemma 4 suggests roughly that the conditioning of the matrix indicates its ability to be estimated. For a given , if is larger, error estimates are likely to be worse, but as approaches 1 they become better and better. This is intuitive, since if , then the matrix shall be a scalar multiple of the identity: for some scalar . Hence the lower bound on becomes and exact estimates are found after a single query. More generally, if , then the bounds (48) and (49) are good, and stochastic estimation is suitable to estimate the diagonal. Conversely, if or greater, then clearly this bound is no longer helpful since we seek queries.
However, if the bound of Lemma 3 is loose, and line three of equation (47) is loose, then the bounds (48) and (49) are too pessimistic. To partly rectify this, and extend these ideas for general , we stop at the second line of (47) to write, with
where is best considered as the “conditioning” of a diagonal element. Note again that in this analysis we are considering , since we already know that if it will be found exactly. Let us define
(50) |
as the conditioning of the diagonal. So we have,
and the same ideas as for may be applied. In particular, note that we shall obtain a relative -approximator if, with
(51) |
and, extending this to the diagonal elements, we readily obtain the following lemma.
Lemma 5
For
we have that
(52) |
is a sufficient number of queries for both Rademacher and Gaussian estimators.
So if then stochastic estimation is suitable to estimate the diagonal of such a matrix. Since , this demand is easier to satisfy than demanding that .
7.2 Spectrum influence - the entire diagonal
So far we have built our understanding by considering single element estimates, and examining their implications for the whole diagonal. We can gain more information by examining the inequality
(53) |
since this accounts for the entire matrix structure. This holds, with probability at least if . Recall that, for , , so we have
(54) |
Now, we also have for , that
which means that
(55) |
and we can once again see the influence of the eigenvalues on the error of our estimates.
This equation, however, is dependent on the entire spectrum, in comparison to only or as was previously found. As expected, the conditioning of a matrix does not tell the whole story. If the condition number of the matrix in question may be large, even , but equation (55) suggests that the scaling of the absolute error will be acceptable. Clearly, a flat spectrum, such as , corresponding to the identity, yields a right hand side of zero. Meanwhile, the value of is greatest when mass is concentrated on a single eigenvalue.
For completeness, note that the reverse is true of the mass of the diagonal, since
and so
(56) |
which is worse as diagonal elements become flatter but better as most diagonal mass is concentrated on a handful of entries.
7.3 Eigenvector influence
Equation (56) hints at the notion that it is not only the spectrum of that influences the error of diagonal estimation. Indeed, if we have a very steep spectrum, but the eigenvectors of are columns of the identity, that is , then we shall still have a diagonal matrix and so expect one-shot estimation. Consider the following.
Lemma 6
Let be a diagonalisable matrix, , with , and , then
where , is the Hadamard (element-wise) product of with itself.
Proof
The above is easily found by recalling
and simply extending to the matrix equation.
Hence, Lemma 6 and equation (54) imply, for ,
If we write
so we may readily obtain the following lemma.
Lemma 7
For
we have that
(57) |
is a sufficient number of queries.
This returns our prediction for one-shot estimation if . In this case we find and the bound (57) is zero, so is sufficient to recover the diagonal exactly. Further analysis of the value of is complex and outside the scope of this paper. We point the interested reader to ando1987singular for further investigation.
7.4 Links with Hutchinson’s estimator variance
Lastly, once again let us turn our attention to the quantity:
(58) |
and recall that, but for a factor of 2, this is in fact the variance of Hutchinson’s estimator for the trace. A variety of papers suggest particular variance reduction schemes for Hutchinson’s estimator, focused on reducing (58). Certain approaches take advantage of the sparsity of the matrix stathopoulos2013hierarchical ; tang2011domain (by a process often called probing), whilst others still take a “decomposition” approach adams2018estimating . Perhaps the most promising approach in this case is that taken by Gambhir et. al gambhir2017deflation , Meyer et. al meyer2021hutch++ and Lin lin2017randomized . In these papers the approach is to form a decomposition by approximate projection onto the top eigenspace of , using the matrix designed to capture most of the column space of . This matrix roughly spans the top eigenspace of . In particular, gambhir2017deflation and lin2017randomized justify this approach for estimating the trace when is low rank as will capture most of the trace of . A moment’s thought suggests that this method may be adapted to estimate the whole diagonal, since or indeed are randomised low-rank approximations to the matrix halko2011finding . We might do well to find the exact diagonals of either or and then estimate the diagonal of the remainder of the matrix: or . We expand further on this idea in Section 9 extending the results and analysis of meyer2021hutch++ .
8 Numerical Experiments
8.1 Single element estimates
We first examine the Rademacher diagonal estimator and explore the implications of equations (39) and (40), since the sufficient query bound for such an estimator is likely the tightest derived thus far. Moreover, the value of in Theorem 6.1 is arbitrary, so any theoretical bound ought to hold for all values , at all values of . For different diagonal elements we expect the same rates of convergence, but different scaling, according to the value of for each .
This is exactly what is found in practice, as shown in Figure 1. This figure shows the relative convergence of estimates of selected diagonal elements of the real world matrix “Boeing msc10480” from the SuiteSparse matrix collection111111Formerly the University of Florida matrix collection. davis2011university . These elements were chosen to be displayed as they have large differences in their values of , and so are most easily resolved in the plots of relative error. As expected, the larger the value of , the greater the error after a given number of queries.
For a fixed value of and a given matrix entry, the ratio of the numerical error and theoretical bound is roughly fixed - the convergence of the numerical experiments mirrors that of the theoretical convergence bound. Thus the theoretical bound is tight in , as numerical error also scales with . In addition, for decreasing , the bound to observation ratio can be seen to decrease, suggesting that the query bound is tight in , since we would have otherwise expected it to grow.
![]() |
![]() |
![]() |
![]() |
8.2 Rademacher vs Gaussian element estimates
Query vector distribution | Rademacher | Gaussian |
---|---|---|
Sufficient query bound∗ | ||
Demand on | none |
![]() |
![]() |
Table 2 summaries sufficient query bounds for a row-dependent -approximator for ease of reference. For a relative -approximator, we simply introduce the matrix constant in each bound. The bound for the Rademacher diagonal estimator is lower than that for the Gaussian diagonal estimator, so we expect to be able to reach a lower error faster with the Rademacher estimator. Note however, that since we have not tuned the proof of Theorem 5.1 for optimal (see equation (23)), the query bound might indeed be lower, but not by much. Experiments suggest that (23) is quite tight unless .
Figure 2 illustrates this comparison, again on the matrix “Boeing/msc10480”. For the median experiments, , the Gaussian theoretical bound contains the corresponding experimental error, but for the percentile it seems as if this is not the case. In this instance, where , the lowest value may take is when . Hence
(59) |
so we can only expect the theoretical bound to be valid for . If only one Gaussian query is made the relative error appears to not be bounded by the theoretical limit, but this is no surprise, since Theorem 5.1 poses the question: for given and , what is bounded by? Whereas experimentation shows us, for given and , the value will take121212Note that plots are of relative error, but in the context of Theorem 5.1 refers to row-dependent -approximators.. The theorem gives us a tool to find , but in return asks that is bounded131313For equation (22) to be valid. from above by . The experiment, meanwhile, takes an input of and and returns .
Whilst the theoretical bound in the right hand plot of Figure 2 only kicks in at , we have drawn it extending back to to illustrate the fact that it does not apply for small enough. Meanwhile, as noted, the Rademacher bound holds for arbitrary . The effect is further manifested by the fact that with , the Rademacher estimator is more accurate for sufficiently small , roughly .
8.2.1 Intuitions for performance with small
To gain more intuition for why the Rademacher diagonal estimator is better than the Gaussian diagonal estimator for small , it is useful to compare the variances of individual elements for the two estimators. From equation (11), the variance of the Radamacher estimate is
(60) |
whilst in the Gaussian case for , we find
(61) |
but for , the variance is not bounded. For a given , this is found in the following manner
where and are as in the proof of Theorem 5.1. Referring back to this proof again, we already know that the distribution of is the same as for some independent of . This gives
where is an -distribution with parameters and . The expectation of such a variable is readily found as , for , and otherwise is infinite, yielding the result of equation (61). Thus we expect the estimators to be similarly behaved for sufficiently large but not if is small, in which case we expect that the Rademacher estimator shall be the better of the two, as the numerical results indicate.
8.2.2 Implications for related work
This has implications for the recent paper of Yao et. al yao2020adahessian . The authors estimate the diagonal of the Hessian of a loss function in machine learning tasks, implicit from back-propagation. Only a single Rademacher query vector is used due to computational restrictions. These results suggest it would be unwise to switch to a single Gaussian query vector as in Definition 4, since such an estimate would have unbounded variance.
We remark further that our the results, particularly Theorem 6.1 and Corollary 2, suggest the algorithm may greatly benefit from taking further estimates of the diagonal. Indeed the authors indicate yao2020video that the noise the algorithm experiences is dominated by the error of estimation of the diagonal. It is now clear why.
8.3 Full diagonal estimates
Figure 3 shows the convergence of the Rademacher and Gaussian diagonal estimators to the full diagonal. Note that, in contrast to the experiments for single elements of the diagonal, the theoretical bounds appear looser. This is expected, since in arriving at the theoretical bounds for our complete diagonal estimates we have used multiple inequalities for single elements. Therefore the query bounds will be loose. Use of a union bound in Corollaries 1 and 2 shall further compound this loosening. Note once again that, in the Gaussian case, the theoretical bound holds only for sufficiently large , as expected. The effects seen in the error of single diagonal element estimates carry over to the full diagonal, but are even more pronounced. For a handful of queries (again, or so) the Rademacher estimator out performs its Gaussian counterpart, but as a greater number of queries are used the convergence behaviours become increasing similar. As before, with the Rademacher estimator the ratio of the theoretical bound to the experimental error is roughly constant for each , so the convergence is of the order of , just as for single elements. In the Gaussian case, for large enough , the same may be said.
![]() ![]() |
![]() ![]() |
8.4 Experiments on Synthetic Matrices.
We also test the Rademacher estimator on random matrices with power law spectra. We take to be diagonal with , for varying c. We generate a random orthogonal matrix by orthogonalising a Gaussian matrix, and set . A larger value of results in a more quickly decaying spectrum, meaning that the quantity (as in equation (55)) will be larger, so we expect worse estimates for a given number of queries. This is exactly as found empirically, as shown in figure 4. In particular, for and , the error of our estimates is still woeful even after vectors! This does not bode well for stochastic diagonal estimation of general dense matrices with sharply decaying spectra. Whilst we still have convergence, the error starts off too poorly for this to be of much use.
8.5 Summary
At the start of this section, we set out to find an answer for the number of queries needed for a given error estimation. We have seen that, for fixed , goes as , and that the error scales with . We have seen that the Rademacher diagonal estimator has both a better theoretical bound than its Gaussian counterpart and also has lower variance for each element. This is in contrast to the conclusions of Avron and Toledo avron2011randomized , who found that Gaussian trace estimators were better than Hutchinson’s, Rademacher estimator. However, there is no real contention here. Summing the components of the Gaussian diagonal estimator is not the same as the Gaussian estimator of the trace examined in avron2011randomized . The difference is merely an artifact of the denominator introduced in (13).
We have also seen that positive semi-definite matrices with flatter spectra are suitable for estimation in this way, but as the spectrum of a matrix becomes steeper, estimates become worse. Perhaps this may spell the end of stochastic diagonal estimation. Perhaps it may restrict it to a specific class of matrices, in contexts where the spectrum of the matrix is known to be close to flat due to properties of the system under investigation. All things considered, it so far seems that, at the best, this form of estimation may only be useful under special circumstances. Might there be a way to improve on this poor performance for general, dense matrices, with rapidly decaying spectra? This is the subject of Section 9. Since it is found to be the best, we take forward only the Rademacher diagonal estimator.
9 Improved diagonal estimation
Is it possible to improve stochastic diagonal estimation, particularly the worst case scenarios of the previous section? This section seeks to address this by extending the ideas of meyer2021hutch++ , adapting the algorithm therein for full diagonal estimation. For , we propose an to improvement in the number of required query vectors and illustrate the robustness of our algorithm to matrix spectra.
9.1 Relation to Hutch++
We briefly give an overview of the work of Meyer, Musco, Musco and Woodruff meyer2021hutch++ . In their paper, they motivate their algorithm, “Hutch++” as a quadratic improvement in trace estimation. Using randomised projection to approximately project off the top eigenvalues, they then approximate the trace of the remainder of the matrix. All of the error results from this remainder.
They arrive at their result in the following manner, first introducing the following lemma (Lemma 2, Section 2, meyer2021hutch++ ) for a general square matrix .
Lemma 8
For . Let be the Hutchinson estimate of the trace. For fixed constants , if , then with probability
Thus, if then, with probability , .
So, for , we simply have , to get a relative trace approximator. Note, this bound is only tight when . The authors then give the following lemma (see Lemma 3, Section 3, meyer2021hutch++ ).
Lemma 9
Let be the best rank-r approximation to PSD matrix . Then
Proof
Since ,
They argue that this result immediately lends itself to the possibility of an algorithm with complexity. Setting and splitting , the first term may be computed exactly if (the top eigenvectors) is known, since . Thus the error in the estimate of the trace is contained in our estimate of . So (roughly) combining Lemmas 8 and 9, the error becomes, for their trace estimator:
(62) |
Hence for fixed is sufficient for an -approximator to the trace of .
Of course, cannot be computed exactly in a handful of matrix-vector queries, but this is easily resolved with standard tools from randomised linear algebra halko2011finding . Namely, queries is sufficient to find a with with high probability, which is all that is required in the second line of (62), and how the algorithm is implemented in practice. For a full, formal treatment of these ideas see meyer2021hutch++ , Theorem 4, Section 3. We elaborate on this in the next section.
9.2 Diag++
Here we outline how the above ideas may be applied to estimation of the entire diagonal, using queries. Recall once again, for
(63) |
it is sufficient to take
(64) |
We may now engage in a trade-off. For fixed and , we swap convergence for convergence. This comes at the (potentially minor) expense of a greater multiplicative factor in our query bound. Just as in Hutch++, we first motivate this supposing we know , the top eigenvectors, exactly.
Supposing we have projected off the -rank approximation and extracted its diagonal, , we are left to estimate the diagonal of
Denoting as the diagonal estimate of this, we have
(65) |
with probability, , if we have used a sufficient number of queries: for the Rademacher diagonal estimator. Ignoring the term, and employing the result of Lemma 9, yields
Letting
gives
(66) |
still with probability , if we have used sufficient . Note that this is not a relative estimation of the diagonal of ; rather the error is qualified relative to , since this is, overall, the error that we are interested in. Explicitly, for the Rademacher diagonal estimator, this results in sufficiency if
(67) |
and we have arrived at convergence, unlike as for all previous estimates! We remark that when (67) is tight, this is a dramatic improvement! In the case where , the result of (67) is very favorable141414Such a case may be found for the “Model Hamiltonian” in Section 4 of bekas2007estimator .. Since , the worst case scenario (when the diagonals take uniform values) yields
(68) |
and we observe the introduction of a potentially expensive constant, but one which, for moderate and will nonetheless give for large .
Once again, we have motivated the above for exact -rank approximation, but, as in Hutch++, the same tools from randomised linear algebra halko2011finding mean that we can approximate using queries. Specifically queries are sufficient to find such that
(69) |
In detail (see musco2020projection , Corollary 7 and Claim 1), provided vectors have been used to form , then, with probability
and since
so equation (69) holds for . Thus, overall it is sufficient to take queries where clearly the first term dominates. Applying the same ideas as in equations (65) to (67), for the randomised Rademacher case we readily arrive at
(70) |
for some constant , is sufficient to approximate the diagonal to within error, with probability . Of course, we remark that this might be a very pessimistic bound. We have ignored and employed the potentially loose result of Lemma 9, so cannot necessarily expect this bound to be good. However, if we have a steep spectrum151515Such an example may be found for the “Graph Estrada Index” in Section 5.2 of meyer2021hutch++ . with both and , then the bound on should result in Diag++ outperforming simple stochastic estimation. Concretely the algorithm for Diag++ with Rademacher query vectors is as follows:
Input: Implicit-matrix oracle, for . Number of queries .
Output: Diag++(): Approximation to diag()
for with Rademacher entries
In total matrix-vector multiplications with are required161616A recent preprint persson2021improved suggests that one can often improve the estimate by adaptively choosing the number of queries used for computing , and , instead of the equal queries used in Hutch++. It would be of interest to explore such improvement for diagonal estimation.: for , for and for . Computing requires extra flops, but this is dominated by the cost of matrix-vector multiplications: .
9.3 Numerical Experiments
To investigate the performance of Diag++, we once again turn to synthetic matrices: : and a random orthogonal matrix formed by orthogonalising a Gaussian matrix. Recall that with stochastic estimation alone, convergence was slow for such matrices with steep spectra. We expect Diag++ to perform more or less similarly to standard stochastic estimation if the spectrum of the matrix is flat, since (70) is pessimistic, whilst if the spectrum is steep, (70) is much more optimistic and faster convergence is to be expected. The algorithm is robust to the spectrum of , making the “best of both worlds” from projection and estimation. If the spectrum is steep, then we shall be able to capture most of the diagonal of through projection, whilst if it is flat, the stochastic estimation of the diagonal of the remainder will buffer the effects of projection. This is as is found in Figure 5. Not only is Diag++ robust to spectra decaying at different rates, but also converges faster than simple stochastic estimation when the matrices in question have rapidly decaying eigenvalues. The algorithm even overtakes simple projection after a handful of vector queries in such cases. Thus, the answer to our original question at the start of this section is yes: we can improve stochastic estimation, particularly for what would otherwise be worst case scenarios.
![]() |
![]() |
---|---|
![]() |
![]() |
![]() |
![]() |
10 Conclusion
10.1 Summary
In setting out, we sought to gain insight into particular questions surrounding diagonal estimation:
-
1.
How many query vectors are needed to bound the error of the stochastic diagonal estimator in (9), and how does this change with distribution of vector entries?
-
2.
How does matrix structure affect estimation, and how might we improve worst case scenarios?
In answer to 1, we have seen that elemental error probabilistically converges as , and scales as . We have seen that for the entire diagonal, similar results arise, with error going as , and scaling as . Additionally we have observed both theoretically and numerically, that if is small, error arising from the use of the Rademacher diagonal estimator is less than that arising from the Gaussian diagonal estimator. Conversely, if is large, there is an inconsequential difference between the two estimators.
In answer to 2, we have seen, as per our intuitions, that both eigenvector basis, and eigenvalue spectrum, influence the estimation of positive semi-definite matrices. Generally, the steeper the spectrum, the worse the stochastic diagonal estimator. However, we have introduced a new method of overcoming these worst case scenarios. We have analysed it to find favourable convergence properties, and demonstrated its superiority for estimating the diagonal of general dense matrices with steep spectra, whilst maintaining robustness to flatter spectra.
10.2 Future Work
There is ample scope to extend the work presented herein, both to applications, and further analysis.
-
1.
We have seen that the error of a given diagonal element is dependent on the rest of its row. How can we ascertain the actual error of our estimates if we don’t know the norm of the row of our implicit matrix? Or if we don’t have an idea of its spectrum? Is the context of use of stochastic esimation a necessity? We note that we can make tracks on this question since we can estimate using trace estimation of , a free byproduct of diagonal estimation.
-
2.
In our analysis we have perhaps wielded a hammer, rather than a scalpel, in order to arrive at the bounds for . In particular, union bounding introduces the potentially unfavourable factor of . Perhaps future analysis could employ the dependence of element estimates in our favour to remove such a term?
-
3.
Having noted that Yao et. al yao2020adahessian use only a single Rademacher estimator in their state-of-the-art second order machine learning optimiser, how much better will their method become with further estimation? Given the possibility of parallelised estimation, will this introduce a rapid new stochastic optimisation technique that outperforms current methods?
These are interesting and exciting open questions, providing the potential investigator with both theoretical and numerical directions in which to head. During this paper, we also considered extending elemental diagonal analysis to shed new light on the original investigations into trace estimation. Whilst initial work suggested itself fruitful, the dependence of each element estimate upon one another confounded further analysis. We readily invite suggestions or insights into how these ideas may be useful in approaching trace estimation.
References
- [1] R. P. Adams, J. Pennington, M. J. Johnson, J. Smith, Y. Ovadia, B. Patton, and J. Saunderson. Estimating the spectral density of large implicit matrices. arXiv:1802.03451, 2018.
- [2] T. Ando, R. A. Horn, and C. R. Johnson. The singular values of a Hadamard product: A basic inequality. Linear Multilinear Algebra, 21(4):345–365, 1987.
- [3] H. Avron and S. Toledo. Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix. Journal of the ACM, 58(2):1–34, 2011.
- [4] R. A. Baston. On matrix estimation. Master’s thesis, University of Oxford, 2021.
- [5] C. Bekas, E. Kokiopoulou, and Y. Saad. An estimator for the diagonal of a matrix. Appl. Numer. Math., 57(11-12):1214–1229, 2007.
- [6] C. Boutsidis, P. Drineas, P. Kambadur, E.-M. Kontopoulou, and A. Zouzias. A randomized algorithm for approximating the log determinant of a symmetric positive definite matrix. Linear Algebra Appl., 533:95–117, 2017.
- [7] V. Braverman, R. Krauthgamer, A. Krishnan, and R. Sinoff. Schatten norms in matrix streams: Hello sparsity, goodbye dimension. In ICML 2020. PMLR, 2020.
- [8] J. Chen. How accurately should I compute implicit matrix-vector products when applying the Hutchinson trace estimator? SIAM J. Sci. Comp, 38(6):A3515–A3539, 2016.
- [9] D. Cohen-Steiner, W. Kong, C. Sohler, and G. Valiant. Approximating the spectrum of a graph. In Proceedings of the 24th acm sigkdd international conference on knowledge discovery & data mining, pages 1263–1271, 2018.
- [10] A. Cortinovis and D. Kressner. On randomized trace estimates for indefinite matrices with an application to determinants. Found. Comput. Math., pages 1–29, 2021.
- [11] T. A. Davis and Y. Hu. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS), 38(1):1–25, 2011.
- [12] E. Di Napoli, E. Polizzi, and Y. Saad. Efficient estimation of eigenvalue counts in an interval. Numer. Lin. Alg. Appl., 23(4):674–692, 2016.
- [13] C. Eckart and G. Young. The approximation of one matrix by another of lower rank. Psychometrika, 1(3):211–218, 1936.
- [14] A. S. Gambhir, A. Stathopoulos, and K. Orginos. Deflation as a method of variance reduction for estimating the trace of a matrix inverse. SIAM J. Sci. Comp, 39(2):A532–A558, 2017.
- [15] S. Goedecker. Linear scaling electronic structure methods. Reviews of Modern Physics, 71(4):1085, 1999.
- [16] S. Goedecker and M. Teter. Tight-binding electronic-structure calculations and tight-binding molecular dynamics with localized orbitals. Physical Review B, 51(15):9455, 1995.
- [17] G. H. Golub, M. Heath, and G. Wahba. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics, 21(2):215–223, 1979.
- [18] G. H. Golub and U. Von Matt. Generalized cross-validation for large-scale problems. J. Comput. Graph. Stat., 6(1):1–34, 1997.
- [19] N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev., 53(2):217–288, 2011.
- [20] I. Han, D. Malioutov, H. Avron, and J. Shin. Approximating spectral sums of large-scale matrices using stochastic Chebyshev approximations. SIAM J. Sci. Comp, 39(4):A1558–A1585, 2017.
- [21] I. Han, D. Malioutov, and J. Shin. Large-scale log-determinant computation through stochastic Chebyshev expansions. In ICML 2015, pages 908–917. PMLR, 2015.
- [22] M. F. Hutchinson. A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines. Commun. Stat. - Simul. Comput., 18(3):1059–1076, 1989.
- [23] J. Li, A. Sidford, K. Tian, and H. Zhang. Well-conditioned methods for ill-conditioned systems: Linear regression with semi-random noise. arXiv:2008.01722, 2020.
- [24] L. Lin. Randomized estimation of spectral densities of large matrices made accurate. Numer. Math., 136(1):183–213, 2017.
- [25] L. Lin, Y. Saad, and C. Yang. Approximating spectral densities of large matrices. SIAM Rev., 58(1):34–65, 2016.
- [26] R. A. Meyer, C. Musco, C. Musco, and D. P. Woodruff. Hutch++: Optimal stochastic trace estimation. In Symposium on Simplicity in Algorithms (SOSA), pages 142–155. SIAM, 2021.
- [27] C. Musco and C. Musco. Projection-cost-preserving sketches: Proof strategies and constructions. arXiv:2004.08434, 2020.
- [28] C. Musco, P. Netrapalli, A. Sidford, S. Ubaru, and D. P. Woodruff. Spectrum approximation beyond fast matrix multiplication: Algorithms and hardness. arXiv:1704.04163, 2017.
- [29] D. Persson, A. Cortinovis, and D. Kressner. Improved variants of the Hutch++ algorithm for trace estimation. arXiv:2109.10659, 2021.
- [30] F. Roosta-Khorasani and U. Ascher. Improved bounds on sample size for implicit matrix trace estimators. Foundations of Computational Mathematics, 15(5):1187–1212, 2015.
- [31] A. Stathopoulos, J. Laeuchli, and K. Orginos. Hierarchical probing for estimating the trace of the matrix inverse on toroidal lattices. SIAM J. Sci. Comp, 35(5):S299–S322, 2013.
- [32] J. M. Tang and Y. Saad. Domain-decomposition-type methods for computing the diagonal of a matrix inverse. SIAM J. Sci. Comp, 33(5):2823–2847, 2011.
- [33] S. Ubaru and Y. Saad. Applications of trace estimation techniques. In International Conference on High Performance Computing in Science and Engineering, pages 19–33. Springer, 2017.
- [34] M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.
- [35] Z. Yao and A. Gholami. With the authors: AdaHessian optimiser. https://youtu.be/S87ancnZ0MM?t=2000. Accessed: 24/08/2021.
- [36] Z. Yao, A. Gholami, S. Shen, M. Mustafa, K. Keutzer, and M. W. Mahoney. AdaHessian: An adaptive second order optimizer for machine learning. arXiv:2006.00719, 2020.