Vector-wise Joint Diagonalization of Almost Commuting Matrices
Abstract
This work aims to numerically construct exactly commuting matrices close to given almost commuting ones, which is equivalent to the joint approximate diagonalization problem. We first prove that almost commuting matrices generically have approximate common eigenvectors that are almost orthogonal to each other. Based on this key observation, we propose an efficient and robust vector-wise joint diagonalization (VJD) algorithm, which constructs the orthogonal similarity transform by sequentially finding these approximate common eigenvectors. In doing so, we consider sub-optimization problems over the unit sphere, for which we present a Riemannian approximate Newton method with a rigorous convergence analysis. We also discuss the numerical stability of the proposed VJD algorithm. Numerical experiments with applications in the independent component analysis are provided to reveal the relation with Huaxin Lin’s theorem and to demonstrate that our method compares favorably with the state-of-the-art Jacobi-type joint diagonalization algorithm.
keywords:
Almost commuting matrices , Huaxin Lin’s theorem , Joint diagonalizability , Riemannian approximate Newton method , Vector-wise algorithmMSC:
[2020] 15A18 , 15A27 , 15A45 , 15A60 , 58C15 , 65F15 , 65F351 Introduction
It is a fundamental result in linear algebra that a set of Hermitian matrices are commuting if and only if they are jointly diagonalizable by a unitary matrix. A closely related and long-standing question [38, 60], dating back to the 1950s, was that are almost commuting matrices near commuting pairs? It was answered affirmatively by Huaxin Lin [49] in 1995, which gave the following celebrated theorem: for any , there exists , such that if are Hermitian matrices satisfying and , then there exist commuting Hermitian matrices such that . A simplified proof of Lin’s result was provided soon after by Friis and Rørdam [30], and an analogous result in the real symmetric case is recently given in [51] by Loring and Sørensen. However, their arguments were abstract and did not show how to construct these matrices , which would be very useful from the application point of view [51]. In 2008, Hastings [40] suggested a constructive approach to finding the nearby commuting matrices and and discussed the dependence of on (see also [43] for the extended discussions about Hastings’ method). Later, Kachkovskiy and Safarov, in their seminal work [44], revisited this problem in the framework of operator algebra theory and gave the optimal dependence of on by a different argument from the one in [40]. We state their result below for future reference. The interested readers are referred to [43, 44] for a more complete historical review of Lin’s theorem and its development. A list of notations used throughout this work is given at the end of this section.
Theorem 1.1 ([44]).
For any two self-adjoint operators on a Hilbert space , there exists a pair of commuting self-adjoint operators such that
(1.1) |
where is the commutator and the constant is independent of and the dimension of .
The theory of almost commuting matrices and Lin’s theorem find numerous applications in many fields, e.g., quantum mechanics [39, 41, 42, 57], computer graphics [13, 46], and signal processing [16, 25, 61]. In particular, Hastings used Lin’s theorem to claim the existence of localized Wannier functions for non-interacting fermions [39], which inspired a series of subsequent works [41, 42]. For the application in computer science, there is a famous algorithm called Joint Approximate Diagonalization (JADE) developed by Cardoso and Souloumiac for blind source separation [16, 17]. Such an algorithm has also been used in computational quantum chemistry to generate Wannier functions [36]. The machine learning-related applications are recently discussed in [33, 46]. While numerous attempts have been made to constructively prove Lin’s theorem and explore its applications in various areas, a numerically feasible algorithm for finding in Theorem 1.1 remains elusive.
In this work, our focus is on designing a fast algorithm for Theorem 1.1 in the almost commuting regime. Given matrices with , we consider the following equivalent optimization problem:
(OP) |
In view of practical applications [24, 33, 36], we will focus on real matrices, but all the results and discussions can be adapted to the complex case easily. It may be clear but worth emphasizing that any approximate solution to (OP) gives a potential pair of commuting matrices satisfying the estimate (1.1). The problem (OP) is generally very difficult to solve by the existing manifold optimization techniques, since the operator norm involved is closely related to the Finsler geometry [9], which is a rather uncommon structure in the optimization community. Even if we replace the operator norm with the Frobenius one, (OP) is still a challenging problem, due to the non-convexity of the objective function and the Stiefel manifold constraint. One old but popular method is a Jacobi-type algorithm represented in [14] with ideas dating back to [20, 34]. This algorithm minimizes the off-diagonal objective function: for Hermitian matrices and ,
by a sequence of unitary transformations constructed from Givens rotation matrices:
where is a unitary matrix with the complex adjoint , measures the off-diagonal part of a square matrix , and parameters are the Jacobi angles satisfying . The optimal and are derived in [17, Theorem 1.1] in a closed-form expression. For the readers’ convenience, we briefly summarize the main procedures of the Jacobi algorithm [14] in the following pseudocode.
Algorithm 1 is known to have the quadratic asymptotic convergence rate [14] and can be efficiently implemented in the parallel framework as discussed in [8, 21]. Another advantage of the Jacobi algorithm is its numerical stability against rounding errors. It is worth noting that Theis [68] extended the above Jacobi algorithm for the joint block diagonalization, but the output block decomposition may not be the finest one. For the finest block diagonalization, a class of algorithms based on the theory of matrix -algebra has been proposed in [18] and developed in [52, 53, 54]. Other joint diagonalization techniques in various problem settings include a gradient-based algorithm for approximately diagonalizing almost commuting Laplacians [13], subspace fitting methods with non-orthogonal transformations [71, 75], and the fast approximate joint diagonalization free of trivial solutions [48], just to name a few. In particular, Riemannian optimization algorithms have been recently explored for the joint diagonalization problem with or without orthogonality constraints. Theis et al. [69] applied the Riemannian trust-region method to the joint diagonalization on the Stiefel manifold. Within the same framework, an efficient Riemannian Newton-type method was investigated in [62] by Sato. More recently, Bouchard et al. [11] proposed a unified Riemannian optimization framework for the approximate joint diagonalization on the general linear group, which generalizes previous algorithms [2, 4].
The joint matrix diagonalization problem is also closely related to the canonical polyadic decomposition (CPD). In [19], De Lathauwer derived a sufficient condition for the uniqueness of CPD and showed that the decomposition can be obtained by a simultaneous diagonalization by congruence. Sørensen [64] proved that for a given tensor, the optimal low-rank CPD approximation exists with one of the matrix factors being column-wise orthonormal. In particular, numerical algorithms based on joint congruence diagonalization and alternating least squares are also explored in [64] for this constrained CPD. We also mention that in a recent work [23] of Evert and De Lathauwer, they discussed the optimal CPD of a noisy low-rank tensor and connected it to a joint generalized eigenvalue problem.
In this work, we propose a practically efficient vector-wise algorithm for finding the approximate solution to the optimization problem (OP) in the case where , and present the associated error analysis. Numerical experiments demonstrate that our algorithm has comparable or even better performance than that of the state-of-the-art Jacobi-type method while significantly reducing computational time. Moreover, we numerically compute the error between the given almost commuting matrices and the commuting ones constructed by our algorithm. It turns out that in the almost commuting regime, is of the order , and thus the constructed satisfies the upper bound (1.1) in Lin’s theorem. Some applications in independent component analysis are also provided to justify the effectiveness of our method.
We now summarize the main idea and contributions of this work in detail. Our method is motivated by the simple fact that the commutative matrices can be simultaneously diagonalized by an orthogonal matrix whose columns consist of common orthonormal eigenvectors of , which can be characterized by the global minimizers of the following optimization problem:
() |
Note that admits minimizers with and being mutually orthonormal. This observation leads to an ideal approach for diagonalizing commuting matrices . Suppose that we have successfully solved the first minimizers of . To determine the -th minimizer, we solve () but with an additional orthogonal constraint:
(1.2) |
However, this elegant approach may encounter practical challenges due to numerical noise and errors. Since we can not solve the optimization problem (1.2) exactly at each step, the numerical errors will accumulate so that the constraints may exclude the minimizers of the cost functional , which leads to the failure of the algorithm for finding the complete orthonormal set of common eigenvectors. Moreover, the input commuting matrices are generally subject to numerical noises or rounding errors. As proved in Lemma 2.3, any such small perturbation could disrupt all the common eigenvectors. Consequently, the functional might not exhibit sufficient local minimizers, in which case the algorithm will also fail.
Our first contribution is Theorem 2.13, which proves that in the almost commuting regime, generically has local minimizers with being almost orthogonal to each other (which may be viewed as approximate common eigenvectors). For this, we provide a stability result in Theorem 2.6 by using perturbation techniques. Then, based on the theory of Gaussian orthogonal ensemble (GOE) [5, 65], we show that the common eigenspaces of commuting matrices are generically of one dimension from both topological and probabilistic points of view (cf. Lemma 2.9 and Corollary 2.11), similarly to the well-known genericity of simple eigenvalues of self-adjoint operators [66, 67, 70]. Combining these results with the estimates of the spectral gaps for random matrices in [56], we can immediately conclude the desired claim in Theorem 2.13.
In view of this generic property of the local minimizers of , we consider a relaxed version of the aforementioned approach for finding the columns of the orthogonal matrix and the diagonal entries of matrices sequentially, called the vector-wise joint diagonalization (VJD); see Algorithm 2. The core idea behind VJD is to relax the exact orthogonality constraint in (1.2) to . This relaxation accommodates possible numerical errors and non-commutativity of input matrices. However, enforcing constraints of the form , , can be numerically challenging. Therefore, in practice, we adopt the equivalent constraint:
(1.3) |
where and is the error tolerance parameter. Notably, this form (1.3) allows for the straightforward computation of the associated projection, as shown in Lemma 2.4. The VJD algorithm proceeds as follows; see Section 3 for more details. We first solve () to find . Then, for , we consider the relaxed problem of (1.2):
() |
which gives . This iterative process culminates in the construction of an almost orthogonal matrix and two diagonal matrices: and . The algorithm ends up with a classical nearest orthogonal matrix problem (i.e., orthogonal Procrustes problem [63]):
(1.4) |
where is either the spectral norm or the Frobenius norm . In both cases, the problem (1.4) has an analytic solution , with being the singular value decomposition (SVD) of . For solving sub-optimization problems () and (), we derive a Riemannian approximate Newton-type method with the approximate Hessian matrix being structurally simple and positive semidefinite; see Algorithm 3. Our method exploits the second-order information of the cost function and enjoys stability with low computational cost. We would like to remark that there might be some other algorithms that are also potentially suitable for our optimization problems, for example, the trust-region method [1], the column-wise block coordinate descent method [31, 32], the Cayley transform-based optimization [72], and the consensus-based optimization method [29, 37, 45], but the detailed discussion about these approaches is beyond the scope of this work.
Our third contribution lies in establishing the global convergence of the proposed Riemannian approximate Newton method for () and demonstrating the numerical stability of our VJD algorithm. By leveraging techniques from [10, 47, 74] for Newton-type methods on manifolds, we prove in Theorem 4.1 that Algorithm 3 for () converges linearly with the convergence rate and quadratically if . However, it is important to note that our optimization process, owing to its sequential nature, does not exactly correspond to solving (OP) when and are non-commuting matrices. Nevertheless, based on empirical observations from numerical experiments, our method provides a good approximation to the original problem. To further substantiate this numerical advantage, we theoretically establish the error-tolerant property of our VJD algorithm in Proposition 4.5. We remark that although our discussions primarily focus on a pair of almost commuting matrices, our algorithm and analysis can be directly extended to a family of matrices that are almost commuting pairwisely.
This work is organized as follows. Section 2 delves into the stability and generic behaviors of local minimizers of , serving as motivation for the VJD algorithm outlined in Section 3, which relies on an efficient approximate Newton method for sub-optimization problems. Additionally, Section 4 is dedicated to the examination of the convergence properties and numerical stability of the presented algorithms. In Section 5, various numerical examples are presented to justify the efficiency of our method.
Notations.
-
•
Let be the set of symmetric matrices, and be the orthogonal group in dimension . We denote by the space of diagonal matrices, and by the closed subset of with increasing diagonal entries . By abuse of notation, we also write for the diagonal matrix with diagonal entries . We denote (resp., ) for if is positive semidefinite (resp., positive definite).
-
•
For a matrix , denotes its element. Moreover, we denote by the Frobenius norm, and by (or, simply, ) the operator norm.
-
•
Let be the Euclidean inner product of and be the associated norm. The standard basis of is given by , where denotes the vector with in the -th coordinate and elsewhere. We will not distinguish the row vector and the column one unless otherwise specified. We define the -dimensional unit sphere by
-
•
We denote by the spectrum of a matrix and by its spectral gap:
We say that an eigenvalue of is simple (resp., degenerate) if its multiplicity is equal to (resp., greater than) one.
-
•
We recall the commutator of two matrices and define the set of commuting matrices by
Then, we say that is an eigenvalue pair of if the associated common eigenspace is non-empty:
(1.5) With slight abuse of notation, we define the joint spectrum of by the set of all its eigenvalue pairs and the eigenvalue gap by
(1.6) We also denote by the convex hull of . The multiplicity of is defined by the dimension of the subspace , and is said to be simple (resp., degenerate) if is equal to (resp., greater than) one.
2 Analysis of local minimizers of
This section is devoted to the analysis of the local minimizers of the functional in (). Note and that if is a local minimum of , so is . It is convenient to consider a point as an equivalence class in what follows. We shall first derive the optimality conditions of () and characterize the minimizers of in the commuting case. Then, we establish some stability estimates on the minimizers by perturbation arguments. We also show that with high probability , the functional for almost commuting random matrices has local minima on the manifold with minimizing vectors almost orthogonal to each other. These findings shed light on designing the column-wise algorithm for the optimization problem (OP), which we will discuss in detail in the next section.
Recall that in this work, we consider the almost commuting symmetric matrices with . By Theorem 1.1, without loss of generality, we assume that the matrices can be written as
(2.1) |
for some commuting matrices with
(2.2) |
Note that in the case of (i.e., ), we can always find local (also global) minima of (with minimum value zero) given by the eigenvalue pairs of and the associated common orthonormal eigenvectors. Indeed, the converse is also true; see Proposition 2.2 below.
We start with the optimality conditions for (). For ease of exposition, we recall some basic concepts on Riemannian optimization on a smooth manifold embedded in an Euclidean space [3, 12]. We denote by the tangent space at and by the orthogonal projection from to . Let be a smooth function on , which admits a smooth extension to a neighborhood of in . In what follows, we will denote by and the standard Euclidean gradient and Hessian, respectively. Then, the Riemannian gradient of is given by
(2.3) |
see [12, Proposition 3.53]. Letting be any smooth extension of , the Riemannian Hessian of , as a linear map on , can be computed by [12, Corollary 5.14]
(2.4) |
In the special case of , we have the tangent space at with the associated projection given by . Thanks to (2.3) and (2.4) above, the Riemannian gradient and Hessian of on can be explicitly computed as
(2.5) |
We also recall the notion of retraction on [12, Definition 3.41]. In this work, we limit our discussion to the following one for its simplicity (another popular choice is the exponential map):
(2.6) |
It is worth noting that in (2.6) is a second-order retraction on , i.e., for any and , the curve satisfies , , and ; see [12, Definition 5.41].
We now compute the Riemannian gradient of the functional on the manifold by (2.3). Note that is well-defined on . For and , we have
(2.7) |
where denotes the Riemannian gradient of in . Similarly, by (2.4), we calculate the Riemannian Hessian of as follows:
(2.8) |
where the matrices and are given by
(2.9) |
and
(2.10) |
Here, is the Euclidean Hessian of in :
(2.11) |
We define the stationary points of () by . Then, if is a local minimizer of (), then it is a stationary point and there holds on the tangent space . Conversely, if is a stationary point such that on , then it is an isolated local minimizer of (). We next give a characterization for the stationary points of () for commuting matrices. We denote by the -simplex:
Proposition 2.1.
Proof.
From Proposition 2.1 above, we see that for , the stationary points of () are essentially characterized by the set in (2.12), which always includes the nodes and their averages , while the full analytical characterization of would rely on the distribution of the eigenvalue pairs of . The following result shows that in the commuting case, the local minimizers of () are exactly given by eigenvalue pairs and the common eigenvectors.
Proposition 2.2.
Proof.
It suffices to prove the direction . Since is a local minimizer, we have and on . It follows that the Schur complement of in is also positive semidefinite, that is,
(2.16) |
Suppose that is not an eigenvalue pair and the common eigenvector of . From (2.13), is a convex combination of some non-identical with for and with (otherwise, for some and is the common eigenvector). Moreover, there holds for and some constant . Then, for any with , we can compute, by (2.10) and (2.16),
(2.17) |
Noting that are not identical, either or is not collinear with . Thus, we can choose such that , which contradicts with (2) and completes the proof. ∎
However, the following observation shows that any small perturbation of commuting symmetric matrices could eliminate all common eigenvectors.
Lemma 2.3.
The set is an open dense subset of the space .
Proof.
Let and be the distinct eigenvalues of and with associated eigenspaces and , respectively. We define and , and see that both and are closed subsets of , and has no common eigenvectors if and only if the set is empty. It follows that there exist open neighbourhoods of and in with empty intersection. Then, by the perturbation theory, we can conclude that the set is open in . We next show that is also dense. It suffices to prove that for having common eigenvectors and for small enough , we can find , with , such that , have no common eigenvectors. For this, by perturbing the eigenvalues of and , without loss of generality, we assume that both and have only simple eigenvalues. We let and next construct . Consider the spectral decompositions for and : and for . Then, for a given , we choose an orthogonal matrix satisfying for any and , and define the matrix by , where and . When is small enough, the condition implies for all . Thus, and have no common eigenvectors, and the proof is complete. ∎
Although an arbitrary pair of symmetric matrices may not have any common eigenvectors, the functional can still admit many local minima. Indeed, recalling the decomposition (2.1) for almost commuting matrices , let be a common eigenspace (1.5) of associated with some eigenvalue pair :
(2.18) |
We shall prove in Theorem 2.6 below that when , there always exists a local minimizer to in some neighborhood of in , where . For convenience, we denote by and the orthogonal projections to the subspace and its orthogonal complement , respectively. The following lemma is useful in the sequel.
Lemma 2.4.
Let the set and the projections and be defined as above. For , it holds that
(2.19) |
where the minimum value is attained at . Similarly, let be the -neighborhood of the set : . Then, there holds, for ,
(2.20) |
where is the unique minimizer to given by
(2.21) |
Proof.
The first claim (2.19) directly follows from the definition, so we only show the second one (2.20). For this, we note from (2.19) that is equivalent to . Hence the statements (2.20) and (2.21) clearly hold for . Then we consider the case where , which gives . It is easy to see
(2.22) |
By elementary analysis, we have that the function is monotone increasing in for . Therefore, it follows from (2.21) and (2.22) that
The proof is complete. ∎
Remark 2.5.
One can also define the distance function for , by using the geodesic distance on , which has been investigated in [26, 27]. It is easy to see that if , which implies that and are equivalent, and in (2.21) is also the unique minimizer to . However, the interested set is not geodesically convex by [27, Proposition 2]. Therefore, the results in [26, 27] do not directly apply to our case and we choose to focus on the function in (2.20) in this work.
Theorem 2.6.
Suppose that are almost commuting matrices with decomposition (2.1), and denote by the eigenvalue gap (1.6) for commuting matrices in (2.1). Also, let the eigenvalue pair with the space and the set be defined as before Lemma 2.4. We assume
and define constants
(2.23) |
Then, it holds that () admits a local minimizer in the domain:
(2.24) |
Proof.
By Lemma 2.4, we write, for ,
(2.25) |
and then have
(2.26) |
where , , and are common orthonormal eigenvectors of spanning . Note that for any and , the set is compact, which implies that admits a minimizer. We shall prove that for small enough , there exists and , depending on and satisfying as , such that the minimizer of is in the open set and hence also a local minimizer of (). For this, we define a constant
(2.27) |
and, without loss of generality, assume . Indeed, if holds with a minimizer , then we readily have , and the point is the desired local minimizer of () in .
We next construct constants and such that
(2.28) |
which, by (2), yields . It follows that and hence completes the proof. We first have with
A simple estimate implies
(2.29) |
We now estimate on . By (2.25) and (2.26), there holds for , which gives
where we have also used . If we choose and satisfying
(2.30) |
it follows from (2.29) that
(2.31) |
We next estimate on . Let be the eigenvalue pair of associated with the common eigenvector . Similarly by (2.25), we can derive
(2.32) |
thanks to the elementary inequality for . For and small enough , it is clear that and
Then, if follows from (2) that
If we choose and such that
(2.33) |
then recalling (2.29), we have
(2.34) |
One can check that in the regime , the choice (2.23) of parameters satisfies the conditions (2.30) and (2.33). Then, the claim (2.28) follows from (2.31) and (2.34) as desired. ∎
Remark 2.7.
The prefactors in Theorem 2.6 (e.g., and ) are chosen for convenience and can be improved, but the orders in and are optimal by the standard perturbation theory.
Remark 2.8.
In the extreme case where is a singleton, equivalently, for some , we have and can be set as by convention (note that (1.6) is not well-defined in this case). Then, Theorem 2.6 is a trivial fact that () has a local minimum with near the point . Moreover, we can see that in this case, () reduces to for , which is hard to analyze due to its generality (except , in which case it is essentially a one-dimensional optimization problem; see also Lemma 3.2 below). It is possible to carry out a landscape analysis as in [76], but one may have to consider special and such as rank-one and diagonal matrices.
We next discuss generic properties of local minimizers of () for almost commuting matrices. Given , let () be its distinct eigenvalue pairs and be the associated common eigenspaces (1.5). Theorem 2.6 guarantees that () for the perturbed pair with admits a local minimizer in the neighborhood of each closed set , where . Thus, if holds, equivalently, all of the common eigenspaces are one-dimensional, we readily have that has at least local minima, and the associated minimizing vectors are almost orthogonal to each other in the sense that for . We shall see that generically, it is indeed the case. Note that the set of commuting matrices is a complete metric space.
Lemma 2.9.
The subset is open dense in .
Proof.
To show that is dense in , it suffices to note that for , we can always perturb the eigenvalues of such that the perturbed matrix has only simple eigenvalues, which implies , while the claim that is open in follows from the standard perturbation result (cf. Lemma 4.3). ∎
The above lemma is an analog of Lemma 2.3, which gives the genericity of in the topological sense. It would also be interesting to interpret this property from a probabilistic perspective. We next define a probability measure on the set of commuting matrices and show that the event happens almost surely. Note that can be parameterized as
Our probabilistic model is largely motivated by GOE in random matrix theory. We recall that is a GOE if diagonal elements for are i.i.d. Gaussian random variables and off-diagonal elements are i.i.d. Gaussian that are independent of . We let be uniformly distributed on and , , be random diagonal matrices with diagonal entries independently sampled from the following probability density on the set :
where is the normalization constant. Equivalently, we define the product probability measure:
on , where denotes the Haar measure on the orthogonal group . Clearly, the map is smooth and thus the pushforward measure on is well-defined. In what follows, we limit our discussion to the case where is equipped with the probability . One may have noted that is nothing else than the joint distribution of eigenvalues of GOE [5]. Thus, the pair has the same marginal distribution as GOE, i.e., , . To proceed, we recall a basic result that may be due to Wigner and von Neumann [55]; see also [65, 67].
Lemma 2.10.
The set of symmetric matrices with degenerate eigenvalues is a surface of codimension in the -dimensional real vector space . Thus, given an absolutely continuous probability on with respect to Lebesgue measure, a matrix has all simple eigenvalues almost surely.
Corollary 2.11.
For a pair of random commuting matrices drawn from the probability , almost surely we have that all its common eigenspaces are one-dimensional.
In order to combine these observations with Theorem 2.6, we recall from [56, Corollary 2.2] a tail bound of the spectral gap for Wigner matrices (i.e., off-diagonal entries are i.i.d. sub-Gaussian random variables with mean and variance and diagonal entries are independent sub-Gaussians with mean and variances bounded by ), which include GOE as a subclass.
Lemma 2.12.
For real symmetric Wigner matrices with sub-Gaussian entries, there holds
with probability , where are eigenvalues of (counting multiplicity) in an increasing order.
Applying Lemma 2.12 to GOE, by Theorem 2.6, we can conclude the following result, which shows the desired claim that generically has almost orthogonal local minimizing vectors .
Theorem 2.13.
Let be almost commuting matrices with decomposition (2.1), where is sampled from . Then, for , with probability , the function admits local minima on with associated minimizing vectors satisfying
3 Vector-wise joint diagonalization algorithm
In this section, we will provide a detailed description of the VJD algorithm for solving (OP). Following the ideas presented in the introduction, we first summarize its key steps in Algorithm 2 below.
Remark 3.1.
The rest of this section is devoted to designing algorithms for the optimization problems () and (). Note that the cost functional is quadratic and convex with respect to both variables and , which motivates us to consider the alternating descent framework, namely, successively updating the variables and to minimize the cost . In particular, we have
(3.1) |
and there holds, for any and ,
(3.2) |
with
(3.3) |
which is a fourth-order polynomial in . Thus, we can update explicitly via (3.1). To update the vector , we consider Riemannian Newton’s method with a line search, which takes advantage of the Hessian information of and enables a fast convergence. More precisely, in the -th step, given and , we compute the search direction by
(3.4) |
and then update and , where the stepsize is determined by a backtracking line search (cf. (3.15) below). However, per our numerical experiments, such an alternating minimization may converge slowly and become less efficient when the matrix size increases. We also note that may have negative eigenvalues so that in (3.4) is not necessarily a descent direction of , which leads to the potential instability of the algorithm. Our practically efficient algorithm is motivated by the following lemmas.
Lemma 3.2.
Proof.
The Riemannian gradient (3.5) and Hessian (3.6) of follow from a direct computation by (2.5). For the first claim, let be a local minimum of . Then, is a local minimum of the quadratic function , which gives . By (3.2), we have, for near ,
(3.7) |
i.e., is a local minimum of . The other direction can be proved similarly. ∎
Thanks to Lemma 3.2, optimizing the function is equivalent to optimizing the one , since they have exactly the same local minima in the variable . Noting from (3.5) that , we will show that in the almost commuting regime , the alternating descent algorithm given above can be regarded as a Riemannian approximate Newton method for solving
(3.8) |
namely, a special case of Algorithm 3 below with . In this work, the approximate Newton method refers to a Newton-type algorithm that employs structurally simple approximate Hessian matrices with easily computed inverses.
Lemma 3.3.
Given a pair of almost commuting matrices with decomposition (2.1), let be a local minimum of the associated on . Then, for both and with , it holds that
(3.9) |
for near and small enough. In particular, if , we have
(3.10) |
Proof.
It suffices to consider , since the other one can be similarly analyzed. By formulas (2.7) and (3.5), the optimality condition gives
(3.11) |
which yields, by (2.10),
(3.12) |
For the the desired estimate (3.9), we first observe
by the Lipschitz continuity of and . Then, recalling from Proposition 2.2 and Lemma 3.2 that the local minima of for commuting matrices are characterized by their common eigenvectors, we have and for , by formulas (2.9), (3.6), and (3.12). It follows from the stability estimate in Theorem 2.6 that . Thus, (3.9) and also (3.10) hold. ∎
In what follows, we say that a symmetric matrix is an approximate Riemannian Hessian of if the property (3.9) holds as and . In addition to with from the alternating minimization, Lemma 3.3 also suggests a new choice for approximating with the favorable positive semidefinite property. It is easy to see that its restriction on : also satisfies the estimate (3.9) and hence can be a good candidate for the approximate Hessian as well.
We are now ready to give our approximate Newton method (Algorithm 3) for solving () and (). We note that the only difference between () and () is the inequality constraint , which can be easily dealt with by a projection step, thanks to Lemma 2.4. One may also observe that for , there holds by definition and then () reduces to (). Thus, it suffices to state the algorithm for (). For clarity, we recall that given (),
(3.13) |
and for some . If , then . We denote by the projection to the minimizer of given in (2.21) and also recall the retraction in (2.6).
(3.14) |
(3.15) |
We end this section with several remarks. First, by Theorem 2.13, the parameter in Algorithm 3 can be chosen of the order or larger. Second, by the above discussion, it is clear that Algorithm 3 with gives the alternating minimization for () and (), while for , Algorithm 3 is nothing else than the standard Riemannian Newton’s method. Moreover, combining the abstract VJD scheme in Algorithm 2 with Algorithm 3 gives the one used in practice.
Finally, for the step (3.14), if the approximate Hessian maps to , we can simply take the Moore Penrose generalized inverse of to find . But if does not have as an invariant subspace, say, , then the equation (3.14) may not admit a solution in . In this case, we consider the least-squares solution to (3.14). See Section 5.1 for more computational details, where we numerically test the aforementioned candidates of the approximate Hessian .
4 Error analysis
This section is devoted to the error analysis for our VJD algorithm. In Section 4.1, we establish the global convergence of Algorithm 3 for () (i.e., without the inequality constraint), while in Section 4.2, we discuss the numerical stability of Algorithm 2 with respect to noises and numerical errors.
4.1 Convergence of Riemannian approximate Newton method
Before stating our main theorem, we recall some preliminaries. First, by [59, Lemma 6], there exist , , and such that for any and with ,
(4.1) |
where is the geodesic distance on . We define the following constant for later use:
(4.2) |
We denote by the parallel transport of tangent vectors at to the tangent vectors at along the geodesic curve connecting points and ; see [12] for the precise definition of . We also recall the following expansion for the Riemannian gradient of a smooth function on [28, 47]: for near ,
(4.3) |
Theorem 4.1.
Let be an positive semidefinite approximate Hessian of satisfying (3.9). Suppose that is an accumulation point of the iterative sequence generated by Algorithm 3 for (); and that for all and are non-singular operators on the tangent space. Then, is a stationary point of (), and it holds that for , the sequence converges to linearly with the convergence rate:
In the commuting case , we have that converges to quadratically:
Proof.
In this proof, we always regard as a linear operator on the tangent space . By abuse of notation, we define the norm for any by . Since is non-singular and is continuous, there exists a neighborhood of such that for is positive definite on and satisfies
(4.4) |
which implies that we can find a constant such that on . It follows that
(4.5) |
Then by (4.5) and the line search condition (3.15), we see that the stepsize is well defined.
Step 1. We first prove that the accumulation point is also the stationary point, i.e., . Again by (3.15), we have
(4.6) |
which means that is strictly decreasing. By (4.6), if holds, then we find
and as , which readily gives . If not, i.e., holds, given any , we can take a subsequence such that for large enough. Hence, does not satisfy the linear search condition (3.15), that is,
(4.7) |
Without loss of generality, we assume that there holds and . Taking the limit in (4.7) gives
Then it is easy to see
Noting in Algorithm 3, we obtain and thus .
Step 2. We next show that there exists a neighborhood of such that if , then . Recalling that is a second-order retraction, we have the following Taylor expansion [12, Proposition 5.43]:
(4.8) |
for with small enough. Then by (4.4), for near , it holds that
which, along with (4.8), yields
(4.9) |
where we have also used the assumption that (3.9) holds. The above estimate (4.1) allows us to find a neighborhood of such that for any , there holds
By the above estimate and the line search condition (3.15), we conclude our claim.
Step 3. We finally consider the local convergence rate with stepsize one. For this, we claim that there exists a neighborhood of such that holds for any , where . Let
By a direct computation with the relation [12], we have
which gives the estimate
(4.10) |
By the Lipschitz property of and , we estimate, for near ,
(4.11) |
Without loss of generality, we let be small enough such that (4.1) holds with constant defined in (4.2). Then it is easy to see from (4.10) and (4.11) that for some constant ,
(4.12) |
where we used by (4.3) and , and the last estimate is from (2.2) and (3.9). When is small enough, we obtain
(4.13) |
In particular, if , it holds that by Lemma 3.3. In this case, similarly, we have
(4.14) |
We now have all the ingredients to finish the proof. Since is the accumulation point, there exists such that . Then by Step 2 and Step 3, a simple induction argument yields and for . In view of estimates (4.12) and (4.14), we readily have the global convergence of and its convergence rate. ∎
Remark 4.2.
Empirically, the projected approximate Newton method in Algorithm 3 exhibits a similar convergence rate when applied to () with inequality constraints as it does for (); see Section 5.1 for the numerical results. Nevertheless, to the best of the author’s knowledge, there exists very little research on the convergence of projection-type Riemannian optimization algorithms, and we choose to postpone the convergence analysis of Algorithm 3 for () to future investigations. We also refer interested readers to [7, 50, 73] for recent advancements in Riemannian optimization with constraints.
4.2 Numerical stability
We next show that for commuting matrices , our VJD algorithm can produce a reliable simultaneous approximate diagonalization. Note that when , for suitable in (1.3), each sub-optimization problem () or () in Algorithm 2 admits a minimizer satisfying . However, due to the presence of noises and iteration errors, the input matrices may not exactly commute and can not be perfectly solved. In this scenario, we assume that the sub-optimization problems () and () are solved by Algorithm 3 with outputs satisfying
(4.15) |
for some precision parameter , and prove that our Algorithm 2 can output matrices and with a controlled error; see Proposition 4.5. We remark that the justification of the assumption (4.15) relies on the convergence of Algorithm 3 for () and hence is left open. The proof is based on the following two lemmas: a standard perturbation result for eigenvalue problems [58, p.58] and a backward error stability result for the common eigenvector problem [22, Theorem 4].
Lemma 4.3.
Suppose that has eigenvalues with the associated orthonormal eigenvectors . For any with small enough, let be the eigenvalues of . Then, the perturbed matrix has orthonormal eigenvectors associated with , which satisfy
Lemma 4.4.
Suppose that is a common eigenvector of matrices with and . If is an approximation of , then there exist and such that
and
We now state and prove the main result of this section.
Proposition 4.5.
Proof.
We first estimate the error for the nearest orthogonal matrix problem (1.4) involved in Algorithm 2. From Algorithm 3, it is clear that , and then the assumption (4.15) implies
(4.16) |
By viewing as an approximation to the common eigenvector and applying Lemma 4.5, there exist small perturbations such that is an eigenvector of both and and there holds
(4.17) |
Therefore, by (4.17) and Lemma 4.3, we obtain
and hence
(4.18) |
Letting , it readily follows from (4.18) that and , which implies
(4.19) |
Again by Lemma 4.3, the estimate (4.19) above gives that all the eigenvalues of are perturbations of , that is, all the diagonal entries of in the SVD of are of the form . This allows us to conclude
5 Numerical experiments
In this section, we present extensive numerical experiments to verify the efficiency and robustness of our proposed algorithms in Section 3. In Section 5.1, we test the performance of the approximate Newton method (cf. Algorithm 3) with the exact Riemannian Hessian and various approximate Hessian matrices. Then in Section 5.2, we compare the Jacobi algorithm and the VJD one for almost commuting matrices in detail, and we also investigate, both numerically and theoretically, the relations between the error cost function and the magnitude of the commutator. In Section 5.3, we apply our algorithm to the independent component analysis (ICA) problem. All the experiments presented below are conducted using Python on a personal laptop with 16 GB RAM and 8-core 2.6 MHz CPU. The implementation of the JADE algorithm is based on the Python package [6]. The synthetic almost commuting matrices in Sections 5.1 and 5.2 are generated by , where the commuting pair is sampled from the distribution introduced in Section 2 and is the additive Gaussian noise with noise level , namely, and with are i.i.d. Gaussian .
5.1 Comparison between different approximate Hessian
We explore different approximate Hessian matrices within Algorithm 3 and compare their performances for solving () and () in terms of the computing times. As discussed after Algorithm 3, these approximate Hessian matrices can be broadly categorized into two families: those with the tangent space as their invariant subspaces and those without. For the first family, the search direction in (3.14) can be readily solved by the pseudoinverse:
(5.1) |
while for the second family, we consider the associated least-squares problem:
(5.2) |
It is well-known [35] that the above problem (5.2) admits a minimizer:
(5.3) |
Here we consider the following four approximate Hessian matrices (which have been justified in Lemma 3.3 and the discussion afterward): the Euclidean Hessian of in : , the Riemannian Hessian of in (2.10): , the projected Euclidean Hessian of : , the exact Riemannian Hessian of : .
We test Algorithm 3 with the above four candidates of the approximate Hessian for solving () and () with and let be updated accordingly either by (5.1) or (5.3), for three randomly generated pairs of almost commuting matrices of dimension with of orders , , and . The numerical results are represented in Figure 1, which clearly shows that all the candidates achieve similar levels of accuracy with nearly the same iteration steps. However, compared to the alternating minimization and the Riemannian Newton method (i.e., Algorithm 3 with approximate Hessian and , respectively), Algorithm 3 with or significantly reduces the computational cost, with the running time being about five times less than that of the standard Riemannian Newton method. Note that the convergence for the case of has been established in Theorem 4.1. The algorithm with the approximate Hessian , which appears to be the most efficient option, relies on the heuristic scheme (5.3) and does not align with the Riemannian optimization framework. Some additional techniques may be necessary to ensure its convergence properties. Figure 1 also verifies the robustness of the superiority of the approximate Hessians and with respect to the magnitude of .

5.2 Comparison with Jacobi algorithm and relation with Lin’s theorem
We test Jacobi Algorithm 1 and our VJD Algorithm 2 with approximate Hessian given in Section 5.1 on randomly generated almost commuting symmetric matrices of dimensions with noise levels , respectively, to compare their computational efficiencies. We plot the distributions of the computing times for both algorithms across all the generated samples in Figure 2. It clearly demonstrates that our VJD algorithm is robust against noise and much more efficient than the Jacobi one in the almost commuting regime, especially when the matrix size is large (say, ). In addition, noting that Jacobi Algorithm 1 involves Givens rotation sweeps and in each sweep, operations are needed to update matrices and , thus the computational complexity of the Jacobi algorithm is . We numerically compare the complexities of the Jacobi algorithm and the proposed VJD one in Figure 3, where we plot the computational times of Algorithms 1 and 2 as functions of the matrix size on a double logarithmic scale with base 10. It shows that in practice, our VJD algorithm is also of complexity but with a smaller prefactor.


Recalling that this work was initially motivated by finding a numerically feasible solution to Lin’s theorem, we are interested in whether our VJD algorithm can produce commuting matrices satisfying the estimate (1.1) in the almost commuting regime. For this, we consider the same pairs of almost commuting matrices as in Figure 2 and compute the numerical errors (OP) of both Jacobi and VJD algorithms, denoted by and , respectively. We plot as a function of the commutator in Figure 4(a) and find that scales as , which is independent of the problem size. Thus, it is clear that when , there holds
(5.4) |
which means that and from Algorithm 2 would satisfy the bound (1.1). We provide a theoretical lower bound of below to justify (5.4), which is generalized from [33, Theorem 3.1]. The proof is given in A for the sake of completeness.
Proposition 5.1.
Let be symmetric matrices with . Suppose that matrices and are computed by Algorithm 2 with input . Then it holds that
(5.5) |
We also compute the relative errors between Jacobi and VJD algorithms in the same experimental setup by with the results shown in Figure 4(b). We can see that our VJD algorithm can achieve nearly the same accuracy as the Jacobi method but with much less computational time (cf. Figures 2 and 3).


5.3 Application in independent component analysis
Let be independent pure signals with length , which are assumed to be non-Gaussian. Suppose that the mixed signals are the linear combinations of the pure signals, i.e., , where and are matrices consisting of rows and , respectively. The goal of ICA is to reconstruct the source signals and the mixing matrix from the noisy measured data . Cardoso et al. [15, 16] proposed a standard framework for solving this problem based on the joint diagonalization. Here we follow the setup in [61] and apply the VJD algorithm to the ICA problem. The basic procedures are summarized as follows for the reader’s convenience. First, normalize such that each row of has zero mean, and define the whitened matrix with computed from the SVD of : . Second, compute the fourth-order cumulant tensor of the matrix by
(5.6) |
where are the columns of ; notation represents the element-wise product (Hadamard product); and means the expected value (average). Third, project the cumulant tensor onto a set of orthogonal eigenmatrices of dimension , where of them are zero matrices with one diagonal entry being , and the others are zero matrices with two symmetrically off-diagonal entries being . Denote the projected matrices by . Fourth, apply the VJD Algorithm 2 to jointly diagonalize the matrices . Denote the output orthogonal matrix by . Finally, reconstruct the source signals and the mixing matrix by
To test our VJD algorithm in the ICA application, we consider simulated one-dimensional and two-dimensional source signals in Figures 5(a) and 6(a). We assume that the noisy measured signals are given by , where is a random orthogonal matrix and is a random matrix with being i.i.d. Gaussian ; see Figures 5(b) and 6(b). Then, in our experiments, there are symmetric matrices of size to be jointly diagonalized. We plot in Figures 5 and 6 the reconstructed signals by Jacobi Algorithm 1 and VJD Algorithm 2 to compare their performance, which clearly shows that both of these two algorithms can help recover the source signals effectively. In addition, we repeat the experiments for the same source signals with 100 samples of and and record the average time cost and -error between the reconstructed signals and the source signals in Tables 2 and 2. It again demonstrates the advantages of our VJD algorithm over the standard Jacobi one.








6 Conclusion
The approximate joint diagonalization of almost commuting matrices arises in many applications and closely relates to the celebrated Huaxin Lin’s theorem. In this work, we have designed a vector-wise algorithm framework for jointly diagonalizing a given pair of almost commuting matrices , which relies on solving the sub-optimization problems () and (). To motivate our VJD algorithm (cf. Algorithm 2), we have first analyzed the stability of the local minimizers of in (), and then, with the help of the random matrix theory, we have shown that for almost commuting random matrices, with high probability the functional has local minimizers with almost orthogonal minimizing vectors (cf. Theorem 2.13). To efficiently solve the sub-optimization problems, we have proposed a projected Riemannian approximate Newton method (cf. Algorithm 3) for () and (). Moreover, we have proved in Theorem 4.1 that in the almost commuting regime, Algorithm 3 for () exhibits linear convergence with a rate of , transitioning to quadratic convergence when and commute. However, the convergence of the projected Newton method (i.e., Algorithm 3 for ()) is still open and needs further investigation. We have also proved that our VJD algorithm is stable with respect to iteration errors and noise.
Numerical results suggest that our vector-wise framework is more efficient than the popular Jacobi-type algorithm, especially for large-scale matrices, and can be applied to the ICA problem for signal reconstruction. We have numerically and theoretically examined the relations between the cost functional in (OP) and the commutator and find , which indicates that when is small enough, the commuting matrices constructed from Algorithm 2 satisfy the estimate (1.1) and hence give a desired numerical solution for Lin’s theorem. Finally, we would like to point out that though our approach works quite well for almost commuting matrices, it would be still important and useful to find an algorithm for Lin’s Theorem beyond the almost commuting regime.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this work.
Acknowledgement
The work of BL and JL is supported in part by the US National Science Foundation under award CCF-1910571 and by the US Department of Energy via grant DE-SC0019449. JL would like to thank Terry Loring for helpful discussions regarding the problem of almost commuting matrices.
Appendix A Proof of Proposition 5.1
Proof of Proposition 5.1.
We recall the construction of the diagonal matrices in Algorithm 2:
where is the -th column of the orthogonal matrix . Then it is easy to see that
Here and in what follows, means the diagonal part of a matrix . Hence, to bound from below, it suffices to consider the lower bound of
(A.1) |
Let be a minimizer to the optimization problem A.1, and we write
(A.2) |
where and . Noting that holds for , we readily have, by definition,
(A.3) |
The decomposition (A.2) can be rewritten as and , which implies
Then, a direct computation gives
(A.4) |
By the triangle inequality and the unitary invariance of operator norm, there holds
which further yields, by a simple estimate using (A.3),
The proof is complete.
∎
References
- [1] P.-A. Absil, C. G. Baker, and K. A. Gallivan. Trust-region methods on Riemannian manifolds. Foundations of Computational Mathematics, 7(3):303–330, 2007.
- [2] P.-A. Absil and K. A. Gallivan. Joint diagonalization on the oblique manifold for independent component analysis. In 2006 IEEE international conference on acoustics speech and signal processing proceedings, volume 5, pages V–V. IEEE, 2006.
- [3] P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2009.
- [4] B. Afsari and P. S. Krishnaprasad. Some gradient based joint diagonalization methods for ica. In Independent Component Analysis and Blind Signal Separation: Fifth International Conference, ICA 2004, Granada, Spain, September 22-24, 2004. Proceedings 5, pages 437–444. Springer, 2004.
- [5] G. W. Anderson, A. Guionnet, and O. Zeitouni. An introduction to random matrices. Cambridge Studies in Advanced Mathematics. Cambridge university press, 2010.
- [6] A. Barachant et al. pyriemann: Biosignals classification with riemannian geometry. https://pyriemann.readthedocs.io/en/latest/index.html, 2015. [Online].
- [7] R. Bergmann and R. Herzog. Intrinsic formulation of KKT conditions and constraint qualifications on smooth manifolds. SIAM Journal on Optimization, 29(4):2423–2444, 2019.
- [8] M. Berry and A. Sameh. Multiprocessor Jacobi algorithms for dense symmetric eigenvalue and singular value decompositions. Technical report, Illinois Univ., Urbana (USA). Center for Supercomputing Research and Development, 1986.
- [9] R. Bhatia. Spectral variation, normal matrices, and Finsler geometry. Mathematical Intelligencer, 29(3):41–46, 2007.
- [10] M. A. A. Bortoloti, T. A. Fernandes, and O. P. Ferreira. An efficient damped Newton-type algorithm with globalization strategy on Riemannian manifolds. Journal of Computational and Applied Mathematics, 403:113853, 2022.
- [11] F. Bouchard, B. Afsari, J. Malick, and M. Congedo. Approximate joint diagonalization with riemannian optimization on the general linear group. SIAM Journal on Matrix Analysis and Applications, 41(1):152–170, 2020.
- [12] N. Boumal. An introduction to optimization on smooth manifolds. To appear with Cambridge University Press, Apr 2022.
- [13] M. M. Bronstein, K. Glashoff, and T. A. Loring. Making Laplacians commute. arXiv preprint arXiv:1307.6549, 2013.
- [14] A. Bunse-Gerstner, R. Byers, and V. Mehrmann. Numerical methods for simultaneous diagonalization. SIAM journal on matrix analysis and applications, 14(4):927–949, 1993.
- [15] J.-F. Cardoso. Multidimensional independent component analysis. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No. 98CH36181), volume 4, pages 1941–1944. IEEE, 1998.
- [16] J.-F. Cardoso and A. Souloumiac. Blind beamforming for non-Gaussian signals. In IEE proceedings F (radar and signal processing), volume 140, pages 362–370. IET, 1993.
- [17] J.-F. Cardoso and A. Souloumiac. Jacobi angles for simultaneous diagonalization. SIAM journal on matrix analysis and applications, 17(1):161–164, 1996.
- [18] E. de Klerk, C. Dobre, and D. V. Ṗasechnik. Numerical block diagonalization of matrix-algebras with application to semidefinite programming. Mathematical programming, 129(1):91–111, 2011.
- [19] L. De Lathauwer. A link between the canonical decomposition in multilinear algebra and simultaneous matrix diagonalization. SIAM journal on Matrix Analysis and Applications, 28(3):642–666, 2006.
- [20] P. Eberlein. Solution to the complex eigenproblem by a norm reducing Jacobi type method. Numerische Mathematik, 14(3):232–245, 1970.
- [21] P. J. Eberlein. On one-sided Jacobi methods for parallel computation. SIAM Journal on Algebraic Discrete Methods, 8(4):790–796, 1987.
- [22] A. El Ghazi, S. El Hajji, L. Giraud, and S. Gratton. Newton’s method for the common eigenvector problem. Journal of computational and applied mathematics, 219(2):398–407, 2008.
- [23] E. Evert and L. De Lathauwer. Guarantees for existence of a best canonical polyadic approximation of a noisy low-rank tensor. SIAM Journal on Matrix Analysis and Applications, 43(1):328–369, 2022.
- [24] D. Eynard, K. Glashoff, M. M. Bronstein, and A. M. Bronstein. Multimodal diffusion geometry by joint diagonalization of Laplacians. arXiv preprint arXiv:1209.2295, 2012.
- [25] D.-Z. Feng, X.-D. Zhang, and Z. Bao. An efficient multistage decomposition approach for independent components. Signal Processing, 83(1):181–197, 2003.
- [26] O. P. Ferreira, A. N. Iusem, and S. Z. Németh. Projections onto convex sets on the sphere. Journal of Global Optimization, 57(3):663–676, 2013.
- [27] O. P. Ferreira, A. N. Iusem, and S. Z. Németh. Concepts and techniques of optimization on the sphere. Top, 22(3):1148–1170, 2014.
- [28] O. P. Ferreira and B. F. Svaiter. Kantorovich’s theorem on Newton’s method in Riemannian manifolds. journal of complexity, 18(1):304–329, 2002.
- [29] M. Fornasier, H. Huang, L. Pareschi, and P. Sünnen. Consensus-based optimization on the sphere: Convergence to global minimizers and machine learning. Journal of Machine Learning Research, 22(237):1–55, 2021.
- [30] P. Friis and M. Rø rdam. Almost commuting self-adjoint matrices-a short proof of Huaxin lin’s theorem. Journal für die reine und angewandte Mathematik, 479:121–131, 1996.
- [31] B. Gao, X. Liu, X. Chen, and Y.-x. Yuan. A new first-order algorithmic framework for optimization problems with orthogonality constraints. SIAM Journal on Optimization, 28(1):302–332, 2018.
- [32] B. Gao, X. Liu, and Y.-x. Yuan. Parallelizable algorithms for optimization problems with orthogonality constraints. SIAM Journal on Scientific Computing, 41(3):A1949–A1983, 2019.
- [33] K. Glashoff and M. M. Bronstein. Matrix commutators: their asymptotic metric properties and relation to approximate joint diagonalization. Linear Algebra and its Applications, 439(8):2503–2513, 2013.
- [34] H. H. Goldstine and L. P. Horwitz. A procedure for the diagonalization of normal matrices. Journal of the ACM (JACM), 6(2):176–195, 1959.
- [35] G. H. Golub and C. F. Van Loan. Matrix computations. JHU press, 2013.
- [36] F. Gygi, J.-L. Fattebert, and E. Schwegler. Computation of maximally localized Wannier functions using a simultaneous diagonalization algorithm. Computer physics communications, 155(1):1–6, 2003.
- [37] S.-Y. Ha, M. Kang, D. Kim, J. Kim, and I. Yang. Stochastic consensus dynamics for nonconvex optimization on the Stiefel manifold: Mean-field limit and convergence. Mathematical Models and Methods in Applied Sciences, 32(03):533–617, 2022.
- [38] P. R. Halmos. Some unsolved problems of unknown depth about operators on Hilbert space. Proceedings of the Royal Society of Edinburgh Section A: Mathematics, 76(1):67–76, 1976.
- [39] M. B. Hastings. Topology and phases in fermionic systems. Journal of Statistical Mechanics: Theory and Experiment, 2008(01):L01001, 2008.
- [40] M. B. Hastings. Making almost commuting matrices commute. Communications in Mathematical Physics, 291(2):321–345, 2009.
- [41] M. B. Hastings and T. A. Loring. Almost commuting matrices, localized Wannier functions, and the quantum Hall effect. Journal of mathematical physics, 51(1):015214, 2010.
- [42] M. B. Hastings and T. A. Loring. Topological insulators and -algebras: Theory and numerical practice. Annals of Physics, 326(7):1699–1759, 2011.
- [43] D. Herrera. Constructive Approaches to Lin’s Theorem for Almost Commuting Matrices. arXiv preprint arXiv:2011.11800, 2020.
- [44] I. Kachkovskiy and Y. Safarov. Distance to normal elements in -algebras of real rank zero. Journal of the American Mathematical Society, 29(1):61–80, 2016.
- [45] J. Kim, M. Kang, D. Kim, S.-Y. Ha, and I. Yang. A stochastic consensus method for nonconvex optimization on the Stiefel manifold. In 2020 59th ieee conference on decision and control (cdc), pages 1050–1057. IEEE, 2020.
- [46] A. Kovnatsky, M. M. Bronstein, A. M. Bronstein, K. Glashoff, and R. Kimmel. Coupled quasi-harmonic bases. In Computer Graphics Forum, volume 32, pages 439–448. Wiley Online Library, 2013.
- [47] X.-b. Li, N.-j. Huang, Q. H. Ansari, and J.-C. Yao. Convergence rate of descent method with new inexact line-search on Riemannian manifolds. Journal of Optimization Theory and Applications, 180(3):830–854, 2019.
- [48] X.-L. Li and X.-D. Zhang. Nonorthogonal joint diagonalization free of degenerate solution. IEEE Transactions on Signal Processing, 55(5):1803–1814, 2007.
- [49] H. Lin. Almost commuting selfadjoint matrices and applications. Fields Inst. Commun, 13:193–233, 1997.
- [50] C. Liu and N. Boumal. Simple algorithms for optimization on Riemannian manifolds with constraints. Applied Mathematics & Optimization, 82(3):949–981, 2020.
- [51] T. A. Loring and A. P. Sørensen. Almost commuting self-adjoint matrices: the real and self-dual cases. Reviews in Mathematical Physics, 28(07):1650017, 2016.
- [52] T. Maehara and K. Murota. A numerical algorithm for block-diagonal decomposition of matrix-algebras with general irreducible components. Japan journal of industrial and applied mathematics, 27(2):263–293, 2010.
- [53] T. Maehara and K. Murota. Algorithm for error-controlled simultaneous block-diagonalization of matrices. SIAM Journal on Matrix Analysis and Applications, 32(2):605–620, 2011.
- [54] K. Murota, Y. Kanno, M. Kojima, and S. Kojima. A numerical algorithm for block-diagonal decomposition of matrix -algebras with application to semidefinite programming. Japan Journal of Industrial and Applied Mathematics, 27(1):125–160, 2010.
- [55] J. v. Neumann and E. P. Wigner. Über das verhalten von eigenwerten bei adiabatischen prozessen. In The Collected Works of Eugene Paul Wigner, pages 294–297. Springer, 1993.
- [56] H. Nguyen, T. Tao, and V. Vu. Random matrices: tail bounds for gaps between eigenvalues. Probability Theory and Related Fields, 167(3):777–816, 2017.
- [57] Y. Ogata. Approximating macroscopic observables in quantum spin systems with commuting matrices. Journal of Functional Analysis, 264(9):2005–2033, 2013.
- [58] F. Rellich and J. Berkowitz. Perturbation theory of eigenvalue problems. CRC Press, 1969.
- [59] W. Ring and B. Wirth. Optimization methods on Riemannian manifolds and their application to shape space. SIAM Journal on Optimization, 22(2):596–627, 2012.
- [60] P. Rosenthal. Are almost commuting matrices near commuting matrices? The American Mathematical Monthly, 76(8):925–926, 1969.
- [61] D. N. Rutledge and D. J.-R. Bouveresse. Independent components analysis with the JADE algorithm. TrAC Trends in Analytical Chemistry, 50:22–32, 2013.
- [62] H. Sato. Riemannian newton-type methods for joint diagonalization on the stiefel manifold with application to independent component analysis. Optimization, 66(12):2211–2231, 2017.
- [63] P. H. Schönemann. A generalized solution of the orthogonal procrustes problem. Psychometrika, 31(1):1–10, 1966.
- [64] M. Sørensen, L. D. Lathauwer, P. Comon, S. Icart, and L. Deneire. Canonical polyadic decomposition with a columnwise orthonormal factor matrix. SIAM Journal on Matrix Analysis and Applications, 33(4):1190–1213, 2012.
- [65] T. Tao. Topics in random matrix theory, volume 132. American Mathematical Soc., 2012.
- [66] T. Tao and V. Vu. Random matrices have simple spectrum. Combinatorica, 37(3):539–553, 2017.
- [67] M. Teytel. How rare are multiple eigenvalues? Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 52(8):917–934, 1999.
- [68] F. Theis. Towards a general independent subspace analysis. Advances in Neural Information Processing Systems, 19, 2006.
- [69] F. J. Theis, T. P. Cason, and P. A. Absil. Soft dimension reduction for ica by joint diagonalization on the stiefel manifold. In Independent Component Analysis and Signal Separation: 8th International Conference, ICA 2009, Paraty, Brazil, March 15-18, 2009. Proceedings 8, pages 354–361. Springer, 2009.
- [70] K. Uhlenbeck. Generic properties of eigenfunctions. American Journal of Mathematics, 98(4):1059–1078, 1976.
- [71] A.-J. Van Der Veen. Joint diagonalization via subspace fitting techniques. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), volume 5, pages 2773–2776. IEEE, 2001.
- [72] Z. Wen and W. Yin. A feasible method for optimization with orthogonality constraints. Mathematical Programming, 142:397–434, 12 2013.
- [73] W. H. Yang, L.-H. Zhang, and R. Song. Optimality conditions for the nonlinear programming problems on Riemannian manifolds. Pacific Journal of Optimization, 10(2):415–434, 2014.
- [74] Y. Yang. Globally convergent optimization algorithms on Riemannian manifolds: Uniform framework for unconstrained and constrained optimization. Journal of Optimization Theory and Applications, 132(2):245–265, 2007.
- [75] A. Yeredor. Blind source separation via the second characteristic function. Signal Processing, 80(5):897–902, 2000.
- [76] H. Zhang, A. Milzarek, Z. Wen, and W. Yin. On the geometric analysis of a quartic–quadratic optimization problem under a spherical constraint. Mathematical Programming, 195(1-2):421–473, 2022.