This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Vector-wise Joint Diagonalization of Almost Commuting Matrices

Bowen Li bowen.li200@duke.edu Jianfeng Lu jianfeng@math.duke.edu Ziang Yu yuziang@sjtu.edu.cn Department of Mathematics, Duke University, United States Department of Chemistry and Department of Physics, Duke University, United States School of Mathematical Sciences, Shanghai Jiao Tong University, China
Abstract

This work aims to numerically construct exactly commuting matrices close to given almost commuting ones, which is equivalent to the joint approximate diagonalization problem. We first prove that almost commuting matrices generically have approximate common eigenvectors that are almost orthogonal to each other. Based on this key observation, we propose an efficient and robust vector-wise joint diagonalization (VJD) algorithm, which constructs the orthogonal similarity transform by sequentially finding these approximate common eigenvectors. In doing so, we consider sub-optimization problems over the unit sphere, for which we present a Riemannian approximate Newton method with a rigorous convergence analysis. We also discuss the numerical stability of the proposed VJD algorithm. Numerical experiments with applications in the independent component analysis are provided to reveal the relation with Huaxin Lin’s theorem and to demonstrate that our method compares favorably with the state-of-the-art Jacobi-type joint diagonalization algorithm.

keywords:
Almost commuting matrices , Huaxin Lin’s theorem , Joint diagonalizability , Riemannian approximate Newton method , Vector-wise algorithm
MSC:
[2020] 15A18 , 15A27 , 15A45 , 15A60 , 58C15 , 65F15 , 65F35

1 Introduction

It is a fundamental result in linear algebra that a set of Hermitian matrices are commuting if and only if they are jointly diagonalizable by a unitary matrix. A closely related and long-standing question [38, 60], dating back to the 1950s, was that are almost commuting matrices near commuting pairs? It was answered affirmatively by Huaxin Lin [49] in 1995, which gave the following celebrated theorem: for any ϵ>0\epsilon>0, there exists δ>0\delta>0, such that if A,Bn×nA,B\in\mathbb{C}^{n\times n} are Hermitian matrices satisfying A,B1\lVert A\rVert,\lVert B\rVert\leq 1 and ABBA<δ\lVert AB-BA\rVert<\delta, then there exist commuting Hermitian matrices A,BA^{\prime},B^{\prime} such that AA,BB<ϵ\lVert A^{\prime}-A\rVert,\lVert B^{\prime}-B\rVert<\epsilon. A simplified proof of Lin’s result was provided soon after by Friis and Rørdam [30], and an analogous result in the real symmetric case is recently given in [51] by Loring and Sørensen. However, their arguments were abstract and did not show how to construct these matrices A,BA^{\prime},B^{\prime}, which would be very useful from the application point of view [51]. In 2008, Hastings [40] suggested a constructive approach to finding the nearby commuting matrices AA^{\prime} and BB^{\prime} and discussed the dependence of δ\delta on ϵ\epsilon (see also [43] for the extended discussions about Hastings’ method). Later, Kachkovskiy and Safarov, in their seminal work [44], revisited this problem in the framework of operator algebra theory and gave the optimal dependence of δ\delta on ϵ\epsilon by a different argument from the one in [40]. We state their result below for future reference. The interested readers are referred to [43, 44] for a more complete historical review of Lin’s theorem and its development. A list of notations used throughout this work is given at the end of this section.

Theorem 1.1 ([44]).

For any two self-adjoint operators A,BA,B on a Hilbert space HH, there exists a pair of commuting self-adjoint operators A,BA^{\prime},B^{\prime} such that

AA+BBC[A,B]1/2,\displaystyle\lVert A-A^{\prime}\rVert+\lVert B-B^{\prime}\rVert\leq C\lVert[A,B]\rVert^{1/2}\,, (1.1)

where [A,B]=ABBA[A,B]=AB-BA is the commutator and the constant CC is independent of A,BA,B and the dimension of HH.

The theory of almost commuting matrices and Lin’s theorem find numerous applications in many fields, e.g., quantum mechanics [39, 41, 42, 57], computer graphics [13, 46], and signal processing [16, 25, 61]. In particular, Hastings used Lin’s theorem to claim the existence of localized Wannier functions for non-interacting fermions [39], which inspired a series of subsequent works [41, 42]. For the application in computer science, there is a famous algorithm called Joint Approximate Diagonalization (JADE) developed by Cardoso and Souloumiac for blind source separation [16, 17]. Such an algorithm has also been used in computational quantum chemistry to generate Wannier functions [36]. The machine learning-related applications are recently discussed in [33, 46]. While numerous attempts have been made to constructively prove Lin’s theorem and explore its applications in various areas, a numerically feasible algorithm for finding A,BA^{\prime},B^{\prime} in Theorem 1.1 remains elusive.

In this work, our focus is on designing a fast algorithm for Theorem 1.1 in the almost commuting regime. Given matrices A,BS(n)A,B\in{\rm S}(n) with [A,B]1\lVert[A,B]\rVert\ll 1, we consider the following equivalent optimization problem:

minD1,D2diag(n),UO(n)𝒥(D1,D2,U):=UTAUD12+UTBUD22.\begin{split}&\min_{D_{1},\,D_{2}\in\mathrm{diag}\left(n\right),\,U\in O(n)}\mathcal{J}(D_{1},D_{2},U):=\lVert U^{T}AU-D_{1}\rVert^{2}+\lVert U^{T}BU-D_{2}\rVert^{2}.\end{split} (OP)

In view of practical applications [24, 33, 36], we will focus on real matrices, but all the results and discussions can be adapted to the complex case easily. It may be clear but worth emphasizing that any approximate solution to (OP) gives a potential pair of commuting matrices (A,B)=(UD1UT,UD2UT)(A^{\prime},B^{\prime})=(UD_{1}U^{T},UD_{2}U^{T}) satisfying the estimate (1.1). The problem (OP) is generally very difficult to solve by the existing manifold optimization techniques, since the operator norm involved is closely related to the Finsler geometry [9], which is a rather uncommon structure in the optimization community. Even if we replace the operator norm with the Frobenius one, (OP) is still a challenging problem, due to the non-convexity of the objective function and the Stiefel manifold constraint. One old but popular method is a Jacobi-type algorithm represented in [14] with ideas dating back to [20, 34]. This algorithm minimizes the off-diagonal objective function: for Hermitian matrices AA and BB,

J(A,B,U)=off(UAU)+off(UBU),J(A,B,U)={\rm off}(U^{*}AU)+{\rm off}(U^{*}BU)\,,

by a sequence of unitary transformations constructed from Givens rotation matrices:

R(i,j,c,s):=I+(c1)eieiTs¯eiejT+sejeiT+(c¯1)ejejT,R(i,j,c,s):=I+(c-1)e_{i}e_{i}^{T}-\bar{s}e_{i}e_{j}^{T}+se_{j}e_{i}^{T}+(\bar{c}-1)e_{j}e_{j}^{T}\,,

where UU is a unitary matrix with the complex adjoint UU^{*}, off(A):=ij|aij|2{\rm off}(A):=\sum_{\scriptscriptstyle i\neq j}|a_{ij}|^{2} measures the off-diagonal part of a square matrix AA, and parameters c,sc,s\in\mathbb{C} are the Jacobi angles satisfying |c|2+|s|2=1\left|c\right|^{\scriptscriptstyle 2}+\left|s\right|^{\scriptscriptstyle 2}=1. The optimal cc and ss are derived in [17, Theorem 1.1] in a closed-form expression. For the readers’ convenience, we briefly summarize the main procedures of the Jacobi algorithm [14] in the following pseudocode.

Input: ϵ>0\epsilon>0; Hermitian commuting matrices A,BA,B
Output: Unitary matrix UU such that off(UAU)+off(UBU)ϵ(AF+BF){\rm off}(U^{*}AU)+{\rm off}(U^{*}BU)\leq\epsilon(\lVert A\rVert_{\rm F}+\lVert B\rVert_{\rm F})
1
2U=IU=I;
3 while off(A)+off(B)>ϵ(AF+BF){\rm off}(A)+{\rm off}(B)>\epsilon(\left\|A\right\|_{{\rm F}}+\left\|B\right\|_{{\rm F}}) do
4       for i=1,,ni=1,\cdots,n and j=i+1,,nj=i+1,\cdots,n do
5             set parameters cc and ss as described in [17]
6            R=R(i,j,c,s)R=R(i,j,c,s)
7            U=URU=UR; A=RARA=R^{*}AR; B=RBRB=R^{*}BR
8       end for
9      
10 end while
Algorithm 1 Jacobi algorithm for simultaneous diagonalization [14, 17]

Algorithm 1 is known to have the quadratic asymptotic convergence rate [14] and can be efficiently implemented in the parallel framework as discussed in [8, 21]. Another advantage of the Jacobi algorithm is its numerical stability against rounding errors. It is worth noting that Theis [68] extended the above Jacobi algorithm for the joint block diagonalization, but the output block decomposition may not be the finest one. For the finest block diagonalization, a class of algorithms based on the theory of matrix *-algebra has been proposed in [18] and developed in [52, 53, 54]. Other joint diagonalization techniques in various problem settings include a gradient-based algorithm for approximately diagonalizing almost commuting Laplacians [13], subspace fitting methods with non-orthogonal transformations [71, 75], and the fast approximate joint diagonalization free of trivial solutions [48], just to name a few. In particular, Riemannian optimization algorithms have been recently explored for the joint diagonalization problem with or without orthogonality constraints. Theis et al. [69] applied the Riemannian trust-region method to the joint diagonalization on the Stiefel manifold. Within the same framework, an efficient Riemannian Newton-type method was investigated in [62] by Sato. More recently, Bouchard et al. [11] proposed a unified Riemannian optimization framework for the approximate joint diagonalization on the general linear group, which generalizes previous algorithms [2, 4].

The joint matrix diagonalization problem is also closely related to the canonical polyadic decomposition (CPD). In [19], De Lathauwer derived a sufficient condition for the uniqueness of CPD and showed that the decomposition can be obtained by a simultaneous diagonalization by congruence. Sørensen [64] proved that for a given tensor, the optimal low-rank CPD approximation exists with one of the matrix factors being column-wise orthonormal. In particular, numerical algorithms based on joint congruence diagonalization and alternating least squares are also explored in [64] for this constrained CPD. We also mention that in a recent work [23] of Evert and De Lathauwer, they discussed the optimal CPD of a noisy low-rank tensor and connected it to a joint generalized eigenvalue problem.

In this work, we propose a practically efficient vector-wise algorithm for finding the approximate solution to the optimization problem (OP) in the case where [A,B]1\lVert[A,B]\rVert\ll 1, and present the associated error analysis. Numerical experiments demonstrate that our algorithm has comparable or even better performance than that of the state-of-the-art Jacobi-type method while significantly reducing computational time. Moreover, we numerically compute the error between the given almost commuting matrices (A,B)(A,B) and the commuting ones (A,B)(A^{\prime},B^{\prime}) constructed by our algorithm. It turns out that in the almost commuting regime, AA+BB\lVert A-A^{\prime}\rVert+\lVert B-B^{\prime}\rVert is of the order [A,B]\lVert[A,B]\rVert, and thus the constructed (A,B)(A^{\prime},B^{\prime}) satisfies the upper bound (1.1) in Lin’s theorem. Some applications in independent component analysis are also provided to justify the effectiveness of our method.

We now summarize the main idea and contributions of this work in detail. Our method is motivated by the simple fact that the commutative matrices A,BA,B can be simultaneously diagonalized by an orthogonal matrix whose columns consist of common orthonormal eigenvectors of A,BA,B, which can be characterized by the global minimizers of the following optimization problem:

minΛ=(λ,μ)2,v𝕊n1(Λ,v):=Avλv2+Bvμv2.\begin{split}&{\color[rgb]{0,0,0}\min_{\Lambda=(\lambda,\mu)\in\mathbb{R}^{2}\,,v\in\mathbb{S}^{n-1}}\mathcal{L}(\Lambda,v)}:=\lVert Av-\lambda v\rVert^{2}+\lVert Bv-\mu v\rVert^{2}\,.\end{split} (OP1{\rm OP}_{1})

Note that (Λ,v)\mathcal{L}(\Lambda,v) admits nn minimizers (Λj,vj)(\Lambda_{j},v_{j}) with (Λj,vj)=0\mathcal{L}(\Lambda_{j},v_{j})=0 and vjv_{j} being mutually orthonormal. This observation leads to an ideal approach for diagonalizing commuting matrices (A,B)(A,B). Suppose that we have successfully solved the first ii minimizers {(Λj,vj)}j=1i\{(\Lambda_{j},v_{j})\}_{j=1}^{i} of (Λ,v)\mathcal{L}(\Lambda,v). To determine the i+1i+1-th minimizer, we solve (OP1{\rm OP}_{1}) but with an additional orthogonal constraint:

min{(Λ,v);Λ2,v𝕊n1,v,vj=0,1ji}.\displaystyle\min\left\{\mathcal{L}(\Lambda,v)\,;\ \Lambda\in\mathbb{R}^{2}\,,\ v\in\mathbb{S}^{n-1}\,,\ \langle v,v_{j}\rangle=0\,,1\leq j\leq i\right\}\,. (1.2)

However, this elegant approach may encounter practical challenges due to numerical noise and errors. Since we can not solve the optimization problem (1.2) exactly at each step, the numerical errors will accumulate so that the constraints v,vj=0\langle v,v_{j}\rangle=0 may exclude the minimizers of the cost functional \mathcal{L}, which leads to the failure of the algorithm for finding the complete orthonormal set of common eigenvectors. Moreover, the input commuting matrices are generally subject to numerical noises or rounding errors. As proved in Lemma 2.3, any such small perturbation could disrupt all the common eigenvectors. Consequently, the functional \mathcal{L} might not exhibit sufficient local minimizers, in which case the algorithm will also fail.

Our first contribution is Theorem 2.13, which proves that in the almost commuting regime, (Λ,v)\mathcal{L}(\Lambda,v) generically has nn local minimizers {(Λj,vj)}i=1n\{(\Lambda_{j},v_{j})\}_{i=1}^{n} with vjv_{j} being almost orthogonal to each other (which may be viewed as approximate common eigenvectors). For this, we provide a stability result in Theorem 2.6 by using perturbation techniques. Then, based on the theory of Gaussian orthogonal ensemble (GOE) [5, 65], we show that the common eigenspaces of commuting matrices are generically of one dimension from both topological and probabilistic points of view (cf. Lemma 2.9 and Corollary 2.11), similarly to the well-known genericity of simple eigenvalues of self-adjoint operators [66, 67, 70]. Combining these results with the estimates of the spectral gaps for random matrices in [56], we can immediately conclude the desired claim in Theorem 2.13.

In view of this generic property of the local minimizers of (Λ,v)\mathcal{L}(\Lambda,v), we consider a relaxed version of the aforementioned approach for finding the columns of the orthogonal matrix UU and the diagonal entries of matrices Λi\Lambda_{i} sequentially, called the vector-wise joint diagonalization (VJD); see Algorithm 2. The core idea behind VJD is to relax the exact orthogonality constraint v,vj=0\langle v,v_{j}\rangle=0 in (1.2) to |v,vj|ε|\langle v,v_{j}\rangle|\leq\varepsilon. This relaxation accommodates possible numerical errors and non-commutativity of input matrices. However, enforcing constraints of the form |v,vj|ε|\langle v,v_{j}\rangle|\leq\varepsilon, 1ji1\leq j\leq i, can be numerically challenging. Therefore, in practice, we adopt the equivalent constraint:

dist(v,Si)ε,{\rm dist}(v,S_{i})\leq\sqrt{\varepsilon}\,, (1.3)

where Si=span{v1,,vi}𝕊n1S_{i}={\rm span}\{v_{1},\cdots,v_{i}\}^{\perp}\cap\mathbb{S}^{n-1} and ε\varepsilon is the error tolerance parameter. Notably, this form (1.3) allows for the straightforward computation of the associated projection, as shown in Lemma 2.4. The VJD algorithm proceeds as follows; see Section 3 for more details. We first solve (OP1{\rm OP}_{1}) to find (λ1,μ1,v1)(\lambda_{1},\mu_{1},v_{1}). Then, for j2j\geq 2, we consider the relaxed problem of (1.2):

min{(Λ,v);Λ=(λ,μ)2,v𝕊n1,dist(v,Sj1)ε},\min\left\{\mathcal{L}(\Lambda,v)\,;\ \Lambda=(\lambda,\mu)\in\mathbb{R}^{2}\,,\ v\in\mathbb{S}^{n-1}\,,\ {\rm dist}(v,S_{j-1})\leq\sqrt{\varepsilon}\right\}\,, (OP2{\rm OP}_{2})

which gives (λj,μj,vj)(\lambda_{j},\mu_{j},v_{j}). This iterative process culminates in the construction of an almost orthogonal matrix V=[v1,,vn]V=[v_{1},\cdots,v_{n}] and two diagonal matrices: D1=diag(λ1,,λn)D_{1}={\rm diag}(\lambda_{1},\ldots,\lambda_{n}) and D2=diag(μ1,,μn)D_{2}={\rm diag}(\mu_{1},\ldots,\mu_{n}). The algorithm ends up with a classical nearest orthogonal matrix problem (i.e., orthogonal Procrustes problem [63]):

min{UV;UO(n)},\displaystyle\min\{\lVert U-V\rVert\,;\ U\in{\rm O}(n)\}\,, (1.4)

where \lVert\cdot\rVert is either the spectral norm op\lVert\cdot\rVert_{{\rm op}} or the Frobenius norm F\lVert\cdot\rVert_{{\rm F}}. In both cases, the problem (1.4) has an analytic solution U=U1U2TU=U_{1}U_{2}^{T}, with V=U1ΣU2TV=U_{1}\Sigma U_{2}^{T} being the singular value decomposition (SVD) of VV. For solving sub-optimization problems (OP1{\rm OP}_{1}) and (OP2{\rm OP}_{2}), we derive a Riemannian approximate Newton-type method with the approximate Hessian matrix being structurally simple and positive semidefinite; see Algorithm 3. Our method exploits the second-order information of the cost function and enjoys stability with low computational cost. We would like to remark that there might be some other algorithms that are also potentially suitable for our optimization problems, for example, the trust-region method [1], the column-wise block coordinate descent method [31, 32], the Cayley transform-based optimization [72], and the consensus-based optimization method [29, 37, 45], but the detailed discussion about these approaches is beyond the scope of this work.

Our third contribution lies in establishing the global convergence of the proposed Riemannian approximate Newton method for (OP1{\rm OP}_{1}) and demonstrating the numerical stability of our VJD algorithm. By leveraging techniques from [10, 47, 74] for Newton-type methods on manifolds, we prove in Theorem 4.1 that Algorithm 3 for (OP1{\rm OP}_{1}) converges linearly with the convergence rate O([A,B]1/2)1O(\lVert[A,B]\rVert^{1/2})\ll 1 and quadratically if [A,B]=0[A,B]=0. However, it is important to note that our optimization process, owing to its sequential nature, does not exactly correspond to solving (OP) when AA and BB are non-commuting matrices. Nevertheless, based on empirical observations from numerical experiments, our method provides a good approximation to the original problem. To further substantiate this numerical advantage, we theoretically establish the error-tolerant property of our VJD algorithm in Proposition 4.5. We remark that although our discussions primarily focus on a pair of almost commuting matrices, our algorithm and analysis can be directly extended to a family of matrices that are almost commuting pairwisely.

This work is organized as follows. Section 2 delves into the stability and generic behaviors of local minimizers of (Λ,v)\mathcal{L}(\Lambda,v), serving as motivation for the VJD algorithm outlined in Section 3, which relies on an efficient approximate Newton method for sub-optimization problems. Additionally, Section 4 is dedicated to the examination of the convergence properties and numerical stability of the presented algorithms. In Section 5, various numerical examples are presented to justify the efficiency of our method.

Notations.

  1. Let S(n){\rm S}(n) be the set of symmetric n×nn\times n matrices, and O(n){\rm O}(n) be the orthogonal group in dimension nn. We denote by diag(n){\rm diag}(n) the space of n×nn\times n diagonal matrices, and by diag(n){\rm diag}_{\leq}(n) the closed subset of diag(n){\rm diag}(n) with increasing diagonal entries λ1λn\lambda_{1}\leq\cdots\leq\lambda_{n}. By abuse of notation, we also write diag(λ1,,λn)\mathrm{diag}\left(\lambda_{1},\ldots,\lambda_{n}\right) for the diagonal matrix with diagonal entries {λi}i=1n\{\lambda_{i}\}_{i=1}^{n}. We denote A0A\succeq 0 (resp., A0A\succ 0) for AS(n)A\in{\rm S}(n) if AA is positive semidefinite (resp., positive definite).

  2. For a matrix Mm×nM\in\mathbb{R}^{m\times n}, MijM_{ij} denotes its (i,j)(i,j) element. Moreover, we denote by F\lVert\cdot\rVert_{{\rm F}} the Frobenius norm, and by op\lVert\cdot\rVert_{{\rm op}} (or, simply, \lVert\cdot\rVert) the operator norm.

  3. Let ,\langle\cdot,\cdot\rangle be the Euclidean inner product of n\mathbb{R}^{n} and \lVert\cdot\rVert be the associated norm. The standard basis of n\mathbb{R}^{n} is given by {ei}i=1n\{e_{i}\}_{i=1}^{n}, where eie_{i} denotes the vector with 11 in the ii-th coordinate and 0 elsewhere. We will not distinguish the row vector (x1,,xn)(x_{1},\ldots,x_{n}) and the column one (x1,,xn)T(x_{1},\ldots,x_{n})^{T} unless otherwise specified. We define the (n1)(n-1)-dimensional unit sphere by 𝕊n1:={x=(x1,,xn)n;x=1}.\mathbb{S}^{n-1}:=\{x=(x_{1},\ldots,x_{n})\in\mathbb{R}^{n}\,;\ \lVert x\rVert=1\}.

  4. We denote by σ(X)\sigma(X) the spectrum of a matrix XS(n)X\in{\rm S}(n) and by δ(X)\delta(X) its spectral gap:

    δ(X)=min{|λiλj|;λiλj,λi,λjσ(X)}.\delta(X)=\min\{|\lambda_{i}-\lambda_{j}|\,;\ \lambda_{i}\neq\lambda_{j}\,,\ \lambda_{i},\lambda_{j}\in\sigma(X)\}\,.

    We say that an eigenvalue λσ(X)\lambda\in\sigma(X) of XS(n)X\in{\rm S}(n) is simple (resp., degenerate) if its multiplicity is equal to (resp., greater than) one.

  5. We recall the commutator [A,B]=ABBA[A,B]=AB-BA of two matrices A,Bn×nA,B\in\mathbb{R}^{n\times n} and define the set of commuting matrices by

    Fcom:={(A,B)S(n)×S(n);[A,B]=0}.{\rm F}_{\rm com}:=\{(A,B)\in{\rm S}(n)\times{\rm S}(n)\,;\ [A,B]=0\}\,.

    Then, we say that Λ:=(λ,μ)σ(A)×σ(B)\Lambda:=(\lambda,\mu)\in\sigma(A)\times\sigma(B) is an eigenvalue pair of (A,B)Fcom(A,B)\in{\rm F}_{\rm com} if the associated common eigenspace is non-empty:

    EA,BΛ:={vn;Av=λv,Bv=μv}.E_{A,B}^{\Lambda}:=\{v\in\mathbb{R}^{n}\,;\ Av=\lambda v,\,Bv=\mu v\}\neq\emptyset\,. (1.5)

    With slight abuse of notation, we define the joint spectrum σ(A,B)2\sigma(A,B)\subset\mathbb{R}^{2} of (A,B)Fcom(A,B)\in{\rm F}_{\rm com} by the set of all its eigenvalue pairs Λj\Lambda_{j} and the eigenvalue gap by

    δ(A,B):=min{ΛiΛj;ΛiΛj,Λi,Λjσ(A,B)}.\delta(A,B):=\min\{\lVert\Lambda_{i}-\Lambda_{j}\rVert\,;\ \Lambda_{i}\neq\Lambda_{j}\,,\ \Lambda_{i},\Lambda_{j}\in\sigma(A,B)\}\,. (1.6)

    We also denote by conv(σ(A,B))2{\rm conv}(\sigma(A,B))\subset\mathbb{R}^{2} the convex hull of σ(A,B)\sigma(A,B). The multiplicity of Λσ(A,B)\Lambda\in\sigma(A,B) is defined by the dimension of the subspace EA,BΛE_{A,B}^{\Lambda}, and Λσ(A,B)\Lambda\in\sigma(A,B) is said to be simple (resp., degenerate) if dim(EA,BΛ)\dim(E_{A,B}^{\Lambda}) is equal to (resp., greater than) one.

2 Analysis of local minimizers of (OP1)\eqref{eq:sub-problem}

This section is devoted to the analysis of the local minimizers of the functional (Λ,v)\mathcal{L}(\Lambda,v) in (OP1{\rm OP}_{1}). Note (Λ,v)=(Λ,v)\mathcal{L}(\Lambda,v)=\mathcal{L}(\Lambda,-v) and that if (Λ,v)(\Lambda,v) is a local minimum of \mathcal{L}, so is (Λ,v)(\Lambda,-v). It is convenient to consider a point (Λ,v)2×𝕊n1(\Lambda,v)\in\mathbb{R}^{2}\times\mathbb{S}^{n-1} as an equivalence class {(Λ,v),(Λ,v)}\{(\Lambda,v),(\Lambda,-v)\} in what follows. We shall first derive the optimality conditions of (OP1{\rm OP}_{1}) and characterize the minimizers of \mathcal{L} in the commuting case. Then, we establish some stability estimates on the minimizers by perturbation arguments. We also show that with high probability 1o(1)1-o(1), the functional (Λ,v)\mathcal{L}(\Lambda,v) for almost commuting random matrices has nn local minima {(Λi,vi)}i=1n\{(\Lambda_{i},v_{i})\}_{i=1}^{n} on the manifold 2×𝕊n1\mathbb{R}^{2}\times\mathbb{S}^{n-1} with minimizing vectors vi𝕊n1v_{i}\in\mathbb{S}^{n-1} almost orthogonal to each other. These findings shed light on designing the column-wise algorithm for the optimization problem (OP), which we will discuss in detail in the next section.

Recall that in this work, we consider the almost commuting symmetric matrices (A,B)(A,B) with [A,B]1\lVert[A,B]\rVert\ll 1. By Theorem 1.1, without loss of generality, we assume that the matrices (A,B)(A,B) can be written as

A=A+ΔA,B=B+ΔB,\displaystyle A=A_{*}+\Delta A\,,\quad B=B_{*}+\Delta B\,, (2.1)

for some commuting matrices (A,B)Fcom(A_{*},B_{*})\in{\rm F}_{\rm com} with

ΔA+ΔBC[A,B]1/21.\lVert\Delta A\rVert+\lVert\Delta B\rVert\leq C\lVert[A,B]\rVert^{1/2}\ll 1\,. (2.2)

Note that in the case of [A,B]=0[A,B]=0 (i.e., ΔA=ΔB=0\Delta A=\Delta B=0), we can always find nn local (also global) minima of (Λ,v)\mathcal{L}(\Lambda,v) (with minimum value zero) given by the eigenvalue pairs of (A,B)(A,B) and the associated common orthonormal eigenvectors. Indeed, the converse is also true; see Proposition 2.2 below.

We start with the optimality conditions for (OP1{\rm OP}_{1}). For ease of exposition, we recall some basic concepts on Riemannian optimization on a smooth manifold \mathcal{M} embedded in an Euclidean space n\mathbb{R}^{n} [3, 12]. We denote by TxT_{x}\mathcal{M} the tangent space at xx\in\mathcal{M} and by 𝒫x\mathcal{P}_{x} the orthogonal projection from n\mathbb{R}^{n} to TxT_{x}\mathcal{M}. Let ff be a smooth function on \mathcal{M}, which admits a smooth extension f¯\bar{f} to a neighborhood of \mathcal{M} in n\mathbb{R}^{n}. In what follows, we will denote by \nabla and 2\nabla^{2} the standard Euclidean gradient and Hessian, respectively. Then, the Riemannian gradient of ff is given by

gradf(x)=𝒫x(f¯(x));\displaystyle\operatorname{grad}f(x)=\mathcal{P}_{x}(\nabla\bar{f}(x))\,; (2.3)

see [12, Proposition 3.53]. Letting G¯\bar{G} be any smooth extension of gradf\operatorname{grad}f, the Riemannian Hessian of ff, as a linear map on TxT_{x}\mathcal{M}, can be computed by [12, Corollary 5.14]

Hessf(x)=𝒫x(G¯(x)).\displaystyle\operatorname{Hess}f(x)=\mathcal{P}_{x}(\nabla\bar{G}(x))\,. (2.4)

In the special case of =𝕊n1\mathcal{M}=\mathbb{S}^{n-1}, we have the tangent space Tv𝕊n1={pn;p,v=0}T_{v}\mathbb{S}^{n-1}=\{p\in\mathbb{R}^{n}\,;\ \langle p,v\rangle=0\} at v𝕊n1v\in\mathbb{S}^{n-1} with the associated projection 𝒫v\mathcal{P}_{v} given by InvvTI_{n}-vv^{T}. Thanks to (2.3) and (2.4) above, the Riemannian gradient and Hessian of ff on 𝕊n1\mathbb{S}^{n-1} can be explicitly computed as

gradf(v)=(InvvT)f¯(v),Hessf(v)=(InvvT)(2f¯(v)f¯(v),vIn).\displaystyle\operatorname{grad}f(v)=(I_{n}-vv^{T})\nabla\bar{f}(v)\,,\quad\operatorname{Hess}f(v)=(I_{n}-vv^{T})(\nabla^{2}\bar{f}(v)-\langle\nabla\bar{f}(v),v\rangle I_{n})\,. (2.5)

We also recall the notion of retraction on 𝕊n1\mathbb{S}^{n-1} [12, Definition 3.41]. In this work, we limit our discussion to the following one for its simplicity (another popular choice is the exponential map):

Rv(p)=v+pv+p,v𝕊n1,pTv𝕊n1.\displaystyle R_{v}(p)=\frac{v+p}{\lVert v+p\rVert}\,,\quad v\in\mathbb{S}^{n-1}\,,\ p\in T_{v}\mathbb{S}^{n-1}\,. (2.6)

It is worth noting that RvR_{v} in (2.6) is a second-order retraction on 𝕊n1\mathbb{S}^{n-1}, i.e., for any v𝕊n1v\in\mathbb{S}^{n-1} and pTv𝕊n1p\in T_{v}\mathbb{S}^{n-1}, the curve c(t)=Rv(tp)c(t)=R_{v}(tp) satisfies c(0)=0c(0)=0, c(0)=pc^{\prime}(0)=p, and c′′(0)=0c^{\prime\prime}(0)=0; see [12, Definition 5.41].

We now compute the Riemannian gradient of the functional (Λ,v)\mathcal{L}(\Lambda,v) on the manifold 2×𝕊n1\mathbb{R}^{2}\times\mathbb{S}^{n-1} by (2.3). Note that \mathcal{L} is well-defined on 2+n\mathbb{R}^{2+n}. For Λ=(λ,μ)2\Lambda=(\lambda,\mu)\in\mathbb{R}^{2} and v𝕊n1v\in\mathbb{S}^{n-1}, we have

grad(Λ,v)=[2λ2v,Av2μ2v,Bvgradv(Λ,v)]withgradv(Λ,v)=2(InvvT)((Aλ)2+(Bμ)2)v,\displaystyle\operatorname{grad}\mathcal{L}(\Lambda,v)=\left[\begin{matrix}2\lambda-2\langle v,Av\rangle\\ 2\mu-2\langle v,Bv\rangle\\ \operatorname{grad}_{v}\mathcal{L}(\Lambda,v)\end{matrix}\right]\quad\text{with}\quad\operatorname{grad}_{v}\mathcal{L}(\Lambda,v)=2\left(I_{n}-vv^{T}\right)\left((A-\lambda)^{2}+(B-\mu)^{2}\right)v\,, (2.7)

where gradv\operatorname{grad}_{v}\mathcal{L} denotes the Riemannian gradient of (Λ,v)\mathcal{L}(\Lambda,v) in v𝕊n1v\in\mathbb{S}^{n-1}. Similarly, by (2.4), we calculate the Riemannian Hessian of \mathcal{L} as follows:

Hess(Λ,v)\displaystyle\operatorname{Hess}\mathcal{L}(\Lambda,v) =[2I2WTWHessv(Λ,v)],\displaystyle=\left[\begin{matrix}2I_{2}&W^{T}\\ W&\operatorname{Hess}_{v}\mathcal{L}(\Lambda,v)\end{matrix}\right], (2.8)

where the matrices Wn×2W\in\mathbb{R}^{n\times 2} and Hessv(Λ,v)S(n)\operatorname{Hess}_{v}\mathcal{L}(\Lambda,v)\in{\rm S}(n) are given by

W=4[(InvvT)Av(InvvT)Bv],\displaystyle W=-4\left[\begin{matrix}\left(I_{n}-vv^{T}\right)Av&\left(I_{n}-vv^{T}\right)Bv\end{matrix}\right], (2.9)

and

Hessv(Λ,v)=(InvvT)(v2(Λ,v)v,v2(Λ,v)vIn).\displaystyle\operatorname{Hess}_{v}\mathcal{L}(\Lambda,v)=(I_{n}-vv^{T})\left(\nabla^{2}_{v}\mathcal{L}(\Lambda,v)-\langle v,\nabla^{2}_{v}\mathcal{L}(\Lambda,v)v\rangle I_{n}\right). (2.10)

Here, v2(Λ,v)\nabla^{2}_{v}\mathcal{L}(\Lambda,v) is the Euclidean Hessian of (Λ,v)\mathcal{L}(\Lambda,v) in vv:

v2(Λ,v)=2(Aλ)2+2(Bμ)2.\displaystyle\nabla^{2}_{v}\mathcal{L}(\Lambda,v)=2(A-\lambda)^{2}+2(B-\mu)^{2}\,. (2.11)

We define the stationary points of (OP1{\rm OP}_{1}) by {(Λ,v)2×𝕊n1;grad(Λ,v)=0}\{(\Lambda,v)\in\mathbb{R}^{2}\times\mathbb{S}^{n-1}\,;\ \operatorname{grad}\mathcal{L}(\Lambda,v)=0\}. Then, if (Λ,v)(\Lambda,v) is a local minimizer of (OP1{\rm OP}_{1}), then it is a stationary point and there holds Hess(Λ,v)0\operatorname{Hess}\mathcal{L}(\Lambda,v)\succeq 0 on the tangent space T(Λ,v)(2×𝕊n1)=2×Tv𝕊n1T_{(\Lambda,v)}(\mathbb{R}^{2}\times\mathbb{S}^{n-1})=\mathbb{R}^{2}\times T_{v}\mathbb{S}^{n-1}. Conversely, if (Λ,v)(\Lambda,v) is a stationary point such that Hess(Λ,v)0\operatorname{Hess}\mathcal{L}(\Lambda,v)\succ 0 on 2×Tv𝕊n1\mathbb{R}^{2}\times T_{v}\mathbb{S}^{n-1}, then it is an isolated local minimizer of (OP1{\rm OP}_{1}). We next give a characterization for the stationary points of (OP1{\rm OP}_{1}) for commuting matrices. We denote by Δn\Delta_{n} the nn-simplex:

Δn={(pi)in;pi[0,1],i=1npi=1}.\Delta_{n}=\Big{\{}(p_{i})_{i}\in\mathbb{R}^{n}\,;\ p_{i}\in[0,1],\ \sum_{i=1}^{n}p_{i}=1\Big{\}}\,.
Proposition 2.1.

Given (A,B)Fcom(A,B)\in{\rm F}_{\rm com}, let {Λi}i=1n\{\Lambda_{i}\}_{i=1}^{n} be its eigenvalue pairs (counting multiplicity) and {vi}i=1n𝕊n1\{v_{i}\}_{i=1}^{n}\subset\mathbb{S}^{n-1} be the associated common orthonormal eigenvectors. We define the subset of Δn\Delta_{n}:

Δsta:={(pi)iΔn;Λ=i=1npiΛi,ΛΛiis constant for{i;pi0}}.\Delta_{\rm sta}:=\Big{\{}(p_{i})_{i}\in\Delta_{n}\,;\ \Lambda=\sum_{i=1}^{n}p_{i}\Lambda_{i}\,,\ \lVert\Lambda-\Lambda_{i}\rVert\ \text{is constant for}\ \{i\,;\ p_{i}\neq 0\}\Big{\}}\,. (2.12)

Then, the stationary points of (𝑂𝑃1{\rm OP}_{1}) are given by

Λ=i=1npiΛiconv(σ(A,B)),v=i=1ncivi𝕊n1,\Lambda=\sum_{i=1}^{n}p_{i}\Lambda_{i}\in{\rm conv}(\sigma(A,B))\,,\ v=\sum_{i=1}^{n}c_{i}v_{i}\in\mathbb{S}^{n-1}\,, (2.13)

with (pi)iΔsta(p_{i})_{i}\in\Delta_{\rm sta} and ci=±pic_{i}=\pm\sqrt{p_{i}}.

Proof.

It is clear from (2.7) that grad(Λ,v)=0\operatorname{grad}\mathcal{L}(\Lambda,v)=0 is equivalent to Λ=(v,Av,v,Bv)\Lambda=(\langle v,Av\rangle,\langle v,Bv\rangle) and

((Aλ)2+(Bμ)2)v=ξv,((A-\lambda)^{2}+(B-\mu)^{2})v=\xi v\,, (2.14)

for some constant ξ0\xi\geq 0. We expand v=iciviv=\sum_{i}c_{i}v_{i} by the common eigenvectors {vi}i=1n\{v_{i}\}_{i=1}^{n} with the eigenvalue pairs {Λi}i=1n\{\Lambda_{i}\}_{i=1}^{n}, and find that the equation (2.14) reduces to

ΛiΛ2ci=ξci,fori=1,,n,\lVert\Lambda_{i}-\Lambda\rVert^{2}c_{i}=\xi c_{i}\,,\quad\text{for}\ i=1,\ldots,n\,, (2.15)

where Λ=(v,Av,v,Bv)=i|ci|2Λiconv(σ(A,B))\Lambda=(\langle v,Av\rangle,\langle v,Bv\rangle)=\sum_{i}|c_{i}|^{2}\Lambda_{i}\in{\rm conv}(\sigma(A,B)). Then, it readily follows that ΛiΛ2=ξ\lVert\Lambda_{i}-\Lambda\rVert^{2}=\xi for any ii with ci0c_{i}\neq 0. The proof is complete. ∎

From Proposition 2.1 above, we see that for (A,B)Fcom(A,B)\in{\rm F}_{\rm com}, the stationary points of (OP1{\rm OP}_{1}) are essentially characterized by the set Δsta\Delta_{\rm sta} in (2.12), which always includes the nodes eie_{i} and their averages (ei+ej)/2(e_{i}+e_{j})/2, while the full analytical characterization of Δsta\Delta_{\rm sta} would rely on the distribution of the eigenvalue pairs {Λi}\{\Lambda_{i}\} of (A,B)(A,B). The following result shows that in the commuting case, the local minimizers of (OP1{\rm OP}_{1}) are exactly given by eigenvalue pairs and the common eigenvectors.

Proposition 2.2.

For a given (A,B)Fcom(A,B)\in{\rm F}_{\rm com}, (Λ,v)2×𝕊n1(\Lambda,v)\in\mathbb{R}^{2}\times\mathbb{S}^{n-1} is a local minimizer of (𝑂𝑃1{\rm OP}_{1}) if and only if Λ\Lambda is an eigenvalue pair and vv is the associated common eigenvector.

Proof.

It suffices to prove the direction ()(\Rightarrow). Since (Λ,v)(\Lambda,v) is a local minimizer, we have grad(Λ,v)=0\operatorname{grad}\mathcal{L}(\Lambda,v)=0 and Hess(Λ,v)0\operatorname{Hess}\mathcal{L}(\Lambda,v)\succeq 0 on 2×Tv𝕊n1\mathbb{R}^{2}\times T_{v}\mathbb{S}^{n-1}. It follows that the Schur complement of 2I22I_{2} in Hess(Λ,v)\operatorname{Hess}\mathcal{L}(\Lambda,v) is also positive semidefinite, that is,

Hessv(Λ,v)21WWT0,onTv𝕊n1.\operatorname{Hess}_{v}\mathcal{L}(\Lambda,v)-2^{-1}WW^{T}\succeq 0\,,\ \text{on}\ T_{v}\mathbb{S}^{n-1}\,. (2.16)

Suppose that (Λ,v)(\Lambda,v) is not an eigenvalue pair and the common eigenvector of (A,B)(A,B). From (2.13), Λ=ipiΛi\Lambda=\sum_{i\in\mathcal{I}}p_{i}\Lambda_{i} is a convex combination of some non-identical {Λi}i\{\Lambda_{i}\}_{i\in\mathcal{I}} with pi0p_{i}\neq 0 for ii\in\mathcal{I} and v=iciviv=\sum_{i\in\mathcal{I}}c_{i}v_{i} with ci=±pic_{i}=\pm\sqrt{p_{i}} (otherwise, Λ=Λi\Lambda=\Lambda_{i} for some ii and vv is the common eigenvector). Moreover, there holds ΛΛi2=ξ\lVert\Lambda-\Lambda_{i}\rVert^{2}=\xi for ii\in\mathcal{I} and some constant ξ\xi. Then, for any q=ibiviTv𝕊n1q=\sum_{i\in\mathcal{I}}b_{i}v_{i}\in T_{v}\mathbb{S}^{n-1} with q=1\lVert q\rVert=1, we can compute, by (2.10) and (2.16),

q,(Hessv(Λ,v)21WWT)q\displaystyle\big{\langle}q,\big{(}\operatorname{Hess}_{v}\mathcal{L}(\Lambda,v)-2^{-1}WW^{T}\big{)}q\big{\rangle}
=\displaystyle= q,v2(Λ,v)qv,v2(Λ,v)v8(q,Av2+q,Bv2)\displaystyle\langle q,\nabla^{2}_{v}\mathcal{L}(\Lambda,v)q\rangle-\langle v,\nabla^{2}_{v}\mathcal{L}(\Lambda,v)v\rangle-8(\langle q,Av\rangle^{2}+\langle q,Bv\rangle^{2})
=\displaystyle= 2ξ2ξ8(q,Av2+q,Bv2)0.\displaystyle 2\xi-2\xi-8(\langle q,Av\rangle^{2}+\langle q,Bv\rangle^{2})\geq 0\,. (2.17)

Noting that {Λi=(λi,μi)}i\{\Lambda_{i}=(\lambda_{i},\mu_{i})\}_{i\in\mathcal{I}} are not identical, either AvAv or BvBv is not collinear with vv. Thus, we can choose qTv𝕊n1q\in T_{v}\mathbb{S}^{n-1} such that q,Av2+q,Bv2>0\langle q,Av\rangle^{2}+\langle q,Bv\rangle^{2}>0, which contradicts with (2) and completes the proof. ∎

However, the following observation shows that any small perturbation of commuting symmetric matrices could eliminate all common eigenvectors.

Lemma 2.3.

The set M:={(A,B)S(n)×S(n);(A,B)has no common eigenvectors}{\rm M}:=\{(A,B)\in{\rm S}(n)\times{\rm S}(n)\,;\ (A,B)\ \text{has \emph{no} common eigenvectors}\} is an open dense subset of the space S(n)×S(n){\rm S}(n)\times{\rm S}(n).

Proof.

Let {λi}iA\{\lambda_{i}\}_{i\in\mathcal{I}_{A}} and {μi}iB\{\mu_{i}\}_{i\in\mathcal{I}_{B}} be the distinct eigenvalues of AA and BB with associated eigenspaces Ei,AE_{i,A} and Ei,BE_{i,B}, respectively. We define Si,A:=𝕊n1Ei,AS_{i,A}:=\mathbb{S}^{n-1}\cap E_{i,A} and Si,B:=𝕊n1Ei,BS_{i,B}:=\mathbb{S}^{n-1}\cap E_{i,B}, and see that both iASi,A\cup_{i\in\mathcal{I}_{A}}S_{i,A} and iBSi,B\cup_{i\in\mathcal{I}_{B}}S_{i,B} are closed subsets of 𝕊n1\mathbb{S}^{n-1}, and (A,B)(A,B) has no common eigenvectors if and only if the set (iASi,A)(iBSi,B)(\cup_{i\in\mathcal{I}_{A}}S_{i,A})\cap(\cup_{i\in\mathcal{I}_{B}}S_{i,B}) is empty. It follows that there exist open neighbourhoods of (iASi,A)(\cup_{i\in\mathcal{I}_{A}}S_{i,A}) and (iBSi,B)(\cup_{i\in\mathcal{I}_{B}}S_{i,B}) in 𝕊n1\mathbb{S}^{n-1} with empty intersection. Then, by the perturbation theory, we can conclude that the set M{\rm M} is open in S(n)×S(n){\rm S}(n)\times{\rm S}(n). We next show that M{\rm M} is also dense. It suffices to prove that for A,BS(n)A,B\in{\rm S}(n) having common eigenvectors and for small enough ε>0\varepsilon>0, we can find AεA_{\varepsilon}, BεS(n)B_{\varepsilon}\in{\rm S}(n) with AAεε\lVert A-A_{\varepsilon}\rVert\leq\varepsilon, BBεε\lVert B-B_{\varepsilon}\rVert\leq\varepsilon such that AεA_{\varepsilon}, BεB_{\varepsilon} have no common eigenvectors. For this, by perturbing the eigenvalues of AA and BB, without loss of generality, we assume that both AA and BB have only simple eigenvalues. We let Aε=AA_{\varepsilon}=A and next construct BεB_{\varepsilon}. Consider the spectral decompositions for AA and BB: Awi=λiwiAw_{i}=\lambda_{i}w_{i} and Bvi=μiviBv_{i}=\mu_{i}v_{i} for 1in1\leq i\leq n. Then, for a given ε>0\varepsilon>0, we choose an orthogonal matrix UεU_{\varepsilon} satisfying UεviviU_{\varepsilon}v_{i}\neq v_{i} for any ii and UεIε\lVert U_{\varepsilon}-I\rVert\leq\varepsilon, and define the matrix BεB_{\varepsilon} by BεUεV=UεVDB_{\varepsilon}U_{\varepsilon}V=U_{\varepsilon}VD, where V=[v1,,vn]V=[v_{1},\ldots,v_{n}] and D=diag(μ1,,μn)D={\rm diag}(\mu_{1},\ldots,\mu_{n}). When ε\varepsilon is small enough, the condition UεIε\lVert U_{\varepsilon}-I\rVert\leq\varepsilon implies Uεvivj,wkU_{\varepsilon}v_{i}\neq v_{j},w_{k} for all 1i,j,kn1\leq i,j,k\leq n. Thus, AεA_{\varepsilon} and BεB_{\varepsilon} have no common eigenvectors, and the proof is complete. ∎

Although an arbitrary pair of symmetric matrices may not have any common eigenvectors, the functional (Λ,v)\mathcal{L}(\Lambda,v) can still admit many local minima. Indeed, recalling the decomposition (2.1) for almost commuting matrices (A,B)(A,B), let EE_{*} be a common eigenspace (1.5) of (A,B)(A_{*},B_{*}) associated with some eigenvalue pair Λ:=(λ,μ)\Lambda_{*}:=(\lambda_{*},\mu_{*}):

E:={vn;Av=λv,Bv=μv}.E_{*}:=\{v\in\mathbb{R}^{n}\,;\ A_{*}v=\lambda_{*}v\,,\ B_{*}v=\mu_{*}v\}\,. (2.18)

We shall prove in Theorem 2.6 below that when [A,B]1\lVert[A,B]\rVert\ll 1, there always exists a local minimizer to \mathcal{L} in some neighborhood of (Λ,S)(\Lambda_{*},S_{*}) in 2×𝕊n1\mathbb{R}^{2}\times\mathbb{S}^{n-1}, where S:=𝕊n1ES_{*}:=\mathbb{S}^{n-1}\cap E_{*}. For convenience, we denote by 𝒫\mathcal{P}_{*} and 𝒫\mathcal{P}_{*}^{\perp} the orthogonal projections to the subspace EE_{*} and its orthogonal complement EE_{*}^{\perp}, respectively. The following lemma is useful in the sequel.

Lemma 2.4.

Let the set SS_{*} and the projections 𝒫\mathcal{P}_{*} and 𝒫\mathcal{P}_{*}^{\perp} be defined as above. For v𝕊n1v\in\mathbb{S}^{n-1}, it holds that

dist(v,S)2:=minwSvw2=(1𝒫(v))2+𝒫(v)2=22𝒫(v),\displaystyle\operatorname{dist}(v,S_{*})^{2}:=\min_{w\in S_{*}}\lVert v-w\rVert^{2}=(1-\lVert\mathcal{P}_{*}(v)\rVert)^{2}+\lVert\mathcal{P}_{*}^{\perp}(v)\rVert^{2}=2-2\lVert\mathcal{P}_{*}(v)\rVert\,, (2.19)

where the minimum value is attained at v=𝒫(v)/𝒫(v)v_{*}=\mathcal{P}_{*}(v)/\lVert\mathcal{P}_{*}(v)\rVert. Similarly, let S,εS_{*,\varepsilon} be the ε\varepsilon-neighborhood of the set SS_{*}: S,ε:={v𝕊n1;dist(v,S)ε}S_{*,\varepsilon}:=\{v\in\mathbb{S}^{n-1}\,;\ \operatorname{dist}(v,S_{*})\leq\sqrt{\varepsilon}\}. Then, there holds, for v𝕊n1v\in\mathbb{S}^{n-1},

dist(v,S,ε)2=minwS,εvw2=v𝒫ε(v)2,\displaystyle\operatorname{dist}(v,S_{*,\varepsilon})^{2}=\min_{w\in S_{*,\varepsilon}}\lVert v-w\rVert^{2}=\lVert v-\mathcal{P}_{\varepsilon}(v)\rVert^{2}\,, (2.20)

where 𝒫ε(v)\mathcal{P}_{\varepsilon}(v) is the unique minimizer to dist(v,S,ε)\operatorname{dist}(v,S_{*,\varepsilon}) given by

𝒫ε(v)=θ𝒫(v)𝒫(v)+1θ2𝒫(v)𝒫(v)withθ=max{1ε2,𝒫(v)}.\displaystyle\mathcal{P}_{\varepsilon}(v)=\theta\frac{\mathcal{P}_{*}(v)}{\lVert\mathcal{P}_{*}(v)\rVert}+\sqrt{1-\theta^{2}}\frac{\mathcal{P}^{\perp}_{*}(v)}{\lVert\mathcal{P}^{\perp}_{*}(v)\rVert}\quad\text{with}\quad\theta=\max\left\{1-\frac{\varepsilon}{2},\lVert\mathcal{P}_{*}(v)\rVert\right\}. (2.21)
Proof.

The first claim (2.19) directly follows from the definition, so we only show the second one (2.20). For this, we note from (2.19) that vS,εv\in S_{*,\varepsilon} is equivalent to 𝒫(v)1ε/2\lVert\mathcal{P}_{*}(v)\rVert\geq 1-\varepsilon/2. Hence the statements (2.20) and (2.21) clearly hold for vS,εv\in S_{*,\varepsilon}. Then we consider the case where v𝕊n1\S,εv\in\mathbb{S}^{n-1}\backslash S_{*,\varepsilon}, which gives θ=1ε/2\theta=1-\varepsilon/2. It is easy to see

vw2=𝒫(vw)2+𝒫(vw)2|𝒫(v)𝒫(w)|2+|𝒫(v)𝒫(w)|2.\displaystyle\lVert v-w\rVert^{2}=\lVert\mathcal{P}_{*}(v-w)\rVert^{2}+\lVert\mathcal{P}^{\perp}_{*}(v-w)\rVert^{2}\geq\big{|}\lVert\mathcal{P}_{*}(v)\rVert-\lVert\mathcal{P}_{*}(w)\rVert\big{|}^{2}+\big{|}\lVert\mathcal{P}^{\perp}_{*}(v)\rVert-\lVert\mathcal{P}^{\perp}_{*}(w)\rVert\big{|}^{2}. (2.22)

By elementary analysis, we have that the function |𝒫(v)x|2+|𝒫(v)1x2|2\big{|}\lVert\mathcal{P}_{*}(v)\rVert-x\big{|}^{2}+\big{|}\lVert\mathcal{P}^{\perp}_{*}(v)\rVert-\sqrt{1-x^{2}}\big{|}^{2} is monotone increasing in x[1ε/2,1]x\in[1-\varepsilon/2,1] for v𝕊n1\S,εv\in\mathbb{S}^{n-1}\backslash S_{*,\varepsilon}. Therefore, it follows from (2.21) and (2.22) that

dist(v,S,ε)2|𝒫(v)θ|2+|𝒫(v)1θ2|2=v𝒫ε(v)2.\operatorname{dist}(v,S_{*,\varepsilon})^{2}\geq\big{|}\lVert\mathcal{P}_{*}(v)\rVert-\theta\big{|}^{2}+\big{|}\lVert\mathcal{P}^{\perp}_{*}(v)\rVert-\sqrt{1-\theta^{2}}\big{|}^{2}=\lVert v-\mathcal{P}_{\varepsilon}(v)\rVert^{2}\,.

The proof is complete. ∎

Remark 2.5.

One can also define the distance function dist~(v,E):=infwEd(v,w)\widetilde{{\rm dist}}(v,E):=\inf_{w\in E}d(v,w) for E𝕊n1E\subset\mathbb{S}^{n-1}, by using the geodesic distance d(v,w):=arccosv,wd(v,w):=\arccos\langle v,w\rangle on 𝕊n1\mathbb{S}^{n-1}, which has been investigated in [26, 27]. It is easy to see that dist(v,E)2=22cosdist~(v,E)\operatorname{dist}(v,E)^{2}=2-2\cos\widetilde{{\rm dist}}(v,E) if 0dist~(v,E)π0\leq\widetilde{{\rm dist}}(v,E)\leq\pi, which implies that dist\operatorname{dist} and dist~\widetilde{{\rm dist}} are equivalent, and 𝒫ε(v)\mathcal{P}_{\varepsilon}(v) in (2.21) is also the unique minimizer to dist~(v,S,ε)\widetilde{\operatorname{dist}}(v,S_{*,\varepsilon}). However, the interested set S,εS_{*,\varepsilon} is not geodesically convex by [27, Proposition 2]. Therefore, the results in [26, 27] do not directly apply to our case and we choose to focus on the function dist(,)\operatorname{dist}(\cdot,\cdot) in (2.20) in this work.

Theorem 2.6.

Suppose that (A,B)(A,B) are almost commuting matrices with decomposition (2.1), and denote by δ(A,B)\delta(A_{*},B_{*}) the eigenvalue gap (1.6) for commuting matrices (A,B)(A_{*},B_{*}) in (2.1). Also, let the eigenvalue pair Λ=(λ,μ)\Lambda_{*}=(\lambda_{*},\mu_{*}) with the space EE_{*} and the set SS_{*} be defined as before Lemma 2.4. We assume

ΔA+ΔB18δ(A,B),\lVert\Delta A\rVert+\lVert\Delta B\rVert\leq\frac{1}{8}{\color[rgb]{0,0,0}\delta(A_{*},B_{*})},

and define constants

η:=154(ΔA+ΔB)andε:=1283(ΔA+ΔBδ(A,B))2.\eta:={\color[rgb]{0,0,0}\frac{15}{4}}(\lVert\Delta A\rVert+\lVert\Delta B\rVert)\quad\text{and}\quad\varepsilon:=\frac{128}{3}\left(\frac{\lVert\Delta A\rVert+\lVert\Delta B\rVert}{{\color[rgb]{0,0,0}\delta(A_{*},B_{*})}}\right)^{2}\,. (2.23)

Then, it holds that (𝑂𝑃1{\rm OP}_{1}) admits a local minimizer in the domain:

Dη,ε:={(Λ,v)2×𝕊n1;ΛΛ<η,dist(v,S)<ε}.{\color[rgb]{0,0,0}D_{\eta,\varepsilon}:=\big{\{}(\Lambda,v)\in\mathbb{R}^{2}\times\mathbb{S}^{n-1}\,;\ \lVert\Lambda-\Lambda_{*}\rVert<\eta,\,\operatorname{dist}(v,S_{*})<\sqrt{\varepsilon}\big{\}}\,.} (2.24)
Proof.

By Lemma 2.4, we write, for v𝕊n1v\in\mathbb{S}^{n-1},

v=cv+j=1mcjvj,|c|2+j=1m|cj|2=1,\displaystyle v=c_{*}v_{*}+\sum_{j=1}^{m}c_{j}v_{j}\,,\quad|c_{*}|^{2}+\sum_{j=1}^{m}|c_{j}|^{2}=1\,, (2.25)

and then have

dist(v,S)2:=|c1|2+i=1m|ci|2=22c,\displaystyle\operatorname{dist}(v,S_{*})^{2}:=|c_{*}-1|^{2}+\sum_{i=1}^{m}|c_{i}|^{2}=2-2c_{*}\,, (2.26)

where c=𝒫(v)c_{*}=\lVert\mathcal{P}_{*}(v)\rVert, v=𝒫(v)/cSv_{*}=\mathcal{P}_{*}(v)/c_{*}\in S_{*}, and {vj}j=1m\{v_{j}\}_{j=1}^{m} are common orthonormal eigenvectors of (A,B)(A_{*},B_{*}) spanning EE_{*}^{\perp}. Note that for any ε\varepsilon and η\eta, the set Dη,ε¯\overline{D_{\eta,\varepsilon}} is compact, which implies that infDη,ε¯(Λ,v)\inf_{\overline{D_{\eta,\varepsilon}}}\mathcal{L}(\Lambda,v) admits a minimizer. We shall prove that for small enough ΔA+ΔB\lVert\Delta A\rVert+\lVert\Delta B\rVert, there exists ε\varepsilon and η\eta, depending on ΔA+ΔB\lVert\Delta A\rVert+\lVert\Delta B\rVert and satisfying ε,η0\varepsilon,\eta\to 0 as ΔA+ΔB0\lVert\Delta A\rVert+\lVert\Delta B\rVert\to 0, such that the minimizer of infDη,ε¯(Λ,v)\inf_{\overline{D_{\eta,\varepsilon}}}\mathcal{L}(\Lambda,v) is in the open set Dη,εD_{\eta,\varepsilon} and hence also a local minimizer of (OP1{\rm OP}_{1}). For this, we define a constant

L0:=\displaystyle L_{0}:= infvS(Aλ)v2+(Bμ)v2\displaystyle\inf_{v\in S_{*}}\lVert(A-\lambda_{*})v\rVert^{2}+\lVert(B-\mu_{*})v\rVert^{2}
=\displaystyle= infvSΔAv2+ΔBv2ΔA2+ΔB2,\displaystyle\inf_{v\in S_{*}}\lVert\Delta Av\rVert^{2}+\lVert\Delta Bv\rVert^{2}\leq\lVert\Delta A\rVert^{2}+\lVert\Delta B\rVert^{2}\,, (2.27)

and, without loss of generality, assume L0>0L_{0}>0. Indeed, if L0=0L_{0}=0 holds with a minimizer vSv\in S_{*}, then we readily have (Λ,v)=0\mathcal{L}(\Lambda_{*},v)=0, and the point (Λ,v)(\Lambda_{*},v) is the desired local minimizer of (OP1{\rm OP}_{1}) in Dη,εD_{\eta,\varepsilon}.

We next construct constants ε\varepsilon and η\eta such that

infDη,ε¯\Dη,ε(Λ,v)>ΔA2+ΔB2,\displaystyle\inf_{\overline{D_{\eta,\varepsilon}}\backslash D_{\eta,\varepsilon}}\mathcal{L}(\Lambda,v)>\lVert\Delta A\rVert^{2}+\lVert\Delta B\rVert^{2}\,, (2.28)

which, by (2), yields infDη,ε¯\Dη,ε(Λ,v)>L0\inf_{\overline{D_{\eta,\varepsilon}}\backslash D_{\eta,\varepsilon}}\mathcal{L}(\Lambda,v)>L_{0}. It follows that argminDη,ε¯(Λ,v)Dη,ε\arg\min_{\overline{D_{\eta,\varepsilon}}}\mathcal{L}(\Lambda,v)\in D_{\eta,\varepsilon} and hence completes the proof. We first have Dη,ε¯\Dη,εD1D2\overline{D_{\eta,\varepsilon}}\backslash D_{\eta,\varepsilon}\subset D_{1}\cup D_{2} with

D1:={(Λ,v)2×𝕊n1;ΛΛ=η,dist(v,S)ε},\displaystyle D_{1}:=\big{\{}(\Lambda,v)\in\mathbb{R}^{2}\times\mathbb{S}^{n-1}\,;\ \lVert\Lambda-\Lambda_{*}\rVert=\eta\,,\ \operatorname{dist}(v,S_{*})\leq\sqrt{\varepsilon}\big{\}}\,,
D2:={(Λ,v)2×𝕊n1;ΛΛη,dist(v,S)=ε}.\displaystyle D_{2}:=\big{\{}(\Lambda,v)\in\mathbb{R}^{2}\times\mathbb{S}^{n-1}\,;\ \lVert\Lambda-\Lambda_{*}\rVert\leq\eta\,,\ \operatorname{dist}(v,S_{*})=\sqrt{\varepsilon}\big{\}}\,.

A simple estimate implies

2(Λ,v)\displaystyle\sqrt{2\mathcal{L}(\Lambda,v)} (Aλ)v+(Bμ)vΔAΔB.\displaystyle\geq\lVert(A_{*}-\lambda)v\rVert+\lVert(B_{*}-\mu)v\rVert-\lVert\Delta A\rVert-\lVert\Delta B\rVert\,. (2.29)

We now estimate (Λ,v)\mathcal{L}(\Lambda,v) on D1D_{1}. By (2.25) and (2.26), there holds c1ε/2c_{*}\geq 1-\varepsilon/2 for (Λ,v)D1(\Lambda,v)\in D_{1}, which gives

(Aλ)v+(Bμ)v|c|ΛΛ(1ε2)η,\displaystyle\lVert(A_{*}-\lambda)v\rVert+\lVert(B_{*}-\mu)v\rVert\geq|c_{*}|\lVert\Lambda-\Lambda_{*}\rVert\geq\left(1-\frac{\varepsilon}{2}\right)\eta\,,

where we have also used |λλ|+|μμ|ΛΛ|\lambda-\lambda_{*}|+|\mu-\mu_{*}|\geq\lVert\Lambda-\Lambda_{*}\rVert. If we choose ε\varepsilon and η\eta satisfying

(1ε2)η>(1+2)(ΔA+ΔB),\displaystyle\left(1-\frac{\varepsilon}{2}\right)\eta>(1+\sqrt{2})(\lVert\Delta A\rVert+\lVert\Delta B\rVert)\,, (2.30)

it follows from (2.29) that

infD1(Λ,v)>ΔA2+ΔB2.\inf_{D_{1}}\mathcal{L}(\Lambda,v)>\lVert\Delta A\rVert^{2}+\lVert\Delta B\rVert^{2}\,. (2.31)

We next estimate (Λ,v)\mathcal{L}(\Lambda,v) on D2D_{2}. Let Λj=(λj,μj)\Lambda_{j}=(\lambda_{j},\mu_{j}) be the eigenvalue pair of (A,B)(A_{*},B_{*}) associated with the common eigenvector vjv_{j}. Similarly by (2.25), we can derive

(Aλ)v+(Bμ)v\displaystyle\lVert(A_{*}-\lambda)v\rVert+\lVert(B_{*}-\mu)v\rVert j=1m|cj(λjλ)|2+j=1m|cj(μjμ)|2\displaystyle\geq\sqrt{\sum_{j=1}^{m}|c_{j}(\lambda_{j}-\lambda)|^{2}}+\sqrt{\sum_{j=1}^{m}|c_{j}(\mu_{j}-\mu)|^{2}}
j=1m|cj|2ΛjΛ2,\displaystyle\geq\sqrt{\sum_{j=1}^{m}|c_{j}|^{2}\lVert\Lambda_{j}-\Lambda\rVert^{2}}\,, (2.32)

thanks to the elementary inequality x+yx+y\sqrt{x}+\sqrt{y}\geq\sqrt{x+y} for x,y0x,y\geq 0. For (Λ,v)D2(\Lambda,v)\in D_{2} and small enough η\eta, it is clear that j=1m|cj|2=1(1ε/2)2\sum_{j=1}^{m}|c_{j}|^{2}=1-(1-\varepsilon/2)^{2} and

ΛΛjΛjΛΛΛδ(A,B)η.\lVert\Lambda-\Lambda_{j}\rVert\geq\lVert\Lambda_{j}-\Lambda_{*}\rVert-\lVert\Lambda-\Lambda_{*}\rVert\geq\delta(A_{*},B_{*})-\eta\,.

Then, if follows from (2) that

(Aλ)v+(Bμ)vj=1m|cj|2ΛjΛ2(δ(A,B)η)1(1ε2)2.\displaystyle\lVert(A_{*}-\lambda)v\rVert+\lVert(B_{*}-\mu)v\rVert\geq\sqrt{\sum_{j=1}^{m}|c_{j}|^{2}\lVert\Lambda_{j}-\Lambda\rVert^{2}}\geq(\delta(A_{*},B_{*})-\eta)\sqrt{1-\left(1-\frac{\varepsilon}{2}\right)^{2}}\,.

If we choose ε\varepsilon and η\eta such that

(δ(A,B)η)1(1ε2)2>(1+2)(ΔA+ΔB),\displaystyle(\delta(A_{*},B_{*})-\eta)\sqrt{1-\left(1-\frac{\varepsilon}{2}\right)^{2}}>(1+\sqrt{2})(\lVert\Delta A\rVert+\lVert\Delta B\rVert)\,, (2.33)

then recalling (2.29), we have

infD2(Λ,v)>ΔA2+ΔB2.\inf_{D_{2}}\mathcal{L}(\Lambda,v)>\lVert\Delta A\rVert^{2}+\lVert\Delta B\rVert^{2}\,. (2.34)

One can check that in the regime ΔA+ΔBδ(A,B)/8\lVert\Delta A\rVert+\lVert\Delta B\rVert\leq\delta(A_{*},B_{*})/8, the choice (2.23) of parameters (ε,η)(\varepsilon,\eta) satisfies the conditions (2.30) and (2.33). Then, the claim (2.28) follows from (2.31) and (2.34) as desired.

Remark 2.7.

The prefactors in Theorem 2.6 (e.g., 18\frac{1}{8} and 154\frac{15}{4}) are chosen for convenience and can be improved, but the orders in ΔA+ΔB\lVert\Delta A\rVert+\lVert\Delta B\rVert and δ(A,B)\delta(A_{*},B_{*}) are optimal by the standard perturbation theory.

Remark 2.8.

In the extreme case where σ(A,B)\sigma(A_{*},B_{*}) is a singleton, equivalently, (A,B)=(aI,bI)(A_{*},B_{*})=(aI,bI) for some a,ba,b\in\mathbb{R}, we have S=𝕊n1S_{*}=\mathbb{S}^{n-1} and δ(A,B)\delta(A_{*},B_{*}) can be set as ++\infty by convention (note that (1.6) is not well-defined in this case). Then, Theorem 2.6 is a trivial fact that (OP1{\rm OP}_{1}) has a local minimum (Λ,v)2×𝕊n1(\Lambda,v)\in\mathbb{R}^{2}\times\mathbb{S}^{n-1} with Λ\Lambda near the point (a,b)2(a,b)\in\mathbb{R}^{2}. Moreover, we can see that in this case, (OP1{\rm OP}_{1}) reduces to min(Λ,v)2×𝕊n1(ΔAλ)v2+(ΔBμ)v2\min_{(\Lambda,v)\in\mathbb{R}^{2}\times\mathbb{S}^{n-1}}\lVert(\Delta A-\lambda)v\rVert^{2}+\lVert(\Delta B-\mu)v\rVert^{2} for ΔA,ΔBS(n)\Delta A,\Delta B\in{\rm S}(n), which is hard to analyze due to its generality (except n=2n=2, in which case it is essentially a one-dimensional optimization problem; see also Lemma 3.2 below). It is possible to carry out a landscape analysis as in [76], but one may have to consider special ΔA\Delta A and ΔB\Delta B such as rank-one and diagonal matrices.

We next discuss generic properties of local minimizers of (OP1{\rm OP}_{1}) for almost commuting matrices. Given (A,B)Fcom(A,B)\in{\rm F}_{\rm com}, let {Λl}l=1L\{\Lambda_{l}\}_{l=1}^{L} (LnL\leq n) be its distinct eigenvalue pairs and {EA,BΛl}l=1L\{E^{\Lambda_{l}}_{A,B}\}_{l=1}^{L} be the associated common eigenspaces (1.5). Theorem 2.6 guarantees that (OP1{\rm OP}_{1}) for the perturbed pair (A+ΔA,B+ΔB)(A+\Delta A,B+\Delta B) with ΔA+ΔB1\lVert\Delta A\rVert+\lVert\Delta B\rVert\ll 1 admits a local minimizer in the neighborhood of each closed set (Λl,Sl)(\Lambda_{l},S_{l}), where Sl:=EA,BΛl𝕊n1S_{l}:=E_{A,B}^{\Lambda_{l}}\cap\mathbb{S}^{n-1}. Thus, if L=nL=n holds, equivalently, all of the common eigenspaces are one-dimensional, we readily have that (Λ,v)\mathcal{L}(\Lambda,v) has at least nn local minima, and the associated minimizing vectors vjv_{j} are almost orthogonal to each other in the sense that |vi,vj|1|\langle v_{i},v_{j}\rangle|\ll 1 for iji\neq j. We shall see that generically, it is indeed the case. Note that the set of commuting matrices FcomS(n)×S(n){\rm F}_{\rm com}\subset{\rm S}(n)\times{\rm S}(n) is a complete metric space.

Lemma 2.9.

The subset F0:={(A,B)Fcom;dim(EA,BΛl)=1,Λlσ(A,B)}{\rm F}_{0}:=\{(A,B)\in{\rm F}_{\rm com}\,;\ \dim(E_{A,B}^{\Lambda_{l}})=1\,,\ \Lambda_{l}\in\sigma(A,B)\} is open dense in Fcom{\rm F}_{\rm com}.

Proof.

To show that F0{\rm F}_{0} is dense in Fcom{\rm F}_{\rm com}, it suffices to note that for (A,B)Fcom(A,B)\in{\rm F}_{\rm com}, we can always perturb the eigenvalues of AA such that the perturbed matrix A~\widetilde{A} has only simple eigenvalues, which implies (A~,B)F0(\widetilde{A},B)\in{\rm F}_{0}, while the claim that F0{\rm F}_{0} is open in Fcom{\rm F}_{\rm com} follows from the standard perturbation result (cf. Lemma 4.3). ∎

The above lemma is an analog of Lemma 2.3, which gives the genericity of dim(EA,BΛl)=1\dim(E_{A,B}^{\Lambda_{l}})=1 in the topological sense. It would also be interesting to interpret this property from a probabilistic perspective. We next define a probability measure on the set Fcom{\rm F}_{\rm com} of commuting matrices and show that the event F0{\rm F}_{0} happens almost surely. Note that Fcom{\rm F}_{\rm com} can be parameterized as

Φ:(D1,D2,U)diag(n)×diag(n)×O(n)(UD1UT,UD2UT)Fcom.\displaystyle\Phi:(D_{1},D_{2},U)\in{\rm diag}_{\leq}(n)\times{\rm diag}_{\leq}(n)\times{\rm O}(n)\longrightarrow(UD_{1}U^{T},UD_{2}U^{T})\in{\rm F}_{\rm com}.

Our probabilistic model is largely motivated by GOE in random matrix theory. We recall that XX is a GOE if diagonal elements XiiN(0,1)X_{ii}\sim N(0,1) for 1in1\leq i\leq n are i.i.d. Gaussian random variables and off-diagonal elements {Xij}1i<jn\{X_{ij}\}_{1\leq i<j\leq n} are i.i.d. Gaussian N(0,12)N(0,\frac{1}{2}) that are independent of XiiX_{ii}. We let UU be uniformly distributed on O(n){\rm O}(n) and Didiag(n)D_{i}\in{\rm diag}_{\leq}(n), i=1,2i=1,2, be random diagonal matrices with diagonal entries independently sampled from the following probability density on the set {x1x2xn}\{x_{1}\leq x_{2}\leq\cdots\leq x_{n}\}:

φ(x1,,xn)=cne12(x12++xn2)1i<jn(xjxi),\displaystyle\varphi(x_{1},\cdots,x_{n})=c_{n}e^{-\frac{1}{2}(x_{1}^{2}+\cdots+x_{n}^{2})}\prod_{1\leq i<j\leq n}(x_{j}-x_{i})\,,

where cnc_{n} is the normalization constant. Equivalently, we define the product probability measure:

=φ(λ1,,λn)dλ1dλn×φ(μ1,,μn)dμ1dμn×μHaar,\displaystyle\mathbb{P}=\varphi(\lambda_{1},\cdots,\lambda_{n})d\lambda_{1}\cdots d\lambda_{n}\times\varphi(\mu_{1},\cdots,\mu_{n})d\mu_{1}\cdots d\mu_{n}\times\mu_{\rm Haar}\,,

on diag(n)×diag(n)×O(n){\rm diag}_{\leq}(n)\times{\rm diag}_{\leq}(n)\times{\rm O}(n), where μHaar\mu_{\rm Haar} denotes the Haar measure on the orthogonal group O(n){\rm O}(n). Clearly, the map Φ\Phi is smooth and thus the pushforward measure Fcom:=Φ1\mathbb{P}_{{\rm F}_{\rm com}}:=\mathbb{P}\circ\Phi^{-1} on Fcom{\rm F}_{\rm com} is well-defined. In what follows, we limit our discussion to the case where Fcom{\rm F}_{\rm com} is equipped with the probability Fcom\mathbb{P}_{{\rm F}_{\rm com}}. One may have noted that φ\varphi is nothing else than the joint distribution of eigenvalues of GOE [5]. Thus, the pair (A,B)Fcom(A,B)\in{\rm F}_{\rm com} has the same marginal distribution as GOE, i.e., AGOEA\sim{\rm GOE}, BGOEB\sim{\rm GOE}. To proceed, we recall a basic result that may be due to Wigner and von Neumann [55]; see also [65, 67].

Lemma 2.10.

The set of symmetric matrices with degenerate eigenvalues is a surface of codimension 22 in the n(n1)/2n(n-1)/2-dimensional real vector space S(n){\rm S}(n). Thus, given an absolutely continuous probability on S(n){\rm S}(n) with respect to Lebesgue measure, a matrix XS(n)X\in{\rm S}(n) has all simple eigenvalues almost surely.

We immediately observe from Lemma 2.10 that

Fcom(Fcom\F0)\displaystyle\mathbb{P}_{{\rm F}_{\rm com}}({\rm F}_{\rm com}\backslash{\rm F}_{0}) =Fcom(Λσ(A,B)is degenerate)\displaystyle=\mathbb{P}_{{\rm F}_{\rm com}}(\Lambda\in\sigma(A,B)\ \text{is degenerate})
GOE(XS(n)has a degenerate eigenvalue)=0,\displaystyle\leq\mathbb{P}_{{\rm GOE}}(X\in{\rm S}(n)\ \text{has a degenerate eigenvalue})=0\,,

and the following corollary holds.

Corollary 2.11.

For a pair of random commuting matrices drawn from the probability Fcom\mathbb{P}_{{\rm F}_{\rm com}}, almost surely we have that all its common eigenspaces are one-dimensional.

In order to combine these observations with Theorem 2.6, we recall from [56, Corollary 2.2] a tail bound of the spectral gap for Wigner matrices (i.e., off-diagonal entries are i.i.d. sub-Gaussian random variables with mean 0 and variance 11 and diagonal entries are independent sub-Gaussians with mean 0 and variances bounded by n1o(1)n^{1-o(1)}), which include GOE as a subclass.

Lemma 2.12.

For real symmetric Wigner matrices XnX_{n} with sub-Gaussian entries, there holds

min1in1(λi+1λi)n32o(1),\displaystyle\min_{1\leq i\leq n-1}(\lambda_{i+1}-\lambda_{i})\geq n^{-\frac{3}{2}-o(1)}\,,

with probability 1o(1)1-o(1), where {λi}i=1n\{\lambda_{i}\}_{i=1}^{n} are eigenvalues of XnX_{n} (counting multiplicity) in an increasing order.

Applying Lemma 2.12 to GOE, by Theorem 2.6, we can conclude the following result, which shows the desired claim that generically (Λ,v)\mathcal{L}(\Lambda,v) has almost orthogonal local minimizing vectors viv_{i}.

Theorem 2.13.

Let A,BS(n)A,B\in{\rm S}(n) be almost commuting matrices with decomposition (2.1), where (A,B)Fcom(A_{*},B_{*})\in{\rm F}_{\rm com} is sampled from Fcom\mathbb{P}_{{\rm F}_{\rm com}}. Then, for ΔA+ΔBn3/2o(1)/8\lVert\Delta A\rVert+\lVert\Delta B\rVert\leq n^{-3/2-o(1)}/8, with probability 1o(1)1-o(1), the function (Λ,v)\mathcal{L}(\Lambda,v) admits nn local minima on 2×𝕊n1\mathbb{R}^{2}\times\mathbb{S}^{n-1} with associated minimizing vectors {vj}j=1n\{v_{j}\}_{j=1}^{n} satisfying

|vi,vj|=O(ΔA+ΔBn3/2o(1)).\displaystyle|\langle v_{i},v_{j}\rangle|=O\left(\frac{\lVert\Delta A\rVert+\lVert\Delta B\rVert}{n^{-3/2-o(1)}}\right).

3 Vector-wise joint diagonalization algorithm

In this section, we will provide a detailed description of the VJD algorithm for solving (OP). Following the ideas presented in the introduction, we first summarize its key steps in Algorithm 2 below.

Input: almost commuting matrices A,BS(n)A,B\in{\rm S}(n)
Output: orthogonal matrix UO(n)U\in{\rm O}(n); diagonal matrices D1,D2diag(n)D_{1},D_{2}\in\mathrm{diag}\left(n\right)
1
2set (λ(1),μ(1),v(1))(\lambda^{(1)},\mu^{(1)},v^{(1)}) to the minimizer of (OP1{\rm OP}_{1})
3V=v(1)V=v^{(1)}
4w(1)=v(1)w^{(1)}=v^{(1)}
5for j=2,,nj=2,\cdots,n do
6      
7      set (λ(j),μ(j),v(j))(\lambda^{(j)},\mu^{(j)},v^{(j)}) to the minimizer of (OP2{\rm OP}_{2}) with inputs {w(i)}i=1j1\{w^{(i)}\}_{i=1}^{j-1}
8      V=[V,v(j)]V=[V,v^{(j)}]
9      compute w(j)w^{(j)} by orthogonalizing {w(1),,w(j1),v(j)}\{w^{(1)},\ldots,w^{(j-1)},v^{(j)}\} via Gram-Schmidt process
10 end for
11
12compute the SVD: V=U1ΣU2V=U_{1}\Sigma U_{2} for solving (1.4)
13U=U1U2U=U_{1}U_{2}, D1=diag(λ(1),,λ(n))D_{1}={\rm diag}(\lambda^{(1)},\ldots,\lambda^{(n)}), D2=diag(μ(1),,μ(n))D_{2}={\rm diag}(\mu^{(1)},\ldots,\mu^{(n)})
Algorithm 2 VJD algorithm for (OP)
Remark 3.1.

The introduction of w(j)w^{(j)} is conceptually unnecessary and mainly for implementing the projection 𝒫ε\mathcal{P}_{\varepsilon} (2.21) associated with span{v(1),,v(j)}{\rm span}\{v^{(1)},\ldots,v^{(j)}\}, which is needed for solving (OP2{\rm OP}_{2}); see Algorithm 3.

The rest of this section is devoted to designing algorithms for the optimization problems (OP1{\rm OP}_{1}) and (OP2{\rm OP}_{2}). Note that the cost functional (Λ,v)\mathcal{L}(\Lambda,v) is quadratic and convex with respect to both variables Λ2\Lambda\in\mathbb{R}^{2} and vnv\in\mathbb{R}^{n}, which motivates us to consider the alternating descent framework, namely, successively updating the variables Λ\Lambda and vv to minimize the cost \mathcal{L}. In particular, we have

argminΛ2(Λ,v)=(v,Av,v,Bv),\displaystyle\arg\min_{\Lambda\in\mathbb{R}^{2}}\mathcal{L}(\Lambda,v)=(\langle v,Av\rangle,\langle v,Bv\rangle)\,, (3.1)

and there holds, for any v𝕊n1v\in\mathbb{S}^{n-1} and Λ=(λ,μ)2\Lambda=(\lambda,\mu)\in\mathbb{R}^{2},

~(v)(Λ,v),\displaystyle\widetilde{\mathcal{L}}(v)\leq\mathcal{L}(\Lambda,v)\,, (3.2)

with

~(v):=(v,Av,v,Bv,v)=v,(A2+B2)vv,Av2v,Bv2,\widetilde{\mathcal{L}}(v):=\mathcal{L}(\langle v,Av\rangle,\langle v,Bv\rangle,v)=\langle v,(A^{2}+B^{2})v\rangle-\langle v,Av\rangle^{2}-\langle v,Bv\rangle^{2}\,, (3.3)

which is a fourth-order polynomial in vv. Thus, we can update Λ\Lambda explicitly via (3.1). To update the vector v𝕊n1v\in\mathbb{S}^{n-1}, we consider Riemannian Newton’s method with a line search, which takes advantage of the Hessian information of \mathcal{L} and enables a fast convergence. More precisely, in the kk-th step, given Λk=(vk,Avk,vk,Bvk)\Lambda_{k}=(\langle v_{k},Av_{k}\rangle,\langle v_{k},Bv_{k}\rangle) and vk𝕊n1v_{k}\in\mathbb{S}^{n-1}, we compute the search direction skTvk𝕊n1s_{k}\in T_{v_{k}}\mathbb{S}^{n-1} by

Hessv(Λk,vk)[sk]=gradv(Λk,vk),-\operatorname{Hess}_{v}{\mathcal{L}(\Lambda_{k},v_{k})}[s_{k}]=\operatorname{grad}_{v}{\mathcal{L}(\Lambda_{k},v_{k})}\,, (3.4)

and then update vk+1=Rvk(αksk)v_{k+1}=R_{v_{k}}(\alpha_{k}s_{k}) and Λk+1=(vk+1,Avk+1,vk+1,Bvk+1)\Lambda_{k+1}=(\langle v_{k+1},Av_{k+1}\rangle,\langle v_{k+1},Bv_{k+1}\rangle), where the stepsize αk\alpha_{k} is determined by a backtracking line search (cf. (3.15) below). However, per our numerical experiments, such an alternating minimization may converge slowly and become less efficient when the matrix size nn increases. We also note that Hessv\operatorname{Hess}_{v}\mathcal{L} may have negative eigenvalues so that sks_{k} in (3.4) is not necessarily a descent direction of (Λ,)\mathcal{L}(\Lambda,\cdot), which leads to the potential instability of the algorithm. Our practically efficient algorithm is motivated by the following lemmas.

Lemma 3.2.

Given A,BS(n)A,B\in{\rm S}(n), let functions (Λ,v)\mathcal{L}(\Lambda,v) on 2×𝕊n1\mathbb{R}^{2}\times\mathbb{S}^{n-1} and ~(v)\widetilde{\mathcal{L}}(v) on 𝕊n1\mathbb{S}^{n-1} be defined in (𝑂𝑃1{\rm OP}_{1}) and (3.3), respectively. Then, (Λ,v)(\Lambda_{*},v_{*}) is a local minimum of \mathcal{L} if and only if Λ=(v,Av,v,Bv)\Lambda_{*}=(\langle v_{*},Av_{*}\rangle,\langle v_{*},Bv_{*}\rangle) and vv_{*} is a local minimum of ~\widetilde{\mathcal{L}}. Moreover, we have

grad~(v)=gradv(Λ,v)|(Λ,v)=(v,Av,v,Bv,v),\displaystyle\operatorname{grad}\widetilde{\mathcal{L}}(v)={\color[rgb]{0,0,0}\operatorname{grad}_{v}\mathcal{L}(\Lambda,v)|_{(\Lambda,v)=(\langle v,Av\rangle,\langle v,Bv\rangle,v)}}\,, (3.5)

and Hess~(v)\operatorname{Hess}\widetilde{\mathcal{L}}(v) is the Schur complement of the block 2I22I_{2} of the Hessian Hess\operatorname{Hess}\mathcal{L} (2.8), i.e.,

Hess~(v)=Hessv(Λ,v)|(Λ,v)=(v,Av,v,Bv,v)21WWT.\displaystyle\operatorname{Hess}\widetilde{\mathcal{L}}(v)={\color[rgb]{0,0,0}\operatorname{Hess}_{v}\mathcal{L}(\Lambda,v)|_{(\Lambda,v)=(\langle v,Av\rangle,\langle v,Bv\rangle,v)}}-2^{-1}WW^{T}\,. (3.6)
Proof.

The Riemannian gradient (3.5) and Hessian (3.6) of ~(v)\widetilde{\mathcal{L}}(v) follow from a direct computation by (2.5). For the first claim, let (Λ,v)(\Lambda_{*},v_{*}) be a local minimum of \mathcal{L}. Then, Λ\Lambda_{*} is a local minimum of the quadratic function (,v)\mathcal{L}(\cdot,v_{*}), which gives Λ=(v,Av,v,Bv)\Lambda_{*}=(\langle v_{*},Av_{*}\rangle,\langle v_{*},Bv_{*}\rangle). By (3.2), we have, for (Λ,v)(\Lambda,v) near (Λ,v)(\Lambda_{*},v_{*}),

~(v)~(v)(Λ,v),\displaystyle\widetilde{\mathcal{L}}(v_{*})\leq\widetilde{\mathcal{L}}(v)\leq\mathcal{L}(\Lambda,v)\,, (3.7)

i.e., vv_{*} is a local minimum of ~(v)\widetilde{\mathcal{L}}(v). The other direction can be proved similarly. ∎

Thanks to Lemma 3.2, optimizing the function (Λ,v)\mathcal{L}(\Lambda,v) is equivalent to optimizing the one ~(v)\widetilde{\mathcal{L}}(v), since they have exactly the same local minima in the variable v𝕊n1v\in\mathbb{S}^{n-1}. Noting from (3.5) that gradv(Λk,vk)=grad~(vk)\operatorname{grad}_{v}\mathcal{L}(\Lambda_{k},v_{k})=\operatorname{grad}\widetilde{\mathcal{L}}(v_{k}), we will show that in the almost commuting regime [A,B]1[A,B]\ll 1, the alternating descent algorithm given above can be regarded as a Riemannian approximate Newton method for solving

minv𝕊n1~(v)=v,(A2+B2)vv,Av2v,Bv2,\displaystyle\min_{v\in\mathbb{S}^{n-1}}\widetilde{\mathcal{L}}(v)=\langle v,(A^{2}+B^{2})v\rangle-\langle v,Av\rangle^{2}-\langle v,Bv\rangle^{2}\,, (3.8)

namely, a special case of Algorithm 3 below with H(v)=Hessv(v,Av,v,Bv,v)H(v)=\operatorname{Hess}_{v}\mathcal{L}(\langle v,Av\rangle,\langle v,Bv\rangle,v). In this work, the approximate Newton method refers to a Newton-type algorithm that employs structurally simple approximate Hessian matrices with easily computed inverses.

Lemma 3.3.

Given a pair of almost commuting matrices (A,B)(A,B) with decomposition (2.1), let vv_{*} be a local minimum of the associated ~(v)\widetilde{\mathcal{L}}(v) on 𝕊n1\mathbb{S}^{n-1}. Then, for both H(v)=v2(Λ,v)H(v)=\nabla_{v}^{2}\mathcal{L}(\Lambda,v) and H(v)=Hessv(Λ,v)H(v)=\operatorname{Hess}_{v}\mathcal{L}(\Lambda,v) with Λ=(v,Av,v,Bv)\Lambda=(\langle v,Av\rangle,\langle v,Bv\rangle), it holds that

Hess~(v)H(v)=O(vv)+O(ΔA+ΔB),\displaystyle\lVert\operatorname{Hess}\widetilde{\mathcal{L}}(v)-H(v)\rVert=O(\lVert v-v_{*}\rVert)+O(\lVert\Delta A\rVert+\lVert\Delta B\rVert)\,, (3.9)

for vv near vv_{*} and ΔA+ΔB\lVert\Delta A\rVert+\lVert\Delta B\rVert small enough. In particular, if [A,B]=0[A,B]=0, we have

Hess~(v)H(v)=O(vv).\displaystyle\lVert\operatorname{Hess}\widetilde{\mathcal{L}}(v)-H(v)\rVert=O(\lVert v-v_{*}\rVert)\,. (3.10)
Proof.

It suffices to consider H(v)=v2(Λ,v)|(Λ,v)=(v,Av,v,Bv,v)H(v)=\nabla_{v}^{2}\mathcal{L}(\Lambda,v)|_{(\Lambda,v)=(\langle v,Av\rangle,\langle v,Bv\rangle,v)}, since the other one can be similarly analyzed. By formulas (2.7) and (3.5), the optimality condition grad~(v)=0\operatorname{grad}\widetilde{\mathcal{L}}(v_{*})=0 gives

(H(v)v,H(v)vI)v=0,\displaystyle(H(v_{*})-\langle v_{*},H(v_{*})v_{*}\rangle I)v_{*}=0\,, (3.11)

which yields, by (2.10),

M(v):=Hessv(Λ,v)|(Λ,v)=(v,Av,v,Bv,v)=H(v)v,H(v)vI.\displaystyle M(v_{*}):={\color[rgb]{0,0,0}\operatorname{Hess}_{v}\mathcal{L}(\Lambda,v)|_{(\Lambda,v)=(\langle v_{*},Av_{*}\rangle,\langle v_{*},Bv_{*}\rangle,v_{*})}}=H(v_{*})-\langle v_{*},H(v_{*})v_{*}\rangle I\,. (3.12)

For the the desired estimate (3.9), we first observe

Hess~(v)H(v)O(vv)+Hess~(v)H(v),\displaystyle\lVert\operatorname{Hess}\widetilde{\mathcal{L}}(v)-H(v)\rVert\leq O(\lVert v-v_{*}\rVert)+\lVert\operatorname{Hess}\widetilde{\mathcal{L}}(v_{*})-H(v_{*})\rVert\,,

by the Lipschitz continuity of Hess~(v)\operatorname{Hess}\widetilde{\mathcal{L}}(v) and H(v)H(v). Then, recalling from Proposition 2.2 and Lemma 3.2 that the local minima of ~(v)\widetilde{\mathcal{L}}(v) for commuting matrices are characterized by their common eigenvectors, we have H(v)v=0H(v_{*})v_{*}=0 and Hess~(v)=M(v)=H(v)\operatorname{Hess}\widetilde{\mathcal{L}}(v_{*})=M(v_{*})=H(v_{*}) for (A,B)Fcom(A,B)\in{\rm F}_{\rm com}, by formulas (2.9), (3.6), and (3.12). It follows from the stability estimate in Theorem 2.6 that Hess~(v)H(v)=O(ΔA+ΔB)\lVert\operatorname{Hess}\widetilde{\mathcal{L}}(v_{*})-H(v_{*})\rVert=O(\lVert\Delta A\rVert+\lVert\Delta B\rVert). Thus, (3.9) and also (3.10) hold. ∎

In what follows, we say that a symmetric matrix H(v)S(n)H(v)\in{\rm S}(n) is an approximate Riemannian Hessian of ~(v)\widetilde{\mathcal{L}}(v) if the property (3.9) holds as vv0\lVert v-v_{*}\rVert\to 0 and ΔA+ΔB0\lVert\Delta A\rVert+\lVert\Delta B\rVert\to 0. In addition to Hessv(Λ,v)\operatorname{Hess}_{v}\mathcal{L}(\Lambda,v) with Λ=(v,Av,v,Bv)\Lambda=(\langle v,Av\rangle,\langle v,Bv\rangle) from the alternating minimization, Lemma 3.3 also suggests a new choice v2(Λ,v)\nabla_{v}^{2}\mathcal{L}(\Lambda,v) for approximating Hess~(v)\operatorname{Hess}\widetilde{\mathcal{L}}(v) with the favorable positive semidefinite property. It is easy to see that its restriction on Tv𝕊n1T_{v}\mathbb{S}^{n-1}: (InvvT)v2(Λ,v)(InvvT)(I_{n}-vv^{T})\nabla_{v}^{2}\mathcal{L}(\Lambda,v)(I_{n}-vv^{T}) also satisfies the estimate (3.9) and hence can be a good candidate for the approximate Hessian as well.

We are now ready to give our approximate Newton method (Algorithm 3) for solving (OP1{\rm OP}_{1}) and (OP2{\rm OP}_{2}). We note that the only difference between (OP1{\rm OP}_{1}) and (OP2{\rm OP}_{2}) is the inequality constraint dist(v,Sj1)ε\operatorname{dist}(v,S_{j-1})\leq\sqrt{\varepsilon}, which can be easily dealt with by a projection step, thanks to Lemma 2.4. One may also observe that for j=1j=1, there holds Sj1=𝕊n1S_{j-1}=\mathbb{S}^{n-1} by definition and then (OP2{\rm OP}_{2}) reduces to (OP1{\rm OP}_{1}). Thus, it suffices to state the algorithm for (OP2{\rm OP}_{2}). For clarity, we recall that given w1,,wj𝕊n1w_{1},\ldots,w_{j}\in\mathbb{S}^{n-1} (j1j\geq 1),

Sj:=span{w1,,wj}𝕊n1,\displaystyle S_{j}:={\rm span}\{w_{1},\ldots,w_{j}\}^{\perp}\cap\mathbb{S}^{n-1}\,, (3.13)

and Sj,ε:={v𝕊n1;dist(v,Sj)ε}S_{j,\varepsilon}:=\left\{v\in\mathbb{S}^{n-1}\,;\ {\rm dist}(v,S_{j})\leq\sqrt{\varepsilon}\right\} for some ε>0\varepsilon>0. If j=0j=0, then Sj=Sj,ε:=𝕊n1S_{j}=S_{j,\varepsilon}:=\mathbb{S}^{n-1}. We denote by 𝒫j,ε\mathcal{P}_{j,\varepsilon} the projection to the minimizer of dist(v,Sj,ε)\operatorname{dist}(v,S_{j,\varepsilon}) given in (2.21) and also recall the retraction RvR_{v} in (2.6).

Input: almost commuting matrices A,BS(n)A,B\in{\rm S}(n);
      vectors w1,,wj1𝕊n1w_{1},\ldots,w_{j-1}\in\mathbb{S}^{n-1} for j1j\geq 1 (j=1j=1 means no input vectors);
      parameters τ(0,1/2)\tau\in(0,1/2), β(0,1)\beta\in(0,1), ϵ>0\epsilon>0, and 0<ε10<\varepsilon\ll 1;
      approximate Hessian matrix H()S(n)H(\cdot)\in{\rm S}(n)
1
Output: a triplet (λ,μ,v)2×𝕊n1(\lambda,\mu,v)\in\mathbb{R}^{2}\times\mathbb{S}^{n-1}
2
3generate random initial vector v0Sj1v_{0}\in S_{j-1} with Sj1S_{j-1} defined by {w1,,wj1}\{w_{1},\ldots,w_{j-1}\} as in (3.13)
4k=0k=0
5while grad~(vk)>ϵ\lVert\operatorname{grad}\widetilde{\mathcal{L}}(v_{k})\rVert>\epsilon do
6       compute the search direction skTv𝕊n1s_{k}\in T_{v}\mathbb{S}^{n-1} by solving
H(vk)[sk]=grad~(vk)-H(v_{k})[s_{k}]=\operatorname{grad}{\widetilde{\mathcal{L}}}(v_{k}) (3.14)
7      compute the stepsize αk\alpha_{k} by the backtracking line search:
αk:=max{βj;~(Rvk(βjsk))~(vk)+τβjgrad~(vk),sk}\displaystyle\alpha_{k}:=\max\left\{\beta^{j}\,;\ \widetilde{\mathcal{L}}(R_{v_{k}}(\beta^{j}s_{k}))\leq\widetilde{\mathcal{L}}(v_{k})+\tau\beta^{j}\langle\operatorname{grad}\widetilde{\mathcal{L}}(v_{k}),s_{k}\rangle\right\} (3.15)
8      vk+1=𝒫j,εRvk(αksk)v_{k+1}=\mathcal{P}_{j,\varepsilon}R_{v_{k}}(\alpha_{k}s_{k})
9      k=k+1k=k+1
10 end while
11
v=vk𝕊n1v=v_{k}\in\mathbb{S}^{n-1}, λ=vk,Avk\lambda=\langle v_{k},Av_{k}\rangle, μ=vk,Bvk\mu=\langle v_{k},Bv_{k}\rangle
Algorithm 3 Projected Riemannian approximate Newton method for (OP1{\rm OP}_{1}) and (OP2{\rm OP}_{2})

We end this section with several remarks. First, by Theorem 2.13, the parameter ε\varepsilon in Algorithm 3 can be chosen of the order (ΔA+ΔB)/n3/2(\lVert\Delta A\rVert+\lVert\Delta B\rVert)/n^{-3/2} or larger. Second, by the above discussion, it is clear that Algorithm 3 with H(v)=Hessv(v,Av,v,Bv,v)H(v)=\operatorname{Hess}_{v}\mathcal{L}(\langle v,Av\rangle,\langle v,Bv\rangle,v) gives the alternating minimization for (OP1{\rm OP}_{1}) and (OP2{\rm OP}_{2}), while for H(v)=Hess~(v)H(v)=\operatorname{Hess}\widetilde{\mathcal{L}}(v), Algorithm 3 is nothing else than the standard Riemannian Newton’s method. Moreover, combining the abstract VJD scheme in Algorithm 2 with Algorithm 3 gives the one used in practice.

Finally, for the step (3.14), if the approximate Hessian maps Tv𝕊n1T_{v}\mathbb{S}^{n-1} to Tv𝕊n1T_{v}\mathbb{S}^{n-1}, we can simply take the Moore Penrose generalized inverse of HH to find sk=H(vk)grad~(vk)s_{k}=-H(v_{k})^{\dagger}\operatorname{grad}\widetilde{\mathcal{L}}(v_{k}). But if H(v)H(v) does not have Tv𝕊n1T_{v}\mathbb{S}^{n-1} as an invariant subspace, say, H(v)=v2(v,Av,v,Bv,v)H(v)=\nabla_{v}^{2}\mathcal{L}(\langle v,Av\rangle,\langle v,Bv\rangle,v), then the equation (3.14) may not admit a solution in Tv𝕊n1T_{v}\mathbb{S}^{n-1}. In this case, we consider the least-squares solution to (3.14). See Section 5.1 for more computational details, where we numerically test the aforementioned candidates of the approximate Hessian H(v)H(v).

4 Error analysis

This section is devoted to the error analysis for our VJD algorithm. In Section 4.1, we establish the global convergence of Algorithm 3 for (OP1{\rm OP}_{1}) (i.e., without the inequality constraint), while in Section 4.2, we discuss the numerical stability of Algorithm 2 with respect to noises and numerical errors.

4.1 Convergence of Riemannian approximate Newton method

Before stating our main theorem, we recall some preliminaries. First, by [59, Lemma 6], there exist a0>0a_{0}>0, a1>0a_{1}>0, and δa0,a1>0\delta_{a_{0},a_{1}}>0 such that for any v𝕊n1v\in\mathbb{S}^{n-1} and pTv𝕊n1p\in T_{v}\mathbb{S}^{n-1} with pδa0,a1\lVert p\rVert\leq\delta_{a_{0},a_{1}},

a0pd(v,Rv(p))a1p,\displaystyle a_{0}\lVert p\rVert\leq d(v,R_{v}(p))\leq a_{1}\lVert p\rVert\,, (4.1)

where d(,)d(\cdot,\cdot) is the geodesic distance on 𝕊n1\mathbb{S}^{n-1}. We define the following constant for later use:

K:=sup{d(Rv(p),Rv(q))pq;p,qTv𝕊n1,pδa0,a1,qδa0,a1}.\displaystyle K:=\sup\left\{\frac{d(R_{v}(p),R_{v}(q))}{\lVert p-q\rVert}\,;\ p,q\in T_{v}\mathbb{S}^{n-1}\,,\ \lVert p\rVert\leq\delta_{a_{0},a_{1}}\,,\ \lVert q\rVert\leq\delta_{a_{0},a_{1}}\right\}. (4.2)

We denote by Pv,wP_{v,w} the parallel transport of tangent vectors at vv to the tangent vectors at ww along the geodesic curve connecting points vv and ww; see [12] for the precise definition of Pv,wP_{v,w}. We also recall the following expansion for the Riemannian gradient of a smooth function ff on 𝕊n1\mathbb{S}^{n-1} [28, 47]: for ww near vv,

gradf(w)=Pv,wgradf(v)+Pv,wHessf(v)[Rv1w]+O(Rv1w2).\displaystyle\operatorname{grad}f(w)=P_{v,w}\operatorname{grad}f(v)+P_{v,w}\operatorname{Hess}f(v)[R_{v}^{-1}w]+O(\lVert R_{v}^{-1}w\rVert^{2})\,. (4.3)
Theorem 4.1.

Let H(v):Tv𝕊n1Tv𝕊n1H(v):T_{v}\mathbb{S}^{n-1}\to T_{v}\mathbb{S}^{n-1} be an positive semidefinite approximate Hessian of ~\widetilde{\mathcal{L}} satisfying (3.9). Suppose that vv_{*} is an accumulation point of the iterative sequence {vk}\{v_{k}\} generated by Algorithm 3 for (𝑂𝑃1{\rm OP}_{1}); and that H(vk)H(v_{k}) for all kk and H(v)H(v_{*}) are non-singular operators on the tangent space. Then, vv_{*} is a stationary point of (𝑂𝑃1{\rm OP}_{1}), and it holds that for 0[A,B]10\neq\lVert[A,B]\rVert\ll 1, the sequence {vk}\{v_{k}\} converges to vv_{*} linearly with the convergence rate:

lim supkd(vk+1,v)d(vk,v)=O([A,B]1/2).{\color[rgb]{0,0,0}\limsup_{k\to\infty}\frac{d(v_{k+1},v_{*})}{d(v_{k},v_{*})}=O(\lVert[A,B]\rVert^{1/2})\,.}

In the commuting case [A,B]=0[A,B]=0, we have that {vk}\{v_{k}\} converges to vv_{*} quadratically:

lim supkd(vk+1,v)d(vk,v)2=0.{\color[rgb]{0,0,0}\limsup_{k\to\infty}\frac{d(v_{k+1},v_{*})}{d(v_{k},v_{*})^{2}}=0\,.}
Proof.

In this proof, we always regard H(v)H(v) as a linear operator on the tangent space Tv𝕊n1T_{v}\mathbb{S}^{n-1}. By abuse of notation, we define the norm for any T:Tv𝕊n1Tv𝕊n1T:T_{v}\mathbb{S}^{n-1}\to T_{v}\mathbb{S}^{n-1} by T=sup{Tv;vTv𝕊n1,v=1}\lVert T\rVert=\sup\{\lVert Tv\rVert\,;\ v\in T_{v}\mathbb{S}^{n-1}\,,\ \lVert v\rVert=1\}. Since H(v)H(v_{*}) is non-singular and H(v)H(v) is continuous, there exists a neighborhood UU of vv_{*} such that H(v)H(v) for vUv\in U is positive definite on Tv𝕊n1T_{v}\mathbb{S}^{n-1} and satisfies

H(v)12H(v)1,\displaystyle\lVert H(v)^{-1}\rVert\leq 2\lVert H(v_{*})^{-1}\rVert\,, (4.4)

which implies that we can find a constant σ>0\sigma>0 such that H(vk)1σIH(v_{k})^{-1}\geq\sigma I on Tvk𝕊n1T_{v_{k}}\mathbb{S}^{n-1}. It follows that

grad~(vk),sk=grad~(vk),H(vk)1grad~(vk)σgrad~(vk)2.\displaystyle\langle\operatorname{grad}\widetilde{\mathcal{L}}(v_{k}),s_{k}\rangle=-\langle\operatorname{grad}\widetilde{\mathcal{L}}(v_{k}),H(v_{k})^{-1}\operatorname{grad}\widetilde{\mathcal{L}}(v_{k})\rangle\leq-\sigma\lVert\operatorname{grad}\widetilde{\mathcal{L}}(v_{k})\rVert^{2}\,. (4.5)

Then by (4.5) and the line search condition (3.15), we see that the stepsize αk\alpha_{k} is well defined.

Step 1. We first prove that the accumulation point vv_{*} is also the stationary point, i.e., grad~(v)=0\operatorname{grad}\widetilde{\mathcal{L}}(v_{*})=0. Again by (3.15), we have

~(vk)~(vk+1)ταkgrad~(vk),skαkστgrad~(vk)2>0,\displaystyle\widetilde{\mathcal{L}}(v_{k})-\widetilde{\mathcal{L}}(v_{k+1})\geq-\tau\alpha_{k}\langle\operatorname{grad}\widetilde{\mathcal{L}}(v_{k}),s_{k}\rangle\geq\alpha_{k}\sigma\tau\lVert\operatorname{grad}\widetilde{\mathcal{L}}(v_{k})\rVert^{2}>0\,, (4.6)

which means that ~(vk)\widetilde{\mathcal{L}}(v_{k}) is strictly decreasing. By (4.6), if lim infkαk=μ>0\liminf_{k\to\infty}\alpha_{k}=\mu>0 holds, then we find

k=1grad~(vk)2<+,\displaystyle\sum_{k=1}^{\infty}\lVert\operatorname{grad}\widetilde{\mathcal{L}}(v_{k})\rVert^{2}<+\infty\,,

and grad~(vk)0\lVert\operatorname{grad}\widetilde{\mathcal{L}}(v_{k})\rVert\to 0 as kk\to\infty, which readily gives grad~(v)=0\lVert\operatorname{grad}\widetilde{\mathcal{L}}(v_{*})\rVert=0. If not, i.e., lim infkαk=0\liminf_{k\to\infty}\alpha_{k}=0 holds, given any ss\in\mathbb{N}, we can take a subsequence kjk_{j} such that αkj<βs\alpha_{k_{j}}<\beta^{-s} for jj large enough. Hence, βs\beta^{-s} does not satisfy the linear search condition (3.15), that is,

~(Rvkj(βsskj))>~(vkj)+τβsgrad~(vkj),skj.\widetilde{\mathcal{L}}(R_{v_{k_{j}}}(\beta^{-s}s_{k_{j}}))>\widetilde{\mathcal{L}}(v_{k_{j}})+\tau\beta^{-s}\langle\operatorname{grad}\widetilde{\mathcal{L}}(v_{k_{j}}),s_{k_{j}}\rangle\,. (4.7)

Without loss of generality, we assume that there holds vkjvv_{k_{j}}\to v_{*} and skjs=H(v)1grad~(v)s_{k_{j}}\to s_{*}=-H(v_{*})^{-1}\operatorname{grad}\widetilde{\mathcal{L}}(v_{*}). Taking the limit jj\to\infty in (4.7) gives

~(Rv(βss))>~(v)+τβsgrad~(v),s.\widetilde{\mathcal{L}}(R_{v_{*}}(\beta^{-s}s_{*}))>\widetilde{\mathcal{L}}(v_{*})+\tau\beta^{-s}\langle\operatorname{grad}\widetilde{\mathcal{L}}(v_{*}),s_{*}\rangle\,.

Then it is easy to see

τgrad~(v),s<lim infs~(Rv(βss))~(v)βs=grad~(v),s.\displaystyle\tau\langle\operatorname{grad}\widetilde{\mathcal{L}}(v_{*}),s_{*}\rangle<\liminf_{s\to\infty}\frac{\widetilde{\mathcal{L}}(R_{v_{*}}(\beta^{-s}s_{*}))-\widetilde{\mathcal{L}}(v_{*})}{\beta^{-s}}=\langle\operatorname{grad}\widetilde{\mathcal{L}}(v_{*}),s_{*}\rangle\,.

Noting τ(0,1/2)\tau\in(0,1/2) in Algorithm 3, we obtain grad~(v),s=0\langle\operatorname{grad}\widetilde{\mathcal{L}}(v_{*}),s_{*}\rangle=0 and thus grad~(v)=0\operatorname{grad}\widetilde{\mathcal{L}}(v_{*})=0.

Step 2. We next show that there exists a neighborhood VUV\subset U of vv_{*} such that if vkVv_{k}\in V, then αk=1\alpha_{k}=1. Recalling that RvR_{v} is a second-order retraction, we have the following Taylor expansion [12, Proposition 5.43]:

~(Rv(s))=~(v)+grad~(v),s+12Hess~(v)[s],s+O(s3),\widetilde{\mathcal{L}}(R_{v}(s))=\widetilde{\mathcal{L}}(v)+\langle\operatorname{grad}\widetilde{\mathcal{L}}(v),s\rangle+\frac{1}{2}\langle\operatorname{Hess}\widetilde{\mathcal{L}}(v)[s],s\rangle+O(\lVert s\rVert^{3})\,, (4.8)

for sTv𝕊n1s\in T_{v}\mathbb{S}^{n-1} with s\lVert s\rVert small enough. Then by (4.4), for vv near vv_{*}, it holds that

s=H(v)1grad~(v)=O(vv),\displaystyle s=-H(v)^{-1}\operatorname{grad}\widetilde{\mathcal{L}}(v)=O(\lVert v-v_{*}\rVert)\,,

which, along with (4.8), yields

~(Rv(s))~(v)\displaystyle\widetilde{\mathcal{L}}(R_{v}(s))-\widetilde{\mathcal{L}}(v) =grad~(v),s+12Hess~(v)[s],s+O(s3)\displaystyle=\langle\operatorname{grad}\widetilde{\mathcal{L}}(v),s\rangle+\frac{1}{2}\langle\operatorname{Hess}\widetilde{\mathcal{L}}(v)[s],s\rangle+O(\lVert s\rVert^{3})
=12grad~(v),s+O(ΔA+ΔB)vv2+O(vv3),\displaystyle=\frac{1}{2}\langle\operatorname{grad}\widetilde{\mathcal{L}}(v),s\rangle+O\left(\lVert\Delta A\rVert+\lVert\Delta B\rVert\right)\lVert v-v_{*}\rVert^{2}+O(\lVert v-v_{*}\rVert^{3})\,, (4.9)

where we have also used the assumption that (3.9) holds. The above estimate (4.1) allows us to find a neighborhood VV of vv_{*} such that for any vVv\in V, there holds

~(Rv(s))~(v)τgrad~(v),s.\widetilde{\mathcal{L}}(R_{v}(s))-\widetilde{\mathcal{L}}(v)\leq\tau\langle\operatorname{grad}\widetilde{\mathcal{L}}(v),s\rangle\,.

By the above estimate and the line search condition (3.15), we conclude our claim.

Step 3. We finally consider the local convergence rate with stepsize one. For this, we claim that there exists a neighborhood V0VV_{0}\subset V of vv_{*} such that Rv(s)V0R_{v}(s)\in V_{0} holds for any vV0v\in V_{0}, where s=H(v)1grad~(v)s=-H(v)^{-1}\operatorname{grad}\widetilde{\mathcal{L}}(v). Let

r(v):=grad~(v)Pv,vHess~(v)[Rv1v].r(v):=\operatorname{grad}\widetilde{\mathcal{L}}(v)-P_{v_{*},v}\operatorname{Hess}\widetilde{\mathcal{L}}(v_{*})[R_{v_{*}}^{-1}v]\,.

By a direct computation with the relation Rv1(v)=Pv,vRv1(v)R_{v}^{-1}(v_{*})=-P_{v_{*},v}R_{v_{*}}^{-1}(v) [12], we have

H(v)1grad~(v)+Rv1(v)=H(v)1(r(v)+Pv,vHess~(v)[Rv1v]H(v)Pv,vRv1(v)),\displaystyle H(v)^{-1}\operatorname{grad}\widetilde{\mathcal{L}}(v)+R_{v}^{-1}(v_{*})=H(v)^{-1}\left(r(v)+P_{v_{*},v}\operatorname{Hess}\widetilde{\mathcal{L}}(v_{*})[R_{v_{*}}^{-1}v]-H(v)P_{v_{*},v}R_{v_{*}}^{-1}(v)\right),

which gives the estimate

sRv1(v)\displaystyle\lVert s-R_{v}^{-1}(v_{*})\rVert H(v)1(r(v)+Pv,vHess~(v)H(v)Pv,vRv1(v)).\displaystyle\leq\lVert H(v)^{-1}\rVert\left(\lVert r(v)\rVert+\lVert P_{v_{*},v}\operatorname{Hess}\widetilde{\mathcal{L}}(v_{*})-H(v)P_{v_{*},v}\rVert\lVert R_{v_{*}}^{-1}(v)\rVert\right). (4.10)

By the Lipschitz property of H(v)H(v) and vvd(v,v)\lVert v-v_{*}\rVert\leq d(v,v_{*}), we estimate, for vv near vv_{*},

Pv,vHess~(v)H(v)Pv,vPv,vHess~(v)H(v)Pv,v+O(d(v,v)).\displaystyle\lVert P_{v_{*},v}\operatorname{Hess}\widetilde{\mathcal{L}}(v_{*})-H(v)P_{v_{*},v}\rVert\leq\lVert P_{v_{*},v}\operatorname{Hess}\widetilde{\mathcal{L}}(v_{*})-H(v_{*})P_{v_{*},v}\rVert+O(d(v,v_{*}))\,. (4.11)

Without loss of generality, we let V0V_{0} be small enough such that (4.1) holds with constant KK defined in (4.2). Then it is easy to see from (4.10) and (4.11) that for some constant CC,

lim supvvd(Rv(s),v)d(v,v)lim supvvKsRv1(v)a0Rv1(v)CHess(v)H(v)=O([A,B]1/2),\displaystyle\limsup_{v\to v_{*}}\frac{d(R_{v}(s),v_{*})}{d(v,v_{*})}\leq\limsup_{v\to v_{*}}\frac{K\lVert s-R_{v}^{-1}(v_{*})\rVert}{a_{0}\lVert R_{v_{*}}^{-1}(v)\rVert}\leq C\lVert\operatorname{Hess}\mathcal{L}(v_{*})-H(v_{*})\rVert=O(\lVert[A,B]\rVert^{1/2})\,, (4.12)

where we used r(v)=O(Rv1(v)2)r(v)=O\left(\lVert R_{v_{*}}^{-1}(v)\rVert^{2}\right) by (4.3) and grad~(v)=0\operatorname{grad}\widetilde{\mathcal{L}}(v_{*})=0, and the last estimate is from (2.2) and (3.9). When [A,B]\lVert[A,B]\rVert is small enough, we obtain

lim supvvd(Rv(s),v)d(v,v)<1.\displaystyle\limsup_{v\to v_{*}}\frac{d(R_{v}(s),v_{*})}{d(v,v_{*})}<1\,. (4.13)

In particular, if [A,B]=0[A,B]=0, it holds that Hess~(v)=H(v)\operatorname{Hess}\widetilde{\mathcal{L}}(v_{*})=H(v_{*}) by Lemma 3.3. In this case, similarly, we have

lim supvvd(Rv(s),v)d(v,v)2<+.\displaystyle\limsup_{v\to v_{*}}\frac{d(R_{v}(s),v_{*})}{d(v,v_{*})^{2}}<+\infty\,. (4.14)

We now have all the ingredients to finish the proof. Since vv_{*} is the accumulation point, there exists k0k_{0} such that vk0V0v_{k_{0}}\in V_{0}. Then by Step 2 and Step 3, a simple induction argument yields vkV0v_{k}\in V_{0} and αk=1\alpha_{k}=1 for kk0k\geq k_{0}. In view of estimates (4.12) and (4.14), we readily have the global convergence of vkv_{k} and its convergence rate. ∎

Remark 4.2.

Empirically, the projected approximate Newton method in Algorithm 3 exhibits a similar convergence rate when applied to (OP2{\rm OP}_{2}) with inequality constraints as it does for (OP1{\rm OP}_{1}); see Section 5.1 for the numerical results. Nevertheless, to the best of the author’s knowledge, there exists very little research on the convergence of projection-type Riemannian optimization algorithms, and we choose to postpone the convergence analysis of Algorithm 3 for (OP2{\rm OP}_{2}) to future investigations. We also refer interested readers to [7, 50, 73] for recent advancements in Riemannian optimization with constraints.

4.2 Numerical stability

We next show that for commuting matrices (A,B)Fcom(A,B)\in{\rm F}_{\rm com}, our VJD algorithm can produce a reliable simultaneous approximate diagonalization. Note that when [A,B]=0[A,B]=0, for suitable ε\varepsilon in (1.3), each sub-optimization problem (OP1{\rm OP}_{1}) or (OP2{\rm OP}_{2}) in Algorithm 2 admits a minimizer (λ(j),μ(j),v(j))(\lambda^{(j)},\mu^{(j)},v^{(j)}) satisfying (λ(j),μ(j),v(j))=0\mathcal{L}(\lambda^{(j)},\mu^{(j)},v^{(j)})=0. However, due to the presence of noises and iteration errors, the input matrices may not exactly commute and (λ(j),μ(j),v(j))(\lambda^{(j)},\mu^{(j)},v^{(j)}) can not be perfectly solved. In this scenario, we assume that the sub-optimization problems (OP1{\rm OP}_{1}) and (OP2{\rm OP}_{2}) are solved by Algorithm 3 with outputs (λ~j,μ~j,v~j)(\widetilde{\lambda}_{j},\widetilde{\mu}_{j},\widetilde{v}_{j}) satisfying

(λ~j,μ~j,v~j)=O(δ),\displaystyle\mathcal{L}(\widetilde{\lambda}_{j},\widetilde{\mu}_{j},\widetilde{v}_{j})=O(\delta)\,, (4.15)

for some precision parameter δ>0\delta>0, and prove that our Algorithm 2 can output matrices UO(n)U\in{\rm O}(n) and D1,D2diag(n)D_{1},D_{2}\in\mathrm{diag}\left(n\right) with a controlled error; see Proposition 4.5. We remark that the justification of the assumption (4.15) relies on the convergence of Algorithm 3 for (OP2{\rm OP}_{2}) and hence is left open. The proof is based on the following two lemmas: a standard perturbation result for eigenvalue problems [58, p.58] and a backward error stability result for the common eigenvector problem [22, Theorem 4].

Lemma 4.3.

Suppose that AS(n)A\in{\rm S}(n) has nn eigenvalues λ1λn\lambda_{1}\leq\cdots\leq\lambda_{n} with the associated orthonormal eigenvectors v1,,vnv_{1},\ldots,v_{n}. For any ΔAS(n)\Delta A\in{\rm S}(n) with ΔA\lVert\Delta A\rVert small enough, let λ~1λ~n\widetilde{\lambda}_{1}\leq\cdots\leq\widetilde{\lambda}_{n} be the eigenvalues of A+ΔAA+\Delta A. Then, the perturbed matrix A+ΔAA+\Delta A has orthonormal eigenvectors v~i\widetilde{v}_{i} associated with λ~i\widetilde{\lambda}_{i}, which satisfy

|λ~iλi|=O(ΔA),v~ivi=O(ΔA).|\widetilde{\lambda}_{i}-\lambda_{i}|=O(\lVert\Delta A\rVert)\,,\quad\lVert\widetilde{v}_{i}-v_{i}\rVert=O(\lVert\Delta A\rVert)\,.
Lemma 4.4.

Suppose that v𝕊n1v\in\mathbb{S}^{n-1} is a common eigenvector of matrices A,BS(n)A,B\in{\rm S}(n) with Av=λvAv=\lambda v and Bv=μvBv=\mu v. If v~\widetilde{v} is an approximation of vv, then there exist λ~,μ~\widetilde{\lambda},\widetilde{\mu}\in\mathbb{R} and ΔA,ΔBS(n)\Delta A,\Delta B\in{\rm S}(n) such that

(A+ΔA)v~=λ~v~,(B+ΔB)v~=μ~v~,\displaystyle(A+\Delta A)\widetilde{v}=\widetilde{\lambda}\widetilde{v}\,,\quad(B+\Delta B)\widetilde{v}=\widetilde{\mu}\widetilde{v}\,,

and

ΔA2A2+ΔB2B2=Av~v~,Av~v~2A2+Bv~v~,Bv~v~2B2.\displaystyle\sqrt{\frac{\lVert\Delta A\rVert^{2}}{\lVert A\rVert^{2}}+\frac{\lVert\Delta B\rVert^{2}}{\lVert B\rVert^{2}}}=\sqrt{\frac{\lVert A\widetilde{v}-\langle\widetilde{v},A\widetilde{v}\rangle\widetilde{v}\rVert^{2}}{\lVert A\rVert^{2}}+\frac{\lVert B\widetilde{v}-\langle\widetilde{v},B\widetilde{v}\rangle\widetilde{v}\rVert^{2}}{\lVert B\rVert^{2}}}\,.

We now state and prove the main result of this section.

Proposition 4.5.

For commuting matrices A,BS(n)A,B\in{\rm S}(n), suppose that the minimizers (λ(j),μ(j),v(j))(\lambda^{(j)},\mu^{(j)},v^{(j)}) in Algorithm 2 are approximated by (λ~j,μ~j,v~j)(\widetilde{\lambda}_{j},\widetilde{\mu}_{j},\widetilde{v}_{j}) computed from Algorithm 3 with the estimate (4.15). Then, Algorithm 2 generates matrices UO(n)U\in{\rm O}(n) and D1,D2diag(n)D_{1},D_{2}\in\mathrm{diag}\left(n\right) that satisfy

AUD1UT2+BUD2UT2=O(nδ)+O(n2ϵδ),\displaystyle\lVert A-UD_{1}U^{T}\rVert^{2}+\lVert B-UD_{2}U^{T}\rVert^{2}=O(n\delta)+O(n^{2}\epsilon\sqrt{\delta})\,,

where ϵ\epsilon is the threshold parameter in the stopping criterion of Algorithm 3.

Proof.

We first estimate the error for the nearest orthogonal matrix problem (1.4) involved in Algorithm 2. From Algorithm 3, it is clear that (λ~j,μ~j,v~j)=(v~j,Av~j,v~j,Bv~j,v~j)(\widetilde{\lambda}_{j},\widetilde{\mu}_{j},\widetilde{v}_{j})=(\langle\widetilde{v}_{j},A\widetilde{v}_{j}\rangle,\langle\widetilde{v}_{j},B\widetilde{v}_{j}\rangle,\widetilde{v}_{j}), and then the assumption (4.15) implies

~(v~j)=O(δ)for allj.\widetilde{\mathcal{L}}\left(\tilde{v}_{j}\right)=O(\delta)\quad\text{for all}\ j\,. (4.16)

By viewing v~j\widetilde{v}_{j} as an approximation to the common eigenvector v(j)v^{(j)} and applying Lemma 4.5, there exist small perturbations ΔAj,ΔBjS(n)\Delta A_{j},\Delta B_{j}\in{\rm S}(n) such that v~j\tilde{v}_{j} is an eigenvector of both A+ΔAjA+\Delta A_{j} and B+ΔBjB+\Delta B_{j} and there holds

ΔAj2+ΔBj2max{A,B}min{A,B}~(v~j)=O(δ).\sqrt{\lVert\Delta A_{j}\rVert^{2}+\lVert\Delta B_{j}\rVert^{2}}\leq\frac{\max\{\lVert A\rVert,\lVert B\rVert\}}{\min\{\lVert A\rVert,\lVert B\rVert\}}\sqrt{\widetilde{\mathcal{L}}(\widetilde{v}_{j})}=O(\sqrt{\delta})\,. (4.17)

Therefore, by (4.17) and Lemma 4.3, we obtain

v~jvj=O(δ),\displaystyle\lVert\widetilde{v}_{j}-v_{j}\rVert=O(\sqrt{\delta})\,,

and hence

|v~i,v~j|=O(δ)forij.\left|\langle\tilde{v}_{i},\tilde{v}_{j}\rangle\right|=O(\sqrt{\delta})\quad\text{for}\ i\neq j\,. (4.18)

Letting V=[v~1,,v~n]V=[\widetilde{v}_{1},\ldots,\widetilde{v}_{n}], it readily follows from (4.18) that (VTV)ii=1(V^{T}V)_{ii}=1 and (VTV)ij=O(δ)(V^{T}V)_{ij}=O(\sqrt{\delta}), which implies

VTVIVTVIF=O(nδ).\displaystyle\lVert V^{T}V-I\rVert\leq\lVert V^{T}V-I\rVert_{{\rm F}}=O(n\sqrt{\delta})\,. (4.19)

Again by Lemma 4.3, the estimate (4.19) above gives that all the eigenvalues of VTVV^{T}V are O(nδ)O(n\sqrt{\delta}) perturbations of 11, that is, all the diagonal entries of Σdiag(n)\Sigma\in{\rm diag}(n) in the SVD V=U1ΣU2V=U_{1}\Sigma U_{2} of VV are of the form 1+O(nδ)1+O(n\sqrt{\delta}). This allows us to conclude

U1U2V=IΣ=O(nδ).\displaystyle\lVert U_{1}U_{2}-V\rVert=\lVert I-\Sigma\rVert=O(n\sqrt{\delta})\,.

We now write U=U1U2=[v¯1,,v¯n]U=U_{1}U_{2}=[\bar{v}_{1},\cdots,\bar{v}_{n}] and find

j=1nv~jv¯j2=UVF2=IΣF2=O(n3δ).\sum_{j=1}^{n}\lVert\tilde{v}_{j}-\bar{v}_{j}\rVert^{2}=\left\|U-V\right\|_{{\rm F}}^{2}=\left\|I-\Sigma\right\|_{{\rm F}}^{2}=O(n^{3}\delta)\,. (4.20)

By the stopping criterion in Algorithm 3, we directly see grad(v~j)ϵ\lVert\operatorname{grad}\mathcal{L}(\widetilde{v}_{j})\rVert\leq\epsilon. Then, from (4.20), we have

|j=1n~(v¯j)j=1n~(v~j)|j=1n|~(v¯j)~(v~j)|O(ϵ)j=1nv¯jv~j=O(n2ϵδ),\begin{split}&\Big{|}\sum_{j=1}^{n}\widetilde{\mathcal{L}}\left(\bar{v}_{j}\right)-\sum_{j=1}^{n}\widetilde{\mathcal{L}}\left(\tilde{v}_{j}\right)\Big{|}\leq\sum_{j=1}^{n}\Big{|}\widetilde{\mathcal{L}}\left(\bar{v}_{j}\right)-\widetilde{\mathcal{L}}\left(\tilde{v}_{j}\right)\Big{|}\leq O(\epsilon)\sum_{j=1}^{n}\lVert\bar{v}_{j}-\widetilde{v}_{j}\rVert=O(n^{2}\epsilon\sqrt{\delta})\,,\end{split}

which further gives

j=1n~(v¯j)j=1n~(v~j)+O(n2ϵδ)=O(nδ)+O(n2ϵδ).\displaystyle\sum_{j=1}^{n}\widetilde{\mathcal{L}}(\bar{v}_{j})\leq\sum_{j=1}^{n}\widetilde{\mathcal{L}}(\tilde{v}_{j})+O(n^{2}\epsilon\sqrt{\delta})=O(n\delta)+O(n^{2}\epsilon\sqrt{\delta})\,.

The proof is completed by the following simple estimate:

AUD1UT2+BUD2UT2AUD1UTF2+BUD2UTF2j=1n~(v¯j),\displaystyle\lVert A-UD_{1}U^{T}\rVert^{2}+\lVert B-UD_{2}U^{T}\rVert^{2}\leq\lVert A-UD_{1}U^{T}\rVert_{{\rm F}}^{2}+\lVert B-UD_{2}U^{T}\rVert_{{\rm F}}^{2}\leq\sum_{j=1}^{n}\widetilde{\mathcal{L}}(\bar{v}_{j})\,,

where the last inequality follows from the definitions of UO(n)U\in{\rm O}(n) and D1,D2diag(n)D_{1},D_{2}\in\mathrm{diag}\left(n\right). ∎

5 Numerical experiments

In this section, we present extensive numerical experiments to verify the efficiency and robustness of our proposed algorithms in Section 3. In Section 5.1, we test the performance of the approximate Newton method (cf. Algorithm 3) with the exact Riemannian Hessian and various approximate Hessian matrices. Then in Section 5.2, we compare the Jacobi algorithm and the VJD one for almost commuting matrices in detail, and we also investigate, both numerically and theoretically, the relations between the error cost function and the magnitude of the commutator. In Section 5.3, we apply our algorithm to the independent component analysis (ICA) problem. All the experiments presented below are conducted using Python on a personal laptop with 16 GB RAM and 8-core 2.6 MHz CPU. The implementation of the JADE algorithm is based on the Python package [6]. The synthetic almost commuting matrices in Sections 5.1 and 5.2 are generated by (A,B)=(A,B)+(ΔA,ΔB)(A,B)=(A_{*},B_{*})+(\Delta A,\Delta B), where the commuting pair (A,B)Fcom(A_{*},B_{*})\in{\rm F}_{\rm com} is sampled from the distribution Fcom\mathbb{P}_{{\rm F}_{\rm com}} introduced in Section 2 and (ΔA,ΔB)S(n)×S(n)(\Delta A,\Delta B)\in{\rm S}(n)\times{\rm S}(n) is the additive Gaussian noise with noise level σ\sigma, namely, (ΔA)ij(\Delta A)_{ij} and (ΔB)ij(\Delta B)_{ij} with iji\leq j are i.i.d. Gaussian 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}).

5.1 Comparison between different approximate Hessian

We explore different approximate Hessian matrices within Algorithm 3 and compare their performances for solving (OP1{\rm OP}_{1}) and (OP2{\rm OP}_{2}) in terms of the computing times. As discussed after Algorithm 3, these approximate Hessian matrices can be broadly categorized into two families: those with the tangent space Tv𝕊n1T_{v}\mathbb{S}^{n-1} as their invariant subspaces and those without. For the first family, the search direction sks_{k} in (3.14) can be readily solved by the pseudoinverse:

sk=H(vk)grad~(vk),\displaystyle s_{k}=-H(v_{k})^{\dagger}\operatorname{grad}\widetilde{\mathcal{L}}(v_{k})\,, (5.1)

while for the second family, we consider the associated least-squares problem:

minsTvk𝕊n1H(vk)s+grad~(vk)=minxnH(vk)(IvkvkT)x+grad~(vk).\displaystyle\min_{s\in T_{v_{k}}\mathbb{S}^{n-1}}\lVert H(v_{k})s+\operatorname{grad}\widetilde{\mathcal{L}}(v_{k})\rVert=\min_{x\in\mathbb{R}^{n}}\lVert H(v_{k})(I-v_{k}v_{k}^{T})x+\operatorname{grad}\widetilde{\mathcal{L}}(v_{k})\rVert\,. (5.2)

It is well-known [35] that the above problem (5.2) admits a minimizer:

sk=(IvkvkT)(H(vk)(IvkvkT))grad~(vk).\displaystyle s_{k}=-\left(I-v_{k}v_{k}^{T}\right)\left(H(v_{k})(I-v_{k}v_{k}^{T})\right)^{\dagger}\operatorname{grad}\widetilde{\mathcal{L}}(v_{k})\,. (5.3)

Here we consider the following four approximate Hessian matrices (which have been justified in Lemma 3.3 and the discussion afterward): the Euclidean Hessian of (Λ,v)\mathcal{L}(\Lambda,v) in vv: H0(v)=v2(Λ,v)|(Λ,v)=(v,Av,v,Bv,v)H_{0}(v)=\nabla_{v}^{2}{\mathcal{L}}(\Lambda,v)|_{(\Lambda,v)=(\langle v,Av\rangle,\langle v,Bv\rangle,v)}, the Riemannian Hessian of \mathcal{L} in vv (2.10): H1(v)=Hessv(Λ,v)|(Λ,v)=(v,Av,v,Bv,v)H_{1}(v)=\operatorname{Hess}_{v}{\mathcal{L}}(\Lambda,v)|_{(\Lambda,v)=(\langle v,Av\rangle,\langle v,Bv\rangle,v)}, the projected Euclidean Hessian of \mathcal{L}: H2(v)=(InvvT)H0(v)(InvvT)H_{2}(v)=(I_{n}-vv^{T})H_{0}(v)(I_{n}-vv^{T}), the exact Riemannian Hessian of ~(v)\widetilde{\mathcal{L}}(v): H3(v)=Hess~(v)H_{3}(v)=\operatorname{Hess}{\widetilde{\mathcal{L}}}(v).

We test Algorithm 3 with the above four candidates of the approximate Hessian H(v)H(v) for solving (OP1{\rm OP}_{1}) and (OP2{\rm OP}_{2}) with j=2j=2 and let sks_{k} be updated accordingly either by (5.1) or (5.3), for three randomly generated pairs of almost commuting matrices of dimension n=1000n=1000 with [A,B]\lVert[A,B]\rVert of orders O(106)O(10^{-6}), O(104)O(10^{-4}), and O(102)O(10^{-2}). The numerical results are represented in Figure 1, which clearly shows that all the candidates achieve similar levels of accuracy with nearly the same iteration steps. However, compared to the alternating minimization and the Riemannian Newton method (i.e., Algorithm 3 with approximate Hessian H1H_{1} and H3H_{3}, respectively), Algorithm 3 with H0H_{0} or H2H_{2} significantly reduces the computational cost, with the running time being about five times less than that of the standard Riemannian Newton method. Note that the convergence for the case of H2H_{2} has been established in Theorem 4.1. The algorithm with the approximate Hessian H0H_{0}, which appears to be the most efficient option, relies on the heuristic scheme (5.3) and does not align with the Riemannian optimization framework. Some additional techniques may be necessary to ensure its convergence properties. Figure 1 also verifies the robustness of the superiority of the approximate Hessians H0H_{0} and H2H_{2} with respect to the magnitude of [A,B][A,B].

Refer to caption
Figure 1: Convergence histories of Algorithm 3 for (OP1{\rm OP}_{1}) (i.e., the first triplet) and (OP2{\rm OP}_{2}) with j=2j=2 (i.e., the second triplet) with various approximate Hessian matrices HH for randomly generated almost commuting matrices (A,B)(A,B) of dimension n=1000n=1000 with [A,B]=O(106),O(104),O(102)\lVert[A,B]\rVert=O(10^{-6}),O(10^{-4}),O(10^{-2}) (left to right). The results for (OP1{\rm OP}_{1}) and (OP2{\rm OP}_{2}) with j=2j=2 are given in the first and second rows, respectively.

5.2 Comparison with Jacobi algorithm and relation with Lin’s theorem

We test Jacobi Algorithm 1 and our VJD Algorithm 2 with approximate Hessian H0H_{0} given in Section 5.1 on randomly generated almost commuting symmetric matrices of dimensions n=50,100,500,1000n=50,100,500,1000 with noise levels σ=102,103,104,105\sigma=10^{-2},10^{-3},10^{-4},10^{-5}, respectively, to compare their computational efficiencies. We plot the distributions of the computing times for both algorithms across all the generated samples in Figure 2. It clearly demonstrates that our VJD algorithm is robust against noise and much more efficient than the Jacobi one in the almost commuting regime, especially when the matrix size is large (say, n=500,1000n=500,1000). In addition, noting that Jacobi Algorithm 1 involves n(n1)/2n(n-1)/2 Givens rotation sweeps and in each sweep, O(n)O(n) operations are needed to update matrices AA and BB, thus the computational complexity of the Jacobi algorithm is O(n3)O(n^{3}). We numerically compare the complexities of the Jacobi algorithm and the proposed VJD one in Figure 3, where we plot the computational times of Algorithms 1 and 2 as functions of the matrix size on a double logarithmic scale with base 10. It shows that in practice, our VJD algorithm is also of complexity O(n3)O(n^{3}) but with a smaller prefactor.

Refer to caption
Figure 2: Statistical distributions of the computational times of Jacobi Algorithm 1 (blue dots) and VJD Algorithm 2 (red dots) for 3030 randomly generated pairs of almost commuting matrices of dimensions n=50,100,500,1000n=50,100,500,1000 with noise levels σ=102,103,104,105\sigma=10^{-2},10^{-3},10^{-4},10^{-5} (from left to right), respectively.
Refer to caption
Figure 3: Averaged computational times of Jacobi Algorithm 1 and VJD Algorithm 2 for almost commuting matrices with dimensions ranging from 128128 to 20482048 (log–log scale). For each dimension, 3030 samples are generated with noise level σ=0.01\sigma=0.01. The blue and red dashed lines are obtained by the least-square fitting of the data points. The results are benchmarked by the black dashed line with a slope of 33.

Recalling that this work was initially motivated by finding a numerically feasible solution to Lin’s theorem, we are interested in whether our VJD algorithm can produce commuting matrices (A,B)(A^{\prime},B^{\prime}) satisfying the estimate (1.1) in the almost commuting regime. For this, we consider the same pairs of almost commuting matrices (A,B)(A,B) as in Figure 2 and compute the numerical errors 𝒥(D1,D2,U)\mathcal{J}(D_{1},D_{2},U) (OP) of both Jacobi and VJD algorithms, denoted by 𝒥Jacobi\mathcal{J}_{\rm Jacobi} and 𝒥VJD\mathcal{J}_{\rm VJD}, respectively. We plot 𝒥VJD\mathcal{J}_{\rm VJD} as a function of the commutator [A,B]\lVert[A,B]\rVert in Figure 4(a) and find that 𝒥VJD\mathcal{J}_{\rm VJD} scales as [A,B]2\lVert[A,B]\rVert^{2}, which is independent of the problem size. Thus, it is clear that when [A,B]1\lVert[A,B]\rVert\ll 1, there holds

𝒥VJD[A,B]2[A,B],\displaystyle\mathcal{J}_{\rm VJD}\sim\lVert[A,B]\rVert^{2}\ll\lVert[A,B]\rVert\,, (5.4)

which means that A=UD1UTA^{\prime}=UD_{1}U^{T} and B=UD2UTB^{\prime}=UD_{2}U^{T} from Algorithm 2 would satisfy the bound (1.1). We provide a theoretical lower bound of 𝒥VJD\mathcal{J}_{\rm VJD} below to justify (5.4), which is generalized from [33, Theorem 3.1]. The proof is given in A for the sake of completeness.

Proposition 5.1.

Let A,BS(n)A,B\in{\rm S}(n) be symmetric matrices with A,B1\lVert A\rVert,\lVert B\rVert\leq 1. Suppose that matrices UO(n)U\in{\rm O}(n) and D1,D2diag(n)D_{1},D_{2}\in\mathrm{diag}\left(n\right) are computed by Algorithm 2 with input (A,B)(A,B). Then it holds that

𝒥(D1,D2,U)18[A,B]2.\mathcal{J}(D_{1},D_{2},U)\geq\frac{1}{8}\lVert[A,B]\rVert^{2}. (5.5)

We also compute the relative errors between Jacobi and VJD algorithms in the same experimental setup by |𝒥Jacobi𝒥VJD|/𝒥Jacobi|\mathcal{J}_{\rm Jacobi}-\mathcal{J}_{\rm VJD}|/\mathcal{J}_{\rm Jacobi} with the results shown in Figure 4(b). We can see that our VJD algorithm can achieve nearly the same accuracy as the Jacobi method but with much less computational time (cf. Figures 2 and 3).

Refer to caption
(a) Error of VJD
Refer to caption
(b) Relative error
Figure 4: (a) Errors 𝒥VJD\mathcal{J}_{\rm VJD} of VJD Algorithm 2 for almost commuting matrices of various dimensions used in Figure 2 (log–log scale). The black dashed line represents the lower bound [A,B]2/8\lVert[A,B]\rVert^{2}/8 in (5.5). (b) Relative errors |𝒥Jacobi𝒥VJD|/𝒥Jacobi|\mathcal{J}_{\rm Jacobi}-\mathcal{J}_{\rm VJD}|/\mathcal{J}_{\rm Jacobi} between Jacobi and VJD algorithms (log-log scale).

5.3 Application in independent component analysis

Let s1,,snts_{1},\ldots,s_{n}\in\mathbb{R}^{t} be independent pure signals with length tt, which are assumed to be non-Gaussian. Suppose that the mixed signals x1,,xr(rn)x_{1},\ldots,x_{r}(r\geq n) are the linear combinations of the pure signals, i.e., X=ASX=AS, where XX and SS are matrices consisting of rows xix_{i} and sis_{i}, respectively. The goal of ICA is to reconstruct the source signals SS and the mixing matrix AA from the noisy measured data XX. Cardoso et al. [15, 16] proposed a standard framework for solving this problem based on the joint diagonalization. Here we follow the setup in [61] and apply the VJD algorithm to the ICA problem. The basic procedures are summarized as follows for the reader’s convenience. First, normalize XX such that each row of XX has zero mean, and define the whitened matrix PW=tPP_{W}=\sqrt{t}P with PP computed from the SVD of XX: X=QΣPTX=Q\Sigma P^{T}. Second, compute the fourth-order cumulant tensor of the matrix PWP_{W} by

K(i,j,k,l)=pipjpkplpipjpkplpipkpjplpiplpjpk,K(i,j,k,l)=\langle p_{i}\circ p_{j}\circ p_{k}\circ p_{l}\rangle-\langle p_{i}\circ p_{j}\rangle\langle p_{k}\circ p_{l}\rangle-\langle p_{i}\circ p_{k}\rangle\langle p_{j}\circ p_{l}\rangle-\langle p_{i}\circ p_{l}\rangle\langle p_{j}\circ p_{k}\rangle\,, (5.6)

where pip_{i} are the columns of PWP_{W}; notation \circ represents the element-wise product (Hadamard product); and \langle\cdot\rangle means the expected value (average). Third, project the cumulant tensor KK onto a set of m=n(n+1)/2m=n(n+1)/2 orthogonal eigenmatrices of dimension n×nn\times n, where nn of them are zero matrices with one diagonal entry being 11, and the others are zero matrices with two symmetrically off-diagonal entries being 0.5\sqrt{0.5}. Denote the projected matrices by M1,,MmM_{1},\cdots,M_{m}. Fourth, apply the VJD Algorithm 2 to jointly diagonalize the mm matrices M1,,MmM_{1},\cdots,M_{m}. Denote the output orthogonal matrix by UU. Finally, reconstruct the source signals and the mixing matrix AA by

S~=tUΣ1QTX,A~=XS~T(S~S~T)1.\tilde{S}=\sqrt{t}U\Sigma^{-1}Q^{T}X\,,\quad\tilde{A}=X\tilde{S}^{T}\left(\tilde{S}\tilde{S}^{T}\right)^{-1}\,.

To test our VJD algorithm in the ICA application, we consider simulated one-dimensional and two-dimensional source signals in Figures 5(a) and 6(a). We assume that the noisy measured signals are given by X=AS+EX=AS+E, where AA is a random orthogonal matrix and EE is a random matrix with EijE_{ij} being i.i.d. Gaussian 𝒩(0,η2)\mathcal{N}(0,\eta^{2}); see Figures 5(b) and 6(b). Then, in our experiments, there are 2121 symmetric matrices of size 6×66\times 6 to be jointly diagonalized. We plot in Figures 5 and 6 the reconstructed signals by Jacobi Algorithm 1 and VJD Algorithm 2 to compare their performance, which clearly shows that both of these two algorithms can help recover the source signals effectively. In addition, we repeat the experiments for the same source signals with 100 samples of AA and EE and record the average time cost and L2L^{2}-error XreconsXtrueF\lVert X_{\rm recons}-X_{\rm true}\rVert_{\rm F} between the reconstructed signals XreconsX_{\rm recons} and the source signals XtrueX_{\rm true} in Tables 2 and 2. It again demonstrates the advantages of our VJD algorithm over the standard Jacobi one.

Refer to caption
(a) Source signals
Refer to caption
(b) Measured signals with noise level η=0.1\eta=0.1
Refer to caption
(c) Reconstruction by JADE
Refer to caption
(d) Reconstruction by VJD
Figure 5: ICA for one-dimensional signals. The operator norms of pairwise commutators [Mi,Mj]\lVert[M_{i},M_{j}]\rVert of matrices {Mj}j=1m\{M_{j}\}_{j=1}^{m} to be jointly diagonalized range from 0.0410.041 to 0.3190.319.
Refer to caption
(a) Source signals
Refer to caption
(b) Measured signals with noise level η=0.01\eta=0.01
Refer to caption
(c) Reconstruction by JADE
Refer to caption
(d) Reconstruction by VJD
Figure 6: ICA for two-dimensional signals. The operator norms of pairwise commutators [Mi,Mj]\lVert[M_{i},M_{j}]\rVert of matrices {Mj}j=1m\{M_{j}\}_{j=1}^{m} to be jointly diagonalized range from 0.1120.112 to 0.6690.669.
Avg. time (ms) Avg. L2L^{2}-error
JADE 5.5265.526 0.2270.227
VJD 2.1432.143 0.2280.228
Table 1: Computing time and L2L^{2}-error for ICA in Figure 5
Avg. time (ms) Avg. L2L^{2}-error
JADE 6.9356.935 5.602×1025.602\times 10^{-2}
VJD 2.4932.493 4.813×1024.813\times 10^{-2}
Table 2: Computing time and L2L^{2}-error for ICA in Figure 6

6 Conclusion

The approximate joint diagonalization of almost commuting matrices arises in many applications and closely relates to the celebrated Huaxin Lin’s theorem. In this work, we have designed a vector-wise algorithm framework for jointly diagonalizing a given pair of almost commuting matrices (A,B)(A,B), which relies on solving the sub-optimization problems (OP1{\rm OP}_{1}) and (OP2{\rm OP}_{2}). To motivate our VJD algorithm (cf. Algorithm 2), we have first analyzed the stability of the local minimizers of (Λ,v)\mathcal{L}(\Lambda,v) in (OP1{\rm OP}_{1}), and then, with the help of the random matrix theory, we have shown that for almost commuting random matrices, with high probability the functional \mathcal{L} has nn local minimizers with almost orthogonal minimizing vectors (cf. Theorem 2.13). To efficiently solve the sub-optimization problems, we have proposed a projected Riemannian approximate Newton method (cf. Algorithm 3) for (OP1{\rm OP}_{1}) and (OP2{\rm OP}_{2}). Moreover, we have proved in Theorem 4.1 that in the almost commuting regime, Algorithm 3 for (OP1{\rm OP}_{1}) exhibits linear convergence with a rate of O(|[A,B]|1/2)O(|[A,B]|^{1/2}), transitioning to quadratic convergence when AA and BB commute. However, the convergence of the projected Newton method (i.e., Algorithm 3 for (OP2{\rm OP}_{2})) is still open and needs further investigation. We have also proved that our VJD algorithm is stable with respect to iteration errors and noise.

Numerical results suggest that our vector-wise framework is more efficient than the popular Jacobi-type algorithm, especially for large-scale matrices, and can be applied to the ICA problem for signal reconstruction. We have numerically and theoretically examined the relations between the cost functional 𝒥(D1,D2,U)\mathcal{J}(D_{1},D_{2},U) in (OP) and the commutator [A,B][A,B] and find 𝒥[A,B]2\mathcal{J}\sim\lVert[A,B]\rVert^{2}, which indicates that when [A,B]\lVert[A,B]\rVert is small enough, the commuting matrices (A,B)=(UD1UT,UD2UT)(A^{\prime},B^{\prime})=(UD_{1}U^{T},UD_{2}U^{T}) constructed from Algorithm 2 satisfy the estimate (1.1) and hence give a desired numerical solution for Lin’s theorem. Finally, we would like to point out that though our approach works quite well for almost commuting matrices, it would be still important and useful to find an algorithm for Lin’s Theorem beyond the almost commuting regime.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this work.

Acknowledgement

The work of BL and JL is supported in part by the US National Science Foundation under award CCF-1910571 and by the US Department of Energy via grant DE-SC0019449. JL would like to thank Terry Loring for helpful discussions regarding the problem of almost commuting matrices.

Appendix A Proof of Proposition 5.1

Proof of Proposition 5.1.

We recall the construction of the diagonal matrices D1,D2D_{1},D_{2} in Algorithm 2:

D1=diag(v1,Av1,,vn,Avn),D2=diag(v1,Bv1,,vn,Bvn),\displaystyle D_{1}={\rm diag}(\langle v_{1},Av_{1}\rangle,\ldots,\langle v_{n},Av_{n}\rangle)\,,\quad D_{2}={\rm diag}(\langle v_{1},Bv_{1}\rangle,\ldots,\langle v_{n},Bv_{n}\rangle)\,,

where viv_{i} is the ii-th column of the orthogonal matrix UU. Then it is easy to see that

J(D1,D2,U)=UTAUdiag(UTAU)2+UTBUdiag(UTBU)2=:𝒥(U).\displaystyle J(D_{1},D_{2},U)=\lVert U^{T}AU-\mathrm{diag}\left(U^{T}AU\right)\rVert^{2}+\lVert U^{T}BU-\mathrm{diag}\left(U^{T}BU\right)\rVert^{2}=:\mathcal{J}(U)\,.

Here and in what follows, diag(M)\mathrm{diag}\left(M\right) means the diagonal part of a n×nn\times n matrix MM. Hence, to bound J(D1,D2,U)J(D_{1},D_{2},U) from below, it suffices to consider the lower bound of

min{𝒥(U);UO(n)}.\displaystyle\min\left\{\mathcal{J}(U)\,;\ U\in{\rm O}(n)\right\}\,. (A.1)

Let VV be a minimizer to the optimization problem A.1, and we write

VTAV=DA+X,VTBV=DB+Y.V^{T}AV=D_{A}+X\,,\quad V^{T}BV=D_{B}+Y\,. (A.2)

where DA:=diag(VTAV)D_{A}:={\rm diag}(V^{T}AV) and DB=diag(VTBV)D_{B}={\rm diag}(V^{T}BV). Noting that A=maxv𝕊n1|x,Ax|\lVert A\rVert=\max_{v\in\mathbb{S}^{n-1}}|\langle x,Ax\rangle| holds for AS(n)A\in{\rm S}(n), we readily have, by definition,

DAA1,DBB1.\lVert D_{A}\rVert\leq\lVert A\rVert\leq 1\,,\quad\lVert D_{B}\rVert\leq\lVert B\rVert\leq 1\,. (A.3)

The decomposition (A.2) can be rewritten as A=VDAVT+VXVTA=VD_{A}V^{T}+VXV^{T} and A=VDBVT+VYVTA=VD_{B}V^{T}+VYV^{T}, which implies

AB=VDADBVT+VDAYVT+VXDBVT+VXYVT,BA=VDBDAVT+VDBXVT+VYDAVT+VYXVT.\begin{split}&AB=VD_{A}D_{B}V^{T}+VD_{A}YV^{T}+VXD_{B}V^{T}+VXYV^{T},\\ &BA=VD_{B}D_{A}V^{T}+VD_{B}XV^{T}+VYD_{A}V^{T}+VYXV^{T}.\end{split}

Then, a direct computation gives

[A,B]=ABBA=V([DA,DB]+[DA,Y]+[X,DB]+[X,Y])VT=V([DA,Y]+[X,DB]+[X,Y])VT=V([DA+X,Y]+[X,DB])VT.\begin{split}[A,B]&=AB-BA=V\left([D_{A},D_{B}]+[D_{A},Y]+[X,D_{B}]+[X,Y]\right)V^{T}\\ &=V\left([D_{A},Y]+[X,D_{B}]+[X,Y]\right)V^{T}\\ &=V\left([D_{A}+X,Y]+[X,D_{B}]\right)V^{T}\,.\end{split} (A.4)

By the triangle inequality and the unitary invariance of operator norm, there holds

[A,B][DA+X,Y]+[X,DB],\lVert[A,B]\rVert\leq\lVert[D_{A}+X,Y]\rVert+\lVert[X,D_{B}]\rVert\,,

which further yields, by a simple estimate using (A.3),

[A,B]\displaystyle\lVert[A,B]\rVert 2DA+XY+2XDB2(X+Y)22𝒥(V).\displaystyle\leq 2\lVert D_{A}+X\rVert\lVert Y\rVert+2\lVert X\rVert\lVert D_{B}\rVert\leq 2\left(\lVert X\rVert+\lVert Y\rVert\right)\leq 2\sqrt{2\mathcal{J}(V)}\,.

The proof is complete.

References

  • [1] P.-A. Absil, C. G. Baker, and K. A. Gallivan. Trust-region methods on Riemannian manifolds. Foundations of Computational Mathematics, 7(3):303–330, 2007.
  • [2] P.-A. Absil and K. A. Gallivan. Joint diagonalization on the oblique manifold for independent component analysis. In 2006 IEEE international conference on acoustics speech and signal processing proceedings, volume 5, pages V–V. IEEE, 2006.
  • [3] P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2009.
  • [4] B. Afsari and P. S. Krishnaprasad. Some gradient based joint diagonalization methods for ica. In Independent Component Analysis and Blind Signal Separation: Fifth International Conference, ICA 2004, Granada, Spain, September 22-24, 2004. Proceedings 5, pages 437–444. Springer, 2004.
  • [5] G. W. Anderson, A. Guionnet, and O. Zeitouni. An introduction to random matrices. Cambridge Studies in Advanced Mathematics. Cambridge university press, 2010.
  • [6] A. Barachant et al. pyriemann: Biosignals classification with riemannian geometry. https://pyriemann.readthedocs.io/en/latest/index.html, 2015. [Online].
  • [7] R. Bergmann and R. Herzog. Intrinsic formulation of KKT conditions and constraint qualifications on smooth manifolds. SIAM Journal on Optimization, 29(4):2423–2444, 2019.
  • [8] M. Berry and A. Sameh. Multiprocessor Jacobi algorithms for dense symmetric eigenvalue and singular value decompositions. Technical report, Illinois Univ., Urbana (USA). Center for Supercomputing Research and Development, 1986.
  • [9] R. Bhatia. Spectral variation, normal matrices, and Finsler geometry. Mathematical Intelligencer, 29(3):41–46, 2007.
  • [10] M. A. A. Bortoloti, T. A. Fernandes, and O. P. Ferreira. An efficient damped Newton-type algorithm with globalization strategy on Riemannian manifolds. Journal of Computational and Applied Mathematics, 403:113853, 2022.
  • [11] F. Bouchard, B. Afsari, J. Malick, and M. Congedo. Approximate joint diagonalization with riemannian optimization on the general linear group. SIAM Journal on Matrix Analysis and Applications, 41(1):152–170, 2020.
  • [12] N. Boumal. An introduction to optimization on smooth manifolds. To appear with Cambridge University Press, Apr 2022.
  • [13] M. M. Bronstein, K. Glashoff, and T. A. Loring. Making Laplacians commute. arXiv preprint arXiv:1307.6549, 2013.
  • [14] A. Bunse-Gerstner, R. Byers, and V. Mehrmann. Numerical methods for simultaneous diagonalization. SIAM journal on matrix analysis and applications, 14(4):927–949, 1993.
  • [15] J.-F. Cardoso. Multidimensional independent component analysis. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No. 98CH36181), volume 4, pages 1941–1944. IEEE, 1998.
  • [16] J.-F. Cardoso and A. Souloumiac. Blind beamforming for non-Gaussian signals. In IEE proceedings F (radar and signal processing), volume 140, pages 362–370. IET, 1993.
  • [17] J.-F. Cardoso and A. Souloumiac. Jacobi angles for simultaneous diagonalization. SIAM journal on matrix analysis and applications, 17(1):161–164, 1996.
  • [18] E. de Klerk, C. Dobre, and D. V. Ṗasechnik. Numerical block diagonalization of matrix*-algebras with application to semidefinite programming. Mathematical programming, 129(1):91–111, 2011.
  • [19] L. De Lathauwer. A link between the canonical decomposition in multilinear algebra and simultaneous matrix diagonalization. SIAM journal on Matrix Analysis and Applications, 28(3):642–666, 2006.
  • [20] P. Eberlein. Solution to the complex eigenproblem by a norm reducing Jacobi type method. Numerische Mathematik, 14(3):232–245, 1970.
  • [21] P. J. Eberlein. On one-sided Jacobi methods for parallel computation. SIAM Journal on Algebraic Discrete Methods, 8(4):790–796, 1987.
  • [22] A. El Ghazi, S. El Hajji, L. Giraud, and S. Gratton. Newton’s method for the common eigenvector problem. Journal of computational and applied mathematics, 219(2):398–407, 2008.
  • [23] E. Evert and L. De Lathauwer. Guarantees for existence of a best canonical polyadic approximation of a noisy low-rank tensor. SIAM Journal on Matrix Analysis and Applications, 43(1):328–369, 2022.
  • [24] D. Eynard, K. Glashoff, M. M. Bronstein, and A. M. Bronstein. Multimodal diffusion geometry by joint diagonalization of Laplacians. arXiv preprint arXiv:1209.2295, 2012.
  • [25] D.-Z. Feng, X.-D. Zhang, and Z. Bao. An efficient multistage decomposition approach for independent components. Signal Processing, 83(1):181–197, 2003.
  • [26] O. P. Ferreira, A. N. Iusem, and S. Z. Németh. Projections onto convex sets on the sphere. Journal of Global Optimization, 57(3):663–676, 2013.
  • [27] O. P. Ferreira, A. N. Iusem, and S. Z. Németh. Concepts and techniques of optimization on the sphere. Top, 22(3):1148–1170, 2014.
  • [28] O. P. Ferreira and B. F. Svaiter. Kantorovich’s theorem on Newton’s method in Riemannian manifolds. journal of complexity, 18(1):304–329, 2002.
  • [29] M. Fornasier, H. Huang, L. Pareschi, and P. Sünnen. Consensus-based optimization on the sphere: Convergence to global minimizers and machine learning. Journal of Machine Learning Research, 22(237):1–55, 2021.
  • [30] P. Friis and M. Rø rdam. Almost commuting self-adjoint matrices-a short proof of Huaxin lin’s theorem. Journal für die reine und angewandte Mathematik, 479:121–131, 1996.
  • [31] B. Gao, X. Liu, X. Chen, and Y.-x. Yuan. A new first-order algorithmic framework for optimization problems with orthogonality constraints. SIAM Journal on Optimization, 28(1):302–332, 2018.
  • [32] B. Gao, X. Liu, and Y.-x. Yuan. Parallelizable algorithms for optimization problems with orthogonality constraints. SIAM Journal on Scientific Computing, 41(3):A1949–A1983, 2019.
  • [33] K. Glashoff and M. M. Bronstein. Matrix commutators: their asymptotic metric properties and relation to approximate joint diagonalization. Linear Algebra and its Applications, 439(8):2503–2513, 2013.
  • [34] H. H. Goldstine and L. P. Horwitz. A procedure for the diagonalization of normal matrices. Journal of the ACM (JACM), 6(2):176–195, 1959.
  • [35] G. H. Golub and C. F. Van Loan. Matrix computations. JHU press, 2013.
  • [36] F. Gygi, J.-L. Fattebert, and E. Schwegler. Computation of maximally localized Wannier functions using a simultaneous diagonalization algorithm. Computer physics communications, 155(1):1–6, 2003.
  • [37] S.-Y. Ha, M. Kang, D. Kim, J. Kim, and I. Yang. Stochastic consensus dynamics for nonconvex optimization on the Stiefel manifold: Mean-field limit and convergence. Mathematical Models and Methods in Applied Sciences, 32(03):533–617, 2022.
  • [38] P. R. Halmos. Some unsolved problems of unknown depth about operators on Hilbert space. Proceedings of the Royal Society of Edinburgh Section A: Mathematics, 76(1):67–76, 1976.
  • [39] M. B. Hastings. Topology and phases in fermionic systems. Journal of Statistical Mechanics: Theory and Experiment, 2008(01):L01001, 2008.
  • [40] M. B. Hastings. Making almost commuting matrices commute. Communications in Mathematical Physics, 291(2):321–345, 2009.
  • [41] M. B. Hastings and T. A. Loring. Almost commuting matrices, localized Wannier functions, and the quantum Hall effect. Journal of mathematical physics, 51(1):015214, 2010.
  • [42] M. B. Hastings and T. A. Loring. Topological insulators and C{C}^{*}-algebras: Theory and numerical practice. Annals of Physics, 326(7):1699–1759, 2011.
  • [43] D. Herrera. Constructive Approaches to Lin’s Theorem for Almost Commuting Matrices. arXiv preprint arXiv:2011.11800, 2020.
  • [44] I. Kachkovskiy and Y. Safarov. Distance to normal elements in C{C}^{*}-algebras of real rank zero. Journal of the American Mathematical Society, 29(1):61–80, 2016.
  • [45] J. Kim, M. Kang, D. Kim, S.-Y. Ha, and I. Yang. A stochastic consensus method for nonconvex optimization on the Stiefel manifold. In 2020 59th ieee conference on decision and control (cdc), pages 1050–1057. IEEE, 2020.
  • [46] A. Kovnatsky, M. M. Bronstein, A. M. Bronstein, K. Glashoff, and R. Kimmel. Coupled quasi-harmonic bases. In Computer Graphics Forum, volume 32, pages 439–448. Wiley Online Library, 2013.
  • [47] X.-b. Li, N.-j. Huang, Q. H. Ansari, and J.-C. Yao. Convergence rate of descent method with new inexact line-search on Riemannian manifolds. Journal of Optimization Theory and Applications, 180(3):830–854, 2019.
  • [48] X.-L. Li and X.-D. Zhang. Nonorthogonal joint diagonalization free of degenerate solution. IEEE Transactions on Signal Processing, 55(5):1803–1814, 2007.
  • [49] H. Lin. Almost commuting selfadjoint matrices and applications. Fields Inst. Commun, 13:193–233, 1997.
  • [50] C. Liu and N. Boumal. Simple algorithms for optimization on Riemannian manifolds with constraints. Applied Mathematics & Optimization, 82(3):949–981, 2020.
  • [51] T. A. Loring and A. P. Sørensen. Almost commuting self-adjoint matrices: the real and self-dual cases. Reviews in Mathematical Physics, 28(07):1650017, 2016.
  • [52] T. Maehara and K. Murota. A numerical algorithm for block-diagonal decomposition of matrix*-algebras with general irreducible components. Japan journal of industrial and applied mathematics, 27(2):263–293, 2010.
  • [53] T. Maehara and K. Murota. Algorithm for error-controlled simultaneous block-diagonalization of matrices. SIAM Journal on Matrix Analysis and Applications, 32(2):605–620, 2011.
  • [54] K. Murota, Y. Kanno, M. Kojima, and S. Kojima. A numerical algorithm for block-diagonal decomposition of matrix *-algebras with application to semidefinite programming. Japan Journal of Industrial and Applied Mathematics, 27(1):125–160, 2010.
  • [55] J. v. Neumann and E. P. Wigner. Über das verhalten von eigenwerten bei adiabatischen prozessen. In The Collected Works of Eugene Paul Wigner, pages 294–297. Springer, 1993.
  • [56] H. Nguyen, T. Tao, and V. Vu. Random matrices: tail bounds for gaps between eigenvalues. Probability Theory and Related Fields, 167(3):777–816, 2017.
  • [57] Y. Ogata. Approximating macroscopic observables in quantum spin systems with commuting matrices. Journal of Functional Analysis, 264(9):2005–2033, 2013.
  • [58] F. Rellich and J. Berkowitz. Perturbation theory of eigenvalue problems. CRC Press, 1969.
  • [59] W. Ring and B. Wirth. Optimization methods on Riemannian manifolds and their application to shape space. SIAM Journal on Optimization, 22(2):596–627, 2012.
  • [60] P. Rosenthal. Are almost commuting matrices near commuting matrices? The American Mathematical Monthly, 76(8):925–926, 1969.
  • [61] D. N. Rutledge and D. J.-R. Bouveresse. Independent components analysis with the JADE algorithm. TrAC Trends in Analytical Chemistry, 50:22–32, 2013.
  • [62] H. Sato. Riemannian newton-type methods for joint diagonalization on the stiefel manifold with application to independent component analysis. Optimization, 66(12):2211–2231, 2017.
  • [63] P. H. Schönemann. A generalized solution of the orthogonal procrustes problem. Psychometrika, 31(1):1–10, 1966.
  • [64] M. Sørensen, L. D. Lathauwer, P. Comon, S. Icart, and L. Deneire. Canonical polyadic decomposition with a columnwise orthonormal factor matrix. SIAM Journal on Matrix Analysis and Applications, 33(4):1190–1213, 2012.
  • [65] T. Tao. Topics in random matrix theory, volume 132. American Mathematical Soc., 2012.
  • [66] T. Tao and V. Vu. Random matrices have simple spectrum. Combinatorica, 37(3):539–553, 2017.
  • [67] M. Teytel. How rare are multiple eigenvalues? Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 52(8):917–934, 1999.
  • [68] F. Theis. Towards a general independent subspace analysis. Advances in Neural Information Processing Systems, 19, 2006.
  • [69] F. J. Theis, T. P. Cason, and P. A. Absil. Soft dimension reduction for ica by joint diagonalization on the stiefel manifold. In Independent Component Analysis and Signal Separation: 8th International Conference, ICA 2009, Paraty, Brazil, March 15-18, 2009. Proceedings 8, pages 354–361. Springer, 2009.
  • [70] K. Uhlenbeck. Generic properties of eigenfunctions. American Journal of Mathematics, 98(4):1059–1078, 1976.
  • [71] A.-J. Van Der Veen. Joint diagonalization via subspace fitting techniques. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), volume 5, pages 2773–2776. IEEE, 2001.
  • [72] Z. Wen and W. Yin. A feasible method for optimization with orthogonality constraints. Mathematical Programming, 142:397–434, 12 2013.
  • [73] W. H. Yang, L.-H. Zhang, and R. Song. Optimality conditions for the nonlinear programming problems on Riemannian manifolds. Pacific Journal of Optimization, 10(2):415–434, 2014.
  • [74] Y. Yang. Globally convergent optimization algorithms on Riemannian manifolds: Uniform framework for unconstrained and constrained optimization. Journal of Optimization Theory and Applications, 132(2):245–265, 2007.
  • [75] A. Yeredor. Blind source separation via the second characteristic function. Signal Processing, 80(5):897–902, 2000.
  • [76] H. Zhang, A. Milzarek, Z. Wen, and W. Yin. On the geometric analysis of a quartic–quadratic optimization problem under a spherical constraint. Mathematical Programming, 195(1-2):421–473, 2022.