Vector-wise Joint Diagonalization of Almost Commuting Matrices

Bowen Li bowen.li200@duke.edu Jianfeng Lu jianfeng@math.duke.edu Ziang Yu yuziang@sjtu.edu.cn Department of Mathematics, Duke University, United States Department of Chemistry and Department of Physics, Duke University, United States School of Mathematical Sciences, Shanghai Jiao Tong University, China

Abstract

This work aims to numerically construct exactly commuting matrices close to given almost commuting ones, which is equivalent to the joint approximate diagonalization problem. We first prove that almost commuting matrices generically have approximate common eigenvectors that are almost orthogonal to each other. Based on this key observation, we propose an efficient and robust vector-wise joint diagonalization (VJD) algorithm, which constructs the orthogonal similarity transform by sequentially finding these approximate common eigenvectors. In doing so, we consider sub-optimization problems over the unit sphere, for which we present a Riemannian approximate Newton method with a rigorous convergence analysis. We also discuss the numerical stability of the proposed VJD algorithm. Numerical experiments with applications in the independent component analysis are provided to reveal the relation with Huaxin Lin’s theorem and to demonstrate that our method compares favorably with the state-of-the-art Jacobi-type joint diagonalization algorithm.

keywords:

Almost commuting matrices , Huaxin Lin’s theorem , Joint diagonalizability , Riemannian approximate Newton method , Vector-wise algorithm

MSC:

[2020] 15A18 , 15A27 , 15A45 , 15A60 , 58C15 , 65F15 , 65F35

1 Introduction

It is a fundamental result in linear algebra that a set of Hermitian matrices are commuting if and only if they are jointly diagonalizable by a unitary matrix. A closely related and long-standing question [38, 60], dating back to the 1950s, was that are almost commuting matrices near commuting pairs? It was answered affirmatively by Huaxin Lin [49] in 1995, which gave the following celebrated theorem: for any $\epsilon>0$ , there exists $\delta>0$ , such that if $A,B\in\mathbb{C}^{n\times n}$ are Hermitian matrices satisfying $\lVert A\rVert,\lVert B\rVert\leq 1$ and $\lVert AB-BA\rVert<\delta$ , then there exist commuting Hermitian matrices $A^{\prime},B^{\prime}$ such that $\lVert A^{\prime}-A\rVert,\lVert B^{\prime}-B\rVert<\epsilon$ . A simplified proof of Lin’s result was provided soon after by Friis and Rørdam [30], and an analogous result in the real symmetric case is recently given in [51] by Loring and Sørensen. However, their arguments were abstract and did not show how to construct these matrices $A^{\prime},B^{\prime}$ , which would be very useful from the application point of view [51]. In 2008, Hastings [40] suggested a constructive approach to finding the nearby commuting matrices $A^{\prime}$ and $B^{\prime}$ and discussed the dependence of $\delta$ on $\epsilon$ (see also [43] for the extended discussions about Hastings’ method). Later, Kachkovskiy and Safarov, in their seminal work [44], revisited this problem in the framework of operator algebra theory and gave the optimal dependence of $\delta$ on $\epsilon$ by a different argument from the one in [40]. We state their result below for future reference. The interested readers are referred to [43, 44] for a more complete historical review of Lin’s theorem and its development. A list of notations used throughout this work is given at the end of this section.

Theorem 1.1 ([44]).

For any two self-adjoint operators $A,B$ on a Hilbert space $H$ , there exists a pair of commuting self-adjoint operators $A^{\prime},B^{\prime}$ such that

\displaystyle\lVert A-A^{\prime}\rVert+\lVert B-B^{\prime}\rVert\leq C\lVert[A,B]\rVert^{1/2}\,,

(1.1)

where $[A,B]=AB-BA$ is the commutator and the constant $C$ is independent of $A,B$ and the dimension of $H$ .

The theory of almost commuting matrices and Lin’s theorem find numerous applications in many fields, e.g., quantum mechanics [39, 41, 42, 57], computer graphics [13, 46], and signal processing [16, 25, 61]. In particular, Hastings used Lin’s theorem to claim the existence of localized Wannier functions for non-interacting fermions [39], which inspired a series of subsequent works [41, 42]. For the application in computer science, there is a famous algorithm called Joint Approximate Diagonalization (JADE) developed by Cardoso and Souloumiac for blind source separation [16, 17]. Such an algorithm has also been used in computational quantum chemistry to generate Wannier functions [36]. The machine learning-related applications are recently discussed in [33, 46]. While numerous attempts have been made to constructively prove Lin’s theorem and explore its applications in various areas, a numerically feasible algorithm for finding $A^{\prime},B^{\prime}$ in Theorem 1.1 remains elusive.

In this work, our focus is on designing a fast algorithm for Theorem 1.1 in the almost commuting regime. Given matrices $A,B\in{\rm S}(n)$ with $\lVert[A,B]\rVert\ll 1$ , we consider the following equivalent optimization problem:

\begin{split}&\min_{D_{1},\,D_{2}\in\mathrm{diag}\left(n\right),\,U\in O(n)}\mathcal{J}(D_{1},D_{2},U):=\lVert U^{T}AU-D_{1}\rVert^{2}+\lVert U^{T}BU-D_{2}\rVert^{2}.\end{split}

(OP)

In view of practical applications [24, 33, 36], we will focus on real matrices, but all the results and discussions can be adapted to the complex case easily. It may be clear but worth emphasizing that any approximate solution to (OP) gives a potential pair of commuting matrices $(A^{\prime},B^{\prime})=(UD_{1}U^{T},UD_{2}U^{T})$ satisfying the estimate (1.1). The problem (OP) is generally very difficult to solve by the existing manifold optimization techniques, since the operator norm involved is closely related to the Finsler geometry [9], which is a rather uncommon structure in the optimization community. Even if we replace the operator norm with the Frobenius one, (OP) is still a challenging problem, due to the non-convexity of the objective function and the Stiefel manifold constraint. One old but popular method is a Jacobi-type algorithm represented in [14] with ideas dating back to [20, 34]. This algorithm minimizes the off-diagonal objective function: for Hermitian matrices $A$ and $B$ ,

J(A,B,U)={\rm off}(U^{*}AU)+{\rm off}(U^{*}BU)\,,

by a sequence of unitary transformations constructed from Givens rotation matrices:

R(i,j,c,s):=I+(c-1)e_{i}e_{i}^{T}-\bar{s}e_{i}e_{j}^{T}+se_{j}e_{i}^{T}+(\bar{c}-1)e_{j}e_{j}^{T}\,,

where $U$ is a unitary matrix with the complex adjoint $U^{*}$ , ${\rm off}(A):=\sum_{\scriptscriptstyle i\neq j}|a_{ij}|^{2}$ measures the off-diagonal part of a square matrix $A$ , and parameters $c,s\in\mathbb{C}$ are the Jacobi angles satisfying $\left|c\right|^{\scriptscriptstyle 2}+\left|s\right|^{\scriptscriptstyle 2}=1$ . The optimal $c$ and $s$ are derived in [17, Theorem 1.1] in a closed-form expression. For the readers’ convenience, we briefly summarize the main procedures of the Jacobi algorithm [14] in the following pseudocode.

Input:

\epsilon>0

; Hermitian commuting matrices

A,B

Output: Unitary matrix

U

such that

{\rm off}(U^{*}AU)+{\rm off}(U^{*}BU)\leq\epsilon(\lVert A\rVert_{\rm F}+\lVert B\rVert_{\rm F})

U=I

;

3 while ${\rm off}(A)+{\rm off}(B)>\epsilon(\left\|A\right\|_{{\rm F}}+\left\|B\right\|_{{\rm F}})$ do

4 for $i=1,\cdots,n$ and $j=i+1,\cdots,n$ do

5 set parameters

c

and

s

as described in [17]

R=R(i,j,c,s)

U=UR

;

A=R^{*}AR

;

B=R^{*}BR

8 end for

10 end while

Algorithm 1 Jacobi algorithm for simultaneous diagonalization [14, 17]

Algorithm 1 is known to have the quadratic asymptotic convergence rate [14] and can be efficiently implemented in the parallel framework as discussed in [8, 21]. Another advantage of the Jacobi algorithm is its numerical stability against rounding errors. It is worth noting that Theis [68] extended the above Jacobi algorithm for the joint block diagonalization, but the output block decomposition may not be the finest one. For the finest block diagonalization, a class of algorithms based on the theory of matrix $*$ -algebra has been proposed in [18] and developed in [52, 53, 54]. Other joint diagonalization techniques in various problem settings include a gradient-based algorithm for approximately diagonalizing almost commuting Laplacians [13], subspace fitting methods with non-orthogonal transformations [71, 75], and the fast approximate joint diagonalization free of trivial solutions [48], just to name a few. In particular, Riemannian optimization algorithms have been recently explored for the joint diagonalization problem with or without orthogonality constraints. Theis et al. [69] applied the Riemannian trust-region method to the joint diagonalization on the Stiefel manifold. Within the same framework, an efficient Riemannian Newton-type method was investigated in [62] by Sato. More recently, Bouchard et al. [11] proposed a unified Riemannian optimization framework for the approximate joint diagonalization on the general linear group, which generalizes previous algorithms [2, 4].

The joint matrix diagonalization problem is also closely related to the canonical polyadic decomposition (CPD). In [19], De Lathauwer derived a sufficient condition for the uniqueness of CPD and showed that the decomposition can be obtained by a simultaneous diagonalization by congruence. Sørensen [64] proved that for a given tensor, the optimal low-rank CPD approximation exists with one of the matrix factors being column-wise orthonormal. In particular, numerical algorithms based on joint congruence diagonalization and alternating least squares are also explored in [64] for this constrained CPD. We also mention that in a recent work [23] of Evert and De Lathauwer, they discussed the optimal CPD of a noisy low-rank tensor and connected it to a joint generalized eigenvalue problem.

In this work, we propose a practically efficient vector-wise algorithm for finding the approximate solution to the optimization problem (OP) in the case where $\lVert[A,B]\rVert\ll 1$ , and present the associated error analysis. Numerical experiments demonstrate that our algorithm has comparable or even better performance than that of the state-of-the-art Jacobi-type method while significantly reducing computational time. Moreover, we numerically compute the error between the given almost commuting matrices $(A,B)$ and the commuting ones $(A^{\prime},B^{\prime})$ constructed by our algorithm. It turns out that in the almost commuting regime, $\lVert A-A^{\prime}\rVert+\lVert B-B^{\prime}\rVert$ is of the order $\lVert[A,B]\rVert$ , and thus the constructed $(A^{\prime},B^{\prime})$ satisfies the upper bound (1.1) in Lin’s theorem. Some applications in independent component analysis are also provided to justify the effectiveness of our method.

We now summarize the main idea and contributions of this work in detail. Our method is motivated by the simple fact that the commutative matrices $A,B$ can be simultaneously diagonalized by an orthogonal matrix whose columns consist of common orthonormal eigenvectors of $A,B$ , which can be characterized by the global minimizers of the following optimization problem:

\begin{split}&{\color[rgb]{0,0,0}\min_{\Lambda=(\lambda,\mu)\in\mathbb{R}^{2}\,,v\in\mathbb{S}^{n-1}}\mathcal{L}(\Lambda,v)}:=\lVert Av-\lambda v\rVert^{2}+\lVert Bv-\mu v\rVert^{2}\,.\end{split}

(

{\rm OP}_{1}

)

Note that $\mathcal{L}(\Lambda,v)$ admits $n$ minimizers $(\Lambda_{j},v_{j})$ with $\mathcal{L}(\Lambda_{j},v_{j})=0$ and $v_{j}$ being mutually orthonormal. This observation leads to an ideal approach for diagonalizing commuting matrices $(A,B)$ . Suppose that we have successfully solved the first $i$ minimizers $\{(\Lambda_{j},v_{j})\}_{j=1}^{i}$ of $\mathcal{L}(\Lambda,v)$ . To determine the $i+1$ -th minimizer, we solve ( ${\rm OP}_{1}$ ) but with an additional orthogonal constraint:

\displaystyle\min\left\{\mathcal{L}(\Lambda,v)\,;\ \Lambda\in\mathbb{R}^{2}\,,\ v\in\mathbb{S}^{n-1}\,,\ \langle v,v_{j}\rangle=0\,,1\leq j\leq i\right\}\,.

(1.2)

However, this elegant approach may encounter practical challenges due to numerical noise and errors. Since we can not solve the optimization problem (1.2) exactly at each step, the numerical errors will accumulate so that the constraints $\langle v,v_{j}\rangle=0$ may exclude the minimizers of the cost functional $\mathcal{L}$ , which leads to the failure of the algorithm for finding the complete orthonormal set of common eigenvectors. Moreover, the input commuting matrices are generally subject to numerical noises or rounding errors. As proved in Lemma 2.3, any such small perturbation could disrupt all the common eigenvectors. Consequently, the functional $\mathcal{L}$ might not exhibit sufficient local minimizers, in which case the algorithm will also fail.

Our first contribution is Theorem 2.13, which proves that in the almost commuting regime, $\mathcal{L}(\Lambda,v)$ generically has $n$ local minimizers $\{(\Lambda_{j},v_{j})\}_{i=1}^{n}$ with $v_{j}$ being almost orthogonal to each other (which may be viewed as approximate common eigenvectors). For this, we provide a stability result in Theorem 2.6 by using perturbation techniques. Then, based on the theory of Gaussian orthogonal ensemble (GOE) [5, 65], we show that the common eigenspaces of commuting matrices are generically of one dimension from both topological and probabilistic points of view (cf. Lemma 2.9 and Corollary 2.11), similarly to the well-known genericity of simple eigenvalues of self-adjoint operators [66, 67, 70]. Combining these results with the estimates of the spectral gaps for random matrices in [56], we can immediately conclude the desired claim in Theorem 2.13.

In view of this generic property of the local minimizers of $\mathcal{L}(\Lambda,v)$ , we consider a relaxed version of the aforementioned approach for finding the columns of the orthogonal matrix $U$ and the diagonal entries of matrices $\Lambda_{i}$ sequentially, called the vector-wise joint diagonalization (VJD); see Algorithm 2. The core idea behind VJD is to relax the exact orthogonality constraint $\langle v,v_{j}\rangle=0$ in (1.2) to $|\langle v,v_{j}\rangle|\leq\varepsilon$ . This relaxation accommodates possible numerical errors and non-commutativity of input matrices. However, enforcing constraints of the form $|\langle v,v_{j}\rangle|\leq\varepsilon$ , $1\leq j\leq i$ , can be numerically challenging. Therefore, in practice, we adopt the equivalent constraint:

{\rm dist}(v,S_{i})\leq\sqrt{\varepsilon}\,,

(1.3)

where $S_{i}={\rm span}\{v_{1},\cdots,v_{i}\}^{\perp}\cap\mathbb{S}^{n-1}$ and $\varepsilon$ is the error tolerance parameter. Notably, this form (1.3) allows for the straightforward computation of the associated projection, as shown in Lemma 2.4. The VJD algorithm proceeds as follows; see Section 3 for more details. We first solve ( ${\rm OP}_{1}$ ) to find $(\lambda_{1},\mu_{1},v_{1})$ . Then, for $j\geq 2$ , we consider the relaxed problem of (1.2):

\min\left\{\mathcal{L}(\Lambda,v)\,;\ \Lambda=(\lambda,\mu)\in\mathbb{R}^{2}\,,\ v\in\mathbb{S}^{n-1}\,,\ {\rm dist}(v,S_{j-1})\leq\sqrt{\varepsilon}\right\}\,,

(

{\rm OP}_{2}

)

which gives $(\lambda_{j},\mu_{j},v_{j})$ . This iterative process culminates in the construction of an almost orthogonal matrix $V=[v_{1},\cdots,v_{n}]$ and two diagonal matrices: $D_{1}={\rm diag}(\lambda_{1},\ldots,\lambda_{n})$ and $D_{2}={\rm diag}(\mu_{1},\ldots,\mu_{n})$ . The algorithm ends up with a classical nearest orthogonal matrix problem (i.e., orthogonal Procrustes problem [63]):

\displaystyle\min\{\lVert U-V\rVert\,;\ U\in{\rm O}(n)\}\,,

(1.4)

where $\lVert\cdot\rVert$ is either the spectral norm $\lVert\cdot\rVert_{{\rm op}}$ or the Frobenius norm $\lVert\cdot\rVert_{{\rm F}}$ . In both cases, the problem (1.4) has an analytic solution $U=U_{1}U_{2}^{T}$ , with $V=U_{1}\Sigma U_{2}^{T}$ being the singular value decomposition (SVD) of $V$ . For solving sub-optimization problems ( ${\rm OP}_{1}$ ) and ( ${\rm OP}_{2}$ ), we derive a Riemannian approximate Newton-type method with the approximate Hessian matrix being structurally simple and positive semidefinite; see Algorithm 3. Our method exploits the second-order information of the cost function and enjoys stability with low computational cost. We would like to remark that there might be some other algorithms that are also potentially suitable for our optimization problems, for example, the trust-region method [1], the column-wise block coordinate descent method [31, 32], the Cayley transform-based optimization [72], and the consensus-based optimization method [29, 37, 45], but the detailed discussion about these approaches is beyond the scope of this work.

Our third contribution lies in establishing the global convergence of the proposed Riemannian approximate Newton method for ( ${\rm OP}_{1}$ ) and demonstrating the numerical stability of our VJD algorithm. By leveraging techniques from [10, 47, 74] for Newton-type methods on manifolds, we prove in Theorem 4.1 that Algorithm 3 for ( ${\rm OP}_{1}$ ) converges linearly with the convergence rate $O(\lVert[A,B]\rVert^{1/2})\ll 1$ and quadratically if $[A,B]=0$ . However, it is important to note that our optimization process, owing to its sequential nature, does not exactly correspond to solving (OP) when $A$ and $B$ are non-commuting matrices. Nevertheless, based on empirical observations from numerical experiments, our method provides a good approximation to the original problem. To further substantiate this numerical advantage, we theoretically establish the error-tolerant property of our VJD algorithm in Proposition 4.5. We remark that although our discussions primarily focus on a pair of almost commuting matrices, our algorithm and analysis can be directly extended to a family of matrices that are almost commuting pairwisely.

This work is organized as follows. Section 2 delves into the stability and generic behaviors of local minimizers of $\mathcal{L}(\Lambda,v)$ , serving as motivation for the VJD algorithm outlined in Section 3, which relies on an efficient approximate Newton method for sub-optimization problems. Additionally, Section 4 is dedicated to the examination of the convergence properties and numerical stability of the presented algorithms. In Section 5, various numerical examples are presented to justify the efficiency of our method.

Notations.

•

Let ${\rm S}(n)$ be the set of symmetric $n\times n$ matrices, and ${\rm O}(n)$ be the orthogonal group in dimension $n$ . We denote by ${\rm diag}(n)$ the space of $n\times n$ diagonal matrices, and by ${\rm diag}_{\leq}(n)$ the closed subset of ${\rm diag}(n)$ with increasing diagonal entries $\lambda_{1}\leq\cdots\leq\lambda_{n}$ . By abuse of notation, we also write $\mathrm{diag}\left(\lambda_{1},\ldots,\lambda_{n}\right)$ for the diagonal matrix with diagonal entries $\{\lambda_{i}\}_{i=1}^{n}$ . We denote $A\succeq 0$ (resp., $A\succ 0$ ) for $A\in{\rm S}(n)$ if $A$ is positive semidefinite (resp., positive definite).
•

For a matrix $M\in\mathbb{R}^{m\times n}$ , $M_{ij}$ denotes its $(i,j)$ element. Moreover, we denote by $\lVert\cdot\rVert_{{\rm F}}$ the Frobenius norm, and by $\lVert\cdot\rVert_{{\rm op}}$ (or, simply, $\lVert\cdot\rVert$ ) the operator norm.
•

Let $\langle\cdot,\cdot\rangle$ be the Euclidean inner product of $\mathbb{R}^{n}$ and $\lVert\cdot\rVert$ be the associated norm. The standard basis of $\mathbb{R}^{n}$ is given by $\{e_{i}\}_{i=1}^{n}$ , where $e_{i}$ denotes the vector with $1$ in the $i$ -th coordinate and $0$ elsewhere. We will not distinguish the row vector $(x_{1},\ldots,x_{n})$ and the column one $(x_{1},\ldots,x_{n})^{T}$ unless otherwise specified. We define the $(n-1)$ -dimensional unit sphere by $\mathbb{S}^{n-1}:=\{x=(x_{1},\ldots,x_{n})\in\mathbb{R}^{n}\,;\ \lVert x\rVert=1\}.$
•

We denote by $\sigma(X)$ the spectrum of a matrix $X\in{\rm S}(n)$ and by $\delta(X)$ its spectral gap:

$\delta(X)=\min\{|\lambda_{i}-\lambda_{j}|\,;\ \lambda_{i}\neq\lambda_{j}\,,\ \lambda_{i},\lambda_{j}\in\sigma(X)\}\,.$

We say that an eigenvalue $\lambda\in\sigma(X)$ of $X\in{\rm S}(n)$ is simple (resp., degenerate) if its multiplicity is equal to (resp., greater than) one.

•

We recall the commutator $[A,B]=AB-BA$ of two matrices $A,B\in\mathbb{R}^{n\times n}$ and define the set of commuting matrices by

{\rm F}_{\rm com}:=\{(A,B)\in{\rm S}(n)\times{\rm S}(n)\,;\ [A,B]=0\}\,.

Then, we say that $\Lambda:=(\lambda,\mu)\in\sigma(A)\times\sigma(B)$ is an eigenvalue pair of $(A,B)\in{\rm F}_{\rm com}$ if the associated common eigenspace is non-empty:

E_{A,B}^{\Lambda}:=\{v\in\mathbb{R}^{n}\,;\ Av=\lambda v,\,Bv=\mu v\}\neq\emptyset\,.

(1.5)

With slight abuse of notation, we define the joint spectrum $\sigma(A,B)\subset\mathbb{R}^{2}$ of $(A,B)\in{\rm F}_{\rm com}$ by the set of all its eigenvalue pairs $\Lambda_{j}$ and the eigenvalue gap by

\delta(A,B):=\min\{\lVert\Lambda_{i}-\Lambda_{j}\rVert\,;\ \Lambda_{i}\neq\Lambda_{j}\,,\ \Lambda_{i},\Lambda_{j}\in\sigma(A,B)\}\,.

(1.6)

We also denote by ${\rm conv}(\sigma(A,B))\subset\mathbb{R}^{2}$ the convex hull of $\sigma(A,B)$ . The multiplicity of $\Lambda\in\sigma(A,B)$ is defined by the dimension of the subspace $E_{A,B}^{\Lambda}$ , and $\Lambda\in\sigma(A,B)$ is said to be simple (resp., degenerate) if $\dim(E_{A,B}^{\Lambda})$ is equal to (resp., greater than) one.

2 Analysis of local minimizers of $\eqref{eq:sub-problem}$

This section is devoted to the analysis of the local minimizers of the functional $\mathcal{L}(\Lambda,v)$ in ( ${\rm OP}_{1}$ ). Note $\mathcal{L}(\Lambda,v)=\mathcal{L}(\Lambda,-v)$ and that if $(\Lambda,v)$ is a local minimum of $\mathcal{L}$ , so is $(\Lambda,-v)$ . It is convenient to consider a point $(\Lambda,v)\in\mathbb{R}^{2}\times\mathbb{S}^{n-1}$ as an equivalence class $\{(\Lambda,v),(\Lambda,-v)\}$ in what follows. We shall first derive the optimality conditions of ( ${\rm OP}_{1}$ ) and characterize the minimizers of $\mathcal{L}$ in the commuting case. Then, we establish some stability estimates on the minimizers by perturbation arguments. We also show that with high probability $1-o(1)$ , the functional $\mathcal{L}(\Lambda,v)$ for almost commuting random matrices has $n$ local minima $\{(\Lambda_{i},v_{i})\}_{i=1}^{n}$ on the manifold $\mathbb{R}^{2}\times\mathbb{S}^{n-1}$ with minimizing vectors $v_{i}\in\mathbb{S}^{n-1}$ almost orthogonal to each other. These findings shed light on designing the column-wise algorithm for the optimization problem (OP), which we will discuss in detail in the next section.

Recall that in this work, we consider the almost commuting symmetric matrices $(A,B)$ with $\lVert[A,B]\rVert\ll 1$ . By Theorem 1.1, without loss of generality, we assume that the matrices $(A,B)$ can be written as

\displaystyle A=A_{*}+\Delta A\,,\quad B=B_{*}+\Delta B\,,

(2.1)

for some commuting matrices $(A_{*},B_{*})\in{\rm F}_{\rm com}$ with

\lVert\Delta A\rVert+\lVert\Delta B\rVert\leq C\lVert[A,B]\rVert^{1/2}\ll 1\,.

(2.2)

Note that in the case of $[A,B]=0$ (i.e., $\Delta A=\Delta B=0$ ), we can always find $n$ local (also global) minima of $\mathcal{L}(\Lambda,v)$ (with minimum value zero) given by the eigenvalue pairs of $(A,B)$ and the associated common orthonormal eigenvectors. Indeed, the converse is also true; see Proposition 2.2 below.

We start with the optimality conditions for ( ${\rm OP}_{1}$ ). For ease of exposition, we recall some basic concepts on Riemannian optimization on a smooth manifold $\mathcal{M}$ embedded in an Euclidean space $\mathbb{R}^{n}$ [3, 12]. We denote by $T_{x}\mathcal{M}$ the tangent space at $x\in\mathcal{M}$ and by $\mathcal{P}_{x}$ the orthogonal projection from $\mathbb{R}^{n}$ to $T_{x}\mathcal{M}$ . Let $f$ be a smooth function on $\mathcal{M}$ , which admits a smooth extension $\bar{f}$ to a neighborhood of $\mathcal{M}$ in $\mathbb{R}^{n}$ . In what follows, we will denote by $\nabla$ and $\nabla^{2}$ the standard Euclidean gradient and Hessian, respectively. Then, the Riemannian gradient of $f$ is given by

\displaystyle\operatorname{grad}f(x)=\mathcal{P}_{x}(\nabla\bar{f}(x))\,;

(2.3)

see [12, Proposition 3.53]. Letting $\bar{G}$ be any smooth extension of $\operatorname{grad}f$ , the Riemannian Hessian of $f$ , as a linear map on $T_{x}\mathcal{M}$ , can be computed by [12, Corollary 5.14]

\displaystyle\operatorname{Hess}f(x)=\mathcal{P}_{x}(\nabla\bar{G}(x))\,.

(2.4)

In the special case of $\mathcal{M}=\mathbb{S}^{n-1}$ , we have the tangent space $T_{v}\mathbb{S}^{n-1}=\{p\in\mathbb{R}^{n}\,;\ \langle p,v\rangle=0\}$ at $v\in\mathbb{S}^{n-1}$ with the associated projection $\mathcal{P}_{v}$ given by $I_{n}-vv^{T}$ . Thanks to (2.3) and (2.4) above, the Riemannian gradient and Hessian of $f$ on $\mathbb{S}^{n-1}$ can be explicitly computed as

\displaystyle\operatorname{grad}f(v)=(I_{n}-vv^{T})\nabla\bar{f}(v)\,,\quad\operatorname{Hess}f(v)=(I_{n}-vv^{T})(\nabla^{2}\bar{f}(v)-\langle\nabla\bar{f}(v),v\rangle I_{n})\,.

(2.5)

We also recall the notion of retraction on $\mathbb{S}^{n-1}$ [12, Definition 3.41]. In this work, we limit our discussion to the following one for its simplicity (another popular choice is the exponential map):

\displaystyle R_{v}(p)=\frac{v+p}{\lVert v+p\rVert}\,,\quad v\in\mathbb{S}^{n-1}\,,\ p\in T_{v}\mathbb{S}^{n-1}\,.

(2.6)

It is worth noting that $R_{v}$ in (2.6) is a second-order retraction on $\mathbb{S}^{n-1}$ , i.e., for any $v\in\mathbb{S}^{n-1}$ and $p\in T_{v}\mathbb{S}^{n-1}$ , the curve $c(t)=R_{v}(tp)$ satisfies $c(0)=0$ , $c^{\prime}(0)=p$ , and $c^{\prime\prime}(0)=0$ ; see [12, Definition 5.41].

We now compute the Riemannian gradient of the functional $\mathcal{L}(\Lambda,v)$ on the manifold $\mathbb{R}^{2}\times\mathbb{S}^{n-1}$ by (2.3). Note that $\mathcal{L}$ is well-defined on $\mathbb{R}^{2+n}$ . For $\Lambda=(\lambda,\mu)\in\mathbb{R}^{2}$ and $v\in\mathbb{S}^{n-1}$ , we have

\displaystyle\operatorname{grad}\mathcal{L}(\Lambda,v)=\left[\begin{matrix}2\lambda-2\langle v,Av\rangle\\ 2\mu-2\langle v,Bv\rangle\\ \operatorname{grad}_{v}\mathcal{L}(\Lambda,v)\end{matrix}\right]\quad\text{with}\quad\operatorname{grad}_{v}\mathcal{L}(\Lambda,v)=2\left(I_{n}-vv^{T}\right)\left((A-\lambda)^{2}+(B-\mu)^{2}\right)v\,,

(2.7)

where $\operatorname{grad}_{v}\mathcal{L}$ denotes the Riemannian gradient of $\mathcal{L}(\Lambda,v)$ in $v\in\mathbb{S}^{n-1}$ . Similarly, by (2.4), we calculate the Riemannian Hessian of $\mathcal{L}$ as follows:

\displaystyle\operatorname{Hess}\mathcal{L}(\Lambda,v)

\displaystyle=\left[\begin{matrix}2I_{2}&W^{T}\\ W&\operatorname{Hess}_{v}\mathcal{L}(\Lambda,v)\end{matrix}\right],

(2.8)

where the matrices $W\in\mathbb{R}^{n\times 2}$ and $\operatorname{Hess}_{v}\mathcal{L}(\Lambda,v)\in{\rm S}(n)$ are given by

\displaystyle W=-4\left[\begin{matrix}\left(I_{n}-vv^{T}\right)Av&\left(I_{n}-vv^{T}\right)Bv\end{matrix}\right],

(2.9)

and

\displaystyle\operatorname{Hess}_{v}\mathcal{L}(\Lambda,v)=(I_{n}-vv^{T})\left(\nabla^{2}_{v}\mathcal{L}(\Lambda,v)-\langle v,\nabla^{2}_{v}\mathcal{L}(\Lambda,v)v\rangle I_{n}\right).

(2.10)

Here, $\nabla^{2}_{v}\mathcal{L}(\Lambda,v)$ is the Euclidean Hessian of $\mathcal{L}(\Lambda,v)$ in $v$ :

\displaystyle\nabla^{2}_{v}\mathcal{L}(\Lambda,v)=2(A-\lambda)^{2}+2(B-\mu)^{2}\,.

(2.11)

We define the stationary points of ( ${\rm OP}_{1}$ ) by $\{(\Lambda,v)\in\mathbb{R}^{2}\times\mathbb{S}^{n-1}\,;\ \operatorname{grad}\mathcal{L}(\Lambda,v)=0\}$ . Then, if $(\Lambda,v)$ is a local minimizer of ( ${\rm OP}_{1}$ ), then it is a stationary point and there holds $\operatorname{Hess}\mathcal{L}(\Lambda,v)\succeq 0$ on the tangent space $T_{(\Lambda,v)}(\mathbb{R}^{2}\times\mathbb{S}^{n-1})=\mathbb{R}^{2}\times T_{v}\mathbb{S}^{n-1}$ . Conversely, if $(\Lambda,v)$ is a stationary point such that $\operatorname{Hess}\mathcal{L}(\Lambda,v)\succ 0$ on $\mathbb{R}^{2}\times T_{v}\mathbb{S}^{n-1}$ , then it is an isolated local minimizer of ( ${\rm OP}_{1}$ ). We next give a characterization for the stationary points of ( ${\rm OP}_{1}$ ) for commuting matrices. We denote by $\Delta_{n}$ the $n$ -simplex:

\Delta_{n}=\Big{\{}(p_{i})_{i}\in\mathbb{R}^{n}\,;\ p_{i}\in[0,1],\ \sum_{i=1}^{n}p_{i}=1\Big{\}}\,.

Proposition 2.1.

Given $(A,B)\in{\rm F}_{\rm com}$ , let $\{\Lambda_{i}\}_{i=1}^{n}$ be its eigenvalue pairs (counting multiplicity) and $\{v_{i}\}_{i=1}^{n}\subset\mathbb{S}^{n-1}$ be the associated common orthonormal eigenvectors. We define the subset of $\Delta_{n}$ :

\Delta_{\rm sta}:=\Big{\{}(p_{i})_{i}\in\Delta_{n}\,;\ \Lambda=\sum_{i=1}^{n}p_{i}\Lambda_{i}\,,\ \lVert\Lambda-\Lambda_{i}\rVert\ \text{is constant for}\ \{i\,;\ p_{i}\neq 0\}\Big{\}}\,.

(2.12)

Then, the stationary points of ( ${\rm OP}_{1}$ ) are given by

\Lambda=\sum_{i=1}^{n}p_{i}\Lambda_{i}\in{\rm conv}(\sigma(A,B))\,,\ v=\sum_{i=1}^{n}c_{i}v_{i}\in\mathbb{S}^{n-1}\,,

(2.13)

with $(p_{i})_{i}\in\Delta_{\rm sta}$ and $c_{i}=\pm\sqrt{p_{i}}$ .

Proof.

It is clear from (2.7) that $\operatorname{grad}\mathcal{L}(\Lambda,v)=0$ is equivalent to $\Lambda=(\langle v,Av\rangle,\langle v,Bv\rangle)$ and

((A-\lambda)^{2}+(B-\mu)^{2})v=\xi v\,,

(2.14)

for some constant $\xi\geq 0$ . We expand $v=\sum_{i}c_{i}v_{i}$ by the common eigenvectors $\{v_{i}\}_{i=1}^{n}$ with the eigenvalue pairs $\{\Lambda_{i}\}_{i=1}^{n}$ , and find that the equation (2.14) reduces to

\lVert\Lambda_{i}-\Lambda\rVert^{2}c_{i}=\xi c_{i}\,,\quad\text{for}\ i=1,\ldots,n\,,

(2.15)

where $\Lambda=(\langle v,Av\rangle,\langle v,Bv\rangle)=\sum_{i}|c_{i}|^{2}\Lambda_{i}\in{\rm conv}(\sigma(A,B))$ . Then, it readily follows that $\lVert\Lambda_{i}-\Lambda\rVert^{2}=\xi$ for any $i$ with $c_{i}\neq 0$ . The proof is complete. ∎

From Proposition 2.1 above, we see that for $(A,B)\in{\rm F}_{\rm com}$ , the stationary points of ( ${\rm OP}_{1}$ ) are essentially characterized by the set $\Delta_{\rm sta}$ in (2.12), which always includes the nodes $e_{i}$ and their averages $(e_{i}+e_{j})/2$ , while the full analytical characterization of $\Delta_{\rm sta}$ would rely on the distribution of the eigenvalue pairs $\{\Lambda_{i}\}$ of $(A,B)$ . The following result shows that in the commuting case, the local minimizers of ( ${\rm OP}_{1}$ ) are exactly given by eigenvalue pairs and the common eigenvectors.

Proposition 2.2.

For a given $(A,B)\in{\rm F}_{\rm com}$ , $(\Lambda,v)\in\mathbb{R}^{2}\times\mathbb{S}^{n-1}$ is a local minimizer of ( ${\rm OP}_{1}$ ) if and only if $\Lambda$ is an eigenvalue pair and $v$ is the associated common eigenvector.

Proof.

It suffices to prove the direction $(\Rightarrow)$ . Since $(\Lambda,v)$ is a local minimizer, we have $\operatorname{grad}\mathcal{L}(\Lambda,v)=0$ and $\operatorname{Hess}\mathcal{L}(\Lambda,v)\succeq 0$ on $\mathbb{R}^{2}\times T_{v}\mathbb{S}^{n-1}$ . It follows that the Schur complement of $2I_{2}$ in $\operatorname{Hess}\mathcal{L}(\Lambda,v)$ is also positive semidefinite, that is,

\operatorname{Hess}_{v}\mathcal{L}(\Lambda,v)-2^{-1}WW^{T}\succeq 0\,,\ \text{on}\ T_{v}\mathbb{S}^{n-1}\,.

(2.16)

Suppose that $(\Lambda,v)$ is not an eigenvalue pair and the common eigenvector of $(A,B)$ . From (2.13), $\Lambda=\sum_{i\in\mathcal{I}}p_{i}\Lambda_{i}$ is a convex combination of some non-identical $\{\Lambda_{i}\}_{i\in\mathcal{I}}$ with $p_{i}\neq 0$ for $i\in\mathcal{I}$ and $v=\sum_{i\in\mathcal{I}}c_{i}v_{i}$ with $c_{i}=\pm\sqrt{p_{i}}$ (otherwise, $\Lambda=\Lambda_{i}$ for some $i$ and $v$ is the common eigenvector). Moreover, there holds $\lVert\Lambda-\Lambda_{i}\rVert^{2}=\xi$ for $i\in\mathcal{I}$ and some constant $\xi$ . Then, for any $q=\sum_{i\in\mathcal{I}}b_{i}v_{i}\in T_{v}\mathbb{S}^{n-1}$ with $\lVert q\rVert=1$ , we can compute, by (2.10) and (2.16),

	$\displaystyle\big{\langle}q,\big{(}\operatorname{Hess}_{v}\mathcal{L}(\Lambda,v)-2^{-1}WW^{T}\big{)}q\big{\rangle}$
$\displaystyle=$	$\displaystyle\langle q,\nabla^{2}_{v}\mathcal{L}(\Lambda,v)q\rangle-\langle v,\nabla^{2}_{v}\mathcal{L}(\Lambda,v)v\rangle-8(\langle q,Av\rangle^{2}+\langle q,Bv\rangle^{2})$
$\displaystyle=$	$\displaystyle 2\xi-2\xi-8(\langle q,Av\rangle^{2}+\langle q,Bv\rangle^{2})\geq 0\,.$	(2.17)

Noting that $\{\Lambda_{i}=(\lambda_{i},\mu_{i})\}_{i\in\mathcal{I}}$ are not identical, either $Av$ or $Bv$ is not collinear with $v$ . Thus, we can choose $q\in T_{v}\mathbb{S}^{n-1}$ such that $\langle q,Av\rangle^{2}+\langle q,Bv\rangle^{2}>0$ , which contradicts with (2) and completes the proof. ∎

However, the following observation shows that any small perturbation of commuting symmetric matrices could eliminate all common eigenvectors.

Lemma 2.3.

The set ${\rm M}:=\{(A,B)\in{\rm S}(n)\times{\rm S}(n)\,;\ (A,B)\ \text{has \emph{no} common eigenvectors}\}$ is an open dense subset of the space ${\rm S}(n)\times{\rm S}(n)$ .

Proof.

Let $\{\lambda_{i}\}_{i\in\mathcal{I}_{A}}$ and $\{\mu_{i}\}_{i\in\mathcal{I}_{B}}$ be the distinct eigenvalues of $A$ and $B$ with associated eigenspaces $E_{i,A}$ and $E_{i,B}$ , respectively. We define $S_{i,A}:=\mathbb{S}^{n-1}\cap E_{i,A}$ and $S_{i,B}:=\mathbb{S}^{n-1}\cap E_{i,B}$ , and see that both $\cup_{i\in\mathcal{I}_{A}}S_{i,A}$ and $\cup_{i\in\mathcal{I}_{B}}S_{i,B}$ are closed subsets of $\mathbb{S}^{n-1}$ , and $(A,B)$ has no common eigenvectors if and only if the set $(\cup_{i\in\mathcal{I}_{A}}S_{i,A})\cap(\cup_{i\in\mathcal{I}_{B}}S_{i,B})$ is empty. It follows that there exist open neighbourhoods of $(\cup_{i\in\mathcal{I}_{A}}S_{i,A})$ and $(\cup_{i\in\mathcal{I}_{B}}S_{i,B})$ in $\mathbb{S}^{n-1}$ with empty intersection. Then, by the perturbation theory, we can conclude that the set ${\rm M}$ is open in ${\rm S}(n)\times{\rm S}(n)$ . We next show that ${\rm M}$ is also dense. It suffices to prove that for $A,B\in{\rm S}(n)$ having common eigenvectors and for small enough $\varepsilon>0$ , we can find $A_{\varepsilon}$ , $B_{\varepsilon}\in{\rm S}(n)$ with $\lVert A-A_{\varepsilon}\rVert\leq\varepsilon$ , $\lVert B-B_{\varepsilon}\rVert\leq\varepsilon$ such that $A_{\varepsilon}$ , $B_{\varepsilon}$ have no common eigenvectors. For this, by perturbing the eigenvalues of $A$ and $B$ , without loss of generality, we assume that both $A$ and $B$ have only simple eigenvalues. We let $A_{\varepsilon}=A$ and next construct $B_{\varepsilon}$ . Consider the spectral decompositions for $A$ and $B$ : $Aw_{i}=\lambda_{i}w_{i}$ and $Bv_{i}=\mu_{i}v_{i}$ for $1\leq i\leq n$ . Then, for a given $\varepsilon>0$ , we choose an orthogonal matrix $U_{\varepsilon}$ satisfying $U_{\varepsilon}v_{i}\neq v_{i}$ for any $i$ and $\lVert U_{\varepsilon}-I\rVert\leq\varepsilon$ , and define the matrix $B_{\varepsilon}$ by $B_{\varepsilon}U_{\varepsilon}V=U_{\varepsilon}VD$ , where $V=[v_{1},\ldots,v_{n}]$ and $D={\rm diag}(\mu_{1},\ldots,\mu_{n})$ . When $\varepsilon$ is small enough, the condition $\lVert U_{\varepsilon}-I\rVert\leq\varepsilon$ implies $U_{\varepsilon}v_{i}\neq v_{j},w_{k}$ for all $1\leq i,j,k\leq n$ . Thus, $A_{\varepsilon}$ and $B_{\varepsilon}$ have no common eigenvectors, and the proof is complete. ∎

Although an arbitrary pair of symmetric matrices may not have any common eigenvectors, the functional $\mathcal{L}(\Lambda,v)$ can still admit many local minima. Indeed, recalling the decomposition (2.1) for almost commuting matrices $(A,B)$ , let $E_{*}$ be a common eigenspace (1.5) of $(A_{*},B_{*})$ associated with some eigenvalue pair $\Lambda_{*}:=(\lambda_{*},\mu_{*})$ :

E_{*}:=\{v\in\mathbb{R}^{n}\,;\ A_{*}v=\lambda_{*}v\,,\ B_{*}v=\mu_{*}v\}\,.

(2.18)

We shall prove in Theorem 2.6 below that when $\lVert[A,B]\rVert\ll 1$ , there always exists a local minimizer to $\mathcal{L}$ in some neighborhood of $(\Lambda_{*},S_{*})$ in $\mathbb{R}^{2}\times\mathbb{S}^{n-1}$ , where $S_{*}:=\mathbb{S}^{n-1}\cap E_{*}$ . For convenience, we denote by $\mathcal{P}_{*}$ and $\mathcal{P}_{*}^{\perp}$ the orthogonal projections to the subspace $E_{*}$ and its orthogonal complement $E_{*}^{\perp}$ , respectively. The following lemma is useful in the sequel.

Lemma 2.4.

Let the set $S_{*}$ and the projections $\mathcal{P}_{*}$ and $\mathcal{P}_{*}^{\perp}$ be defined as above. For $v\in\mathbb{S}^{n-1}$ , it holds that

\displaystyle\operatorname{dist}(v,S_{*})^{2}:=\min_{w\in S_{*}}\lVert v-w\rVert^{2}=(1-\lVert\mathcal{P}_{*}(v)\rVert)^{2}+\lVert\mathcal{P}_{*}^{\perp}(v)\rVert^{2}=2-2\lVert\mathcal{P}_{*}(v)\rVert\,,

(2.19)

where the minimum value is attained at $v_{*}=\mathcal{P}_{*}(v)/\lVert\mathcal{P}_{*}(v)\rVert$ . Similarly, let $S_{*,\varepsilon}$ be the $\varepsilon$ -neighborhood of the set $S_{*}$ : $S_{*,\varepsilon}:=\{v\in\mathbb{S}^{n-1}\,;\ \operatorname{dist}(v,S_{*})\leq\sqrt{\varepsilon}\}$ . Then, there holds, for $v\in\mathbb{S}^{n-1}$ ,

\displaystyle\operatorname{dist}(v,S_{*,\varepsilon})^{2}=\min_{w\in S_{*,\varepsilon}}\lVert v-w\rVert^{2}=\lVert v-\mathcal{P}_{\varepsilon}(v)\rVert^{2}\,,

(2.20)

where $\mathcal{P}_{\varepsilon}(v)$ is the unique minimizer to $\operatorname{dist}(v,S_{*,\varepsilon})$ given by

\displaystyle\mathcal{P}_{\varepsilon}(v)=\theta\frac{\mathcal{P}_{*}(v)}{\lVert\mathcal{P}_{*}(v)\rVert}+\sqrt{1-\theta^{2}}\frac{\mathcal{P}^{\perp}_{*}(v)}{\lVert\mathcal{P}^{\perp}_{*}(v)\rVert}\quad\text{with}\quad\theta=\max\left\{1-\frac{\varepsilon}{2},\lVert\mathcal{P}_{*}(v)\rVert\right\}.

(2.21)

Proof.

The first claim (2.19) directly follows from the definition, so we only show the second one (2.20). For this, we note from (2.19) that $v\in S_{*,\varepsilon}$ is equivalent to $\lVert\mathcal{P}_{*}(v)\rVert\geq 1-\varepsilon/2$ . Hence the statements (2.20) and (2.21) clearly hold for $v\in S_{*,\varepsilon}$ . Then we consider the case where $v\in\mathbb{S}^{n-1}\backslash S_{*,\varepsilon}$ , which gives $\theta=1-\varepsilon/2$ . It is easy to see

\displaystyle\lVert v-w\rVert^{2}=\lVert\mathcal{P}_{*}(v-w)\rVert^{2}+\lVert\mathcal{P}^{\perp}_{*}(v-w)\rVert^{2}\geq\big{|}\lVert\mathcal{P}_{*}(v)\rVert-\lVert\mathcal{P}_{*}(w)\rVert\big{|}^{2}+\big{|}\lVert\mathcal{P}^{\perp}_{*}(v)\rVert-\lVert\mathcal{P}^{\perp}_{*}(w)\rVert\big{|}^{2}.

(2.22)

By elementary analysis, we have that the function $\big{|}\lVert\mathcal{P}_{*}(v)\rVert-x\big{|}^{2}+\big{|}\lVert\mathcal{P}^{\perp}_{*}(v)\rVert-\sqrt{1-x^{2}}\big{|}^{2}$ is monotone increasing in $x\in[1-\varepsilon/2,1]$ for $v\in\mathbb{S}^{n-1}\backslash S_{*,\varepsilon}$ . Therefore, it follows from (2.21) and (2.22) that

\operatorname{dist}(v,S_{*,\varepsilon})^{2}\geq\big{|}\lVert\mathcal{P}_{*}(v)\rVert-\theta\big{|}^{2}+\big{|}\lVert\mathcal{P}^{\perp}_{*}(v)\rVert-\sqrt{1-\theta^{2}}\big{|}^{2}=\lVert v-\mathcal{P}_{\varepsilon}(v)\rVert^{2}\,.

The proof is complete. ∎

Remark 2.5.

One can also define the distance function $\widetilde{{\rm dist}}(v,E):=\inf_{w\in E}d(v,w)$ for $E\subset\mathbb{S}^{n-1}$ , by using the geodesic distance $d(v,w):=\arccos\langle v,w\rangle$ on $\mathbb{S}^{n-1}$ , which has been investigated in [26, 27]. It is easy to see that $\operatorname{dist}(v,E)^{2}=2-2\cos\widetilde{{\rm dist}}(v,E)$ if $0\leq\widetilde{{\rm dist}}(v,E)\leq\pi$ , which implies that $\operatorname{dist}$ and $\widetilde{{\rm dist}}$ are equivalent, and $\mathcal{P}_{\varepsilon}(v)$ in (2.21) is also the unique minimizer to $\widetilde{\operatorname{dist}}(v,S_{*,\varepsilon})$ . However, the interested set $S_{*,\varepsilon}$ is not geodesically convex by [27, Proposition 2]. Therefore, the results in [26, 27] do not directly apply to our case and we choose to focus on the function $\operatorname{dist}(\cdot,\cdot)$ in (2.20) in this work.

Theorem 2.6.

Suppose that $(A,B)$ are almost commuting matrices with decomposition (2.1), and denote by $\delta(A_{*},B_{*})$ the eigenvalue gap (1.6) for commuting matrices $(A_{*},B_{*})$ in (2.1). Also, let the eigenvalue pair $\Lambda_{*}=(\lambda_{*},\mu_{*})$ with the space $E_{*}$ and the set $S_{*}$ be defined as before Lemma 2.4. We assume

\lVert\Delta A\rVert+\lVert\Delta B\rVert\leq\frac{1}{8}{\color[rgb]{0,0,0}\delta(A_{*},B_{*})},

and define constants

\eta:={\color[rgb]{0,0,0}\frac{15}{4}}(\lVert\Delta A\rVert+\lVert\Delta B\rVert)\quad\text{and}\quad\varepsilon:=\frac{128}{3}\left(\frac{\lVert\Delta A\rVert+\lVert\Delta B\rVert}{{\color[rgb]{0,0,0}\delta(A_{*},B_{*})}}\right)^{2}\,.

(2.23)

Then, it holds that ( ${\rm OP}_{1}$ ) admits a local minimizer in the domain:

{\color[rgb]{0,0,0}D_{\eta,\varepsilon}:=\big{\{}(\Lambda,v)\in\mathbb{R}^{2}\times\mathbb{S}^{n-1}\,;\ \lVert\Lambda-\Lambda_{*}\rVert<\eta,\,\operatorname{dist}(v,S_{*})<\sqrt{\varepsilon}\big{\}}\,.}

(2.24)

Proof.

By Lemma 2.4, we write, for $v\in\mathbb{S}^{n-1}$ ,

\displaystyle v=c_{*}v_{*}+\sum_{j=1}^{m}c_{j}v_{j}\,,\quad|c_{*}|^{2}+\sum_{j=1}^{m}|c_{j}|^{2}=1\,,

(2.25)

and then have

\displaystyle\operatorname{dist}(v,S_{*})^{2}:=|c_{*}-1|^{2}+\sum_{i=1}^{m}|c_{i}|^{2}=2-2c_{*}\,,

(2.26)

where $c_{*}=\lVert\mathcal{P}_{*}(v)\rVert$ , $v_{*}=\mathcal{P}_{*}(v)/c_{*}\in S_{*}$ , and $\{v_{j}\}_{j=1}^{m}$ are common orthonormal eigenvectors of $(A_{*},B_{*})$ spanning $E_{*}^{\perp}$ . Note that for any $\varepsilon$ and $\eta$ , the set $\overline{D_{\eta,\varepsilon}}$ is compact, which implies that $\inf_{\overline{D_{\eta,\varepsilon}}}\mathcal{L}(\Lambda,v)$ admits a minimizer. We shall prove that for small enough $\lVert\Delta A\rVert+\lVert\Delta B\rVert$ , there exists $\varepsilon$ and $\eta$ , depending on $\lVert\Delta A\rVert+\lVert\Delta B\rVert$ and satisfying $\varepsilon,\eta\to 0$ as $\lVert\Delta A\rVert+\lVert\Delta B\rVert\to 0$ , such that the minimizer of $\inf_{\overline{D_{\eta,\varepsilon}}}\mathcal{L}(\Lambda,v)$ is in the open set $D_{\eta,\varepsilon}$ and hence also a local minimizer of ( ${\rm OP}_{1}$ ). For this, we define a constant

	$\displaystyle L_{0}:=$	$\displaystyle\inf_{v\in S_{}}\lVert(A-\lambda_{})v\rVert^{2}+\lVert(B-\mu_{*})v\rVert^{2}$
	$\displaystyle=$	$\displaystyle\inf_{v\in S_{*}}\lVert\Delta Av\rVert^{2}+\lVert\Delta Bv\rVert^{2}\leq\lVert\Delta A\rVert^{2}+\lVert\Delta B\rVert^{2}\,,$		(2.27)

and, without loss of generality, assume $L_{0}>0$ . Indeed, if $L_{0}=0$ holds with a minimizer $v\in S_{*}$ , then we readily have $\mathcal{L}(\Lambda_{*},v)=0$ , and the point $(\Lambda_{*},v)$ is the desired local minimizer of ( ${\rm OP}_{1}$ ) in $D_{\eta,\varepsilon}$ .

We next construct constants $\varepsilon$ and $\eta$ such that

\displaystyle\inf_{\overline{D_{\eta,\varepsilon}}\backslash D_{\eta,\varepsilon}}\mathcal{L}(\Lambda,v)>\lVert\Delta A\rVert^{2}+\lVert\Delta B\rVert^{2}\,,

(2.28)

which, by (2), yields $\inf_{\overline{D_{\eta,\varepsilon}}\backslash D_{\eta,\varepsilon}}\mathcal{L}(\Lambda,v)>L_{0}$ . It follows that $\arg\min_{\overline{D_{\eta,\varepsilon}}}\mathcal{L}(\Lambda,v)\in D_{\eta,\varepsilon}$ and hence completes the proof. We first have $\overline{D_{\eta,\varepsilon}}\backslash D_{\eta,\varepsilon}\subset D_{1}\cup D_{2}$ with

	$\displaystyle D_{1}:=\big{\{}(\Lambda,v)\in\mathbb{R}^{2}\times\mathbb{S}^{n-1}\,;\ \lVert\Lambda-\Lambda_{}\rVert=\eta\,,\ \operatorname{dist}(v,S_{})\leq\sqrt{\varepsilon}\big{\}}\,,$
	$\displaystyle D_{2}:=\big{\{}(\Lambda,v)\in\mathbb{R}^{2}\times\mathbb{S}^{n-1}\,;\ \lVert\Lambda-\Lambda_{}\rVert\leq\eta\,,\ \operatorname{dist}(v,S_{})=\sqrt{\varepsilon}\big{\}}\,.$

A simple estimate implies

\displaystyle\sqrt{2\mathcal{L}(\Lambda,v)}

\displaystyle\geq\lVert(A_{*}-\lambda)v\rVert+\lVert(B_{*}-\mu)v\rVert-\lVert\Delta A\rVert-\lVert\Delta B\rVert\,.

(2.29)

We now estimate $\mathcal{L}(\Lambda,v)$ on $D_{1}$ . By (2.25) and (2.26), there holds $c_{*}\geq 1-\varepsilon/2$ for $(\Lambda,v)\in D_{1}$ , which gives

\displaystyle\lVert(A_{*}-\lambda)v\rVert+\lVert(B_{*}-\mu)v\rVert\geq|c_{*}|\lVert\Lambda-\Lambda_{*}\rVert\geq\left(1-\frac{\varepsilon}{2}\right)\eta\,,

where we have also used $|\lambda-\lambda_{*}|+|\mu-\mu_{*}|\geq\lVert\Lambda-\Lambda_{*}\rVert$ . If we choose $\varepsilon$ and $\eta$ satisfying

\displaystyle\left(1-\frac{\varepsilon}{2}\right)\eta>(1+\sqrt{2})(\lVert\Delta A\rVert+\lVert\Delta B\rVert)\,,

(2.30)

it follows from (2.29) that

\inf_{D_{1}}\mathcal{L}(\Lambda,v)>\lVert\Delta A\rVert^{2}+\lVert\Delta B\rVert^{2}\,.

(2.31)

We next estimate $\mathcal{L}(\Lambda,v)$ on $D_{2}$ . Let $\Lambda_{j}=(\lambda_{j},\mu_{j})$ be the eigenvalue pair of $(A_{*},B_{*})$ associated with the common eigenvector $v_{j}$ . Similarly by (2.25), we can derive

	$\displaystyle\lVert(A_{}-\lambda)v\rVert+\lVert(B_{}-\mu)v\rVert$	$\displaystyle\geq\sqrt{\sum_{j=1}^{m}\|c_{j}(\lambda_{j}-\lambda)\|^{2}}+\sqrt{\sum_{j=1}^{m}\|c_{j}(\mu_{j}-\mu)\|^{2}}$
		$\displaystyle\geq\sqrt{\sum_{j=1}^{m}\|c_{j}\|^{2}\lVert\Lambda_{j}-\Lambda\rVert^{2}}\,,$		(2.32)

thanks to the elementary inequality $\sqrt{x}+\sqrt{y}\geq\sqrt{x+y}$ for $x,y\geq 0$ . For $(\Lambda,v)\in D_{2}$ and small enough $\eta$ , it is clear that $\sum_{j=1}^{m}|c_{j}|^{2}=1-(1-\varepsilon/2)^{2}$ and

\lVert\Lambda-\Lambda_{j}\rVert\geq\lVert\Lambda_{j}-\Lambda_{*}\rVert-\lVert\Lambda-\Lambda_{*}\rVert\geq\delta(A_{*},B_{*})-\eta\,.

Then, if follows from (2) that

\displaystyle\lVert(A_{*}-\lambda)v\rVert+\lVert(B_{*}-\mu)v\rVert\geq\sqrt{\sum_{j=1}^{m}|c_{j}|^{2}\lVert\Lambda_{j}-\Lambda\rVert^{2}}\geq(\delta(A_{*},B_{*})-\eta)\sqrt{1-\left(1-\frac{\varepsilon}{2}\right)^{2}}\,.

If we choose $\varepsilon$ and $\eta$ such that

\displaystyle(\delta(A_{*},B_{*})-\eta)\sqrt{1-\left(1-\frac{\varepsilon}{2}\right)^{2}}>(1+\sqrt{2})(\lVert\Delta A\rVert+\lVert\Delta B\rVert)\,,

(2.33)

then recalling (2.29), we have

\inf_{D_{2}}\mathcal{L}(\Lambda,v)>\lVert\Delta A\rVert^{2}+\lVert\Delta B\rVert^{2}\,.

(2.34)

One can check that in the regime $\lVert\Delta A\rVert+\lVert\Delta B\rVert\leq\delta(A_{*},B_{*})/8$ , the choice (2.23) of parameters $(\varepsilon,\eta)$ satisfies the conditions (2.30) and (2.33). Then, the claim (2.28) follows from (2.31) and (2.34) as desired. ∎

Remark 2.7.

The prefactors in Theorem 2.6 (e.g., $\frac{1}{8}$ and $\frac{15}{4}$ ) are chosen for convenience and can be improved, but the orders in $\lVert\Delta A\rVert+\lVert\Delta B\rVert$ and $\delta(A_{*},B_{*})$ are optimal by the standard perturbation theory.

Remark 2.8.

In the extreme case where $\sigma(A_{*},B_{*})$ is a singleton, equivalently, $(A_{*},B_{*})=(aI,bI)$ for some $a,b\in\mathbb{R}$ , we have $S_{*}=\mathbb{S}^{n-1}$ and $\delta(A_{*},B_{*})$ can be set as $+\infty$ by convention (note that (1.6) is not well-defined in this case). Then, Theorem 2.6 is a trivial fact that ( ${\rm OP}_{1}$ ) has a local minimum $(\Lambda,v)\in\mathbb{R}^{2}\times\mathbb{S}^{n-1}$ with $\Lambda$ near the point $(a,b)\in\mathbb{R}^{2}$ . Moreover, we can see that in this case, ( ${\rm OP}_{1}$ ) reduces to $\min_{(\Lambda,v)\in\mathbb{R}^{2}\times\mathbb{S}^{n-1}}\lVert(\Delta A-\lambda)v\rVert^{2}+\lVert(\Delta B-\mu)v\rVert^{2}$ for $\Delta A,\Delta B\in{\rm S}(n)$ , which is hard to analyze due to its generality (except $n=2$ , in which case it is essentially a one-dimensional optimization problem; see also Lemma 3.2 below). It is possible to carry out a landscape analysis as in [76], but one may have to consider special $\Delta A$ and $\Delta B$ such as rank-one and diagonal matrices.

We next discuss generic properties of local minimizers of ( ${\rm OP}_{1}$ ) for almost commuting matrices. Given $(A,B)\in{\rm F}_{\rm com}$ , let $\{\Lambda_{l}\}_{l=1}^{L}$ ( $L\leq n$ ) be its distinct eigenvalue pairs and $\{E^{\Lambda_{l}}_{A,B}\}_{l=1}^{L}$ be the associated common eigenspaces (1.5). Theorem 2.6 guarantees that ( ${\rm OP}_{1}$ ) for the perturbed pair $(A+\Delta A,B+\Delta B)$ with $\lVert\Delta A\rVert+\lVert\Delta B\rVert\ll 1$ admits a local minimizer in the neighborhood of each closed set $(\Lambda_{l},S_{l})$ , where $S_{l}:=E_{A,B}^{\Lambda_{l}}\cap\mathbb{S}^{n-1}$ . Thus, if $L=n$ holds, equivalently, all of the common eigenspaces are one-dimensional, we readily have that $\mathcal{L}(\Lambda,v)$ has at least $n$ local minima, and the associated minimizing vectors $v_{j}$ are almost orthogonal to each other in the sense that $|\langle v_{i},v_{j}\rangle|\ll 1$ for $i\neq j$ . We shall see that generically, it is indeed the case. Note that the set of commuting matrices ${\rm F}_{\rm com}\subset{\rm S}(n)\times{\rm S}(n)$ is a complete metric space.

Lemma 2.9.

The subset ${\rm F}_{0}:=\{(A,B)\in{\rm F}_{\rm com}\,;\ \dim(E_{A,B}^{\Lambda_{l}})=1\,,\ \Lambda_{l}\in\sigma(A,B)\}$ is open dense in ${\rm F}_{\rm com}$ .

Proof.

To show that ${\rm F}_{0}$ is dense in ${\rm F}_{\rm com}$ , it suffices to note that for $(A,B)\in{\rm F}_{\rm com}$ , we can always perturb the eigenvalues of $A$ such that the perturbed matrix $\widetilde{A}$ has only simple eigenvalues, which implies $(\widetilde{A},B)\in{\rm F}_{0}$ , while the claim that ${\rm F}_{0}$ is open in ${\rm F}_{\rm com}$ follows from the standard perturbation result (cf. Lemma 4.3). ∎

The above lemma is an analog of Lemma 2.3, which gives the genericity of $\dim(E_{A,B}^{\Lambda_{l}})=1$ in the topological sense. It would also be interesting to interpret this property from a probabilistic perspective. We next define a probability measure on the set ${\rm F}_{\rm com}$ of commuting matrices and show that the event ${\rm F}_{0}$ happens almost surely. Note that ${\rm F}_{\rm com}$ can be parameterized as

\displaystyle\Phi:(D_{1},D_{2},U)\in{\rm diag}_{\leq}(n)\times{\rm diag}_{\leq}(n)\times{\rm O}(n)\longrightarrow(UD_{1}U^{T},UD_{2}U^{T})\in{\rm F}_{\rm com}.

Our probabilistic model is largely motivated by GOE in random matrix theory. We recall that $X$ is a GOE if diagonal elements $X_{ii}\sim N(0,1)$ for $1\leq i\leq n$ are i.i.d. Gaussian random variables and off-diagonal elements $\{X_{ij}\}_{1\leq i<j\leq n}$ are i.i.d. Gaussian $N(0,\frac{1}{2})$ that are independent of $X_{ii}$ . We let $U$ be uniformly distributed on ${\rm O}(n)$ and $D_{i}\in{\rm diag}_{\leq}(n)$ , $i=1,2$ , be random diagonal matrices with diagonal entries independently sampled from the following probability density on the set $\{x_{1}\leq x_{2}\leq\cdots\leq x_{n}\}$ :

\displaystyle\varphi(x_{1},\cdots,x_{n})=c_{n}e^{-\frac{1}{2}(x_{1}^{2}+\cdots+x_{n}^{2})}\prod_{1\leq i<j\leq n}(x_{j}-x_{i})\,,

where $c_{n}$ is the normalization constant. Equivalently, we define the product probability measure:

\displaystyle\mathbb{P}=\varphi(\lambda_{1},\cdots,\lambda_{n})d\lambda_{1}\cdots d\lambda_{n}\times\varphi(\mu_{1},\cdots,\mu_{n})d\mu_{1}\cdots d\mu_{n}\times\mu_{\rm Haar}\,,

on ${\rm diag}_{\leq}(n)\times{\rm diag}_{\leq}(n)\times{\rm O}(n)$ , where $\mu_{\rm Haar}$ denotes the Haar measure on the orthogonal group ${\rm O}(n)$ . Clearly, the map $\Phi$ is smooth and thus the pushforward measure $\mathbb{P}_{{\rm F}_{\rm com}}:=\mathbb{P}\circ\Phi^{-1}$ on ${\rm F}_{\rm com}$ is well-defined. In what follows, we limit our discussion to the case where ${\rm F}_{\rm com}$ is equipped with the probability $\mathbb{P}_{{\rm F}_{\rm com}}$ . One may have noted that $\varphi$ is nothing else than the joint distribution of eigenvalues of GOE [5]. Thus, the pair $(A,B)\in{\rm F}_{\rm com}$ has the same marginal distribution as GOE, i.e., $A\sim{\rm GOE}$ , $B\sim{\rm GOE}$ . To proceed, we recall a basic result that may be due to Wigner and von Neumann [55]; see also [65, 67].

Lemma 2.10.

The set of symmetric matrices with degenerate eigenvalues is a surface of codimension $2$ in the $n(n-1)/2$ -dimensional real vector space ${\rm S}(n)$ . Thus, given an absolutely continuous probability on ${\rm S}(n)$ with respect to Lebesgue measure, a matrix $X\in{\rm S}(n)$ has all simple eigenvalues almost surely.

We immediately observe from Lemma 2.10 that

	$\displaystyle\mathbb{P}_{{\rm F}_{\rm com}}({\rm F}_{\rm com}\backslash{\rm F}_{0})$	$\displaystyle=\mathbb{P}_{{\rm F}_{\rm com}}(\Lambda\in\sigma(A,B)\ \text{is degenerate})$
		$\displaystyle\leq\mathbb{P}_{{\rm GOE}}(X\in{\rm S}(n)\ \text{has a degenerate eigenvalue})=0\,,$

and the following corollary holds.

Corollary 2.11.

For a pair of random commuting matrices drawn from the probability $\mathbb{P}_{{\rm F}_{\rm com}}$ , almost surely we have that all its common eigenspaces are one-dimensional.

In order to combine these observations with Theorem 2.6, we recall from [56, Corollary 2.2] a tail bound of the spectral gap for Wigner matrices (i.e., off-diagonal entries are i.i.d. sub-Gaussian random variables with mean $0$ and variance $1$ and diagonal entries are independent sub-Gaussians with mean $0$ and variances bounded by $n^{1-o(1)}$ ), which include GOE as a subclass.

Lemma 2.12.

For real symmetric Wigner matrices $X_{n}$ with sub-Gaussian entries, there holds

\displaystyle\min_{1\leq i\leq n-1}(\lambda_{i+1}-\lambda_{i})\geq n^{-\frac{3}{2}-o(1)}\,,

with probability $1-o(1)$ , where $\{\lambda_{i}\}_{i=1}^{n}$ are eigenvalues of $X_{n}$ (counting multiplicity) in an increasing order.

Applying Lemma 2.12 to GOE, by Theorem 2.6, we can conclude the following result, which shows the desired claim that generically $\mathcal{L}(\Lambda,v)$ has almost orthogonal local minimizing vectors $v_{i}$ .

Theorem 2.13.

Let $A,B\in{\rm S}(n)$ be almost commuting matrices with decomposition (2.1), where $(A_{*},B_{*})\in{\rm F}_{\rm com}$ is sampled from $\mathbb{P}_{{\rm F}_{\rm com}}$ . Then, for $\lVert\Delta A\rVert+\lVert\Delta B\rVert\leq n^{-3/2-o(1)}/8$ , with probability $1-o(1)$ , the function $\mathcal{L}(\Lambda,v)$ admits $n$ local minima on $\mathbb{R}^{2}\times\mathbb{S}^{n-1}$ with associated minimizing vectors $\{v_{j}\}_{j=1}^{n}$ satisfying

\displaystyle|\langle v_{i},v_{j}\rangle|=O\left(\frac{\lVert\Delta A\rVert+\lVert\Delta B\rVert}{n^{-3/2-o(1)}}\right).

3 Vector-wise joint diagonalization algorithm

In this section, we will provide a detailed description of the VJD algorithm for solving (OP). Following the ideas presented in the introduction, we first summarize its key steps in Algorithm 2 below.

Input: almost commuting matrices

A,B\in{\rm S}(n)

Output: orthogonal matrix

U\in{\rm O}(n)

; diagonal matrices

D_{1},D_{2}\in\mathrm{diag}\left(n\right)

2set

(\lambda^{(1)},\mu^{(1)},v^{(1)})

to the minimizer of (

{\rm OP}_{1}

)

V=v^{(1)}

w^{(1)}=v^{(1)}

5for $j=2,\cdots,n$ do

7 set

(\lambda^{(j)},\mu^{(j)},v^{(j)})

to the minimizer of (

{\rm OP}_{2}

) with inputs

\{w^{(i)}\}_{i=1}^{j-1}

V=[V,v^{(j)}]

9 compute

w^{(j)}

by orthogonalizing

\{w^{(1)},\ldots,w^{(j-1)},v^{(j)}\}

via Gram-Schmidt process

10 end for

12compute the SVD:

V=U_{1}\Sigma U_{2}

for solving (1.4)

U=U_{1}U_{2}

D_{1}={\rm diag}(\lambda^{(1)},\ldots,\lambda^{(n)})

D_{2}={\rm diag}(\mu^{(1)},\ldots,\mu^{(n)})

Algorithm 2 VJD algorithm for (OP)

Remark 3.1.

The introduction of $w^{(j)}$ is conceptually unnecessary and mainly for implementing the projection $\mathcal{P}_{\varepsilon}$ (2.21) associated with ${\rm span}\{v^{(1)},\ldots,v^{(j)}\}$ , which is needed for solving ( ${\rm OP}_{2}$ ); see Algorithm 3.

The rest of this section is devoted to designing algorithms for the optimization problems ( ${\rm OP}_{1}$ ) and ( ${\rm OP}_{2}$ ). Note that the cost functional $\mathcal{L}(\Lambda,v)$ is quadratic and convex with respect to both variables $\Lambda\in\mathbb{R}^{2}$ and $v\in\mathbb{R}^{n}$ , which motivates us to consider the alternating descent framework, namely, successively updating the variables $\Lambda$ and $v$ to minimize the cost $\mathcal{L}$ . In particular, we have

\displaystyle\arg\min_{\Lambda\in\mathbb{R}^{2}}\mathcal{L}(\Lambda,v)=(\langle v,Av\rangle,\langle v,Bv\rangle)\,,

(3.1)

and there holds, for any $v\in\mathbb{S}^{n-1}$ and $\Lambda=(\lambda,\mu)\in\mathbb{R}^{2}$ ,

\displaystyle\widetilde{\mathcal{L}}(v)\leq\mathcal{L}(\Lambda,v)\,,

(3.2)

with

\widetilde{\mathcal{L}}(v):=\mathcal{L}(\langle v,Av\rangle,\langle v,Bv\rangle,v)=\langle v,(A^{2}+B^{2})v\rangle-\langle v,Av\rangle^{2}-\langle v,Bv\rangle^{2}\,,

(3.3)

which is a fourth-order polynomial in $v$ . Thus, we can update $\Lambda$ explicitly via (3.1). To update the vector $v\in\mathbb{S}^{n-1}$ , we consider Riemannian Newton’s method with a line search, which takes advantage of the Hessian information of $\mathcal{L}$ and enables a fast convergence. More precisely, in the $k$ -th step, given $\Lambda_{k}=(\langle v_{k},Av_{k}\rangle,\langle v_{k},Bv_{k}\rangle)$ and $v_{k}\in\mathbb{S}^{n-1}$ , we compute the search direction $s_{k}\in T_{v_{k}}\mathbb{S}^{n-1}$ by

-\operatorname{Hess}_{v}{\mathcal{L}(\Lambda_{k},v_{k})}[s_{k}]=\operatorname{grad}_{v}{\mathcal{L}(\Lambda_{k},v_{k})}\,,

(3.4)

and then update $v_{k+1}=R_{v_{k}}(\alpha_{k}s_{k})$ and $\Lambda_{k+1}=(\langle v_{k+1},Av_{k+1}\rangle,\langle v_{k+1},Bv_{k+1}\rangle)$ , where the stepsize $\alpha_{k}$ is determined by a backtracking line search (cf. (3.15) below). However, per our numerical experiments, such an alternating minimization may converge slowly and become less efficient when the matrix size $n$ increases. We also note that $\operatorname{Hess}_{v}\mathcal{L}$ may have negative eigenvalues so that $s_{k}$ in (3.4) is not necessarily a descent direction of $\mathcal{L}(\Lambda,\cdot)$ , which leads to the potential instability of the algorithm. Our practically efficient algorithm is motivated by the following lemmas.

Lemma 3.2.

Given $A,B\in{\rm S}(n)$ , let functions $\mathcal{L}(\Lambda,v)$ on $\mathbb{R}^{2}\times\mathbb{S}^{n-1}$ and $\widetilde{\mathcal{L}}(v)$ on $\mathbb{S}^{n-1}$ be defined in ( ${\rm OP}_{1}$ ) and (3.3), respectively. Then, $(\Lambda_{*},v_{*})$ is a local minimum of $\mathcal{L}$ if and only if $\Lambda_{*}=(\langle v_{*},Av_{*}\rangle,\langle v_{*},Bv_{*}\rangle)$ and $v_{*}$ is a local minimum of $\widetilde{\mathcal{L}}$ . Moreover, we have

\displaystyle\operatorname{grad}\widetilde{\mathcal{L}}(v)={\color[rgb]{0,0,0}\operatorname{grad}_{v}\mathcal{L}(\Lambda,v)|_{(\Lambda,v)=(\langle v,Av\rangle,\langle v,Bv\rangle,v)}}\,,

(3.5)

and $\operatorname{Hess}\widetilde{\mathcal{L}}(v)$ is the Schur complement of the block $2I_{2}$ of the Hessian $\operatorname{Hess}\mathcal{L}$ (2.8), i.e.,

\displaystyle\operatorname{Hess}\widetilde{\mathcal{L}}(v)={\color[rgb]{0,0,0}\operatorname{Hess}_{v}\mathcal{L}(\Lambda,v)|_{(\Lambda,v)=(\langle v,Av\rangle,\langle v,Bv\rangle,v)}}-2^{-1}WW^{T}\,.

(3.6)

Proof.

The Riemannian gradient (3.5) and Hessian (3.6) of $\widetilde{\mathcal{L}}(v)$ follow from a direct computation by (2.5). For the first claim, let $(\Lambda_{*},v_{*})$ be a local minimum of $\mathcal{L}$ . Then, $\Lambda_{*}$ is a local minimum of the quadratic function $\mathcal{L}(\cdot,v_{*})$ , which gives $\Lambda_{*}=(\langle v_{*},Av_{*}\rangle,\langle v_{*},Bv_{*}\rangle)$ . By (3.2), we have, for $(\Lambda,v)$ near $(\Lambda_{*},v_{*})$ ,

\displaystyle\widetilde{\mathcal{L}}(v_{*})\leq\widetilde{\mathcal{L}}(v)\leq\mathcal{L}(\Lambda,v)\,,

(3.7)

i.e., $v_{*}$ is a local minimum of $\widetilde{\mathcal{L}}(v)$ . The other direction can be proved similarly. ∎

Thanks to Lemma 3.2, optimizing the function $\mathcal{L}(\Lambda,v)$ is equivalent to optimizing the one $\widetilde{\mathcal{L}}(v)$ , since they have exactly the same local minima in the variable $v\in\mathbb{S}^{n-1}$ . Noting from (3.5) that $\operatorname{grad}_{v}\mathcal{L}(\Lambda_{k},v_{k})=\operatorname{grad}\widetilde{\mathcal{L}}(v_{k})$ , we will show that in the almost commuting regime $[A,B]\ll 1$ , the alternating descent algorithm given above can be regarded as a Riemannian approximate Newton method for solving

\displaystyle\min_{v\in\mathbb{S}^{n-1}}\widetilde{\mathcal{L}}(v)=\langle v,(A^{2}+B^{2})v\rangle-\langle v,Av\rangle^{2}-\langle v,Bv\rangle^{2}\,,

(3.8)

namely, a special case of Algorithm 3 below with $H(v)=\operatorname{Hess}_{v}\mathcal{L}(\langle v,Av\rangle,\langle v,Bv\rangle,v)$ . In this work, the approximate Newton method refers to a Newton-type algorithm that employs structurally simple approximate Hessian matrices with easily computed inverses.

Lemma 3.3.

Given a pair of almost commuting matrices $(A,B)$ with decomposition (2.1), let $v_{*}$ be a local minimum of the associated $\widetilde{\mathcal{L}}(v)$ on $\mathbb{S}^{n-1}$ . Then, for both $H(v)=\nabla_{v}^{2}\mathcal{L}(\Lambda,v)$ and $H(v)=\operatorname{Hess}_{v}\mathcal{L}(\Lambda,v)$ with $\Lambda=(\langle v,Av\rangle,\langle v,Bv\rangle)$ , it holds that

\displaystyle\lVert\operatorname{Hess}\widetilde{\mathcal{L}}(v)-H(v)\rVert=O(\lVert v-v_{*}\rVert)+O(\lVert\Delta A\rVert+\lVert\Delta B\rVert)\,,

(3.9)

for $v$ near $v_{*}$ and $\lVert\Delta A\rVert+\lVert\Delta B\rVert$ small enough. In particular, if $[A,B]=0$ , we have

\displaystyle\lVert\operatorname{Hess}\widetilde{\mathcal{L}}(v)-H(v)\rVert=O(\lVert v-v_{*}\rVert)\,.

(3.10)

Proof.

It suffices to consider $H(v)=\nabla_{v}^{2}\mathcal{L}(\Lambda,v)|_{(\Lambda,v)=(\langle v,Av\rangle,\langle v,Bv\rangle,v)}$ , since the other one can be similarly analyzed. By formulas (2.7) and (3.5), the optimality condition $\operatorname{grad}\widetilde{\mathcal{L}}(v_{*})=0$ gives

\displaystyle(H(v_{*})-\langle v_{*},H(v_{*})v_{*}\rangle I)v_{*}=0\,,

(3.11)

which yields, by (2.10),

\displaystyle M(v_{*}):={\color[rgb]{0,0,0}\operatorname{Hess}_{v}\mathcal{L}(\Lambda,v)|_{(\Lambda,v)=(\langle v_{*},Av_{*}\rangle,\langle v_{*},Bv_{*}\rangle,v_{*})}}=H(v_{*})-\langle v_{*},H(v_{*})v_{*}\rangle I\,.

(3.12)

For the the desired estimate (3.9), we first observe

\displaystyle\lVert\operatorname{Hess}\widetilde{\mathcal{L}}(v)-H(v)\rVert\leq O(\lVert v-v_{*}\rVert)+\lVert\operatorname{Hess}\widetilde{\mathcal{L}}(v_{*})-H(v_{*})\rVert\,,

by the Lipschitz continuity of $\operatorname{Hess}\widetilde{\mathcal{L}}(v)$ and $H(v)$ . Then, recalling from Proposition 2.2 and Lemma 3.2 that the local minima of $\widetilde{\mathcal{L}}(v)$ for commuting matrices are characterized by their common eigenvectors, we have $H(v_{*})v_{*}=0$ and $\operatorname{Hess}\widetilde{\mathcal{L}}(v_{*})=M(v_{*})=H(v_{*})$ for $(A,B)\in{\rm F}_{\rm com}$ , by formulas (2.9), (3.6), and (3.12). It follows from the stability estimate in Theorem 2.6 that $\lVert\operatorname{Hess}\widetilde{\mathcal{L}}(v_{*})-H(v_{*})\rVert=O(\lVert\Delta A\rVert+\lVert\Delta B\rVert)$ . Thus, (3.9) and also (3.10) hold. ∎

In what follows, we say that a symmetric matrix $H(v)\in{\rm S}(n)$ is an approximate Riemannian Hessian of $\widetilde{\mathcal{L}}(v)$ if the property (3.9) holds as $\lVert v-v_{*}\rVert\to 0$ and $\lVert\Delta A\rVert+\lVert\Delta B\rVert\to 0$ . In addition to $\operatorname{Hess}_{v}\mathcal{L}(\Lambda,v)$ with $\Lambda=(\langle v,Av\rangle,\langle v,Bv\rangle)$ from the alternating minimization, Lemma 3.3 also suggests a new choice $\nabla_{v}^{2}\mathcal{L}(\Lambda,v)$ for approximating $\operatorname{Hess}\widetilde{\mathcal{L}}(v)$ with the favorable positive semidefinite property. It is easy to see that its restriction on $T_{v}\mathbb{S}^{n-1}$ : $(I_{n}-vv^{T})\nabla_{v}^{2}\mathcal{L}(\Lambda,v)(I_{n}-vv^{T})$ also satisfies the estimate (3.9) and hence can be a good candidate for the approximate Hessian as well.

We are now ready to give our approximate Newton method (Algorithm 3) for solving ( ${\rm OP}_{1}$ ) and ( ${\rm OP}_{2}$ ). We note that the only difference between ( ${\rm OP}_{1}$ ) and ( ${\rm OP}_{2}$ ) is the inequality constraint $\operatorname{dist}(v,S_{j-1})\leq\sqrt{\varepsilon}$ , which can be easily dealt with by a projection step, thanks to Lemma 2.4. One may also observe that for $j=1$ , there holds $S_{j-1}=\mathbb{S}^{n-1}$ by definition and then ( ${\rm OP}_{2}$ ) reduces to ( ${\rm OP}_{1}$ ). Thus, it suffices to state the algorithm for ( ${\rm OP}_{2}$ ). For clarity, we recall that given $w_{1},\ldots,w_{j}\in\mathbb{S}^{n-1}$ ( $j\geq 1$ ),

\displaystyle S_{j}:={\rm span}\{w_{1},\ldots,w_{j}\}^{\perp}\cap\mathbb{S}^{n-1}\,,

(3.13)

and $S_{j,\varepsilon}:=\left\{v\in\mathbb{S}^{n-1}\,;\ {\rm dist}(v,S_{j})\leq\sqrt{\varepsilon}\right\}$ for some $\varepsilon>0$ . If $j=0$ , then $S_{j}=S_{j,\varepsilon}:=\mathbb{S}^{n-1}$ . We denote by $\mathcal{P}_{j,\varepsilon}$ the projection to the minimizer of $\operatorname{dist}(v,S_{j,\varepsilon})$ given in (2.21) and also recall the retraction $R_{v}$ in (2.6).

Input: almost commuting matrices

A,B\in{\rm S}(n)

;

vectors

w_{1},\ldots,w_{j-1}\in\mathbb{S}^{n-1}

for

j\geq 1

(

j=1

means no input vectors);

parameters

\tau\in(0,1/2)

\beta\in(0,1)

\epsilon>0

, and

0<\varepsilon\ll 1

;

approximate Hessian matrix

H(\cdot)\in{\rm S}(n)

Output: a triplet

(\lambda,\mu,v)\in\mathbb{R}^{2}\times\mathbb{S}^{n-1}

3generate random initial vector

v_{0}\in S_{j-1}

with

S_{j-1}

defined by

\{w_{1},\ldots,w_{j-1}\}

as in (3.13)

k=0

5while $\lVert\operatorname{grad}\widetilde{\mathcal{L}}(v_{k})\rVert>\epsilon$ do

6 compute the search direction

s_{k}\in T_{v}\mathbb{S}^{n-1}

by solving

-H(v_{k})[s_{k}]=\operatorname{grad}{\widetilde{\mathcal{L}}}(v_{k})

(3.14)

7 compute the stepsize

\alpha_{k}

by the backtracking line search:

\displaystyle\alpha_{k}:=\max\left\{\beta^{j}\,;\ \widetilde{\mathcal{L}}(R_{v_{k}}(\beta^{j}s_{k}))\leq\widetilde{\mathcal{L}}(v_{k})+\tau\beta^{j}\langle\operatorname{grad}\widetilde{\mathcal{L}}(v_{k}),s_{k}\rangle\right\}

(3.15)

v_{k+1}=\mathcal{P}_{j,\varepsilon}R_{v_{k}}(\alpha_{k}s_{k})

k=k+1

10 end while

v=v_{k}\in\mathbb{S}^{n-1}

\lambda=\langle v_{k},Av_{k}\rangle

\mu=\langle v_{k},Bv_{k}\rangle

Algorithm 3 Projected Riemannian approximate Newton method for (

{\rm OP}_{1}

) and (

{\rm OP}_{2}

)

We end this section with several remarks. First, by Theorem 2.13, the parameter $\varepsilon$ in Algorithm 3 can be chosen of the order $(\lVert\Delta A\rVert+\lVert\Delta B\rVert)/n^{-3/2}$ or larger. Second, by the above discussion, it is clear that Algorithm 3 with $H(v)=\operatorname{Hess}_{v}\mathcal{L}(\langle v,Av\rangle,\langle v,Bv\rangle,v)$ gives the alternating minimization for ( ${\rm OP}_{1}$ ) and ( ${\rm OP}_{2}$ ), while for $H(v)=\operatorname{Hess}\widetilde{\mathcal{L}}(v)$ , Algorithm 3 is nothing else than the standard Riemannian Newton’s method. Moreover, combining the abstract VJD scheme in Algorithm 2 with Algorithm 3 gives the one used in practice.

Finally, for the step (3.14), if the approximate Hessian maps $T_{v}\mathbb{S}^{n-1}$ to $T_{v}\mathbb{S}^{n-1}$ , we can simply take the Moore Penrose generalized inverse of $H$ to find $s_{k}=-H(v_{k})^{\dagger}\operatorname{grad}\widetilde{\mathcal{L}}(v_{k})$ . But if $H(v)$ does not have $T_{v}\mathbb{S}^{n-1}$ as an invariant subspace, say, $H(v)=\nabla_{v}^{2}\mathcal{L}(\langle v,Av\rangle,\langle v,Bv\rangle,v)$ , then the equation (3.14) may not admit a solution in $T_{v}\mathbb{S}^{n-1}$ . In this case, we consider the least-squares solution to (3.14). See Section 5.1 for more computational details, where we numerically test the aforementioned candidates of the approximate Hessian $H(v)$ .

4 Error analysis

This section is devoted to the error analysis for our VJD algorithm. In Section 4.1, we establish the global convergence of Algorithm 3 for ( ${\rm OP}_{1}$ ) (i.e., without the inequality constraint), while in Section 4.2, we discuss the numerical stability of Algorithm 2 with respect to noises and numerical errors.

4.1 Convergence of Riemannian approximate Newton method

Before stating our main theorem, we recall some preliminaries. First, by [59, Lemma 6], there exist $a_{0}>0$ , $a_{1}>0$ , and $\delta_{a_{0},a_{1}}>0$ such that for any $v\in\mathbb{S}^{n-1}$ and $p\in T_{v}\mathbb{S}^{n-1}$ with $\lVert p\rVert\leq\delta_{a_{0},a_{1}}$ ,

\displaystyle a_{0}\lVert p\rVert\leq d(v,R_{v}(p))\leq a_{1}\lVert p\rVert\,,

(4.1)

where $d(\cdot,\cdot)$ is the geodesic distance on $\mathbb{S}^{n-1}$ . We define the following constant for later use:

\displaystyle K:=\sup\left\{\frac{d(R_{v}(p),R_{v}(q))}{\lVert p-q\rVert}\,;\ p,q\in T_{v}\mathbb{S}^{n-1}\,,\ \lVert p\rVert\leq\delta_{a_{0},a_{1}}\,,\ \lVert q\rVert\leq\delta_{a_{0},a_{1}}\right\}.

(4.2)

We denote by $P_{v,w}$ the parallel transport of tangent vectors at $v$ to the tangent vectors at $w$ along the geodesic curve connecting points $v$ and $w$ ; see [12] for the precise definition of $P_{v,w}$ . We also recall the following expansion for the Riemannian gradient of a smooth function $f$ on $\mathbb{S}^{n-1}$ [28, 47]: for $w$ near $v$ ,

\displaystyle\operatorname{grad}f(w)=P_{v,w}\operatorname{grad}f(v)+P_{v,w}\operatorname{Hess}f(v)[R_{v}^{-1}w]+O(\lVert R_{v}^{-1}w\rVert^{2})\,.

(4.3)

Theorem 4.1.

Let $H(v):T_{v}\mathbb{S}^{n-1}\to T_{v}\mathbb{S}^{n-1}$ be an positive semidefinite approximate Hessian of $\widetilde{\mathcal{L}}$ satisfying (3.9). Suppose that $v_{*}$ is an accumulation point of the iterative sequence $\{v_{k}\}$ generated by Algorithm 3 for ( ${\rm OP}_{1}$ ); and that $H(v_{k})$ for all $k$ and $H(v_{*})$ are non-singular operators on the tangent space. Then, $v_{*}$ is a stationary point of ( ${\rm OP}_{1}$ ), and it holds that for $0\neq\lVert[A,B]\rVert\ll 1$ , the sequence $\{v_{k}\}$ converges to $v_{*}$ linearly with the convergence rate:

{\color[rgb]{0,0,0}\limsup_{k\to\infty}\frac{d(v_{k+1},v_{*})}{d(v_{k},v_{*})}=O(\lVert[A,B]\rVert^{1/2})\,.}

In the commuting case $[A,B]=0$ , we have that $\{v_{k}\}$ converges to $v_{*}$ quadratically:

{\color[rgb]{0,0,0}\limsup_{k\to\infty}\frac{d(v_{k+1},v_{*})}{d(v_{k},v_{*})^{2}}=0\,.}

Proof.

In this proof, we always regard $H(v)$ as a linear operator on the tangent space $T_{v}\mathbb{S}^{n-1}$ . By abuse of notation, we define the norm for any $T:T_{v}\mathbb{S}^{n-1}\to T_{v}\mathbb{S}^{n-1}$ by $\lVert T\rVert=\sup\{\lVert Tv\rVert\,;\ v\in T_{v}\mathbb{S}^{n-1}\,,\ \lVert v\rVert=1\}$ . Since $H(v_{*})$ is non-singular and $H(v)$ is continuous, there exists a neighborhood $U$ of $v_{*}$ such that $H(v)$ for $v\in U$ is positive definite on $T_{v}\mathbb{S}^{n-1}$ and satisfies

\displaystyle\lVert H(v)^{-1}\rVert\leq 2\lVert H(v_{*})^{-1}\rVert\,,

(4.4)

which implies that we can find a constant $\sigma>0$ such that $H(v_{k})^{-1}\geq\sigma I$ on $T_{v_{k}}\mathbb{S}^{n-1}$ . It follows that

\displaystyle\langle\operatorname{grad}\widetilde{\mathcal{L}}(v_{k}),s_{k}\rangle=-\langle\operatorname{grad}\widetilde{\mathcal{L}}(v_{k}),H(v_{k})^{-1}\operatorname{grad}\widetilde{\mathcal{L}}(v_{k})\rangle\leq-\sigma\lVert\operatorname{grad}\widetilde{\mathcal{L}}(v_{k})\rVert^{2}\,.

(4.5)

Then by (4.5) and the line search condition (3.15), we see that the stepsize $\alpha_{k}$ is well defined.

Step 1. We first prove that the accumulation point $v_{*}$ is also the stationary point, i.e., $\operatorname{grad}\widetilde{\mathcal{L}}(v_{*})=0$ . Again by (3.15), we have

\displaystyle\widetilde{\mathcal{L}}(v_{k})-\widetilde{\mathcal{L}}(v_{k+1})\geq-\tau\alpha_{k}\langle\operatorname{grad}\widetilde{\mathcal{L}}(v_{k}),s_{k}\rangle\geq\alpha_{k}\sigma\tau\lVert\operatorname{grad}\widetilde{\mathcal{L}}(v_{k})\rVert^{2}>0\,,

(4.6)

which means that $\widetilde{\mathcal{L}}(v_{k})$ is strictly decreasing. By (4.6), if $\liminf_{k\to\infty}\alpha_{k}=\mu>0$ holds, then we find

\displaystyle\sum_{k=1}^{\infty}\lVert\operatorname{grad}\widetilde{\mathcal{L}}(v_{k})\rVert^{2}<+\infty\,,

and $\lVert\operatorname{grad}\widetilde{\mathcal{L}}(v_{k})\rVert\to 0$ as $k\to\infty$ , which readily gives $\lVert\operatorname{grad}\widetilde{\mathcal{L}}(v_{*})\rVert=0$ . If not, i.e., $\liminf_{k\to\infty}\alpha_{k}=0$ holds, given any $s\in\mathbb{N}$ , we can take a subsequence $k_{j}$ such that $\alpha_{k_{j}}<\beta^{-s}$ for $j$ large enough. Hence, $\beta^{-s}$ does not satisfy the linear search condition (3.15), that is,

\widetilde{\mathcal{L}}(R_{v_{k_{j}}}(\beta^{-s}s_{k_{j}}))>\widetilde{\mathcal{L}}(v_{k_{j}})+\tau\beta^{-s}\langle\operatorname{grad}\widetilde{\mathcal{L}}(v_{k_{j}}),s_{k_{j}}\rangle\,.

(4.7)

Without loss of generality, we assume that there holds $v_{k_{j}}\to v_{*}$ and $s_{k_{j}}\to s_{*}=-H(v_{*})^{-1}\operatorname{grad}\widetilde{\mathcal{L}}(v_{*})$ . Taking the limit $j\to\infty$ in (4.7) gives

\widetilde{\mathcal{L}}(R_{v_{*}}(\beta^{-s}s_{*}))>\widetilde{\mathcal{L}}(v_{*})+\tau\beta^{-s}\langle\operatorname{grad}\widetilde{\mathcal{L}}(v_{*}),s_{*}\rangle\,.

Then it is easy to see

\displaystyle\tau\langle\operatorname{grad}\widetilde{\mathcal{L}}(v_{*}),s_{*}\rangle<\liminf_{s\to\infty}\frac{\widetilde{\mathcal{L}}(R_{v_{*}}(\beta^{-s}s_{*}))-\widetilde{\mathcal{L}}(v_{*})}{\beta^{-s}}=\langle\operatorname{grad}\widetilde{\mathcal{L}}(v_{*}),s_{*}\rangle\,.

Noting $\tau\in(0,1/2)$ in Algorithm 3, we obtain $\langle\operatorname{grad}\widetilde{\mathcal{L}}(v_{*}),s_{*}\rangle=0$ and thus $\operatorname{grad}\widetilde{\mathcal{L}}(v_{*})=0$ .

Step 2. We next show that there exists a neighborhood $V\subset U$ of $v_{*}$ such that if $v_{k}\in V$ , then $\alpha_{k}=1$ . Recalling that $R_{v}$ is a second-order retraction, we have the following Taylor expansion [12, Proposition 5.43]:

\widetilde{\mathcal{L}}(R_{v}(s))=\widetilde{\mathcal{L}}(v)+\langle\operatorname{grad}\widetilde{\mathcal{L}}(v),s\rangle+\frac{1}{2}\langle\operatorname{Hess}\widetilde{\mathcal{L}}(v)[s],s\rangle+O(\lVert s\rVert^{3})\,,

(4.8)

for $s\in T_{v}\mathbb{S}^{n-1}$ with $\lVert s\rVert$ small enough. Then by (4.4), for $v$ near $v_{*}$ , it holds that

\displaystyle s=-H(v)^{-1}\operatorname{grad}\widetilde{\mathcal{L}}(v)=O(\lVert v-v_{*}\rVert)\,,

which, along with (4.8), yields

	$\displaystyle\widetilde{\mathcal{L}}(R_{v}(s))-\widetilde{\mathcal{L}}(v)$	$\displaystyle=\langle\operatorname{grad}\widetilde{\mathcal{L}}(v),s\rangle+\frac{1}{2}\langle\operatorname{Hess}\widetilde{\mathcal{L}}(v)[s],s\rangle+O(\lVert s\rVert^{3})$
		$\displaystyle=\frac{1}{2}\langle\operatorname{grad}\widetilde{\mathcal{L}}(v),s\rangle+O\left(\lVert\Delta A\rVert+\lVert\Delta B\rVert\right)\lVert v-v_{}\rVert^{2}+O(\lVert v-v_{}\rVert^{3})\,,$		(4.9)

where we have also used the assumption that (3.9) holds. The above estimate (4.1) allows us to find a neighborhood $V$ of $v_{*}$ such that for any $v\in V$ , there holds

\widetilde{\mathcal{L}}(R_{v}(s))-\widetilde{\mathcal{L}}(v)\leq\tau\langle\operatorname{grad}\widetilde{\mathcal{L}}(v),s\rangle\,.

By the above estimate and the line search condition (3.15), we conclude our claim.

Step 3. We finally consider the local convergence rate with stepsize one. For this, we claim that there exists a neighborhood $V_{0}\subset V$ of $v_{*}$ such that $R_{v}(s)\in V_{0}$ holds for any $v\in V_{0}$ , where $s=-H(v)^{-1}\operatorname{grad}\widetilde{\mathcal{L}}(v)$ . Let

r(v):=\operatorname{grad}\widetilde{\mathcal{L}}(v)-P_{v_{*},v}\operatorname{Hess}\widetilde{\mathcal{L}}(v_{*})[R_{v_{*}}^{-1}v]\,.

By a direct computation with the relation $R_{v}^{-1}(v_{*})=-P_{v_{*},v}R_{v_{*}}^{-1}(v)$ [12], we have

\displaystyle H(v)^{-1}\operatorname{grad}\widetilde{\mathcal{L}}(v)+R_{v}^{-1}(v_{*})=H(v)^{-1}\left(r(v)+P_{v_{*},v}\operatorname{Hess}\widetilde{\mathcal{L}}(v_{*})[R_{v_{*}}^{-1}v]-H(v)P_{v_{*},v}R_{v_{*}}^{-1}(v)\right),

which gives the estimate

\displaystyle\lVert s-R_{v}^{-1}(v_{*})\rVert

\displaystyle\leq\lVert H(v)^{-1}\rVert\left(\lVert r(v)\rVert+\lVert P_{v_{*},v}\operatorname{Hess}\widetilde{\mathcal{L}}(v_{*})-H(v)P_{v_{*},v}\rVert\lVert R_{v_{*}}^{-1}(v)\rVert\right).

(4.10)

By the Lipschitz property of $H(v)$ and $\lVert v-v_{*}\rVert\leq d(v,v_{*})$ , we estimate, for $v$ near $v_{*}$ ,

\displaystyle\lVert P_{v_{*},v}\operatorname{Hess}\widetilde{\mathcal{L}}(v_{*})-H(v)P_{v_{*},v}\rVert\leq\lVert P_{v_{*},v}\operatorname{Hess}\widetilde{\mathcal{L}}(v_{*})-H(v_{*})P_{v_{*},v}\rVert+O(d(v,v_{*}))\,.

(4.11)

Without loss of generality, we let $V_{0}$ be small enough such that (4.1) holds with constant $K$ defined in (4.2). Then it is easy to see from (4.10) and (4.11) that for some constant $C$ ,

\displaystyle\limsup_{v\to v_{*}}\frac{d(R_{v}(s),v_{*})}{d(v,v_{*})}\leq\limsup_{v\to v_{*}}\frac{K\lVert s-R_{v}^{-1}(v_{*})\rVert}{a_{0}\lVert R_{v_{*}}^{-1}(v)\rVert}\leq C\lVert\operatorname{Hess}\mathcal{L}(v_{*})-H(v_{*})\rVert=O(\lVert[A,B]\rVert^{1/2})\,,

(4.12)

where we used $r(v)=O\left(\lVert R_{v_{*}}^{-1}(v)\rVert^{2}\right)$ by (4.3) and $\operatorname{grad}\widetilde{\mathcal{L}}(v_{*})=0$ , and the last estimate is from (2.2) and (3.9). When $\lVert[A,B]\rVert$ is small enough, we obtain

\displaystyle\limsup_{v\to v_{*}}\frac{d(R_{v}(s),v_{*})}{d(v,v_{*})}<1\,.

(4.13)

In particular, if $[A,B]=0$ , it holds that $\operatorname{Hess}\widetilde{\mathcal{L}}(v_{*})=H(v_{*})$ by Lemma 3.3. In this case, similarly, we have

\displaystyle\limsup_{v\to v_{*}}\frac{d(R_{v}(s),v_{*})}{d(v,v_{*})^{2}}<+\infty\,.

(4.14)

We now have all the ingredients to finish the proof. Since $v_{*}$ is the accumulation point, there exists $k_{0}$ such that $v_{k_{0}}\in V_{0}$ . Then by Step 2 and Step 3, a simple induction argument yields $v_{k}\in V_{0}$ and $\alpha_{k}=1$ for $k\geq k_{0}$ . In view of estimates (4.12) and (4.14), we readily have the global convergence of $v_{k}$ and its convergence rate. ∎

Remark 4.2.

Empirically, the projected approximate Newton method in Algorithm 3 exhibits a similar convergence rate when applied to ( ${\rm OP}_{2}$ ) with inequality constraints as it does for ( ${\rm OP}_{1}$ ); see Section 5.1 for the numerical results. Nevertheless, to the best of the author’s knowledge, there exists very little research on the convergence of projection-type Riemannian optimization algorithms, and we choose to postpone the convergence analysis of Algorithm 3 for ( ${\rm OP}_{2}$ ) to future investigations. We also refer interested readers to [7, 50, 73] for recent advancements in Riemannian optimization with constraints.

4.2 Numerical stability

We next show that for commuting matrices $(A,B)\in{\rm F}_{\rm com}$ , our VJD algorithm can produce a reliable simultaneous approximate diagonalization. Note that when $[A,B]=0$ , for suitable $\varepsilon$ in (1.3), each sub-optimization problem ( ${\rm OP}_{1}$ ) or ( ${\rm OP}_{2}$ ) in Algorithm 2 admits a minimizer $(\lambda^{(j)},\mu^{(j)},v^{(j)})$ satisfying $\mathcal{L}(\lambda^{(j)},\mu^{(j)},v^{(j)})=0$ . However, due to the presence of noises and iteration errors, the input matrices may not exactly commute and $(\lambda^{(j)},\mu^{(j)},v^{(j)})$ can not be perfectly solved. In this scenario, we assume that the sub-optimization problems ( ${\rm OP}_{1}$ ) and ( ${\rm OP}_{2}$ ) are solved by Algorithm 3 with outputs $(\widetilde{\lambda}_{j},\widetilde{\mu}_{j},\widetilde{v}_{j})$ satisfying

\displaystyle\mathcal{L}(\widetilde{\lambda}_{j},\widetilde{\mu}_{j},\widetilde{v}_{j})=O(\delta)\,,

(4.15)

for some precision parameter $\delta>0$ , and prove that our Algorithm 2 can output matrices $U\in{\rm O}(n)$ and $D_{1},D_{2}\in\mathrm{diag}\left(n\right)$ with a controlled error; see Proposition 4.5. We remark that the justification of the assumption (4.15) relies on the convergence of Algorithm 3 for ( ${\rm OP}_{2}$ ) and hence is left open. The proof is based on the following two lemmas: a standard perturbation result for eigenvalue problems [58, p.58] and a backward error stability result for the common eigenvector problem [22, Theorem 4].

Lemma 4.3.

Suppose that $A\in{\rm S}(n)$ has $n$ eigenvalues $\lambda_{1}\leq\cdots\leq\lambda_{n}$ with the associated orthonormal eigenvectors $v_{1},\ldots,v_{n}$ . For any $\Delta A\in{\rm S}(n)$ with $\lVert\Delta A\rVert$ small enough, let $\widetilde{\lambda}_{1}\leq\cdots\leq\widetilde{\lambda}_{n}$ be the eigenvalues of $A+\Delta A$ . Then, the perturbed matrix $A+\Delta A$ has orthonormal eigenvectors $\widetilde{v}_{i}$ associated with $\widetilde{\lambda}_{i}$ , which satisfy

|\widetilde{\lambda}_{i}-\lambda_{i}|=O(\lVert\Delta A\rVert)\,,\quad\lVert\widetilde{v}_{i}-v_{i}\rVert=O(\lVert\Delta A\rVert)\,.

Lemma 4.4.

Suppose that $v\in\mathbb{S}^{n-1}$ is a common eigenvector of matrices $A,B\in{\rm S}(n)$ with $Av=\lambda v$ and $Bv=\mu v$ . If $\widetilde{v}$ is an approximation of $v$ , then there exist $\widetilde{\lambda},\widetilde{\mu}\in\mathbb{R}$ and $\Delta A,\Delta B\in{\rm S}(n)$ such that

\displaystyle(A+\Delta A)\widetilde{v}=\widetilde{\lambda}\widetilde{v}\,,\quad(B+\Delta B)\widetilde{v}=\widetilde{\mu}\widetilde{v}\,,

and

\displaystyle\sqrt{\frac{\lVert\Delta A\rVert^{2}}{\lVert A\rVert^{2}}+\frac{\lVert\Delta B\rVert^{2}}{\lVert B\rVert^{2}}}=\sqrt{\frac{\lVert A\widetilde{v}-\langle\widetilde{v},A\widetilde{v}\rangle\widetilde{v}\rVert^{2}}{\lVert A\rVert^{2}}+\frac{\lVert B\widetilde{v}-\langle\widetilde{v},B\widetilde{v}\rangle\widetilde{v}\rVert^{2}}{\lVert B\rVert^{2}}}\,.

We now state and prove the main result of this section.

Proposition 4.5.

For commuting matrices $A,B\in{\rm S}(n)$ , suppose that the minimizers $(\lambda^{(j)},\mu^{(j)},v^{(j)})$ in Algorithm 2 are approximated by $(\widetilde{\lambda}_{j},\widetilde{\mu}_{j},\widetilde{v}_{j})$ computed from Algorithm 3 with the estimate (4.15). Then, Algorithm 2 generates matrices $U\in{\rm O}(n)$ and $D_{1},D_{2}\in\mathrm{diag}\left(n\right)$ that satisfy

\displaystyle\lVert A-UD_{1}U^{T}\rVert^{2}+\lVert B-UD_{2}U^{T}\rVert^{2}=O(n\delta)+O(n^{2}\epsilon\sqrt{\delta})\,,

where $\epsilon$ is the threshold parameter in the stopping criterion of Algorithm 3.

Proof.

We first estimate the error for the nearest orthogonal matrix problem (1.4) involved in Algorithm 2. From Algorithm 3, it is clear that $(\widetilde{\lambda}_{j},\widetilde{\mu}_{j},\widetilde{v}_{j})=(\langle\widetilde{v}_{j},A\widetilde{v}_{j}\rangle,\langle\widetilde{v}_{j},B\widetilde{v}_{j}\rangle,\widetilde{v}_{j})$ , and then the assumption (4.15) implies

\widetilde{\mathcal{L}}\left(\tilde{v}_{j}\right)=O(\delta)\quad\text{for all}\ j\,.

(4.16)

By viewing $\widetilde{v}_{j}$ as an approximation to the common eigenvector $v^{(j)}$ and applying Lemma 4.5, there exist small perturbations $\Delta A_{j},\Delta B_{j}\in{\rm S}(n)$ such that $\tilde{v}_{j}$ is an eigenvector of both $A+\Delta A_{j}$ and $B+\Delta B_{j}$ and there holds

\sqrt{\lVert\Delta A_{j}\rVert^{2}+\lVert\Delta B_{j}\rVert^{2}}\leq\frac{\max\{\lVert A\rVert,\lVert B\rVert\}}{\min\{\lVert A\rVert,\lVert B\rVert\}}\sqrt{\widetilde{\mathcal{L}}(\widetilde{v}_{j})}=O(\sqrt{\delta})\,.

(4.17)

Therefore, by (4.17) and Lemma 4.3, we obtain

\displaystyle\lVert\widetilde{v}_{j}-v_{j}\rVert=O(\sqrt{\delta})\,,

and hence

\left|\langle\tilde{v}_{i},\tilde{v}_{j}\rangle\right|=O(\sqrt{\delta})\quad\text{for}\ i\neq j\,.

(4.18)

Letting $V=[\widetilde{v}_{1},\ldots,\widetilde{v}_{n}]$ , it readily follows from (4.18) that $(V^{T}V)_{ii}=1$ and $(V^{T}V)_{ij}=O(\sqrt{\delta})$ , which implies

\displaystyle\lVert V^{T}V-I\rVert\leq\lVert V^{T}V-I\rVert_{{\rm F}}=O(n\sqrt{\delta})\,.

(4.19)

Again by Lemma 4.3, the estimate (4.19) above gives that all the eigenvalues of $V^{T}V$ are $O(n\sqrt{\delta})$ perturbations of $1$ , that is, all the diagonal entries of $\Sigma\in{\rm diag}(n)$ in the SVD $V=U_{1}\Sigma U_{2}$ of $V$ are of the form $1+O(n\sqrt{\delta})$ . This allows us to conclude

\displaystyle\lVert U_{1}U_{2}-V\rVert=\lVert I-\Sigma\rVert=O(n\sqrt{\delta})\,.

We now write $U=U_{1}U_{2}=[\bar{v}_{1},\cdots,\bar{v}_{n}]$ and find

\sum_{j=1}^{n}\lVert\tilde{v}_{j}-\bar{v}_{j}\rVert^{2}=\left\|U-V\right\|_{{\rm F}}^{2}=\left\|I-\Sigma\right\|_{{\rm F}}^{2}=O(n^{3}\delta)\,.

(4.20)

By the stopping criterion in Algorithm 3, we directly see $\lVert\operatorname{grad}\mathcal{L}(\widetilde{v}_{j})\rVert\leq\epsilon$ . Then, from (4.20), we have

\begin{split}&\Big{|}\sum_{j=1}^{n}\widetilde{\mathcal{L}}\left(\bar{v}_{j}\right)-\sum_{j=1}^{n}\widetilde{\mathcal{L}}\left(\tilde{v}_{j}\right)\Big{|}\leq\sum_{j=1}^{n}\Big{|}\widetilde{\mathcal{L}}\left(\bar{v}_{j}\right)-\widetilde{\mathcal{L}}\left(\tilde{v}_{j}\right)\Big{|}\leq O(\epsilon)\sum_{j=1}^{n}\lVert\bar{v}_{j}-\widetilde{v}_{j}\rVert=O(n^{2}\epsilon\sqrt{\delta})\,,\end{split}

which further gives

\displaystyle\sum_{j=1}^{n}\widetilde{\mathcal{L}}(\bar{v}_{j})\leq\sum_{j=1}^{n}\widetilde{\mathcal{L}}(\tilde{v}_{j})+O(n^{2}\epsilon\sqrt{\delta})=O(n\delta)+O(n^{2}\epsilon\sqrt{\delta})\,.

The proof is completed by the following simple estimate:

\displaystyle\lVert A-UD_{1}U^{T}\rVert^{2}+\lVert B-UD_{2}U^{T}\rVert^{2}\leq\lVert A-UD_{1}U^{T}\rVert_{{\rm F}}^{2}+\lVert B-UD_{2}U^{T}\rVert_{{\rm F}}^{2}\leq\sum_{j=1}^{n}\widetilde{\mathcal{L}}(\bar{v}_{j})\,,

where the last inequality follows from the definitions of $U\in{\rm O}(n)$ and $D_{1},D_{2}\in\mathrm{diag}\left(n\right)$ . ∎

5 Numerical experiments

In this section, we present extensive numerical experiments to verify the efficiency and robustness of our proposed algorithms in Section 3. In Section 5.1, we test the performance of the approximate Newton method (cf. Algorithm 3) with the exact Riemannian Hessian and various approximate Hessian matrices. Then in Section 5.2, we compare the Jacobi algorithm and the VJD one for almost commuting matrices in detail, and we also investigate, both numerically and theoretically, the relations between the error cost function and the magnitude of the commutator. In Section 5.3, we apply our algorithm to the independent component analysis (ICA) problem. All the experiments presented below are conducted using Python on a personal laptop with 16 GB RAM and 8-core 2.6 MHz CPU. The implementation of the JADE algorithm is based on the Python package [6]. The synthetic almost commuting matrices in Sections 5.1 and 5.2 are generated by $(A,B)=(A_{*},B_{*})+(\Delta A,\Delta B)$ , where the commuting pair $(A_{*},B_{*})\in{\rm F}_{\rm com}$ is sampled from the distribution $\mathbb{P}_{{\rm F}_{\rm com}}$ introduced in Section 2 and $(\Delta A,\Delta B)\in{\rm S}(n)\times{\rm S}(n)$ is the additive Gaussian noise with noise level $\sigma$ , namely, $(\Delta A)_{ij}$ and $(\Delta B)_{ij}$ with $i\leq j$ are i.i.d. Gaussian $\mathcal{N}(0,\sigma^{2})$ .

5.1 Comparison between different approximate Hessian

We explore different approximate Hessian matrices within Algorithm 3 and compare their performances for solving ( ${\rm OP}_{1}$ ) and ( ${\rm OP}_{2}$ ) in terms of the computing times. As discussed after Algorithm 3, these approximate Hessian matrices can be broadly categorized into two families: those with the tangent space $T_{v}\mathbb{S}^{n-1}$ as their invariant subspaces and those without. For the first family, the search direction $s_{k}$ in (3.14) can be readily solved by the pseudoinverse:

\displaystyle s_{k}=-H(v_{k})^{\dagger}\operatorname{grad}\widetilde{\mathcal{L}}(v_{k})\,,

(5.1)

while for the second family, we consider the associated least-squares problem:

\displaystyle\min_{s\in T_{v_{k}}\mathbb{S}^{n-1}}\lVert H(v_{k})s+\operatorname{grad}\widetilde{\mathcal{L}}(v_{k})\rVert=\min_{x\in\mathbb{R}^{n}}\lVert H(v_{k})(I-v_{k}v_{k}^{T})x+\operatorname{grad}\widetilde{\mathcal{L}}(v_{k})\rVert\,.

(5.2)

It is well-known [35] that the above problem (5.2) admits a minimizer:

\displaystyle s_{k}=-\left(I-v_{k}v_{k}^{T}\right)\left(H(v_{k})(I-v_{k}v_{k}^{T})\right)^{\dagger}\operatorname{grad}\widetilde{\mathcal{L}}(v_{k})\,.

(5.3)

Here we consider the following four approximate Hessian matrices (which have been justified in Lemma 3.3 and the discussion afterward): the Euclidean Hessian of $\mathcal{L}(\Lambda,v)$ in $v$ : $H_{0}(v)=\nabla_{v}^{2}{\mathcal{L}}(\Lambda,v)|_{(\Lambda,v)=(\langle v,Av\rangle,\langle v,Bv\rangle,v)}$ , the Riemannian Hessian of $\mathcal{L}$ in $v$ (2.10): $H_{1}(v)=\operatorname{Hess}_{v}{\mathcal{L}}(\Lambda,v)|_{(\Lambda,v)=(\langle v,Av\rangle,\langle v,Bv\rangle,v)}$ , the projected Euclidean Hessian of $\mathcal{L}$ : $H_{2}(v)=(I_{n}-vv^{T})H_{0}(v)(I_{n}-vv^{T})$ , the exact Riemannian Hessian of $\widetilde{\mathcal{L}}(v)$ : $H_{3}(v)=\operatorname{Hess}{\widetilde{\mathcal{L}}}(v)$ .

We test Algorithm 3 with the above four candidates of the approximate Hessian $H(v)$ for solving ( ${\rm OP}_{1}$ ) and ( ${\rm OP}_{2}$ ) with $j=2$ and let $s_{k}$ be updated accordingly either by (5.1) or (5.3), for three randomly generated pairs of almost commuting matrices of dimension $n=1000$ with $\lVert[A,B]\rVert$ of orders $O(10^{-6})$ , $O(10^{-4})$ , and $O(10^{-2})$ . The numerical results are represented in Figure 1, which clearly shows that all the candidates achieve similar levels of accuracy with nearly the same iteration steps. However, compared to the alternating minimization and the Riemannian Newton method (i.e., Algorithm 3 with approximate Hessian $H_{1}$ and $H_{3}$ , respectively), Algorithm 3 with $H_{0}$ or $H_{2}$ significantly reduces the computational cost, with the running time being about five times less than that of the standard Riemannian Newton method. Note that the convergence for the case of $H_{2}$ has been established in Theorem 4.1. The algorithm with the approximate Hessian $H_{0}$ , which appears to be the most efficient option, relies on the heuristic scheme (5.3) and does not align with the Riemannian optimization framework. Some additional techniques may be necessary to ensure its convergence properties. Figure 1 also verifies the robustness of the superiority of the approximate Hessians $H_{0}$ and $H_{2}$ with respect to the magnitude of $[A,B]$ .

Refer to caption — Figure 1: Convergence histories of Algorithm 3 for ( ${\rm OP}_{1}$ ) (i.e., the first triplet) and ( ${\rm OP}_{2}$ ) with $j=2$ (i.e., the second triplet) with various approximate Hessian matrices $H$ for randomly generated almost commuting matrices $(A,B)$ of dimension $n=1000$ with $\lVert[A,B]\rVert=O(10^{-6}),O(10^{-4}),O(10^{-2})$ (left to right). The results for ( ${\rm OP}_{1}$ ) and ( ${\rm OP}_{2}$ ) with $j=2$ are given in the first and second rows, respectively.

5.2 Comparison with Jacobi algorithm and relation with Lin’s theorem

We test Jacobi Algorithm 1 and our VJD Algorithm 2 with approximate Hessian $H_{0}$ given in Section 5.1 on randomly generated almost commuting symmetric matrices of dimensions $n=50,100,500,1000$ with noise levels $\sigma=10^{-2},10^{-3},10^{-4},10^{-5}$ , respectively, to compare their computational efficiencies. We plot the distributions of the computing times for both algorithms across all the generated samples in Figure 2. It clearly demonstrates that our VJD algorithm is robust against noise and much more efficient than the Jacobi one in the almost commuting regime, especially when the matrix size is large (say, $n=500,1000$ ). In addition, noting that Jacobi Algorithm 1 involves $n(n-1)/2$ Givens rotation sweeps and in each sweep, $O(n)$ operations are needed to update matrices $A$ and $B$ , thus the computational complexity of the Jacobi algorithm is $O(n^{3})$ . We numerically compare the complexities of the Jacobi algorithm and the proposed VJD one in Figure 3, where we plot the computational times of Algorithms 1 and 2 as functions of the matrix size on a double logarithmic scale with base 10. It shows that in practice, our VJD algorithm is also of complexity $O(n^{3})$ but with a smaller prefactor.

Recalling that this work was initially motivated by finding a numerically feasible solution to Lin’s theorem, we are interested in whether our VJD algorithm can produce commuting matrices $(A^{\prime},B^{\prime})$ satisfying the estimate (1.1) in the almost commuting regime. For this, we consider the same pairs of almost commuting matrices $(A,B)$ as in Figure 2 and compute the numerical errors $\mathcal{J}(D_{1},D_{2},U)$ (OP) of both Jacobi and VJD algorithms, denoted by $\mathcal{J}_{\rm Jacobi}$ and $\mathcal{J}_{\rm VJD}$ , respectively. We plot $\mathcal{J}_{\rm VJD}$ as a function of the commutator $\lVert[A,B]\rVert$ in Figure 4(a) and find that $\mathcal{J}_{\rm VJD}$ scales as $\lVert[A,B]\rVert^{2}$ , which is independent of the problem size. Thus, it is clear that when $\lVert[A,B]\rVert\ll 1$ , there holds

\displaystyle\mathcal{J}_{\rm VJD}\sim\lVert[A,B]\rVert^{2}\ll\lVert[A,B]\rVert\,,

(5.4)

which means that $A^{\prime}=UD_{1}U^{T}$ and $B^{\prime}=UD_{2}U^{T}$ from Algorithm 2 would satisfy the bound (1.1). We provide a theoretical lower bound of $\mathcal{J}_{\rm VJD}$ below to justify (5.4), which is generalized from [33, Theorem 3.1]. The proof is given in A for the sake of completeness.

Proposition 5.1.

Let $A,B\in{\rm S}(n)$ be symmetric matrices with $\lVert A\rVert,\lVert B\rVert\leq 1$ . Suppose that matrices $U\in{\rm O}(n)$ and $D_{1},D_{2}\in\mathrm{diag}\left(n\right)$ are computed by Algorithm 2 with input $(A,B)$ . Then it holds that

\mathcal{J}(D_{1},D_{2},U)\geq\frac{1}{8}\lVert[A,B]\rVert^{2}.

(5.5)

We also compute the relative errors between Jacobi and VJD algorithms in the same experimental setup by $|\mathcal{J}_{\rm Jacobi}-\mathcal{J}_{\rm VJD}|/\mathcal{J}_{\rm Jacobi}$ with the results shown in Figure 4(b). We can see that our VJD algorithm can achieve nearly the same accuracy as the Jacobi method but with much less computational time (cf. Figures 2 and 3).

5.3 Application in independent component analysis

Let $s_{1},\ldots,s_{n}\in\mathbb{R}^{t}$ be independent pure signals with length $t$ , which are assumed to be non-Gaussian. Suppose that the mixed signals $x_{1},\ldots,x_{r}(r\geq n)$ are the linear combinations of the pure signals, i.e., $X=AS$ , where $X$ and $S$ are matrices consisting of rows $x_{i}$ and $s_{i}$ , respectively. The goal of ICA is to reconstruct the source signals $S$ and the mixing matrix $A$ from the noisy measured data $X$ . Cardoso et al. [15, 16] proposed a standard framework for solving this problem based on the joint diagonalization. Here we follow the setup in [61] and apply the VJD algorithm to the ICA problem. The basic procedures are summarized as follows for the reader’s convenience. First, normalize $X$ such that each row of $X$ has zero mean, and define the whitened matrix $P_{W}=\sqrt{t}P$ with $P$ computed from the SVD of $X$ : $X=Q\Sigma P^{T}$ . Second, compute the fourth-order cumulant tensor of the matrix $P_{W}$ by

K(i,j,k,l)=\langle p_{i}\circ p_{j}\circ p_{k}\circ p_{l}\rangle-\langle p_{i}\circ p_{j}\rangle\langle p_{k}\circ p_{l}\rangle-\langle p_{i}\circ p_{k}\rangle\langle p_{j}\circ p_{l}\rangle-\langle p_{i}\circ p_{l}\rangle\langle p_{j}\circ p_{k}\rangle\,,

(5.6)

where $p_{i}$ are the columns of $P_{W}$ ; notation $\circ$ represents the element-wise product (Hadamard product); and $\langle\cdot\rangle$ means the expected value (average). Third, project the cumulant tensor $K$ onto a set of $m=n(n+1)/2$ orthogonal eigenmatrices of dimension $n\times n$ , where $n$ of them are zero matrices with one diagonal entry being $1$ , and the others are zero matrices with two symmetrically off-diagonal entries being $\sqrt{0.5}$ . Denote the projected matrices by $M_{1},\cdots,M_{m}$ . Fourth, apply the VJD Algorithm 2 to jointly diagonalize the $m$ matrices $M_{1},\cdots,M_{m}$ . Denote the output orthogonal matrix by $U$ . Finally, reconstruct the source signals and the mixing matrix $A$ by

\tilde{S}=\sqrt{t}U\Sigma^{-1}Q^{T}X\,,\quad\tilde{A}=X\tilde{S}^{T}\left(\tilde{S}\tilde{S}^{T}\right)^{-1}\,.

To test our VJD algorithm in the ICA application, we consider simulated one-dimensional and two-dimensional source signals in Figures 5(a) and 6(a). We assume that the noisy measured signals are given by $X=AS+E$ , where $A$ is a random orthogonal matrix and $E$ is a random matrix with $E_{ij}$ being i.i.d. Gaussian $\mathcal{N}(0,\eta^{2})$ ; see Figures 5(b) and 6(b). Then, in our experiments, there are $21$ symmetric matrices of size $6\times 6$ to be jointly diagonalized. We plot in Figures 5 and 6 the reconstructed signals by Jacobi Algorithm 1 and VJD Algorithm 2 to compare their performance, which clearly shows that both of these two algorithms can help recover the source signals effectively. In addition, we repeat the experiments for the same source signals with 100 samples of $A$ and $E$ and record the average time cost and $L^{2}$ -error $\lVert X_{\rm recons}-X_{\rm true}\rVert_{\rm F}$ between the reconstructed signals $X_{\rm recons}$ and the source signals $X_{\rm true}$ in Tables 2 and 2. It again demonstrates the advantages of our VJD algorithm over the standard Jacobi one.

	Avg. time (ms)	Avg. $L^{2}$ -error
JADE	$5.526$	$0.227$
VJD	$2.143$	$0.228$

Table 1: Computing time and

L^{2}

-error for ICA in Figure 5

	Avg. time (ms)	Avg. $L^{2}$ -error
JADE	$6.935$	$5.602\times 10^{-2}$
VJD	$2.493$	$4.813\times 10^{-2}$

Table 2: Computing time and

L^{2}

-error for ICA in Figure 6

6 Conclusion

The approximate joint diagonalization of almost commuting matrices arises in many applications and closely relates to the celebrated Huaxin Lin’s theorem. In this work, we have designed a vector-wise algorithm framework for jointly diagonalizing a given pair of almost commuting matrices $(A,B)$ , which relies on solving the sub-optimization problems ( ${\rm OP}_{1}$ ) and ( ${\rm OP}_{2}$ ). To motivate our VJD algorithm (cf. Algorithm 2), we have first analyzed the stability of the local minimizers of $\mathcal{L}(\Lambda,v)$ in ( ${\rm OP}_{1}$ ), and then, with the help of the random matrix theory, we have shown that for almost commuting random matrices, with high probability the functional $\mathcal{L}$ has $n$ local minimizers with almost orthogonal minimizing vectors (cf. Theorem 2.13). To efficiently solve the sub-optimization problems, we have proposed a projected Riemannian approximate Newton method (cf. Algorithm 3) for ( ${\rm OP}_{1}$ ) and ( ${\rm OP}_{2}$ ). Moreover, we have proved in Theorem 4.1 that in the almost commuting regime, Algorithm 3 for ( ${\rm OP}_{1}$ ) exhibits linear convergence with a rate of $O(|[A,B]|^{1/2})$ , transitioning to quadratic convergence when $A$ and $B$ commute. However, the convergence of the projected Newton method (i.e., Algorithm 3 for ( ${\rm OP}_{2}$ )) is still open and needs further investigation. We have also proved that our VJD algorithm is stable with respect to iteration errors and noise.

Numerical results suggest that our vector-wise framework is more efficient than the popular Jacobi-type algorithm, especially for large-scale matrices, and can be applied to the ICA problem for signal reconstruction. We have numerically and theoretically examined the relations between the cost functional $\mathcal{J}(D_{1},D_{2},U)$ in (OP) and the commutator $[A,B]$ and find $\mathcal{J}\sim\lVert[A,B]\rVert^{2}$ , which indicates that when $\lVert[A,B]\rVert$ is small enough, the commuting matrices $(A^{\prime},B^{\prime})=(UD_{1}U^{T},UD_{2}U^{T})$ constructed from Algorithm 2 satisfy the estimate (1.1) and hence give a desired numerical solution for Lin’s theorem. Finally, we would like to point out that though our approach works quite well for almost commuting matrices, it would be still important and useful to find an algorithm for Lin’s Theorem beyond the almost commuting regime.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this work.

Acknowledgement

The work of BL and JL is supported in part by the US National Science Foundation under award CCF-1910571 and by the US Department of Energy via grant DE-SC0019449. JL would like to thank Terry Loring for helpful discussions regarding the problem of almost commuting matrices.

Appendix A Proof of Proposition 5.1

Proof of Proposition 5.1.

We recall the construction of the diagonal matrices $D_{1},D_{2}$ in Algorithm 2:

\displaystyle D_{1}={\rm diag}(\langle v_{1},Av_{1}\rangle,\ldots,\langle v_{n},Av_{n}\rangle)\,,\quad D_{2}={\rm diag}(\langle v_{1},Bv_{1}\rangle,\ldots,\langle v_{n},Bv_{n}\rangle)\,,

where $v_{i}$ is the $i$ -th column of the orthogonal matrix $U$ . Then it is easy to see that

\displaystyle J(D_{1},D_{2},U)=\lVert U^{T}AU-\mathrm{diag}\left(U^{T}AU\right)\rVert^{2}+\lVert U^{T}BU-\mathrm{diag}\left(U^{T}BU\right)\rVert^{2}=:\mathcal{J}(U)\,.

Here and in what follows, $\mathrm{diag}\left(M\right)$ means the diagonal part of a $n\times n$ matrix $M$ . Hence, to bound $J(D_{1},D_{2},U)$ from below, it suffices to consider the lower bound of

\displaystyle\min\left\{\mathcal{J}(U)\,;\ U\in{\rm O}(n)\right\}\,.

(A.1)

Let $V$ be a minimizer to the optimization problem A.1, and we write

V^{T}AV=D_{A}+X\,,\quad V^{T}BV=D_{B}+Y\,.

(A.2)

where $D_{A}:={\rm diag}(V^{T}AV)$ and $D_{B}={\rm diag}(V^{T}BV)$ . Noting that $\lVert A\rVert=\max_{v\in\mathbb{S}^{n-1}}|\langle x,Ax\rangle|$ holds for $A\in{\rm S}(n)$ , we readily have, by definition,

\lVert D_{A}\rVert\leq\lVert A\rVert\leq 1\,,\quad\lVert D_{B}\rVert\leq\lVert B\rVert\leq 1\,.

(A.3)

The decomposition (A.2) can be rewritten as $A=VD_{A}V^{T}+VXV^{T}$ and $A=VD_{B}V^{T}+VYV^{T}$ , which implies

\begin{split}&AB=VD_{A}D_{B}V^{T}+VD_{A}YV^{T}+VXD_{B}V^{T}+VXYV^{T},\\ &BA=VD_{B}D_{A}V^{T}+VD_{B}XV^{T}+VYD_{A}V^{T}+VYXV^{T}.\end{split}

Then, a direct computation gives

\begin{split}[A,B]&=AB-BA=V\left([D_{A},D_{B}]+[D_{A},Y]+[X,D_{B}]+[X,Y]\right)V^{T}\\ &=V\left([D_{A},Y]+[X,D_{B}]+[X,Y]\right)V^{T}\\ &=V\left([D_{A}+X,Y]+[X,D_{B}]\right)V^{T}\,.\end{split}

(A.4)

By the triangle inequality and the unitary invariance of operator norm, there holds

\lVert[A,B]\rVert\leq\lVert[D_{A}+X,Y]\rVert+\lVert[X,D_{B}]\rVert\,,

which further yields, by a simple estimate using (A.3),

\displaystyle\lVert[A,B]\rVert

\displaystyle\leq 2\lVert D_{A}+X\rVert\lVert Y\rVert+2\lVert X\rVert\lVert D_{B}\rVert\leq 2\left(\lVert X\rVert+\lVert Y\rVert\right)\leq 2\sqrt{2\mathcal{J}(V)}\,.

The proof is complete.

∎

References

[1] P.-A. Absil, C. G. Baker, and K. A. Gallivan. Trust-region methods on Riemannian manifolds. Foundations of Computational Mathematics, 7(3):303–330, 2007.
[2] P.-A. Absil and K. A. Gallivan. Joint diagonalization on the oblique manifold for independent component analysis. In 2006 IEEE international conference on acoustics speech and signal processing proceedings, volume 5, pages V–V. IEEE, 2006.
[3] P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2009.
[4] B. Afsari and P. S. Krishnaprasad. Some gradient based joint diagonalization methods for ica. In Independent Component Analysis and Blind Signal Separation: Fifth International Conference, ICA 2004, Granada, Spain, September 22-24, 2004. Proceedings 5, pages 437–444. Springer, 2004.
[5] G. W. Anderson, A. Guionnet, and O. Zeitouni. An introduction to random matrices. Cambridge Studies in Advanced Mathematics. Cambridge university press, 2010.
[6] A. Barachant et al. pyriemann: Biosignals classification with riemannian geometry. https://pyriemann.readthedocs.io/en/latest/index.html, 2015. [Online].
[7] R. Bergmann and R. Herzog. Intrinsic formulation of KKT conditions and constraint qualifications on smooth manifolds. SIAM Journal on Optimization, 29(4):2423–2444, 2019.
[8] M. Berry and A. Sameh. Multiprocessor Jacobi algorithms for dense symmetric eigenvalue and singular value decompositions. Technical report, Illinois Univ., Urbana (USA). Center for Supercomputing Research and Development, 1986.
[9] R. Bhatia. Spectral variation, normal matrices, and Finsler geometry. Mathematical Intelligencer, 29(3):41–46, 2007.
[10] M. A. A. Bortoloti, T. A. Fernandes, and O. P. Ferreira. An efficient damped Newton-type algorithm with globalization strategy on Riemannian manifolds. Journal of Computational and Applied Mathematics, 403:113853, 2022.
[11] F. Bouchard, B. Afsari, J. Malick, and M. Congedo. Approximate joint diagonalization with riemannian optimization on the general linear group. SIAM Journal on Matrix Analysis and Applications, 41(1):152–170, 2020.
[12] N. Boumal. An introduction to optimization on smooth manifolds. To appear with Cambridge University Press, Apr 2022.
[13] M. M. Bronstein, K. Glashoff, and T. A. Loring. Making Laplacians commute. arXiv preprint arXiv:1307.6549, 2013.
[14] A. Bunse-Gerstner, R. Byers, and V. Mehrmann. Numerical methods for simultaneous diagonalization. SIAM journal on matrix analysis and applications, 14(4):927–949, 1993.
[15] J.-F. Cardoso. Multidimensional independent component analysis. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No. 98CH36181), volume 4, pages 1941–1944. IEEE, 1998.
[16] J.-F. Cardoso and A. Souloumiac. Blind beamforming for non-Gaussian signals. In IEE proceedings F (radar and signal processing), volume 140, pages 362–370. IET, 1993.
[17] J.-F. Cardoso and A. Souloumiac. Jacobi angles for simultaneous diagonalization. SIAM journal on matrix analysis and applications, 17(1):161–164, 1996.
[18] E. de Klerk, C. Dobre, and D. V. Ṗasechnik. Numerical block diagonalization of matrix $*$ -algebras with application to semidefinite programming. Mathematical programming, 129(1):91–111, 2011.
[19] L. De Lathauwer. A link between the canonical decomposition in multilinear algebra and simultaneous matrix diagonalization. SIAM journal on Matrix Analysis and Applications, 28(3):642–666, 2006.
[20] P. Eberlein. Solution to the complex eigenproblem by a norm reducing Jacobi type method. Numerische Mathematik, 14(3):232–245, 1970.
[21] P. J. Eberlein. On one-sided Jacobi methods for parallel computation. SIAM Journal on Algebraic Discrete Methods, 8(4):790–796, 1987.
[22] A. El Ghazi, S. El Hajji, L. Giraud, and S. Gratton. Newton’s method for the common eigenvector problem. Journal of computational and applied mathematics, 219(2):398–407, 2008.
[23] E. Evert and L. De Lathauwer. Guarantees for existence of a best canonical polyadic approximation of a noisy low-rank tensor. SIAM Journal on Matrix Analysis and Applications, 43(1):328–369, 2022.
[24] D. Eynard, K. Glashoff, M. M. Bronstein, and A. M. Bronstein. Multimodal diffusion geometry by joint diagonalization of Laplacians. arXiv preprint arXiv:1209.2295, 2012.
[25] D.-Z. Feng, X.-D. Zhang, and Z. Bao. An efficient multistage decomposition approach for independent components. Signal Processing, 83(1):181–197, 2003.
[26] O. P. Ferreira, A. N. Iusem, and S. Z. Németh. Projections onto convex sets on the sphere. Journal of Global Optimization, 57(3):663–676, 2013.
[27] O. P. Ferreira, A. N. Iusem, and S. Z. Németh. Concepts and techniques of optimization on the sphere. Top, 22(3):1148–1170, 2014.
[28] O. P. Ferreira and B. F. Svaiter. Kantorovich’s theorem on Newton’s method in Riemannian manifolds. journal of complexity, 18(1):304–329, 2002.
[29] M. Fornasier, H. Huang, L. Pareschi, and P. Sünnen. Consensus-based optimization on the sphere: Convergence to global minimizers and machine learning. Journal of Machine Learning Research, 22(237):1–55, 2021.
[30] P. Friis and M. Rø rdam. Almost commuting self-adjoint matrices-a short proof of Huaxin lin’s theorem. Journal für die reine und angewandte Mathematik, 479:121–131, 1996.
[31] B. Gao, X. Liu, X. Chen, and Y.-x. Yuan. A new first-order algorithmic framework for optimization problems with orthogonality constraints. SIAM Journal on Optimization, 28(1):302–332, 2018.
[32] B. Gao, X. Liu, and Y.-x. Yuan. Parallelizable algorithms for optimization problems with orthogonality constraints. SIAM Journal on Scientific Computing, 41(3):A1949–A1983, 2019.
[33] K. Glashoff and M. M. Bronstein. Matrix commutators: their asymptotic metric properties and relation to approximate joint diagonalization. Linear Algebra and its Applications, 439(8):2503–2513, 2013.
[34] H. H. Goldstine and L. P. Horwitz. A procedure for the diagonalization of normal matrices. Journal of the ACM (JACM), 6(2):176–195, 1959.
[35] G. H. Golub and C. F. Van Loan. Matrix computations. JHU press, 2013.
[36] F. Gygi, J.-L. Fattebert, and E. Schwegler. Computation of maximally localized Wannier functions using a simultaneous diagonalization algorithm. Computer physics communications, 155(1):1–6, 2003.
[37] S.-Y. Ha, M. Kang, D. Kim, J. Kim, and I. Yang. Stochastic consensus dynamics for nonconvex optimization on the Stiefel manifold: Mean-field limit and convergence. Mathematical Models and Methods in Applied Sciences, 32(03):533–617, 2022.
[38] P. R. Halmos. Some unsolved problems of unknown depth about operators on Hilbert space. Proceedings of the Royal Society of Edinburgh Section A: Mathematics, 76(1):67–76, 1976.
[39] M. B. Hastings. Topology and phases in fermionic systems. Journal of Statistical Mechanics: Theory and Experiment, 2008(01):L01001, 2008.
[40] M. B. Hastings. Making almost commuting matrices commute. Communications in Mathematical Physics, 291(2):321–345, 2009.
[41] M. B. Hastings and T. A. Loring. Almost commuting matrices, localized Wannier functions, and the quantum Hall effect. Journal of mathematical physics, 51(1):015214, 2010.
[42] M. B. Hastings and T. A. Loring. Topological insulators and ${C}^{*}$ -algebras: Theory and numerical practice. Annals of Physics, 326(7):1699–1759, 2011.
[43] D. Herrera. Constructive Approaches to Lin’s Theorem for Almost Commuting Matrices. arXiv preprint arXiv:2011.11800, 2020.
[44] I. Kachkovskiy and Y. Safarov. Distance to normal elements in ${C}^{*}$ -algebras of real rank zero. Journal of the American Mathematical Society, 29(1):61–80, 2016.
[45] J. Kim, M. Kang, D. Kim, S.-Y. Ha, and I. Yang. A stochastic consensus method for nonconvex optimization on the Stiefel manifold. In 2020 59th ieee conference on decision and control (cdc), pages 1050–1057. IEEE, 2020.
[46] A. Kovnatsky, M. M. Bronstein, A. M. Bronstein, K. Glashoff, and R. Kimmel. Coupled quasi-harmonic bases. In Computer Graphics Forum, volume 32, pages 439–448. Wiley Online Library, 2013.
[47] X.-b. Li, N.-j. Huang, Q. H. Ansari, and J.-C. Yao. Convergence rate of descent method with new inexact line-search on Riemannian manifolds. Journal of Optimization Theory and Applications, 180(3):830–854, 2019.
[48] X.-L. Li and X.-D. Zhang. Nonorthogonal joint diagonalization free of degenerate solution. IEEE Transactions on Signal Processing, 55(5):1803–1814, 2007.
[49] H. Lin. Almost commuting selfadjoint matrices and applications. Fields Inst. Commun, 13:193–233, 1997.
[50] C. Liu and N. Boumal. Simple algorithms for optimization on Riemannian manifolds with constraints. Applied Mathematics & Optimization, 82(3):949–981, 2020.
[51] T. A. Loring and A. P. Sørensen. Almost commuting self-adjoint matrices: the real and self-dual cases. Reviews in Mathematical Physics, 28(07):1650017, 2016.
[52] T. Maehara and K. Murota. A numerical algorithm for block-diagonal decomposition of matrix $*$ -algebras with general irreducible components. Japan journal of industrial and applied mathematics, 27(2):263–293, 2010.
[53] T. Maehara and K. Murota. Algorithm for error-controlled simultaneous block-diagonalization of matrices. SIAM Journal on Matrix Analysis and Applications, 32(2):605–620, 2011.
[54] K. Murota, Y. Kanno, M. Kojima, and S. Kojima. A numerical algorithm for block-diagonal decomposition of matrix $*$ -algebras with application to semidefinite programming. Japan Journal of Industrial and Applied Mathematics, 27(1):125–160, 2010.
[55] J. v. Neumann and E. P. Wigner. Über das verhalten von eigenwerten bei adiabatischen prozessen. In The Collected Works of Eugene Paul Wigner, pages 294–297. Springer, 1993.
[56] H. Nguyen, T. Tao, and V. Vu. Random matrices: tail bounds for gaps between eigenvalues. Probability Theory and Related Fields, 167(3):777–816, 2017.
[57] Y. Ogata. Approximating macroscopic observables in quantum spin systems with commuting matrices. Journal of Functional Analysis, 264(9):2005–2033, 2013.
[58] F. Rellich and J. Berkowitz. Perturbation theory of eigenvalue problems. CRC Press, 1969.
[59] W. Ring and B. Wirth. Optimization methods on Riemannian manifolds and their application to shape space. SIAM Journal on Optimization, 22(2):596–627, 2012.
[60] P. Rosenthal. Are almost commuting matrices near commuting matrices? The American Mathematical Monthly, 76(8):925–926, 1969.
[61] D. N. Rutledge and D. J.-R. Bouveresse. Independent components analysis with the JADE algorithm. TrAC Trends in Analytical Chemistry, 50:22–32, 2013.
[62] H. Sato. Riemannian newton-type methods for joint diagonalization on the stiefel manifold with application to independent component analysis. Optimization, 66(12):2211–2231, 2017.
[63] P. H. Schönemann. A generalized solution of the orthogonal procrustes problem. Psychometrika, 31(1):1–10, 1966.
[64] M. Sørensen, L. D. Lathauwer, P. Comon, S. Icart, and L. Deneire. Canonical polyadic decomposition with a columnwise orthonormal factor matrix. SIAM Journal on Matrix Analysis and Applications, 33(4):1190–1213, 2012.
[65] T. Tao. Topics in random matrix theory, volume 132. American Mathematical Soc., 2012.
[66] T. Tao and V. Vu. Random matrices have simple spectrum. Combinatorica, 37(3):539–553, 2017.
[67] M. Teytel. How rare are multiple eigenvalues? Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 52(8):917–934, 1999.
[68] F. Theis. Towards a general independent subspace analysis. Advances in Neural Information Processing Systems, 19, 2006.
[69] F. J. Theis, T. P. Cason, and P. A. Absil. Soft dimension reduction for ica by joint diagonalization on the stiefel manifold. In Independent Component Analysis and Signal Separation: 8th International Conference, ICA 2009, Paraty, Brazil, March 15-18, 2009. Proceedings 8, pages 354–361. Springer, 2009.
[70] K. Uhlenbeck. Generic properties of eigenfunctions. American Journal of Mathematics, 98(4):1059–1078, 1976.
[71] A.-J. Van Der Veen. Joint diagonalization via subspace fitting techniques. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), volume 5, pages 2773–2776. IEEE, 2001.
[72] Z. Wen and W. Yin. A feasible method for optimization with orthogonality constraints. Mathematical Programming, 142:397–434, 12 2013.
[73] W. H. Yang, L.-H. Zhang, and R. Song. Optimality conditions for the nonlinear programming problems on Riemannian manifolds. Pacific Journal of Optimization, 10(2):415–434, 2014.
[74] Y. Yang. Globally convergent optimization algorithms on Riemannian manifolds: Uniform framework for unconstrained and constrained optimization. Journal of Optimization Theory and Applications, 132(2):245–265, 2007.
[75] A. Yeredor. Blind source separation via the second characteristic function. Signal Processing, 80(5):897–902, 2000.
[76] H. Zhang, A. Milzarek, Z. Wen, and W. Yin. On the geometric analysis of a quartic–quadratic optimization problem under a spherical constraint. Mathematical Programming, 195(1-2):421–473, 2022.

	$\displaystyle\lVert(A_{}-\lambda)v\rVert+\lVert(B_{}-\mu)v\rVert$	$\displaystyle\geq\sqrt{\sum_{j=1}^{m}\|c_{j}(\lambda_{j}-\lambda)\|^{2}}+\sqrt{\sum_{j=1}^{m}\|c_{j}(\mu_{j}-\mu)\|^{2}}$
		$\displaystyle\geq\sqrt{\sum_{j=1}^{m}\|c_{j}\|^{2}\lVert\Lambda_{j}-\Lambda\rVert^{2}}\,,$		(2.32)

Vector-wise Joint Diagonalization of Almost Commuting Matrices

Abstract

keywords:

MSC:

1 Introduction

Theorem 1.1 ([44]).

2 Analysis of local minimizers of (​OP1​)\eqref{eq:sub-problem}

Proposition 2.1.

Proof.

Proposition 2.2.

Proof.

Lemma 2.3.

Proof.

Lemma 2.4.

Proof.

Remark 2.5.

Theorem 2.6.

Proof.

Remark 2.7.

Remark 2.8.

Lemma 2.9.

Proof.

Lemma 2.10.

Corollary 2.11.

Lemma 2.12.

Theorem 2.13.

3 Vector-wise joint diagonalization algorithm

Remark 3.1.

Lemma 3.2.

Proof.

Lemma 3.3.

Proof.

4 Error analysis

4.1 Convergence of Riemannian approximate Newton method

Theorem 4.1.

Proof.

Remark 4.2.

4.2 Numerical stability

Lemma 4.3.

Lemma 4.4.

Proposition 4.5.

Proof.

5 Numerical experiments

5.1 Comparison between different approximate Hessian

5.2 Comparison with Jacobi algorithm and relation with Lin’s theorem

Proposition 5.1.

5.3 Application in independent component analysis

6 Conclusion

Declaration of competing interest

Acknowledgement

Appendix A Proof of Proposition 5.1

Proof of Proposition 5.1.

References

2 Analysis of local minimizers of $\eqref{eq:sub-problem}$