This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Fair Principal Component Analysis and Filter Design

Gad Zalcberg(1) and Ami Wiesel(1)(2){}^{(1)\ (2)} (1) School of Computer Science and Engineering, The Hebrew University of Jerusalem, Israel
(2) Google Research, Israel
Abstract

We consider Fair Principal Component Analysis (FPCA) and search for a low dimensional subspace that spans multiple target vectors in a fair manner. FPCA is defined as a non-concave maximization of the worst projected target norm within a given set. The problem arises in filter design in signal processing, and when incorporating fairness into dimensionality reduction schemes. The state of the art approach to FPCA is via semidefinite programming followed by rank reduction methods. Instead, we propose to address FPCA using simple sub-gradient descent. We analyze the landscape of the underlying optimization in the case of orthogonal targets. We prove that the landscape is benign and that all local minima are globally optimal. Interestingly, the SDR approach leads to sub-optimal solutions in this orthogonal case. Finally, we discuss the equivalence between orthogonal FPCA and the design of normalized tight frames.

Index Terms:
Dimensionality Reduction, SDP, Fairness, Normalized Tight Frame, PCA

I Introduction

Dimensionality reduction is a fundamental problem in signal processing and machine learning. In particular, Principal Component Analysis (PCA) is among the most popular data science tool. It involves a non-concave maximization but has a tight semidefinite relaxation (SDR). Its optimization landscape, saddle points and extreme points are all well understood and it is routinely solved using scalable first order methods [1]. PCA maximizes the average performance across a given set of vector targets. In many settings, worst case metrics are preferred in order to ensure fairness and equal performance across all targets. This gives rise to Fair PCA (FPCA) [2] which will be defined formally in the next section. Unfortunately, changing the average PCA objective to a worst case FPCA objective results in an NP-hard problem [2] which is poorly understood. There is a growing body of works on convex relaxations via SDR for FPCA [2, 3], but these methods do not scale well and are inapplicable to many realistic settings. Therefore, the goal of this paper to consider scalable first order solutions to FPCA and shed more light on the landscape of this important optimization problem.

Due to the significance of PCA it is non-trivial to track the origins of FPCA. In the context of filter design, FPCA with rank one constraints is known as multicast beamforming and there is a huge body of literature on this topic, e.g., [4, 5, 6]. In the modern context of fairness in machine learning, FPCA was considered in [7, 8, 2, 3]. It was shown that SDR with an iterative rounding technique provides near optimal performance when the rank is much larger than the squared root of the number of targets. More generally, by interpreting the worst case operator as an LL_{\infty} norm, FPCA is a special case of L2,pL_{2,p} norm optimizations. Classical PCA corresponds to L2,2L_{2,2}. Robust PCA algorithms as [9] rely on L2,1L_{2,1}, and FPCA is the other extreme using L2,L_{2,\infty}. Most of these works capitalize on the use of SDR that leads to conic optimizations with provable performance guarantees. Finally, [10] proposed a different definition to fairness in PCA via multi-objective optimization. They developed a first order method for attaining random solutions on the PCA Pareto frontier. FPCA defined above may be considered as one of these points.

SDR and nuclear norm relaxations are currently state of the art in a wide range of subspace recovery problems. Unfortunately, SDR is known to scale poorly to high dimensions. Therefore, there is a growing body of works on first order solutions to semidefinite programs. The main trick is to factorize the low rank matrix and show that the landscape of the resulting non-convex objective is benign [11, 12, 13, 14, 15, 16, 17]. The SDR of FPCA involves two types of linear matrix inequalities and still poses a challenge. Therefore, we first reformulate the problem and then apply sub-gradient descent on the factorized formulation.

The main contribution of this paper is the observation that the landscape of the factorized FPCA optimization is benign when the targets are orthogonal. This is the case in which dimensionality reduction is most lossy. Yet, we show that it is easy from an optimization perspective. The maximization is non-concave but every (non-connected) local minima is globally optimal. Surprisingly, we show that this case is challenging for SDR. Its objective is tight but it is not trivial to project its solution onto the feasible set. Numerical experiments with synthetic data suggest that these properties also hold in more realistic near-orthogonal settings. Finally, a direct corollary of our analysis is an equivalence between orthogonal FPCA and the design of finite normalized tight frames [18]. This characterization may be useful in future works on data-driven normalized tight frame design.

Notations:

We used bold uppercase letters (e.g. P) for matrices, bold lowercase letters (e.g. 𝐯{\mathbf{v}}) for vectors and non-bold letters (e.g. nn) for scalars. We used pythonic notation for indices of matrices: 𝐔i:{\mathbf{U}}_{i:} for the ii’th row, 𝐔:j{\mathbf{U}}_{:j} for the jj’th column and 𝐔ij{\mathbf{U}}_{ij} for the i,ji,j’th entry of matrix. The set of d×rd\times r (rdr\leq d) semi-orthogonal matrices (matrices with orthonormal columns) is denoted by 𝒪d×r\mathcal{O}^{d\times r}, the set of positive semidefinite matrices by 𝕊+d\mathbb{S}^{d}_{+}, and the set of d×dd\times d projection matrices of rank at most rr by 𝒫rd\mathcal{P}^{d}_{r} (and 𝒫d:=𝒫dd\mathcal{P}^{d}:=\mathcal{P}^{d}_{d}). Given a function f:Amf:A\rightarrow\mathbb{R}^{m}, 𝐔A{\mathbf{U}}\in A we define the set of indices 𝐔:=argmaxifi(𝐔)\mathcal{I}_{\mathbf{U}}:=\arg\max_{i}f_{i}({\mathbf{U}}). Finally we define a projection operator onto the set of projection matrices of rank at most rr: Πr:𝕊+d𝒫rd\Pi_{r}:\mathbb{S}^{d}_{+}\rightarrow\mathcal{P}^{d}_{r} as follows: Let 𝐏=𝐔𝚺𝐔T{\mathbf{P}}={\mathbf{U}}{\bf\Sigma}{\mathbf{U}}^{T} (EVD decomposition), where: 𝐔=(𝐮1,,𝐮d){\mathbf{U}}=({\mathbf{u}}_{1},...,{\mathbf{u}}_{d}) then: Πr[𝐏]:=(𝐮1,,𝐮r)(𝐮1,,𝐮r)T\Pi_{r}[{\mathbf{P}}]:=({\mathbf{u}}_{1},...,{\mathbf{u}}_{r})({\mathbf{u}}_{1},...,{\mathbf{u}}_{r})^{T}.

II Problem formulation

The goal of this paper is to identify a low dimensional subspace that maximizes the smallest norm of a given set of projected targets. More specifically, let {𝐱i}i=1nd\{{\mathbf{x}}_{i}\}_{i=1}^{n}\subset\mathbb{R}^{d} be the set of targets, we consider the problem:

FPCA:max𝐏𝒫dmini[n]𝐱iT𝐏𝐱is.t.rank(𝐏)r{\rm{FPCA:}}\quad\begin{array}[]{ll}\max_{{\bf{P}}\in\mathcal{P}^{d}}&\min_{i\in[n]}{\mathbf{x}}_{i}^{T}{\bf{P}}{\mathbf{x}}_{i}\\ {\rm{s.t.}}&{\rm{rank}}\left({\bf{P}}\right)\leq r\end{array} (1)

Our motivation to FPCA arises in the context of filter design for detection. We are interested in the design of a linear sampling device from d{\mathbb{R}}^{d} to r{\mathbb{R}}^{r} that will allow detection of nn known targets denoted by {𝐱i}i=1n\{{\mathbf{x}}_{i}\}_{i=1}^{n}. The motivation for using a small rank is that the cost of power, space and/or time resources typically scales with rr. Detection accuracy in additive white Gaussian noise decreases exponentially with the received signal to noise ratio (SNR), and it is therefore natural to maximize the worst SNR across all the targets. Hopefully, this will lead to a fair solution with equal norms for all the targets.

FPCA with r=1r=1 is concerned with the design of a single beaamforming filter, and is equivalent to multicast downlink transmit beamforming [4, 5]

max𝐮dmini[n](𝐱iT𝐮)2s.t.𝐮1\begin{array}[]{ll}\max_{{\mathbf{u}}\in\mathbb{R}^{d}}&\min_{i\in[n]}({\mathbf{x}}_{i}^{T}{\mathbf{u}})^{2}\\ {\rm{s.t.}}&\left\|{\mathbf{u}}\right\|\leq 1\end{array} (2)

Practical systems typically satisfy r<ndr<n\ll d, e.g., the design of a few expensive sensors that downsample a high resolution digital signal (or even an infinite dimension analog signal). Without loss of optimality, we assume a first stage of dimensionality reduction via PCA that results in effective dimensions such that n=dn=d.

As detailed in the introduction, FPCA was also recently introduced in the context of fair machine learning. There, it is more natural to assume a block structure. The targets are divided into nn blocks, denoted by d×bid\times b_{i} matrices 𝐗i{\mathbf{X}}_{i}, and fairness needs to be respected with respect to properties as gender or race [7, 2, 3]:

FPCAblocksmax𝐏𝒫dmini[n]Tr(𝐗iT𝐏𝐗i)s.t.rank(𝐏)r{\rm{FPCA}}^{\rm{blocks}}\quad\begin{array}[]{ll}\max_{{\bf{P}}\in\mathcal{P}^{d}}&\min_{i\in[n]}{\rm{Tr}}\left({\mathbf{X}}_{i}^{T}{\bf{P}}{\mathbf{X}}_{i}\right)\\ {\rm{s.t.}}&{\rm{rank}}\left({\bf{P}}\right)\leq r\end{array} (3)

Throughout this paper, we will consider the simpler non-block FPCA formulation corresponding to filter design. Preliminary experiments suggest that most of the results also hold in the block case.

FPCA is known to be NP-hard [4, 2]. The state of the art approach to FPCA is SDR. Specifically, we relax the rank constraint by its convex hull, the nuclear norm, and the projection constraint by linear matrix inequalities [5, 2]. This yields the SDP:

SDR:max𝐏𝕊+dmini[n]𝐱iT𝐏𝐱is.t.Trace(𝐏)r𝟎𝐏𝐈{\rm{SDR:}}\quad\begin{array}[]{ll}\max_{{\bf{P}}\in\mathbb{S}^{d}_{+}}&\min_{i\in[n]}{\mathbf{x}}_{i}^{T}{\bf{P}}{\mathbf{x}}_{i}\\ {\rm{s.t.}}&{\rm{Trace}}\left({\bf{P}}\right)\leq r\\ &{\bf{0}}\preceq{\bf{P}}\preceq{\bf{I}}\end{array} (4)

The computational complexity of solving an SDR using an Interior Point method is intractable for most applications, but [2] propose a practical and efficient multiplicative weight update. Unfortunately, the optimal solution to SDR might not be a feasible projection, and 𝐏SDR{\mathbf{P}}_{\rm{SDR}} may have any rank. Due to the relaxation, SDR always results in an upper bound on FPCA. To obtain a feasible approximation, it is customary to define

𝐏PrSDR=Πr[𝐏SDR]{\mathbf{P}}_{\rm{PrSDR}}=\Pi_{r}[{\mathbf{P}}_{\rm{SDR}}] (5)

PrSDR is a feasible projection matrix of rank rr, and is therefore a lower bound on FPCA. Better approximations may be obtained via randomized procedures [5]. Recently, an iterative rounding technique was proven to provide a (1O(n)r)\left(1-\frac{O(\sqrt{n})}{r}\right) approximation [2]. This result is near optimal in the block case where it is reasonable to assume rnr\gg\sqrt{n}. It is less applicable to filter design where nn is large and smaller ranks are required.

The goal of this paper is to provide a scalable, yet accurate solution to FPCA, without the need for additional rank reduction schemes. Motivated by the growing success of simple gradient based methods in complex optimization problems, e.g., deep learning, we consider the application of sub-gradient descent to FPCA and analyze its optimization landscape.

III Algorithm

In this section, we propose an alternative and more scalable approach for solving SDR. The two optimization challenges in FPCA are the projection and rank constraints. We confront the first challenge by reformulating the problem using a quadratic objective, and the second by decomposing the projection matrix using its low rank factors. Together, we define factorized FPCA:

FFPCA:min𝐔d×rmaxi[n]fi(𝐔){\rm{F-FPCA:}}\quad\begin{array}[]{ll}\min_{{\mathbf{U}}\in\mathbb{R}^{d\times r}}&\max_{{i\in[n]}}\quad f_{i}({\mathbf{U}})\end{array} (6)

where

fi(𝐔)=𝐱i𝐔𝐔T𝐱i2𝐱i2f_{i}({\mathbf{U}})=\left\|{\mathbf{x}}_{i}-{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}-\left\|{\mathbf{x}}_{i}\right\|^{2} (7)

The formal equivalence between (1) and (6) is stated below.

Proposition 1.

Let 𝐔{\mathbf{U}} be a globally optimal solution to F-FPCA in (6). Then, 𝐏=Π[𝐔𝐔T]{\mathbf{P}}=\Pi[{{\mathbf{U}}}{{\mathbf{U}}}^{T}] is a globally optimal solution to FPCA in (1).

Before proving the proposition, we note that the projection Π[𝐔𝐔T]\Pi[{\mathbf{U}}{\mathbf{U}}^{T}] is only needed in order to handle a degenerate case in which the dimension of the subspace spanned by the targets is smaller than rr. Typically, this projection is not needed and 𝐔𝐔T{\mathbf{U}}{\mathbf{U}}^{T} is feasible.

Proof.

We rely on the observation that F-FPCA has an optimal solution with orthogonal matrix, and for orthogonal matrix we have:

fi(𝐔)=𝐱iT𝐔𝐔T𝐱i-f_{i}({\mathbf{U}})={\mathbf{x}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}

In addition the function 𝐔𝐔𝐔T{\mathbf{U}}\mapsto{\mathbf{U}}{\mathbf{U}}^{T} is a surjective function from 𝒪d×r\mathcal{O}^{d\times r} to 𝒫r𝒫r1\mathcal{P}_{r}\setminus\mathcal{P}_{r-1}, so the optimization over both sets is equivalent. More details are available in the Appendix. ∎

The advantage of solving F-FPCA rather than FPCA is that it forces a low rank solution via an unconstrained optimization. A member of the sub-gradient of F-FPCA objective can be computed in O(drn)O(drn). In particular, Algorithm 1 describes a promising sub-gradient descent method for its minimization.

The obvious downside of using F-FPCA is its non-convexity that may cause descent algorithms to converge to bad stationary points. Its convergence analysis is more difficult due to the non-smooth maximum function. Nonetheless, in the next section, we prove that there are no bad local minima when the targets are orthogonal. This is also demonstrated in the experimental section where we show the advantages of F-FPCA in terms of accuracy.

Relation to other low rank optimization papers: We note in passing that there is a large body of literature on global optimality properties of low rank optimizations [16, 17]. These provide sufficient conditions for convergence to global optimum in factorized formulations, e.g., Restricted Strong Convexity and Smoothness. Observe that these guarantees require the existence of a low rank optimal solution in the original problem. These conditions do not hold in FPCA, and therefore our analysis below takes a different approach.

Algorithm 1 F-FPCA via sub-gradient descent
0:  {𝐱i}i=1nd\{{\mathbf{x}}_{i}\}_{i=1}^{n}\subset\mathbb{R}^{d}, rr\in\mathbb{N}, η\eta.
0:  𝐏𝒫d,rank(𝐏)r{\mathbf{P}}\in\mathcal{P}^{d},{\rm{rank}}\left({\bf{P}}\right)\leq r.
1:  t0t\leftarrow 0
2:  draw 𝐔d×r{\mathbf{U}}\in\mathbb{R}^{d\times r} randomly.
3:  repeat
4:     tt+1t\leftarrow t+1
5:     i^argmaxi[n]𝐱i𝐔𝐔T𝐱i2𝐱i2\hat{i}\leftarrow\arg\max_{{i\in[n]}}\left\|{\mathbf{x}}_{{i}}-{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{{i}}\right\|^{2}-\left\|{\mathbf{x}}_{i}\right\|^{2}
6:     𝐔𝐔ηt(𝐱i^𝐱i^T𝐔𝐔T+𝐔𝐔T𝐱i^𝐱i^T2𝐱i^𝐱i^T)𝐔{\mathbf{U}}\leftarrow{\mathbf{U}}-\frac{\eta}{t}\left({\mathbf{x}}_{\hat{i}}{\mathbf{x}}_{\hat{i}}^{T}{\mathbf{U}}{\mathbf{U}}^{T}+{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{\hat{i}}{\mathbf{x}}_{\hat{i}}^{T}-2{\mathbf{x}}_{\hat{i}}{\mathbf{x}}_{\hat{i}}^{T}\right){\mathbf{U}}
7:  until convergence
8:  return  𝐏=Π[𝐔𝐔T]{\mathbf{P}}=\Pi\left[{\mathbf{U}}{\mathbf{U}}^{T}\right]

IV Analysis - the orthogonal case

In this section, we analyze the FPCA in the special case of orthogonal targets. As explained, FPCA is NP-hard and we do not expect a scalable and accurate solution for arbitray targets. Interestingly, our analysis shows that the problem becomes significantly easier when the targets are orthogonal. This is the case for example when the targets are randomly generated and the number of targets is much smaller than their dimension.

We will use the following assumptions:

  1. A1:

    The targets {𝐱i}i=1nd\{{\mathbf{x}}_{i}\}_{i=1}^{n}\subset\mathbb{R}^{d} are orthogonal vectors.

  2. A2:

    The problem is not degenerate in the sense that

    rnH<mini𝐱i2\frac{r}{n}H<\min_{i}\left\|{\mathbf{x}}_{i}\right\|^{2}

    where

    H=ni=1n1𝐱i2H=\frac{n}{\sum_{i=1}^{n}\frac{1}{\left\|{\mathbf{x}}_{i}\right\|^{2}}}

    (the harmonic mean of the squared norms of {𝐱i}i=1n\{{\mathbf{x}}_{i}\}^{n}_{i=1}).

Assumption A1 is the main property that simplifies the landscape and allows a tractable solution and analysis. On the other hand, assumption A2 is a technical condition that prevents a trivial degenerate solution based on the norms of the targets.

Using these assumptions, we have the following results.

Proposition 2.

Under assumptions A1-A2, any local minimizer of F-FPCA is a global maximizer of FPCA and FPCA=rnH=\frac{r}{n}H.

Proof.

The proof consists of the following lemmas (proofs in the appendix):

Lemma 1.

Under assumptions A1-A2, let 𝐔r×d{\mathbf{U}}\in\mathbb{R}^{r\times d} be a local minimizer of F-FPCA, then 𝐔𝒪d×r{\mathbf{U}}\in\mathcal{O}^{d\times r}.

Lemma 2.

Under assumptions A1-A2, let 𝐔𝒪d×r{\mathbf{U}}\in\mathcal{O}^{d\times r} a local minimizer of F-FPCA, then: f=fi(𝐔)=fj(𝐔)f=f_{i}({\mathbf{U}})=f_{j}({\mathbf{U}}) for all i,j[n]i,j\in[n].

Intuitively, if the property in Lemma 2 is violated, then 𝐔{\mathbf{U}} can be infinitesimally changed in a manner that decreases the correlation of 𝐔{\mathbf{U}} with some direction 𝐰{\mathbf{w}} such that 𝐰𝐱j{\mathbf{w}}\perp{\mathbf{x}}_{j} for all j𝐔j\in\mathcal{I}_{\mathbf{U}}. We can decrease the value of fif_{i} for some i𝐔i\in\mathcal{I}_{\mathbf{U}} without harming the objective function using a sequence of Givens rotations with respect to the pairs {𝐰,𝐱i}\{{\mathbf{w}},{\mathbf{x}}_{i}\} for each i𝐔i\in\mathcal{I}_{\mathbf{U}}. After decreasing fif_{i} for all i𝐔i\in\mathcal{I}_{\mathbf{U}} the objective will also be decreased.

Finally, in order to prove global optimality we define:

𝐗=(𝐱1𝐱1,,𝐱n𝐱n),𝐔^=𝐗T𝐔\displaystyle{\mathbf{X}}=\left(\frac{{\mathbf{x}}_{1}}{\left\|{\mathbf{x}}_{1}\right\|},...,\frac{{\mathbf{x}}_{n}}{\left\|{\mathbf{x}}_{n}\right\|}\right),\quad\hat{{\mathbf{U}}}={\mathbf{X}}^{T}{\mathbf{U}} (8)

If f=fi(𝐔)=fj(𝐔)f=f_{i}({\mathbf{U}})=f_{j}({\mathbf{U}}):

𝐔^i:2=fi(𝐔)𝐱i2=f𝐱i2\left\|\hat{{\mathbf{U}}}_{i:}\right\|^{2}=-\frac{f_{i}({\mathbf{U}})}{\left\|{\mathbf{x}}_{i}\right\|^{2}}=-\frac{f}{\left\|{\mathbf{x}}_{i}\right\|^{2}}

We have:

r=𝐔^F2=i=1n𝐔^i:2=i=1nf𝐱i2\displaystyle r=\left\|\hat{{\mathbf{U}}}\right\|^{2}_{F}=\sum_{i=1}^{n}\left\|\hat{{\mathbf{U}}}_{i:}\right\|^{2}=-\sum_{i=1}^{n}\frac{f}{\left\|{\mathbf{x}}_{i}\right\|^{2}}

Rearranging yields f=rnHf=-\frac{r}{n}H. Together with the equivalence in Proposition 1 we conclude that all local minima yield an identical objective of rnH\frac{r}{n}H which is globally optimal.∎

Proposition 2 justifies the use of Algorithm 1 when the targets are orthogonal. Numerical results in the next section suggest that bad local minima are rare even in more realistic near-orthogonal scenarios.

Given the favourable properties of F-FPCA in the orthogonal case, it is interesting to analyze the performance of SDR in this case.

Proposition 3.

Under assumptions A1-A2, SDR is tight and its optimal objective value is

SDR=rnH.{\rm{SDR}}=\frac{r}{n}H.

However, the optimal solution may be full rank and infeasible for FPCA.

Proof.

See Appendix. ∎

Observe that the rank constraint is hard, and a rank reduction procedure such as PrSDR is necessary for finding a feasible solution. The iterative rounding algorithm of [2] relies on finding an extreme point solution, and guarantees an upper bound on its rank. The bound is not always tight. For example, their algorithm fails to find an optimal low rank solution in the orthogonal case. On the other hand, Algorithm 1 easily finds the global solution.

Finally, we conclude this section by noting an interesting relation between FPCA with orthogonal targets and the design of Finite Tight Frames [18]. Recall the following definition:

Definition 1.

.

  • Let {𝐮i}i=1nr\{{\mathbf{u}}_{i}\}_{i=1}^{n}\subset\mathbb{R}^{r}. If span({𝐮i}i=1n)=rspan(\{{\mathbf{u}}_{i}\}_{i=1}^{n})=\mathbb{R}^{r} then {𝐮i}i=1n\{{\mathbf{u}}_{i}\}_{i=1}^{n} is frame for r\mathbb{R}^{r}.

  • A frame {𝐮i}i=1n\{{\mathbf{u}}_{i}\}_{i=1}^{n} is tight with frame bound A if 𝐯n\forall{\mathbf{v}}\in\mathbb{R}^{n}:

    𝐯=1Ai=1n𝐯,𝐮i𝐮i{\mathbf{v}}=\frac{1}{A}\sum_{i=1}^{n}\left<{\mathbf{v}},{\mathbf{u}}_{i}\right>{\mathbf{u}}_{i}
  • A frame {𝐮i}i=1n\{{\mathbf{u}}_{i}\}_{i=1}^{n} is a ’Normalized Tight Frame’ if {𝐮i}i=1n\{{\mathbf{u}}_{i}\}_{i=1}^{n} is tight frame and 𝐮i=1\|{\mathbf{u}}_{i}\|=1 for all ii.

A straight forward consequence is the following result.

Corollary 1.

Under assumptions A1-A2, if 𝐔{\mathbf{U}} is an optimal solution for F-FPCA, then 𝐔T{\mathbf{U}}^{T} is a tight frame. In particular, if the targets are the standard basis, then dr𝐔T\frac{d}{r}{\mathbf{U}}^{T} is a normalized tight frame.

Sketch of proof (the proof in the appendix): As proved before, the solution of F-FPCA is in 𝒪d×r\mathcal{O}^{d\times r} and the transposition of any 𝐔𝒪d×r{\mathbf{U}}\in\mathcal{O}^{d\times r} is a tight frame. The second part is true since the optimal solution of F-FPCA is satisfied for all k: 𝐱kT𝐔2=rnH\left\|{\mathbf{x}}_{k}^{T}{\mathbf{U}}\right\|^{2}=\frac{r}{n}H. For the standard basis we get for all i,ji,j: 𝐔T𝐞i=𝐔T𝐞j\left\|{\mathbf{U}}^{T}{\mathbf{e}}_{i}\right\|=\left\|{\mathbf{U}}^{T}{\mathbf{e}}_{j}\right\| i.e. the norm of all rows of 𝐔{\mathbf{U}} are equals.

It is well known that normalized tight frames can be derived as minimizers of frame potential functions [18]. The corollary provides an alternative derivation via FPCA with different targets 𝐱i{\mathbf{x}}_{i}. Depending on the properties of the targets, this allows a flexible data-driven design that will be pursued in future work.

V Experimental results

In this section, we illustrate the efficacy of the different algorithms using numerical experiments. We compare the following competitors:

  • SDR - a (possibly infeasible) upper bound defined as the solution to (4) via CVXPY [19, 20].

  • PIRSDR - the projection of SDR onto the feasible set using eigenvalue decomposition. Before project the solution, the iterative rounding rank reduction from [2] was performed.

  • F-FPCA - the solution to (6) via Algorithm 1 with a random initialization.

  • F-FPCAi - the solution to (6) via Algorithm 1 with PIRSDR initialization.

To allow easy comparison, we normalize the results by the value of SDR, so that a ratio of 11 corresponds to a tight solution.

V-A Synthetic simulations

We begin with experiments on synthetic targets with independent, zero mean, unit variance, Gaussian random variables. This is clearly a simplistic setting but it allows control over the different parameters rr, nn and dd. Each of the experiments was performed 1515 times and we report the average performance.

Rank effect: The first experiment is presented in Fig. 1 and illustrates the dependency on the rank rr. It is easy to see that even with very small rank, the gap between the upper and lower bounds vanishes. We conclude that in this non-orthogonal setting, the landscape of FPCA is benign as long as the rank is not very small.

Refer to caption

Figure 1: Quality of approximation as a function of the rank (d=200,n=50d=200,n=50)

Orthogonality effect: The second experiment is presented in Fig. 2 and addresses the effect of orthogonality. As explained, the targets are drawn randomly and they tend to orthogonality as dd increases. Our analysis proved that the gap should vanish when the targets are exactly orthogonal. The numerical results suggest that this is also true for more realistic and near-orthogonal targets. The optimality gap clearly decreases as dd increases.

Refer to caption

Figure 2: Quality of approximation as a function of the orthogonality (n=50,r=2n=50,r=2)

V-B Minerals dataset

In order to illustrate the performance in a more realistic setting we also considered a real world dataset. We consider the design of hyperspectral matched filters for detection of known minerals. We downloaded spectral signatures of minerals from the Spectral Library of United States Geological Survey (USGS). We experimented with 114114 different minerals, each with 480480 bands in the range 0.01μ3μ0.01\mu-3\mu. Some of the measurements were missing and their corresponding bands were omitted. We then performed PCA and reduced the dimension from 421421 to 114\mathbb{R}^{114}. These vectors were normalized and then centered. Fig. 3 provides the signatures of the first minerals before and after the pre-processing. Finally, we performed fair dimension reduction to r=16r=1...6. Fig. 4 summarizes the quality of the approximation of the different algorithms. As before, it is easy to see that F-FPCA is near optimal at very small ranks. Interestingly, PIRSDR is beneficial as an initialization but shows inferior and non-monotonic performance on its own. As expected, all the algorithms easily attain the optimal performance at higher values of rr.

Refer to caption


Figure 3: The spectral signatures of Actinolite, Adularia, Albite, Allanite and Almandine.

Refer to caption

Figure 4: Quality of approximation as a function of the rank: minerals dataset.

V-C Credit dataset

Next, we continue to the credit dataset [21] that was considered for block-FPCA in [7, 2, 3]. Following these works, we consider the functions

fi(𝐔)=𝐗i𝐔𝐔T𝐗i2PCA(𝐗i)f_{i}({\mathbf{U}})=\left\|{\mathbf{X}}_{i}-{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{X}}_{i}\right\|^{2}-PCA({\mathbf{X}}_{i}) (9)

where 𝐗i=j=1ni𝐱ji𝐱jiT{\mathbf{X}}_{i}=\sum_{j=1}^{n_{i}}{\mathbf{x}}_{j}^{i}{\mathbf{x}}_{j}^{iT} and PCAPCA is the objective of the standard PCA function of 𝐗i{\mathbf{X}}_{i} which is independent of 𝐔{\mathbf{U}}. The results in Fig 6 are identical to those in [2] with our additional F-FPCA algorithm. PIRSDR achieves the SDR lower bound at all ranks excepts r=7r=7. Remarkably, F-FPCA is optimal in this specific setting and attains the bound for all rr without exception. Apparently, the landscape of the credit dataset is benign. We emphasize that this is pure luck and we can easily find other non-orthogonal examples with spurious local minima. In real applications, we recommend running F-FPCA using multiple initializations and choosing the best solution.

Refer to caption

Figure 5: Marginal loss from [2] when splitting the credit data to 6 classes

VI Conclusion

In this paper, we suggested to tackle the problem of fairness in linear dimension reduction by simply using first order methods over a non-convex optimization problem. We provided an analysis of the landscape of this problem in the special case where the targets are orthogonal to each other. We also provided experimental results which support our approach by showing that sub gradient descent is successful also in the near orthogonal case and real world data.

There are many interesting extensions to this paper that are worth pursuing. Analysis of the near-orthogonal case is still an open question. In addition, a drawback of our approach is the non smoothness of the landscape which might prevent the use of standard convergence bounds for first order methods. This can be treated by approximating the L2,L_{2,\infty} in our formulation by log-sum-exp or L2,pL_{2,p} norm for p<p<\infty functions. Experimental results show that our results can be extended to the block case that is more relevant to machine learning. Finally, we only considered the case of classical linear dimension reduction. Future work may focus on extensions to non-linear methods and tensor decompositions.

Appendix A Proof of Proposition 1

Let 𝐎𝐃𝐎T{\mathbf{O}}{\mathbf{D}}{\mathbf{O}}^{T} a truncated EVD decomposition of 𝐔𝐔T{\mathbf{U}}{\mathbf{U}}^{T}, then:

𝐱i𝐔𝐔T𝐱i2𝐱i2\displaystyle\left\|{\mathbf{x}}_{i}-{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}-\left\|{\mathbf{x}}_{i}\right\|^{2}
=\displaystyle= 𝐱i2+𝐱iT𝐔𝐔T𝐔𝐔T𝐱i2𝐱iT𝐔𝐔T𝐱i𝐱i2\displaystyle\left\|{\mathbf{x}}_{i}\right\|^{2}+{\mathbf{x}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}-2{\mathbf{x}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}-\left\|{\mathbf{x}}_{i}\right\|^{2}
=\displaystyle= 𝐱iT𝐎𝐃2𝐎T𝐱i2𝐱iT𝐎𝐃𝐎T𝐱i\displaystyle{\mathbf{x}}_{i}^{T}{\mathbf{O}}{\mathbf{D}}^{2}{\mathbf{O}}^{T}{\mathbf{x}}_{i}-2{\mathbf{x}}_{i}^{T}{\mathbf{O}}{\mathbf{D}}{\mathbf{O}}^{T}{\mathbf{x}}_{i}
=\displaystyle= 𝐱iT𝐎(𝐃22𝐃)𝐎T𝐱i\displaystyle{\mathbf{x}}_{i}^{T}{\mathbf{O}}({\mathbf{D}}^{2}-2{\mathbf{D}}){\mathbf{O}}^{T}{\mathbf{x}}_{i}
=\displaystyle= l=1r(𝐃ll22𝐃ll)𝐎:l,𝐱i2\displaystyle\sum_{l=1}^{r}({\mathbf{D}}_{ll}^{2}-2{\mathbf{D}}_{ll})\left<{\mathbf{O}}_{:l},{\mathbf{x}}_{i}\right>^{2}

Observe that this function is minimized when 𝐃ll=1{\mathbf{D}}_{ll}=1 for all lrl\leq r, so:

𝐱i𝐔𝐔T𝐱i2𝐱i2𝐱iTΠ[𝐔𝐔T]𝐱i\displaystyle\left\|{\mathbf{x}}_{i}-{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}-\left\|{\mathbf{x}}_{i}\right\|^{2}\geq-{\mathbf{x}}_{i}^{T}\Pi[{\mathbf{U}}{\mathbf{U}}^{T}]{\mathbf{x}}_{i}

So F-FPCA is equivalent to the following problem (over the orthogonal matrices):

min𝐔𝒪d×rmaxi[n]𝐱i𝐔𝐔T𝐱i2𝐱i2\displaystyle\begin{array}[]{ll}\min_{{\mathbf{U}}\in\mathcal{O}^{d\times r}}&\max_{{i\in[n]}}\quad\left\|{\mathbf{x}}_{i}-{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}-\left\|{\mathbf{x}}_{i}\right\|^{2}\end{array}

Now for any orthogonal matrix 𝐔{\mathbf{U}} we get:

=𝐱i𝐔𝐔T𝐱i2𝐱i2=𝐱i2+𝐱iT𝐔𝐔T𝐔𝐔T𝐱i2𝐱iT𝐔𝐔T𝐱i𝐱i2=𝐱iT𝐔𝐔T𝐱i2𝐱iT𝐔𝐔T𝐱i=𝐱iT𝐔𝐔T𝐱i\displaystyle\begin{array}[]{ll}=\left\|{\mathbf{x}}_{i}-{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}-\left\|{\mathbf{x}}_{i}\right\|^{2}\\ =\left\|{\mathbf{x}}_{i}\right\|^{2}+{\mathbf{x}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}-2{\mathbf{x}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}-\left\|{\mathbf{x}}_{i}\right\|^{2}\\ ={\mathbf{x}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}-2{\mathbf{x}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}\\ =-{\mathbf{x}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}\\ \end{array}

Finally, observe that:

  • 𝐔{\mathbf{U}} is a feasible solution for the problem above iff 𝐏=𝐔𝐔T{\mathbf{P}}={\mathbf{U}}{\mathbf{U}}^{T} is a feasible solution for FPCA.

  • The objective function of FPCA in 𝐏{\mathbf{P}} is equal to the objective function of the problem above in 𝐔{\mathbf{U}} (multiplied by 1-1).

So we conclude that the problems are equivalent.

Appendix B Proof of Lemma 1

We begin with the following lemma:

Lemma 3.

Let 𝐔𝒪d×r{\mathbf{U}}\in\mathcal{O}^{d\times r}. If A2 holds then i𝐔:𝐱i>𝐔𝐔T𝐱i\forall i\in\mathcal{I}_{\mathbf{U}}:\quad\left\|{\mathbf{x}}_{i}\right\|>\left\|{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|.

Proof.

Assume in contradiction that there exists k𝐔k\in\mathcal{I}_{\mathbf{U}} (𝐔:=argmaxifi(𝐔)\mathcal{I}_{\mathbf{U}}:=\arg\max_{i}f_{i}({\mathbf{U}})) such that: 𝐱k=𝐔𝐔T𝐱k\left\|{\mathbf{x}}_{k}\right\|=\left\|{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{k}\right\|, and let jargmin𝐱i𝐱ij\in\arg\min_{{\mathbf{x}}_{i}}\left\|{\mathbf{x}}_{i}\right\|. We get for all ii:

fi(𝐔)fk(𝐔)𝐱k2𝐱j2f_{i}({\mathbf{U}})\leq f_{k}({\mathbf{U}}){\leq}-\left\|{\mathbf{x}}_{k}\right\|^{2}\leq-\left\|{\mathbf{x}}_{j}\right\|^{2}

Now recall the definition of 𝐔^\hat{{\mathbf{U}}} in (8) and observe that:

r=𝐔^F2=i=1n𝐔^i:2=i=1nfi(𝐔)𝐱i2i=1n𝐱j2𝐱i2\displaystyle r=\left\|\hat{{\mathbf{U}}}\right\|^{2}_{F}{=}\sum_{i=1}^{n}\left\|\hat{{\mathbf{U}}}_{i:}\right\|^{2}=\sum_{i=1}^{n}\frac{-f_{i}({\mathbf{U}})}{\left\|{\mathbf{x}}_{i}\right\|^{2}}\geq\sum_{i=1}^{n}\frac{\left\|{\mathbf{x}}_{j}\right\|^{2}}{\left\|{\mathbf{x}}_{i}\right\|^{2}}
𝐱j2ri=1n1𝐱i2\displaystyle\Rightarrow\left\|{\mathbf{x}}_{j}\right\|^{2}\leq\frac{r}{\sum_{i=1}^{n}\frac{1}{\left\|{\mathbf{x}}_{i}\right\|^{2}}}

This means that A2 does not hold. ∎

We will now show that if 𝐔{\mathbf{U}} is not orthogonal, then we can decrease either the size of 𝐔\mathcal{I}_{\mathbf{U}} or the value of maxifi(𝐔)\max_{i}f_{i}({\mathbf{U}}) by choosing an arbitrarily close 𝐔{\mathbf{U}}^{\prime}.

Lemma 4.

Let 𝐔𝒪d×r{\mathbf{U}}\notin\mathcal{O}^{d\times r}, then for any ϵ>0\epsilon>0 there exists a 𝐔{\mathbf{U}}^{\prime} such that:

  1. 1.

    𝐔𝐔ϵ\left\|{\mathbf{U}}-{\mathbf{U}}^{\prime}\right\|\leq\epsilon.

  2. 2.

    Either |𝐔|>|𝐔||\mathcal{I}_{{\mathbf{U}}}|>|\mathcal{I}_{{\mathbf{U}}^{\prime}}|, or maxifi(𝐔)>maxifi(𝐔)\max_{i}f_{i}({\mathbf{U}})>\max_{i}f_{i}({\mathbf{U}}^{\prime}).

Proof.

Let 𝐎𝐃𝐎T{\mathbf{O}}{\mathbf{D}}{\mathbf{O}}^{T} an EVD decomposition of 𝐔𝐔T{\mathbf{U}}{\mathbf{U}}^{T}, then:

𝐱i𝐔𝐔T𝐱i2𝐱i2=l=1r(𝐃ll22𝐃ll)𝐎:l,𝐱i2\displaystyle\left\|{\mathbf{x}}_{i}-{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}-\left\|{\mathbf{x}}_{i}\right\|^{2}=\sum_{l=1}^{r}({\mathbf{D}}_{ll}^{2}-2{\mathbf{D}}_{ll})\left<{\mathbf{O}}_{:l},{\mathbf{x}}_{i}\right>^{2}

Due to 𝐔𝒪d×r{\mathbf{U}}\notin\mathcal{O}^{d\times r}, there is an l^r\hat{l}\leq r such that 𝐃l^,l^1{{\mathbf{D}}}_{\hat{l},\hat{l}}\neq 1, and an ii such that 𝐎:l^,𝐱i0\left<{\mathbf{O}}_{:\hat{l}},{\mathbf{x}}_{i}\right>\neq 0. Observe that: hi(𝐃l^,l^)=(𝐃l^,l^22𝐃l^,l^)𝐎:l^,𝐱i2h_{i}({{\mathbf{D}}}_{\hat{l},\hat{l}})=({{\mathbf{D}}}_{\hat{l},\hat{l}}^{2}-2{{\mathbf{D}}}_{\hat{l},\hat{l}})\left<{\mathbf{O}}_{:\hat{l}},{\mathbf{x}}_{i}\right>^{2} has a local minimum only in 𝐃l^,l^=1{{\mathbf{D}}}_{\hat{l},\hat{l}}=1. Therefore, define 𝐔=𝐎𝐃𝐎T{\mathbf{U}}^{{}^{\prime}}={\mathbf{O}}{\mathbf{D}}^{\prime}{\mathbf{O}}^{T} where:

𝐃l^,l^={𝐃l^,l^=𝐃l^,l^ϵ𝐃l^,l^>1𝐃l^,l^=𝐃l^,l^+ϵ𝐃l^,l^<1{{\mathbf{D}}}_{\hat{l},\hat{l}}^{\prime}=\begin{cases}{{\mathbf{D}}}_{\hat{l},\hat{l}}^{\prime}={{\mathbf{D}}}_{\hat{l},\hat{l}}-\epsilon&\quad{{\mathbf{D}}}_{\hat{l},\hat{l}}>1\\ {{\mathbf{D}}}_{\hat{l},\hat{l}}^{\prime}={{\mathbf{D}}}_{\hat{l},\hat{l}}+\epsilon&\quad{{\mathbf{D}}}_{\hat{l},\hat{l}}<1\end{cases}

and we get fj(𝐔)>fj(𝐔)f_{j}({\mathbf{U}})>f_{j}({\mathbf{U}}^{\prime}) for all jj such that 𝐎:l^,𝐱j20\left<{\mathbf{O}}_{:\hat{l}},{\mathbf{x}}_{j}\right>^{2}\neq 0.

If there exists an l^\hat{l} such that 𝐃l^,l^1{{\mathbf{D}}}_{\hat{l},\hat{l}}\neq 1 and |{j|𝐎:l^,𝐱j20}𝐔|>0\left|\left\{j|\left<{\mathbf{O}}_{:\hat{l}},{\mathbf{x}}_{j}\right>^{2}\neq 0\right\}\cap\mathcal{I}_{\mathbf{U}}\right|>0 then we are done.

Otherwise, pick some l^\hat{l} with 𝐃l^,l^1{{\mathbf{D}}}_{\hat{l},\hat{l}}\neq 1, and after the procedure above take 𝐱k𝐔{\mathbf{x}}_{k}\in\mathcal{I}_{\mathbf{U}} and define 𝐱k{\mathbf{x}}^{\perp}_{k} the projection of 𝐱k{\mathbf{x}}_{k} onto Im(𝐔){\rm{Im}}({\mathbf{U}})^{\perp} (by Lemma 3 𝐱k0{\mathbf{x}}_{k}^{\perp}\neq 0). Define 𝐎{\mathbf{O}}^{\prime} by adding ϵ𝐱k\epsilon{\mathbf{x}}^{\perp}_{k} to the l^th\hat{l}^{\prime}th singular vector 𝐎:l^{\mathbf{O}}_{:\hat{l}} of 𝐔{\mathbf{U}}^{\prime} and define 𝐔′′=𝐎𝐃𝐎T{\mathbf{U}}^{\prime\prime}={\mathbf{O}}^{\prime}{\mathbf{D}}^{\prime}{\mathbf{O}}^{\prime T}. Now we get that for all i𝐔i\in\mathcal{I}_{{\mathbf{U}}^{\prime}}:

𝐎:l^,𝐱i2=(𝐎:l^+ϵ𝐱k,𝐱i)2\displaystyle\left<{\mathbf{O}}_{:\hat{l}}^{\prime},{\mathbf{x}}_{i}\right>^{2}=\left(\left<{\mathbf{O}}_{:\hat{l}}+\epsilon{\mathbf{x}}_{k}^{\perp},{\mathbf{x}}_{i}\right>\right)^{2}
=\displaystyle= (𝐎:l^,𝐱i+ϵ𝐱k,𝐱i)2=ϵ2𝐱k,𝐱i20\displaystyle\left(\left<{\mathbf{O}}_{:\hat{l}},{\mathbf{x}}_{i}\right>+\epsilon\left<{\mathbf{x}}_{k}^{\perp},{\mathbf{x}}_{i}\right>\right)^{2}=\epsilon^{2}\left<{\mathbf{x}}_{k}^{\perp},{\mathbf{x}}_{i}\right>^{2}\geq 0
\displaystyle\Rightarrow 𝐱i𝐔′′𝐔′′T𝐱i2𝐱i2=l=1r(𝐃ll22𝐃ll)𝐎:l,𝐱i2\displaystyle\left\|{\mathbf{x}}_{i}-{\mathbf{U}}^{\prime\prime}{\mathbf{U}}^{\prime\prime T}{\mathbf{x}}_{i}\right\|^{2}-\left\|{\mathbf{x}}_{i}\right\|^{2}=\sum_{l=1}^{r}({\mathbf{D}}_{ll}^{\prime 2}-2{\mathbf{D}}_{ll}^{\prime})\left<{\mathbf{O}}_{:l}^{\prime},{\mathbf{x}}_{i}\right>^{2}
\displaystyle\leq l=1r(𝐃ll22𝐃ll)𝐎:l,𝐱i2\displaystyle\sum_{l=1}^{r}({\mathbf{D}}_{ll}^{\prime 2}-2{\mathbf{D}}_{ll}^{\prime})\left<{\mathbf{O}}_{:l},{\mathbf{x}}_{i}\right>^{2}

Similarly, for 𝐱k{\mathbf{x}}_{k} we get:

l=1r(𝐃ll22𝐃ll)𝐎:l,𝐱k2\displaystyle\sum_{l=1}^{r}({\mathbf{D}}_{ll}^{\prime 2}-2{\mathbf{D}}_{ll}^{\prime})\left<{\mathbf{O}}_{:l}^{\prime},{\mathbf{x}}_{k}\right>^{2}
=\displaystyle= l=1r(𝐃ll22𝐃ll)𝐎:l,𝐱k2+ϵ(𝐃l^l^22𝐃l^l^)𝐱k,𝐱k2\displaystyle\sum_{l=1}^{r}({\mathbf{D}}_{ll}^{\prime 2}-2{\mathbf{D}}_{ll}^{\prime})\left<{\mathbf{O}}_{:l},{\mathbf{x}}_{k}\right>^{2}+\epsilon({\mathbf{D}}_{\hat{l}\hat{l}}^{\prime 2}-2{\mathbf{D}}_{\hat{l}\hat{l}}^{\prime})\left<{\mathbf{x}}_{k}^{\perp},{\mathbf{x}}_{k}\right>^{2}
<\displaystyle< l=1r(𝐃ll22𝐃ll)𝐎:l,𝐱k2\displaystyle\sum_{l=1}^{r}({\mathbf{D}}_{ll}^{\prime 2}-2{\mathbf{D}}_{ll}^{\prime})\left<{\mathbf{O}}_{:l},{\mathbf{x}}_{k}\right>^{2}

as required. ∎

We can now apply Lemma 4 iteratively as follows. Let 𝐔𝒪d×r{\mathbf{U}}\notin\mathcal{O}^{d\times r}, and let ϵ>0\epsilon>0. By Lemma 4:

  • There is 𝐔1{\mathbf{U}}_{1} with 𝐔1𝐔ϵn\left\|{\mathbf{U}}_{1}-{\mathbf{U}}\right\|\leq\frac{\epsilon}{n}, s.t.: |𝐔|>|𝐔1||\mathcal{I}_{{\mathbf{U}}}|>|\mathcal{I}_{{\mathbf{U}}_{1}}|.

  • There is 𝐔2{\mathbf{U}}_{2} with 𝐔2𝐔1ϵn\left\|{\mathbf{U}}_{2}-{\mathbf{U}}_{1}\right\|\leq\frac{\epsilon}{n}, s.t.: |𝐔1|>|𝐔2||\mathcal{I}_{{\mathbf{U}}_{1}}|>|\mathcal{I}_{{\mathbf{U}}_{2}}|.

  • There is 𝐔{\mathbf{U}}^{\prime} with 𝐔𝐔Kϵn\left\|{\mathbf{U}}^{\prime}-{\mathbf{U}}_{K}\right\|\leq\frac{\epsilon}{n}, s.t.: maxifi(𝐔K)>maxifi(𝐔)\max_{i}f_{i}({\mathbf{U}}_{K})>\max_{i}f_{i}({\mathbf{U}}^{\prime}).

Finally, observe that K+1|U|K+1\leq|\mathcal{I}_{U}|, so 𝐔𝐔ϵK+1nϵ\left\|{\mathbf{U}}-{\mathbf{U}}^{\prime}\right\|\leq\epsilon\frac{K+1}{n}\leq\epsilon and we can find arbitrarily close 𝐔{\mathbf{U}}^{\prime} such that maxifi(𝐔)<maxifi(𝐔)\max_{i}f_{i}({\mathbf{U}}^{\prime})<\max_{i}f_{i}({\mathbf{U}}) i.e. 𝐔{\mathbf{U}} is not a local minimizer.

Appendix C Proof of Lemma 2

We begin with the following lemma that states that we can utilize the orthogonality of the targets in order to infinitesimally change 𝐔{\mathbf{U}} in a manner that increases the value of fjf_{j} for some j𝐔j\notin\mathcal{I}_{\mathbf{U}}, decreases the value of fif_{i} for some i𝐔i\in\mathcal{I}_{\mathbf{U}} and does not change the value of fkf_{k} for k𝐔{i}k\in\mathcal{I}_{{\mathbf{U}}}\setminus\{i\}.

Lemma 5.

Let 𝐔𝒪d×r{\mathbf{U}}\in\mathcal{O}^{d\times r} such that there exist jj with fj(𝐔)<maxl[n]fl(𝐔)f_{j}({\mathbf{U}})<\max_{l\in[n]}f_{l}({\mathbf{U}}). Then, there exists an i𝐔i\in\mathcal{I}_{\mathbf{U}} such that: for any ϵ>0\epsilon>0 there exist 𝐔θ{\mathbf{U}}_{\theta} such that:

  1. 1.

    𝐔𝐔θϵ\left\|{\mathbf{U}}-{\mathbf{U}}_{\theta}\right\|\leq\epsilon.

  2. 2.

    k[n]𝐔:fk(𝐔θ)<fi(𝐔)\forall k\in[n]\setminus\mathcal{I}_{\mathbf{U}}:\quad f_{k}({\mathbf{U}}_{\theta})<f_{i}({\mathbf{U}})

  3. 3.

    k𝐔{i}:fk(𝐔)=fk(𝐔θ)\forall k\in\mathcal{I}_{\mathbf{U}}\setminus\left\{i\right\}:\qquad f_{k}({\mathbf{U}})=f_{k}({\mathbf{U}}_{\theta}).

  4. 4.

    fi(𝐔θ)<fi(𝐔)f_{i}({\mathbf{U}}_{\theta})<f_{i}({\mathbf{U}})

Proof.

Define 𝐑(θ){\mathbf{R}}(\theta), a Given Rotation (for some θ\theta) over the 1,21,2 axes, i.e.:

𝐑(θ)ij={cosθij=11orij=22sinθij=12sinθij=21𝐈ijelse{\mathbf{R}}(\theta)_{ij}=\begin{cases}\cos\theta&ij=11\quad or\quad ij=22\\ \sin\theta&ij=12\\ -\sin\theta&ij=21\\ {\mathbf{I}}_{ij}&else\end{cases}

along with two orthogonal vectors

𝐯1\displaystyle{\mathbf{v}}_{1} =\displaystyle= 𝐱i𝐱i\displaystyle\frac{{\mathbf{x}}_{i}}{\left\|{\mathbf{x}}_{i}\right\|}
𝐯2\displaystyle{\mathbf{v}}_{2} =\displaystyle= {𝐱j𝐱ji𝐔:𝐱iT𝐔𝐔T𝐱j0𝐔𝐔T𝐱j𝐔𝐔T𝐱jelse\displaystyle\begin{array}[]{lcr}\begin{cases}\frac{{\mathbf{x}}_{j}}{\left\|{\mathbf{x}}_{j}\right\|}&\exists i\in\mathcal{I}_{\mathbf{U}}:\;{\mathbf{x}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{j}\neq 0\\ \frac{{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{j}}{\left\|{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{j}\right\|}&{\rm{else}}\end{cases}\end{array} (11)

and an orthonormal basis for their orthogonal complement in d\mathbb{R}^{d}: 𝐕=(𝐯1,𝐯d){\mathbf{V}}=\left({\mathbf{v}}_{1},...{\mathbf{v}}_{d}\right). Now define: 𝐔θ=𝐕𝐑(θ)𝐕T𝐔{\mathbf{U}}_{\theta}={\mathbf{V}}{\mathbf{R}}(\theta){\mathbf{V}}^{T}{\mathbf{U}} and we get:

1+2 is true, since:

h1(θ):=𝐔θ,h2(θ):=fj(𝐔θ)h_{1}\left(\theta\right):={\mathbf{U}}_{\theta},\quad h_{2}(\theta):=f_{j}({\mathbf{U}}_{\theta}) are continuous functions.

3 is true, since:

For all kU{i}:𝐱k𝐯1,𝐯2k\in\mathcal{I}_{U}\setminus\left\{i\right\}:{\mathbf{x}}_{k}\perp{\mathbf{v}}_{1},{\mathbf{v}}_{2} thus 𝐑(θ)𝐕T𝐱k=𝐈𝐕T𝐱k{\mathbf{R}}(\theta){\mathbf{V}}^{T}{\mathbf{x}}_{k}={\mathbf{I}}{\mathbf{V}}^{T}{\mathbf{x}}_{k} and:

𝐱kT𝐔θ𝐔θT𝐱k=\displaystyle{\mathbf{x}}_{k}^{T}{\mathbf{U}}_{\theta}{\mathbf{U}}_{\theta}^{T}{\mathbf{x}}_{k}= 𝐱kT𝐕𝐑(θ)𝐕T𝐔𝐔T𝐕𝐑(θ)T𝐕T𝐱k\displaystyle{\mathbf{x}}_{k}^{T}{\mathbf{V}}{\mathbf{R}}(\theta){\mathbf{V}}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{V}}{\mathbf{R}}(\theta)^{T}{\mathbf{V}}^{T}{\mathbf{x}}_{k}
=\displaystyle= 𝐱kT𝐕𝐈𝐕T𝐔𝐔T𝐕𝐈𝐕T𝐱k\displaystyle{\mathbf{x}}_{k}^{T}{\mathbf{V}}{\mathbf{I}}{\mathbf{V}}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{V}}{\mathbf{I}}{\mathbf{V}}^{T}{\mathbf{x}}_{k}
=\displaystyle= 𝐱kT𝐔𝐔T𝐱k\displaystyle{\mathbf{x}}_{k}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{k}

In order to show 4 we use the equality in (12) (In the next page, proof is in the appendix, since it is quite technical):

fi(𝐔θ)fi(𝐔)={sin(2θ)𝐞iT𝐔^𝐔^T𝐞j+sin2(θ)(𝐞jT𝐔^𝐔^T𝐞j𝐞iT𝐔^𝐔^T𝐞i)i𝐔:𝐱iT𝐔𝐔T𝐱j0(𝐱i2𝐔T𝐱i2)sin(θ)2else\displaystyle f_{i}({\mathbf{U}}_{\theta})-f_{i}({\mathbf{U}})=\begin{cases}\sin\left(2\theta\right){\mathbf{e}}_{i}^{T}\hat{{\mathbf{U}}}\hat{{\mathbf{U}}}^{T}{\mathbf{e}}_{j}+\sin^{2}\left(\theta\right)\left({\mathbf{e}}_{j}^{T}\hat{{\mathbf{U}}}\hat{{\mathbf{U}}}^{T}{\mathbf{e}}_{j}-{\mathbf{e}}_{i}^{T}\hat{{\mathbf{U}}}\hat{{\mathbf{U}}}^{T}{\mathbf{e}}_{i}\right)&\exists i\in\mathcal{I}_{\mathbf{U}}:\;{\mathbf{x}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{j}\neq 0\\ (\left\|{\mathbf{x}}_{i}\right\|^{2}-\left\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}){\sin(\theta)^{2}}&else\end{cases} (12)

Now, if i𝐔:𝐱iT𝐔𝐔T𝐱j0\exists i\in\mathcal{I}_{\mathbf{U}}:\;{\mathbf{x}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{j}\neq 0:

  • If: 𝐱iT𝐔𝐔T𝐱j<0{\mathbf{x}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{j}<0 then for any π2>θ>0\frac{\pi}{2}>\theta>0: sin(2θ)𝐞iT𝐔^𝐔^T𝐞j<0\sin\left(2\theta\right){\mathbf{e}}_{i}^{T}\hat{{\mathbf{U}}}\hat{{\mathbf{U}}}^{T}{\mathbf{e}}_{j}<0.

  • If: 𝐱iT𝐔𝐔T𝐱j>0{\mathbf{x}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{j}>0 then for any π2<θ<0-\frac{\pi}{2}<\theta<0: sin(2θ)𝐞iT𝐔^𝐔^T𝐞j<0\sin\left(2\theta\right){\mathbf{e}}_{i}^{T}\hat{{\mathbf{U}}}\hat{{\mathbf{U}}}^{T}{\mathbf{e}}_{j}<0.

and since sin(θ)2=o(sin(2θ))\sin(\theta)^{2}=o(\sin(2\theta)), for small enough |θ||\theta| we get: fi(𝐔)<fi(𝐔θ)f_{i}({\mathbf{U}})<f_{i}({\mathbf{U}}_{\theta}).

On the other hand, By Lemma 3 𝐱i2>𝐔T𝐱i2\left\|{\mathbf{x}}_{i}\right\|^{2}>\left\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}, so if i𝐔:𝐱iT𝐔𝐔T𝐱j=0\forall i\in\mathcal{I}_{\mathbf{U}}:\;{\mathbf{x}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{j}=0 then:

(𝐔T𝐱i2𝐱i2)sin(θ)2<0(\left\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}-\left\|{\mathbf{x}}_{i}\right\|^{2}){\sin(\theta)^{2}}<0

Armed with these results, we proceed to the rest of the proof of Lemma 2. Assume fi(𝐔)<maxlfl(𝐔)f_{i}({\mathbf{U}})<\max_{l}f_{l}({\mathbf{U}}) for some ii. By Lemma 5, let ϵ>0\epsilon>0, then:

  • There is 𝐔1{\mathbf{U}}_{1} with 𝐔1𝐔ϵK\left\|{\mathbf{U}}_{1}-{\mathbf{U}}\right\|\leq\frac{\epsilon}{K}, s.t.: maxifi(𝐔1)=maxifi(𝐔)\max_{i}f_{i}({\mathbf{U}}_{1})=\max_{i}f_{i}({\mathbf{U}}), and |𝐔1|=|𝐔|1|\mathcal{I}_{{\mathbf{U}}_{1}}|=|\mathcal{I}_{{\mathbf{U}}}|-1.

  • There is 𝐔K{\mathbf{U}}_{K} with 𝐔K𝐔K1ϵK\left\|{\mathbf{U}}_{K}-{\mathbf{U}}_{K-1}\right\|\leq\frac{\epsilon}{K}, s.t.: maxifi(𝐔K)<maxifi(𝐔K1)\max_{i}f_{i}({\mathbf{U}}_{K})<\max_{i}f_{i}({\mathbf{U}}_{K-1}).

Finally, observe that 𝐔𝐔Kϵ\left\|{\mathbf{U}}-{\mathbf{U}}_{K}\right\|\leq{\epsilon} and K|U|K\leq|\mathcal{I}_{U}|, so we can find arbitrarily close 𝐔K{\mathbf{U}}_{K} such that maxifi(𝐔K)<maxifi(𝐔)\max_{i}f_{i}({\mathbf{U}}_{K})<\max_{i}f_{i}({\mathbf{U}}) i.e. 𝐔{\mathbf{U}} is not a local maximizer.

Appendix D Proof of Proposition 3

Proof.

Given {𝐱i}i=1n\{{\mathbf{x}}_{i}\}_{i=1}^{n}, and recall the definition of 𝐗{\mathbf{X}} in (8). In the SDR problem we solve:

max𝐏𝕊+dmini[n]𝐱iT𝐏𝐱is.t.Trace(𝐏)r𝟎𝐏𝐈\displaystyle\begin{array}[]{ll}\max_{{\bf{P}}\in\mathbb{S}^{d}_{+}}&\min_{i\in[n]}{\mathbf{x}}_{i}^{T}{\bf{P}}{\mathbf{x}}_{i}\\ {\rm{s.t.}}&{\rm{Trace}}\left({\bf{P}}\right)\leq r\\ &{\bf{0}}\preceq{\bf{P}}\preceq{\bf{I}}\end{array}
=\displaystyle= max𝐏𝕊+dmini[n]𝐱i2𝐞iT𝐗𝐏𝐗T𝐞is.t.Trace(𝐏)r𝟎𝐏𝐈\displaystyle\begin{array}[]{ll}\max_{{\bf{P}}\in\mathbb{S}^{d}_{+}}&\min_{i\in[n]}\left\|{\mathbf{x}}_{i}\right\|^{2}{\mathbf{e}}_{i}^{T}{\mathbf{X}}{\mathbf{P}}{\mathbf{X}}^{T}{\mathbf{e}}_{i}\\ {\rm{s.t.}}&{\rm{Trace}}\left({\bf{P}}\right)\leq r\\ &{\bf{0}}\preceq{\bf{P}}\preceq{\bf{I}}\end{array}
=\displaystyle= max𝐏𝕊+dmini[n]𝐱i2[𝐗𝐏𝐗T]iis.t.Trace(𝐗𝐏𝐗T)r𝟎𝐗𝐏𝐗T𝐈\displaystyle\begin{array}[]{ll}\max_{{\bf{P}}\in\mathbb{S}^{d}_{+}}&\min_{i\in[n]}\left\|{\mathbf{x}}_{i}\right\|^{2}[{\mathbf{X}}{\mathbf{P}}{\mathbf{X}}^{T}]_{ii}\\ {\rm{s.t.}}&{\rm{Trace}}\left({\mathbf{X}}{\mathbf{P}}{\mathbf{X}}^{T}\right)\leq r\\ &{\bf{0}}\preceq{\mathbf{X}}{\mathbf{P}}{\mathbf{X}}^{T}\preceq{\bf{I}}\end{array}
=\displaystyle= max𝐏^𝕊+dmini[n]𝐱i2𝐏^iis.t.Trace(𝐏^)r𝟎𝐏^𝐈\displaystyle\begin{array}[]{ll}\max_{\hat{{\mathbf{P}}}\in\mathbb{S}^{d}_{+}}&\min_{i\in[n]}\left\|{\mathbf{x}}_{i}\right\|^{2}\hat{{\mathbf{P}}}_{ii}\\ {\rm{s.t.}}&{\rm{Trace}}\left(\hat{{\mathbf{P}}}\right)\leq r\\ &{\bf{0}}\preceq\hat{{\mathbf{P}}}\preceq{\bf{I}}\end{array}

where 𝐗𝐏𝐗T=𝐏^{\mathbf{X}}{\mathbf{P}}{\mathbf{X}}^{T}=\hat{{\mathbf{P}}} and we have used the orthogonality of 𝐗{\mathbf{X}}.

Now, define 𝐏^\hat{{\mathbf{P}}} as a diagonal matrix with 𝐏^ii=rnH𝐱i2\hat{{\mathbf{P}}}_{ii}=\frac{r}{n}\cdot\frac{H}{\left\|{\mathbf{x}}_{i}\right\|^{2}}.

It is easy to verify that 𝐏^\hat{{\mathbf{P}}} is feasible and yields an objective of rnH\frac{r}{n}H. Now let 𝐏^\hat{{\mathbf{P}}}^{\prime} such that mink𝐱k2𝐏^kk>rnH\min_{k}\left\|{\mathbf{x}}_{k}\right\|^{2}\hat{{\mathbf{P}}}_{kk}^{\prime}>\frac{r}{n}H, then:

i:𝐱i2𝐏^ii>rnH\displaystyle\forall i:\left\|{\mathbf{x}}_{i}\right\|^{2}\hat{{\mathbf{P}}}_{ii}^{\prime}>\frac{r}{n}H
\displaystyle\Rightarrow i:𝐏^ii>rnH𝐱i2\displaystyle\forall i:\hat{{\mathbf{P}}}_{ii}^{\prime}>\frac{\frac{r}{n}H}{\left\|{\mathbf{x}}_{i}\right\|^{2}}
\displaystyle\Rightarrow Trace(𝐏^)=i=1n𝐏^ii>i=1nrnH𝐱i2=r\displaystyle{\rm{Trace}}(\hat{{\mathbf{P}}}^{\prime})=\sum_{i=1}^{n}\hat{{\mathbf{P}}}_{ii}^{\prime}>\sum_{i=1}^{n}\frac{\frac{r}{n}H}{\left\|{\mathbf{x}}_{i}\right\|^{2}}=r

so 𝐏^\hat{{\mathbf{P}}}^{\prime} is not feasible and we conclude that 𝐏^\hat{{\mathbf{P}}} is optimal for SDR. By Proposition 2 the optimal value for F-FPCA is rnH-\frac{r}{n}H so by Proposition 1 the value of FPCA is rnH\frac{r}{n}H so SDR=FPCA (but this solution might not be low rank, as in our positive definite construction). ∎

Appendix E Proof of Corollary 1

We start the proof of the proposition by the observation that tight frame is characterized by the standard basis:

Lemma 6.

A frame {𝐮i}i=1nr\{{\mathbf{u}}_{i}\}_{i=1}^{n}\subset\mathbb{R}^{r} is tight with frame bound A if and only if 𝐞i\forall{\mathbf{e}}_{i} in the standard basis of r\mathbb{R}^{r}:

𝐞i=1Ai=1n𝐞i,𝐮i𝐮i{\mathbf{e}}_{i}=\frac{1}{A}\sum_{i=1}^{n}\left<{\mathbf{e}}_{i},{\mathbf{u}}_{i}\right>{\mathbf{u}}_{i}
Proof.

Observe that if 𝐞i{𝐞i}i=1r\;\forall{\mathbf{e}}_{i}\in\left\{{\mathbf{e}}_{i}\right\}_{i=1}^{r} we have: 𝐞i=1Aj=1n𝐞i,𝐮j𝐮j{\mathbf{e}}_{i}=\frac{1}{A}\sum_{j=1}^{n}\left\langle{\mathbf{e}}_{i},{\mathbf{u}}_{j}\right\rangle{\mathbf{u}}_{j} than we get for all 𝐯r{\mathbf{v}}\in\mathbb{R}^{r}:

𝐯=j=1r𝐯j𝐞j=j=1r𝐯j1Ai=1n𝐞j,𝐮i𝐮i\displaystyle{\mathbf{v}}=\sum_{j=1}^{r}{\mathbf{v}}_{j}{\mathbf{e}}_{j}=\sum_{j=1}^{r}{\mathbf{v}}_{j}\frac{1}{A}\sum_{i=1}^{n}\left\langle{\mathbf{e}}_{j},{\mathbf{u}}_{i}\right\rangle{\mathbf{u}}_{i}
=\displaystyle= 1Ai=1nj=1r𝐯j𝐞j,𝐮i𝐮i=1Ai=1n𝐯,𝐮i𝐮i\displaystyle\frac{1}{A}\sum_{i=1}^{n}\left\langle\sum_{j=1}^{r}{\mathbf{v}}_{j}{\mathbf{e}}_{j},{\mathbf{u}}_{i}\right\rangle{\mathbf{u}}_{i}=\frac{1}{A}\sum_{i=1}^{n}\left\langle{\mathbf{v}},{\mathbf{u}}_{i}\right\rangle{\mathbf{u}}_{i}

Now we use the observation above to claim that tight frame is actual the transposition of semi orthogonal matrix:

Lemma 7.

Let 𝐔T=(𝐮1,,𝐮n)r×n{\mathbf{U}}^{T}=({\mathbf{u}}_{1},...,{\mathbf{u}}_{n})\in\mathbb{R}^{r\times n}. {𝐮i}i=1n\{{\mathbf{u}}_{i}\}_{i=1}^{n} is a tight frame with frame bound AA iff 𝐔{\mathbf{U}} has orthogonal columns with norm A\sqrt{A}.

Proof.

Consider equality 1 below:

𝐔T𝐔=(𝐮1,,𝐮n)(𝐮1,,𝐮n)T\displaystyle{\mathbf{U}}^{T}{\mathbf{U}}=\left({\mathbf{u}}_{1},...,{\mathbf{u}}_{n}\right)\left({\mathbf{u}}_{1},...,{\mathbf{u}}_{n}\right)^{T}
=\displaystyle= (j=1n𝐮j1𝐮j,,j=1n𝐮jr𝐮j)\displaystyle\left(\begin{array}[]{c}\\ \sum_{j=1}^{n}{\mathbf{u}}_{j}^{1}{\mathbf{u}}_{j},\\ \\ \end{array}...,\begin{array}[]{c}\\ \sum_{j=1}^{n}{\mathbf{u}}_{j}^{r}{\mathbf{u}}_{j}\\ \\ \end{array}\right)
=1\displaystyle=^{1} (A𝐞1A𝐞r)=A𝐈\displaystyle\left(\begin{array}[]{c}\\ A{\mathbf{e}}_{1}\\ \\ \end{array}...\begin{array}[]{c}\\ A{\mathbf{e}}_{r}\\ \\ \end{array}\right)=A{\mathbf{I}}

Observe that 𝐔𝐔T=A𝐈{\mathbf{U}}{\mathbf{U}}^{T}=A{\mathbf{I}} iff 𝐔{\mathbf{U}} has orthogonal columns with norm A\sqrt{A}. Equality 1 also holds iff A𝐞i=j=1n𝐮ji𝐮j=j=1n𝐞i,𝐮j𝐮jA{\mathbf{e}}_{i}=\sum_{j=1}^{n}{\mathbf{u}}_{j}^{i}{\mathbf{u}}_{j}=\sum_{j=1}^{n}\left\langle{\mathbf{e}}_{i},{\mathbf{u}}_{j}\right\rangle{\mathbf{u}}_{j} which holds iff {𝐮i}i=1n\{{\mathbf{u}}_{i}\}_{i=1}^{n} is a tight frame with frame bound AA (by Lemma 6), so we conclude that the conditions are equivalent.

Lemma 8.

If for all k: 𝐱kT𝐔2=rnH\left\|{\mathbf{x}}_{k}^{T}{\mathbf{U}}\right\|^{2}=\frac{r}{n}H, then: 𝐔2=r\left\|{\mathbf{U}}\right\|^{2}=r.

Proof.

Recall the definition of 𝐔^\hat{{\mathbf{U}}} in (8) and observe that:

𝐔F2=𝐔^F2=i=1n𝐞iT𝐔^2=i=1n1𝐱i2𝐱iT𝐔2\displaystyle\left\|{\mathbf{U}}\right\|_{F}^{2}=\left\|\hat{{\mathbf{U}}}\right\|_{F}^{2}=\sum_{i=1}^{n}\left\|{\mathbf{e}}_{i}^{T}\hat{{\mathbf{U}}}\right\|^{2}=\sum_{i=1}^{n}\frac{1}{\left\|{{\mathbf{x}}}_{i}\right\|^{2}}\left\|{{\mathbf{x}}}_{i}^{T}{{\mathbf{U}}}\right\|^{2}
=\displaystyle= i=1n1𝐱i2r(i=1n1𝐱i2)1=r\displaystyle\sum_{i=1}^{n}\frac{1}{\left\|{{\mathbf{x}}}_{i}\right\|^{2}}r\left({\sum_{i=1}^{n}\frac{1}{\left\|{\mathbf{x}}_{i}\right\|^{2}}}\right)^{-1}=r

Now, let 𝐔d×r{\mathbf{U}}\in\mathbb{R}^{d\times r} an optimal solution for F-FPCA, by Lemma 1 the columns of 𝐔{\mathbf{U}} are orthonormal, so by Lemma 7 𝐔{\mathbf{U}} is tight frame and by Proposition 2 we have for all kk:

𝐱kT𝐔2=fk(𝐔)=rnH\left\|{\mathbf{x}}_{k}^{T}{\mathbf{U}}\right\|^{2}=f_{k}({\mathbf{U}})=\frac{r}{n}H

On the other hand, let 𝐔T{\mathbf{U}}^{T} a tight frame as above, by Lemma 7 the columns of 𝐔{\mathbf{U}} are orthogonal and have the same norm. By Lemma 8 𝐔F2=r\left\|{\mathbf{U}}\right\|_{F}^{2}=r so the columns has unit norms, i.e. the columns are orthonormal and for all i[n]i\in[n]:

fi(𝐔)=𝐱iT𝐔𝐔T𝐱i=rnH\displaystyle f_{i}({\mathbf{U}})=-{\mathbf{x}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}=-\frac{r}{n}H

i.e. mini𝐱i𝐔𝐔T𝐱i2𝐱i2=rnH\min_{i}\left\|{\mathbf{x}}_{i}-{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}-\left\|{\mathbf{x}}_{i}\right\|^{2}=-\frac{r}{n}H which is the optimal target. Finally, if i:𝐱i=𝐞i\forall i:\;{\mathbf{x}}_{i}={\mathbf{e}}_{i}, then F-FPCA is reduced to the problem of finding ’normalized tight frame’.

Acknowledgment

The authors would like to thank Uri Okon who initiated this research and defined the problem, as well as Gal Elidan. This work was partially supported by ISF grant 1339/15.

References

  • [1] J. Sun, Q. Qu, and J. Wright, “When are nonconvex problems not scary?” arXiv preprint arXiv:1510.06096, 2015.
  • [2] J. Morgenstern, S. Samadi, M. Singh, U. Tantipongpipat, and S. Vempala, “Fair dimensionality reduction and iterative rounding for sdps,” arXiv preprint arXiv:1902.11281, 2019.
  • [3] S. Samadi, U. Tantipongpipat, J. H. Morgenstern, M. Singh, and S. Vempala, “The price of fair pca: One extra dimension,” in Advances in Neural Information Processing Systems, 2018, pp. 10 976–10 987.
  • [4] N. D. Sidiropoulos, T. N. Davidson, and Z.-Q. Luo, “Transmit beamforming for physical-layer multicasting,” IEEE Trans. Signal Processing, vol. 54, no. 6-1, pp. 2239–2251, 2006.
  • [5] W.-K. K. Ma, “Semidefinite relaxation of quadratic optimization problems and applications,” IEEE Signal Processing Magazine, vol. 1053, no. 5888/10, 2010.
  • [6] A. Cheriyadat and L. M. Bruce, “Why principal component analysis is not an appropriate feature extraction method for hyperspectral data,” in IGARSS 2003. 2003 IEEE International Geoscience and Remote Sensing Symposium. Proceedings (IEEE Cat. No. 03CH37477), vol. 6.   IEEE, 2003, pp. 3420–3422.
  • [7] M. Olfat and A. Aswani, “Convex formulations for fair principal component analysis,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 663–670.
  • [8] W. Bian and D. Tao, “Max-min distance analysis by using sequential sdp relaxation for dimension reduction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 5, pp. 1037–1050, 2010.
  • [9] G. Lerman, M. B. McCoy, J. A. Tropp, and T. Zhang, “Robust computation of linear models by convex relaxation,” Foundations of Computational Mathematics, vol. 15, no. 2, pp. 363–410, 2015.
  • [10] M. M. Kamani, F. Haddadpour, R. Forsati, and M. Mahdavi, “Efficient fair principal component analysis,” arXiv preprint arXiv:1911.04931, 2019.
  • [11] S. Burer and R. D. Monteiro, “A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization,” Mathematical Programming, vol. 95, no. 2, pp. 329–357, 2003.
  • [12] ——, “Local minima and convergence in low-rank semidefinite programming,” Mathematical Programming, vol. 103, no. 3, pp. 427–444, 2005.
  • [13] N. Boumal, V. Voroninski, and A. S. Bandeira, “Deterministic guarantees for burer-monteiro factorizations of smooth semidefinite programs,” Communications on Pure and Applied Mathematics, 2018.
  • [14] N. Boumal, V. Voroninski, and A. Bandeira, “The non-convex burer-monteiro approach works on smooth semidefinite programs,” in Advances in Neural Information Processing Systems, 2016, pp. 2757–2765.
  • [15] D. Cifuentes, “Burer-monteiro guarantees for general semidefinite programs,” arXiv preprint arXiv:1904.07147, 2019.
  • [16] Z. Zhu, Q. Li, G. Tang, and M. B. Wakin, “Global optimality in low-rank matrix optimization,” IEEE Transactions on Signal Processing, vol. 66, no. 13, pp. 3614–3628, 2018.
  • [17] Q. Li, Z. Zhu, and G. Tang, “The non-convex geometry of low-rank matrix optimization,” Information and Inference: A Journal of the IMA, vol. 8, no. 1, pp. 51–96, 2018.
  • [18] J. J. Benedetto and M. Fickus, “Finite normalized tight frames,” Advances in Computational Mathematics, vol. 18, no. 2-4, pp. 357–385, 2003.
  • [19] S. Diamond and S. Boyd, “CVXPY: A Python-embedded modeling language for convex optimization,” Journal of Machine Learning Research, vol. 17, no. 83, pp. 1–5, 2016.
  • [20] A. Agrawal, R. Verschueren, S. Diamond, and S. Boyd, “A rewriting system for convex optimization problems,” Journal of Control and Decision, vol. 5, no. 1, pp. 42–60, 2018.
  • [21] I.-C. Yeh and C.-h. Lien, “The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients,” Expert Systems with Applications, vol. 36, no. 2, pp. 2473–2480, 2009.

Appendix F Proof of equation (11)

Lemma 9.

Let 𝐔𝒪d×r{\mathbf{U}}\in\mathcal{O}^{d\times r} and assume i𝐔:𝐱iT𝐔𝐔T𝐱j=0\forall i\in\mathcal{I}_{\mathbf{U}}:\;{\mathbf{x}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{j}=0. Define:

𝐯2=𝐔𝐔T𝐱j𝐔𝐔T𝐱j,𝐯1=𝐱i𝐱i{\mathbf{v}}_{2}=\frac{{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{j}}{\left\|{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{j}\right\|},\;\;{\mathbf{v}}_{1}=\frac{{\mathbf{x}}_{i}}{\left\|{\mathbf{x}}_{i}\right\|}

for some i𝐔i\in\mathcal{I}_{{\mathbf{U}}} and complete these vectors to orthonormal basis: 𝐕=(𝐯1,𝐯d){\mathbf{V}}=\left({\mathbf{v}}_{1},...{\mathbf{v}}_{d}\right), and define: 𝐔θ=𝐕𝐑(θ)𝐕T𝐔{\mathbf{U}}_{\theta}={\mathbf{V}}{\mathbf{R}}(\theta){\mathbf{V}}^{T}{\mathbf{U}}, then:

fi(𝐔)fi(𝐔θ)=(𝐱i2𝐔T𝐱i2)sin(θ)2\displaystyle f_{i}({\mathbf{U}})-f_{i}({\mathbf{U}}_{\theta})=(\left\|{\mathbf{x}}_{i}\right\|^{2}-\left\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}){\sin(\theta)^{2}}
Proof.
fi(𝐔)fi(𝐔θ)=\displaystyle f_{i}({\mathbf{U}})-f_{i}({\mathbf{U}}_{\theta})= 𝐱iT𝐔θ𝐔θT𝐱i𝐱iT𝐔𝐔T𝐱i\displaystyle{\mathbf{x}}_{i}^{T}{\mathbf{U}}_{\theta}{\mathbf{U}}_{\theta}^{T}{\mathbf{x}}_{i}-{\mathbf{x}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}
=\displaystyle= (𝐔θT𝐱i𝐔T𝐱i)T(𝐔θT𝐱i+𝐔T𝐱i)\displaystyle\left({\mathbf{U}}_{\theta}^{T}{\mathbf{x}}_{i}-{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right)^{T}\left({\mathbf{U}}_{\theta}^{T}{\mathbf{x}}_{i}+{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right)
=\displaystyle= (𝐔T𝐕𝐑(θ)T𝐕T𝐱i𝐔T𝐕𝐕T𝐱i)T(𝐔T𝐕𝐑(θ)T𝐕T𝐱i+𝐔T𝐕𝐕T𝐱i)\displaystyle\left({\mathbf{U}}^{T}{\mathbf{V}}{\mathbf{R}}(\theta)^{T}{\mathbf{V}}^{T}{\mathbf{x}}_{i}-{\mathbf{U}}^{T}{\mathbf{V}}{\mathbf{V}}^{T}{\mathbf{x}}_{i}\right)^{T}\left({\mathbf{U}}^{T}{\mathbf{V}}{\mathbf{R}}(\theta)^{T}{\mathbf{V}}^{T}{\mathbf{x}}_{i}+{\mathbf{U}}^{T}{\mathbf{V}}{\mathbf{V}}^{T}{\mathbf{x}}_{i}\right)
=\displaystyle= 𝐱iT𝐕(𝐑(θ)T𝐈)T𝐕T𝐔𝐔T𝐕(𝐑(θ)T+𝐈)𝐕T𝐱i\displaystyle{\mathbf{x}}_{i}^{T}{\mathbf{V}}\left({\mathbf{R}}(\theta)^{T}-{\mathbf{I}}\right)^{T}{\mathbf{V}}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{V}}\left({\mathbf{R}}(\theta)^{T}+{\mathbf{I}}\right){\mathbf{V}}^{T}{\mathbf{x}}_{i}
=\displaystyle= 𝐱iT𝐕(𝐑(θ)T𝐈)T𝐕T𝐔𝐔T𝐕(𝐑(θ)T𝐈)𝐕T𝐱i+𝐱iT𝐕(𝐑(θ)T𝐈)T𝐕T𝐔𝐔T𝐕(2𝐈)𝐕T𝐱i\displaystyle{\mathbf{x}}_{i}^{T}{\mathbf{V}}\left({\mathbf{R}}(\theta)^{T}-{\mathbf{I}}\right)^{T}{\mathbf{V}}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{V}}\left({\mathbf{R}}(\theta)^{T}-{\mathbf{I}}\right){\mathbf{V}}^{T}{\mathbf{x}}_{i}+{\mathbf{x}}_{i}^{T}{\mathbf{V}}\left({\mathbf{R}}(\theta)^{T}-{\mathbf{I}}\right)^{T}{\mathbf{V}}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{V}}(2{\mathbf{I}}){\mathbf{V}}^{T}{\mathbf{x}}_{i}
=\displaystyle= 𝐱iT𝐕(𝐑(θ)T𝐈)T𝐕T𝐔x^iT𝐔T𝐕(𝐑(θ)T𝐈)𝐕T𝐱ix^i+2𝐱iT𝐕(𝐑(θ)T𝐈)T𝐕T𝐔x^iT𝐔T𝐱i\displaystyle\underbrace{{\mathbf{x}}_{i}^{T}{\mathbf{V}}\left({\mathbf{R}}(\theta)^{T}-{\mathbf{I}}\right)^{T}{\mathbf{V}}^{T}{\mathbf{U}}}_{\hat{x}_{i}^{T}}\underbrace{{\mathbf{U}}^{T}{\mathbf{V}}\left({\mathbf{R}}(\theta)^{T}-{\mathbf{I}}\right){\mathbf{V}}^{T}{\mathbf{x}}_{i}}_{\hat{x}_{i}}+2\underbrace{{\mathbf{x}}_{i}^{T}{\mathbf{V}}\left({\mathbf{R}}(\theta)^{T}-{\mathbf{I}}\right)^{T}{\mathbf{V}}^{T}{\mathbf{U}}}_{\hat{x}_{i}^{T}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}
𝐱^i=𝐔T𝐕(𝐑(θ)T𝐈)𝐕T𝐱i=𝐔T𝐕(𝐑(θ)T𝐈)𝐯1,𝐱i𝐞1=𝐱i𝐔T𝐕(𝐑(θ)T𝐈)𝐞1=𝐱i𝐔T𝐕(cos(θ)𝐞1sin(θ)𝐞2𝐞1)=𝐱i(𝐔T𝐯1(cos(θ)1)𝐔T𝐯2sin(θ))=𝐔T𝐱i(cos(θ)1)𝐱i𝐔T𝐯2sin(θ)𝐱^iT𝐔T𝐱i=𝐱iT𝐔(𝐔T𝐱i(cos(θ)1)𝐱i𝐔T𝐯2sin(θ))=𝐔T𝐱i2(cos(θ)1)𝐱i𝐱iT𝐔𝐔T𝐱j𝐔𝐔T𝐱jsin(θ)=𝐔T𝐱i2(cos(θ)1)\begin{split}{\hat{{\mathbf{x}}}_{i}}=&{\mathbf{U}}^{T}{\mathbf{V}}\left({\mathbf{R}}(\theta)^{T}-{\mathbf{I}}\right){\mathbf{V}}^{T}{\mathbf{x}}_{i}\\ =&{{\mathbf{U}}^{T}{\mathbf{V}}\left({\mathbf{R}}(\theta)^{T}-{\mathbf{I}}\right)\left<{\mathbf{v}}_{1},{\mathbf{x}}_{i}\right>{\mathbf{e}}_{1}}\\ =&\left\|{\mathbf{x}}_{i}\right\|{{\mathbf{U}}^{T}{\mathbf{V}}\left({\mathbf{R}}(\theta)^{T}-{\mathbf{I}}\right){\mathbf{e}}_{1}}\\ =&\left\|{\mathbf{x}}_{i}\right\|{{\mathbf{U}}^{T}{\mathbf{V}}\left(\cos(\theta){\mathbf{e}}_{1}-\sin(\theta){\mathbf{e}}_{2}-{\mathbf{e}}_{1}\right)}\\ =&\left\|{\mathbf{x}}_{i}\right\|\left({{\mathbf{U}}^{T}{\mathbf{v}}_{1}\left(\cos(\theta)-1\right)}-{{\mathbf{U}}^{T}{\mathbf{v}}_{2}\sin(\theta)}\right)\\ =&{{\mathbf{U}}^{T}{\mathbf{x}}_{i}\left(\cos(\theta)-1\right)}-\left\|{\mathbf{x}}_{i}\right\|{{\mathbf{U}}^{T}{\mathbf{v}}_{2}\sin(\theta)}\end{split}\quad\quad\begin{split}\hat{{\mathbf{x}}}_{i}^{T}{\mathbf{U}}^{T}{\mathbf{x}}_{i}=&{\mathbf{x}}_{i}^{T}{\mathbf{U}}\left({{\mathbf{U}}^{T}{\mathbf{x}}_{i}\left(\cos(\theta)-1\right)}-\left\|{\mathbf{x}}_{i}\right\|{{\mathbf{U}}^{T}{\mathbf{v}}_{2}\sin(\theta)}\right)\\ =&\left\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}\left(\cos(\theta)-1\right)-\left\|{\mathbf{x}}_{i}\right\|{\frac{{\mathbf{x}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{j}}{\left\|{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{j}\right\|}\sin(\theta)}\\ =&\left\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}\left(\cos(\theta)-1\right)\end{split}
𝐱^iT𝐱^i=\displaystyle\hat{{\mathbf{x}}}_{i}^{T}\hat{{\mathbf{x}}}_{i}= 𝐱i2(𝐔T𝐯1(cos(θ)1)𝐔T𝐯2sin(θ))T(𝐔T𝐯1(cos(θ)1)𝐔T𝐯2sin(θ))\displaystyle\left\|{\mathbf{x}}_{i}\right\|^{2}\left({{\mathbf{U}}^{T}{\mathbf{v}}_{1}\left(\cos(\theta)-1\right)}-{{\mathbf{U}}^{T}{\mathbf{v}}_{2}\sin(\theta)}\right)^{T}\left({{\mathbf{U}}^{T}{\mathbf{v}}_{1}\left(\cos(\theta)-1\right)}-{{\mathbf{U}}^{T}{\mathbf{v}}_{2}\sin(\theta)}\right)
=\displaystyle= 𝐱i2(𝐯1T𝐔𝐔T𝐯1(cos(θ)1)2+𝐯2T𝐔𝐔T𝐯2sin(θ)22sin(θ)𝐯2T𝐔𝐔T𝐯1(cos(θ)1))\displaystyle\left\|{\mathbf{x}}_{i}\right\|^{2}\left({{\mathbf{v}}_{1}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{v}}_{1}\left(\cos(\theta)-1\right)^{2}}+{{\mathbf{v}}_{2}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{v}}_{2}\sin(\theta)^{2}}-2\sin(\theta){\mathbf{v}}_{2}^{T}{\mathbf{U}}{{\mathbf{U}}^{T}{\mathbf{v}}_{1}\left(\cos(\theta)-1\right)}\right)
=\displaystyle= 𝐱i2(𝐯1T𝐔𝐔T𝐯1(cos(θ)1)2+𝐯2T𝐯2sin(θ)2)\displaystyle\left\|{\mathbf{x}}_{i}\right\|^{2}\left({{\mathbf{v}}_{1}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{v}}_{1}\left(\cos(\theta)-1\right)^{2}}+{{\mathbf{v}}_{2}^{T}{\mathbf{v}}_{2}\sin(\theta)^{2}}\right)
=\displaystyle= 𝐱iT𝐔𝐔T𝐱i(cos(θ)1)2+𝐱i2sin(θ)2\displaystyle{\mathbf{x}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}\left(\cos(\theta)-1\right)^{2}+\left\|{\mathbf{x}}_{i}\right\|^{2}{\sin(\theta)^{2}}
=\displaystyle= 𝐔T𝐱i2cos(θ)22𝐔T𝐱i2cos(θ)+𝐔T𝐱i2+(𝐱i2𝐔T𝐱i2)sin(θ)2+𝐔T𝐱i2sin(θ)2\displaystyle\left\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}\cos(\theta)^{2}-2\left\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}\cos(\theta)+\left\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}+(\left\|{\mathbf{x}}_{i}\right\|^{2}-\left\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}){\sin(\theta)^{2}}+\left\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}{\sin(\theta)^{2}}
=\displaystyle= 𝐔T𝐱i22𝐔T𝐱i2cos(θ)+𝐔T𝐱i2+(𝐱i2𝐔T𝐱i2)sin(θ)2\displaystyle\left\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}-2\left\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}\cos(\theta)+\left\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}+(\left\|{\mathbf{x}}_{i}\right\|^{2}-\left\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}){\sin(\theta)^{2}}
=\displaystyle= (𝐱i2𝐔T𝐱i2)sin(θ)2+2𝐔T𝐱i2(1cos(θ))\displaystyle(\left\|{\mathbf{x}}_{i}\right\|^{2}-\left\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}){\sin(\theta)^{2}}+2\left\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}(1-\cos(\theta))

So finally:

fi(𝐔)fi(𝐔θ)=\displaystyle f_{i}({\mathbf{U}})-f_{i}({\mathbf{U}}_{\theta})= 𝐱^iT𝐱^i+2𝐱^iT𝐔T𝐱i\displaystyle\hat{{\mathbf{x}}}_{i}^{T}\hat{{\mathbf{x}}}_{i}+2\hat{{\mathbf{x}}}_{i}^{T}{\mathbf{U}}^{T}{\mathbf{x}}_{i}
=\displaystyle= 2𝐔T𝐱i2(cos(θ)1)+(𝐱i2𝐔T𝐱i2)sin(θ)2+2𝐔T𝐱i2(1cos(θ))\displaystyle 2\left\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}\left(\cos(\theta)-1\right)+(\left\|{\mathbf{x}}_{i}\right\|^{2}-\left\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}){\sin(\theta)^{2}}+2\left\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}(1-\cos(\theta))
=\displaystyle= (𝐱i2𝐔T𝐱i2)sin(θ)2\displaystyle(\left\|{\mathbf{x}}_{i}\right\|^{2}-\left\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}){\sin(\theta)^{2}}

. ∎

Lemma 10.

Let 𝐔𝒪d×r{\mathbf{U}}\in\mathcal{O}^{d\times r}, 𝐑(θ)=G(θ,i,j){\mathbf{R}}(\theta)=G(\theta,i,j) (a Given rotation over the axes i,j), 𝐔=𝐑(θ)𝐔{\mathbf{U}}^{{}^{\prime}}={\mathbf{R}}(\theta){\mathbf{U}}.

𝐞iT𝐔𝐔T𝐞i𝐞iT𝐔𝐔T𝐞i=sin(2θ)𝐞iT𝐔𝐔T𝐞j+sin2(θ)(𝐞jT𝐔𝐔T𝐞j𝐞iT𝐔𝐔T𝐞i)\displaystyle{\mathbf{e}}_{i}^{T}{\mathbf{U}}^{{}^{\prime}}{\mathbf{U}}^{{}^{\prime}T}{\mathbf{e}}_{i}-{\mathbf{e}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{e}}_{i}=\sin\left(2\theta\right){\mathbf{e}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{e}}_{j}+\sin^{2}\left(\theta\right)\left({\mathbf{e}}_{j}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{e}}_{j}-{\mathbf{e}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{e}}_{i}\right)
Proof.

Observe that:

𝐞iT𝐔𝐔T𝐞i=\displaystyle{\mathbf{e}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{e}}_{i}= l=1r𝐔il2\displaystyle\sum_{l=1}^{r}{\mathbf{U}}_{il}^{2}

Since 𝐑(θ)𝐔𝐔T𝐑(θ)T=𝐔𝐔T{\mathbf{R}}(\theta){\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{R}}(\theta)^{T}={\mathbf{U}}^{{}^{\prime}}{\mathbf{U}}^{{}^{\prime}T}, we also have:

𝐞iT𝐔𝐔T𝐞i=\displaystyle{\mathbf{e}}_{i}^{T}{\mathbf{U}}^{{}^{\prime}}{\mathbf{U}}^{{}^{\prime}T}{\mathbf{e}}_{i}= l=1r[𝐔T𝐑(θ)T𝐞i]l2\displaystyle\sum_{l=1}^{r}\left[{\mathbf{U}}^{T}{\mathbf{R}}(\theta)^{T}{\mathbf{e}}_{i}\right]_{l}^{2}
=\displaystyle= l=1r[𝐔T(cos(θ)𝐞i+sin(θ)𝐞j)]l2\displaystyle\sum_{l=1}^{r}\left[{\mathbf{U}}^{T}\left(\cos\left(\theta\right){\mathbf{e}}_{i}+\sin\left(\theta\right){\mathbf{e}}_{j}\right)\right]_{l}^{2}
=\displaystyle= l=1r(cos(θ)2𝐔il2+2cos(θ)𝐔ilsin(θ)𝐔jl+sin(θ)2𝐔jl2)\displaystyle\sum_{l=1}^{r}\left(\cos\left(\theta\right)^{2}{\mathbf{U}}_{il}^{2}+2\cos\left(\theta\right){\mathbf{U}}_{il}\sin\left(\theta\right){\mathbf{U}}_{jl}+\sin\left(\theta\right)^{2}{\mathbf{U}}_{jl}^{2}\right)

So:

𝐞iT𝐔𝐔T𝐞i𝐞iT𝐔𝐔T𝐞i=\displaystyle{\mathbf{e}}_{i}^{T}{\mathbf{U}}^{{}^{\prime}}{\mathbf{U}}^{{}^{\prime}T}{\mathbf{e}}_{i}-{\mathbf{e}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{e}}_{i}= l=1r(2cos(θ)𝐔ilsin(θ)𝐔jl+sin2(θ)𝐔jl2(1cos2(θ))𝐔il2)\displaystyle\sum_{l=1}^{r}\left(2\cos\left(\theta\right){\mathbf{U}}_{il}\sin\left(\theta\right){\mathbf{U}}_{jl}+\sin^{2}\left(\theta\right){\mathbf{U}}_{jl}^{2}-\left(1-\cos^{2}\left(\theta\right)\right){\mathbf{U}}_{il}^{2}\right)
=\displaystyle= l=1r(2cos(θ)𝐔ilsin(θ)𝐔jl+sin2(θ)𝐔jl2sin2(θ)𝐔il2)\displaystyle\sum_{l=1}^{r}\left(2\cos\left(\theta\right){\mathbf{U}}_{il}\sin\left(\theta\right){\mathbf{U}}_{jl}+\sin^{2}\left(\theta\right){\mathbf{U}}_{jl}^{2}-\sin^{2}\left(\theta\right){\mathbf{U}}_{il}^{2}\right)
=\displaystyle= l=1r(2cos(θ)𝐔ilsin(θ)𝐔jl+sin2(θ)(𝐔jl2𝐔il2))\displaystyle\sum_{l=1}^{r}\left(2\cos\left(\theta\right){\mathbf{U}}_{il}\sin\left(\theta\right){\mathbf{U}}_{jl}+\sin^{2}\left(\theta\right)\left({\mathbf{U}}_{jl}^{2}-{\mathbf{U}}_{il}^{2}\right)\right)
=\displaystyle= 2cos(θ)sin(θ)l=1r𝐔il𝐔jl+sin2(θ)l=1r(𝐔jl2𝐔il2)\displaystyle 2\cos\left(\theta\right)\sin\left(\theta\right)\sum_{l=1}^{r}{\mathbf{U}}_{il}{\mathbf{U}}_{jl}+\sin^{2}\left(\theta\right)\sum_{l=1}^{r}\left({\mathbf{U}}_{jl}^{2}-{\mathbf{U}}_{il}^{2}\right)
=\displaystyle= sin(2θ)𝐞iT𝐔𝐔T𝐞j+sin2(θ)(𝐞jT𝐔𝐔T𝐞j𝐞iT𝐔𝐔T𝐞i)\displaystyle\sin\left(2\theta\right){\mathbf{e}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{e}}_{j}+\sin^{2}\left(\theta\right)\left({\mathbf{e}}_{j}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{e}}_{j}-{\mathbf{e}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{e}}_{i}\right)