Fair Principal Component Analysis and Filter Design

Gad Zalcberg⁽¹⁾ and Ami Wiesel

{}^{(1)\ (2)}

(1) School of Computer Science and Engineering, The Hebrew University of Jerusalem, Israel
(2) Google Research, Israel

Abstract

We consider Fair Principal Component Analysis (FPCA) and search for a low dimensional subspace that spans multiple target vectors in a fair manner. FPCA is defined as a non-concave maximization of the worst projected target norm within a given set. The problem arises in filter design in signal processing, and when incorporating fairness into dimensionality reduction schemes. The state of the art approach to FPCA is via semidefinite programming followed by rank reduction methods. Instead, we propose to address FPCA using simple sub-gradient descent. We analyze the landscape of the underlying optimization in the case of orthogonal targets. We prove that the landscape is benign and that all local minima are globally optimal. Interestingly, the SDR approach leads to sub-optimal solutions in this orthogonal case. Finally, we discuss the equivalence between orthogonal FPCA and the design of normalized tight frames.

Index Terms:

Dimensionality Reduction, SDP, Fairness, Normalized Tight Frame, PCA

I Introduction

Dimensionality reduction is a fundamental problem in signal processing and machine learning. In particular, Principal Component Analysis (PCA) is among the most popular data science tool. It involves a non-concave maximization but has a tight semidefinite relaxation (SDR). Its optimization landscape, saddle points and extreme points are all well understood and it is routinely solved using scalable first order methods [1]. PCA maximizes the average performance across a given set of vector targets. In many settings, worst case metrics are preferred in order to ensure fairness and equal performance across all targets. This gives rise to Fair PCA (FPCA) [2] which will be defined formally in the next section. Unfortunately, changing the average PCA objective to a worst case FPCA objective results in an NP-hard problem [2] which is poorly understood. There is a growing body of works on convex relaxations via SDR for FPCA [2, 3], but these methods do not scale well and are inapplicable to many realistic settings. Therefore, the goal of this paper to consider scalable first order solutions to FPCA and shed more light on the landscape of this important optimization problem.

Due to the significance of PCA it is non-trivial to track the origins of FPCA. In the context of filter design, FPCA with rank one constraints is known as multicast beamforming and there is a huge body of literature on this topic, e.g., [4, 5, 6]. In the modern context of fairness in machine learning, FPCA was considered in [7, 8, 2, 3]. It was shown that SDR with an iterative rounding technique provides near optimal performance when the rank is much larger than the squared root of the number of targets. More generally, by interpreting the worst case operator as an $L_{\infty}$ norm, FPCA is a special case of $L_{2,p}$ norm optimizations. Classical PCA corresponds to $L_{2,2}$ . Robust PCA algorithms as [9] rely on $L_{2,1}$ , and FPCA is the other extreme using $L_{2,\infty}$ . Most of these works capitalize on the use of SDR that leads to conic optimizations with provable performance guarantees. Finally, [10] proposed a different definition to fairness in PCA via multi-objective optimization. They developed a first order method for attaining random solutions on the PCA Pareto frontier. FPCA defined above may be considered as one of these points.

SDR and nuclear norm relaxations are currently state of the art in a wide range of subspace recovery problems. Unfortunately, SDR is known to scale poorly to high dimensions. Therefore, there is a growing body of works on first order solutions to semidefinite programs. The main trick is to factorize the low rank matrix and show that the landscape of the resulting non-convex objective is benign [11, 12, 13, 14, 15, 16, 17]. The SDR of FPCA involves two types of linear matrix inequalities and still poses a challenge. Therefore, we first reformulate the problem and then apply sub-gradient descent on the factorized formulation.

The main contribution of this paper is the observation that the landscape of the factorized FPCA optimization is benign when the targets are orthogonal. This is the case in which dimensionality reduction is most lossy. Yet, we show that it is easy from an optimization perspective. The maximization is non-concave but every (non-connected) local minima is globally optimal. Surprisingly, we show that this case is challenging for SDR. Its objective is tight but it is not trivial to project its solution onto the feasible set. Numerical experiments with synthetic data suggest that these properties also hold in more realistic near-orthogonal settings. Finally, a direct corollary of our analysis is an equivalence between orthogonal FPCA and the design of finite normalized tight frames [18]. This characterization may be useful in future works on data-driven normalized tight frame design.

Notations:

We used bold uppercase letters (e.g. P) for matrices, bold lowercase letters (e.g. ${\mathbf{v}}$ ) for vectors and non-bold letters (e.g. $n$ ) for scalars. We used pythonic notation for indices of matrices: ${\mathbf{U}}_{i:}$ for the $i$ ’th row, ${\mathbf{U}}_{:j}$ for the $j$ ’th column and ${\mathbf{U}}_{ij}$ for the $i,j$ ’th entry of matrix. The set of $d\times r$ ( $r\leq d$ ) semi-orthogonal matrices (matrices with orthonormal columns) is denoted by $\mathcal{O}^{d\times r}$ , the set of positive semidefinite matrices by $\mathbb{S}^{d}_{+}$ , and the set of $d\times d$ projection matrices of rank at most $r$ by $\mathcal{P}^{d}_{r}$ (and $\mathcal{P}^{d}:=\mathcal{P}^{d}_{d}$ ). Given a function $f:A\rightarrow\mathbb{R}^{m}$ , ${\mathbf{U}}\in A$ we define the set of indices $\mathcal{I}_{\mathbf{U}}:=\arg\max_{i}f_{i}({\mathbf{U}})$ . Finally we define a projection operator onto the set of projection matrices of rank at most $r$ : $\Pi_{r}:\mathbb{S}^{d}_{+}\rightarrow\mathcal{P}^{d}_{r}$ as follows: Let ${\mathbf{P}}={\mathbf{U}}{\bf\Sigma}{\mathbf{U}}^{T}$ (EVD decomposition), where: ${\mathbf{U}}=({\mathbf{u}}_{1},...,{\mathbf{u}}_{d})$ then: $\Pi_{r}[{\mathbf{P}}]:=({\mathbf{u}}_{1},...,{\mathbf{u}}_{r})({\mathbf{u}}_{1},...,{\mathbf{u}}_{r})^{T}$ .

II Problem formulation

The goal of this paper is to identify a low dimensional subspace that maximizes the smallest norm of a given set of projected targets. More specifically, let $\{{\mathbf{x}}_{i}\}_{i=1}^{n}\subset\mathbb{R}^{d}$ be the set of targets, we consider the problem:

{\rm{FPCA:}}\quad\begin{array}[]{ll}\max_{{\bf{P}}\in\mathcal{P}^{d}}&\min_{i\in[n]}{\mathbf{x}}_{i}^{T}{\bf{P}}{\mathbf{x}}_{i}\\ {\rm{s.t.}}&{\rm{rank}}\left({\bf{P}}\right)\leq r\end{array}

(1)

Our motivation to FPCA arises in the context of filter design for detection. We are interested in the design of a linear sampling device from ${\mathbb{R}}^{d}$ to ${\mathbb{R}}^{r}$ that will allow detection of $n$ known targets denoted by $\{{\mathbf{x}}_{i}\}_{i=1}^{n}$ . The motivation for using a small rank is that the cost of power, space and/or time resources typically scales with $r$ . Detection accuracy in additive white Gaussian noise decreases exponentially with the received signal to noise ratio (SNR), and it is therefore natural to maximize the worst SNR across all the targets. Hopefully, this will lead to a fair solution with equal norms for all the targets.

FPCA with $r=1$ is concerned with the design of a single beaamforming filter, and is equivalent to multicast downlink transmit beamforming [4, 5]

\begin{array}[]{ll}\max_{{\mathbf{u}}\in\mathbb{R}^{d}}&\min_{i\in[n]}({\mathbf{x}}_{i}^{T}{\mathbf{u}})^{2}\\ {\rm{s.t.}}&\left\|{\mathbf{u}}\right\|\leq 1\end{array}

(2)

Practical systems typically satisfy $r<n\ll d$ , e.g., the design of a few expensive sensors that downsample a high resolution digital signal (or even an infinite dimension analog signal). Without loss of optimality, we assume a first stage of dimensionality reduction via PCA that results in effective dimensions such that $n=d$ .

As detailed in the introduction, FPCA was also recently introduced in the context of fair machine learning. There, it is more natural to assume a block structure. The targets are divided into $n$ blocks, denoted by $d\times b_{i}$ matrices ${\mathbf{X}}_{i}$ , and fairness needs to be respected with respect to properties as gender or race [7, 2, 3]:

{\rm{FPCA}}^{\rm{blocks}}\quad\begin{array}[]{ll}\max_{{\bf{P}}\in\mathcal{P}^{d}}&\min_{i\in[n]}{\rm{Tr}}\left({\mathbf{X}}_{i}^{T}{\bf{P}}{\mathbf{X}}_{i}\right)\\ {\rm{s.t.}}&{\rm{rank}}\left({\bf{P}}\right)\leq r\end{array}

(3)

Throughout this paper, we will consider the simpler non-block FPCA formulation corresponding to filter design. Preliminary experiments suggest that most of the results also hold in the block case.

FPCA is known to be NP-hard [4, 2]. The state of the art approach to FPCA is SDR. Specifically, we relax the rank constraint by its convex hull, the nuclear norm, and the projection constraint by linear matrix inequalities [5, 2]. This yields the SDP:

{\rm{SDR:}}\quad\begin{array}[]{ll}\max_{{\bf{P}}\in\mathbb{S}^{d}_{+}}&\min_{i\in[n]}{\mathbf{x}}_{i}^{T}{\bf{P}}{\mathbf{x}}_{i}\\ {\rm{s.t.}}&{\rm{Trace}}\left({\bf{P}}\right)\leq r\\ &{\bf{0}}\preceq{\bf{P}}\preceq{\bf{I}}\end{array}

(4)

The computational complexity of solving an SDR using an Interior Point method is intractable for most applications, but [2] propose a practical and efficient multiplicative weight update. Unfortunately, the optimal solution to SDR might not be a feasible projection, and ${\mathbf{P}}_{\rm{SDR}}$ may have any rank. Due to the relaxation, SDR always results in an upper bound on FPCA. To obtain a feasible approximation, it is customary to define

{\mathbf{P}}_{\rm{PrSDR}}=\Pi_{r}[{\mathbf{P}}_{\rm{SDR}}]

(5)

PrSDR is a feasible projection matrix of rank $r$ , and is therefore a lower bound on FPCA. Better approximations may be obtained via randomized procedures [5]. Recently, an iterative rounding technique was proven to provide a $\left(1-\frac{O(\sqrt{n})}{r}\right)$ approximation [2]. This result is near optimal in the block case where it is reasonable to assume $r\gg\sqrt{n}$ . It is less applicable to filter design where $n$ is large and smaller ranks are required.

The goal of this paper is to provide a scalable, yet accurate solution to FPCA, without the need for additional rank reduction schemes. Motivated by the growing success of simple gradient based methods in complex optimization problems, e.g., deep learning, we consider the application of sub-gradient descent to FPCA and analyze its optimization landscape.

III Algorithm

In this section, we propose an alternative and more scalable approach for solving SDR. The two optimization challenges in FPCA are the projection and rank constraints. We confront the first challenge by reformulating the problem using a quadratic objective, and the second by decomposing the projection matrix using its low rank factors. Together, we define factorized FPCA:

{\rm{F-FPCA:}}\quad\begin{array}[]{ll}\min_{{\mathbf{U}}\in\mathbb{R}^{d\times r}}&\max_{{i\in[n]}}\quad f_{i}({\mathbf{U}})\end{array}

(6)

where

f_{i}({\mathbf{U}})=\left\|{\mathbf{x}}_{i}-{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}-\left\|{\mathbf{x}}_{i}\right\|^{2}

(7)

The formal equivalence between (1) and (6) is stated below.

Proposition 1.

Let ${\mathbf{U}}$ be a globally optimal solution to F-FPCA in (6). Then, ${\mathbf{P}}=\Pi[{{\mathbf{U}}}{{\mathbf{U}}}^{T}]$ is a globally optimal solution to FPCA in (1).

Before proving the proposition, we note that the projection $\Pi[{\mathbf{U}}{\mathbf{U}}^{T}]$ is only needed in order to handle a degenerate case in which the dimension of the subspace spanned by the targets is smaller than $r$ . Typically, this projection is not needed and ${\mathbf{U}}{\mathbf{U}}^{T}$ is feasible.

Proof.

We rely on the observation that F-FPCA has an optimal solution with orthogonal matrix, and for orthogonal matrix we have:

-f_{i}({\mathbf{U}})={\mathbf{x}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}

In addition the function ${\mathbf{U}}\mapsto{\mathbf{U}}{\mathbf{U}}^{T}$ is a surjective function from $\mathcal{O}^{d\times r}$ to $\mathcal{P}_{r}\setminus\mathcal{P}_{r-1}$ , so the optimization over both sets is equivalent. More details are available in the Appendix. ∎

The advantage of solving F-FPCA rather than FPCA is that it forces a low rank solution via an unconstrained optimization. A member of the sub-gradient of F-FPCA objective can be computed in $O(drn)$ . In particular, Algorithm 1 describes a promising sub-gradient descent method for its minimization.

The obvious downside of using F-FPCA is its non-convexity that may cause descent algorithms to converge to bad stationary points. Its convergence analysis is more difficult due to the non-smooth maximum function. Nonetheless, in the next section, we prove that there are no bad local minima when the targets are orthogonal. This is also demonstrated in the experimental section where we show the advantages of F-FPCA in terms of accuracy.

Relation to other low rank optimization papers: We note in passing that there is a large body of literature on global optimality properties of low rank optimizations [16, 17]. These provide sufficient conditions for convergence to global optimum in factorized formulations, e.g., Restricted Strong Convexity and Smoothness. Observe that these guarantees require the existence of a low rank optimal solution in the original problem. These conditions do not hold in FPCA, and therefore our analysis below takes a different approach.

Algorithm 1 F-FPCA via sub-gradient descent

\{{\mathbf{x}}_{i}\}_{i=1}^{n}\subset\mathbb{R}^{d}

r\in\mathbb{N}

\eta

{\mathbf{P}}\in\mathcal{P}^{d},{\rm{rank}}\left({\bf{P}}\right)\leq r

t\leftarrow 0

2: draw

{\mathbf{U}}\in\mathbb{R}^{d\times r}

randomly.

3: repeat

t\leftarrow t+1

\hat{i}\leftarrow\arg\max_{{i\in[n]}}\left\|{\mathbf{x}}_{{i}}-{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{{i}}\right\|^{2}-\left\|{\mathbf{x}}_{i}\right\|^{2}

{\mathbf{U}}\leftarrow{\mathbf{U}}-\frac{\eta}{t}\left({\mathbf{x}}_{\hat{i}}{\mathbf{x}}_{\hat{i}}^{T}{\mathbf{U}}{\mathbf{U}}^{T}+{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{\hat{i}}{\mathbf{x}}_{\hat{i}}^{T}-2{\mathbf{x}}_{\hat{i}}{\mathbf{x}}_{\hat{i}}^{T}\right){\mathbf{U}}

7: until convergence

8: return

{\mathbf{P}}=\Pi\left[{\mathbf{U}}{\mathbf{U}}^{T}\right]

IV Analysis - the orthogonal case

In this section, we analyze the FPCA in the special case of orthogonal targets. As explained, FPCA is NP-hard and we do not expect a scalable and accurate solution for arbitray targets. Interestingly, our analysis shows that the problem becomes significantly easier when the targets are orthogonal. This is the case for example when the targets are randomly generated and the number of targets is much smaller than their dimension.

We will use the following assumptions:

A1:

The targets $\{{\mathbf{x}}_{i}\}_{i=1}^{n}\subset\mathbb{R}^{d}$ are orthogonal vectors.
A2:

The problem is not degenerate in the sense that

$\frac{r}{n}H<\min_{i}\left\|{\mathbf{x}}_{i}\right\|^{2}$

where

$H=\frac{n}{\sum_{i=1}^{n}\frac{1}{\left\|{\mathbf{x}}_{i}\right\|^{2}}}$

(the harmonic mean of the squared norms of $\{{\mathbf{x}}_{i}\}^{n}_{i=1}$ ).

Assumption A1 is the main property that simplifies the landscape and allows a tractable solution and analysis. On the other hand, assumption A2 is a technical condition that prevents a trivial degenerate solution based on the norms of the targets.

Using these assumptions, we have the following results.

Proposition 2.

Under assumptions A1-A2, any local minimizer of F-FPCA is a global maximizer of FPCA and FPCA $=\frac{r}{n}H$ .

Proof.

The proof consists of the following lemmas (proofs in the appendix):

Lemma 1.

Under assumptions A1-A2, let ${\mathbf{U}}\in\mathbb{R}^{r\times d}$ be a local minimizer of F-FPCA, then ${\mathbf{U}}\in\mathcal{O}^{d\times r}$ .

Lemma 2.

Under assumptions A1-A2, let ${\mathbf{U}}\in\mathcal{O}^{d\times r}$ a local minimizer of F-FPCA, then: $f=f_{i}({\mathbf{U}})=f_{j}({\mathbf{U}})$ for all $i,j\in[n]$ .

Intuitively, if the property in Lemma 2 is violated, then ${\mathbf{U}}$ can be infinitesimally changed in a manner that decreases the correlation of ${\mathbf{U}}$ with some direction ${\mathbf{w}}$ such that ${\mathbf{w}}\perp{\mathbf{x}}_{j}$ for all $j\in\mathcal{I}_{\mathbf{U}}$ . We can decrease the value of $f_{i}$ for some $i\in\mathcal{I}_{\mathbf{U}}$ without harming the objective function using a sequence of Givens rotations with respect to the pairs $\{{\mathbf{w}},{\mathbf{x}}_{i}\}$ for each $i\in\mathcal{I}_{\mathbf{U}}$ . After decreasing $f_{i}$ for all $i\in\mathcal{I}_{\mathbf{U}}$ the objective will also be decreased.

Finally, in order to prove global optimality we define:

\displaystyle{\mathbf{X}}=\left(\frac{{\mathbf{x}}_{1}}{\left\|{\mathbf{x}}_{1}\right\|},...,\frac{{\mathbf{x}}_{n}}{\left\|{\mathbf{x}}_{n}\right\|}\right),\quad\hat{{\mathbf{U}}}={\mathbf{X}}^{T}{\mathbf{U}}

(8)

If $f=f_{i}({\mathbf{U}})=f_{j}({\mathbf{U}})$ :

\left\|\hat{{\mathbf{U}}}_{i:}\right\|^{2}=-\frac{f_{i}({\mathbf{U}})}{\left\|{\mathbf{x}}_{i}\right\|^{2}}=-\frac{f}{\left\|{\mathbf{x}}_{i}\right\|^{2}}

We have:

\displaystyle r=\left\|\hat{{\mathbf{U}}}\right\|^{2}_{F}=\sum_{i=1}^{n}\left\|\hat{{\mathbf{U}}}_{i:}\right\|^{2}=-\sum_{i=1}^{n}\frac{f}{\left\|{\mathbf{x}}_{i}\right\|^{2}}

Rearranging yields $f=-\frac{r}{n}H$ . Together with the equivalence in Proposition 1 we conclude that all local minima yield an identical objective of $\frac{r}{n}H$ which is globally optimal.∎

Proposition 2 justifies the use of Algorithm 1 when the targets are orthogonal. Numerical results in the next section suggest that bad local minima are rare even in more realistic near-orthogonal scenarios.

Given the favourable properties of F-FPCA in the orthogonal case, it is interesting to analyze the performance of SDR in this case.

Proposition 3.

Under assumptions A1-A2, SDR is tight and its optimal objective value is

{\rm{SDR}}=\frac{r}{n}H.

However, the optimal solution may be full rank and infeasible for FPCA.

Proof.

See Appendix. ∎

Observe that the rank constraint is hard, and a rank reduction procedure such as PrSDR is necessary for finding a feasible solution. The iterative rounding algorithm of [2] relies on finding an extreme point solution, and guarantees an upper bound on its rank. The bound is not always tight. For example, their algorithm fails to find an optimal low rank solution in the orthogonal case. On the other hand, Algorithm 1 easily finds the global solution.

Finally, we conclude this section by noting an interesting relation between FPCA with orthogonal targets and the design of Finite Tight Frames [18]. Recall the following definition:

Definition 1.

•

Let $\{{\mathbf{u}}_{i}\}_{i=1}^{n}\subset\mathbb{R}^{r}$ . If $span(\{{\mathbf{u}}_{i}\}_{i=1}^{n})=\mathbb{R}^{r}$ then $\{{\mathbf{u}}_{i}\}_{i=1}^{n}$ is frame for $\mathbb{R}^{r}$ .
•

A frame $\{{\mathbf{u}}_{i}\}_{i=1}^{n}$ is tight with frame bound A if $\forall{\mathbf{v}}\in\mathbb{R}^{n}$ :

${\mathbf{v}}=\frac{1}{A}\sum_{i=1}^{n}\left<{\mathbf{v}},{\mathbf{u}}_{i}\right>{\mathbf{u}}_{i}$
•

A frame $\{{\mathbf{u}}_{i}\}_{i=1}^{n}$ is a ’Normalized Tight Frame’ if $\{{\mathbf{u}}_{i}\}_{i=1}^{n}$ is tight frame and $\|{\mathbf{u}}_{i}\|=1$ for all $i$ .

A straight forward consequence is the following result.

Corollary 1.

Under assumptions A1-A2, if ${\mathbf{U}}$ is an optimal solution for F-FPCA, then ${\mathbf{U}}^{T}$ is a tight frame. In particular, if the targets are the standard basis, then $\frac{d}{r}{\mathbf{U}}^{T}$ is a normalized tight frame.

Sketch of proof (the proof in the appendix): As proved before, the solution of F-FPCA is in $\mathcal{O}^{d\times r}$ and the transposition of any ${\mathbf{U}}\in\mathcal{O}^{d\times r}$ is a tight frame. The second part is true since the optimal solution of F-FPCA is satisfied for all k: $\left\|{\mathbf{x}}_{k}^{T}{\mathbf{U}}\right\|^{2}=\frac{r}{n}H$ . For the standard basis we get for all $i,j$ : $\left\|{\mathbf{U}}^{T}{\mathbf{e}}_{i}\right\|=\left\|{\mathbf{U}}^{T}{\mathbf{e}}_{j}\right\|$ i.e. the norm of all rows of ${\mathbf{U}}$ are equals.

It is well known that normalized tight frames can be derived as minimizers of frame potential functions [18]. The corollary provides an alternative derivation via FPCA with different targets ${\mathbf{x}}_{i}$ . Depending on the properties of the targets, this allows a flexible data-driven design that will be pursued in future work.

V Experimental results

In this section, we illustrate the efficacy of the different algorithms using numerical experiments. We compare the following competitors:

•

SDR - a (possibly infeasible) upper bound defined as the solution to (4) via CVXPY [19, 20].
•

PIRSDR - the projection of SDR onto the feasible set using eigenvalue decomposition. Before project the solution, the iterative rounding rank reduction from [2] was performed.
•

F-FPCA - the solution to (6) via Algorithm 1 with a random initialization.
•

F-FPCAi - the solution to (6) via Algorithm 1 with PIRSDR initialization.

To allow easy comparison, we normalize the results by the value of SDR, so that a ratio of $1$ corresponds to a tight solution.

V-A Synthetic simulations

We begin with experiments on synthetic targets with independent, zero mean, unit variance, Gaussian random variables. This is clearly a simplistic setting but it allows control over the different parameters $r$ , $n$ and $d$ . Each of the experiments was performed $15$ times and we report the average performance.

Rank effect: The first experiment is presented in Fig. 1 and illustrates the dependency on the rank $r$ . It is easy to see that even with very small rank, the gap between the upper and lower bounds vanishes. We conclude that in this non-orthogonal setting, the landscape of FPCA is benign as long as the rank is not very small.

Refer to caption — Figure 1: Quality of approximation as a function of the rank ( $d=200,n=50$ )

Orthogonality effect: The second experiment is presented in Fig. 2 and addresses the effect of orthogonality. As explained, the targets are drawn randomly and they tend to orthogonality as $d$ increases. Our analysis proved that the gap should vanish when the targets are exactly orthogonal. The numerical results suggest that this is also true for more realistic and near-orthogonal targets. The optimality gap clearly decreases as $d$ increases.

V-B Minerals dataset

In order to illustrate the performance in a more realistic setting we also considered a real world dataset. We consider the design of hyperspectral matched filters for detection of known minerals. We downloaded spectral signatures of minerals from the Spectral Library of United States Geological Survey (USGS). We experimented with $114$ different minerals, each with $480$ bands in the range $0.01\mu-3\mu$ . Some of the measurements were missing and their corresponding bands were omitted. We then performed PCA and reduced the dimension from $421$ to $\mathbb{R}^{114}$ . These vectors were normalized and then centered. Fig. 3 provides the signatures of the first minerals before and after the pre-processing. Finally, we performed fair dimension reduction to $r=1...6$ . Fig. 4 summarizes the quality of the approximation of the different algorithms. As before, it is easy to see that F-FPCA is near optimal at very small ranks. Interestingly, PIRSDR is beneficial as an initialization but shows inferior and non-monotonic performance on its own. As expected, all the algorithms easily attain the optimal performance at higher values of $r$ .

V-C Credit dataset

Next, we continue to the credit dataset [21] that was considered for block-FPCA in [7, 2, 3]. Following these works, we consider the functions

f_{i}({\mathbf{U}})=\left\|{\mathbf{X}}_{i}-{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{X}}_{i}\right\|^{2}-PCA({\mathbf{X}}_{i})

(9)

where ${\mathbf{X}}_{i}=\sum_{j=1}^{n_{i}}{\mathbf{x}}_{j}^{i}{\mathbf{x}}_{j}^{iT}$ and $PCA$ is the objective of the standard PCA function of ${\mathbf{X}}_{i}$ which is independent of ${\mathbf{U}}$ . The results in Fig 6 are identical to those in [2] with our additional F-FPCA algorithm. PIRSDR achieves the SDR lower bound at all ranks excepts $r=7$ . Remarkably, F-FPCA is optimal in this specific setting and attains the bound for all $r$ without exception. Apparently, the landscape of the credit dataset is benign. We emphasize that this is pure luck and we can easily find other non-orthogonal examples with spurious local minima. In real applications, we recommend running F-FPCA using multiple initializations and choosing the best solution.

VI Conclusion

In this paper, we suggested to tackle the problem of fairness in linear dimension reduction by simply using first order methods over a non-convex optimization problem. We provided an analysis of the landscape of this problem in the special case where the targets are orthogonal to each other. We also provided experimental results which support our approach by showing that sub gradient descent is successful also in the near orthogonal case and real world data.

There are many interesting extensions to this paper that are worth pursuing. Analysis of the near-orthogonal case is still an open question. In addition, a drawback of our approach is the non smoothness of the landscape which might prevent the use of standard convergence bounds for first order methods. This can be treated by approximating the $L_{2,\infty}$ in our formulation by log-sum-exp or $L_{2,p}$ norm for $p<\infty$ functions. Experimental results show that our results can be extended to the block case that is more relevant to machine learning. Finally, we only considered the case of classical linear dimension reduction. Future work may focus on extensions to non-linear methods and tensor decompositions.

Appendix A Proof of Proposition 1

Let ${\mathbf{O}}{\mathbf{D}}{\mathbf{O}}^{T}$ a truncated EVD decomposition of ${\mathbf{U}}{\mathbf{U}}^{T}$ , then:

		$\displaystyle\left\\|{\mathbf{x}}_{i}-{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\\|^{2}-\left\\|{\mathbf{x}}_{i}\right\\|^{2}$
	$\displaystyle=$	$\displaystyle\left\\|{\mathbf{x}}_{i}\right\\|^{2}+{\mathbf{x}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}-2{\mathbf{x}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}-\left\\|{\mathbf{x}}_{i}\right\\|^{2}$
	$\displaystyle=$	$\displaystyle{\mathbf{x}}_{i}^{T}{\mathbf{O}}{\mathbf{D}}^{2}{\mathbf{O}}^{T}{\mathbf{x}}_{i}-2{\mathbf{x}}_{i}^{T}{\mathbf{O}}{\mathbf{D}}{\mathbf{O}}^{T}{\mathbf{x}}_{i}$
	$\displaystyle=$	$\displaystyle{\mathbf{x}}_{i}^{T}{\mathbf{O}}({\mathbf{D}}^{2}-2{\mathbf{D}}){\mathbf{O}}^{T}{\mathbf{x}}_{i}$
	$\displaystyle=$	$\displaystyle\sum_{l=1}^{r}({\mathbf{D}}_{ll}^{2}-2{\mathbf{D}}_{ll})\left<{\mathbf{O}}_{:l},{\mathbf{x}}_{i}\right>^{2}$

Observe that this function is minimized when ${\mathbf{D}}_{ll}=1$ for all $l\leq r$ , so:

\displaystyle\left\|{\mathbf{x}}_{i}-{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}-\left\|{\mathbf{x}}_{i}\right\|^{2}\geq-{\mathbf{x}}_{i}^{T}\Pi[{\mathbf{U}}{\mathbf{U}}^{T}]{\mathbf{x}}_{i}

So F-FPCA is equivalent to the following problem (over the orthogonal matrices):

\displaystyle\begin{array}[]{ll}\min_{{\mathbf{U}}\in\mathcal{O}^{d\times r}}&\max_{{i\in[n]}}\quad\left\|{\mathbf{x}}_{i}-{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}-\left\|{\mathbf{x}}_{i}\right\|^{2}\end{array}

Now for any orthogonal matrix ${\mathbf{U}}$ we get:

\displaystyle\begin{array}[]{ll}=\left\|{\mathbf{x}}_{i}-{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}-\left\|{\mathbf{x}}_{i}\right\|^{2}\\ =\left\|{\mathbf{x}}_{i}\right\|^{2}+{\mathbf{x}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}-2{\mathbf{x}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}-\left\|{\mathbf{x}}_{i}\right\|^{2}\\ ={\mathbf{x}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}-2{\mathbf{x}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}\\ =-{\mathbf{x}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}\\ \end{array}

Finally, observe that:

•

${\mathbf{U}}$ is a feasible solution for the problem above iff ${\mathbf{P}}={\mathbf{U}}{\mathbf{U}}^{T}$ is a feasible solution for FPCA.
•

The objective function of FPCA in ${\mathbf{P}}$ is equal to the objective function of the problem above in ${\mathbf{U}}$ (multiplied by $-1$ ).

So we conclude that the problems are equivalent.

Appendix B Proof of Lemma 1

We begin with the following lemma:

Lemma 3.

Let ${\mathbf{U}}\in\mathcal{O}^{d\times r}$ . If A2 holds then $\forall i\in\mathcal{I}_{\mathbf{U}}:\quad\left\|{\mathbf{x}}_{i}\right\|>\left\|{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|$ .

Proof.

Assume in contradiction that there exists $k\in\mathcal{I}_{\mathbf{U}}$ ( $\mathcal{I}_{\mathbf{U}}:=\arg\max_{i}f_{i}({\mathbf{U}})$ ) such that: $\left\|{\mathbf{x}}_{k}\right\|=\left\|{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{k}\right\|$ , and let $j\in\arg\min_{{\mathbf{x}}_{i}}\left\|{\mathbf{x}}_{i}\right\|$ . We get for all $i$ :

f_{i}({\mathbf{U}})\leq f_{k}({\mathbf{U}}){\leq}-\left\|{\mathbf{x}}_{k}\right\|^{2}\leq-\left\|{\mathbf{x}}_{j}\right\|^{2}

Now recall the definition of $\hat{{\mathbf{U}}}$ in (8) and observe that:

	$\displaystyle r=\left\\|\hat{{\mathbf{U}}}\right\\|^{2}_{F}{=}\sum_{i=1}^{n}\left\\|\hat{{\mathbf{U}}}_{i:}\right\\|^{2}=\sum_{i=1}^{n}\frac{-f_{i}({\mathbf{U}})}{\left\\|{\mathbf{x}}_{i}\right\\|^{2}}\geq\sum_{i=1}^{n}\frac{\left\\|{\mathbf{x}}_{j}\right\\|^{2}}{\left\\|{\mathbf{x}}_{i}\right\\|^{2}}$
	$\displaystyle\Rightarrow\left\\|{\mathbf{x}}_{j}\right\\|^{2}\leq\frac{r}{\sum_{i=1}^{n}\frac{1}{\left\\|{\mathbf{x}}_{i}\right\\|^{2}}}$

This means that A2 does not hold. ∎

We will now show that if ${\mathbf{U}}$ is not orthogonal, then we can decrease either the size of $\mathcal{I}_{\mathbf{U}}$ or the value of $\max_{i}f_{i}({\mathbf{U}})$ by choosing an arbitrarily close ${\mathbf{U}}^{\prime}$ .

Lemma 4.

Let ${\mathbf{U}}\notin\mathcal{O}^{d\times r}$ , then for any $\epsilon>0$ there exists a ${\mathbf{U}}^{\prime}$ such that:

1.

$\left\|{\mathbf{U}}-{\mathbf{U}}^{\prime}\right\|\leq\epsilon$ .
2.

Either $|\mathcal{I}_{{\mathbf{U}}}|>|\mathcal{I}_{{\mathbf{U}}^{\prime}}|$ , or $\max_{i}f_{i}({\mathbf{U}})>\max_{i}f_{i}({\mathbf{U}}^{\prime})$ .

Proof.

Let ${\mathbf{O}}{\mathbf{D}}{\mathbf{O}}^{T}$ an EVD decomposition of ${\mathbf{U}}{\mathbf{U}}^{T}$ , then:

\displaystyle\left\|{\mathbf{x}}_{i}-{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}-\left\|{\mathbf{x}}_{i}\right\|^{2}=\sum_{l=1}^{r}({\mathbf{D}}_{ll}^{2}-2{\mathbf{D}}_{ll})\left<{\mathbf{O}}_{:l},{\mathbf{x}}_{i}\right>^{2}

Due to ${\mathbf{U}}\notin\mathcal{O}^{d\times r}$ , there is an $\hat{l}\leq r$ such that ${{\mathbf{D}}}_{\hat{l},\hat{l}}\neq 1$ , and an $i$ such that $\left<{\mathbf{O}}_{:\hat{l}},{\mathbf{x}}_{i}\right>\neq 0$ . Observe that: $h_{i}({{\mathbf{D}}}_{\hat{l},\hat{l}})=({{\mathbf{D}}}_{\hat{l},\hat{l}}^{2}-2{{\mathbf{D}}}_{\hat{l},\hat{l}})\left<{\mathbf{O}}_{:\hat{l}},{\mathbf{x}}_{i}\right>^{2}$ has a local minimum only in ${{\mathbf{D}}}_{\hat{l},\hat{l}}=1$ . Therefore, define ${\mathbf{U}}^{{}^{\prime}}={\mathbf{O}}{\mathbf{D}}^{\prime}{\mathbf{O}}^{T}$ where:

{{\mathbf{D}}}_{\hat{l},\hat{l}}^{\prime}=\begin{cases}{{\mathbf{D}}}_{\hat{l},\hat{l}}^{\prime}={{\mathbf{D}}}_{\hat{l},\hat{l}}-\epsilon&\quad{{\mathbf{D}}}_{\hat{l},\hat{l}}>1\\ {{\mathbf{D}}}_{\hat{l},\hat{l}}^{\prime}={{\mathbf{D}}}_{\hat{l},\hat{l}}+\epsilon&\quad{{\mathbf{D}}}_{\hat{l},\hat{l}}<1\end{cases}

and we get $f_{j}({\mathbf{U}})>f_{j}({\mathbf{U}}^{\prime})$ for all $j$ such that $\left<{\mathbf{O}}_{:\hat{l}},{\mathbf{x}}_{j}\right>^{2}\neq 0$ .

If there exists an $\hat{l}$ such that ${{\mathbf{D}}}_{\hat{l},\hat{l}}\neq 1$ and $\left|\left\{j|\left<{\mathbf{O}}_{:\hat{l}},{\mathbf{x}}_{j}\right>^{2}\neq 0\right\}\cap\mathcal{I}_{\mathbf{U}}\right|>0$ then we are done.

Otherwise, pick some $\hat{l}$ with ${{\mathbf{D}}}_{\hat{l},\hat{l}}\neq 1$ , and after the procedure above take ${\mathbf{x}}_{k}\in\mathcal{I}_{\mathbf{U}}$ and define ${\mathbf{x}}^{\perp}_{k}$ the projection of ${\mathbf{x}}_{k}$ onto ${\rm{Im}}({\mathbf{U}})^{\perp}$ (by Lemma 3 ${\mathbf{x}}_{k}^{\perp}\neq 0$ ). Define ${\mathbf{O}}^{\prime}$ by adding $\epsilon{\mathbf{x}}^{\perp}_{k}$ to the $\hat{l}^{\prime}th$ singular vector ${\mathbf{O}}_{:\hat{l}}$ of ${\mathbf{U}}^{\prime}$ and define ${\mathbf{U}}^{\prime\prime}={\mathbf{O}}^{\prime}{\mathbf{D}}^{\prime}{\mathbf{O}}^{\prime T}$ . Now we get that for all $i\in\mathcal{I}_{{\mathbf{U}}^{\prime}}$ :

		$\displaystyle\left<{\mathbf{O}}_{:\hat{l}}^{\prime},{\mathbf{x}}_{i}\right>^{2}=\left(\left<{\mathbf{O}}_{:\hat{l}}+\epsilon{\mathbf{x}}_{k}^{\perp},{\mathbf{x}}_{i}\right>\right)^{2}$
	$\displaystyle=$	$\displaystyle\left(\left<{\mathbf{O}}_{:\hat{l}},{\mathbf{x}}_{i}\right>+\epsilon\left<{\mathbf{x}}_{k}^{\perp},{\mathbf{x}}_{i}\right>\right)^{2}=\epsilon^{2}\left<{\mathbf{x}}_{k}^{\perp},{\mathbf{x}}_{i}\right>^{2}\geq 0$
	$\displaystyle\Rightarrow$	$\displaystyle\left\\|{\mathbf{x}}_{i}-{\mathbf{U}}^{\prime\prime}{\mathbf{U}}^{\prime\prime T}{\mathbf{x}}_{i}\right\\|^{2}-\left\\|{\mathbf{x}}_{i}\right\\|^{2}=\sum_{l=1}^{r}({\mathbf{D}}_{ll}^{\prime 2}-2{\mathbf{D}}_{ll}^{\prime})\left<{\mathbf{O}}_{:l}^{\prime},{\mathbf{x}}_{i}\right>^{2}$
	$\displaystyle\leq$	$\displaystyle\sum_{l=1}^{r}({\mathbf{D}}_{ll}^{\prime 2}-2{\mathbf{D}}_{ll}^{\prime})\left<{\mathbf{O}}_{:l},{\mathbf{x}}_{i}\right>^{2}$

Similarly, for ${\mathbf{x}}_{k}$ we get:

		$\displaystyle\sum_{l=1}^{r}({\mathbf{D}}_{ll}^{\prime 2}-2{\mathbf{D}}_{ll}^{\prime})\left<{\mathbf{O}}_{:l}^{\prime},{\mathbf{x}}_{k}\right>^{2}$
	$\displaystyle=$	$\displaystyle\sum_{l=1}^{r}({\mathbf{D}}_{ll}^{\prime 2}-2{\mathbf{D}}_{ll}^{\prime})\left<{\mathbf{O}}_{:l},{\mathbf{x}}_{k}\right>^{2}+\epsilon({\mathbf{D}}_{\hat{l}\hat{l}}^{\prime 2}-2{\mathbf{D}}_{\hat{l}\hat{l}}^{\prime})\left<{\mathbf{x}}_{k}^{\perp},{\mathbf{x}}_{k}\right>^{2}$
	$\displaystyle<$	$\displaystyle\sum_{l=1}^{r}({\mathbf{D}}_{ll}^{\prime 2}-2{\mathbf{D}}_{ll}^{\prime})\left<{\mathbf{O}}_{:l},{\mathbf{x}}_{k}\right>^{2}$

as required. ∎

We can now apply Lemma 4 iteratively as follows. Let ${\mathbf{U}}\notin\mathcal{O}^{d\times r}$ , and let $\epsilon>0$ . By Lemma 4:

•

There is ${\mathbf{U}}_{1}$ with $\left\|{\mathbf{U}}_{1}-{\mathbf{U}}\right\|\leq\frac{\epsilon}{n}$ , s.t.: $|\mathcal{I}_{{\mathbf{U}}}|>|\mathcal{I}_{{\mathbf{U}}_{1}}|$ .
•

There is ${\mathbf{U}}_{2}$ with $\left\|{\mathbf{U}}_{2}-{\mathbf{U}}_{1}\right\|\leq\frac{\epsilon}{n}$ , s.t.: $|\mathcal{I}_{{\mathbf{U}}_{1}}|>|\mathcal{I}_{{\mathbf{U}}_{2}}|$ .
•

…
•

There is ${\mathbf{U}}^{\prime}$ with $\left\|{\mathbf{U}}^{\prime}-{\mathbf{U}}_{K}\right\|\leq\frac{\epsilon}{n}$ , s.t.: $\max_{i}f_{i}({\mathbf{U}}_{K})>\max_{i}f_{i}({\mathbf{U}}^{\prime})$ .

Finally, observe that $K+1\leq|\mathcal{I}_{U}|$ , so $\left\|{\mathbf{U}}-{\mathbf{U}}^{\prime}\right\|\leq\epsilon\frac{K+1}{n}\leq\epsilon$ and we can find arbitrarily close ${\mathbf{U}}^{\prime}$ such that $\max_{i}f_{i}({\mathbf{U}}^{\prime})<\max_{i}f_{i}({\mathbf{U}})$ i.e. ${\mathbf{U}}$ is not a local minimizer.

Appendix C Proof of Lemma 2

We begin with the following lemma that states that we can utilize the orthogonality of the targets in order to infinitesimally change ${\mathbf{U}}$ in a manner that increases the value of $f_{j}$ for some $j\notin\mathcal{I}_{\mathbf{U}}$ , decreases the value of $f_{i}$ for some $i\in\mathcal{I}_{\mathbf{U}}$ and does not change the value of $f_{k}$ for $k\in\mathcal{I}_{{\mathbf{U}}}\setminus\{i\}$ .

Lemma 5.

Let ${\mathbf{U}}\in\mathcal{O}^{d\times r}$ such that there exist $j$ with $f_{j}({\mathbf{U}})<\max_{l\in[n]}f_{l}({\mathbf{U}})$ . Then, there exists an $i\in\mathcal{I}_{\mathbf{U}}$ such that: for any $\epsilon>0$ there exist ${\mathbf{U}}_{\theta}$ such that:

1.

$\left\|{\mathbf{U}}-{\mathbf{U}}_{\theta}\right\|\leq\epsilon$ .
2.

$\forall k\in[n]\setminus\mathcal{I}_{\mathbf{U}}:\quad f_{k}({\mathbf{U}}_{\theta})<f_{i}({\mathbf{U}})$
3.

$\forall k\in\mathcal{I}_{\mathbf{U}}\setminus\left\{i\right\}:\qquad f_{k}({\mathbf{U}})=f_{k}({\mathbf{U}}_{\theta})$ .
4.

$f_{i}({\mathbf{U}}_{\theta})<f_{i}({\mathbf{U}})$

Proof.

Define ${\mathbf{R}}(\theta)$ , a Given Rotation (for some $\theta$ ) over the $1,2$ axes, i.e.:

{\mathbf{R}}(\theta)_{ij}=\begin{cases}\cos\theta&ij=11\quad or\quad ij=22\\ \sin\theta&ij=12\\ -\sin\theta&ij=21\\ {\mathbf{I}}_{ij}&else\end{cases}

along with two orthogonal vectors

	$\displaystyle{\mathbf{v}}_{1}$	$\displaystyle=$	$\displaystyle\frac{{\mathbf{x}}_{i}}{\left\\|{\mathbf{x}}_{i}\right\\|}$
	$\displaystyle{\mathbf{v}}_{2}$	$\displaystyle=$	$\displaystyle\begin{array}[]{lcr}\begin{cases}\frac{{\mathbf{x}}_{j}}{\left\\|{\mathbf{x}}_{j}\right\\|}&\exists i\in\mathcal{I}_{\mathbf{U}}:\;{\mathbf{x}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{j}\neq 0\\ \frac{{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{j}}{\left\\|{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{j}\right\\|}&{\rm{else}}\end{cases}\end{array}$		(11)

and an orthonormal basis for their orthogonal complement in $\mathbb{R}^{d}$ : ${\mathbf{V}}=\left({\mathbf{v}}_{1},...{\mathbf{v}}_{d}\right)$ . Now define: ${\mathbf{U}}_{\theta}={\mathbf{V}}{\mathbf{R}}(\theta){\mathbf{V}}^{T}{\mathbf{U}}$ and we get:

1+2 is true, since:

$h_{1}\left(\theta\right):={\mathbf{U}}_{\theta},\quad h_{2}(\theta):=f_{j}({\mathbf{U}}_{\theta})$ are continuous functions.

3 is true, since:

For all $k\in\mathcal{I}_{U}\setminus\left\{i\right\}:{\mathbf{x}}_{k}\perp{\mathbf{v}}_{1},{\mathbf{v}}_{2}$ thus ${\mathbf{R}}(\theta){\mathbf{V}}^{T}{\mathbf{x}}_{k}={\mathbf{I}}{\mathbf{V}}^{T}{\mathbf{x}}_{k}$ and:

	$\displaystyle{\mathbf{x}}_{k}^{T}{\mathbf{U}}_{\theta}{\mathbf{U}}_{\theta}^{T}{\mathbf{x}}_{k}=$	$\displaystyle{\mathbf{x}}_{k}^{T}{\mathbf{V}}{\mathbf{R}}(\theta){\mathbf{V}}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{V}}{\mathbf{R}}(\theta)^{T}{\mathbf{V}}^{T}{\mathbf{x}}_{k}$
	$\displaystyle=$	$\displaystyle{\mathbf{x}}_{k}^{T}{\mathbf{V}}{\mathbf{I}}{\mathbf{V}}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{V}}{\mathbf{I}}{\mathbf{V}}^{T}{\mathbf{x}}_{k}$
	$\displaystyle=$	$\displaystyle{\mathbf{x}}_{k}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{k}$

In order to show 4 we use the equality in (12) (In the next page, proof is in the appendix, since it is quite technical):

\displaystyle f_{i}({\mathbf{U}}_{\theta})-f_{i}({\mathbf{U}})=\begin{cases}\sin\left(2\theta\right){\mathbf{e}}_{i}^{T}\hat{{\mathbf{U}}}\hat{{\mathbf{U}}}^{T}{\mathbf{e}}_{j}+\sin^{2}\left(\theta\right)\left({\mathbf{e}}_{j}^{T}\hat{{\mathbf{U}}}\hat{{\mathbf{U}}}^{T}{\mathbf{e}}_{j}-{\mathbf{e}}_{i}^{T}\hat{{\mathbf{U}}}\hat{{\mathbf{U}}}^{T}{\mathbf{e}}_{i}\right)&\exists i\in\mathcal{I}_{\mathbf{U}}:\;{\mathbf{x}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{j}\neq 0\\ (\left\|{\mathbf{x}}_{i}\right\|^{2}-\left\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}){\sin(\theta)^{2}}&else\end{cases}

(12)

Now, if $\exists i\in\mathcal{I}_{\mathbf{U}}:\;{\mathbf{x}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{j}\neq 0$ :

•

If: ${\mathbf{x}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{j}<0$ then for any $\frac{\pi}{2}>\theta>0$ : $\sin\left(2\theta\right){\mathbf{e}}_{i}^{T}\hat{{\mathbf{U}}}\hat{{\mathbf{U}}}^{T}{\mathbf{e}}_{j}<0$ .
•

If: ${\mathbf{x}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{j}>0$ then for any $-\frac{\pi}{2}<\theta<0$ : $\sin\left(2\theta\right){\mathbf{e}}_{i}^{T}\hat{{\mathbf{U}}}\hat{{\mathbf{U}}}^{T}{\mathbf{e}}_{j}<0$ .

and since $\sin(\theta)^{2}=o(\sin(2\theta))$ , for small enough $|\theta|$ we get: $f_{i}({\mathbf{U}})<f_{i}({\mathbf{U}}_{\theta})$ .

On the other hand, By Lemma 3 $\left\|{\mathbf{x}}_{i}\right\|^{2}>\left\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}$ , so if $\forall i\in\mathcal{I}_{\mathbf{U}}:\;{\mathbf{x}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{j}=0$ then:

(\left\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}-\left\|{\mathbf{x}}_{i}\right\|^{2}){\sin(\theta)^{2}}<0

∎

Armed with these results, we proceed to the rest of the proof of Lemma 2. Assume $f_{i}({\mathbf{U}})<\max_{l}f_{l}({\mathbf{U}})$ for some $i$ . By Lemma 5, let $\epsilon>0$ , then:

•

There is ${\mathbf{U}}_{1}$ with $\left\|{\mathbf{U}}_{1}-{\mathbf{U}}\right\|\leq\frac{\epsilon}{K}$ , s.t.: $\max_{i}f_{i}({\mathbf{U}}_{1})=\max_{i}f_{i}({\mathbf{U}})$ , and $|\mathcal{I}_{{\mathbf{U}}_{1}}|=|\mathcal{I}_{{\mathbf{U}}}|-1$ .
•

…
•

There is ${\mathbf{U}}_{K}$ with $\left\|{\mathbf{U}}_{K}-{\mathbf{U}}_{K-1}\right\|\leq\frac{\epsilon}{K}$ , s.t.: $\max_{i}f_{i}({\mathbf{U}}_{K})<\max_{i}f_{i}({\mathbf{U}}_{K-1})$ .

Finally, observe that $\left\|{\mathbf{U}}-{\mathbf{U}}_{K}\right\|\leq{\epsilon}$ and $K\leq|\mathcal{I}_{U}|$ , so we can find arbitrarily close ${\mathbf{U}}_{K}$ such that $\max_{i}f_{i}({\mathbf{U}}_{K})<\max_{i}f_{i}({\mathbf{U}})$ i.e. ${\mathbf{U}}$ is not a local maximizer.

Appendix D Proof of Proposition 3

Proof.

Given $\{{\mathbf{x}}_{i}\}_{i=1}^{n}$ , and recall the definition of ${\mathbf{X}}$ in (8). In the SDR problem we solve:

		$\displaystyle\begin{array}[]{ll}\max_{{\bf{P}}\in\mathbb{S}^{d}_{+}}&\min_{i\in[n]}{\mathbf{x}}_{i}^{T}{\bf{P}}{\mathbf{x}}_{i}\\ {\rm{s.t.}}&{\rm{Trace}}\left({\bf{P}}\right)\leq r\\ &{\bf{0}}\preceq{\bf{P}}\preceq{\bf{I}}\end{array}$
	$\displaystyle=$	$\displaystyle\begin{array}[]{ll}\max_{{\bf{P}}\in\mathbb{S}^{d}_{+}}&\min_{i\in[n]}\left\\|{\mathbf{x}}_{i}\right\\|^{2}{\mathbf{e}}_{i}^{T}{\mathbf{X}}{\mathbf{P}}{\mathbf{X}}^{T}{\mathbf{e}}_{i}\\ {\rm{s.t.}}&{\rm{Trace}}\left({\bf{P}}\right)\leq r\\ &{\bf{0}}\preceq{\bf{P}}\preceq{\bf{I}}\end{array}$
	$\displaystyle=$	$\displaystyle\begin{array}[]{ll}\max_{{\bf{P}}\in\mathbb{S}^{d}_{+}}&\min_{i\in[n]}\left\\|{\mathbf{x}}_{i}\right\\|^{2}[{\mathbf{X}}{\mathbf{P}}{\mathbf{X}}^{T}]_{ii}\\ {\rm{s.t.}}&{\rm{Trace}}\left({\mathbf{X}}{\mathbf{P}}{\mathbf{X}}^{T}\right)\leq r\\ &{\bf{0}}\preceq{\mathbf{X}}{\mathbf{P}}{\mathbf{X}}^{T}\preceq{\bf{I}}\end{array}$
	$\displaystyle=$	$\displaystyle\begin{array}[]{ll}\max_{\hat{{\mathbf{P}}}\in\mathbb{S}^{d}_{+}}&\min_{i\in[n]}\left\\|{\mathbf{x}}_{i}\right\\|^{2}\hat{{\mathbf{P}}}_{ii}\\ {\rm{s.t.}}&{\rm{Trace}}\left(\hat{{\mathbf{P}}}\right)\leq r\\ &{\bf{0}}\preceq\hat{{\mathbf{P}}}\preceq{\bf{I}}\end{array}$

where ${\mathbf{X}}{\mathbf{P}}{\mathbf{X}}^{T}=\hat{{\mathbf{P}}}$ and we have used the orthogonality of ${\mathbf{X}}$ .

Now, define $\hat{{\mathbf{P}}}$ as a diagonal matrix with $\hat{{\mathbf{P}}}_{ii}=\frac{r}{n}\cdot\frac{H}{\left\|{\mathbf{x}}_{i}\right\|^{2}}$ .

It is easy to verify that $\hat{{\mathbf{P}}}$ is feasible and yields an objective of $\frac{r}{n}H$ . Now let $\hat{{\mathbf{P}}}^{\prime}$ such that $\min_{k}\left\|{\mathbf{x}}_{k}\right\|^{2}\hat{{\mathbf{P}}}_{kk}^{\prime}>\frac{r}{n}H$ , then:

		$\displaystyle\forall i:\left\\|{\mathbf{x}}_{i}\right\\|^{2}\hat{{\mathbf{P}}}_{ii}^{\prime}>\frac{r}{n}H$
	$\displaystyle\Rightarrow$	$\displaystyle\forall i:\hat{{\mathbf{P}}}_{ii}^{\prime}>\frac{\frac{r}{n}H}{\left\\|{\mathbf{x}}_{i}\right\\|^{2}}$
	$\displaystyle\Rightarrow$	$\displaystyle{\rm{Trace}}(\hat{{\mathbf{P}}}^{\prime})=\sum_{i=1}^{n}\hat{{\mathbf{P}}}_{ii}^{\prime}>\sum_{i=1}^{n}\frac{\frac{r}{n}H}{\left\\|{\mathbf{x}}_{i}\right\\|^{2}}=r$

so $\hat{{\mathbf{P}}}^{\prime}$ is not feasible and we conclude that $\hat{{\mathbf{P}}}$ is optimal for SDR. By Proposition 2 the optimal value for F-FPCA is $-\frac{r}{n}H$ so by Proposition 1 the value of FPCA is $\frac{r}{n}H$ so SDR=FPCA (but this solution might not be low rank, as in our positive definite construction). ∎

Appendix E Proof of Corollary 1

We start the proof of the proposition by the observation that tight frame is characterized by the standard basis:

Lemma 6.

A frame $\{{\mathbf{u}}_{i}\}_{i=1}^{n}\subset\mathbb{R}^{r}$ is tight with frame bound A if and only if $\forall{\mathbf{e}}_{i}$ in the standard basis of $\mathbb{R}^{r}$ :

{\mathbf{e}}_{i}=\frac{1}{A}\sum_{i=1}^{n}\left<{\mathbf{e}}_{i},{\mathbf{u}}_{i}\right>{\mathbf{u}}_{i}

Proof.

Observe that if $\;\forall{\mathbf{e}}_{i}\in\left\{{\mathbf{e}}_{i}\right\}_{i=1}^{r}$ we have: ${\mathbf{e}}_{i}=\frac{1}{A}\sum_{j=1}^{n}\left\langle{\mathbf{e}}_{i},{\mathbf{u}}_{j}\right\rangle{\mathbf{u}}_{j}$ than we get for all ${\mathbf{v}}\in\mathbb{R}^{r}$ :

		$\displaystyle{\mathbf{v}}=\sum_{j=1}^{r}{\mathbf{v}}_{j}{\mathbf{e}}_{j}=\sum_{j=1}^{r}{\mathbf{v}}_{j}\frac{1}{A}\sum_{i=1}^{n}\left\langle{\mathbf{e}}_{j},{\mathbf{u}}_{i}\right\rangle{\mathbf{u}}_{i}$
	$\displaystyle=$	$\displaystyle\frac{1}{A}\sum_{i=1}^{n}\left\langle\sum_{j=1}^{r}{\mathbf{v}}_{j}{\mathbf{e}}_{j},{\mathbf{u}}_{i}\right\rangle{\mathbf{u}}_{i}=\frac{1}{A}\sum_{i=1}^{n}\left\langle{\mathbf{v}},{\mathbf{u}}_{i}\right\rangle{\mathbf{u}}_{i}$

∎

Now we use the observation above to claim that tight frame is actual the transposition of semi orthogonal matrix:

Lemma 7.

Let ${\mathbf{U}}^{T}=({\mathbf{u}}_{1},...,{\mathbf{u}}_{n})\in\mathbb{R}^{r\times n}$ . $\{{\mathbf{u}}_{i}\}_{i=1}^{n}$ is a tight frame with frame bound $A$ iff ${\mathbf{U}}$ has orthogonal columns with norm $\sqrt{A}$ .

Proof.

Consider equality 1 below:

		$\displaystyle{\mathbf{U}}^{T}{\mathbf{U}}=\left({\mathbf{u}}_{1},...,{\mathbf{u}}_{n}\right)\left({\mathbf{u}}_{1},...,{\mathbf{u}}_{n}\right)^{T}$
	$\displaystyle=$	$\displaystyle\left(\begin{array}[]{c}\\ \sum_{j=1}^{n}{\mathbf{u}}_{j}^{1}{\mathbf{u}}_{j},\\ \\ \end{array}...,\begin{array}[]{c}\\ \sum_{j=1}^{n}{\mathbf{u}}_{j}^{r}{\mathbf{u}}_{j}\\ \\ \end{array}\right)$
	$\displaystyle=^{1}$	$\displaystyle\left(\begin{array}[]{c}\\ A{\mathbf{e}}_{1}\\ \\ \end{array}...\begin{array}[]{c}\\ A{\mathbf{e}}_{r}\\ \\ \end{array}\right)=A{\mathbf{I}}$

Observe that ${\mathbf{U}}{\mathbf{U}}^{T}=A{\mathbf{I}}$ iff ${\mathbf{U}}$ has orthogonal columns with norm $\sqrt{A}$ . Equality 1 also holds iff $A{\mathbf{e}}_{i}=\sum_{j=1}^{n}{\mathbf{u}}_{j}^{i}{\mathbf{u}}_{j}=\sum_{j=1}^{n}\left\langle{\mathbf{e}}_{i},{\mathbf{u}}_{j}\right\rangle{\mathbf{u}}_{j}$ which holds iff $\{{\mathbf{u}}_{i}\}_{i=1}^{n}$ is a tight frame with frame bound $A$ (by Lemma 6), so we conclude that the conditions are equivalent.

∎

Lemma 8.

If for all k: $\left\|{\mathbf{x}}_{k}^{T}{\mathbf{U}}\right\|^{2}=\frac{r}{n}H$ , then: $\left\|{\mathbf{U}}\right\|^{2}=r$ .

Proof.

Recall the definition of $\hat{{\mathbf{U}}}$ in (8) and observe that:

		$\displaystyle\left\\|{\mathbf{U}}\right\\|_{F}^{2}=\left\\|\hat{{\mathbf{U}}}\right\\|_{F}^{2}=\sum_{i=1}^{n}\left\\|{\mathbf{e}}_{i}^{T}\hat{{\mathbf{U}}}\right\\|^{2}=\sum_{i=1}^{n}\frac{1}{\left\\|{{\mathbf{x}}}_{i}\right\\|^{2}}\left\\|{{\mathbf{x}}}_{i}^{T}{{\mathbf{U}}}\right\\|^{2}$
	$\displaystyle=$	$\displaystyle\sum_{i=1}^{n}\frac{1}{\left\\|{{\mathbf{x}}}_{i}\right\\|^{2}}r\left({\sum_{i=1}^{n}\frac{1}{\left\\|{\mathbf{x}}_{i}\right\\|^{2}}}\right)^{-1}=r$

∎

Now, let ${\mathbf{U}}\in\mathbb{R}^{d\times r}$ an optimal solution for F-FPCA, by Lemma 1 the columns of ${\mathbf{U}}$ are orthonormal, so by Lemma 7 ${\mathbf{U}}$ is tight frame and by Proposition 2 we have for all $k$ :

\left\|{\mathbf{x}}_{k}^{T}{\mathbf{U}}\right\|^{2}=f_{k}({\mathbf{U}})=\frac{r}{n}H

On the other hand, let ${\mathbf{U}}^{T}$ a tight frame as above, by Lemma 7 the columns of ${\mathbf{U}}$ are orthogonal and have the same norm. By Lemma 8 $\left\|{\mathbf{U}}\right\|_{F}^{2}=r$ so the columns has unit norms, i.e. the columns are orthonormal and for all $i\in[n]$ :

\displaystyle f_{i}({\mathbf{U}})=-{\mathbf{x}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}=-\frac{r}{n}H

i.e. $\min_{i}\left\|{\mathbf{x}}_{i}-{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}-\left\|{\mathbf{x}}_{i}\right\|^{2}=-\frac{r}{n}H$ which is the optimal target. Finally, if $\forall i:\;{\mathbf{x}}_{i}={\mathbf{e}}_{i}$ , then F-FPCA is reduced to the problem of finding ’normalized tight frame’.

Acknowledgment

The authors would like to thank Uri Okon who initiated this research and defined the problem, as well as Gal Elidan. This work was partially supported by ISF grant 1339/15.

References

[1] J. Sun, Q. Qu, and J. Wright, “When are nonconvex problems not scary?” arXiv preprint arXiv:1510.06096, 2015.
[2] J. Morgenstern, S. Samadi, M. Singh, U. Tantipongpipat, and S. Vempala, “Fair dimensionality reduction and iterative rounding for sdps,” arXiv preprint arXiv:1902.11281, 2019.
[3] S. Samadi, U. Tantipongpipat, J. H. Morgenstern, M. Singh, and S. Vempala, “The price of fair pca: One extra dimension,” in Advances in Neural Information Processing Systems, 2018, pp. 10 976–10 987.
[4] N. D. Sidiropoulos, T. N. Davidson, and Z.-Q. Luo, “Transmit beamforming for physical-layer multicasting,” IEEE Trans. Signal Processing, vol. 54, no. 6-1, pp. 2239–2251, 2006.
[5] W.-K. K. Ma, “Semidefinite relaxation of quadratic optimization problems and applications,” IEEE Signal Processing Magazine, vol. 1053, no. 5888/10, 2010.
[6] A. Cheriyadat and L. M. Bruce, “Why principal component analysis is not an appropriate feature extraction method for hyperspectral data,” in IGARSS 2003. 2003 IEEE International Geoscience and Remote Sensing Symposium. Proceedings (IEEE Cat. No. 03CH37477), vol. 6. IEEE, 2003, pp. 3420–3422.
[7] M. Olfat and A. Aswani, “Convex formulations for fair principal component analysis,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 663–670.
[8] W. Bian and D. Tao, “Max-min distance analysis by using sequential sdp relaxation for dimension reduction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 5, pp. 1037–1050, 2010.
[9] G. Lerman, M. B. McCoy, J. A. Tropp, and T. Zhang, “Robust computation of linear models by convex relaxation,” Foundations of Computational Mathematics, vol. 15, no. 2, pp. 363–410, 2015.
[10] M. M. Kamani, F. Haddadpour, R. Forsati, and M. Mahdavi, “Efficient fair principal component analysis,” arXiv preprint arXiv:1911.04931, 2019.
[11] S. Burer and R. D. Monteiro, “A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization,” Mathematical Programming, vol. 95, no. 2, pp. 329–357, 2003.
[12] ——, “Local minima and convergence in low-rank semidefinite programming,” Mathematical Programming, vol. 103, no. 3, pp. 427–444, 2005.
[13] N. Boumal, V. Voroninski, and A. S. Bandeira, “Deterministic guarantees for burer-monteiro factorizations of smooth semidefinite programs,” Communications on Pure and Applied Mathematics, 2018.
[14] N. Boumal, V. Voroninski, and A. Bandeira, “The non-convex burer-monteiro approach works on smooth semidefinite programs,” in Advances in Neural Information Processing Systems, 2016, pp. 2757–2765.
[15] D. Cifuentes, “Burer-monteiro guarantees for general semidefinite programs,” arXiv preprint arXiv:1904.07147, 2019.
[16] Z. Zhu, Q. Li, G. Tang, and M. B. Wakin, “Global optimality in low-rank matrix optimization,” IEEE Transactions on Signal Processing, vol. 66, no. 13, pp. 3614–3628, 2018.
[17] Q. Li, Z. Zhu, and G. Tang, “The non-convex geometry of low-rank matrix optimization,” Information and Inference: A Journal of the IMA, vol. 8, no. 1, pp. 51–96, 2018.
[18] J. J. Benedetto and M. Fickus, “Finite normalized tight frames,” Advances in Computational Mathematics, vol. 18, no. 2-4, pp. 357–385, 2003.
[19] S. Diamond and S. Boyd, “CVXPY: A Python-embedded modeling language for convex optimization,” Journal of Machine Learning Research, vol. 17, no. 83, pp. 1–5, 2016.
[20] A. Agrawal, R. Verschueren, S. Diamond, and S. Boyd, “A rewriting system for convex optimization problems,” Journal of Control and Decision, vol. 5, no. 1, pp. 42–60, 2018.
[21] I.-C. Yeh and C.-h. Lien, “The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients,” Expert Systems with Applications, vol. 36, no. 2, pp. 2473–2480, 2009.

Appendix F Proof of equation (11)

Lemma 9.

Let ${\mathbf{U}}\in\mathcal{O}^{d\times r}$ and assume $\forall i\in\mathcal{I}_{\mathbf{U}}:\;{\mathbf{x}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{j}=0$ . Define:

{\mathbf{v}}_{2}=\frac{{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{j}}{\left\|{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{j}\right\|},\;\;{\mathbf{v}}_{1}=\frac{{\mathbf{x}}_{i}}{\left\|{\mathbf{x}}_{i}\right\|}

for some $i\in\mathcal{I}_{{\mathbf{U}}}$ and complete these vectors to orthonormal basis: ${\mathbf{V}}=\left({\mathbf{v}}_{1},...{\mathbf{v}}_{d}\right)$ , and define: ${\mathbf{U}}_{\theta}={\mathbf{V}}{\mathbf{R}}(\theta){\mathbf{V}}^{T}{\mathbf{U}}$ , then:

\displaystyle f_{i}({\mathbf{U}})-f_{i}({\mathbf{U}}_{\theta})=(\left\|{\mathbf{x}}_{i}\right\|^{2}-\left\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}){\sin(\theta)^{2}}

Proof.

	$\displaystyle f_{i}({\mathbf{U}})-f_{i}({\mathbf{U}}_{\theta})=$	$\displaystyle{\mathbf{x}}_{i}^{T}{\mathbf{U}}_{\theta}{\mathbf{U}}_{\theta}^{T}{\mathbf{x}}_{i}-{\mathbf{x}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}$
	$\displaystyle=$	$\displaystyle\left({\mathbf{U}}_{\theta}^{T}{\mathbf{x}}_{i}-{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right)^{T}\left({\mathbf{U}}_{\theta}^{T}{\mathbf{x}}_{i}+{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right)$
	$\displaystyle=$	$\displaystyle\left({\mathbf{U}}^{T}{\mathbf{V}}{\mathbf{R}}(\theta)^{T}{\mathbf{V}}^{T}{\mathbf{x}}_{i}-{\mathbf{U}}^{T}{\mathbf{V}}{\mathbf{V}}^{T}{\mathbf{x}}_{i}\right)^{T}\left({\mathbf{U}}^{T}{\mathbf{V}}{\mathbf{R}}(\theta)^{T}{\mathbf{V}}^{T}{\mathbf{x}}_{i}+{\mathbf{U}}^{T}{\mathbf{V}}{\mathbf{V}}^{T}{\mathbf{x}}_{i}\right)$
	$\displaystyle=$	$\displaystyle{\mathbf{x}}_{i}^{T}{\mathbf{V}}\left({\mathbf{R}}(\theta)^{T}-{\mathbf{I}}\right)^{T}{\mathbf{V}}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{V}}\left({\mathbf{R}}(\theta)^{T}+{\mathbf{I}}\right){\mathbf{V}}^{T}{\mathbf{x}}_{i}$
	$\displaystyle=$	$\displaystyle{\mathbf{x}}_{i}^{T}{\mathbf{V}}\left({\mathbf{R}}(\theta)^{T}-{\mathbf{I}}\right)^{T}{\mathbf{V}}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{V}}\left({\mathbf{R}}(\theta)^{T}-{\mathbf{I}}\right){\mathbf{V}}^{T}{\mathbf{x}}_{i}+{\mathbf{x}}_{i}^{T}{\mathbf{V}}\left({\mathbf{R}}(\theta)^{T}-{\mathbf{I}}\right)^{T}{\mathbf{V}}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{V}}(2{\mathbf{I}}){\mathbf{V}}^{T}{\mathbf{x}}_{i}$
	$\displaystyle=$	$\displaystyle\underbrace{{\mathbf{x}}_{i}^{T}{\mathbf{V}}\left({\mathbf{R}}(\theta)^{T}-{\mathbf{I}}\right)^{T}{\mathbf{V}}^{T}{\mathbf{U}}}_{\hat{x}_{i}^{T}}\underbrace{{\mathbf{U}}^{T}{\mathbf{V}}\left({\mathbf{R}}(\theta)^{T}-{\mathbf{I}}\right){\mathbf{V}}^{T}{\mathbf{x}}_{i}}_{\hat{x}_{i}}+2\underbrace{{\mathbf{x}}_{i}^{T}{\mathbf{V}}\left({\mathbf{R}}(\theta)^{T}-{\mathbf{I}}\right)^{T}{\mathbf{V}}^{T}{\mathbf{U}}}_{\hat{x}_{i}^{T}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}$

\begin{split}{\hat{{\mathbf{x}}}_{i}}=&{\mathbf{U}}^{T}{\mathbf{V}}\left({\mathbf{R}}(\theta)^{T}-{\mathbf{I}}\right){\mathbf{V}}^{T}{\mathbf{x}}_{i}\\ =&{{\mathbf{U}}^{T}{\mathbf{V}}\left({\mathbf{R}}(\theta)^{T}-{\mathbf{I}}\right)\left<{\mathbf{v}}_{1},{\mathbf{x}}_{i}\right>{\mathbf{e}}_{1}}\\ =&\left\|{\mathbf{x}}_{i}\right\|{{\mathbf{U}}^{T}{\mathbf{V}}\left({\mathbf{R}}(\theta)^{T}-{\mathbf{I}}\right){\mathbf{e}}_{1}}\\ =&\left\|{\mathbf{x}}_{i}\right\|{{\mathbf{U}}^{T}{\mathbf{V}}\left(\cos(\theta){\mathbf{e}}_{1}-\sin(\theta){\mathbf{e}}_{2}-{\mathbf{e}}_{1}\right)}\\ =&\left\|{\mathbf{x}}_{i}\right\|\left({{\mathbf{U}}^{T}{\mathbf{v}}_{1}\left(\cos(\theta)-1\right)}-{{\mathbf{U}}^{T}{\mathbf{v}}_{2}\sin(\theta)}\right)\\ =&{{\mathbf{U}}^{T}{\mathbf{x}}_{i}\left(\cos(\theta)-1\right)}-\left\|{\mathbf{x}}_{i}\right\|{{\mathbf{U}}^{T}{\mathbf{v}}_{2}\sin(\theta)}\end{split}\quad\quad\begin{split}\hat{{\mathbf{x}}}_{i}^{T}{\mathbf{U}}^{T}{\mathbf{x}}_{i}=&{\mathbf{x}}_{i}^{T}{\mathbf{U}}\left({{\mathbf{U}}^{T}{\mathbf{x}}_{i}\left(\cos(\theta)-1\right)}-\left\|{\mathbf{x}}_{i}\right\|{{\mathbf{U}}^{T}{\mathbf{v}}_{2}\sin(\theta)}\right)\\ =&\left\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}\left(\cos(\theta)-1\right)-\left\|{\mathbf{x}}_{i}\right\|{\frac{{\mathbf{x}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{j}}{\left\|{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{j}\right\|}\sin(\theta)}\\ =&\left\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\|^{2}\left(\cos(\theta)-1\right)\end{split}

	$\displaystyle\hat{{\mathbf{x}}}_{i}^{T}\hat{{\mathbf{x}}}_{i}=$	$\displaystyle\left\\|{\mathbf{x}}_{i}\right\\|^{2}\left({{\mathbf{U}}^{T}{\mathbf{v}}_{1}\left(\cos(\theta)-1\right)}-{{\mathbf{U}}^{T}{\mathbf{v}}_{2}\sin(\theta)}\right)^{T}\left({{\mathbf{U}}^{T}{\mathbf{v}}_{1}\left(\cos(\theta)-1\right)}-{{\mathbf{U}}^{T}{\mathbf{v}}_{2}\sin(\theta)}\right)$
	$\displaystyle=$	$\displaystyle\left\\|{\mathbf{x}}_{i}\right\\|^{2}\left({{\mathbf{v}}_{1}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{v}}_{1}\left(\cos(\theta)-1\right)^{2}}+{{\mathbf{v}}_{2}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{v}}_{2}\sin(\theta)^{2}}-2\sin(\theta){\mathbf{v}}_{2}^{T}{\mathbf{U}}{{\mathbf{U}}^{T}{\mathbf{v}}_{1}\left(\cos(\theta)-1\right)}\right)$
	$\displaystyle=$	$\displaystyle\left\\|{\mathbf{x}}_{i}\right\\|^{2}\left({{\mathbf{v}}_{1}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{v}}_{1}\left(\cos(\theta)-1\right)^{2}}+{{\mathbf{v}}_{2}^{T}{\mathbf{v}}_{2}\sin(\theta)^{2}}\right)$
	$\displaystyle=$	$\displaystyle{\mathbf{x}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{x}}_{i}\left(\cos(\theta)-1\right)^{2}+\left\\|{\mathbf{x}}_{i}\right\\|^{2}{\sin(\theta)^{2}}$
	$\displaystyle=$	$\displaystyle\left\\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\\|^{2}\cos(\theta)^{2}-2\left\\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\\|^{2}\cos(\theta)+\left\\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\\|^{2}+(\left\\|{\mathbf{x}}_{i}\right\\|^{2}-\left\\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\\|^{2}){\sin(\theta)^{2}}+\left\\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\\|^{2}{\sin(\theta)^{2}}$
	$\displaystyle=$	$\displaystyle\left\\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\\|^{2}-2\left\\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\\|^{2}\cos(\theta)+\left\\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\\|^{2}+(\left\\|{\mathbf{x}}_{i}\right\\|^{2}-\left\\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\\|^{2}){\sin(\theta)^{2}}$
	$\displaystyle=$	$\displaystyle(\left\\|{\mathbf{x}}_{i}\right\\|^{2}-\left\\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\\|^{2}){\sin(\theta)^{2}}+2\left\\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\\|^{2}(1-\cos(\theta))$

So finally:

	$\displaystyle f_{i}({\mathbf{U}})-f_{i}({\mathbf{U}}_{\theta})=$	$\displaystyle\hat{{\mathbf{x}}}_{i}^{T}\hat{{\mathbf{x}}}_{i}+2\hat{{\mathbf{x}}}_{i}^{T}{\mathbf{U}}^{T}{\mathbf{x}}_{i}$
	$\displaystyle=$	$\displaystyle 2\left\\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\\|^{2}\left(\cos(\theta)-1\right)+(\left\\|{\mathbf{x}}_{i}\right\\|^{2}-\left\\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\\|^{2}){\sin(\theta)^{2}}+2\left\\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\\|^{2}(1-\cos(\theta))$
	$\displaystyle=$	$\displaystyle(\left\\|{\mathbf{x}}_{i}\right\\|^{2}-\left\\|{\mathbf{U}}^{T}{\mathbf{x}}_{i}\right\\|^{2}){\sin(\theta)^{2}}$

. ∎

Lemma 10.

Let ${\mathbf{U}}\in\mathcal{O}^{d\times r}$ , ${\mathbf{R}}(\theta)=G(\theta,i,j)$ (a Given rotation over the axes i,j), ${\mathbf{U}}^{{}^{\prime}}={\mathbf{R}}(\theta){\mathbf{U}}$ .

\displaystyle{\mathbf{e}}_{i}^{T}{\mathbf{U}}^{{}^{\prime}}{\mathbf{U}}^{{}^{\prime}T}{\mathbf{e}}_{i}-{\mathbf{e}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{e}}_{i}=\sin\left(2\theta\right){\mathbf{e}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{e}}_{j}+\sin^{2}\left(\theta\right)\left({\mathbf{e}}_{j}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{e}}_{j}-{\mathbf{e}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{e}}_{i}\right)

Proof.

Observe that:

\displaystyle{\mathbf{e}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{e}}_{i}=

\displaystyle\sum_{l=1}^{r}{\mathbf{U}}_{il}^{2}

Since ${\mathbf{R}}(\theta){\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{R}}(\theta)^{T}={\mathbf{U}}^{{}^{\prime}}{\mathbf{U}}^{{}^{\prime}T}$ , we also have:

	$\displaystyle{\mathbf{e}}_{i}^{T}{\mathbf{U}}^{{}^{\prime}}{\mathbf{U}}^{{}^{\prime}T}{\mathbf{e}}_{i}=$	$\displaystyle\sum_{l=1}^{r}\left[{\mathbf{U}}^{T}{\mathbf{R}}(\theta)^{T}{\mathbf{e}}_{i}\right]_{l}^{2}$
	$\displaystyle=$	$\displaystyle\sum_{l=1}^{r}\left[{\mathbf{U}}^{T}\left(\cos\left(\theta\right){\mathbf{e}}_{i}+\sin\left(\theta\right){\mathbf{e}}_{j}\right)\right]_{l}^{2}$
	$\displaystyle=$	$\displaystyle\sum_{l=1}^{r}\left(\cos\left(\theta\right)^{2}{\mathbf{U}}_{il}^{2}+2\cos\left(\theta\right){\mathbf{U}}_{il}\sin\left(\theta\right){\mathbf{U}}_{jl}+\sin\left(\theta\right)^{2}{\mathbf{U}}_{jl}^{2}\right)$

So:

	$\displaystyle{\mathbf{e}}_{i}^{T}{\mathbf{U}}^{{}^{\prime}}{\mathbf{U}}^{{}^{\prime}T}{\mathbf{e}}_{i}-{\mathbf{e}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{e}}_{i}=$	$\displaystyle\sum_{l=1}^{r}\left(2\cos\left(\theta\right){\mathbf{U}}_{il}\sin\left(\theta\right){\mathbf{U}}_{jl}+\sin^{2}\left(\theta\right){\mathbf{U}}_{jl}^{2}-\left(1-\cos^{2}\left(\theta\right)\right){\mathbf{U}}_{il}^{2}\right)$
	$\displaystyle=$	$\displaystyle\sum_{l=1}^{r}\left(2\cos\left(\theta\right){\mathbf{U}}_{il}\sin\left(\theta\right){\mathbf{U}}_{jl}+\sin^{2}\left(\theta\right){\mathbf{U}}_{jl}^{2}-\sin^{2}\left(\theta\right){\mathbf{U}}_{il}^{2}\right)$
	$\displaystyle=$	$\displaystyle\sum_{l=1}^{r}\left(2\cos\left(\theta\right){\mathbf{U}}_{il}\sin\left(\theta\right){\mathbf{U}}_{jl}+\sin^{2}\left(\theta\right)\left({\mathbf{U}}_{jl}^{2}-{\mathbf{U}}_{il}^{2}\right)\right)$
	$\displaystyle=$	$\displaystyle 2\cos\left(\theta\right)\sin\left(\theta\right)\sum_{l=1}^{r}{\mathbf{U}}_{il}{\mathbf{U}}_{jl}+\sin^{2}\left(\theta\right)\sum_{l=1}^{r}\left({\mathbf{U}}_{jl}^{2}-{\mathbf{U}}_{il}^{2}\right)$
	$\displaystyle=$	$\displaystyle\sin\left(2\theta\right){\mathbf{e}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{e}}_{j}+\sin^{2}\left(\theta\right)\left({\mathbf{e}}_{j}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{e}}_{j}-{\mathbf{e}}_{i}^{T}{\mathbf{U}}{\mathbf{U}}^{T}{\mathbf{e}}_{i}\right)$

∎

		$\displaystyle\forall i:\left\\|{\mathbf{x}}_{i}\right\\|^{2}\hat{{\mathbf{P}}}_{ii}^{\prime}>\frac{r}{n}H$
	$\displaystyle\Rightarrow$	$\displaystyle\forall i:\hat{{\mathbf{P}}}_{ii}^{\prime}>\frac{\frac{r}{n}H}{\left\\|{\mathbf{x}}_{i}\right\\|^{2}}$
	$\displaystyle\Rightarrow$	$\displaystyle{\rm{Trace}}(\hat{{\mathbf{P}}}^{\prime})=\sum_{i=1}^{n}\hat{{\mathbf{P}}}_{ii}^{\prime}>\sum_{i=1}^{n}\frac{\frac{r}{n}H}{\left\\|{\mathbf{x}}_{i}\right\\|^{2}}=r$