This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On Approximation Algorithm for Orthogonal Low-Rank Tensor Approximation

Yuning Yang College of Mathematics and Information Science, Guangxi University, Nanning, 530004, China ([email protected]).
Abstract

The goal of this work is to fill a gap in [Yang, SIAM J. Matrix Anal. Appl, 41 (2020), 1797–1825]. In that work, an approximation procedure was proposed for orthogonal low-rank tensor approximation; however, the approximation lower bound was only established when the number of orthonormal factors is one. To this end, by further exploring the multilinearity and orthogonality of the problem, we introduce a modified approximation algorithm. Approximation lower bound is established, either in deterministic or expected sense, no matter how many orthonormal factors there are. In addition, a major feature of the new algorithm is its flexibility to allow either deterministic or randomized procedures to solve a key step of each latent orthonormal factor involved in the algorithm. This feature can reduce the computation of large SVDs, making the algorithm more efficient. Some numerical studies are provided to validate the usefulness of the proposed algorithm.

Keywords: tensor; orthogonality; approximation algorithm; approximation bound; polar decomposition

1 Introduction

Tensors (hypermatrices) play an important role in signal processing, data analysis, statistics, and machine learning nowadays, key to which are tensor decompositions/approximations [19, 6, 4, 35]. Canonical polyadic (CP) decomposition factorizes the data tensor into some rank-1 components. Such a decomposition can be unique under mild assumptions [20, 32, 36]; however, the degeneracy of its associated optimization requires one to impose additional constraints, such as the nonnegativity, angularity, and orthogonality [18, 26, 25]. Orthogonal CP decomposition allows one or more of the latent factors to be orthonormal [18, 11], giving flexibility to model applications arising from image processing, joint SVD, independent component analysis, DS-CDMA systems, and so on [34, 5, 40, 30, 8, 37, 40].

Most existing algorithms for solving orthogonal low-rank tensor decomposition/approximation are iterative algorithms [3, 39, 44, 29, 11, 45, 15, 23, 22, 33]; just to name a few. To find a good feasible initializer for iterative algorithms, [45, Procedure 3.1] developed an approximation procedure, which finds the orthonormal factors by means of the celebrated HOSVD [9], and then computes the non-orthonormal ones by solving tensor rank-1 approximation problems. Some numerical tests showed that with the help of this procedure, the alternating least squares (ALS) can recover the latent factors better.

However, a gap was left: the approximation lower bound of [45, Procedure 3.1] was established only when the number of orthonormal factors is one. Denote tt as the number of latent orthonormal factors and dd the order of the data tensor under consideration. The difficulty to extend the lower bound analysis of [45, Procedure 3.1] to arbitrary 1<td1<t\leq d lies in that the based HOSVD is essentially designed for the Tucker format, which would yield some cross-terms that cause troubles in the analysis of the current CP format.

To fill this gap, in this work, by further exploring the multilinearity and orthogonality of the problem, we devise a new approximation algorithm for orthogonal low-rank tensor CP approximation, which is a modification of [45, Procedure 3.1]. In fact, we will make the new algorithm somewhat more general, allowing either deterministic or randomized procedures to solve a key step of each latent orthonormal factor involved in the algorithm. Following a certain principle, three procedures for this step are considered, two deterministic and one randomized, leading to three versions of the approximation algorithm. The approximation lower bound is derived, either in the deterministic or in the expected sense, no matter how many latent orthonormal factors there are. The most expensive operation of the proposed algorithm is an SVD of the mode-dd unfolding matrix, making the algorithm more efficient than [45, Procedure 3.1]. Numerical studies will show the efficiency of the introduced algorithm and its usefulness in recovery and clustering, and is favorably compared with [45, Procedure 3.1].

Note that approximation algorithms have been widely proposed for tensor approximations. For example, HOSVD, sequentially truncated HOSVD [42], and hierarchical HOSVD [10] are also approximation algorithms for the Tucker format. On the other hand, several randomized approximation algorithms have been proposed in recent years for Tucker format or t-SVD [27, 2, 47, 1]. The solution quality of the aforementioned algorithms was measured via error bounds. In the context of tensor best rank-1 approximation problems, deterministic or randomized approximation algorithms were developed [13, 12, 38, 7, 17], where the solution quality was measured by approximation bounds; our results extend these to rank-RR approximations.

The rest of this work is organized as follows. Preliminaries are given in Sect. 2. The approximation algorithm is introduced in Sect. 3, while the approximation bound and its analysis are derived in Sect. 4. Sect. 5 provides numerical results and Sect. 6 draws some conclusions.

2 Preliminaries

Throughout this work, vectors are written as boldface lowercase letters (𝐱,𝐲,)(\mathbf{x},\mathbf{y},\ldots), matrices correspond to italic capitals (A,B,)(A,B,\ldots), and tensors are written as calligraphic capitals (𝒜,,)(\mathcal{A},\mathcal{B},\cdots). n1××nd\mathbb{R}^{n_{1}\times\cdots\times n_{d}} denotes the space of n1××ndn_{1}\times\cdots\times n_{d} real tensors. For two tensors 𝒜,\mathcal{A},\mathcal{B} of the same size, their inner product 𝒜,\langle\mathcal{A},\mathcal{B}\rangle is given by the sum of entry-wise product. The Frobenius (or Hilbert–Schmidt) norm of 𝒜\mathcal{A} is defined by 𝒜F=𝒜,𝒜1/2\|\mathcal{A}\|_{F}=\langle\mathcal{A},\mathcal{A}\rangle^{1/2}; \otimes denotes the outer product; in particular, for 𝐮jnj\mathbf{u}_{j}\in\mathbb{R}^{n_{j}}, j=1,,dj=1,\ldots,d, 𝐮1𝐮d\mathbf{u}_{1}\otimes\cdots\otimes\mathbf{u}_{d} denotes a rank-1 tensor in n1××nd\mathbb{R}^{n_{1}\times\cdots\times n_{d}}. We write it as j=1d𝐮j\bigotimes^{d}_{j=1}\mathbf{u}_{j} for short. For RR vectors of size njn_{j}: 𝐮j,1,,𝐮j,R\mathbf{u}_{j,1},\ldots,\mathbf{u}_{j,R}, we usually collect them into a matrix Uj=[𝐮j,1,,𝐮j,R]nj×RU_{j}=[\mathbf{u}_{j,1},\ldots,\mathbf{u}_{j,R}]\in\mathbb{R}^{n_{j}\times R}, j=1,,Rj=1,\ldots,R.

The mode-jj unfolding of 𝒜n1××nd\mathcal{A}\in\mathbb{R}^{n_{1}\times\cdots\times n_{d}}, denoted as A(j)A_{(j)}, is a matrix in nj×kjdnk\mathbb{R}^{n_{j}\times\prod^{d}_{k\neq j}n_{k}}. The product between a tensor 𝒜n1××nd\mathcal{A}\in\mathbb{R}^{n_{1}\times\cdots\times n_{d}} and a vector 𝐮jnj\mathbf{u}_{j}\in\mathbb{R}^{n_{j}} with respect to the jj-th mode, the tensor-vector product 𝒜×j𝐮j\mathcal{A}\times_{j}\mathbf{u}_{j}^{\top} is a tensor in n1××nj1×nj+1××nd\in\mathbb{R}^{n_{1}\times\cdots\times n_{j-1}\times n_{j+1}\times\cdots\times n_{d}}.

Given a dd-th order tensor 𝒜n1××nd\mathcal{A}\in\mathbb{R}^{n_{1}\times\cdots\times n_{d}} and a positive integer RR, the approximate CP decomposition under consideration can be written as [11]:

min𝒜i=1Rσij=1d𝐮j,iF2s.t.𝐮j,i𝐮j,i=1,j=1,,dt,1iR,UjUj=I,j=dt+1,,d,σi,\begin{split}&\min~{}\left\|\mathcal{A}-\sum^{R}_{i=1}\nolimits\sigma_{i}\bigotimes^{d}_{j=1}\nolimits\mathbf{u}_{j,i}\right\|_{F}^{2}\\ &~{}{\rm s.t.}~{}~{}\mathbf{u}_{j,i}^{\top}\mathbf{u}_{j,i}=1,j=1,\ldots,d-t,1\leq i\leq R,\\ &~{}~{}~{}~{}~{}~{}~{}U_{j}^{\top}U_{j}=I,j=d-t+1,\ldots,d,~{}{\sigma}_{i}\in\mathbb{R},\end{split} (2.1)

where 1td1\leq t\leq d. This model means that the last tt latent factor matrices of 𝒜\mathcal{A} are assumed to be orthonormal, while the first dtd-t ones are not; however, due to the presence of the scalars σi\sigma_{i}, when j=1,,dtj=1,\ldots,d-t, each 𝐮j,i\mathbf{u}_{j,i} can be normalized. These lead to the two types of constraints in (2.1). Using orthogonality, (2.1) can be equivalently formulated as a maximization problem [11]:

maxG(U1,,Ud):=i=1R𝒜,j=1d𝐮j,i2s.t.𝐮j,i𝐮j,i=1,j=1,,dt,1iR,UjUj=I,j=dt+1,,d,\begin{split}&\max~{}G(U_{1},\ldots,U_{d}):=\sum^{R}_{i=1}\nolimits\left\langle\mathcal{A},\bigotimes^{d}_{j=1}\nolimits\mathbf{u}_{j,i}\right\rangle^{2}\\ &~{}{\rm s.t.}~{}~{}\mathbf{u}_{j,i}^{\top}\mathbf{u}_{j,i}=1,j=1,\ldots,d-t,1\leq i\leq R,\\ &~{}~{}~{}~{}~{}~{}~{}U_{j}^{\top}U_{j}=I,j=d-t+1,\ldots,d,\end{split} (2.2)

where the variables σi\sigma_{i}’s have been eliminated. The approximation algorithms and the corresponding analysis are mainly focused on this maximization model.

Some necessary definitions and properties are introduced in the following.

Let Vm×nV\in\mathbb{R}^{m\times n}, mnm\geq n. The polar decomposition of VV is to decompose it into two matrices Um×nU\in\mathbb{R}^{m\times n} and Hn×nH\in\mathbb{R}^{n\times n} such that V=UHV=UH, where UU is columnwisely orthonormal, and HH is a symmetric positive semidefinite matrix. Its relation with SVD is given in the following.

Theorem 2.1 (c.f. [14]).

Let Vm×nV\in\mathbb{R}^{m\times n}, mnm\geq n. Then there exist Um×nU\in\mathbb{R}^{m\times n} and a unique symmetric positive semidefinite matrix Hn×nH\in\mathbb{R}^{n\times n} such that

V=UH,UU=In×n.V=UH,~{}U^{\top}U=I\in\mathbb{R}^{n\times n}.

(U,H)(U,H) is the polar decomposition of VV. If rank(V)=n{\rm rank}(V)=n, then HH is symmetric positive definite and UU is uniquely determined.

Furthermore, let H=QΛQH=Q\Lambda Q^{\top}, Q,Λn×nQ,\Lambda\in\mathbb{R}^{n\times n} be the eigenvalue decomposition of HH, namely, QQ=QQ=IQ^{\top}Q=QQ^{\top}=I, Λ=diag(λ1,,λn)\Lambda={\rm diag}(\lambda_{1},\ldots,\lambda_{n}) be a diagonal matrix where λ1λn0\lambda_{1}\geq\cdots\geq\lambda_{n}\geq 0. Then U=PQU=PQ^{\top}, and V=PΛQV=P\Lambda Q^{\top} is a reduced SVD of VV.

Write V=[𝐯1,,𝐯n]V=[\mathbf{v}_{1},\ldots,\mathbf{v}_{n}] and U=[𝐮1,,𝐮n]U=[\mathbf{u}_{1},\ldots,\mathbf{u}_{n}] where 𝐯i,𝐮im\mathbf{v}_{i},\mathbf{u}_{i}\in\mathbb{R}^{m}, i=1,,ni=1,\ldots,n. The following lemmas are useful in the approximation bound analysis.

Lemma 2.1.

Let Um×nU\in\mathbb{R}^{m\times n} and Hn×nH\in\mathbb{R}^{n\times n} be the two matrices generated by the polar decomposition of Vm×nV\in\mathbb{R}^{m\times n} such that V=UHV=UH, where UU is columnwisely orthonormal and HH is symmetric positive semidefinite. Let 𝐯i\mathbf{v}_{i} and 𝐮i\mathbf{u}_{i} be defined as above. Then it holds that

𝐮i,𝐯i0,i=1,,m.\left\langle\mathbf{u}_{i},\mathbf{v}_{i}\right\rangle\geq 0,~{}i=1,\ldots,m.
Proof.

Since V=UHV=UH and UU=IU^{\top}U=I, we have UV=HU^{\top}V=H, which means that 𝐮i,𝐯i\left\langle\mathbf{u}_{i},\mathbf{v}_{i}\right\rangle is exactly the ii-th diagonal entry of HH. As HH is positive semidefinite, its diagonal entries are nonnegative. The required result follows. ∎

Denote \left\|\cdot\right\|_{*} as the nuclear norm of a matrix, i.e., the sum of its singular values. From Theorem 2.1 we easily see that:

Lemma 2.2.

Let UU and VV be defined as in Theorem 2.1. Then U,V=V\left\langle U,V\right\rangle=\left\|V\right\|_{*}.

3 Approximation Algorithm

The approximation algorithm proposed in this section inherits certain ideas of [45, Procedure 3.1], and so we briefly recall its strategy here. [45, Procedure 3.1] obtained an approximation solution to (2.2) as follows: One first applies the truncated HOSVD [9] to get Ud,,Udt+1U_{d},\ldots,U_{d-t+1}; specifically, for j=d,,dt+1j=d,\ldots,d-t+1, one unfolds 𝒜\mathcal{A} to the matrix A(j)nj×ljdnlA_{(j)}\in\mathbb{R}^{n_{j}\times\prod^{d}_{l\neq j}n_{l}}, and then performs the truncated SVD to get its left leading RR singular vectors, denoted as 𝐮j,1,,𝐮j,R\mathbf{u}_{j,1},\ldots,\mathbf{u}_{j,R} , which forms the jj-th factor Uj=[𝐮j,1,,𝐮j,R]U_{j}=[\mathbf{u}_{j,1},\ldots,\mathbf{u}_{j,R}]. Once Ud,,Udt+1U_{d},\ldots,U_{d-t+1} are obtained, one then approximately solves RR tensor best rank-1 approximation problems, given the data tensors 𝒜j=dt+1d𝐮j,i:=𝒜×dt+1𝐮dt+1,i××d𝐮d,in1××ndt\mathcal{A}\bigotimes^{d}_{j=d-t+1}\mathbf{u}_{j,i}:=\mathcal{A}\times_{d-t+1}\mathbf{u}_{d-t+1,i}^{\top}\times\cdots\times_{d}\mathbf{u}_{d,i}^{\top}\in\mathbb{R}^{n_{1}\times\cdots\times n_{d-t}}, i=1,,Ri=1,\ldots,R, namely,

max𝒜j=dt+1d𝐮j,i,j=1dt𝐮j,is.t.𝐮j,i𝐮j,i=1,j=1,,dt,\begin{split}&\max~{}\left\langle\mathcal{A}\bigotimes^{d}_{j=d-t+1}\nolimits\mathbf{u}_{j,i},\bigotimes^{d-t}_{j=1}\nolimits\mathbf{u}_{j,i}\right\rangle\\ &~{}~{}{\rm s.t.}~{}~{}\mathbf{u}_{j,i}^{\top}\mathbf{u}_{j,i}=1,j=1,\ldots,d-t,\end{split} (3.3)

for i=1,,Ri=1,\ldots,R.

Such a strategy of obtaining an approximation solution to (2.2) works well empirically, while the approximation bound was established only when t=1t=1 [45, Proposition 3.2]. We find that it is difficult to derive the approximation bound for general tensors when t2t\geq 2; this is because we unavoidably encounter the tensor 𝒜×dt+1Udt+1×dUd\mathcal{A}\times_{d-t+1}U_{d-t+1}^{\top}\cdots\times_{d}U_{d}^{\top} during the analysis. Such kind of tensors might be easier to deal with in Tucker model [9]; however, it consists of cross-terms 𝒜×dt+1𝐮dt+1,idt+1×d𝐮d,id\mathcal{A}\times_{d-t+1}\mathbf{u}_{d-t+1,i_{d-t+1}}^{\top}\cdots\times_{d}\mathbf{u}_{d,i_{d}}^{\top} with idt+1idi_{d-t+1}\neq\cdots\neq i_{d} that are not welcomed in problem (2.2).

SVDU3U_{3}𝐮3,1,,𝐮3,R\footnotesize\mathbf{u}_{3,1},\ldots,\mathbf{u}_{3,R}×3𝐮3,1\footnotesize\times_{3}\mathbf{u}_{3,1}^{\top}\vdots\vdots×3𝐮3,R\footnotesize\times_{3}\mathbf{u}_{3,R}^{\top}M2,1M_{2,1}\vdots\vdotsM2,RM_{2,R}𝐯2,1\mathbf{v}_{2,1}\vdots\vdots𝐯2,R\mathbf{v}_{2,R}V2V_{2}PDU2U_{2}𝐮2,1,,𝐮2,R\footnotesize\mathbf{u}_{2,1},\ldots,\mathbf{u}_{2,R}×2𝐮2,1×3𝐮3,1\footnotesize\times_{2}\mathbf{u}_{2,1}^{\top}\times_{3}\mathbf{u}_{3,1}^{\top}\vdots\vdots×2𝐮2,R×3𝐮3,R\footnotesize\times_{2}\mathbf{u}_{2,R}^{\top}\times_{3}\mathbf{u}_{3,R}^{\top}𝐯1,1\mathbf{v}_{1,1}\vdots\vdots𝐯1,R\mathbf{v}_{1,R}V1V_{1}PDU1U_{1}𝐮1,1,,𝐮1,R\footnotesize\mathbf{u}_{1,1},\ldots,\mathbf{u}_{1,R}
Figure 1: Workflow of the proposed Algorithm 1 for approximately solving (2.2) when d=3d=3 and t=3t=3. PD is short for polar decomposition.

In view of this difficulty, we will modify [45, Procedure 3.1] such that it is more suitable for problem (2.2). The new algorithm is still designed under the guidance of problem (3.3), namely, we first compute the orthonormal factors Ud,,Udt+1U_{d},\ldots,U_{d-t+1}, and then finding U1,,UdtU_{1},\ldots,U_{d-t} via solving RR tensor rank-1 approximation problems. We differ from [45] in the computation of the orthonormal factors Ud1,,Udt+1U_{d-1},\ldots,U_{d-t+1}. By taking (2.2) with d=t=3d=t=3 as an example, we state our idea in what follows, with the workflow illustrated in Fig. 1.

The computation of U3U_{3} is the same as [45], namely, we let U3:=[𝐮3,1,,𝐮3,R]U_{3}:=[\mathbf{u}_{3,1},\ldots,\mathbf{u}_{3,R}] where 𝐮3,1,,𝐮3,R\mathbf{u}_{3,1},\ldots,\mathbf{u}_{3,R} are the left leading RR singular vectors of the unfolding matrix A(d)n3×n1n2A_{(d)}\in\mathbb{R}^{n_{3}\times n_{1}n_{2}}.

Algorithm 1 Approximation algorithm for (2.2)
0:  𝒜n1××nd\mathcal{A}\in\mathbb{R}^{n_{1}\times\cdots\times n_{d}}, d3d\geq 3, R1R\geq 1, t1t\geq 1
1:  Compute 𝐮d,1,,𝐮d,R\mathbf{u}_{d,1},\ldots,\mathbf{u}_{d,R} as the left leading RR unit singular vectors of the unfolding matrix A(d)nd×k=1d1nkA_{(d)}\in\mathbb{R}^{n_{d}\times\prod^{d-1}_{k=1}n_{k}}. Denote Ud:=[𝐮d,1,,𝐮d,R]U_{d}:=[\mathbf{u}_{d,1},\ldots,\mathbf{u}_{d,R}]
2:  for j=d1:1:dt+1j=d-1:-1:d-t+1 do
3:     % splitting step
4:     for i=1:Ri=1:R do
5:        Compute the jj-th order tensor
j,i:=𝒜×j+1𝐮j+1,i×j+2×d𝐮d,in1××nj\mathcal{B}_{j,i}:=\mathcal{A}\times_{j+1}\mathbf{u}_{j+1,i}^{\top}\times_{j+2}\cdots\times_{d}\mathbf{u}_{d,i}^{\top}\in\mathbb{R}^{n_{1}\times\cdots\times n_{j}}\vspace*{-1\baselineskip}
6:        if j>1j>1 then
7:           Unfold j,i\mathcal{B}_{j,i} to matrix Mj,iM_{j,i} as
Mj,i:=reshape(j,i,nj,k=1j1nk)nj×k=1j1nkM_{j,i}:=\texttt{reshape}\left(\mathcal{B}_{j,i},n_{j},\prod^{j-1}_{k=1}\nolimits n_{k}\right)\in\mathbb{R}^{n_{j}\times\prod^{j-1}_{k=1}n_{k}}\vspace*{-0.7\baselineskip}
8:           Extract a vector 𝐯j,i\mathbf{v}_{j,i} from Mj,iM_{j,i} as
𝐯j,i:=get_v_from_M(Mj,i)nj%tobeintroducedlater~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\mathbf{v}_{j,i}:=\texttt{get\_v\_from\_M}(M_{j,i})\in\mathbb{R}^{n_{j}}~{}~{}\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\%{\rm~{}to~{}be~{}introduced~{}later}\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\vspace*{-1\baselineskip}
9:        else
10:           𝐯j,i:=j,inj~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\mathbf{v}_{j,i}:=\mathcal{B}_{j,i}\in\mathbb{R}^{n_{j}}
11:        end if
12:     end for
13:     % gathering step
14:     Denote Vj:=[𝐯j,1,,𝐯j,R]nj×RV_{j}:=[\mathbf{v}_{j,1},\ldots,\mathbf{v}_{j,R}]\in\mathbb{R}^{n_{j}\times R}
15:     Compute the polar decomposition of VjV_{j} to obtain orthonormal UjU_{j}
Uj=[𝐮j,1,,𝐮j,R]:=polar_decomp(Vj)\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!U_{j}=[\mathbf{u}_{j,1},\ldots,\mathbf{u}_{j,R}]:=\texttt{polar\_decomp}(V_{j})
16:  end for                           % end of the computation of orthonormal factors
17:  if t<dt<d then
18:     for i=1:Ri=1:R do
19:        Compute a rank-1 approximation solution to the tensor 𝒜×dt+1𝐮dt+1,i××d𝐮d,i=dt+1,i×dt+1𝐮dt+1,in1××ndt\mathcal{A}\times_{d-t+1}\mathbf{u}_{d-t+1,i}^{\top}\times\cdots\times_{d}\mathbf{u}_{d,i}^{\top}=\mathcal{B}_{d-t+1,i}\times_{d-t+1}\mathbf{u}_{d-t+1,i}^{\top}\in\mathbb{R}^{n_{1}\times\cdots\times n_{d-t}}, with each 𝐮j,i\mathbf{u}_{j,i} normalized:
(𝐮1,i,,𝐮dt,i)=rank1approx(𝒜×dt+1𝐮dt+1,i××d𝐮d,i)(\mathbf{u}_{1,i},\ldots,\mathbf{u}_{d-t,i})=\texttt{rank1approx}(\mathcal{A}\times_{d-t+1}\mathbf{u}_{d-t+1,i}^{\top}\times\cdots\times_{d}\mathbf{u}_{d,i}^{\top})\vspace*{-0.3\baselineskip}
20:     end for
21:     Collect Uj:=[𝐮j,1,,𝐮j,R],j=1,,dtU_{j}:=[\mathbf{u}_{j,1},\ldots,\mathbf{u}_{j,R}],~{}~{}j=1,\ldots,d-t
22:  end if                          % end of the computation of non-orthonormal factors
23:  return  U1,,UdU_{1},\ldots,U_{d}

Finding U2U_{2} is a bit more complicated than [45], which can be divided into the splitting step and the gathering step. In the splitting step, with 𝐮3,1,,𝐮3,R\mathbf{u}_{3,1},\ldots,\mathbf{u}_{3,R} at hand, we first compute RR matrices M2,1,,M2,RM_{2,1},\ldots,M_{2,R}, with

M2,i:=𝒜×3𝐮3,in2×n1,i=1,,R.M_{2,i}:=\mathcal{A}\times_{3}\mathbf{u}_{3,i}^{\top}\in\mathbb{R}^{n_{2}\times n_{1}},~{}i=1,\ldots,R.

Then, using a procedure get_v_from_M that will be specified later, we compute RR vectors 𝐯2,i\mathbf{v}_{2,i} of size n2n_{2} from M2,iM_{2,i} for each ii. The principle of get_v_from_M is to extract as much of the information from M2,iM_{2,i} as possible, and to satisfy some theoretical bounds. We will consider three versions of such procedures, which will be detailed later. In the gathering step, we first denote V2:=[𝐯2,1,,𝐯2,R]n2×RV_{2}:=[\mathbf{v}_{2,1},\ldots,\mathbf{v}_{2,R}]\in\mathbb{R}^{n_{2}\times R}. As V2V_{2} may not be orthogonal, it is not feasible. We thus propose to apply polar decomposition on V2V_{2} to obtain the orthonormal matrix U2=[𝐮2,1,,𝐮2,R]U_{2}=[\mathbf{u}_{2,1},\ldots,\mathbf{u}_{2,R}].

Computing U1U_{1} is similar to that of U2U_{2}. In the splitting step, with U3U_{3} and U2U_{2} at hand, we first compute M1,i:=𝒜×2𝐮2,i×𝐮3,in1M_{1,i}:=\mathcal{A}\times_{2}\mathbf{u}_{2,i}^{\top}\times\mathbf{u}_{3,i}^{\top}\in\mathbb{R}^{n_{1}}, i=1,,Ri=1,\ldots,R. Since M1,iM_{1,i}’s are already vectors, we directly let 𝐯1,i=M1,i\mathbf{v}_{1,i}=M_{1,i}. In the gathering step, we let V1:=[𝐯1,1,,𝐯1,R]V_{1}:=[\mathbf{v}_{1,1},\ldots,\mathbf{v}_{1,R}] and perform polar decomposition to get U1U_{1}.

For general d3d\geq 3, we can similarly apply the above splitting step and gathering step to obtain Ud1,Ud2,U_{d-1},U_{d-2},\ldots sequentially. In fact, the design of the splitting and gathering steps follows the principle that

i=1R𝐮j,i,𝐯j,i2αi=1R𝐮j+1,i,𝐯j+1,i2,\cdots\geq\sum^{R}_{i=1}\nolimits\left\langle\mathbf{u}_{j,i},\mathbf{v}_{j,i}\right\rangle^{2}\geq\alpha\sum^{R}_{i=1}\nolimits\left\langle\mathbf{u}_{j+1,i},\mathbf{v}_{j+1,i}\right\rangle^{2}\geq\cdots, (3.4)

which will be studied in details in the next section; here the factor α(0,1)\alpha\in(0,1). If t<dt<d, i.e., there exist non-orthonormal constraints in (2.2), then with Udt+1,,UdU_{d-t+1},\ldots,U_{d} at hand, we then compute U1,,UdtU_{1},\ldots,U_{d-t} via approximately solving RR tensor rank-1 approximation problems (3.3). The whole algorithm is summarized in Algorithm 1.

Some remarks on Algorithm 1 are given as follows.

Remark 3.1.
  1. 1.

    When t=1t=1, Algorithm 1 and [45, Procedure 3.1] are exactly the same if they take the same rank-1 approximation procedure.

  2. 2.

    The computation of j,i\mathcal{B}_{j,i} can be more efficient: Since j,i=𝒜×j+1𝐮j+1,i×j+2×d𝐮d,i\mathcal{B}_{j,i}=\mathcal{A}\times_{j+1}\mathbf{u}_{j+1,i}^{\top}\times_{j+2}\cdots\times_{d}\mathbf{u}_{d,i}^{\top} and j1,i=𝒜×j𝐮j,i×j+1×d𝐮d,i\mathcal{B}_{j-1,i}=\mathcal{A}\times_{j}\mathbf{u}_{j,i}^{\top}\times_{j+1}\cdots\times_{d}\mathbf{u}_{d,i}^{\top}, we have the relation that

    j1,i=j,i×j𝐮j,i.\mathcal{B}_{j-1,i}=\mathcal{B}_{j,i}\times_{j}\mathbf{u}_{j,i}^{\top}.

    This recursive definition of j,i\mathcal{B}_{j,i} reduces its computational complexity.

  3. 3.

    The splitting step can be computed in parallel.

  4. 4.

    If taking R=1R=1 and t=dt=d in problem (2.2), we see that it is a best rank-1 approximation problem. Therefore, Algorithm 1 with R=1R=1 and t=dt=d is capable of solving such a problem.

On the procedure get_v_from_M

How to obtain 𝐯j,i\mathbf{v}_{j,i} from Mj,iM_{j,i} is important both in theory and practice. To retain more information of Mj,iM_{j,i} in 𝐯j,i\mathbf{v}_{j,i}, we expect that

𝐯j,i2βMj,iF2\left\|\mathbf{v}_{j,i}\right\|^{2}\geq\beta\left\|M_{j,i}\right\|_{F}^{2} (3.5)

for some factor β(0,1)\beta\in(0,1). We thus provide three simple procedures to achieve this, two deterministic and one randomized. To simplify the notations we omit the footscripts jj and ii. The matrix MM in the procedures is of size n×mn\times m, 𝐯n\mathbf{v}\in\mathbb{R}^{n}, and 𝐲m\mathbf{y}\in\mathbb{R}^{m}.

The first idea is straightforward: we let 𝐯\mathbf{v} be the leading left singular vector of AA times the leading singular value of MM:

Procedure𝐯=get_v_from_M(M)\noindent{\rm Procedure}~{}\mathbf{v}=\texttt{get\_v\_from\_M}(M) (A) 1. Compute the leading right unit singular vector of MM, denoted as 𝐲\mathbf{y}; 2. Return 𝐯=M𝐲\mathbf{v}=M\mathbf{y}.

The second one is to first pick the row of MM with the largest magnitude, and then normalize it to yield 𝐲\mathbf{y}; finally let 𝐯=M𝐲\mathbf{v}=M\mathbf{y}:

Procedure𝐯=get_v_from_M(M)\noindent{\rm Procedure}~{}\mathbf{v}=\texttt{get\_v\_from\_M}(M) (B) 1. Denote MkM^{k} as the kk-th row of MM. Let Mk¯M^{\bar{k}} be the row with the largest magnitude, i.e., Mk¯=max1knMk;\left\|M^{\bar{k}}\right\|=\max_{1\leq k\leq n}\left\|M^{k}\right\|; if there exist multiple k¯\bar{k} satisfying the above, choose the first one; 2. Denote 𝐲:=Mk¯/Mk¯\mathbf{y}:=M^{\bar{k}}/\left\|M^{\bar{k}}\right\|; 3. Return 𝐯=M𝐲\mathbf{v}=M\mathbf{y}.

The third one is similar to the second one, but to be more efficient, we randomly and uniformly pick a row of MM, normalize it to yield 𝐲\mathbf{y}, and then let 𝐯=M𝐲\mathbf{v}=M\mathbf{y}:

Procedure𝐯=get_v_from_M(M)\noindent{\rm Procedure}~{}\mathbf{v}=\texttt{get\_v\_from\_M}(M) (C) 1. Randomly and uniformly choose k¯{1,,n}\bar{k}\in\{1,\ldots,n\}; 2. Denote 𝐲:=Mk¯/Mk¯\mathbf{y}:=M^{\bar{k}}/\left\|M^{\bar{k}}\right\|; 3. Return 𝐯=M𝐲\mathbf{v}=M\mathbf{y}.

We postpone the analysis of (3.5) of each procedure in the next section. We also remark that 𝐲\mathbf{y} in each procedure is important for the approximation bound analysis.

On the procedure rank1approx

To be more general, the rank1approx procedure in Algorithm 1 can be any efficient approximation algorithm for solving tensor best rank-1 approximation problems, such as [13, Algorithm 1 with DR 2], [48, Sect. 5], [45, Procedure 3.3], or even Algorithm 1 itself (as that mentioned in Remark 3.1). For the purpose of approximation bound analysis, we require rank1approx to have an approximation bound characterization, i.e., for any mm-th order data tensor 𝒞n1××nm\mathcal{C}\in\mathbb{R}^{n_{1}\times\cdots\times n_{m}} (assuming that n1nmn_{1}\leq\cdots\leq n_{m} and m1m\geq 1), the normalized solution (𝐱1,,𝐱m)(\mathbf{x}_{1},\ldots,\mathbf{x}_{m}) returned by rank1approx admits an approximation bound of the form:

𝒞,j=1m𝐱j𝒞Fζ(m),𝒞n1××nm,m1,\begin{split}&\left\langle\mathcal{C},\bigotimes^{m}_{j=1}\nolimits\mathbf{x}_{j}\right\rangle\geq\frac{\left\|\mathcal{C}\right\|_{F}}{\zeta(m)},~{}\forall\mathcal{C}\in\mathbb{R}^{n_{1}\times\cdots\times n_{m}},~{}m\geq 1,\end{split} (3.6)

where 1/ζ(m)11/\zeta(m)\leq 1 represents the factor. For [13, Approximation 1 with DR 2], it can be checked from [13, Sect. 3.1] that when m3m\geq 3,

ζ(m)=j=1m2njn1;\zeta(m)=\sqrt{\prod^{m-2}_{j=1}\nolimits n_{j}}\cdot\sqrt{n_{1}};

for [45, Procedure 3.3], it follows from [45, Proposition 3.1] that

ζ(m)={j=1m1njj=1m/22n2j+1n21nm1nm,mevenandm4,j=2m1njj=1(m+1)/22n2jnm1nm,moddandm3;\zeta(m)=\left\{\begin{array}[]{ll}\sqrt{\prod^{m-1}_{j=1}n_{j}\cdot\prod^{m/2-2}_{j=1}n_{2j+1}\cdot n_{2}^{-1}}\sqrt{n_{m-1}n_{m}},&m~{}{\rm even~{}and}~{}m\geq 4,\\ \sqrt{\prod^{m-1}_{j=2}n_{j}\cdot\prod^{(m+1)/2-2}_{j=1}n_{2j}}\sqrt{n_{m-1}n_{m}},&m~{}{\rm odd~{}and}~{}m\geq 3;\end{array}\right.

when m=1,2m=1,2, namely, 𝒞\mathcal{C} is a vector or a matrix, we also have that ζ(2)=n1\zeta(2)=\sqrt{n_{1}} and ζ(1)=1\zeta(1)=1 for both the two algorithms.

Computational complexity of Algorithm 1

We analyze the computational complexity of Algorithm 1. To simplify the presentation, we consider n1==nd=nn_{1}=\cdots=n_{d}=n. Computing UdU_{d} is a truncated SVD, whose complexity is O(n2nd1)O(n^{2}\cdot n^{d-1}). In the computation of Ud1U_{d-1}, computing d1,i\mathcal{B}_{d-1,i} and Md1,iM_{d-1,i} require O(nd)O(n^{d}) flops; since Md1,in×nd2M_{d-1,i}\in\mathbb{R}^{n\times n^{d-2}}, computing 𝐯d1,i\mathbf{v}_{d-1,i} requires O(n2nd2)O(n^{2}\cdot n^{d-2}) flops if Procedure A is used, while it takes O(nnd2)O(n\cdot n^{d-2}) flops for Procedures B and C. In any case, this is dominated by O(nd)O(n^{d}). Thus the complexity of the splitting step is O(Rnd)O(Rn^{d}). The main effort of the gathering step is the polar decomposition of Vd1,iR×nV_{d-1,i}\in\mathbb{R}^{R\times n}, with complexity O(R2n)O(R^{2}n). Hence computing Ud1U_{d-1} requires O(Rnd)+O(R2n)O(Rn^{d})+O(R^{2}n) flops.

The complexity of computing Ud2U_{d-2} is similar: from Remark 3.1, computing d2,i\mathcal{B}_{d-2,i} and Md2,iM_{d-2,i} needs O(nd1)O(n^{d-1}) flops. The total complexity of Ud2U_{d-2} is O(Rnd1)+O(R2n).O(Rn^{d-1})+O(R^{2}n).

For Ud3,,Udt+1U_{d-3},\ldots,U_{d-t+1} the analysis is analogous. Denote O(rank1approx())O(\texttt{rank1approx}(\cdot)) as the complexity of the rank-1 approximation procedure, depending on which procedure is used. Then the updating of U1,,UdtU_{1},\ldots,U_{d-t} has complexity

O(Rndt+1)+O(Rrank1approx(dt+1,i×dt+1𝐮dt+1,i)),O(Rn^{d-t+1})+O(R\cdot\texttt{rank1approx}(\mathcal{B}_{d-t+1,i}\times_{d-t+1}\mathbf{u}_{d-t+1,i}^{\top})),

where the first term comes from computing dt+1,i\mathcal{B}_{d-t+1,i}. Summarizing the above discussions, we have

Proposition 3.1 (Computational complexity of Algorithm 1).

Assume that n1=n2==nd=nn_{1}=n_{2}=\cdots=n_{d}=n. The computational complexity of Algorithm 1 is

O(nd+1)+j=2t+1O(Rndj+2)+O(tR2n)+O(Rrank1approx(dt+1,i×dt+1𝐮dt+1,i));O(n^{d+1})+\sum^{t+1}_{j=2}O(Rn^{d-j+2})+O(tR^{2}n)+O(R\cdot\texttt{rank1approx}(\mathcal{B}_{d-t+1,i}\times_{d-t+1}\mathbf{u}_{d-t+1,i}^{\top}));

in particular, if t=dt=d, then it is

O(nd+1)+j=2d+1O(Rndj+2)+O(dR2n),O(n^{d+1})+\sum^{d+1}_{j=2}\nolimits O(Rn^{d-j+2})+O(dR^{2}n),

which is dominated by O(nd+1)O(n^{d+1}).

Note that [45, Procedure 3.1] requires tt SVDs of size n×ndn\times n^{d}, while Algorithm 1 performs only one SVD of this size plus additional operations of smaller size. This makes Algorithm 1 more efficient and will be confirmed by numerical observations.

4 Approximation Bound Analysis

This section is organized as follows: approximation bound results for general tensors are presented in Sect. 4.1, and then we shortly improve the results for nearly orthogonal tensors in Sect. 4.3.

To begin with, we need some preparations. Recall that there is an intermediate variable 𝐲\mathbf{y} in Procedures AC. This variable is important in the analysis. To distinguish 𝐲\mathbf{y} with respect to each ii and jj, when obtaining 𝐯j,i\mathbf{v}_{j,i}, we denote the associated 𝐲\mathbf{y} as 𝐲j,i\mathbf{y}_{j,i}, wihch is of size k=1j1nk\prod^{j-1}_{k=1}n_{k}. From the procedures, we see that 𝐯j,i=Mj,i𝐲j,i\mathbf{v}_{j,i}=M_{j,i}\mathbf{y}_{j,i}. Since Mj,i=reshape(j,i,nj,k=1j1nk)M_{j,i}=\texttt{reshape}\left(\mathcal{B}_{j,i},n_{j},\prod^{j-1}_{k=1}\nolimits n_{k}\right) and j,i=𝒜×j+1𝐮j+1,i×j+2×d𝐮d,i\mathcal{B}_{j,i}=\mathcal{A}\times_{j+1}\mathbf{u}_{j+1,i}^{\top}\times_{j+2}\cdots\times_{d}\mathbf{u}_{d,i}^{\top}, we have the following expression of 𝐮j,i,𝐯j,i\left\langle\mathbf{u}_{j,i},\mathbf{v}_{j,i}\right\rangle:

𝐮j,i,𝐯j,i\displaystyle\left\langle\mathbf{u}_{j,i},\mathbf{v}_{j,i}\right\rangle =\displaystyle= 𝐮j,i,Mj,i𝐲j,i\displaystyle\left\langle\mathbf{u}_{j,i},M_{j,i}\mathbf{y}_{j,i}\right\rangle
=\displaystyle= Mj,i𝐮j,i,𝐲j,i\displaystyle\left\langle M^{\top}_{j,i}\mathbf{u}_{j,i},\mathbf{y}_{j,i}\right\rangle
=\displaystyle= 𝒜×j𝐮j,i×j+1×d𝐮d,i,𝒴j,i,\displaystyle\left\langle\mathcal{A}\times_{j}\mathbf{u}_{j,i}^{\top}\times_{j+1}\cdots\times_{d}\mathbf{u}^{\top}_{d,i},\mathcal{Y}_{j,i}\right\rangle,

where we denote

𝒴j,i:=reshape(𝐲j,i,n1,,nj1)n1××nj1.\mathcal{Y}_{j,i}:=\texttt{reshape}(\mathbf{y}_{j,i},n_{1},\ldots,n_{j-1})\in\mathbb{R}^{n_{1}\times\cdots\times n_{j-1}}. (4.8)

Note that 𝒴j,iF=1\left\|\mathcal{Y}_{j,i}\right\|_{F}=1 because according to Procedures AC, 𝐲j,i\mathbf{y}_{j,i} is a unit vector.

The above expression is only well-defined when j2j\geq 2 (see Algorithm 1). To make it consistent when j=1j=1 (which happens when t=dt=d), we set 𝒴1,i=𝐲1,i=1\mathcal{Y}_{1,i}=\mathbf{y}_{1,i}=1; we also denote M1,i:=1,i=𝐯1,iM_{1,i}:=\mathcal{B}_{1,i}=\mathbf{v}_{1,i} accordingly. Then it is clear that (4) still makes sense when j=1j=1.

On the other hand, VjV_{j} in the algorithm is only defined when j=dt+1,,d1j=d-t+1,\ldots,d-1. For convenience, we denote

Vd=[𝐯d,1,,𝐯d,R]nd×Rwith𝐯d,i:=λi(A(d))𝐮d,i,i=1,,R,V_{d}=[\mathbf{v}_{d,1},\ldots,\mathbf{v}_{d,R}]\in\mathbb{R}^{n_{d}\times R}~{}~{}{\rm with}~{}~{}\mathbf{v}_{d,i}:=\lambda_{i}(A_{(d)})\mathbf{u}_{d,i},~{}i=1,\ldots,R, (4.9)

where λi(A(d))\lambda_{i}(A_{(d)}) denotes the ii-th largest singular value of A(d)A_{(d)}.

Another notation to be defined is dt,i\mathcal{B}_{d-t,i}. Note that in the algorithm, j,i\mathcal{B}_{j,i} is only defined when jdt+1j\geq d-t+1. We thus similarly define dt,i:=𝒜×dt+1𝐮dt+1,i×dt+2×d𝐮d,in1××ndt\mathcal{B}_{d-t,i}:=\mathcal{A}\times_{d-t+1}\mathbf{u}_{d-t+1,i}^{\top}\times_{d-t+2}\cdots\times_{d}\mathbf{u}_{d,i}^{\top}\in\mathbb{R}^{n_{1}\times\cdots\times n_{d-t}}, i=1,,Ri=1,\ldots,R. When t=dt=d, 0,i\mathcal{B}_{0,i}’s are scalars.

4.1 Approximation bound for general tensors

Since when t=1t=1, Algorithm 1 coincides with [45, Procedure 3.1] if they take the same best rank-1 approximation procedure, similar to [45, Proposition 3.2], we have that when t=1t=1:

G(U1,,Ud)=i=1R𝒜×d𝐮d,i,j=1d1𝐮j,i2i=1R𝒜×d𝐮d,iF2ζ(d1)2=i=1Rλi(A(d))2ζ(d1)2,G(U_{1},\ldots,U_{d})=\sum^{R}_{i=1}\left\langle\mathcal{A}\times_{d}\mathbf{u}_{d,i}^{\top},\bigotimes^{d-1}_{j=1}\mathbf{u}_{j,i}\right\rangle^{2}\geq\sum^{R}_{i=1}\frac{\left\|\mathcal{A}\times_{d}\mathbf{u}^{\top}_{d,i}\right\|_{F}^{2}}{\zeta(d-1)^{2}}=\sum^{R}_{i=1}\frac{\lambda_{i}(A_{(d)})^{2}}{\zeta(d-1)^{2}}, (4.10)

where the inequality comes from (3.6), and the last equality is due to that the unfolding of 𝒜×d𝐮d,i\mathcal{A}\times_{d}\mathbf{u}_{d,i}^{\top} to a vector is exactly 𝐯d,i\mathbf{v}_{d,i} defined in (4.9).

Thus the remaining part is mainly focused on t>2t>2 cases. To make the analysis more readable, we first present an overview and informal analysis of the approximation bound, under the setting that (3.4) holds with a factor αj\alpha_{j}, and then we prove the validity of (3.4) and detail the associated factor. The formal approximation bound results are stated in the last of this subsection.

Lemma 4.1 (Informal approximation bound of Algorithm 1).

Let 2td2\leq t\leq d and let U1,,UdU_{1},\ldots,U_{d} be generated by Algorithm 1. If for each dt+1jd1d-t+1\leq j\leq d-1, there holds

i=1R𝐮j,i,𝐯j,i2αji=1R𝐮j+1,i,𝐯j+1,i2,\sum^{R}_{i=1}\nolimits\left\langle\mathbf{u}_{j,i},\mathbf{v}_{j,i}\right\rangle^{2}\geq\alpha_{j}\sum^{R}_{i=1}\nolimits\left\langle\mathbf{u}_{j+1,i},\mathbf{v}_{j+1,i}\right\rangle^{2}, (4.11)

where αj(0,1]\alpha_{j}\in(0,1], then we have the following approximation bound:

G(U1,,Ud)\displaystyle G(U_{1},\ldots,U_{d}) \displaystyle\geq 1ζ(dt)2j=dt+1d1αji=1Rλi(A(d))2\displaystyle\frac{1}{\zeta(d-t)^{2}}\prod^{d-1}_{j=d-t+1}\nolimits\alpha_{j}\cdot\sum^{R}_{i=1}\nolimits\lambda_{i}(A_{(d)})^{2}
\displaystyle\geq 1ζ(dt)2j=dt+1d1αjGmax,\displaystyle\frac{1}{\zeta(d-t)^{2}}\prod^{d-1}_{j=d-t+1}\nolimits\alpha_{j}\cdot G_{\max},

where 1/ζ(dt)1/\zeta(d-t) is the rank-1 approximation ratio of n1××ndtn_{1}\times\cdots\times n_{d-t} tensors, as noted in (3.6).

Proof.

To make the analysis below consistent with the case t=dt=d, which does not require perform rank-1 approximation, we denote ζ(0)=1\zeta(0)=1 and j=10𝐮j,i=1\bigotimes^{0}_{j=1}\mathbf{u}_{j,i}=1. It then holds that

G(U1,,Ud)\displaystyle G(U_{1},\ldots,U_{d}) =\displaystyle= i=1R𝒜,j=1d𝐮j,i2\displaystyle\sum^{R}_{i=1}\left\langle\mathcal{A},\bigotimes^{d}_{j=1}\nolimits\mathbf{u}_{j,i}\right\rangle^{2}
=\displaystyle= i=1Rdt,i,j=1dt𝐮j,i2\displaystyle\sum^{R}_{i=1}\left\langle\mathcal{B}_{d-t,i},\bigotimes^{d-t}_{j=1}\nolimits\mathbf{u}_{j,i}\right\rangle^{2}
\displaystyle\geq 1ζ(dt)2i=1Rdt,iF2\displaystyle\frac{1}{\zeta(d-t)^{2}}\sum^{R}_{i=1}\left\|\mathcal{B}_{d-t,i}\right\|_{F}^{2}
\displaystyle\geq 1ζ(dt)2i=1Rdt,i,𝒴dt+1,i2,\displaystyle\frac{1}{\zeta(d-t)^{2}}\sum^{R}_{i=1}\left\langle\mathcal{B}_{d-t,i},\mathcal{Y}_{d-t+1,i}\right\rangle^{2},

where the first inequality follows from the setting (3.6), and the second one is due to that dt,iF=max𝒴F=1dt,i,𝒴\left\|\mathcal{B}_{d-t,i}\right\|_{F}=\max_{\left\|\mathcal{Y}\right\|_{F}=1}\left\langle\mathcal{B}_{d-t,i},\mathcal{Y}\right\rangle, and that 𝒴dt+1,iF=1\left\|\mathcal{Y}_{d-t+1,i}\right\|_{F}=1 and 𝒴dt+1,in1××ndt\mathcal{Y}_{d-t+1,i}\in\mathbb{R}^{n_{1}\times\cdots\times n_{d-t}} has the same size as dt,i\mathcal{B}_{d-t,i}; recall that 𝒴j,i\mathcal{Y}_{j,i} was defined in (4.8). From the definition of dt,i\mathcal{B}_{d-t,i} and (4), we have

dt,i,𝒴dt+1,i\displaystyle\left\langle\mathcal{B}_{d-t,i},\mathcal{Y}_{d-t+1,i}\right\rangle =\displaystyle= 𝒜×dt+1𝐮dt+1,i×dt+2×d𝐮d,i,𝒴dt+1,i\displaystyle\left\langle\mathcal{A}\times_{d-t+1}\mathbf{u}_{d-t+1,i}^{\top}\times_{d-t+2}\cdots\times_{d}\mathbf{u}_{d,i}^{\top},\mathcal{Y}_{d-t+1,i}\right\rangle
=\displaystyle= 𝐮dt+1,i,𝐯dt+1,i.\displaystyle\left\langle\mathbf{u}_{d-t+1,i},\mathbf{v}_{d-t+1,i}\right\rangle.

It thus follows from (4.11) that

G(U1,,Ud)\displaystyle G(U_{1},\ldots,U_{d}) \displaystyle\geq 1ζ(dt)2i=1R𝐮dt+1,i,𝐯dt+1,i2\displaystyle\frac{1}{\zeta(d-t)^{2}}\sum^{R}_{i=1}\left\langle\mathbf{u}_{d-t+1,i},\mathbf{v}_{d-t+1,i}\right\rangle^{2}
\displaystyle\geq \displaystyle\cdots
\displaystyle\geq 1ζ(d1)2j=dt+1dtαji=1R𝐮d,i,𝐯d,i2.\displaystyle\frac{1}{\zeta(d-1)^{2}}\prod^{d-t}_{j=d-t+1}\nolimits\alpha_{j}\cdot\sum^{R}_{i=1}\nolimits\left\langle\mathbf{u}_{d,i},\mathbf{v}_{d,i}\right\rangle^{2}.

From (4.9) and that 𝐮j,i=1\left\|\mathbf{u}_{j,i}\right\|=1, we see that 𝐮j,i,𝐯j,i2=i=1Rλi(A(d))2,\left\langle\mathbf{u}_{j,i},\mathbf{v}_{j,i}\right\rangle^{2}=\sum^{R}_{i=1}\nolimits\lambda_{i}(A_{(d)})^{2}, which is clearly an upper bound of GmaxG_{\max}. Thus the required approximation bound follows. ∎

The above lemma gives an overview of the analysis of the approximation bound. It also shows that to make the analysis go through, the chain inequality (4.11) plays a central role. Thus the remaining task of this section is to establish (4.11) with an explicit factor αj\alpha_{j}.

4.1.1 Chain inequality (4.11) analysis

To establish the connection between 𝐮j,i,𝐯j,i\left\langle\mathbf{u}_{j,i},\mathbf{v}_{j,i}\right\rangle and 𝐮j+1,i,𝐯j+1,i\left\langle\mathbf{u}_{j+1,i},\mathbf{v}_{j+1,i}\right\rangle, we need to evaluate the performance of get_v_form_M, namely, we need to estimate (3.5). As there are three choices of get_v_form_M, we first simply assume that (3.5) holds with a factor βj\beta_{j}, and later we analyze it in details.

Lemma 4.2 (Informal chain inequality).

Let 2td2\leq t\leq d and let U1,,UdU_{1},\ldots,U_{d} be generated by Algorithm 1. If for each dt+1jd1d-t+1\leq j\leq d-1, there holds

𝐯j,i2βjMj,iF2,i=1,,R,\left\|\mathbf{v}_{j,i}\right\|^{2}\geq\beta_{j}\left\|M_{j,i}\right\|_{F}^{2},~{}i=1,\ldots,R, (4.12)

where βj(0,1]\beta_{j}\in(0,1] (if j=1j=1 then (4.12) holds with βj=1\beta_{j}=1), then we have for j=dt+1,,d1j=d-t+1,\ldots,d-1, and i=1,,R,i=1,\ldots,R,

i=1R𝐮j,i,𝐯j,i2βjRi=1R𝐮j+1,i,𝐯j+1,i2.\sum^{R}_{i=1}\left\langle\mathbf{u}_{j,i},\mathbf{v}_{j,i}\right\rangle^{2}\geq\frac{\beta_{j}}{R}\sum^{R}_{i=1}\left\langle\mathbf{u}_{j+1,i},\mathbf{v}_{j+1,i}\right\rangle^{2}.
Remark 4.1.

When j=1j=1 (this happens when t=dt=d), according to Algorithm 1, 𝐯1,i=1,i=M1,i\mathbf{v}_{1,i}=\mathcal{B}_{1,i}=M_{1,i} for each ii, and so (4.12) holds with β1=1\beta_{1}=1.

Proof.

Since the orthonormal UjU_{j}’s, j=dt+1,,d1j=d-t+1,\ldots,d-1, are given by the polar decomposition of VjV_{j}, from Lemma 2.1, we have 𝐮j,i,𝐯j,i0\left\langle\mathbf{u}_{j,i},\mathbf{v}_{j,i}\right\rangle\geq 0. Using the relation that a12++aR21R(a1++aR)2a_{1}^{2}+\cdots+a_{R}^{2}\geq\frac{1}{R}\left(a_{1}+\cdots+a_{R}\right)^{2} for nonnegative aia_{i}’s, we have

i=1R𝐮j,i,𝐯j,i21R(i=1R𝐮j,i,𝐯j,i)2=1RUj,Vj2\sum^{R}_{i=1}\nolimits\left\langle\mathbf{u}_{j,i},\mathbf{v}_{j,i}\right\rangle^{2}\geq\frac{1}{R}\left(\sum^{R}_{i=1}\nolimits\left\langle\mathbf{u}_{j,i},\mathbf{v}_{j,i}\right\rangle\right)^{2}=\frac{1}{R}\left\langle U_{j},V_{j}\right\rangle^{2}

for each jj. Lemma 2.2 shows that Uj,Vj=Vj\left\langle U_{j},V_{j}\right\rangle=\left\|V_{j}\right\|_{*}; the norm relationship gives that VjVjF\left\|V_{j}\right\|_{*}\geq\left\|V_{j}\right\|_{F}. Thus

i=1R𝐮j,i,𝐯j,i21RVjF2.\sum^{R}_{i=1}\nolimits\left\langle\mathbf{u}_{j,i},\mathbf{v}_{j,i}\right\rangle^{2}\geq\frac{1}{R}\left\|V_{j}\right\|_{F}^{2}.

We distinguish the cases that jd2j\leq d-2 and j=d1j=d-1. For the former case,

i=1R𝐮j,i,𝐯j,i2\displaystyle\sum^{R}_{i=1}\nolimits\left\langle\mathbf{u}_{j,i},\mathbf{v}_{j,i}\right\rangle^{2} \displaystyle\geq 1RVjF2=1Ri=1R𝐯j,i2\displaystyle\frac{1}{R}\left\|V_{j}\right\|_{F}^{2}=\frac{1}{R}\sum^{R}_{i=1}\nolimits\left\|\mathbf{v}_{j,i}\right\|^{2}
(by(4.12))\displaystyle\footnotesize{{\rm(by~{}\eqref{eq:v_M_inequality_tmp_j})}}~{}~{}~{}~{} \displaystyle\geq βjRi=1RMj,iF2\displaystyle\frac{\beta_{j}}{R}\sum^{R}_{i=1}\nolimits\left\|M_{j,i}\right\|_{F}^{2}
=\displaystyle= βjRi=1R𝒜×j+1𝐮j+1,i×j+2×d𝐮d,iF2\displaystyle\frac{\beta_{j}}{R}\sum^{R}_{i=1}\nolimits\left\|\mathcal{A}\times_{j+1}\mathbf{u}^{\top}_{j+1,i}\times_{j+2}\cdots\times_{d}\mathbf{u}^{\top}_{d,i}\right\|_{F}^{2}
\displaystyle\geq βjRi=1R𝒜×j+1𝐮j+1,i×j+2×d𝐮d,i,𝒴j+1,i2\displaystyle\frac{\beta_{j}}{R}\sum^{R}_{i=1}\nolimits\left\langle\mathcal{A}\times_{j+1}\mathbf{u}^{\top}_{j+1,i}\times_{j+2}\cdots\times_{d}\mathbf{u}^{\top}_{d,i},\mathcal{Y}_{j+1,i}\right\rangle^{2}
(by(4))\displaystyle\footnotesize{{\rm(by~{}\eqref{eq:approx_bound_1})}}~{}~{}~{}~{} =\displaystyle= βjRi=1R𝐮j+1,i,𝐯j+1,i2,\displaystyle\frac{\beta_{j}}{R}\sum^{R}_{i=1}\nolimits\left\langle\mathbf{u}_{j+1,i},\mathbf{v}_{j+1,i}\right\rangle^{2},

where the equality in the third line follows from that Mj,iM_{j,i} is the unfolding of j,i=𝒜×j+1𝐮j+1,i×j+2×d𝐮d\mathcal{B}_{j,i}=\mathcal{A}\times_{j+1}\mathbf{u}^{\top}_{j+1,i}\times_{j+2}\cdots\times_{d}\mathbf{u}^{\top}_{d}, and the third inequality uses the fact that 𝒴j+1,iF=1\left\|\mathcal{Y}_{j+1,i}\right\|_{F}=1 where 𝒴j,i\mathcal{Y}_{j,i} was defined in (4.8).

When j=d1j=d-1, since 𝒴d,i\mathcal{Y}_{d,i} is not defined, there is a slight difference after the third line of (4.1.1). In this case, we have

i=1R𝐮d1,i,𝐯d1,i2\displaystyle\sum^{R}_{i=1}\nolimits\left\langle\mathbf{u}_{d-1,i},\mathbf{v}_{d-1,i}\right\rangle^{2} \displaystyle\geq βd1Ri=1R𝒜×d𝐮d,iF2\displaystyle\frac{\beta_{d-1}}{R}\sum^{R}_{i=1}\nolimits\left\|\mathcal{A}\times_{d}\mathbf{u}^{\top}_{d,i}\right\|_{F}^{2}
=\displaystyle= βd1Ri=1Rλi(A(d))2=βd1Ri=1R𝐮d,i,𝐯d,i2,\displaystyle\frac{\beta_{d-1}}{R}\sum^{R}_{i=1}\nolimits\lambda_{i}(A_{(d)})^{2}=\frac{\beta_{d-1}}{R}\sum^{R}_{i=1}\nolimits\left\langle\mathbf{u}_{d,i},\mathbf{v}_{d,i}\right\rangle^{2},

where the first inequality comes from a similar analysis as (4.1.1), the first equality follows from the definition of UdU_{d}, and the last one is due to the definition of VdV_{d} in (4.9). The required inequality thus follows. ∎

The remaining task is to specify (4.12). Denote 𝖤\mathop{{}\mathsf{E}} as the expectation operator.

Lemma 4.3.

For 2jd12\leq j\leq d-1, if 𝐯j,i\mathbf{v}_{j,i} is generated from Mj,iM_{j,i} by Procedures A or B, then it holds that

𝐯j,i21njMj,iF2;\left\|\mathbf{v}_{j,i}\right\|^{2}\geq\frac{1}{n_{j}}\left\|M_{j,i}\right\|_{F}^{2};

if 𝐯j,i\mathbf{v}_{j,i} is generated by Procedure C, then it holds that

𝖤𝐯j,i21njMj,iF2.\operatorname*{\mathop{\mathsf{E}}}\left\|\mathbf{v}_{j,i}\right\|^{2}\geq\frac{1}{n_{j}}\left\|M_{j,i}\right\|_{F}^{2}.
Proof.

If 𝐯j,i\mathbf{v}_{j,i} is generated by Procedure A, we have

𝐯j,i2=λmax2(Mj,i)1njMj,iF2,\left\|\mathbf{v}_{j,i}\right\|^{2}=\lambda_{\max}^{2}(M_{j,i})\geq\frac{1}{n_{j}}\left\|M_{j,i}\right\|_{F}^{2},

where λmax()\lambda_{\max}(\cdot) denotes the largest singular value of a matrix.

If 𝐯j,i\mathbf{v}_{j,i} is generated by Procedure B, we have

𝐯j,i2=Mj,i𝐲j,i2=k=1njMj,ik,Mj,ik¯Mj,ik¯2Mj,ik¯21njMj,iF2,\left\|\mathbf{v}_{j,i}\right\|^{2}=\left\|M_{j,i}\mathbf{y}_{j,i}\right\|^{2}=\sum^{n_{j}}_{k=1}\left\langle M_{j,i}^{k},\frac{M_{j,i}^{\bar{k}}}{\left\|M_{j,i}^{\bar{k}}\right\|}\right\rangle^{2}\geq\left\|M_{j,i}^{\bar{k}}\right\|^{2}\geq\frac{1}{n_{j}}\left\|M_{j,i}\right\|_{F}^{2},

where we recall that Mj,ikM_{j,i}^{k} denotes the kk-th row of Mj,iM_{j,i}, and Mj,ik¯M_{j,i}^{\bar{k}} is the row with the largest magnitude.

If 𝐯j,i\mathbf{v}_{j,i} is generated by Procedure C, since k¯\bar{k} is uniformly and randomly chosen from {1,,nj}\{1,\ldots,n_{j}\}, this means that with equal probability, the random variable 𝐯j,i2\left\|\mathbf{v}_{j,i}\right\|^{2} takes the value k=1njMj,ik,Mj,is/Mj,is2\sum^{n_{j}}_{k=1}\left\langle M_{j,i}^{k},M_{j,i}^{s}/\left\|M_{j,i}^{s}\right\|\right\rangle^{2}, s=1,,njs=1,\ldots,n_{j}, i.e.,

𝖯𝗋𝗈𝖻{𝐯j,i2=k=1njMj,ik,Mj,isMj,is2}=1nj,s=1,,nj.\mathop{\mathsf{Prob}}\left\{\left\|\mathbf{v}_{j,i}\right\|^{2}=\sum^{n_{j}}_{k=1}\left\langle M_{j,i}^{k},\frac{M_{j,i}^{s}}{\left\|M_{j,i}^{s}\right\|}\right\rangle^{2}\right\}=\frac{1}{n_{j}},~{}s=1,\ldots,n_{j}.

Therefore,

𝖤𝐯j,i2=1njs=1njk=1njMj,ik,Mj,isMj,is21njs=1njMj,is2=1njMj,iF2.\operatorname*{\mathop{\mathsf{E}}}\left\|\mathbf{v}_{j,i}\right\|^{2}=\frac{1}{n_{j}}\sum^{n_{j}}_{s=1}\sum^{n_{j}}_{k=1}\left\langle M_{j,i}^{k},\frac{M_{j,i}^{s}}{\left\|M_{j,i}^{s}\right\|}\right\rangle^{2}\geq\frac{1}{n_{j}}\sum^{n_{j}}_{s=1}\left\|M_{j,i}^{s}\right\|^{2}=\frac{1}{n_{j}}\left\|M_{j,i}\right\|_{F}^{2}.

The proof has been completed. ∎

4.1.2 Puting the pieces together

Combining the previous analysis, we can state our formal results.

Lemma 4.4 (Formal chain inequality).

Let 2td2\leq t\leq d and let U1,,UdU_{1},\ldots,U_{d} be generated by Algorithm 1. Let βj\beta_{j} be such that

βj=nj1if2jd1,andβj=1ifj=1.\beta_{j}={n_{j}^{-1}}~{}{\rm if~{}}2\leq j\leq d-1,~{}~{}{\rm and}~{}~{}\beta_{j}=1{\rm~{}if}~{}j=1.

For j=dt+1,,d1,i=1,,Rj=d-t+1,\ldots,d-1,~{}i=1,\ldots,R, if 𝐯j,i\mathbf{v}_{j,i}’s are generated deterministically by Procedures A or B, then

i=1R𝐮j,i,𝐯j,i2βjRi=1R𝐮j+1,i,𝐯j+1,i2;\sum^{R}_{i=1}\nolimits\left\langle\mathbf{u}_{j,i},\mathbf{v}_{j,i}\right\rangle^{2}\geq\frac{\beta_{j}}{R}\sum^{R}_{i=1}\nolimits\left\langle\mathbf{u}_{j+1,i},\mathbf{v}_{j+1,i}\right\rangle^{2};

if 𝐯j,i\mathbf{v}_{j,i}’s are generated randomly by Procedure C, then

𝖤i=1R𝐮j,i,𝐯j,i2βjR𝖤i=1R𝐮j+1,i,𝐯j+1,i2.\operatorname*{\mathop{\mathsf{E}}}\sum^{R}_{i=1}\nolimits\left\langle\mathbf{u}_{j,i},\mathbf{v}_{j,i}\right\rangle^{2}\geq\frac{\beta_{j}}{R}\operatorname*{\mathop{\mathsf{E}}}\sum^{R}_{i=1}\nolimits\left\langle\mathbf{u}_{j+1,i},\mathbf{v}_{j+1,i}\right\rangle^{2}.
Proof.

Lemma 4.3 tells us that the factor βj\beta_{j} in the assumption of Lemma 4.2 is η(j){\eta(j)}. Thus the results follow by combining Lemmas 4.2 and 4.3. ∎

Remark 4.2.

The expectation here is taken sequentially with respect to j=d1,,dt+1j=d-1,\ldots,d-t+1, i.e., it is in fact

𝖤Vd1𝖤Vji=1R𝐮j,i,𝐯j,i2βjR𝖤Vd1𝖤Vj+1i=1R𝐮j+1,i,𝐯j+1,i2.\operatorname*{\mathop{\mathsf{E}}}_{V_{d-1}}\nolimits\cdots\operatorname*{\mathop{\mathsf{E}}}_{V_{j}}\nolimits\sum^{R}_{i=1}\nolimits\left\langle\mathbf{u}_{j,i},\mathbf{v}_{j,i}\right\rangle^{2}\geq\frac{\beta_{j}}{R}\operatorname*{\mathop{\mathsf{E}}}_{V_{d-1}}\nolimits\cdots\operatorname*{\mathop{\mathsf{E}}}_{V_{j+1}}\nolimits\sum^{R}_{i=1}\nolimits\left\langle\mathbf{u}_{j+1,i},\mathbf{v}_{j+1,i}\right\rangle^{2}.

Based on (4.10), Lemmas 4.1 and 4.4, the main results are stated as follow.

Theorem 4.1 (Formal approximation bound).

Let 1td1\leq t\leq d and let U1,,UdU_{1},\ldots,U_{d} be generated by Algorithm 1. Let βj\beta_{j} be defined as in Lemma 4.4. For j=dt+1,,d1,i=1,,Rj=d-t+1,\ldots,d-1,~{}i=1,\ldots,R, if 𝐯j,i\mathbf{v}_{j,i}’s are generated deterministically by Procedures A or B, then

G(U1,,Ud)\displaystyle G(U_{1},\ldots,U_{d}) \displaystyle\geq 1ζ(dt)2j=dt+1d1βjRi=1Rλi(A(d))2\displaystyle\frac{1}{\zeta(d-t)^{2}}\prod^{d-1}_{j=d-t+1}\nolimits\frac{\beta_{j}}{R}\cdot\sum^{R}_{i=1}\nolimits\lambda_{i}(A_{(d)})^{2}
\displaystyle\geq 1ζ(dt)2j=dt+1d1βjRGmax;\displaystyle\frac{1}{\zeta(d-t)^{2}}\prod^{d-1}_{j=d-t+1}\nolimits\frac{\beta_{j}}{R}\cdot G_{\max};

if 𝐯j,i\mathbf{v}_{j,i}’s are generated randomly by Procedure C, then

𝖤G(U1,,Ud)\displaystyle\operatorname*{\mathop{{}\mathsf{E}}}G(U_{1},\ldots,U_{d}) \displaystyle\geq 1ζ(dt)2j=dt+1d1βjRi=1Rλi(A(d))2\displaystyle\frac{1}{\zeta(d-t)^{2}}\prod^{d-1}_{j=d-t+1}\nolimits\frac{\beta_{j}}{R}\cdot\sum^{R}_{i=1}\nolimits\lambda_{i}(A_{(d)})^{2}
\displaystyle\geq 1ζ(dt)2j=dt+1d1βjRGmax.\displaystyle\frac{1}{\zeta(d-t)^{2}}\prod^{d-1}_{j=d-t+1}\nolimits\frac{\beta_{j}}{R}\cdot G_{\max}.

The ratio 1ζ(dt)2j=dt+1d1βjR\frac{1}{\zeta(d-t)^{2}}\prod^{d-1}_{j=d-t+1}\nolimits\frac{\beta_{j}}{R} is

{1Rd1j=2d1nj,t=d,1Rt1ζ(dt)2j=dt+1d1nj,1t<d.\left\{\begin{array}[]{ll}\frac{1}{R^{d-1}\prod^{d-1}_{j=2}n_{j}},&t=d,\\ \frac{1}{R^{t-1}\zeta(d-t)^{2}\prod^{d-1}_{j=d-t+1}n_{j}},&1\leq t<d.\end{array}\right.

Reducing to the best rank-1 approximation problem max𝐮j=1,1jd𝒜,j=1d𝐮j\max_{\|\mathbf{u}_{j}\|=1,1\leq j\leq d}\left\langle\mathcal{A},\bigotimes^{d}_{j=1}\mathbf{u}_{j}\right\rangle (one can remove the square in the objective function), i.e., R=1,t=dR=1,t=d, from Theorem 4.1 we have the following corollary.

Corollary 4.1.

Let 𝐮j\mathbf{u}_{j}, 1jd1\leq j\leq d be generated by Algorithm 1 with R=1R=1 and t=dt=d. Then the approximation ratio is 1j=2d1nj\frac{1}{\sqrt{\prod^{d-1}_{j=2}n_{j}}} (either in deterministic or expected sense).

The above ratio is of the same order as [13, 48]. It is not the best to date [38, 12]; however, Algorithm 1 is more practical, and even in the context of rank-1 approximation problems, Algorithm 1 with Procedures B and C is new (we discuss it in the last paragraph of the next subsection).

4.2 Discussions

Refer to caption
(a) Plots of the real ratio (i=1R𝐯j,i2/Mj,iF2)/R\left(\sum^{R}_{i=1}\left\|\mathbf{v}_{j,i}\right\|^{2}/\left\|M_{j,i}\right\|_{F}^{2}\right)/R (in red) and the theoretical ratio 1/n1/n (in black) for fourth-order tensors. nn varies from 1010 to 100100. Left: j=1j=1; right: j=2j=2.
Refer to caption
(b) Plots of the real ratio i=1R𝐮j,i,𝐯j,i2/VjF2\sum^{R}_{i=1}\left\langle\mathbf{u}_{j,i},\mathbf{v}_{j,i}\right\rangle^{2}/\left\|V_{j}\right\|_{F}^{2} (in red) and the theoretical ratio 1/R1/R (in black) where 𝒜100×100×100×100\mathcal{A}\in\mathbb{R}^{100\times 100\times 100\times 100}. RR varies from 1010 to 100100. Left: j=1j=1; right: j=2j=2.
Refer to caption
(c) Plots of real ratio versus theoretical ratio. Left: 𝒜n×n×n×n\mathcal{A}\in\mathbb{R}^{n\times n\times n\times n} with nn varying from 1010 to 100100; right: 𝒜100×100×100×100\mathcal{A}\in\mathbb{R}^{100\times 100\times 100\times 100} with RR varying from 1010 to 100100. In the figures: Red: real ratio G(U1,,Ud)i=1Rλi(A(d))2\frac{G(U_{1},\ldots,U_{d})}{\sum^{R}_{i=1}\lambda_{i}(A_{(d)})^{2}}; black with right arrow markers: 1n2\frac{1}{n^{2}}; black with diamond markers: 1R3n2\frac{1}{R^{3}n^{2}}.
Figure 2: Real ratio versus theoretical ratio.

Several discussions concerning the approximation bound in Theorem 4.1 are provided in this subsection.

Firstly, Theorem 4.1 derives the worse-case theoretical approximation bound for general tensors. Looking into Lemma 4.2, we see that the approximation ratio comes from two folds of each factor: the lower bound of 𝐯j,i2/Mj,iF2\left\|\mathbf{v}_{j,i}\right\|^{2}/\left\|M_{j,i}\right\|_{F}^{2} and the lower bound of i=1R𝐮j,i,𝐯j,i2/VjF2\sum^{R}_{i=1}\left\langle\mathbf{u}_{j,i},\mathbf{v}_{j,i}\right\rangle^{2}/\left\|V_{j}\right\|_{F}^{2}. The former reaches the worst bound 1/nj1/n_{j} when Mj,iM_{j,i} is row-wisely orthogonal, where each row has equal magnitude; the latter attains the worst bound 1/R1/R when VjV_{j} is of rank-1 and each column admits equal magnitude. These two worse cases rarely happen simultaneously, meaning that it is possible to further improve the theoretical lower bound. Currently, we cannot figure it out. Instead, we show the real ratio and the theoretical ratio via an example as follows.

We generate tensors 𝒜n×n×n×n\mathcal{A}\in\mathbb{R}^{n\times n\times n\times n}, where every entry obeys the standard normal distribution. We set t=4t=4 for the problem. nn varies from 1010 to 100100, where for each nn, 5050 instances are generated. We use Procedure A in the algorithm. In Fig. 2(a), we plot the curve of the real ratio (R𝐯j,i2/Mj,iF2)/R\left(\sum^{R}\left\|\mathbf{v}_{j,i}\right\|^{2}/\left\|M_{j,i}\right\|_{F}^{2}\right)/R and curve of the theoretical ratio 1/n1/n when j=1,2j=1,2. The figure shows that although the real ratio is better, it still approaches the theoretical one as nn increases. In Fig. 2(b), we fix the size of the tensor as 𝒜100×100×100×100\mathcal{A}\in\mathbb{R}^{100\times 100\times 100\times 100}, and vary RR from 1010 to 100100. We plot the real ratio i=1R𝐮j,i,𝐯j,i2/VjF2\sum^{R}_{i=1}\left\langle\mathbf{u}_{j,i},\mathbf{v}_{j,i}\right\rangle^{2}/\left\|V_{j}\right\|_{F}^{2} and the theoretical one 1/R1/R. The figure shows that the real ratio is far better, while it also decreases as RR increases.

To further investigate the influence of nn and RR on the bound, we also plot the whole real ratio G(U1,,Ud)i=1Rλmax(A(d))2\frac{G(U_{1},\ldots,U_{d})}{\sum^{R}_{i=1}\lambda_{\max}(A_{(d)})^{2}} and the theoretical one 1/R3n2{1}/{R^{3}n^{2}} in Fig. 2(c). The tensors are the same as above, where in the left panel, we vary nn from 1010 to 100100 with R=10R=10 fixed; in the right one, RR varies from 1010 to 100100 with n=100n=100 fixed. The real ratio is in red, and the theoretical ratio is in black with diamond markers. We see that the real ratio decreased as nn increases, confirming the observation of Fig. 2(a); moreover, the real ratio is almost parallel to the theoretical ratio, showing that the theoretical ratio is reasonable up to a constant (if we view RR as a constant). The right panel shows that when RR increases, the real ratio does not decrease as much as the theoretical one, showing that the term 1/R31/R^{3} might be too loose in the theoretical ratio. Thus we also plot the curve of 1n2\frac{1}{n^{2}}, which is in black with right arrow markers. We see that 1/n21/n^{2} still under the curve of the real ratio, meaning that it is possible to improve the theoretical ratio in the absence of RR. This still needs further research.

Secondly, one may wonder whether estimations such as 𝒜𝒜estFα𝒜𝒜bestF\left\|\mathcal{A}-\mathcal{A}^{\rm est}\right\|_{F}\leq\alpha\left\|\mathcal{A}-\mathcal{A}^{\rm best}\right\|_{F} can be established for the proposed algorithm for some factor α\alpha, such as the truncated HOSVD [9]; here 𝒜est\mathcal{A}^{\rm est} is constructed from U10,,Ud0U^{0}_{1},\ldots,U^{0}_{d}, and 𝒜best\mathcal{A}^{\rm best} denotes the optimal solution. This seems to be difficult for a general tensor in the current CP format. The reason may be because unlike that every tensor can be written in the HOSVD format [9, Theorem 2], where the truncated HOSVD gives a good approximation, not every tensor admits a CP decomposition with orthonormal factors. Therefore, estimating 𝒜𝒜estF\left\|\mathcal{A}-\mathcal{A}^{\rm est}\right\|_{F} is not easy for a general tensor.

Thirdly, in the algorithm, other procedures for computing 𝐯j,i\mathbf{v}_{j,i} from Mj,iM_{j,i} are possible. Another randomized example is given in Appendix A. Lemma A.1 shows that the factor βj\beta_{j} of this procedure is worse than those of Procedures A-C. However, it is expected that this procedure might be more efficient.

Finally, after this work is almost finished, we realize that in problem (2.2), if R=1R=1, t=dt=d, and Procedure A is taken in Algorithm 1, then the algorithm boils down essentially to [13, Algorithm 1 with DR2] for best rank-1 approximation problems. In this regard, Algorithm 1 generalizes [13, Algorithm 2] to orthogonal rank-RR approximation cases, and involves more procedures other than Procedure A.

4.3 Approximation results for nearly orthogonal tensors

We briefly consider the following type of tensors in this subsection:

Assumption 4.1.
𝒜=i=1Rσij=1d𝐮j,i+,\mathcal{A}=\sum^{R}_{i=1}\nolimits\sigma_{i}\bigotimes^{d}_{j=1}\nolimits\mathbf{u}_{j,i}+\mathcal{E}, (4.14)

where the last tt UjU_{j}’s are orthonormal, and the first (dt)(d-t) UjU_{j}’s are columnwisely normalized. In addition, we assume that σ1>σ2>>σR>0\sigma_{1}>\sigma_{2}>\cdots>\sigma_{R}>0. \mathcal{E} denotes the noisy tensor.

The cases that σi1=σi2\sigma_{i_{1}}=\sigma_{i_{2}} for some 1i1<i2R1\leq i_{1}<i_{2}\leq R need more assumptions and analysis, which is out of the scope of this work, and will be studied in the future.

Running Algorithm 1 on 𝒜\mathcal{A} with t2t\geq 2, it is easy to obtain the following results, where for completeness, we provide a short discussion. Denote Udj=1d1nj×RU_{-d}\in\mathbb{R}^{\prod^{d-1}_{j=1}n_{j}\times R} as the matrix whose ii-th column is the vectorization of j=1d1𝐮j,i\bigotimes^{d-1}_{j=1}\mathbf{u}_{j,i}.

Proposition 4.1.

Denote (U10,,Ud0)(U^{0}_{1},\ldots,U^{0}_{d}) as the factors generated by Algorithm 1, where the procedure get_v_form_M takes Procedures A, B, C. If Assumption 1 holds, where t2t\geq 2, and we set the same tt in the algorithm; in addition, if =0\mathcal{E}=0, then

G(U10,,Ud0)=i=1Rσi2=Gmax,andUj0=Uj,j=1,,d,G(U_{1}^{0},\ldots,U^{0}_{d})=\sum^{R}_{i=1}\nolimits\sigma_{i}^{2}=G_{\max},~{}{\rm and}~{}U^{0}_{j}=U_{j},~{}j=1,\ldots,d,

where for Uj0=UjU^{0}_{j}=U_{j}, we ignore the sign, i.e., it in fact holds 𝐮j,i0=±𝐮j,i\mathbf{u}_{j,i}^{0}=\pm\mathbf{u}_{j,i} for each jj and ii.

Proof.

To achieve this, we first show that during the algorithm,

Vj=UjΣ,j=dt+1,,d,V_{j}=U_{j}\cdot\Sigma,~{}j=d-t+1,\ldots,d, (4.15)

where Σ:=diag(σ1,,σR)\Sigma:={\rm diag}(\sigma_{1},\ldots,\sigma_{R}). If (4.15) holds, then since Uj0U_{j}^{0} is given by the polar decomposition of VjV_{j}, by Assumption 4.1, it follows

Uj0=Uj,j=dt+1,,d.U^{0}_{j}=U_{j},~{}j=d-t+1,\ldots,d. (4.16)

(4.15) can be proved by induction method starting from j=dj=d. When j=dj=d, this is in fact given by (4.9). To see this, we only need to show that λi(A(d))=σi\lambda_{i}(A_{(d)})=\sigma_{i} where we recall that λi()\lambda_{i}(\cdot) denotes the ii-th largest singular value of a matrix. Under Assumption 4.1 when t2t\geq 2, UdU_{-d} is orthonormal, giving that UdΣUdU_{d}\Sigma U^{\top}_{-d} is a reduced SVD of A(d)A_{(d)}, and the results follow. Assume that (4.15) holds when j=d,d1,,mj=d,d-1,\ldots,m; then Uj0=Uj,j=d,d1,,mU^{0}_{j}=U_{j},j=d,d-1,\ldots,m. Therefore,

m1,i=𝒜×m𝐮m,i0×m+1××d𝐮d,i0=σij=1m1𝐮j,i,\mathcal{B}_{m-1,i}=\mathcal{A}\times_{m}\mathbf{u}^{0\top}_{m,i}\times_{m+1}\times\cdots\times_{d}\mathbf{u}_{d,i}^{0\top}=\sigma_{i}\bigotimes^{m-1}_{j=1}\nolimits\mathbf{u}_{j,i},

which is a rank-1 tensor. Since Mj1,iM_{j-1,i} is the unfolding of m1,i\mathcal{B}_{m-1,i}, it can be seen that 𝐲m1,i\mathbf{y}_{m-1,i} generated by Procedures A, B, C is just the vectorization of j=1m2𝐮j,i\bigotimes^{m-2}_{j=1}\mathbf{u}_{j,i}, and so 𝐯m1,i=σi𝐮m1,i\mathbf{v}_{m-1,i}=\sigma_{i}\mathbf{u}_{m-1,i}, which demonstrates that (4.15) holds when j=m1j=m-1. Thus (4.15) and (4.16) are valid for j=dt+1,,dj=d-t+1,\ldots,d.

Since 𝒜×dt+1𝐮dt+1,i0×m+1××d𝐮d,i0\mathcal{A}\times_{d-t+1}\mathbf{u}^{0\top}_{d-t+1,i}\times_{m+1}\times\cdots\times_{d}\mathbf{u}_{d,i}^{0\top} is of rank-1, the rank-1 approximation procedure generates that 𝐮j,i0=𝐮j,i\mathbf{u}^{0}_{j,i}=\mathbf{u}_{j,i}, j=1,,dtj=1,\ldots,d-t. Hence, G(U10,,Ud0)=G(U1,,Ud)=i=1Rσi2G(U^{0}_{1},\ldots,U^{0}_{d})=G(U_{1},\ldots,U_{d})=\sum^{R}_{i=1}\sigma_{i}^{2}. This completes the proof. ∎

From the above discussions, and by matrix perturbation theory, it is not hard to obtain the perturbed version of Proposition 4.1.

Proposition 4.2.

Under the setting of Proposition 4.1, where now \mathcal{E} is small enough, we have G(U10,,Ud0)=GmaxO()G(U_{1}^{0},\ldots,U^{0}_{d})=G_{\max}-O(\left\|\mathcal{E}\right\|), where the constant behind the big O is nonnegative, and

min{𝐮j,i0+𝐮j,i,𝐮j,i0𝐮j,i}=O(),j=1,,d,i=1,,R.\min\{\left\|\mathbf{u}_{j,i}^{0}+\mathbf{u}_{j,i}\right\|,\left\|\mathbf{u}^{0}_{j,i}-\mathbf{u}_{j,i}\right\|\}=O(\left\|\mathcal{E}\right\|),~{}j=1,\ldots,d,~{}i=1,\ldots,R.

It remains to consider the case that t=1t=1 in Assumption 4.1, i.e., only UdU_{d} is orthonormal. In general, if there is no additional assumptions on UjU_{j}, j=1,,d1j=1,\ldots,d-1, then it is hard to obtain the approximation results. For instance, if A=PQA=PQ^{\top} where PP is orthonormal but QQ is not, then it is in general hard to recover PP and QQ. We thus make the following incoherence assumption:

Assumption 4.2.

There is at least a UjU_{j}, 1jd11\leq j\leq d-1 being incoherent with modules 0δ<10\leq\delta<1, i.e.,

1jd1,with|𝐮j,i1,𝐮j,i2|δ,i1i2.\exists 1\leq j\leq d-1,~{}{\rm with}~{}\left|\left\langle\mathbf{u}_{j,i_{1}},\mathbf{u}_{j,i_{2}}\right\rangle\right|\leq\delta,~{}\forall i_{1}\neq i_{2}.

It is clear that UjU_{j} is a nearly orthonormal matrix if δ\delta is small enough. In what follows, we assume without loss of generality that Ud1U_{d-1} satisfies Assumption 4.2. We immediately have:

Proposition 4.3.

If Ud1U_{d-1} is incoherent with modules δ\delta, then UdU_{-d} is also incoherent with modules δ\delta.

We consider the expression of A(d)A(d)A_{(d)}A^{\top}_{(d)}:

A(d)A(d)\displaystyle A_{(d)}A^{\top}_{(d)} =\displaystyle= UdΣUdUdΣUd+O()\displaystyle U_{d}\Sigma U_{-d}^{\top}U_{-d}\Sigma U_{d}^{\top}+O(\mathcal{E})
=\displaystyle= UdΣ2Ud+UdΣ[0𝐮d,1𝐮d,2𝐮d,1𝐮d,R00𝐮d,R𝐮d,1𝐮d,R𝐮d,R10]ΣUd+O()\displaystyle U_{d}\Sigma^{2}U_{d}^{\top}+U_{d}\Sigma\left[\begin{smallmatrix}0&\mathbf{u}_{-d,1}^{\top}\mathbf{u}_{-d,2}&\cdots&\mathbf{u}_{-d,1}^{\top}\mathbf{u}_{-d,R}\\ \vdots&0&\ddots&\vdots\\ \vdots&\ddots&0&\vdots\\ \mathbf{u}_{-d,R}^{\top}\mathbf{u}_{-d,1}&\cdots&\mathbf{u}_{-d,R}^{\top}\mathbf{u}_{-d,R-1}&0\end{smallmatrix}\right]\Sigma U_{d}^{\top}+O(\mathcal{E})
=\displaystyle= UdΣ2Ud+O(δ)+O(),\displaystyle U_{d}\Sigma^{2}U_{d}^{\top}+O(\delta)+O(\mathcal{E}),

​​namely, A(d)A(d)A_{(d)}A_{(d)}^{\top} is a perturbation of the eigen-decomposition UdΣ2UdU_{d}\Sigma^{2}U_{d}^{\top}, given that δ\delta and \mathcal{E} are small enough. If Assumption 4.1 holds, by matrix perturbation theory, the above implies that

min{𝐮d,i0+𝐮d,i,𝐮d,i0𝐮d,i}=O(δ)+O(),i=1,,R,\min\{\left\|\mathbf{u}_{d,i}^{0}+\mathbf{u}_{d,i}\right\|,\left\|\mathbf{u}^{0}_{d,i}-\mathbf{u}_{d,i}\right\|\}=O(\delta)+O(\left\|\mathcal{E}\right\|),~{}i=1,\ldots,R,

where Ud0U^{0}_{d} is generated by the algorithm. It then follows that 𝒜×d𝐮d,i0\mathcal{A}\times_{d}\mathbf{u}_{d,i}^{0\top} is a nearly rank-1 tensor with perturbation O(δ)+O()O(\delta)+O(\mathcal{E}), and so

min{𝐮j,i0+𝐮j,i,𝐮j,i0𝐮j,i}=O(δ)+O(),j=1,,d1.\min\{\left\|\mathbf{u}_{j,i}^{0}+\mathbf{u}_{j,i}\right\|,\left\|\mathbf{u}^{0}_{j,i}-\mathbf{u}_{j,i}\right\|\}=O(\delta)+O(\left\|\mathcal{E}\right\|),~{}j=1,\ldots,d-1.

We thus have a similar result as Proposition 4.2:

Proposition 4.4.

Denote (U10,,Ud0)(U^{0}_{1},\ldots,U^{0}_{d}) as the factors generated by Algorithm 1, where Assumption 1 holds with t=1t=1; furthermore, Assumption 4.2 holds. If \mathcal{E} and δ\delta are small enough, and we also set t=1t=1 in the algorithm, then G(U10,,Ud0)=GmaxO(δ)O()G(U_{1}^{0},\ldots,U^{0}_{d})=G_{\max}-O(\delta)-O(\left\|\mathcal{E}\right\|), where the constants behind the big O are nonnegative, and

min{𝐮j,i0+𝐮j,i,𝐮j,i0𝐮j,i}=O(δ)+O(),j=1,,d,i=1,,R.\min\{\left\|\mathbf{u}_{j,i}^{0}+\mathbf{u}_{j,i}\right\|,\left\|\mathbf{u}^{0}_{j,i}-\mathbf{u}_{j,i}\right\|\}=O(\delta)+O(\left\|\mathcal{E}\right\|),~{}j=1,\ldots,d,~{}i=1,\ldots,R.

5 Numerical Study

We evaluate the performance of Algorithm 1 with Procedures A, B, and C, and compare them with [45, Procedure 3.1]. All the computations are conducted on an Intel i7-7770 CPU desktop computer with 32 GB of RAM. The supporting software is Matlab R2019b. The Matlab package Tensorlab [43] is employed for tensor operations.

Performance of approximation algorithms
Refer to caption
(a) t=2t=2. Left: G(U1,,Ud)G(U_{1},\ldots,U_{d}) versus nn; right: CPU time versus nn.
Refer to caption
(b) t=3t=3. Left: G(U1,,Ud)G(U_{1},\ldots,U_{d}) versus nn; right: CPU time versus nn.
Refer to caption
(c) t=4t=4. Left: G(U1,,Ud)G(U_{1},\ldots,U_{d}) versus nn; right: CPU time versus nn.
Figure 3: Performances of Alg. 1 (A) (red and star markers), Alg. 1 (B) (magenta and circle markers), Alg. 1 (C) (blue and square markers), and [45, Procedure 3.1] (black and rightarrow markers) for approximately solving (2.2) where 𝒜n×n×n×n\mathcal{A}\in\mathbb{R}^{n\times n\times n\times n} are randomly generated. nn varies from 1010 to 100100.

We compare Algorithm 1 with Procedures A, B, C with [45, Procedure 3.1] in this part. They are respectively marked as Alg. 1 (A), Alg. 1 (B), Alg. 1 (C), and Yang2020. All the algorithms employ the same rank1approx procedure [45, Procedure 3.2] for solving the rank-1 approximation subproblems. The tensors 𝒜n×n×n×n\mathcal{A}\in\mathbb{R}^{n\times n\times n\times n} are randomly generated where every entry obeys the standard normal distribution. The presented results are averaged over 5050 instances for each case.

We first fix R=10R=10, and vary nn from 1010 to 100100. The results are depicted in Fig. 3. Alg. 1 (A) is in red with star markers, Alg. 1 (B) is in magenta with circle markers, Alg. 1 (C) is in blue with square markers, and Yang2020 is in black with rightarrow markers. The left panels show the curves of the objective values G(U1,,Ud)G(U_{1},\ldots,U_{d}) versus nn, from which we observe that Alg. 1 (A) performs the best, followed by Alg. 1 (B); this is because that using SVD, Procedure A retains more information in 𝐯\mathbf{v} than other procedures. Yang2020 performs not as good as Algorithm 1, and its performance gets worse when tt increases; this also has been pointed out in [45, Table 3.1], while the performance of Algorithm 1 is more stable. Therefore, we conclude that Alg. 1 exploits the structure of the problem more than Yang2020. The right panels show the CPU time versus nn, from which we see that Yang2020 needs more time than Alg. 1, as Yang2020 requires to perform tt SVDs of size n×nd1n\times n^{d-1} (assuming n1==nd=nn_{1}=\cdots=n_{d}=n), while the most expensive step of Alg. 1 is only one SVD of this size. Considering Prodecures A, B, and C, there is no doubt that Alg. 1 (A) is the most expensive one among the three, followed by Alg. 1 (B). The randomized version, i.e., Alg. 1 (C), is clearly the cheapest one.

Refer to caption
Figure 4: Performances of Alg. 1 (A) (red and star markers), Alg. 1 (B) (magenta and circle markers), Alg. 1 (C) (blue and square markers), and [45, Procedure 3.1] (black and rightarrow markers) for approximately solving (2.2) where 𝒜80×80×80×80\mathcal{A}\in\mathbb{R}^{80\times 80\times 80\times 80} are randomly generated. RR varies from 55 to 8080.

We then fix n=80n=80, t=3t=3, and vary RR from 55 to 8080. The results are plotted in Fig. 4, from which we still can observe that Alg. 1 (A) performs the best in terms of the objective value. Considering the CPU time, Alg. 1 (B) and 1 (C) seem to be far more better; this shows the superiority of SVD-free procedures for computing the vector 𝐯\mathbf{v}.

Table 1: ϵ\epsilon-ALS initialized by different strategies. t=3t=3 and R=10R=10 cases. ‘rel.err0’ and ‘time0’ respectively represent the relative error and CPU time evaluated at the initializers.
nn Yang2020 Alg. 1 (A) Alg. 1 (B) Alg. 1 (C) Random
60 rel.err.(rel.err0) 0.0139 (0.0366) 0.0140 (0.0295) 0.0137 (0.0348) 0.0218 (0.0680) 0.0291
iter. 7.32 6.26 8.70 8.04 18.72
time(time0) 0.59 (0.17) 0.52 (0.15) 0.56 (0.09) 0.53 (0.08) 0.92
70 rel.err.(rel.err0) 0.0125 (0.0255) 0.0125 (0.0177) 0.0190 (0.0252) 0.0197 (0.0489) 0.0285
iter. 4.5 4.3 4.4 4.7 32.9
time(time0) 0.72 (0.36) 0.62 (0.26) 0.52 (0.16) 0.54 (0.16) 2.71
80 rel.err.(rel.err0) 0.0191 (0.0239) 0.0186 (0.0216) 0.0189 (0.0218) 0.0114 (0.0343) 0.0193
iter. 19.2 9.2 9.7 24.4 35.08
time(time0) 3.16 (0.62) 1.77 (0.36) 1.73 (0.26) 3.50 (0.25) 4.58
90 rel.err.(rel.err0) 0.0030 (0.0096) 0.0030 (0.0057) 0.0030 (0.0109) 0.0030 (0.0314) 0.0108
iter. 5.5 4.9 7.6 10.4 32.1
time(time0) 2.48 (1.02) 1.88 (0.54) 2.24 (0.40) 2.75 (0.36) 6.41
Performance of approximation algorithms plus ALS for factor recovery

The tensors are generated similarly as [45, 39]: 𝒜=/+β𝒩/𝒩\mathcal{A}=\mathcal{B}/\left\|\mathcal{B}\right\|+\beta\cdot\mathcal{N}/\|\mathcal{N}\|, where =i=1Rσij=1d𝐮j,i\mathcal{B}=\sum^{R}_{i=1}\sigma_{i}\bigotimes^{d}_{j=1}\nolimits\mathbf{u}_{j,i}, 𝒩\mathcal{N} is an unstructured tensor, and β\beta denotes the noise level. Here we set β=0.1\beta=0.1, R=10R=10, d=4d=4, and nn varies from 6060 to 9090. σi\sigma_{i}, UjU_{j}, and 𝒩\mathcal{N} are randomly drawn from a uniform distribution in [1,1][-1,1]. The last tt UjU_{j} are then orthonormalized by using the orth function, while the first (dt)(d-t) ones are columnwisely normalized. We only test t=3t=3 and t=4t=4 cases. We compare ϵ\epsilon-ALS [45] with ϵ=108\epsilon=10^{-8} initialized by different initializers generated respectively by Yang2020, Alg. 1 (A), Alg. 1 (B), and Alg. 1 (C). ϵ\epsilon-ALS initialized by random initializers is used as a baseline. The stopping criterion for ϵ\epsilon-ALS is j=1dUjk+1UjkF/UjkF105\sum^{d}_{j=1}\left\|U^{k+1}_{j}-U^{k}_{j}\right\|_{F}/\left\|U^{k}_{j}\right\|_{F}\leq 10^{-5} or k2000k\geq 2000. The relative error is defined as follows [39, 45]:

rel.err=j=1dUjUjoutΠjF/UjF,{\rm rel.err}=\sum^{d}_{j=1}\nolimits\left\|U_{j}-U^{\rm out}_{j}\cdot\Pi_{j}\right\|_{F}/\left\|U_{j}\right\|_{F},

where Πj=argminΠ𝚷UjUjoutΠF\Pi_{j}=\arg\min_{\Pi\in\boldsymbol{\Pi}}\left\|U_{j}-U^{\rm out}_{j}\cdot\Pi\right\|_{F} and 𝚷\boldsymbol{\Pi} denotes the set of permutation matrices, and UjoutU^{\rm out}_{j}’s are the factors output by the iterative algorithm. Finding the permutation can be efficiently done by the Hungarian algorithm [21]. The presented results are averaged over 5050 instances for each case.

Table 2: ϵ\epsilon-ALS initialized by different strategies. t=4t=4 and R=10R=10 cases. ‘rel.err0’ and ‘time0’ respectively represent the relative error and CPU time evaluated at the initializers.
nn Yang2020 Alg. 1 (A) Alg. 1 (B) Alg. 1 (C) Random
60 rel.err.(rel.err0) 0.0211 (0.0328) 0.0214 (0.0263) 0.0133 (0.0320) 0.0209 (0.0658) 0.0706
iter. 13.4 12.0 11.3 18.6 36.7
time(time0) 0.88 (0.19) 0.79 (0.16) 0.69 (0.09) 1.01 (0.08) 1.75
70 rel.err.(rel.err0) 0.0129 (0.0300) 0.0129 (0.0209) 0.0126 (0.0318) 0.0132 (0.0560) 0.0369
iter. 31.6 23.3 18.6 28.9 45.9
time(time0) 3.16 (0.36) 2.21 (0.23) 1.76 (0.16) 2.57 (0.14) 4.03
80 rel.err.(rel.err0) 0.0279 (0.0347) 0.0276 (0.0294) 0.0279 (0.0312) 0.0280 (0.0477) 0.0361
iter. 16.8 9.6 13.3 18.5 24.8
time(time0) 2.96 (0.62) 1.80 (0.36) 2.19 (0.26) 2.78 (0.25) 3.30
90 rel.err.(rel.err0) 0.0029 (0.0064) 0.0029 (0.0041) 0.0029 (0.0079) 0.0029 (0.0414) 0.0194
iter. 30.6 11.5 20.1 16.9 43.5
time(time0) 6.95 (0.98) 3.00 (0.50) 4.46 (0.38) 3.86 (0.35) 8.31

The results when t=3t=3 and t=4t=4 are reported in Tables 1 and 2, in which ‘rel.err0’ and ‘time0 respectively represent the relative error and CPU time evaluated at the initializers. From the tables, we first observe that ϵ\epsilon-ALS initialized by Yang2020 and Alg. 1 outperforms that with random initializations, considering both the relative error and computational time. Comparing Alg. 1 with Yang2020, we can see that in most cases, ϵ\epsilon-ALS initialized by Alg. 1 performs comparable or slightly better. Comparing among ϵ\epsilon-ALS initialized by the three versions of Alg. 1, Alg. 1 (C) is slightly worse in terms of the time; this is because although Alg. 1 (C) is the most efficient, ϵ\epsilon-ALS initialized by it needs more iterations than the others. Therefore, it would be necessary to improve the efficiency of the iterative algorithms.

Refer to caption
Figure 5: Comparisons of relative error of ϵ\epsilon-ALS initialized by Alg. 1 (A) (red and star markers), Alg. 1 (B) (magenta and circle markers), [45, Procedure 3.1] (black and rightarrow markers), and random initializers (black and diamond markers). Left panel: nn is fixed to 100100, β=0.1\beta=0.1, t=4t=4, while RR varies from 1010 to 100100. Right panel: nn is fixed to 5050, t=4t=4, R=10R=10, while the noise level β\beta varies from 0.050.05 to 11.

Next, we fix d=t=4d=t=4, n=100n=100, β=0.1\beta=0.1, and vary RR from 1010 to 100100; we also fix n=50n=50, R=10R=10, and vary β\beta from 0.050.05 to 11. Plots of relative error of different methods are depicted in Fig. 5, from which we still observe that ϵ\epsilon-ALS initialized by approximation algorithms performs better than that with random initializers, and the iterative algorithm initialized by Alg. 1 is comparable with that initialized by Yang2020.

CP approximation for clustering

Tensor CP approximation for clustering works as follows: Suppose that we have NN samples 𝒜1,,𝒜nn1××nd\mathcal{A}_{1},\ldots,\mathcal{A}_{n}\in\mathbb{R}^{n_{1}\times\cdots\times n_{d}}, d2d\geq 2. For given parameter RR, and for unknown variables AN×R,Ujnj×RA\in\mathbb{R}^{N\times R},U_{j}\in\mathbb{R}^{n_{j}\times R}, j=1,,dj=1,\ldots,d, one solves the following problem first:

mink=1N𝒜ki=1Raikj=1d𝐮j,iF2s.t.A=(aik)N×R,UjUj=I,1jd,\begin{split}&\min~{}\sum^{N}_{k=1}\left\|\mathcal{A}_{k}-\sum^{R}_{i=1}\nolimits a^{k}_{i}\bigotimes^{d}_{j=1}\nolimits\mathbf{u}_{j,i}\right\|_{F}^{2}\\ &~{}{\rm s.t.}~{}~{}A=(a^{k}_{i})\in\mathbb{R}^{N\times R},U_{j}^{\top}U_{j}=I,1\leq j\leq d,\end{split} (5.17)

where in aika^{k}_{i}, kk represents the row index while ii is the column one. Write 𝐚k:=[a1k,,aRk]R\mathbf{a}^{k}:=[a^{k}_{1},\ldots,a^{k}_{R}]^{\top}\in\mathbb{R}^{R}, i.e., it is the transpose of the kk-th row of AA. The above problem means that one finds a common subspace in n1××nd\mathbb{R}^{n_{1}\times\cdots\times n_{d}}, which is spanned by the basis j=1d𝐮j,i\bigotimes^{d}_{j=1}\mathbf{u}_{j,i}, i=1,,Ri=1,\ldots,R, and projects the samples onto this subspace. The 𝐚k\mathbf{a}^{k} can be seen as a representation of 𝒜k\mathcal{A}_{k}. This can be regarded as a dimension reduction procedure [34, 30]. By stacking 𝒜k\mathcal{A}_{k}’s into a (d+1)(d+1)-th order tensor 𝒜\mathcal{A} with 𝒜(k,:,,:)=𝒜k\mathcal{A}(k,:,\ldots,:)=\mathcal{A}_{k}, and denoting 𝐚i\mathbf{a}_{i} as the ii-th column of AA, the objective function of (5.17) can be written as 𝒜i=1R𝐚i𝐮1,i𝐮d,iF2\left\|\mathcal{A}-\sum^{R}_{i=1}\mathbf{a}_{i}\otimes\mathbf{u}_{1,i}\otimes\cdots\otimes\mathbf{u}_{d,i}\right\|_{F}^{2}, and so (5.17) can be reduced to (2.2). Once A=[𝐚1,,𝐚R]A=[\mathbf{a}_{1},\ldots,\mathbf{a}_{R}] is obtained, we use KK-means for clustering, with the transpose of the rows of AA, namely, 𝐚k\mathbf{a}^{k}’s being the new samples. The cluster error is defined as follows: Let K2K\geq 2 be the given clusters, ϕ0:n1××nd{1,,K}\phi_{0}:\mathbb{R}^{n_{1}\times\cdots\times n_{d}}\rightarrow\{1,\ldots,K\} be the true clustering mapping, and ϕ\phi be the estimated one. Denote card(){\rm card}(\cdot) the cardinality of a set and I()I(\cdot) the indicator function. Then [41]

clustererr.:=(N2)1card({i,j}:I(ψ(𝒜i)=ψ(𝒜j))I(ψ0(𝒜i)=ψ0(𝒜j)),i<j).{\rm cluster~{}err.}:=\tbinom{N}{2}^{-1}{\rm card}\left(\{i,j\}:I\left(\psi(\mathcal{A}_{i})=\psi(\mathcal{A}_{j})\right)\neq I\left(\psi_{0}(\mathcal{A}_{i})=\psi_{0}(\mathcal{A}_{j})\right),i<j\right).

We solve (5.17) by ϵ\epsilon-ALS initialized by Alg. 1 (A), Alg. 1 (B), and by random initialization, with R=30R=30 for the problem. For the first two initializations, the max iterations of ϵ\epsilon-ALS are set to 1010, while it is 10001000 for the random initialization. We also compare them with vanilla KK-means, namely, the original samples 𝒜k\mathcal{A}_{k}’s are first vectorized and then clustered by KK-means. The dataset COIL-100 http://www.cs.columbia.edu/CAVE/software/softlib/coil-100.php is used, which consists of 100100 objects, each containing 7272 images of size 128×128128\times 128 viewing from different angles. In our experiment, we each time randomly select K={5,7,9,11,15,20}K=\{5,7,9,11,15,20\} objects, each objects randomly selecting M=50M=50 images, resulting into a third-order tensor 𝒜50K×128×128\mathcal{A}\in\mathbb{R}^{50K\times 128\times 128} that consists of 50K50K samples. For each case we run 5050 instances and present the averaged results in Table 3.

Table 3: CP approximation for clustering via first solving (5.17) by Alg. 1 (A) + ϵ\epsilon-ALS, Alg. 1 (B) + ϵ\epsilon-ALS, or random + ϵ\epsilon-ALS; the iterative algorithm of the first two stops within 1010 iterations while that for the third is 10001000; then the KK-means is performed to the reduced samples 𝐚k\mathbf{a}^{k}’s. We also use the vanilla KK-means as the baseline.
Alg. 1 (A) Alg. 1 (B) Random vanilla KK-means
KK cluster err. time cluster err. time cluster err. time. cluster err. time
5 1.10E-01 0.33 1.12E-01 0.24 1.44E-01 6.25 1.33E-01 0.44
7 9.60E-02 0.40 1.04E-01 0.31 1.64E-01 5.12 1.08E-01 0.79
9 8.24E-02 0.50 9.12E-02 0.42 1.40E-01 11.74 1.00E-01 1.25
11 7.48E-02 0.58 8.08E-02 0.50 1.21E-01 15.26 8.54E-02 1.76
15 6.31E-02 0.78 6.92E-02 0.68 9.56E-02 14.86 7.25E-02 3.23
20 5.36E-02 1.04 5.23E-02 0.98 7.84E-02 24.14 5.56E-02 5.58

Considering the clustering error, we observe from the table that armed with the approximation algorithm, CP approximation based method achieves better performance than the vanilla KK-means, while Alg. 1 (A) is slightly better than Alg. 1 (B). However, if starting from a random initializer, CP approximation based method is worse than vanilla KK-means. This shows the usefulness of the introduced approximation algorithm. Considering the computational time, we see that the first two methods are also better than vanilla KK-means, which is because ϵ\epsilon-ALS is stopped within 1010 iterations, and because the sample size after reduction is R=30R=30, while the sample size for the vanilla KK-means is 1282128^{2}. We also observe that Random + ϵ\epsilon-ALS usually cannot stop within 10001000 iterations, making it the slowest one.

Refer to caption
Figure 6: Cluster error and CPU time of CP approximation for clustering of one instance with varying RR from 1010 to 100100, and K=5K=5. Left: cluster error versus RR; right: CPU time versus RR.

We next show the influence of RR on the performance. We fix an instance with K=5K=5, and vary RR from 1010 to 100100. The cluster error and CPU time of Alg. 1 (A) + ϵ\epsilon-ALS is plotted in Fig. 6. We see that the cluster error does not change a lot when RR varies, while around R=30R=30, the cluster error seems to be slightly better than the other cases; this observation together with the CPU time shown in the right panel explains why we choose R=30R=30 in our experiment.

6 Conclusions

In [45], an approximation procedure was proposed for low-rank tensor approximation with orthonormal factors by combining the truncated HOSVD and best rank-1 approximation. The approximation bound was only theoretically established when the number of latent orthonormal factors is one. To fill this gap, a modified approximation algorithm was developed in this work. It allows either deterministic or randomized procedures to solve a key step of each latent orthonormal factor in the algorithm, giving some flexibilities. The approximation bound, either in deterministic or expected sense, has been established no matter how many orthonormal factors there are. Moreover, compared with [45, Procedure 3.1] which requires tt SVDs of size n×nd1n\times n^{d-1}, the introduced approximation algorithm involves only one SVD of this size (plus other operations of smaller size), making it more efficient. Numerical tests were provided to verify the usefulness of the algorithm, and its performance is favorably compared with [45, Procedure 3.1]. The sequel work is to study approaches for finding global solutions. A possible way is to use convex relaxation as those done for rank-1 approximation [28, 16, 46]. However, extending the ideas from rank-1 approximation to rank-RR approximation is not so straightforward; this will be our next focus. Another possible research thread is to extend the notion of best rank-1 approximation ratio of a tensor space [31, 24] to the best rank-RR approximation ratio setting (via problem (2.2)) and study its properties.

Acknowledgement

This work was supported by NSFC Grant 11801100 and the Fok Ying Tong Education Foundation Grant 171094. We thank Mr. Xianpeng Mao for the help in the experiments.

References

  • [1] S. Ahmadi-Asl, A. Cichocki, A. H. Phan, I. Oseledets, S. Abukhovich, and T. Tanaka. Randomized algorithms for computation of tucker decomposition and higher order svd (hosvd). arXiv preprint arXiv:2001.07124, 2020.
  • [2] M. Che, Y. Wei, and H. Yan. The computation of low multilinear rank approximations of tensors via power scheme and random projection. SIAM J. Matrix Anal. Appl., 41(2):605–636, 2020.
  • [3] J. Chen and Y. Saad. On the tensor SVD and the optimal low rank orthogonal approximation of tensors. SIAM J. Matrix Anal. Appl., 30(4):1709–1734, 2009.
  • [4] A. Cichocki, D. Mandic, L. De Lathauwer, G. Zhou, Q. Zhao, C. Caiafa, and H. A. Phan. Tensor decompositions for signal processing applications: From two-way to multiway component analysis. IEEE Signal Process. Mag., 32(2):145–163, 2015.
  • [5] P. Comon. Independent component analysis, a new concept? Signal process., 36(3):287–314, 1994.
  • [6] P. Comon. Tensors: a brief introduction. IEEE Signal Process. Mag., 31(3):44–53, 2014.
  • [7] A. P. da Silva, P. Comon, and A. L. F. de Almeida. A finite algorithm to compute rank-1 tensor approximations. IEEE Signal Process. Letters, 23(7):959–963, 2016.
  • [8] L. De Lathauwer. A short introduction to tensor-based methods for factor analysis and blind source separation. In Proc. of the IEEE Int. Symp. on Image and Signal Processing and Analysis (ISPA 2011), pages 558–563. IEEE, 2011.
  • [9] L. De Lathauwer, B. De Moor, and J. Vandewalle. A multilinear singular value decomposition. SIAM J. Matrix Anal. Appl., 21:1253–1278, 2000.
  • [10] L. Grasedyck. Hierarchical singular value decomposition of tensors. SIAM J. Matrix Anal. Appl., 31:2029–2054, 2010.
  • [11] Y. Guan and D. Chu. Numerical computation for orthogonal low-rank approximation of tensors. SIAM J. Matrix Anal. Appl., 40(3):1047–1065, 2019.
  • [12] S. He, B. Jiang, Z. Li, and S. Zhang. Probability bounds for polynomial functions in random variables. Math. Oper. Res., 39(3):889–907, 2014.
  • [13] S. He, Z. Li, and S. Zhang. Approximation algorithms for homogeneous polynomial optimization with quadratic constraints. Math. Program., 125:353–383, 2010.
  • [14] N. J. Higham. Computing the polar decomposition—with applications. SIAM J. Sci. Stat. Comput., 7(4):1160–1174, 1986.
  • [15] S. Hu and K. Ye. Linear convergence of an alternating polar decomposition method for low rank orthogonal tensor approximations. arXiv preprint arXiv:1912.04085, 2019.
  • [16] B. Jiang, S. Ma, and S. Zhang. Tensor principal component analysis via convex optimization. Math. Program., 150:423–457, 2014.
  • [17] E. Kofidis and P. Regalia. On the best rank-1 approximation of higher-order supersymmetric tensors. SIAM J. Matrix Anal. Appl., 23:863–884, 2002.
  • [18] T. G. Kolda. Orthogonal tensor decompositions. SIAM J. Matrix Anal. Appl., 23(1):243–255, 2001.
  • [19] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Rev., 51:455–500, 2009.
  • [20] J. B. Kruskal. Three-way arrays: rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear Agebra appl., 18(2):95–138, 1977.
  • [21] H. W. Kuhn. The Hungarian method for the assignment problem. Nav. Res. logist. Q., 2(1-2):83–97, 1955.
  • [22] J. Li, K. Usevich, and P. Comon. Globally convergent Jacobi-type algorithms for simultaneous orthogonal symmetric tensor diagonalization. SIAM J. Matrix Anal. Appl., 39(1):1–22, 2018.
  • [23] J. Li and S. Zhang. Polar decomposition based algorithms on the product of stiefel manifolds with applications in tensor approximation. arXiv preprint arXiv:1912.10390, 2019.
  • [24] Z. Li, Y. Nakatsukasa, T. Soma, and A. Uschmajew. On orthogonal tensors and best rank-one approximation ratio. SIAM J. Matrix Anal. Appl., 39(1):400–425, 2018.
  • [25] L.-H. Lim and P. Comon. Nonnegative approximations of nonnegative tensors. J. Chemom., 23(7-8):432–441, 2009.
  • [26] L.-H. Lim and P. Comon. Blind multilinear identification. IEEE Trans. Inf. Theory, 60(2):1260–1280, 2014.
  • [27] R. Minster, A. K. Saibaba, and M. E. Kilmer. Randomized algorithms for low-rank tensor decompositions in the tucker format. SIAM J. Math. Data Sci., 2(1):189–215, 2020.
  • [28] J. Nie and L. Wang. Semidefinite relaxations for best rank-1 tensor approximations. SIAM J. Matrix Anal. Appl., 35(3):1155–1179, 2014.
  • [29] J. Pan and M. K. Ng. Symmetric orthogonal approximation to symmetric tensors with applications to image reconstruction. Numer. Linear Algebra Appl., 25(5):e2180, 2018.
  • [30] J.-C. Pesquet-Popescu, B.and Pesquet and A. P. Petropulu. Joint singular value decomposition-a new tool for separable representation of images. In ICIP, volume 2, pages 569–572. IEEE, 2001.
  • [31] L. Qi. The best rank-one approximation ratio of a tensor space. SIAM J. Matrix Anal. Appl., 32(2):430–442, 2011.
  • [32] Y. Qi, P. Comon, and L.-H. Lim. Semialgebraic geometry of nonnegative tensor rank. SIAM J. Matrix Anal. Appl., 37(4):1556–1580, 2016.
  • [33] B. Savas and L.-H. Lim. Quasi-newton methods on Grassmannians and multilinear approximations of tensors. SIAM J. Sci. Comput., 32(6):3352–3393, 2010.
  • [34] A. Shashua and A. Levin. Linear image coding for regression and classification using the tensor-rank principle. In CVPR, volume 1, pages I–I. IEEE, 2001.
  • [35] N. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. Papalexakis, and C. Faloutsos. Tensor decomposition for signal processing and machine learning. IEEE Trans. Signal Process., 65(13):3551–3582.
  • [36] N. D. Sidiropoulos and R. Bro. On the uniqueness of multilinear decomposition of N-way arrays. J. Chemom., 14(3):229–239, 2000.
  • [37] N. D. Sidiropoulos, G. B. Giannakis, and R. Bro. Blind parafac receivers for ds-cdma systems. IEEE Trans. Signal Process., 48(3):810–823, 2000.
  • [38] A. M.-C. So. Deterministic approximation algorithms for sphere constrained homogeneous polynomial optimization problems. Math. Program., 129(2):357–382, 2011.
  • [39] M. Sørensen, L. De Lathauwer, P. Comon, S. Icart, and L. Deneire. Canonical polyadic decomposition with a columnwise orthonormal factor matrix. SIAM J. Matrix Anal. Appl., 33(4):1190–1213, 2012.
  • [40] M. Sørensen, L. De Lathauwer, and L. Deneire. PARAFAC with orthogonality in one mode and applications in DS-CDMA systems. In Proc. of the IEEE Int. Conference on Acoustics, Speech and Signal Processing (ICASSP 2010), pages 4142–4145. IEEE, 2010.
  • [41] W. Sun, J. Wang, and Y. Fang. Regularized k-means clustering of high-dimensional data and its asymptotic consistency. Electron. J. Stat., 6:148–167, 2012.
  • [42] N. Vannieuwenhoven, R. Vandebril, and K. Meerbergen. A new truncation strategy for the higher-order singular value decomposition. SIAM J. Sci. Comput., 34(2):A1027–A1052, 2012.
  • [43] N. Vervliet, O. Debals, L. Sorber, M. Van Barel, and L. De Lathauwer. Tensorlab 3.0, Mar. 2016. Available online.
  • [44] L. Wang, M. T. Chu, and B. Yu. Orthogonal low rank tensor approximation: Alternating least squares method and its global convergence. SIAM J. Matrix Anal. and Appl., 36(1):1–19, 2015.
  • [45] Y. Yang. The epsilon-alternating least squares for orthogonal low-rank tensor approximation and its global convergence. SIAM J. Matrix Anal. Appl., 41(4):1797–1825, 2020.
  • [46] Y. Yang, Y. Feng, X. Huang, and J. A. K. Suykens. Rank-1 tensor properties with applications to a class of tensor optimization problems. SIAM J. Optim., 26(1):171–196, 2016.
  • [47] J. Zhang, A. K. Saibaba, M. E. Kilmer, and S. Aeron. A randomized tensor singular value decomposition based on the t-product. Numer. Linear Algebra Appl., 25(5):e2179, 2018.
  • [48] X. Zhang, L. Qi, and Y. Ye. The cubic spherical optimization problems. Math. Comput., 81(279):1513–1525, 2012.

Appendix A Another get_v_from_M Procedure

Procedure𝐯=get_v_from_M(M)\noindent{\rm Procedure}~{}\mathbf{v}=\texttt{get\_v\_from\_M}(M) (D) 1. Randomly and uniformly draw a vector 𝐲𝕊m1\mathbf{y}\in\mathbb{S}^{m-1}, where 𝕊m1\mathbb{S}^{m-1} denotes the unit sphere in m\mathbb{R}^{m}; 2. Return 𝐯=M𝐲\mathbf{v}=M\mathbf{y}.

Lemma A.1.

Let 𝐯n\mathbf{v}\in\mathbb{R}^{n} be generated from Mn×mM\in\mathbb{R}^{n\times m} by Procedure D. Then it holds that 𝖤𝐯2=1mVF2\operatorname*{\mathop{\mathsf{E}}}\left\|\mathbf{v}\right\|^{2}=\frac{1}{m}\left\|V\right\|_{F}^{2}.

Proof.

The proof is based on Lemmas A.2 and A.3. ∎

Lemma A.2.

Let 𝐲\mathbf{y} be randomly and uniformly drawn from the unit sphere 𝕊m1\mathbb{S}^{m-1} in m\mathbb{R}^{m}. Then it holds that

𝖤(yk)2=1/m,𝖤yk1yk2=0,\operatorname*{\mathop{\mathsf{E}}}\mathbf{(}y^{k})^{2}={1}/{m},~{}\operatorname*{\mathop{\mathsf{E}}}y^{k_{1}}y^{k_{2}}=0,

where k,k1,k2=1,,mk,k_{1},k_{2}=1,\ldots,m, k1k2k_{1}\neq k_{2}, and yky^{k} denotes the kk-th entry of 𝐲\mathbf{y}.

Proof.

The first property comes from [12, p. 896]. The second one uses symmetry. Let 𝐳n\mathbf{z}\in\mathbb{R}^{n} be defined such that zk=ykz^{k}=y^{k} for all kk1k\neq k_{1}, and zk1=yk1z^{k_{1}}=-y^{k_{1}}. Then 𝐳\mathbf{z} is also uniformly distributed on the unit sphere, which means that 𝖤zk1zk2=𝖤yk1yk2\operatorname*{\mathop{{}\mathsf{E}}}z^{k_{1}}z^{k_{2}}=\operatorname*{\mathop{{}\mathsf{E}}}y^{k_{1}}y^{k_{2}}. By the definition of 𝐳\mathbf{z}, this means that 𝖤yk1yk2=0\operatorname*{\mathop{{}\mathsf{E}}}y^{k_{1}}y^{k_{2}}=0. Since k1k_{1} and k2k_{2} can be arbitrary, the results follow. ∎

Lemma A.3.

Let 𝐲\mathbf{y} be randomly and uniformly drawn from 𝕊m1\mathbb{S}^{m-1}. For any constant vector 𝐚m\mathbf{a}\in\mathbb{R}^{m}, it holds that 𝖤𝐚,𝐲2=m1𝐚2.\operatorname*{\mathop{\mathsf{E}}}\left\langle\mathbf{a},\mathbf{y}\right\rangle^{2}=m^{-1}{\left\|\mathbf{a}\right\|^{2}}.

Proof.

We have from Lemma A.2 that

𝖤𝐚,𝐲2\displaystyle\operatorname*{\mathop{\mathsf{E}}}\left\langle\mathbf{a},\mathbf{y}\right\rangle^{2} =\displaystyle= 𝖤k1=1mk2=1mak1yk1ak2yk2\displaystyle\operatorname*{\mathop{\mathsf{E}}}\sum^{m}_{k_{1}=1}\nolimits\sum^{m}_{k_{2}=1}\nolimits a^{k_{1}}y^{k_{1}}a^{k_{2}}y^{k_{2}}
=\displaystyle= k=1m(ak)2(yk)2+k1k2ak1ak2yk1yk2=m1𝐚2.\displaystyle\sum^{m}_{k=1}\nolimits(a^{k})^{2}(y^{k})^{2}+\sum_{k_{1}\neq k_{2}}\nolimits a^{k_{1}}a^{k_{2}}y^{k_{1}}y^{k_{2}}=m^{-1}{\left\|\mathbf{a}\right\|^{2}}.