On Approximation Algorithm for Orthogonal Low-Rank Tensor Approximation
Abstract
The goal of this work is to fill a gap in [Yang, SIAM J. Matrix Anal. Appl, 41 (2020), 1797–1825]. In that work, an approximation procedure was proposed for orthogonal low-rank tensor approximation; however, the approximation lower bound was only established when the number of orthonormal factors is one. To this end, by further exploring the multilinearity and orthogonality of the problem, we introduce a modified approximation algorithm. Approximation lower bound is established, either in deterministic or expected sense, no matter how many orthonormal factors there are. In addition, a major feature of the new algorithm is its flexibility to allow either deterministic or randomized procedures to solve a key step of each latent orthonormal factor involved in the algorithm. This feature can reduce the computation of large SVDs, making the algorithm more efficient. Some numerical studies are provided to validate the usefulness of the proposed algorithm.
Keywords: tensor; orthogonality; approximation algorithm; approximation bound; polar decomposition
1 Introduction
Tensors (hypermatrices) play an important role in signal processing, data analysis, statistics, and machine learning nowadays, key to which are tensor decompositions/approximations [19, 6, 4, 35]. Canonical polyadic (CP) decomposition factorizes the data tensor into some rank-1 components. Such a decomposition can be unique under mild assumptions [20, 32, 36]; however, the degeneracy of its associated optimization requires one to impose additional constraints, such as the nonnegativity, angularity, and orthogonality [18, 26, 25]. Orthogonal CP decomposition allows one or more of the latent factors to be orthonormal [18, 11], giving flexibility to model applications arising from image processing, joint SVD, independent component analysis, DS-CDMA systems, and so on [34, 5, 40, 30, 8, 37, 40].
Most existing algorithms for solving orthogonal low-rank tensor decomposition/approximation are iterative algorithms [3, 39, 44, 29, 11, 45, 15, 23, 22, 33]; just to name a few. To find a good feasible initializer for iterative algorithms, [45, Procedure 3.1] developed an approximation procedure, which finds the orthonormal factors by means of the celebrated HOSVD [9], and then computes the non-orthonormal ones by solving tensor rank-1 approximation problems. Some numerical tests showed that with the help of this procedure, the alternating least squares (ALS) can recover the latent factors better.
However, a gap was left: the approximation lower bound of [45, Procedure 3.1] was established only when the number of orthonormal factors is one. Denote as the number of latent orthonormal factors and the order of the data tensor under consideration. The difficulty to extend the lower bound analysis of [45, Procedure 3.1] to arbitrary lies in that the based HOSVD is essentially designed for the Tucker format, which would yield some cross-terms that cause troubles in the analysis of the current CP format.
To fill this gap, in this work, by further exploring the multilinearity and orthogonality of the problem, we devise a new approximation algorithm for orthogonal low-rank tensor CP approximation, which is a modification of [45, Procedure 3.1]. In fact, we will make the new algorithm somewhat more general, allowing either deterministic or randomized procedures to solve a key step of each latent orthonormal factor involved in the algorithm. Following a certain principle, three procedures for this step are considered, two deterministic and one randomized, leading to three versions of the approximation algorithm. The approximation lower bound is derived, either in the deterministic or in the expected sense, no matter how many latent orthonormal factors there are. The most expensive operation of the proposed algorithm is an SVD of the mode- unfolding matrix, making the algorithm more efficient than [45, Procedure 3.1]. Numerical studies will show the efficiency of the introduced algorithm and its usefulness in recovery and clustering, and is favorably compared with [45, Procedure 3.1].
Note that approximation algorithms have been widely proposed for tensor approximations. For example, HOSVD, sequentially truncated HOSVD [42], and hierarchical HOSVD [10] are also approximation algorithms for the Tucker format. On the other hand, several randomized approximation algorithms have been proposed in recent years for Tucker format or t-SVD [27, 2, 47, 1]. The solution quality of the aforementioned algorithms was measured via error bounds. In the context of tensor best rank-1 approximation problems, deterministic or randomized approximation algorithms were developed [13, 12, 38, 7, 17], where the solution quality was measured by approximation bounds; our results extend these to rank- approximations.
2 Preliminaries
Throughout this work, vectors are written as boldface lowercase letters , matrices correspond to italic capitals , and tensors are written as calligraphic capitals . denotes the space of real tensors. For two tensors of the same size, their inner product is given by the sum of entry-wise product. The Frobenius (or Hilbert–Schmidt) norm of is defined by ; denotes the outer product; in particular, for , , denotes a rank-1 tensor in . We write it as for short. For vectors of size : , we usually collect them into a matrix , .
The mode- unfolding of , denoted as , is a matrix in . The product between a tensor and a vector with respect to the -th mode, the tensor-vector product is a tensor in .
Given a -th order tensor and a positive integer , the approximate CP decomposition under consideration can be written as [11]:
(2.1) |
where . This model means that the last latent factor matrices of are assumed to be orthonormal, while the first ones are not; however, due to the presence of the scalars , when , each can be normalized. These lead to the two types of constraints in (2.1). Using orthogonality, (2.1) can be equivalently formulated as a maximization problem [11]:
(2.2) |
where the variables ’s have been eliminated. The approximation algorithms and the corresponding analysis are mainly focused on this maximization model.
Some necessary definitions and properties are introduced in the following.
Let , . The polar decomposition of is to decompose it into two matrices and such that , where is columnwisely orthonormal, and is a symmetric positive semidefinite matrix. Its relation with SVD is given in the following.
Theorem 2.1 (c.f. [14]).
Let , . Then there exist and a unique symmetric positive semidefinite matrix such that
is the polar decomposition of . If , then is symmetric positive definite and is uniquely determined.
Furthermore, let , be the eigenvalue decomposition of , namely, , be a diagonal matrix where . Then , and is a reduced SVD of .
Write and where , . The following lemmas are useful in the approximation bound analysis.
Lemma 2.1.
Let and be the two matrices generated by the polar decomposition of such that , where is columnwisely orthonormal and is symmetric positive semidefinite. Let and be defined as above. Then it holds that
Proof.
Since and , we have , which means that is exactly the -th diagonal entry of . As is positive semidefinite, its diagonal entries are nonnegative. The required result follows. ∎
Denote as the nuclear norm of a matrix, i.e., the sum of its singular values. From Theorem 2.1 we easily see that:
Lemma 2.2.
Let and be defined as in Theorem 2.1. Then .
3 Approximation Algorithm
The approximation algorithm proposed in this section inherits certain ideas of [45, Procedure 3.1], and so we briefly recall its strategy here. [45, Procedure 3.1] obtained an approximation solution to (2.2) as follows: One first applies the truncated HOSVD [9] to get ; specifically, for , one unfolds to the matrix , and then performs the truncated SVD to get its left leading singular vectors, denoted as , which forms the -th factor . Once are obtained, one then approximately solves tensor best rank-1 approximation problems, given the data tensors , , namely,
(3.3) |
for .
Such a strategy of obtaining an approximation solution to (2.2) works well empirically, while the approximation bound was established only when [45, Proposition 3.2]. We find that it is difficult to derive the approximation bound for general tensors when ; this is because we unavoidably encounter the tensor during the analysis. Such kind of tensors might be easier to deal with in Tucker model [9]; however, it consists of cross-terms with that are not welcomed in problem (2.2).
In view of this difficulty, we will modify [45, Procedure 3.1] such that it is more suitable for problem (2.2). The new algorithm is still designed under the guidance of problem (3.3), namely, we first compute the orthonormal factors , and then finding via solving tensor rank-1 approximation problems. We differ from [45] in the computation of the orthonormal factors . By taking (2.2) with as an example, we state our idea in what follows, with the workflow illustrated in Fig. 1.
The computation of is the same as [45], namely, we let where are the left leading singular vectors of the unfolding matrix .
Finding is a bit more complicated than [45], which can be divided into the splitting step and the gathering step. In the splitting step, with at hand, we first compute matrices , with
Then, using a procedure get_v_from_M that will be specified later, we compute vectors of size from for each . The principle of get_v_from_M is to extract as much of the information from as possible, and to satisfy some theoretical bounds. We will consider three versions of such procedures, which will be detailed later. In the gathering step, we first denote . As may not be orthogonal, it is not feasible. We thus propose to apply polar decomposition on to obtain the orthonormal matrix .
Computing is similar to that of . In the splitting step, with and at hand, we first compute , . Since ’s are already vectors, we directly let . In the gathering step, we let and perform polar decomposition to get .
For general , we can similarly apply the above splitting step and gathering step to obtain sequentially. In fact, the design of the splitting and gathering steps follows the principle that
(3.4) |
which will be studied in details in the next section; here the factor . If , i.e., there exist non-orthonormal constraints in (2.2), then with at hand, we then compute via approximately solving tensor rank-1 approximation problems (3.3). The whole algorithm is summarized in Algorithm 1.
Some remarks on Algorithm 1 are given as follows.
Remark 3.1.
- 1.
-
2.
The computation of can be more efficient: Since and , we have the relation that
This recursive definition of reduces its computational complexity.
-
3.
The splitting step can be computed in parallel.
- 4.
On the procedure get_v_from_M
How to obtain from is important both in theory and practice. To retain more information of in , we expect that
(3.5) |
for some factor . We thus provide three simple procedures to achieve this, two deterministic and one randomized. To simplify the notations we omit the footscripts and . The matrix in the procedures is of size , , and .
The first idea is straightforward: we let be the leading left singular vector of times the leading singular value of :
(A) 1. Compute the leading right unit singular vector of , denoted as ; 2. Return .
The second one is to first pick the row of with the largest magnitude, and then normalize it to yield ; finally let :
(B) 1. Denote as the -th row of . Let be the row with the largest magnitude, i.e., if there exist multiple satisfying the above, choose the first one; 2. Denote ; 3. Return .
The third one is similar to the second one, but to be more efficient, we randomly and uniformly pick a row of , normalize it to yield , and then let :
(C) 1. Randomly and uniformly choose ; 2. Denote ; 3. Return .
We postpone the analysis of (3.5) of each procedure in the next section. We also remark that in each procedure is important for the approximation bound analysis.
On the procedure rank1approx
To be more general, the rank1approx procedure in Algorithm 1 can be any efficient approximation algorithm for solving tensor best rank-1 approximation problems, such as [13, Algorithm 1 with DR 2], [48, Sect. 5], [45, Procedure 3.3], or even Algorithm 1 itself (as that mentioned in Remark 3.1). For the purpose of approximation bound analysis, we require rank1approx to have an approximation bound characterization, i.e., for any -th order data tensor (assuming that and ), the normalized solution returned by rank1approx admits an approximation bound of the form:
(3.6) |
where represents the factor. For [13, Approximation 1 with DR 2], it can be checked from [13, Sect. 3.1] that when ,
for [45, Procedure 3.3], it follows from [45, Proposition 3.1] that
when , namely, is a vector or a matrix, we also have that and for both the two algorithms.
Computational complexity of Algorithm 1
We analyze the computational complexity of Algorithm 1. To simplify the presentation, we consider . Computing is a truncated SVD, whose complexity is . In the computation of , computing and require flops; since , computing requires flops if Procedure A is used, while it takes flops for Procedures B and C. In any case, this is dominated by . Thus the complexity of the splitting step is . The main effort of the gathering step is the polar decomposition of , with complexity . Hence computing requires flops.
The complexity of computing is similar: from Remark 3.1, computing and needs flops. The total complexity of is
For the analysis is analogous. Denote as the complexity of the rank-1 approximation procedure, depending on which procedure is used. Then the updating of has complexity
where the first term comes from computing . Summarizing the above discussions, we have
4 Approximation Bound Analysis
This section is organized as follows: approximation bound results for general tensors are presented in Sect. 4.1, and then we shortly improve the results for nearly orthogonal tensors in Sect. 4.3.
To begin with, we need some preparations. Recall that there is an intermediate variable in Procedures A–C. This variable is important in the analysis. To distinguish with respect to each and , when obtaining , we denote the associated as , wihch is of size . From the procedures, we see that . Since and , we have the following expression of :
where we denote
(4.8) |
Note that because according to Procedures A–C, is a unit vector.
The above expression is only well-defined when (see Algorithm 1). To make it consistent when (which happens when ), we set ; we also denote accordingly. Then it is clear that (4) still makes sense when .
On the other hand, in the algorithm is only defined when . For convenience, we denote
(4.9) |
where denotes the -th largest singular value of .
Another notation to be defined is . Note that in the algorithm, is only defined when . We thus similarly define , . When , ’s are scalars.
4.1 Approximation bound for general tensors
Since when , Algorithm 1 coincides with [45, Procedure 3.1] if they take the same best rank-1 approximation procedure, similar to [45, Proposition 3.2], we have that when :
(4.10) |
where the inequality comes from (3.6), and the last equality is due to that the unfolding of to a vector is exactly defined in (4.9).
Thus the remaining part is mainly focused on cases. To make the analysis more readable, we first present an overview and informal analysis of the approximation bound, under the setting that (3.4) holds with a factor , and then we prove the validity of (3.4) and detail the associated factor. The formal approximation bound results are stated in the last of this subsection.
Lemma 4.1 (Informal approximation bound of Algorithm 1).
Let and let be generated by Algorithm 1. If for each , there holds
(4.11) |
where , then we have the following approximation bound:
where is the rank-1 approximation ratio of tensors, as noted in (3.6).
Proof.
To make the analysis below consistent with the case , which does not require perform rank-1 approximation, we denote and . It then holds that
where the first inequality follows from the setting (3.6), and the second one is due to that , and that and has the same size as ; recall that was defined in (4.8). From the definition of and (4), we have
It thus follows from (4.11) that
From (4.9) and that , we see that which is clearly an upper bound of . Thus the required approximation bound follows. ∎
The above lemma gives an overview of the analysis of the approximation bound. It also shows that to make the analysis go through, the chain inequality (4.11) plays a central role. Thus the remaining task of this section is to establish (4.11) with an explicit factor .
4.1.1 Chain inequality (4.11) analysis
To establish the connection between and , we need to evaluate the performance of get_v_form_M, namely, we need to estimate (3.5). As there are three choices of get_v_form_M, we first simply assume that (3.5) holds with a factor , and later we analyze it in details.
Lemma 4.2 (Informal chain inequality).
Remark 4.1.
Proof.
Since the orthonormal ’s, , are given by the polar decomposition of , from Lemma 2.1, we have . Using the relation that for nonnegative ’s, we have
for each . Lemma 2.2 shows that ; the norm relationship gives that . Thus
We distinguish the cases that and . For the former case,
where the equality in the third line follows from that is the unfolding of , and the third inequality uses the fact that where was defined in (4.8).
When , since is not defined, there is a slight difference after the third line of (4.1.1). In this case, we have
where the first inequality comes from a similar analysis as (4.1.1), the first equality follows from the definition of , and the last one is due to the definition of in (4.9). The required inequality thus follows. ∎
The remaining task is to specify (4.12). Denote as the expectation operator.
Lemma 4.3.
Proof.
If is generated by Procedure B, we have
where we recall that denotes the -th row of , and is the row with the largest magnitude.
If is generated by Procedure C, since is uniformly and randomly chosen from , this means that with equal probability, the random variable takes the value , , i.e.,
Therefore,
The proof has been completed. ∎
4.1.2 Puting the pieces together
Combining the previous analysis, we can state our formal results.
Lemma 4.4 (Formal chain inequality).
Proof.
Remark 4.2.
The expectation here is taken sequentially with respect to , i.e., it is in fact
Theorem 4.1 (Formal approximation bound).
Reducing to the best rank-1 approximation problem (one can remove the square in the objective function), i.e., , from Theorem 4.1 we have the following corollary.
Corollary 4.1.
Let , be generated by Algorithm 1 with and . Then the approximation ratio is (either in deterministic or expected sense).
4.2 Discussions
Several discussions concerning the approximation bound in Theorem 4.1 are provided in this subsection.
Firstly, Theorem 4.1 derives the worse-case theoretical approximation bound for general tensors. Looking into Lemma 4.2, we see that the approximation ratio comes from two folds of each factor: the lower bound of and the lower bound of . The former reaches the worst bound when is row-wisely orthogonal, where each row has equal magnitude; the latter attains the worst bound when is of rank-1 and each column admits equal magnitude. These two worse cases rarely happen simultaneously, meaning that it is possible to further improve the theoretical lower bound. Currently, we cannot figure it out. Instead, we show the real ratio and the theoretical ratio via an example as follows.
We generate tensors , where every entry obeys the standard normal distribution. We set for the problem. varies from to , where for each , instances are generated. We use Procedure A in the algorithm. In Fig. 2(a), we plot the curve of the real ratio and curve of the theoretical ratio when . The figure shows that although the real ratio is better, it still approaches the theoretical one as increases. In Fig. 2(b), we fix the size of the tensor as , and vary from to . We plot the real ratio and the theoretical one . The figure shows that the real ratio is far better, while it also decreases as increases.
To further investigate the influence of and on the bound, we also plot the whole real ratio and the theoretical one in Fig. 2(c). The tensors are the same as above, where in the left panel, we vary from to with fixed; in the right one, varies from to with fixed. The real ratio is in red, and the theoretical ratio is in black with diamond markers. We see that the real ratio decreased as increases, confirming the observation of Fig. 2(a); moreover, the real ratio is almost parallel to the theoretical ratio, showing that the theoretical ratio is reasonable up to a constant (if we view as a constant). The right panel shows that when increases, the real ratio does not decrease as much as the theoretical one, showing that the term might be too loose in the theoretical ratio. Thus we also plot the curve of , which is in black with right arrow markers. We see that still under the curve of the real ratio, meaning that it is possible to improve the theoretical ratio in the absence of . This still needs further research.
Secondly, one may wonder whether estimations such as can be established for the proposed algorithm for some factor , such as the truncated HOSVD [9]; here is constructed from , and denotes the optimal solution. This seems to be difficult for a general tensor in the current CP format. The reason may be because unlike that every tensor can be written in the HOSVD format [9, Theorem 2], where the truncated HOSVD gives a good approximation, not every tensor admits a CP decomposition with orthonormal factors. Therefore, estimating is not easy for a general tensor.
Thirdly, in the algorithm, other procedures for computing from are possible. Another randomized example is given in Appendix A. Lemma A.1 shows that the factor of this procedure is worse than those of Procedures A-C. However, it is expected that this procedure might be more efficient.
Finally, after this work is almost finished, we realize that in problem (2.2), if , , and Procedure A is taken in Algorithm 1, then the algorithm boils down essentially to [13, Algorithm 1 with DR2] for best rank-1 approximation problems. In this regard, Algorithm 1 generalizes [13, Algorithm 2] to orthogonal rank- approximation cases, and involves more procedures other than Procedure A.
4.3 Approximation results for nearly orthogonal tensors
We briefly consider the following type of tensors in this subsection:
Assumption 4.1.
(4.14) |
where the last ’s are orthonormal, and the first ’s are columnwisely normalized. In addition, we assume that . denotes the noisy tensor.
The cases that for some need more assumptions and analysis, which is out of the scope of this work, and will be studied in the future.
Running Algorithm 1 on with , it is easy to obtain the following results, where for completeness, we provide a short discussion. Denote as the matrix whose -th column is the vectorization of .
Proposition 4.1.
Proof.
To achieve this, we first show that during the algorithm,
(4.15) |
where . If (4.15) holds, then since is given by the polar decomposition of , by Assumption 4.1, it follows
(4.16) |
(4.15) can be proved by induction method starting from . When , this is in fact given by (4.9). To see this, we only need to show that where we recall that denotes the -th largest singular value of a matrix. Under Assumption 4.1 when , is orthonormal, giving that is a reduced SVD of , and the results follow. Assume that (4.15) holds when ; then . Therefore,
which is a rank-1 tensor. Since is the unfolding of , it can be seen that generated by Procedures A, B, C is just the vectorization of , and so , which demonstrates that (4.15) holds when . Thus (4.15) and (4.16) are valid for .
Since is of rank-1, the rank-1 approximation procedure generates that , . Hence, . This completes the proof. ∎
From the above discussions, and by matrix perturbation theory, it is not hard to obtain the perturbed version of Proposition 4.1.
Proposition 4.2.
Under the setting of Proposition 4.1, where now is small enough, we have , where the constant behind the big O is nonnegative, and
It remains to consider the case that in Assumption 4.1, i.e., only is orthonormal. In general, if there is no additional assumptions on , , then it is hard to obtain the approximation results. For instance, if where is orthonormal but is not, then it is in general hard to recover and . We thus make the following incoherence assumption:
Assumption 4.2.
There is at least a , being incoherent with modules , i.e.,
It is clear that is a nearly orthonormal matrix if is small enough. In what follows, we assume without loss of generality that satisfies Assumption 4.2. We immediately have:
Proposition 4.3.
If is incoherent with modules , then is also incoherent with modules .
We consider the expression of :
namely, is a perturbation of the eigen-decomposition , given that and are small enough. If Assumption 4.1 holds, by matrix perturbation theory, the above implies that
where is generated by the algorithm. It then follows that is a nearly rank-1 tensor with perturbation , and so
We thus have a similar result as Proposition 4.2:
5 Numerical Study
We evaluate the performance of Algorithm 1 with Procedures A, B, and C, and compare them with [45, Procedure 3.1]. All the computations are conducted on an Intel i7-7770 CPU desktop computer with 32 GB of RAM. The supporting software is Matlab R2019b. The Matlab package Tensorlab [43] is employed for tensor operations.
Performance of approximation algorithms
We compare Algorithm 1 with Procedures A, B, C with [45, Procedure 3.1] in this part. They are respectively marked as Alg. 1 (A), Alg. 1 (B), Alg. 1 (C), and Yang2020. All the algorithms employ the same rank1approx procedure [45, Procedure 3.2] for solving the rank-1 approximation subproblems. The tensors are randomly generated where every entry obeys the standard normal distribution. The presented results are averaged over instances for each case.
We first fix , and vary from to . The results are depicted in Fig. 3. Alg. 1 (A) is in red with star markers, Alg. 1 (B) is in magenta with circle markers, Alg. 1 (C) is in blue with square markers, and Yang2020 is in black with rightarrow markers. The left panels show the curves of the objective values versus , from which we observe that Alg. 1 (A) performs the best, followed by Alg. 1 (B); this is because that using SVD, Procedure A retains more information in than other procedures. Yang2020 performs not as good as Algorithm 1, and its performance gets worse when increases; this also has been pointed out in [45, Table 3.1], while the performance of Algorithm 1 is more stable. Therefore, we conclude that Alg. 1 exploits the structure of the problem more than Yang2020. The right panels show the CPU time versus , from which we see that Yang2020 needs more time than Alg. 1, as Yang2020 requires to perform SVDs of size (assuming ), while the most expensive step of Alg. 1 is only one SVD of this size. Considering Prodecures A, B, and C, there is no doubt that Alg. 1 (A) is the most expensive one among the three, followed by Alg. 1 (B). The randomized version, i.e., Alg. 1 (C), is clearly the cheapest one.
We then fix , , and vary from to . The results are plotted in Fig. 4, from which we still can observe that Alg. 1 (A) performs the best in terms of the objective value. Considering the CPU time, Alg. 1 (B) and 1 (C) seem to be far more better; this shows the superiority of SVD-free procedures for computing the vector .
Yang2020 | Alg. 1 (A) | Alg. 1 (B) | Alg. 1 (C) | Random | ||
---|---|---|---|---|---|---|
60 | rel.err.(rel.err0) | 0.0139 (0.0366) | 0.0140 (0.0295) | 0.0137 (0.0348) | 0.0218 (0.0680) | 0.0291 |
iter. | 7.32 | 6.26 | 8.70 | 8.04 | 18.72 | |
time(time0) | 0.59 (0.17) | 0.52 (0.15) | 0.56 (0.09) | 0.53 (0.08) | 0.92 | |
70 | rel.err.(rel.err0) | 0.0125 (0.0255) | 0.0125 (0.0177) | 0.0190 (0.0252) | 0.0197 (0.0489) | 0.0285 |
iter. | 4.5 | 4.3 | 4.4 | 4.7 | 32.9 | |
time(time0) | 0.72 (0.36) | 0.62 (0.26) | 0.52 (0.16) | 0.54 (0.16) | 2.71 | |
80 | rel.err.(rel.err0) | 0.0191 (0.0239) | 0.0186 (0.0216) | 0.0189 (0.0218) | 0.0114 (0.0343) | 0.0193 |
iter. | 19.2 | 9.2 | 9.7 | 24.4 | 35.08 | |
time(time0) | 3.16 (0.62) | 1.77 (0.36) | 1.73 (0.26) | 3.50 (0.25) | 4.58 | |
90 | rel.err.(rel.err0) | 0.0030 (0.0096) | 0.0030 (0.0057) | 0.0030 (0.0109) | 0.0030 (0.0314) | 0.0108 |
iter. | 5.5 | 4.9 | 7.6 | 10.4 | 32.1 | |
time(time0) | 2.48 (1.02) | 1.88 (0.54) | 2.24 (0.40) | 2.75 (0.36) | 6.41 |
Performance of approximation algorithms plus ALS for factor recovery
The tensors are generated similarly as [45, 39]: , where , is an unstructured tensor, and denotes the noise level. Here we set , , , and varies from to . , , and are randomly drawn from a uniform distribution in . The last are then orthonormalized by using the orth function, while the first ones are columnwisely normalized. We only test and cases. We compare -ALS [45] with initialized by different initializers generated respectively by Yang2020, Alg. 1 (A), Alg. 1 (B), and Alg. 1 (C). -ALS initialized by random initializers is used as a baseline. The stopping criterion for -ALS is or . The relative error is defined as follows [39, 45]:
where and denotes the set of permutation matrices, and ’s are the factors output by the iterative algorithm. Finding the permutation can be efficiently done by the Hungarian algorithm [21]. The presented results are averaged over instances for each case.
Yang2020 | Alg. 1 (A) | Alg. 1 (B) | Alg. 1 (C) | Random | ||
---|---|---|---|---|---|---|
60 | rel.err.(rel.err0) | 0.0211 (0.0328) | 0.0214 (0.0263) | 0.0133 (0.0320) | 0.0209 (0.0658) | 0.0706 |
iter. | 13.4 | 12.0 | 11.3 | 18.6 | 36.7 | |
time(time0) | 0.88 (0.19) | 0.79 (0.16) | 0.69 (0.09) | 1.01 (0.08) | 1.75 | |
70 | rel.err.(rel.err0) | 0.0129 (0.0300) | 0.0129 (0.0209) | 0.0126 (0.0318) | 0.0132 (0.0560) | 0.0369 |
iter. | 31.6 | 23.3 | 18.6 | 28.9 | 45.9 | |
time(time0) | 3.16 (0.36) | 2.21 (0.23) | 1.76 (0.16) | 2.57 (0.14) | 4.03 | |
80 | rel.err.(rel.err0) | 0.0279 (0.0347) | 0.0276 (0.0294) | 0.0279 (0.0312) | 0.0280 (0.0477) | 0.0361 |
iter. | 16.8 | 9.6 | 13.3 | 18.5 | 24.8 | |
time(time0) | 2.96 (0.62) | 1.80 (0.36) | 2.19 (0.26) | 2.78 (0.25) | 3.30 | |
90 | rel.err.(rel.err0) | 0.0029 (0.0064) | 0.0029 (0.0041) | 0.0029 (0.0079) | 0.0029 (0.0414) | 0.0194 |
iter. | 30.6 | 11.5 | 20.1 | 16.9 | 43.5 | |
time(time0) | 6.95 (0.98) | 3.00 (0.50) | 4.46 (0.38) | 3.86 (0.35) | 8.31 |
The results when and are reported in Tables 1 and 2, in which ‘rel.err0’ and ‘time0 respectively represent the relative error and CPU time evaluated at the initializers. From the tables, we first observe that -ALS initialized by Yang2020 and Alg. 1 outperforms that with random initializations, considering both the relative error and computational time. Comparing Alg. 1 with Yang2020, we can see that in most cases, -ALS initialized by Alg. 1 performs comparable or slightly better. Comparing among -ALS initialized by the three versions of Alg. 1, Alg. 1 (C) is slightly worse in terms of the time; this is because although Alg. 1 (C) is the most efficient, -ALS initialized by it needs more iterations than the others. Therefore, it would be necessary to improve the efficiency of the iterative algorithms.
Next, we fix , , , and vary from to ; we also fix , , and vary from to . Plots of relative error of different methods are depicted in Fig. 5, from which we still observe that -ALS initialized by approximation algorithms performs better than that with random initializers, and the iterative algorithm initialized by Alg. 1 is comparable with that initialized by Yang2020.
CP approximation for clustering
Tensor CP approximation for clustering works as follows: Suppose that we have samples , . For given parameter , and for unknown variables , , one solves the following problem first:
(5.17) |
where in , represents the row index while is the column one. Write , i.e., it is the transpose of the -th row of . The above problem means that one finds a common subspace in , which is spanned by the basis , , and projects the samples onto this subspace. The can be seen as a representation of . This can be regarded as a dimension reduction procedure [34, 30]. By stacking ’s into a -th order tensor with , and denoting as the -th column of , the objective function of (5.17) can be written as , and so (5.17) can be reduced to (2.2). Once is obtained, we use -means for clustering, with the transpose of the rows of , namely, ’s being the new samples. The cluster error is defined as follows: Let be the given clusters, be the true clustering mapping, and be the estimated one. Denote the cardinality of a set and the indicator function. Then [41]
We solve (5.17) by -ALS initialized by Alg. 1 (A), Alg. 1 (B), and by random initialization, with for the problem. For the first two initializations, the max iterations of -ALS are set to , while it is for the random initialization. We also compare them with vanilla -means, namely, the original samples ’s are first vectorized and then clustered by -means. The dataset COIL-100 http://www.cs.columbia.edu/CAVE/software/softlib/coil-100.php is used, which consists of objects, each containing images of size viewing from different angles. In our experiment, we each time randomly select objects, each objects randomly selecting images, resulting into a third-order tensor that consists of samples. For each case we run instances and present the averaged results in Table 3.
Alg. 1 (A) | Alg. 1 (B) | Random | vanilla -means | |||||
---|---|---|---|---|---|---|---|---|
cluster err. | time | cluster err. | time | cluster err. | time. | cluster err. | time | |
5 | 1.10E-01 | 0.33 | 1.12E-01 | 0.24 | 1.44E-01 | 6.25 | 1.33E-01 | 0.44 |
7 | 9.60E-02 | 0.40 | 1.04E-01 | 0.31 | 1.64E-01 | 5.12 | 1.08E-01 | 0.79 |
9 | 8.24E-02 | 0.50 | 9.12E-02 | 0.42 | 1.40E-01 | 11.74 | 1.00E-01 | 1.25 |
11 | 7.48E-02 | 0.58 | 8.08E-02 | 0.50 | 1.21E-01 | 15.26 | 8.54E-02 | 1.76 |
15 | 6.31E-02 | 0.78 | 6.92E-02 | 0.68 | 9.56E-02 | 14.86 | 7.25E-02 | 3.23 |
20 | 5.36E-02 | 1.04 | 5.23E-02 | 0.98 | 7.84E-02 | 24.14 | 5.56E-02 | 5.58 |
Considering the clustering error, we observe from the table that armed with the approximation algorithm, CP approximation based method achieves better performance than the vanilla -means, while Alg. 1 (A) is slightly better than Alg. 1 (B). However, if starting from a random initializer, CP approximation based method is worse than vanilla -means. This shows the usefulness of the introduced approximation algorithm. Considering the computational time, we see that the first two methods are also better than vanilla -means, which is because -ALS is stopped within iterations, and because the sample size after reduction is , while the sample size for the vanilla -means is . We also observe that Random + -ALS usually cannot stop within iterations, making it the slowest one.
We next show the influence of on the performance. We fix an instance with , and vary from to . The cluster error and CPU time of Alg. 1 (A) + -ALS is plotted in Fig. 6. We see that the cluster error does not change a lot when varies, while around , the cluster error seems to be slightly better than the other cases; this observation together with the CPU time shown in the right panel explains why we choose in our experiment.
6 Conclusions
In [45], an approximation procedure was proposed for low-rank tensor approximation with orthonormal factors by combining the truncated HOSVD and best rank-1 approximation. The approximation bound was only theoretically established when the number of latent orthonormal factors is one. To fill this gap, a modified approximation algorithm was developed in this work. It allows either deterministic or randomized procedures to solve a key step of each latent orthonormal factor in the algorithm, giving some flexibilities. The approximation bound, either in deterministic or expected sense, has been established no matter how many orthonormal factors there are. Moreover, compared with [45, Procedure 3.1] which requires SVDs of size , the introduced approximation algorithm involves only one SVD of this size (plus other operations of smaller size), making it more efficient. Numerical tests were provided to verify the usefulness of the algorithm, and its performance is favorably compared with [45, Procedure 3.1]. The sequel work is to study approaches for finding global solutions. A possible way is to use convex relaxation as those done for rank-1 approximation [28, 16, 46]. However, extending the ideas from rank-1 approximation to rank- approximation is not so straightforward; this will be our next focus. Another possible research thread is to extend the notion of best rank-1 approximation ratio of a tensor space [31, 24] to the best rank- approximation ratio setting (via problem (2.2)) and study its properties.
Acknowledgement
This work was supported by NSFC Grant 11801100 and the Fok Ying Tong Education Foundation Grant 171094. We thank Mr. Xianpeng Mao for the help in the experiments.
References
- [1] S. Ahmadi-Asl, A. Cichocki, A. H. Phan, I. Oseledets, S. Abukhovich, and T. Tanaka. Randomized algorithms for computation of tucker decomposition and higher order svd (hosvd). arXiv preprint arXiv:2001.07124, 2020.
- [2] M. Che, Y. Wei, and H. Yan. The computation of low multilinear rank approximations of tensors via power scheme and random projection. SIAM J. Matrix Anal. Appl., 41(2):605–636, 2020.
- [3] J. Chen and Y. Saad. On the tensor SVD and the optimal low rank orthogonal approximation of tensors. SIAM J. Matrix Anal. Appl., 30(4):1709–1734, 2009.
- [4] A. Cichocki, D. Mandic, L. De Lathauwer, G. Zhou, Q. Zhao, C. Caiafa, and H. A. Phan. Tensor decompositions for signal processing applications: From two-way to multiway component analysis. IEEE Signal Process. Mag., 32(2):145–163, 2015.
- [5] P. Comon. Independent component analysis, a new concept? Signal process., 36(3):287–314, 1994.
- [6] P. Comon. Tensors: a brief introduction. IEEE Signal Process. Mag., 31(3):44–53, 2014.
- [7] A. P. da Silva, P. Comon, and A. L. F. de Almeida. A finite algorithm to compute rank-1 tensor approximations. IEEE Signal Process. Letters, 23(7):959–963, 2016.
- [8] L. De Lathauwer. A short introduction to tensor-based methods for factor analysis and blind source separation. In Proc. of the IEEE Int. Symp. on Image and Signal Processing and Analysis (ISPA 2011), pages 558–563. IEEE, 2011.
- [9] L. De Lathauwer, B. De Moor, and J. Vandewalle. A multilinear singular value decomposition. SIAM J. Matrix Anal. Appl., 21:1253–1278, 2000.
- [10] L. Grasedyck. Hierarchical singular value decomposition of tensors. SIAM J. Matrix Anal. Appl., 31:2029–2054, 2010.
- [11] Y. Guan and D. Chu. Numerical computation for orthogonal low-rank approximation of tensors. SIAM J. Matrix Anal. Appl., 40(3):1047–1065, 2019.
- [12] S. He, B. Jiang, Z. Li, and S. Zhang. Probability bounds for polynomial functions in random variables. Math. Oper. Res., 39(3):889–907, 2014.
- [13] S. He, Z. Li, and S. Zhang. Approximation algorithms for homogeneous polynomial optimization with quadratic constraints. Math. Program., 125:353–383, 2010.
- [14] N. J. Higham. Computing the polar decomposition—with applications. SIAM J. Sci. Stat. Comput., 7(4):1160–1174, 1986.
- [15] S. Hu and K. Ye. Linear convergence of an alternating polar decomposition method for low rank orthogonal tensor approximations. arXiv preprint arXiv:1912.04085, 2019.
- [16] B. Jiang, S. Ma, and S. Zhang. Tensor principal component analysis via convex optimization. Math. Program., 150:423–457, 2014.
- [17] E. Kofidis and P. Regalia. On the best rank-1 approximation of higher-order supersymmetric tensors. SIAM J. Matrix Anal. Appl., 23:863–884, 2002.
- [18] T. G. Kolda. Orthogonal tensor decompositions. SIAM J. Matrix Anal. Appl., 23(1):243–255, 2001.
- [19] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Rev., 51:455–500, 2009.
- [20] J. B. Kruskal. Three-way arrays: rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear Agebra appl., 18(2):95–138, 1977.
- [21] H. W. Kuhn. The Hungarian method for the assignment problem. Nav. Res. logist. Q., 2(1-2):83–97, 1955.
- [22] J. Li, K. Usevich, and P. Comon. Globally convergent Jacobi-type algorithms for simultaneous orthogonal symmetric tensor diagonalization. SIAM J. Matrix Anal. Appl., 39(1):1–22, 2018.
- [23] J. Li and S. Zhang. Polar decomposition based algorithms on the product of stiefel manifolds with applications in tensor approximation. arXiv preprint arXiv:1912.10390, 2019.
- [24] Z. Li, Y. Nakatsukasa, T. Soma, and A. Uschmajew. On orthogonal tensors and best rank-one approximation ratio. SIAM J. Matrix Anal. Appl., 39(1):400–425, 2018.
- [25] L.-H. Lim and P. Comon. Nonnegative approximations of nonnegative tensors. J. Chemom., 23(7-8):432–441, 2009.
- [26] L.-H. Lim and P. Comon. Blind multilinear identification. IEEE Trans. Inf. Theory, 60(2):1260–1280, 2014.
- [27] R. Minster, A. K. Saibaba, and M. E. Kilmer. Randomized algorithms for low-rank tensor decompositions in the tucker format. SIAM J. Math. Data Sci., 2(1):189–215, 2020.
- [28] J. Nie and L. Wang. Semidefinite relaxations for best rank-1 tensor approximations. SIAM J. Matrix Anal. Appl., 35(3):1155–1179, 2014.
- [29] J. Pan and M. K. Ng. Symmetric orthogonal approximation to symmetric tensors with applications to image reconstruction. Numer. Linear Algebra Appl., 25(5):e2180, 2018.
- [30] J.-C. Pesquet-Popescu, B.and Pesquet and A. P. Petropulu. Joint singular value decomposition-a new tool for separable representation of images. In ICIP, volume 2, pages 569–572. IEEE, 2001.
- [31] L. Qi. The best rank-one approximation ratio of a tensor space. SIAM J. Matrix Anal. Appl., 32(2):430–442, 2011.
- [32] Y. Qi, P. Comon, and L.-H. Lim. Semialgebraic geometry of nonnegative tensor rank. SIAM J. Matrix Anal. Appl., 37(4):1556–1580, 2016.
- [33] B. Savas and L.-H. Lim. Quasi-newton methods on Grassmannians and multilinear approximations of tensors. SIAM J. Sci. Comput., 32(6):3352–3393, 2010.
- [34] A. Shashua and A. Levin. Linear image coding for regression and classification using the tensor-rank principle. In CVPR, volume 1, pages I–I. IEEE, 2001.
- [35] N. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. Papalexakis, and C. Faloutsos. Tensor decomposition for signal processing and machine learning. IEEE Trans. Signal Process., 65(13):3551–3582.
- [36] N. D. Sidiropoulos and R. Bro. On the uniqueness of multilinear decomposition of N-way arrays. J. Chemom., 14(3):229–239, 2000.
- [37] N. D. Sidiropoulos, G. B. Giannakis, and R. Bro. Blind parafac receivers for ds-cdma systems. IEEE Trans. Signal Process., 48(3):810–823, 2000.
- [38] A. M.-C. So. Deterministic approximation algorithms for sphere constrained homogeneous polynomial optimization problems. Math. Program., 129(2):357–382, 2011.
- [39] M. Sørensen, L. De Lathauwer, P. Comon, S. Icart, and L. Deneire. Canonical polyadic decomposition with a columnwise orthonormal factor matrix. SIAM J. Matrix Anal. Appl., 33(4):1190–1213, 2012.
- [40] M. Sørensen, L. De Lathauwer, and L. Deneire. PARAFAC with orthogonality in one mode and applications in DS-CDMA systems. In Proc. of the IEEE Int. Conference on Acoustics, Speech and Signal Processing (ICASSP 2010), pages 4142–4145. IEEE, 2010.
- [41] W. Sun, J. Wang, and Y. Fang. Regularized k-means clustering of high-dimensional data and its asymptotic consistency. Electron. J. Stat., 6:148–167, 2012.
- [42] N. Vannieuwenhoven, R. Vandebril, and K. Meerbergen. A new truncation strategy for the higher-order singular value decomposition. SIAM J. Sci. Comput., 34(2):A1027–A1052, 2012.
- [43] N. Vervliet, O. Debals, L. Sorber, M. Van Barel, and L. De Lathauwer. Tensorlab 3.0, Mar. 2016. Available online.
- [44] L. Wang, M. T. Chu, and B. Yu. Orthogonal low rank tensor approximation: Alternating least squares method and its global convergence. SIAM J. Matrix Anal. and Appl., 36(1):1–19, 2015.
- [45] Y. Yang. The epsilon-alternating least squares for orthogonal low-rank tensor approximation and its global convergence. SIAM J. Matrix Anal. Appl., 41(4):1797–1825, 2020.
- [46] Y. Yang, Y. Feng, X. Huang, and J. A. K. Suykens. Rank-1 tensor properties with applications to a class of tensor optimization problems. SIAM J. Optim., 26(1):171–196, 2016.
- [47] J. Zhang, A. K. Saibaba, M. E. Kilmer, and S. Aeron. A randomized tensor singular value decomposition based on the t-product. Numer. Linear Algebra Appl., 25(5):e2179, 2018.
- [48] X. Zhang, L. Qi, and Y. Ye. The cubic spherical optimization problems. Math. Comput., 81(279):1513–1525, 2012.
Appendix A Another get_v_from_M Procedure
(D) 1. Randomly and uniformly draw a vector , where denotes the unit sphere in ; 2. Return .
Lemma A.1.
Let be generated from by Procedure D. Then it holds that .
Lemma A.2.
Let be randomly and uniformly drawn from the unit sphere in . Then it holds that
where , , and denotes the -th entry of .
Proof.
The first property comes from [12, p. 896]. The second one uses symmetry. Let be defined such that for all , and . Then is also uniformly distributed on the unit sphere, which means that . By the definition of , this means that . Since and can be arbitrary, the results follow. ∎
Lemma A.3.
Let be randomly and uniformly drawn from . For any constant vector , it holds that