This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Low-rank Tensor Learning with
Nonconvex Overlapped Nuclear Norm Regularization

\nameQuanming Yao \email[email protected]
\addrDepartment of Electronic Engineering, Tsinghua University \AND\nameYaqing Wang  \email[email protected]
\addrBaidu Research \AND\nameBo Han \email[email protected]
\addrDepartment of Computer Science, Hong Kong Baptist University \ANDJames T. Kwok \email[email protected]
\addrDepartment of Computer Science and Engineering, Hong Kong University of Science and Technology
Corresponding Author.
Abstract

Nonconvex regularization has been popularly used in low-rank matrix learning. However, extending it for low-rank tensor learning is still computationally expensive. To address this problem, we develop an efficient solver for use with a nonconvex extension of the overlapped nuclear norm regularizer. Based on the proximal average algorithm, the proposed algorithm can avoid expensive tensor folding/unfolding operations. A special “sparse plus low-rank” structure is maintained throughout the iterations, and allows fast computation of the individual proximal steps. Empirical convergence is further improved with the use of adaptive momentum. We provide convergence guarantees to critical points on smooth losses and also on objectives satisfying the Kurdyka-Łojasiewicz condition. While the optimization problem is nonconvex and nonsmooth, we show that its critical points still have good statistical performance on the tensor completion problem. Experiments on various synthetic and real-world data sets show that the proposed algorithm is efficient in both time and space and more accurate than the existing state-of-the-art.

Keywords: Low-rank tensor, Proximal algorithm, Proximal average algorithm, Nonconvex regularization, Overlapped nuclear norm.

1 Introduction

Tensors can be seen as high-order matrices and are widely used for describing multilinear relationships in the data. They have been popularly applied in areas such as computer vision, data mining and machine learning (Kolda and Bader, 2009; Zhao et al., 2016; Song et al., 2017; Papalexakis et al., 2017; Hong et al., 2020; Janzamin et al., 2020). For example, color images (Liu et al., 2013), hyper-spectral images (Signoretto et al., 2011; He et al., 2019), and knowledge graphs (Nickel et al., 2015; Lacroix et al., 2018) can be naturally represented as third-order tensors, while color videos can be seen as 4-order tensors (Candès et al., 2011; Bengua et al., 2017). In YouTube, users can follow each other and belong to the same subscribed channels. By treating channel as the third dimension, the users’ co-subscription network can also be represented as a third-order tensor (Lei et al., 2009).

In many applications, only a few entries in the tensor are observed. For example, each YouTube user usually only interacts with a few other users (Lei et al., 2009; Davis et al., 2011), and in knowledge graphs, we can only have a few labeled edges describing the relations between entities (Nickel et al., 2015; Lacroix et al., 2018). Tensor completion, which aims at filling in this partially observed tensor, has attracted a lot of recent interest (Rendle and Schmidt-Thieme, 2010; Signoretto et al., 2011; Bahadori et al., 2014; Cichocki et al., 2015).

In the related task of matrix completion, the underlying matrix is often assumed to be low-rank (Candès and Recht, 2009), as its rows/columns share similar characteristics. The nuclear norm, which is the tightest convex envelope of the matrix rank (Boyd and Vandenberghe, 2009), is popularly used as its surrogate in low-rank matrix completion (Cai et al., 2010; Mazumder et al., 2010). In tensor completion, the low-rank assumption also captures relatedness in the different tensor dimensions (Tomioka et al., 2010; Acar et al., 2011; Song et al., 2017; Hong et al., 2020). However, tensors are more complicated than matrices. Indeed, even the computation of tensor rank is NP-hard (Hillar and Lim, 2013). In recent years, many convex relaxations based on the matrix nuclear norm have been proposed for tensors. Examples include the tensor trace norm (Cheng et al., 2016), overlapped nuclear norm (Tomioka et al., 2010; Gandy et al., 2011), and latent nuclear norm (Tomioka et al., 2010). Among these convex relaxations, the overlapped nuclear norm is the most popular as it (i) can be evaluated exactly by performing SVD on the unfolded matrices (Cheng et al., 2016), (ii) has better low-rank approximation (Tomioka et al., 2010), and (iii) can lead to exact recovery (Tomioka et al., 2011; Tomioka and Suzuki, 2013; Mu et al., 2014).

The (overlapped) nuclear norm equally penalizes all singular values. Intuitively, larger singular values are more informative and should be less penalized (Mazumder et al., 2010; Lu et al., 2016b; Yao et al., 2019b). To alleviate this problem in low-rank matrix learning, various adaptive nonconvex regularizers have been recently introduced. Examples include the capped-1\ell_{1} norm (Zhang, 2010b), log-sum-penalty (LSP) (Candès et al., 2008), truncated nuclear norm (TNN) (Hu et al., 2013), smoothed-capped-absolute-deviation (SCAD) (Fan and Li, 2001) and minimax concave penalty (MCP) (Zhang, 2010a). All these assign smaller penalties to the larger singular values. This leads to better recovery performance in many applications such as image recovery (Lu et al., 2016b; Gu et al., 2017) and collaborative filtering (Yao et al., 2019b), and lower statistical errors of various matrix completion and regression problems (Gui et al., 2016; Mazumder et al., 2020).

Motivated by the success of adaptive nonconvex regularizers in low-rank matrix learning, there are recent works that apply nonconvex regularization in learning low-rank tensors. For example, the TNN regularizer is used with the overlapped nuclear norm regularizer on video processing (Xue et al., 2018) and traffic data processing (Chen et al., 2020). In this paper, we propose a general nonconvex variant of the overlapped nuclear norm regularizer for low-rank tensor completion. Unlike the standard convex tensor completion problem, the resulting optimization problem is nonconvex and more difficult to solve. Previous algorithms in (Xue et al., 2018; Chen et al., 2020) are computationally expensive, and have neither convergence results nor statistical guarantees.

To solve this issue, based on the proximal average algorithm (Bauschke et al., 2008), we develop an efficient solver with much smaller time and space complexities. The keys to its success are on (i) avoiding expensive tensor folding/unfolding, (ii) maintaining a “sparse plus low-rank” structure on the iterates, and (iii) incorporating the adaptive momentum (Li and Lin, 2015; Li et al., 2017; Yao et al., 2017). Convergence guarantees to critical points are provided under the usual smoothness assumption for the loss and further Kurdyka-Łojasiewicz (Attouch et al., 2013) condition on the whole learning objective.

Besides, we study its statistical guarantees, and show that critical points of the proposed objective can have small statistical errors under the restricted strong convexity condition (Agarwal et al., 2010). Informally, for tensor completion with unknown noise, we show that the recovery error can be bounded as 𝒳𝒳~F𝒪(λκ0i=1Mki)\|\mathscr{X}^{*}-\tilde{\mathscr{X}}\|_{F}\leq\mathcal{O}(\lambda\kappa_{0}\sum\nolimits_{i=1}^{M}\sqrt{k_{i}}) (see Theorem 16), where 𝒪\mathcal{O} omits constant terms, 𝒳\mathscr{X}^{*} (resp. 𝒳~\tilde{\mathscr{X}}) is the ground-truth (resp. recovered) tensor, MM is the tensor order and kik_{i} is the rank of unfolding matrix on the iith mode. When Gaussian additive noise is assumed, we show that the recovery error also depends linearly with the noise level σ\sigma (see Corollary 17) and logIπ/𝛀1\sqrt{\log I^{\pi}/\left\|\bf{\Omega}\right\|_{1}} where IπI^{\pi} is the tensor size and 𝛀1\left\|\bf{\Omega}\right\|_{1} is the number of observed elements (see Corollary 18).

We further extend it for use with Laplacian regularizer as in spatial-temporal analysis and non-smooth losses as in robust tensor completion. Experiments on a variety of synthetic and real-world data sets (including images, videos, hyper-spectral images, social networks, knowledge graphs and spatial-temporal climate observation records) show that the proposed algorithm is more efficient and has much better empirical performance than other low-rank tensor regularization and decomposition methods.

Difference with the Conference Version

A preliminary conference version of this work (Yao et al., 2019a) was published in ICML-2019. The main differences with this conference version are as follows.

  • 1).

    Only third-order tensor and square loss are considered in (Yao et al., 2019a), while the proposed algorithm here, which is enabled by Proposition 4, can work on tensors with arbitrary orders. The difficulties of extending to higher order tensors are also discussed after Proposition 4.

  • 2).

    Statistical guarantee of the proposed model for the tensor completion problem is added in Section 3.5, which shows that tensors that are not too spiky can be recovered. We also show how the recovery performance can depend on noise level, tensor ranks, and the number of observations.

  • 3).

    In Section 4, we enable the proposed method work with robust loss function (which is nonconvex and nonsmooth) and Laplacian regularizer. These enable the proposed algorithm to be applied to a wider range of application scenarios such as knowledge graph completion, spatial-temporal analysis and robust video recovery.

  • 4).

    Extensive experiments are added. Specifically, quality of the obtained critical points is studied in Section 5.1.3; application to knowledge graphs in Section 5.3, application to robust video completion in Section 5.4, and application to spatial-temporal data in Section 5.5.

Notation

Vectors are denoted by lowercase boldface, matrices by uppercase boldface, and tensors by Euler.

For a matrix 𝑨m×n{\bm{\bm{A}}}\in\mathbb{R}^{m\times n} (without loss of generality, we assume that mnm\geq n), σi(𝑨)\sigma_{i}({\bm{\bm{A}}}) denotes its iith singular, its nuclear norm is 𝑨=iσi\|{\bm{\bm{A}}}\|_{*}=\sum_{i}\sigma_{i}; 𝑨\|{\bm{\bm{A}}}\|_{\infty} returns its maximum singular.

For tensors, we follow the notation in (Kolda and Bader, 2009). For a MM-order tensor 𝒳I1××IM\mathscr{X}\in\mathbb{R}^{I^{1}\times\dots\times I^{M}} (without loss of generality, we assume I1IMI^{1}\geq\dots\geq I^{M}), its (i1,,iM)(i_{1},\dots,i_{M})th entry is 𝒳i1iM\mathscr{X}_{i_{1}\dots i_{M}}. One can unfold 𝒳\mathscr{X} along its ddth mode to obtain the matrix 𝑿dId×(IπId){\bm{\bm{X}}}_{\left\langle d\right\rangle}\in\mathbb{R}^{I^{d}\times(\frac{I^{\pi}}{I^{d}})} with Iπ=i=1MIiI^{\pi}=\prod_{i=1}^{M}I^{i}, whose (id,j)(i_{d},j) entry is 𝒳i1iM\mathscr{X}_{i_{1}\dots i_{M}} with j=1+l=1,ldM(il1)m=1,mdl1Imj=1+\sum_{l=1,l\neq d}^{M}(i_{l}-1)\prod_{m=1,m\neq d}^{l-1}I^{m}. One can also fold a matrix 𝑿{\bm{\bm{X}}} back to a tensor 𝒳=𝑿d\mathscr{X}={\bm{\bm{X}}}^{{\left\langle d\right\rangle}}, with 𝒳i1iM=𝑿idj\mathscr{X}_{i_{1}\dots i_{M}}={\bm{\bm{X}}}_{i_{d}j}, and jj as defined above. A slice in a tensor 𝒳\mathscr{X} is a matrix 𝒙\bm{x} obtained by fixing all but two 𝒳\mathscr{X}’s indices. The inner product of two MM-order tensors 𝒳\mathscr{X} and 𝒴\mathscr{Y} is 𝒳,𝒴=i1=1I1iM=1IM𝒳i1iM𝒴i1iM{\left\langle\mathscr{X},\mathscr{Y}\right\rangle}=\sum_{i_{1}=1}^{I^{1}}...\sum_{i_{M}=1}^{I^{M}}\mathscr{X}_{i_{1}...i_{M}}\mathscr{Y}_{i_{1}...i_{M}}, the Frobenius norm is 𝒳F=𝒳,𝒳\|\mathscr{X}\|_{F}=\sqrt{\langle\mathscr{X},\mathscr{X}\rangle}, 𝒳max\|\mathscr{X}\|_{\max} returns the value of the element in 𝒳\mathscr{X} with the maximum absolute value.

For a proper and lower-semi-continuous function ff, f\partial f denotes its Frechet subdifferential (Attouch et al., 2013).

Finally, P𝛀()P_{{\bm{\bm{\Omega}}}}\left(\cdot\right) is the observer operator, i.e., given a binary tensor 𝛀{0,1}I1××IM{\bm{\bm{\Omega}}}\in\{0,1\}^{I_{1}\times...\times I_{M}} and an arbitrary tensor 𝒳I1××IM\mathscr{X}\in\mathbb{R}^{I_{1}\times...\times I_{M}}, [P𝛀(𝒳)]i1iM=𝒳i1iM[P_{{\bm{\bm{\Omega}}}}\left(\mathscr{X}\right)]_{i_{1}...i_{M}}=\mathscr{X}_{i_{1}...i_{M}} if 𝛀i1iM=1{\bm{\bm{\Omega}}}_{i_{1}...i_{M}}=1 and [P𝛀(𝒳)]i1iM=0[P_{{\bm{\bm{\Omega}}}}\left(\mathscr{X}\right)]_{i_{1}...i_{M}}=0 otherwise.

2 Related Works

2.1 Low-Rank Matrix Learning

Learning of a low-rank matrix 𝑿m×n{\bm{\bm{X}}}\in\mathbb{R}^{m\times n} can be formulated as the following optimization problem:

min𝑿f(𝑿)+λr(𝑿),\displaystyle\min\nolimits_{{\bm{\bm{X}}}}f({\bm{\bm{X}}})+\lambda r({\bm{\bm{X}}}), (1)

where rr is a low-rank regularizer, λ0\lambda\geq 0 is a hyperparameter, and ff is a loss function that is ρ\rho-Lipschitz smooth111In other words, f(𝑿)f(𝒀)Fρ𝑿𝒀F\|\nabla f({\bm{\bm{X}}})-\nabla f({\bm{\bm{Y}}})\|_{F}\leq\rho\left\|{\bm{\bm{X}}}-{\bm{\bm{Y}}}\right\|_{F} for any 𝑿,𝒀{\bm{\bm{X}}},{\bm{\bm{Y}}}.. Existing methods for low-rank matrix learning generally fall into three types: (i) nuclear norm minimization; (ii) nonconvex regularization; and (iii) matrix factorization.

2.1.1 Nuclear Norm Minimization

A common choice for rr is the nuclear norm regularizer. Using the proximal algorithm (Parikh and Boyd, 2013) on (1), the iterate at iteration tt is given by 𝑿t+1=proxλτ(𝒁t){\bm{\bm{X}}}_{t+1}=\text{prox}_{\frac{\lambda}{\tau}\|\cdot\|_{*}}(\bm{Z}_{t}), where

𝒁t=𝑿t1τf(𝑿t).\bm{Z}_{t}={\bm{\bm{X}}}_{t}-\frac{1}{\tau}\nabla f({\bm{\bm{X}}}_{t}). (2)

Here, τ>ρ\tau>\rho controls the stepsize (1/τ1/\tau), and

proxλτ(𝒁)=argmin𝑿12𝑿𝒁F2+λτ𝑿\displaystyle\text{prox}_{\frac{\lambda}{\tau}\left\|\cdot\right\|_{*}}(\bm{Z})=\arg\min\nolimits_{{\bm{\bm{X}}}}\frac{1}{2}\left\|{\bm{\bm{X}}}-\bm{Z}\right\|_{F}^{2}+\frac{\lambda}{\tau}\left\|{\bm{\bm{X}}}\right\|_{*} (3)

is the proximal step. The following Lemma shows that proxλτ(𝒁)\text{prox}_{\frac{\lambda}{\tau}\left\|\cdot\right\|_{*}}(\bm{Z}) can be obtained by shrinking the singular values of 𝒁\bm{Z}, which encourages 𝑿t{\bm{\bm{X}}}_{t} to be low-rank.

Lemma 1

(Cai et al., 2010) proxλ(𝐙)=𝐔(𝚺λ𝐈)+𝐕\text{prox}_{\lambda\left\|\cdot\right\|_{*}}(\bm{Z})=\bm{U}(\bm{\Sigma}-\lambda\bm{I})_{+}\bm{V}^{\top}, where 𝐔𝚺𝐕\bm{U}\bm{\Sigma}\bm{V}^{\top} is the SVD of 𝐙\bm{Z}, and [(𝐗)+]ij=max(𝐗ij,0)\left[({\bm{\bm{X}}})_{+}\right]_{ij}=\max({\bm{\bm{X}}}_{ij},0).

A special class of low-rank matrix learning problems is matrix completion, which attempts to find a low-rank matrix that agrees with the observations in data 𝑶\bm{O}:

min𝑿12P𝛀(𝑿𝑶)F2+λ𝑿.\displaystyle\min\nolimits_{{\bm{\bm{X}}}}\frac{1}{2}\left\|P_{{\bm{\bm{\Omega}}}}\left({\bm{\bm{X}}}-\bm{O}\right)\right\|_{F}^{2}+\lambda\left\|{\bm{\bm{X}}}\right\|_{*}. (4)

Here, positions of the observed elements in 𝑶\bm{O} are indicated by 11’s in the binary matrix 𝛀{\bm{\bm{\Omega}}}. Setting f(𝑿)=12P𝛀(𝑿𝑶)F2f({\bm{\bm{X}}})=\frac{1}{2}\left\|P_{{\bm{\bm{\Omega}}}}\left({\bm{\bm{X}}}-\bm{O}\right)\right\|_{F}^{2} in (1), 𝒁t{\bm{\bm{Z}}}_{t} in (2) becomes:

𝒁t=𝑿t1τP𝛀(𝑿t𝑶).\displaystyle{\bm{\bm{Z}}}_{t}={\bm{\bm{X}}}_{t}-\frac{1}{\tau}P_{{\bm{\bm{\Omega}}}}\left({\bm{\bm{X}}}_{t}-{\bm{\bm{O}}}\right). (5)

Note that 𝑿t{\bm{\bm{X}}}_{t} is low-rank and 1τP𝛀(𝑿t𝑶)\frac{1}{\tau}P_{{\bm{\bm{\Omega}}}}\left({\bm{\bm{X}}}_{t}-{\bm{\bm{O}}}\right) is sparse. 𝒁t{\bm{\bm{Z}}}_{t} thus has a “sparse plus low-rank” structure. This allows the SVD computation in Lemma 1 to be much more efficient (Mazumder et al., 2010). Specifically, on using the power method to compute 𝒁t{\bm{\bm{Z}}}_{t}’s SVD, most effort is spent on multiplications of the forms 𝒁t𝒃{\bm{\bm{Z}}}_{t}\bm{b} and 𝒂𝒁t\bm{a}^{\top}{\bm{\bm{Z}}}_{t} (where 𝒂n\bm{a}\in\mathbb{R}^{n} and 𝒃m\bm{b}\in\mathbb{R}^{m}). Let 𝑿t{\bm{\bm{X}}}_{t} in (5) be low-rank factorized as 𝑼t𝑽t\bm{U}_{t}\bm{V}_{t}^{\top}, where 𝑼tm×kt\bm{U}_{t}\in\mathbb{R}^{m\times k_{t}} and 𝑽tn×kt\bm{V}_{t}\in\mathbb{R}^{n\times k_{t}} with rank ktk_{t}. Computing

𝒁t𝒃=𝑼t(𝑽t𝒃)1τP𝛀(𝒀t𝑶)𝒃\displaystyle{\bm{\bm{Z}}}_{t}\bm{b}=\bm{U}_{t}\left(\bm{V}_{t}^{\top}\bm{b}\right)-\frac{1}{\tau}P_{{\bm{\bm{\Omega}}}}\left({\bm{\bm{Y}}}_{t}-\bm{O}\right)\bm{b} (6)

takes O((m+n)kt+𝛀1)O((m+n)k_{t}+\|{\bm{\bm{\Omega}}}\|_{1}) time. Usually, ktnk_{t}\ll n and 𝛀1mn\|{\bm{\bm{\Omega}}}\|_{1}\ll mn. Thus, this is much faster than directly multiplying 𝒁t{\bm{\bm{Z}}}_{t} and 𝒃\bm{b}, which takes O(mn)O(mn) time. The same holds for computing 𝒂𝒁t\bm{a}^{\top}{\bm{\bm{Z}}}_{t}. The proximal step in (3) takes a total of O((m+n)ktkt+1+𝛀1kt+1)O((m+n)k_{t}k_{t+1}+\|{\bm{\bm{\Omega}}}\|_{1}k_{t+1}) time, while a direct computation without utilizing the “sparse plus low-rank” structure takes O(mnkt+1)O(mnk_{t+1}) time. Besides, as only P𝛀(𝑿t)P_{{\bm{\bm{\Omega}}}}\left({\bm{\bm{X}}}_{t}\right) and the factorized form of 𝑿t{\bm{\bm{X}}}_{t} need to be kept, the space complexity is reduced from O(mn)O(mn) to O((m+n)kt+𝛀1)O((m+n)k_{t}+\|{\bm{\bm{\Omega}}}\|_{1}).

2.1.2 Nonconvex Low-Rank Regularizer

Instead of using a convex rr in (1), the following nonconvex regularizer has been commonly used (Gui et al., 2016; Lu et al., 2016b; Gu et al., 2017; Yao et al., 2019b):

ϕ(𝑿)=i=1nκ(σi(𝑿)),\displaystyle\phi({\bm{\bm{X}}})=\sum\nolimits_{i=1}^{n}\kappa(\sigma_{i}({\bm{\bm{X}}})), (7)

where κ\kappa is nonconvex and possibly nonsmooth. We assume the following on κ\kappa.

Assumption 1

κ(α)\kappa(\alpha) is a concave and non-decreasing function on α0\alpha\geq 0, with κ(0)=0\kappa(0)=0 and limα0+κ(α)=κ0\lim_{\alpha\rightarrow 0^{+}}\kappa^{\prime}(\alpha)=\kappa_{0} for a positive constant κ0\kappa_{0}.

Table 1 shows the κ\kappa’s corresponding to the popular nonconvex regularizers of capped-1\ell_{1} penalty (Zhang, 2010b), log-sum-penalty (LSP) (Candès et al., 2008), truncated nuclear norm (TNN) (Hu et al., 2013), smoothed-capped-absolute-deviation (SCAD) (Fan and Li, 2001), and minimax concave penalty (MCP) (Zhang, 2010a). These nonconvex regularizers have similar statistical guarantees (Gui et al., 2016), and perform empirically better than the convex nuclear norm regularizer (Lu et al., 2016b; Yao et al., 2019b). The proximal algorithm can also be used, and converges to a critical point (Attouch et al., 2013). Analogous to Lemma 1, the underlying proximal step

proxλτϕ(𝒁)=argmin𝑿12𝑿𝒁F2+λτϕ(𝑿)\displaystyle\text{prox}_{\frac{\lambda}{\tau}\phi}(\bm{Z})=\arg\min\nolimits_{{\bm{\bm{X}}}}\frac{1}{2}\left\|{\bm{\bm{X}}}-\bm{Z}\right\|_{F}^{2}+\frac{\lambda}{\tau}\phi({\bm{\bm{X}}}) (8)

can be obtained as follows.

Lemma 2

(Lu et al., 2016b) proxλϕ(𝐙)=𝐔 Diag(y1,,yn)𝐕\text{prox}_{\lambda\phi}(\bm{Z})=\bm{U}\text{\,Diag}\left(y_{1},\dots,y_{n}\right)\bm{V}^{\top}, where 𝐔𝚺𝐕\bm{U}\bm{\Sigma}\bm{V}^{\top} is the SVD of 𝐙\bm{Z}, and yi=argminy012(yσi(𝐙))2+λκ(y)y_{i}=\arg\min\nolimits_{y\geq 0}\frac{1}{2}(y-\sigma_{i}(\bm{Z}))^{2}+\lambda\kappa(y).

Table 1: Common examples of κ(σi(𝑿))\kappa(\sigma_{i}({\bm{\bm{X}}})). Here, θ\theta is a constant. For the capped-1\ell_{1}, LSP and MCP, θ>0\theta>0; for SCAD, θ>2\theta>2; and for TNN, θ\theta is a positive integer.
κ(σi(𝑿))\kappa(\sigma_{i}({\bm{\bm{X}}}))
nuclear norm (Candès and Recht, 2009) σi(𝑿)\sigma_{i}({\bm{\bm{X}}})
capped-1\ell_{1} (Zhang, 2010b) min(σi(𝑿),θ)\min(\sigma_{i}({\bm{\bm{X}}}),\theta)
LSP (Candès et al., 2008) log(σi(𝑿)θ+1)\log(\frac{\sigma_{i}({\bm{\bm{X}}})}{\theta}+1)
TNN (Hu et al., 2013) {σi(𝑿)ifi>θ0otherwise\begin{cases}\sigma_{i}({\bm{\bm{X}}})&\text{if}\;i>\theta\\ 0&\text{otherwise}\end{cases}
SCAD (Fan and Li, 2001) {σi(𝑿)ifσi(𝑿)1(2θσi(𝑿)σi(𝑿)21)2(θ1)if 1<σi(𝑿)θ(θ+1)22otherwise\begin{cases}\sigma_{i}({\bm{\bm{X}}})&\text{if}\;\sigma_{i}({\bm{\bm{X}}})\leq 1\\ \frac{(2\theta\sigma_{i}({\bm{\bm{X}}})-\sigma_{i}({\bm{\bm{X}}})^{2}-1)}{2(\theta-1)}&\text{if}\;1<\sigma_{i}({\bm{\bm{X}}})\leq\theta\\ \frac{(\theta+1)^{2}}{2}&\text{otherwise}\end{cases}
MCP (Zhang, 2010a) {σi(𝑿)α22θifσi(𝑿)θθ22otherwise\begin{cases}\sigma_{i}({\bm{\bm{X}}})-\frac{\alpha^{2}}{2\theta}&\text{if}\;\sigma_{i}({\bm{\bm{X}}})\leq\theta\\ \frac{\theta^{2}}{2}&\text{otherwise}\end{cases}

2.1.3 Matrix Factorization

Note that the aforementioned regularizers require access to individual singular values. As computing the singular values of a m×nm\times n matrix (with mnm\geq n) via SVD takes O(mn2)O(mn^{2}) time, this can be costly for a large matrix. Even when rank-kk truncated SVD is used, the computation cost is still O(mnk)O(mnk). To reduce the computational burden, factored low-rank regularizers are proposed (Srebro et al., 2005; Mazumder et al., 2010). Specifically, equation (1) is rewritten into a factored form as

min𝑾,𝑯f(𝑾𝑯)+λh(𝑾,𝑯),\min\nolimits_{\bm{W},\bm{H}}f(\bm{W}\bm{H}^{\top})+\lambda\cdot h(\bm{W},\bm{H}), (9)

where 𝑿{\bm{\bm{X}}} is factorized as 𝑾𝑯\bm{W}\bm{H}^{\top} with 𝑾m×k\bm{W}\in\mathbb{R}^{m\times k} and 𝑯n×k\bm{H}\in\mathbb{R}^{n\times k}, hh is a regularizer on 𝑾{\bm{\bm{W}}} and 𝑯{\bm{\bm{H}}}, and λ0\lambda\geq 0 is a hyperparameter. When λ=0\lambda=0, this reduces to matrix factorization (Vandereycken, 2013; Boumal and Absil, 2015; Tu et al., 2016; Wang et al., 2017). After factorization, gradient descent or alternative minimization are usually used for optimization. When certain conditions (such as proper initialization, restricted strong convexity (RSC) (Negahban and Wainwright, 2012), or restricted isometry property (RIP) (Candès and Tao, 2005)) are met, statistical guarantees can be obtained (Zheng and Lafferty, 2015; Tu et al., 2016; Wang et al., 2017).

Note that in Table 1, the nuclear norm is the only regularizer r(𝑿)r({\bm{\bm{X}}}) that has an equivalent factored form h(𝑾,𝑯)h(\bm{W},\bm{H}). For a matrix 𝑿{\bm{\bm{X}}} with ground-truth rank no larger than kk, it has been shown that the nuclear norm can be rewritten in a factored form as (Srebro et al., 2005)

𝑿=min𝑿=𝑾𝑯12(𝑾F2+𝑯F2).\displaystyle\left\|{\bm{\bm{X}}}\right\|_{*}=\min\nolimits_{{\bm{\bm{X}}}=\bm{W}\bm{H}^{\top}}\frac{1}{2}\left(\left\|\bm{W}\right\|_{F}^{2}+\left\|\bm{H}\right\|_{F}^{2}\right).

However, the other nonconvex regularizers need to penalize individual singular values, and so cannot be written in factored form.

2.2 Low-Rank Tensor Learning

A MM-order tensor 𝒳\mathscr{X} has rank one if it can be written as the outer product of MM vectors, i.e., 𝒳=𝒙1𝒙2𝒙M\mathscr{X}=\bm{x}^{1}\circ\bm{x}^{2}\circ\dots\circ\bm{x}^{M} where \circ denotes the outer product (i.e., 𝒳i1,,iM=𝒙i11𝒙i22𝒙iMM\mathscr{X}_{i_{1},\ldots,i_{M}}=\bm{x}^{1}_{i_{1}}\cdot\bm{x}^{2}_{i_{2}}\cdot\dots\cdot\bm{x}^{M}_{i_{M}}). In general, the rank of a tensor 𝒳\mathscr{X} is the smallest number of rank-one tensors that generate 𝒳\mathscr{X} as their sum (Kolda and Bader, 2009).

To impose a low-rank structure on tensors, factorization methods (such as the Tucker / CP (Kolda and Bader, 2009; Hong et al., 2020), tensor-train (Oseledets, 2011) and tensor ring (Zhao et al., 2016) decompositions) have been used for low-rank tensor learning. These methods assume that the tensor can be decomposed into low-rank factor matrices (Kolda and Bader, 2009), which are then learned by alternating least squares or coordinate descent (Acar et al., 2011; Xu et al., 2013; Balazevic et al., 2019). Kressner et al. (2014) proposed to utilize the Riemannian structure on the manifold of tensors with fixed multilinear rank, and then perform nonlinear conjugate gradient descent. It can be speeded up by preconditioning (Kasai and Mishra, 2016). However, these models suffer from the problem of local minimum, and have no theoretical guarantee on the convergence rate. Moreover, its per-iteration cost depends on the product of all the mode ranks, and so can be expensive. Thus, they may lead to worse approximations and inferior performance (Tomioka et al., 2011; Liu et al., 2013; Guo et al., 2017).

Due to the above limitations, the nuclear norm, which has been commonly used in low-rank matrix learning, has been extended to the learning of low-rank tensors (Tomioka et al., 2010; Signoretto et al., 2011; Gu et al., 2014; Yuan and Zhang, 2016; Zhang and Aeron, 2017). The most commonly used low-rank tensor regularizer is the following (convex) overlapped nuclear norm:

Definition 3

(Tomioka et al., 2010) The overlapped nuclear norm of a MM-order tensor 𝒳\mathscr{X} is 𝒳overlap\left\|\mathscr{X}\right\|_{\text{overlap}} == i=1Mλi𝒳i\sum_{i=1}^{M}\lambda_{i}\left\|\mathscr{X}_{{\left\langle i\right\rangle}}\right\|_{*}, where {λi0}\{\lambda_{i}\geq 0\} are hyperparameters.

Note that the nuclear norm is a convex envelop of the matrix rank (Candès and Recht, 2009). Similarly, 𝒳overlap\left\|\mathscr{X}\right\|_{\text{overlap}} is a convex envelop of the tensor rank (Tomioka et al., 2010, 2011). Empirically, 𝒳overlap\left\|\mathscr{X}\right\|_{\text{overlap}} has better performance than the other nuclear norm variants in many tensor applications such as image inpainting (Liu et al., 2013) and multi-relational link prediction (Guo et al., 2017). On the theoretical side, let 𝒳\mathscr{X}^{*} be the ground-truth tensor, and 𝒳\mathscr{X} be the tensor obtained by solving the overlapped nuclear norm regularized problem. The statistical error between 𝒳\mathscr{X}^{*} and 𝒳\mathscr{X} has been established in tensor decomposition (Tomioka et al., 2011) and robust tensor decomposition problems (Gu et al., 2014). Specifically, under the restricted strong convexity (RSC) condition (Negahban and Wainwright, 2012), 𝒳𝒳F\|\mathscr{X}^{*}-\mathscr{X}\|_{F} can be bounded by O(σi=1Mki)O(\sigma\sum\nolimits_{i=1}^{M}\sqrt{k_{i}}), where σ\sigma is the noise level and kik_{i} is the rank of 𝒳i\mathscr{X}^{*}_{{\left\langle i\right\rangle}}. Furthermore, we can see that when σ=0\sigma=0 (no noise), exactly recovery can be guaranteed.

2.3 Proximal Average (PA) Algorithm

Let \mathcal{H} be a Hilbert space of 𝒳\mathscr{X}, which can be a scalar/vector/matrix/tensor variable. Consider the following optimization problem:

min𝒳F(𝒳)=f(𝒳)+i=1Kλigi(𝒳),\displaystyle\min\nolimits_{\mathscr{X}\in\mathcal{H}}F(\mathscr{X})=f(\mathscr{X})+\sum\nolimits_{i=1}^{K}\lambda_{i}\,g_{i}(\mathscr{X}), (10)

where ff is the loss and each gig_{i} is a regularizer with hyper-parameter {λi}\{\lambda_{i}\}. Often, g(𝒳)=i=1Kg(\mathscr{X})=\sum_{i=1}^{K} λi\lambda_{i} gi(𝒳)g_{i}(\mathscr{X}) is complicated, and its proximal step does not have a simple solution. Hence, the proximal algorithm cannot be efficiently used. However, it is possible that the proximal step for each individual gig_{i} can be easy obtained. For example, let g1(𝑿)=𝑿1g_{1}({\bm{\bm{X}}})=\left\|{\bm{\bm{X}}}\right\|_{1} and g2(𝑿)=𝑿g_{2}({\bm{\bm{X}}})=\left\|{\bm{\bm{X}}}\right\|_{*}. The closed-form solution on the proximal step for g1g_{1} (resp. g2g_{2}) is given by the soft-thresholding operator (Efron et al., 2004) (resp. singular value thresholding operator (Cai et al., 2010)). However, the closed-form solution does not exist for the proximal step with g1+g2g_{1}+g_{2}.

In this case, the proximal average (PA) algorithm (Bauschke et al., 2008) can be used instead. The PA algorithm generates 𝒳t\mathscr{X}_{t}’s as

𝒳t\displaystyle\mathscr{X}_{t} =i=1K𝒴ti,\displaystyle=\sum\nolimits_{i=1}^{K}\mathscr{Y}_{t}^{i}, (11)
𝒵t\displaystyle\mathscr{Z}_{t} =𝒳t1τf(𝒳t),\displaystyle=\mathscr{X}_{t}-\frac{1}{\tau}\nabla f(\mathscr{X}_{t}), (12)
𝒴t+1i\displaystyle\mathscr{Y}_{t+1}^{i} =proxλigiτ(𝒵t),i=1,,K.\displaystyle=\text{prox}_{\frac{\lambda_{i}g_{i}}{\tau}}(\mathscr{Z}_{t}),\quad i=1,\dots,K. (13)

As each individual proximal step in (13) is easy, the PA algorithm can be significantly faster than the proximal algorithm (Yu, 2013; Zhong and Kwok, 2014; Yu et al., 2015; Shen et al., 2017). When both ff and gg are convex, the PA algorithm converges to an optimal solution of (10) with a proper choice of stepsize τ\tau (Yu, 2013; Shen et al., 2017). Recently, the PA algorithm has also been extended to nonconvex ff and gig_{i}’s (Zhong and Kwok, 2014; Yu et al., 2015). Moreover, τ\tau can be adaptively changed to obtain an empirically faster convergence (Shen et al., 2017).

3 Proposed Algorithm

Analogous to the low-rank matrix completion problem in (1), we consider the following low-rank tensor completion problem with a nonconvex extension of the overlapped nuclear norm:

min𝒳𝛀i1iM=1(𝒳i1iM,𝒪i1iM)+i=1Dλiϕ(𝒳i).\displaystyle\min_{\mathscr{X}}\sum\nolimits_{\mathbf{\Omega}_{i_{1}...i_{M}}=1}\ell\left(\mathscr{X}_{i_{1}\dots i_{M}},\mathscr{O}_{i_{1}\dots i_{M}}\right)+\sum\nolimits_{i=1}^{D}\lambda_{i}\,\phi(\mathscr{X}_{{\left\langle i\right\rangle}}). (14)

Here, the observed elements are in 𝒪i1iM\mathscr{O}_{i_{1}...i_{M}}, 𝒳\mathscr{X} is the tensor to be recovered, (,)\ell(\cdot,\cdot) is a smooth loss function, and ϕ\phi is a nonconvex regularizer in the form in (7). Unlike the overlapped nuclear norm in Definition 3, here we only sum over DMD\leq M modes. This is useful when some modes are already small (e.g., the number of bands in color images), and so do not need to be low-rank regularized. When D=MD=M and κ(α)=α\kappa(\alpha)=\alpha in (7), problem (14) reduces to (convex) overlapped nuclear norm regularization. When D=1D=1 and \ell is the square loss, (14) reduces to the matrix completion problem:

min𝑿I1×(IπI1)12P𝛀(𝑿𝒪1)F2+λ1ϕ(𝑿),\displaystyle\min\nolimits_{{\bm{\bm{X}}}\in\mathbb{R}^{I^{1}\times(\frac{I^{\pi}}{I^{1}})}}\frac{1}{2}\left\|P_{{\bm{\bm{\Omega}}}}\left({\bm{\bm{X}}}-\mathscr{O}_{{\left\langle 1\right\rangle}}\right)\right\|_{F}^{2}+\lambda_{1}\phi({\bm{\bm{X}}}),

which can be solved by the proximal algorithm as in (Lu et al., 2016b; Yao et al., 2019b). In the sequel, we only consider D>1D>1.

3.1 Issues with Existing Solvers

First, consider the case where κ\kappa in (7) is convex. While DD may not be equal to MM, it can be easily shown that existing optimization solvers in (Tomioka et al., 2010; Boyd et al., 2011; Liu et al., 2013) can still be used. However, when κ\kappa is nonconvex, the fast low-rank tensor completion (FaLRTC) solver (Liu et al., 2013) cannot be applied, as the dual of (14) cannot be derived. Tomioka et al. (2010) used the alternating direction of multiple multipliers (ADMM) (Boyd et al., 2011) solver for the overlapped nuclear norm. Recently, it is used in (Chen et al., 2020) to solve a special case of (14), in which ϕ\phi is the truncated nuclear norm (TNN) regularizer (see Table 1). Specifically, (14) is first reformulated as

min𝒳𝛀i1iM=1(𝒳i1iM,𝒪i1iM)+i=1Dλiϕ(𝑿i)s.t.𝑿i=𝒳i,i=1,,D.\displaystyle\min\nolimits_{\mathscr{X}}\sum\nolimits_{\mathbf{\Omega}_{i_{1}...i_{M}}=1}\ell\left(\mathscr{X}_{i_{1}\dots i_{M}},\mathscr{O}_{i_{1}\dots i_{M}}\right)+\sum\nolimits_{i=1}^{D}\lambda_{i}\,\phi({\bm{\bm{X}}}_{i})\;\text{s.t.}\;{\bm{\bm{X}}}_{i}=\mathscr{X}_{{\left\langle i\right\rangle}},\;i=1,\dots,D.

Using ADMM, it then generates iterates as

𝒳t+1\displaystyle\mathscr{X}_{t+1} =argmin𝒳𝛀i1iM=1(𝒳i1iM,𝒪i1iM)+ζ2i=1D(𝑿i)t𝒳i+1ζ(𝑴i)tF2,\displaystyle=\arg\min\nolimits_{\mathscr{X}}\!\!\!\!\!\!\sum_{\mathbf{\Omega}_{i_{1}...i_{M}}=1}\!\!\!\!\!\!\ell\left(\mathscr{X}_{i_{1}\dots i_{M}},\mathscr{O}_{i_{1}\dots i_{M}}\right)\!+\!\frac{\zeta}{2}\sum\nolimits_{i=1}^{D}\big{\|}({\bm{\bm{X}}}_{i})_{t}\!-\!\mathscr{X}_{{\left\langle i\right\rangle}}\!+\!\frac{1}{\zeta}(\bm{M}_{i})_{t}\big{\|}_{F}^{2}, (15)
(𝑿i)t+1\displaystyle({\bm{\bm{X}}}_{i})_{t+1} =proxλiζ((𝑿i)t+1+1ζ(𝑴i)t),i=1,,D,\displaystyle=\text{prox}_{\frac{\lambda_{i}}{\zeta}}(({\bm{\bm{X}}}_{i})_{t+1}+\frac{1}{\zeta}(\bm{M}_{i})_{t}),\quad i=1,\dots,D, (16)
(𝑴i)t+1\displaystyle(\bm{M}_{i})_{t+1} =(𝑴i)t+1ζ((𝑿i)t+1(𝒳i)t+1),i=1,,D,\displaystyle=(\bm{M}_{i})_{t}+\frac{1}{\zeta}\left(({\bm{\bm{X}}}_{i})_{t+1}-(\mathscr{X}_{{\left\langle i\right\rangle}})_{t+1}\right),\quad i=1,\dots,D, (17)

where 𝑴i\bm{M}_{i}’s are the dual variables, and ζ>0\zeta>0. The proximal step in (16) can be computed from Lemma 2. Convergence of this ADMM procedure is guaranteed in (Hong et al., 2016; Wang et al., 2019). However, it does not utilize the sparsity induced by 𝛀{\bm{\bm{\Omega}}}. Moreover, as the tensor 𝒳\mathscr{X} needs to be folded and unfolded repeatedly, the iterative procedure is expensive, taking O(Iπ)O(I^{\pi}) space and O(Iπi=1DIi)O(I^{\pi}\sum_{i=1}^{D}I^{i}) time per iteration.

On the other hand, the proximal algorithm (Section 2.1) cannot be easily used, as the proximal step for i=1Dλiϕ(𝒳i)\sum_{i=1}^{D}\lambda_{i}\phi(\mathscr{X}_{{\left\langle i\right\rangle}}) is not simple in general.

3.2 Structure-aware Proximal Average Iterations

Note that ϕ\phi in (7) admits a difference-of-convex decomposition (Hartman, 1959; Le Thi and Tao, 2005), i.e., ϕ\phi can be decomposed as ϕ=ϕ1ϕ2\phi=\phi_{1}-\phi_{2} where ϕ1\phi_{1} and ϕ2\phi_{2} are convex (Yao et al., 2019b). The proximal average (PA) algorithm (Section 2.3) has been recently extended for nonconvex ff and gig_{i}’s, where each gig_{i} admits a difference-of-convex decomposition (Zhong and Kwok, 2014). Hence, as (14) is in the form in (10), one can generate the PA iterates as:

𝒳t\displaystyle\mathscr{X}_{t} =\displaystyle= i=1D𝒴ti,\displaystyle\sum\nolimits_{i=1}^{D}\mathscr{Y}^{i}_{t}, (18)
𝒵t\displaystyle\mathscr{Z}_{t} =\displaystyle= 𝒳t1τϖ(𝒳t),\displaystyle\mathscr{X}_{t}-\frac{1}{\tau}\varpi(\mathscr{X}_{t}), (19)
𝒴t+1i\displaystyle\mathscr{Y}^{i}_{t+1} =\displaystyle= [proxλiϕτ([𝒵t]i)]i,i=1,,D.\displaystyle\big{[}\text{prox}_{\frac{\lambda_{i}\phi}{\tau}}([\mathscr{Z}_{t}]_{{\left\langle i\right\rangle}})\big{]}^{{\left\langle i\right\rangle}},\quad i=1,\dots,D. (20)

where ξ(𝒳t)\xi(\mathscr{X}_{t}) is a sparse tensor with

[ϖ(𝒳t)]i1iM={0(i1,,iM)𝛀([𝒳t]i1iM,𝒪i1iM)(i1,,iM)𝛀.\displaystyle\big{[}\varpi(\mathscr{X}_{t})\big{]}_{i_{1}\dots i_{M}}=\begin{cases}0&(i_{1},\dots,i_{M})\not\in\mathbf{\Omega}\\ \ell^{\prime}\left([\mathscr{X}_{t}]_{i_{1}\dots i_{M}},\mathscr{O}_{i_{1}\dots i_{M}}\right)&(i_{1},\dots,i_{M})\in\mathbf{\Omega}\end{cases}. (21)

In (20), each individual proximal step can be computed using Lemma 2. However, tensor folding and unfolding are still required. A direct application of the PA algorithm is as expensive as using ADMM (see Table 2).

In the following, we show that by utilizing the “sparse plus low-rank” structure, the PA iterations can be computed efficiently without tensor folding/unfolding. In the earlier conference version (Yao et al., 2019a), we only considered the case M=3M=3. Here, we extend this to M3M\geq 3 by noting that the coordinate format of sparse tensors can naturally handle tensors with arbitrary orders (Section 3.2.1) and the proximal step can be performed without tensor folding/unfolding (Section 3.2.2).

3.2.1 Efficient Computations of 𝒳t\mathscr{X}_{t} and 𝒵t\mathscr{Z}_{t} in (18), (19)

First, rewrite (20) as 𝒴t+1i=[𝒀t+1i]i\mathscr{Y}^{i}_{t+1}=[{\bm{\bm{Y}}}^{i}_{t+1}]^{{\left\langle i\right\rangle}}, where 𝒀t+1i=proxλiϕτ(𝒁ti){\bm{\bm{Y}}}^{i}_{t+1}=\text{prox}_{\frac{\lambda_{i}\phi}{\tau}}({\bm{\bm{Z}}}^{i}_{t}) and 𝒁ti=[𝒵t]i{\bm{\bm{Z}}}^{i}_{t}=\left[\mathscr{Z}_{t}\right]_{{\left\langle i\right\rangle}}. Recall that 𝒀ti{\bm{\bm{Y}}}_{t}^{i} obtained from the proximal step is low-rank. Let its rank be ktik^{i}_{t}. Hence, 𝒀ti{\bm{\bm{Y}}}_{t}^{i} can be represented as 𝑼ti(𝑽ti)\bm{U}_{t}^{i}(\bm{V}_{t}^{i})^{\top}, where 𝑼tiIi×kti\bm{U}_{t}^{i}\in\mathbb{R}^{I^{i}\times k_{t}^{i}} and 𝑽ti(IπIi)×kti\bm{V}_{t}^{i}\in\mathbb{R}^{(\frac{I^{\pi}}{I^{i}})\times k_{t}^{i}}. In each PA iteration, we avoid constructing the dense 𝒴ti\mathscr{Y}_{t}^{i} by storing the above low-rank factorized form of 𝒀ti{\bm{\bm{Y}}}_{t}^{i} instead. Similarly, we also avoid constructing 𝒳t\mathscr{X}_{t} in (18) by storing it implicitly as

𝒳t=i=1D(𝑼ti(𝑽ti))i.\displaystyle\mathscr{X}_{t}=\sum\nolimits_{i=1}^{D}\big{(}\bm{U}_{t}^{i}(\bm{V}_{t}^{i})^{\top}\big{)}^{{\left\langle i\right\rangle}}. (22)

𝒵t\mathscr{Z}_{t} in (19) can then be rewritten as

𝒵t=i=1D(𝑼ti(𝑽ti))i1τξ(𝒳t).\displaystyle\mathscr{Z}_{t}=\sum\nolimits_{i=1}^{D}\left(\bm{U}_{t}^{i}(\bm{V}_{t}^{i})^{\top}\right)^{{\left\langle i\right\rangle}}-\frac{1}{\tau}\xi(\mathscr{X}_{t}). (23)

The sparse tensor ξ(𝒳t)\xi(\mathscr{X}_{t}) in (21) can be constructed efficiently by using the coordinate format222For a sparse MM-order tensor, its ppth nonzero element is represented in the coordinate format as (ip1,,ipM,vp)(i_{p}^{1},\dots,i_{p}^{M},v_{p}), where ip1,,ipMi_{p}^{1},\dots,i_{p}^{M} are indices on each mode and vpv_{p} is the value. (Bader and Kolda, 2007). Using (22), each [ξ(𝒳t)]i1iM[\xi(\mathscr{X}_{t})]_{i_{1}\dots i_{M}} can be computed by finding the corresponding rows in {𝑼ti,𝑽ti}\{\bm{U}_{t}^{i},\bm{V}_{t}^{i}\} as shown in Algorithm 1. This takes O(i=1Dkti)O(\sum_{i=1}^{D}k_{t}^{i}) time.

Algorithm 1 Computing the ppth element vpv_{p} with index (ip1ipM)(i_{p}^{1}\dots i_{p}^{M}) in ξ(𝒳t)\xi(\mathscr{X}_{t}).
0:  factorizations {𝑼t1(𝑽t1),,𝑼tD(𝑽tD)}\{\bm{U}_{t}^{1}(\bm{V}_{t}^{1})^{\top},\dots,\bm{U}_{t}^{D}(\bm{V}_{t}^{D})^{\top}\} of 𝒀t1{\bm{\bm{Y}}}_{t}^{1}, \dots, 𝒀tD{\bm{\bm{Y}}}_{t}^{D}, and observed elements in P𝛀(𝒪)P_{{\bm{\bm{\Omega}}}}\left(\mathscr{O}\right);
1:  vp0v_{p}\leftarrow 0;
2:  for d=1,,Dd=1,\dots,D do
3:     𝒖\bm{u}^{\top}\leftarrow ipdi^{d}_{p}th row of 𝑼td\bm{U}_{t}^{d};
4:     𝒗\bm{v}^{\top}\leftarrow (kdMipkIπ+ipd)(\sum_{k\neq d}^{M}i^{k}_{p}I^{\pi}+i^{d}_{p})th row of 𝑽td\bm{V}_{t}^{d};
5:     vpvp+𝒖𝒗v_{p}\leftarrow v_{p}+\bm{u}^{\top}\bm{v};
6:  end for
7:  vp(vp,𝒪ip1ipM)v_{p}\leftarrow\ell^{\prime}(v_{p},\mathscr{O}_{i_{p}^{1}\dots i_{p}^{M}});
8:  return  vpv_{p}.

3.2.2 Efficient Computation of 𝒴t+1i\mathscr{Y}^{i}_{t+1} in (20)

Recall that the proximal step in (20) requires SVD, which involves matrix multiplications in the form 𝒂(𝒵t)i\bm{a}^{\top}(\mathscr{Z}_{t})_{{\left\langle i\right\rangle}} (where 𝒂Ii\bm{a}\in\mathbb{R}^{I^{i}}) and (𝒵t)i𝒃(\mathscr{Z}_{t})_{{\left\langle i\right\rangle}}\bm{b} (where 𝒃IπIi\bm{b}\in\mathbb{R}^{\frac{I^{\pi}}{I^{i}}}). Using the “sparse plus low-rank” structure in (23), these can be computed as

𝒂(𝒵t)i=(𝒂𝑼ti)(𝑽ti)+ji𝒂[(𝑼tj(𝑽tj))j]i1τ𝒂[ξ(𝒳t)]i,\displaystyle\bm{a}^{\top}(\mathscr{Z}_{t})_{{\left\langle i\right\rangle}}=\big{(}\bm{a}^{\top}\bm{U}_{t}^{i}\big{)}(\bm{V}_{t}^{i})^{\top}+\sum\nolimits_{j\neq i}\bm{a}^{\top}\big{[}(\bm{U}_{t}^{j}(\bm{V}_{t}^{j})^{\top})^{{\left\langle j\right\rangle}}\big{]}_{{\left\langle i\right\rangle}}-\frac{1}{\tau}\bm{a}^{\top}[\xi(\mathscr{X}_{t})]_{{\left\langle i\right\rangle}}, (24)

and

(𝒵t)i𝒃=𝑼ti[(𝑽ti)𝒃]+ji[(𝑼tj(𝑽tj))j]i𝒃1τ[ξ(𝒳t)]i𝒃.\displaystyle(\mathscr{Z}_{t})_{{\left\langle i\right\rangle}}\bm{b}=\bm{U}_{t}^{i}\big{[}(\bm{V}_{t}^{i})^{\top}\bm{b}\big{]}+\sum\nolimits_{j\neq i}\big{[}(\bm{U}_{t}^{j}(\bm{V}_{t}^{j})^{\top})^{{\left\langle j\right\rangle}}\big{]}_{{\left\langle i\right\rangle}}\bm{b}-\frac{1}{\tau}[\xi(\mathscr{X}_{t})]_{{\left\langle i\right\rangle}}\bm{b}. (25)

The first terms in (24) and (25) can be easily computed in O((IπIi+Ii)kti)O((\frac{I^{\pi}}{I^{i}}+I^{i})k_{t}^{i}) space and time. The last terms (𝒂[ξ(𝒳t)]i\bm{a}^{\top}\left[\xi(\mathscr{X}_{t})\right]_{{\left\langle i\right\rangle}} and [ξ(𝒳t)]i𝒃\left[\xi(\mathscr{X}_{t})\right]_{{\left\langle i\right\rangle}}\bm{b}) are sparse, and can be computed in O(𝛀1)O(\|{\bm{\bm{\Omega}}}\|_{1}) space and time by using sparse tensor packages such as the Tensor Toolbox (Bader and Kolda, 2007). However, a direct computation of the 𝒂[(𝑼tj(𝑽tj))j]i\bm{a}^{\top}[(\bm{U}_{t}^{j}(\bm{V}_{t}^{j})^{\top})^{{\left\langle j\right\rangle}}]_{{\left\langle i\right\rangle}} and [(𝑼tj(𝑽tj))j]i𝒃[(\bm{U}_{t}^{j}(\bm{V}_{t}^{j})^{\top})^{{\left\langle j\right\rangle}}]_{{\left\langle i\right\rangle}}\bm{b} terms involves tensor folding/unfolding, and is expensive. By examining how elements are ordered by folding/unfolding, the following shows that these multiplications can indeed be computed efficiently without explicit folding/unfolding.

Proposition 4

Let 𝐔Ij×k\bm{U}\in\mathbb{R}^{I^{j}\times k}, 𝐕(IπIj)×k\bm{V}\in\mathbb{R}^{(\frac{I^{\pi}}{I^{j}})\times k}, and 𝐮p\bm{u}_{p} (resp.𝐯p\bm{v}_{p}) be the ppth column of 𝐔\bm{U} (resp.𝐕\bm{V}). For any iji\neq j, 𝐚Ii\bm{a}\in\mathbb{R}^{I^{i}} and 𝐛IπIi\bm{b}\in\mathbb{R}^{\frac{I^{\pi}}{I^{i}}}, we have

𝒂[(𝑼𝑽)j]i\displaystyle\bm{a}^{\top}\big{[}(\bm{U}\bm{V}^{\top})^{{\left\langle j\right\rangle}}\big{]}_{{\left\langle i\right\rangle}} =p=1k𝒖p[𝒂mat(𝒗p;Ii,I¯ij)],\displaystyle=\sum\nolimits_{p=1}^{k}\bm{u}_{p}^{\top}\otimes\big{[}\bm{a}^{\top}\text{mat}(\bm{v}_{p};I^{i},\bar{I}^{ij})\big{]}, (26)
[(𝑼𝑽)j]i𝒃\displaystyle\big{[}(\bm{U}\bm{V}^{\top})^{{\left\langle j\right\rangle}}\big{]}_{{\left\langle i\right\rangle}}\bm{b} =p=1kmat(𝒗p;Ii,I¯ij)mat(𝒃;I¯ij,Ij)𝒖p,\displaystyle=\sum\nolimits_{p=1}^{k}\text{mat}\big{(}\bm{v}_{p};I^{i},\bar{I}^{ij}\big{)}\text{mat}\big{(}\bm{b};\bar{I}^{ij},I^{j}\big{)}\bm{u}_{p}, (27)

where \otimes is the Kronecker product, I¯ij=Iπ/(IiIj)\bar{I}^{ij}=I^{\pi}/(I^{i}I^{j}), and mat(𝐱;a,b)\text{mat}(\bm{x};a,b) reshapes a vector 𝐱\bm{x} of length abab into a matrix of size a×ba\times b.

In the earlier conference version (Yao et al., 2019a), Proposition 3.2 there (not the proposed algorithm) limits the usage to M=3M=3. Without Proposition 4, the algorithm can suffer from expensive computation cost, and thus has no efficiency advantage over the simple use of the PA algorithm. Specifically, when mapping the vector 𝒗p\bm{v}_{p} back to a matrix, we do not need to take a special treatment on the size of matrix. The reason is that, 𝒗p\bm{v}_{p} has IiIjI_{i}I_{j} elements and we just need to map it back to a matrix of size Ii×IjI_{i}\times I_{j}. Thus, we do not have parameters for mat operation in the conference version. However, when M>3M>3, 𝒗p\bm{v}_{p} has Iπ/IiI^{\pi}/I^{i} elements, we need to check whether ideas used in the conference version can be done in a similar way. As a result, we have two more parameters for the mat operation here, which customize reshaping matrix to a proper size.

Remark 5

For a second-order tensor (i.e., matrix case with M=2M=2), Proposition 4 becomes

𝒂[(𝑼𝑽)1]2=p=1k(𝒂𝒗p)𝒖pand[(𝑼𝑽)1]2𝒃=p=1k𝒗p(𝒃𝒖p).\displaystyle\bm{a}^{\top}\big{[}(\bm{U}\bm{V}^{\top})^{{\left\langle 1\right\rangle}}\big{]}_{{\left\langle 2\right\rangle}}=\sum\nolimits_{p=1}^{k}(\bm{a}^{\top}\bm{v}_{p})\bm{u}_{p}^{\top}\quad\text{and}\quad\big{[}(\bm{U}\bm{V}^{\top})^{{\left\langle 1\right\rangle}}\big{]}_{{\left\langle 2\right\rangle}}\bm{b}=\sum\nolimits_{p=1}^{k}\bm{v}_{p}(\bm{b}^{\top}\bm{u}_{p}).

With the usual square loss (i.e., 𝛀i1iM=1(𝒳i1iM,𝒪i1iM)\sum\nolimits_{\mathbf{\Omega}_{i_{1}...i_{M}}=1}\ell\left(\mathscr{X}_{i_{1}\dots i_{M}},\mathscr{O}_{i_{1}\dots i_{M}}\right) in (14) equals 12P𝛀(𝒳𝒪)F2\frac{1}{2}\left\|P_{{\bm{\bm{\Omega}}}}\left(\mathscr{X}-\mathscr{O}\right)\right\|_{F}^{2}), (25) then reduces to (6) when D=1D=1. When D=2D=2, i=1Dλiϕ(𝒳i)\sum\nolimits_{i=1}^{D}\lambda_{i}\,\phi(\mathscr{X}_{{\left\langle i\right\rangle}}) in (14) becomes λ1ϕ(𝐗)+λ2ϕ(𝐗)=(λ1+λ2)ϕ(𝐗)\lambda_{1}\phi({\bm{\bm{X}}})+\lambda_{2}\phi({\bm{\bm{X}}}^{\top})=(\lambda_{1}+\lambda_{2})\phi({\bm{\bm{X}}}), and is the same as the corresponding regularizer when D=1D=1. Hence, the reduction from (25) to (6) still holds.

3.2.3 Time and Space Complexities

A direct computation of 𝒂[(𝑼tj(𝑽tj))j]i\bm{a}^{\top}[(\bm{U}_{t}^{j}(\bm{V}_{t}^{j})^{\top})^{{\left\langle j\right\rangle}}]_{{\left\langle i\right\rangle}} takes O(ktiIπ)O(k_{t}^{i}I^{\pi}) time and O(Iπ)O(I^{\pi}) space. By using the computation in Proposition 4, these are reduced to O((1Ii+1Ij)ktiIπ)O((\frac{1}{I^{i}}+\frac{1}{I^{j}})k_{t}^{i}I^{\pi}) time and O((1Ij+1Ii)ktiIπ)O((\frac{1}{I^{j}}+\frac{1}{I^{i}})k_{t}^{i}I^{\pi}) space. This is also the case for [(𝑼tj(𝑽tj))j]i𝒃[(\bm{U}_{t}^{j}(\bm{V}_{t}^{j})^{\top})^{{\left\langle j\right\rangle}}]_{{\left\langle i\right\rangle}}\bm{b}. Details are in the following.

operation time space
reshaping mat(𝒗p;Ij,I¯ij)\text{mat}(\bm{v}_{p};I^{j},\bar{I}^{ij}) O(IπIi)O(\frac{I^{\pi}}{I^{i}}) O(IπIi)O(\frac{I^{\pi}}{I^{i}})
multiplication 𝒂()\bm{a}^{\top}(\cdot) O(IπIi)O(\frac{I^{\pi}}{I^{i}}) O(IπIi)O(\frac{I^{\pi}}{I^{i}})
Kronecker product 𝒖p()\bm{u}_{p}^{\top}\otimes(\cdot) O(IπIj)O(\frac{I^{\pi}}{I^{j}}) O(IπIj)O(\frac{I^{\pi}}{I^{j}})
summation p=1kti()\sum_{p=1}^{k_{t}^{i}}(\cdot) O(ktiIπIj)O(\frac{k_{t}^{i}I^{\pi}}{I^{j}}) O((1Ij+1Ii)ktiIπ)O((\frac{1}{I^{j}}+\frac{1}{I^{i}})k_{t}^{i}I^{\pi})
total for (26) O((1Ij+ktiIj)Iπ)O((\frac{1}{I^{j}}+\frac{k_{t}^{i}}{I^{j}})I^{\pi}) O((1Ij+1Ii)ktiIπ)O((\frac{1}{I^{j}}+\frac{1}{I^{i}})k_{t}^{i}I^{\pi})
operation time space
reshaping mat(𝒃;I¯ij,Ii)\text{mat}\left(\bm{b};\bar{I}^{ij},I^{i}\right) O(IπIj)O(\frac{I^{\pi}}{I^{j}}) O(IπIj)O(\frac{I^{\pi}}{I^{j}})
multiplication ()𝒖p(\cdot)\bm{u}_{p} O(IπIi)O(\frac{I^{\pi}}{I^{i}}) O(IπIj)O(\frac{I^{\pi}}{I^{j}})
reshaping mat(𝒗p;Ij,I¯ij)\text{mat}\left(\bm{v}_{p};I^{j},\bar{I}^{ij}\right) O(IπIi)O(\frac{I^{\pi}}{I^{i}}) O(IπIi)O(\frac{I^{\pi}}{I^{i}})
multiplication mat(𝒗p;Ij,I¯ij)()\text{mat}\left(\bm{v}_{p};I^{j},\bar{I}^{ij}\right)(\cdot) O(IπIj)O(\frac{I^{\pi}}{I^{j}}) O(IπIi)O(\frac{I^{\pi}}{I^{i}})
summation p=1kti()\sum_{p=1}^{k_{t}^{i}}(\cdot) O(ktiIi)O(k_{t}^{i}I^{i}) O(ktiIi)O(k_{t}^{i}I^{i})
total for (27) O((1Ij+ktiIj)Iπ)O((\frac{1}{I^{j}}+\frac{k_{t}^{i}}{I^{j}})I^{\pi}) O((1Ij+1Ii)ktiIπ)O((\frac{1}{I^{j}}+\frac{1}{I^{i}})k_{t}^{i}I^{\pi})

Combining the above, and noting that we have to keep the factorized form 𝑼ti(𝑽ti)\bm{U}_{t}^{i}(\bm{V}_{t}^{i})^{\top} of 𝒀ti{\bm{\bm{Y}}}_{t}^{i}, computing all the proximal steps in (20) takes

O(i=1Dji(1Ii+1Ij)ktiIπ+𝛀1)\displaystyle O\big{(}\sum\nolimits_{i=1}^{D}\sum\nolimits_{j\neq i}(\frac{1}{I^{i}}+\frac{1}{I^{j}})k_{t}^{i}I^{\pi}+\|{\bm{\bm{\Omega}}}\|_{1}\big{)} (28)

space and

O(i=1Dji(1Ii+1Ij)ktikt+1iIπ+𝛀1(kti+kt+1i))\displaystyle O\big{(}\sum\nolimits_{i=1}^{D}\sum\nolimits_{j\neq i}(\frac{1}{I^{i}}+\frac{1}{I^{j}})k_{t}^{i}k_{t+1}^{i}I^{\pi}+\|{\bm{\bm{\Omega}}}\|_{1}(k_{t}^{i}+k_{t+1}^{i})\big{)} (29)

time. Empirically, as will be seen in the experimental results in Section 5.1.2, ktik_{t}^{i}, kt+1iIik_{t+1}^{i}\ll I^{i}. Hence, (28) and (29) are much smaller than the complexities with a direct usage of PA and ADMM in Section 3.1 (Table 2).

Table 2: Comparison of the proposed NORT with PA and ADMM for (14) in Section 3.1.
algorithm complexity adaptive
time per iteration space momentum
PA (Zhong and Kwok, 2014) O(Iπi=1DIi)O(I^{\pi}\sum_{i=1}^{D}I^{i}) O(Iπ)O(I^{\pi}) ×\times
ADMM (Chen et al., 2020) O(Iπi=1DIi)O(I^{\pi}\sum_{i=1}^{D}I^{i}) O(Iπ)O(I^{\pi}) ×\times
NORT (Algorithm 2) see (29) see (28) \surd

3.3 Use of Adaptive Momentum

The PA algorithm uses only first-order information, and can be slow to converge empirically (Parikh and Boyd, 2013). To address this problem, we adopt adaptive momentum, which uses historical iterates to speed up convergence. This has been popularly used in stochastic gradient descent (Duchi et al., 2011; Kingma and Ba, 2014), proximal algorithms (Li and Lin, 2015; Yao et al., 2017; Li et al., 2017), cubic regularization (Wang et al., 2020), and zero-order black-box optimization (Chen et al., 2019). Here, we adopt the adaptive scheme in (Li et al., 2017).

The resultant procedure, which will be called NOnconvex Regularized Tensor (NORT), is shown in Algorithm 2. When the extrapolation step 𝒳¯t\bar{\mathscr{X}}_{t} achieves a lower function value (step 4), the momentum γt\gamma_{t} is increased to further exploit the opportunity of acceleration; otherwise, γt\gamma_{t} is decayed (step 7). When step 5 is performed, 𝒱t=𝒳t+γt(𝒳t𝒳t1)\mathscr{V}_{t}=\mathscr{X}_{t}+\gamma_{t}(\mathscr{X}_{t}-\mathscr{X}_{t-1}). 𝒵t\mathscr{Z}_{t} in step 9 becomes

𝒵t=(1+γt)i=1D(𝑼ti(𝑽ti))iγti=1D(𝑼t1i(𝑽t1i))i1τξ(𝒱t),\displaystyle\mathscr{Z}_{t}=(1+\gamma_{t})\sum\nolimits_{i=1}^{D}\big{(}\bm{U}_{t}^{i}(\bm{V}_{t}^{i})^{\top}\big{)}^{{\left\langle i\right\rangle}}-\gamma_{t}\sum\nolimits_{i=1}^{D}\big{(}\bm{U}_{t-1}^{i}(\bm{V}_{t-1}^{i})^{\top}\big{)}^{{\left\langle i\right\rangle}}-\frac{1}{\tau}\xi(\mathscr{V}_{t}), (30)

which still has the “sparse plus low-rank” structure. When step 7 is performed, 𝒱t=𝒳t\mathscr{V}_{t}=\mathscr{X}_{t}, and obviously the resultant 𝒵t\mathscr{Z}_{t} is “sparse plus low-rank”. Thus, the more efficient reformulations in Proposition 4 can be applied in computing the proximal steps at step 11. Note that the rank of 𝑿t+1i{\bm{\bm{X}}}_{t+1}^{i} in step 11 is determined implicitly by the proximal step. As 𝒳t\mathscr{X}_{t} and 𝒵t\mathscr{Z}_{t} are implicitly represented in factorized forms, 𝒱t\mathscr{V}_{t} and 𝒳¯t\bar{\mathscr{X}}_{t} (in step 3) do not need to be explicitly constructed. As a result, the resultant time and space complexities are the same as those in Section 3.2.3.

Algorithm 2 NOnconvex Regularized Tensor (NORT) Algorithm.
1:  Initialize τ>ρ+Dκ0\tau>\rho+D\kappa_{0}, γ1,p(0,1]\gamma_{1},p\in(0,1], 𝒳0=𝒳1=0\mathscr{X}_{0}=\mathscr{X}_{1}=0 and t=1t=1;
2:  while not converged do
3:     𝒳¯t𝒳t+γt(𝒳t𝒳t1)\bar{\mathscr{X}}_{t}\leftarrow\mathscr{X}_{t}+\gamma_{t}(\mathscr{X}_{t}-\mathscr{X}_{t-1});
4:     if F(𝒳¯t)F(𝒳t)F(\bar{\mathscr{X}}_{t})\leq F(\mathscr{X}_{t}) then
5:        𝒱t𝒳¯t\mathscr{V}_{t}\leftarrow\bar{\mathscr{X}}_{t} and γt+1min(γtp,1)\gamma_{t+1}\leftarrow\min(\frac{\gamma_{t}}{p},1);
6:     else
7:        𝒱t𝒳t\mathscr{V}_{t}\leftarrow\mathscr{X}_{t} and γt+1pγt\gamma_{t+1}\leftarrow p\gamma_{t};
8:     end if
9:     𝒵t𝒱t1τξ(𝒱t)\mathscr{Z}_{t}\leftarrow\mathscr{V}_{t}-\frac{1}{\tau}\xi(\mathscr{V}_{t}); // compute ξ(𝒱t)\xi(\mathscr{V}_{t}) using Algorithm 1
10:     for i=1,,Di=1,\dots,D do
11:        𝑿t+1iproxλiϕτ((𝒵t)i){\bm{\bm{X}}}^{i}_{t+1}\leftarrow\text{prox}_{\frac{\lambda_{i}\phi}{\tau}}((\mathscr{Z}_{t})_{{\left\langle i\right\rangle}}); // keep as 𝑼ti(𝑽ti)\bm{U}_{t}^{i}(\bm{V}_{t}^{i})^{\top};
12:     end for // implicitly construct 𝒳t+1i=1D(𝐔t+1i(𝐕t+1i))i\mathscr{X}_{t+1}\leftarrow\sum_{i=1}^{D}\big{(}\bm{U}_{t+1}^{i}(\bm{V}_{t+1}^{i})^{\top}\big{)}^{{\left\langle i\right\rangle}};
13:     t=t+1t=t+1
14:  end while
15:  return  𝒳t\mathscr{X}_{t}.

3.4 Convergence Properties

In this section, we analyze the convergence properties of the proposed algorithm. As can be seen from (14), we have f(𝒳)=𝛀i1iM=1(𝒳i1iM,𝒪i1iM)f(\mathscr{X})=\sum\nolimits_{\mathbf{\Omega}_{i_{1}\ldots i_{M}}=1}\ell\left(\mathscr{X}_{i_{1}\dots i_{M}},\mathscr{O}_{i_{1}\dots i_{M}}\right) here. Moreover, throughout this section, we assume that the loss ff is (Lipschitz-)smooth.

Note that existing proofs for PA algorithm (Yu, 2013; Zhong and Kwok, 2014; Yu et al., 2015) cannot be directly used, as adaptive momentum has not been used with the PA algorithm on nonconvex problems (see Table 2), and also that they do not involve tensor folding/unfolding operations. Our proof strategy will still follow the three main steps in proving convergence of PA:

  1. 1.

    Show that the proximal average step with gig_{i}’s in (13) corresponds to a regularizer;

  2. 2.

    Show that this regularizer, when combined with the loss ff in (10), serves as a good approximation of the original objective FF.

  3. 3.

    Show that the proposed algorithm finds critical points of this approximate optimization problem.

First, the following Proposition shows that the average step in (18) and proximal steps in (20) together correspond to a new regularizer g¯τ\bar{g}_{\tau}.

Proposition 6

For any τ>0\tau>0, i=1D[prox1τλiϕ([𝒵]i)]i=prox1τg¯τ(𝒵)\sum_{i=1}^{D}[\text{prox}_{\frac{1}{\tau}\lambda_{i}\phi}([\mathscr{Z}]_{{\left\langle i\right\rangle}})]^{{\left\langle i\right\rangle}}=\text{prox}_{\frac{1}{\tau}\bar{g}_{\tau}}(\mathscr{Z}), where

g¯τ(𝒳)=τ[min{𝑿d}:d=1D𝑿dd=𝒳d=1D(12𝑿dF2+λdτϕ(𝑿d))D2𝒳F2].\displaystyle\bar{g}_{\tau}(\mathscr{X})=\tau\big{[}\min\nolimits_{\{{\bm{\bm{X}}}_{d}\}:\sum\nolimits_{d=1}^{D}{\bm{\bm{X}}}_{d}^{{\left\langle d\right\rangle}}=\mathscr{X}}\sum\nolimits_{d=1}^{D}\big{(}\frac{1}{2}\left\|{\bm{\bm{X}}}_{d}\right\|_{F}^{2}+\frac{\lambda_{d}}{\tau}\phi({\bm{\bm{X}}}_{d})\big{)}-\frac{D}{2}\left\|\mathscr{X}\right\|_{F}^{2}\big{]}.

Analogous to (10), let the objective corresponding to regularizer g¯τ\bar{g}_{\tau} be

Fτ(𝒳)=f(𝒳)+g¯τ(𝒳).F_{\tau}(\mathscr{X})=f(\mathscr{X})+\bar{g}_{\tau}(\mathscr{X}). (31)

The following bounds the difference between the optimal values (FminF^{\min} and FτminF_{\tau}^{\min}, respectively) of the objectives FF in (10) and FτF_{\tau}. It thus shows that FτF_{\tau} serves as an approximation to FF, which is controlled by τ\tau.

Proposition 7

0FminFτminκ022τi=1Dλi20\leq F^{\min}-F_{\tau}^{\min}\leq\frac{\kappa_{0}^{2}}{2\tau}\sum_{i=1}^{D}\lambda_{i}^{2}, where κ0\kappa_{0} is defined in Assumption 1.

Before showing the convergence of the proposed algorithm, the following Proposition first shows the condition of being critical points of Fτ(𝒳)F_{\tau}(\mathscr{X}).

Proposition 8

If there exists τ>0\tau>0 such that 𝒳~=proxg¯ττ(𝒳~f(𝒳~)/τ)\tilde{\mathscr{X}}=\text{prox}_{\frac{\bar{g}_{\tau}}{\tau}}(\tilde{\mathscr{X}}-\nabla f(\tilde{\mathscr{X}})/\tau), then 𝒳~\tilde{\mathscr{X}} is a critical point of Fτ(𝒳~)F_{\tau}(\tilde{\mathscr{X}}).

Finally, we show how convergence to critical points can be ensured by the proposed algorithm under smooth assumption of loss ff (Section 3.4.1) and Kurdyka-Łojasiewicz condition for the approximated objective FτF_{\tau} (Section 3.4.2).

3.4.1 With Smoothness Assumption on Loss ff

The following shows that Algorithm 2 converges to a critical point (Theorem 9).

Theorem 9

The sequence {𝒳t}\{\mathscr{X}_{t}\} generated from Algorithm 2 has at least one limit point, and all limits points are critical points of Fτ(𝒳)F_{\tau}(\mathscr{X}).

Proof [Sketch, details are in Appendix B.5.] The main idea is as follows. First, we show that (i) if step 5 is performed, Fτ(𝒳t+1)Fτ(𝒳t)η2𝒳t+1𝒳¯tF2F_{\tau}(\mathscr{X}_{t+1})\leq F_{\tau}(\mathscr{X}_{t})-\frac{\eta}{2}\left\|\mathscr{X}_{t+1}-\bar{\mathscr{X}}_{t}\right\|_{F}^{2}; (ii) if step 7 is performed, we have Fτ(𝒳t+1)Fτ(𝒳t)η2𝒳t+1𝒳tF2F_{\tau}(\mathscr{X}_{t+1})\leq F_{\tau}(\mathscr{X}_{t})-\frac{\eta}{2}\left\|\mathscr{X}_{t+1}-\mathscr{X}_{t}\right\|_{F}^{2}. Combining the above two conditions, we obtain

2η(Fτ(𝒳1)Fτ(𝒳T+1))jχ1(T)𝒳t+1𝒳¯tF2+jχ2(T)𝒳t+1𝒳tF2,\displaystyle\frac{2}{\eta}(F_{\tau}(\mathscr{X}_{1})-F_{\tau}(\mathscr{X}_{T+1}))\geq\sum\nolimits_{j\in\chi_{1}(T)}\left\|\mathscr{X}_{t+1}-\bar{\mathscr{X}}_{t}\right\|_{F}^{2}+\sum\nolimits_{j\in\chi_{2}(T)}\left\|\mathscr{X}_{t+1}-\mathscr{X}_{t}\right\|_{F}^{2},

where χ1(T)\chi_{1}(T) and χ2(T)\chi_{2}(T) are partitions of {1,,T}\{1,\dots,T\} such that when jχ1(T)j\in\chi_{1}(T) step 5 is performed, and when jχ2(T)j\in\chi_{2}(T) step 7 is performed. Finally, when TT\rightarrow\infty, we discuss three cases: (i) χ1()\chi_{1}(\infty) is finite, χ2()\chi_{2}(\infty) is infinite; (ii) χ1()\chi_{1}(\infty) is infinite, χ2()\chi_{2}(\infty) is finite; and (iii) both χ1()\chi_{1}(\infty) and χ2()\chi_{2}(\infty) are infinite. Let 𝒳~\tilde{\mathscr{X}} be a limit point of {𝒳t}\{\mathscr{X}_{t}\}, and {𝒳jt}\{\mathscr{X}_{j_{t}}\} be a subsequence that converges to 𝒳~\tilde{\mathscr{X}}. In all three cases, we show that

limjt𝒳jt+1𝒳jtF2=proxg¯ττ(𝒳~1τf(𝒳~))𝒳~F2=0.\displaystyle\lim\limits_{j_{t}\rightarrow\infty}\left\|\mathscr{X}_{j_{t}+1}-\mathscr{X}_{j_{t}}\right\|_{F}^{2}=\|\text{prox}_{\frac{\bar{g}_{\tau}}{\tau}}(\tilde{\mathscr{X}}-\frac{1}{\tau}\nabla f(\tilde{\mathscr{X}}))-\tilde{\mathscr{X}}\|_{F}^{2}=0.

Thus, we must have 𝒳~\tilde{\mathscr{X}} is also a critical point based on Proposition 8. It is easy to see that we have not made any specifications on the limit points. Thus, all limit points are also critical points.  

Recall that 𝒳t+1\mathscr{X}_{t+1} is generated from 𝒱t\mathscr{V}_{t} in steps 9-12 and 𝒳t+1=𝒱t\mathscr{X}_{t+1}=\mathscr{V}_{t} indicates convergence to a critical point (Proposition 8). Thus, we can measure convergence of Algorithm 2 by 𝒳t+1𝒱tF\left\|\mathscr{X}_{t+1}-\mathscr{V}_{t}\right\|_{F}. Corollary 10 shows that a rate of O(1/T)O(1/T) can be obtained, which is also the best possible rate for first-order methods on general nonconvex problems (Nesterov, 2013; Ghadimi and Lan, 2016).

Corollary 10

mint=1,,T12𝒳t+1𝒱tF21ηT[Fτ(𝒳1)Fτmin]\min\nolimits_{t=1,\dots,T}\frac{1}{2}\left\|\mathscr{X}_{t+1}-\mathscr{V}_{t}\right\|_{F}^{2}\leq\frac{1}{\eta T}\big{[}F_{\tau}(\mathscr{X}_{1})-F^{\min}_{\tau}\big{]}, where η=τρDL\eta=\tau-\rho-DL.

Remark 11

A larger τ\tau leads to a better approximation to the original problem FF (Proposition 7). However, it also make the stepsize 1/τ1/\tau smaller (step 11 in Algorithm 2) and thus slower convergence (Corollary 10).

3.4.2 With Kurdyka-Łojasiewicz Condition on Approximated Objective FτF_{\tau}

In Section 3.4.1, we showed the convergence results when ff is smooth and gg is of the form in (7). In this section, we consider using the Kurdyka-Łojasiewicz (KL) condition (Attouch et al., 2013; Bolte et al., 2014) on FτF_{\tau}, which has been popularly used in nonconvex optimization, particularly in gradient descent (Attouch et al., 2013) and proximal gradient algorithms (Bolte et al., 2014; Li and Lin, 2015; Li et al., 2017). For example, the class of semi-algebraic functions satisfy the KL condition. More examples can be found in (Bolte et al., 2010, 2014).

Definition 12

A function hh: n(,]\mathbb{R}^{n}\rightarrow(-\infty,\infty] has the uniformized KL property if for every compact set 𝒮dom(h)\mathcal{S}\in\text{dom}(h) on which hh is a constant, there exist ϵ\epsilon, c>0c>0 such that for all 𝐮𝒮\bm{u}\in\mathcal{S} and all 𝐮¯{𝐮:min𝐯𝒮𝐮𝐯2ϵ}{𝐮:f(𝐮¯)<f(𝐮)<f(𝐮¯)+c}\bar{\bm{u}}\in\{\bm{u}:\min\nolimits_{\bm{v}\in\mathcal{S}}\left\|\bm{u}-\bm{v}\right\|_{2}\leq\epsilon\}\cap\{\bm{u}:f(\bar{\bm{u}})<f(\bm{u})<f(\bar{\bm{u}})+c\}, one has ψ(f(𝐮)f(𝐮¯))min𝐯f(𝐮)𝐯2>1\psi^{\prime}\left(f(\bm{u})-f(\bm{\bar{u}})\right)\min\nolimits_{\bm{v}\in\partial f(\bm{u})}\left\|\bm{v}\right\|_{2}>1, where ψ(α)=Cαxx\psi(\alpha)=\frac{C\alpha^{x}}{x} for some C>0C>0, α[0,c)\alpha\in[0,c) and x(0,1]x\in(0,1].

Since the KL property (Attouch et al., 2013; Bolte et al., 2014) does not require hh to be smooth or convex, it thus allows convergence analysis under the nonconvex and nonsmooth setting. However, such a property cannot replace the smoothness assumption in Section 3.4.1, as there are example functions which are smooth but fail to meet the KL condition (Section 4.3 of (Bolte et al., 2010)).

The following Theorem extends Algorithm 2 to be used with the uniformized KL property.

Theorem 13

Assume that FτF_{\tau} in (31) has the uniformized KL property, and let rt=Fτ(𝒳t)Fτminr_{t}=F_{\tau}(\mathscr{X}_{t})-F_{\tau}^{\min}. For a sufficiently large t0t_{0},

  • a)

    If xx in Definition 12 equals 11, then rt=0r_{t}=0 for all tt0t\geq t_{0};

  • b)

    If x[12,1)x\in[\frac{1}{2},1), rt(d1C21+d1C2)tt0rt0r_{t}\leq(\frac{d_{1}C^{2}}{1+d_{1}C^{2}})^{t-t_{0}}r_{t_{0}} where d1=2(τ+ρ)2/ηd_{1}=2(\tau+\rho)^{2}/\eta;

  • c)

    If x(0,12)x\in(0,\frac{1}{2}), rt(C(tt0)d2(12x))1/(12x)r_{t}\leq(\frac{C}{(t-t_{0})d_{2}(1-2x)})^{1/(1-2x)} where d2=min(12d1C,C12x(22x12x21)rt0)d_{2}=\min\big{(}\frac{1}{2d_{1}C},\frac{C}{1-2x}(2^{\frac{2x-1}{2x-2}}-1)r_{t_{0}}\big{)}.

Proof [Sketch, details are in Appendix B.7.] The proof idea generally follows that for (Bolte et al., 2014) with a special treatment for 𝒱t\mathscr{V}_{t} here. First, we show

limtmin𝒰tFτ(𝒳t)𝒰tFlimt(τ+ρ)𝒳t+1𝒱tF=0.\displaystyle\lim\limits_{t\rightarrow\infty}\min\nolimits_{\mathscr{U}_{t}\in\partial F_{\tau}(\mathscr{X}_{t})}\left\|\mathscr{U}_{t}\right\|_{F}\leq\lim\limits_{t\rightarrow\infty}(\tau+\rho)\left\|\mathscr{X}_{t+1}-\mathscr{V}_{t}\right\|_{F}=0.

Next, using the KL condition, we have

1ψ(Fτ(𝒳t+1)Fτmin)(τ+ρ)𝒳t+1𝒱tF.\displaystyle 1\leq\psi^{\prime}\left(F_{\tau}(\mathscr{X}_{t+1})-F_{\tau}^{\min}\right)(\tau+\rho)\left\|\mathscr{X}_{t+1}-\mathscr{V}_{t}\right\|_{F}.

Then, let rt=Fτ(𝒳t)Fτminr_{t}=F_{\tau}(\mathscr{X}_{t})-F_{\tau}^{\min}. From its definition, we have

rtrt+1Fτ(𝒱t)Fτ(𝒳t+1).\displaystyle r_{t}-r_{t+1}\geq F_{\tau}(\mathscr{V}_{t})-F_{\tau}(\mathscr{X}_{t+1}).

Combining the above three inequalities, we obtain

12(τ+ρ)2η[ψ(rt+1)]2(rtrt+1).\displaystyle 1\leq\frac{2(\tau+\rho)^{2}}{\eta}\left[\psi^{\prime}(r_{t+1})\right]^{2}(r_{t}-r_{t+1}).

Since ϕ(α)=Cαxx\phi(\alpha)=\frac{C\alpha^{x}}{x}, then ϕ(α)=Cαx1\phi^{\prime}(\alpha)=C\alpha^{x-1}. The above inequality becomes 1d1C2rt+12x2(rtrt+1)1\leq d_{1}C^{2}r_{t+1}^{2x-2}(r_{t}-r_{t+1}), where d1=2(τ+ρ)2ηd_{1}=\frac{2(\tau+\rho)^{2}}{\eta}. It is shown in (Bolte et al., 2014; Li and Lin, 2015; Li et al., 2017) that for the sequence {rt}\{r_{t}\} satisfying the above inequality, we have convergence to zero with the different rates stated in the Theorem.  

In Corollary 10 and Theorem 13, the convergence rates do not depend on pp, and thus do not demonstrate the effect of momentum. Empirically, the proposed algorithm does have faster convergence when momentum is used, and will be shown in Section 5. This also agrees with previous studies in (Duchi et al., 2011; Kingma and Ba, 2014; Li and Lin, 2015; Li et al., 2017; Yao et al., 2017).

3.5 Statistical Guarantees

Existing statistical analysis on nonconvex regularization has been studied in the context of sparse and low-rank matrix learning. For example, the SCAD (Fan and Li, 2001), MCP (Zhang, 2010a) and capped-1\ell_{1} (Zhang, 2010b) penalties have shown to be better than the convex 1\ell_{1}-regularizer on sparse learning problems; and SCAD, MCP and LSP have shown to be better than the convex nuclear norm in matrix completion (Gui et al., 2016). However, these results cannot be extended to the tensor completion problem here as the nonconvex overlapped nuclear norm regularizer in (10) is not separable. Statistical analysis on tensor completion has been studied with CP and Tucker decompositions (Mu et al., 2014), tensor ring decomposition (Huang et al., 2020), convex overlapped nuclear norm (Tomioka et al., 2011), and tensor nuclear norm (Yuan and Zhang, 2016; Cheng et al., 2016). They show that tensor completion is possible under the incoherence condition when the number of observations is sufficiently large. In comparison, in this section, we will (i) use the restricted strong convexity condition (Agarwal et al., 2010; Negahban et al., 2012)) instead of the incoherence condition, and (ii) study nonconvex overlapped nuclear norm regularization.

3.5.1 Controlling the Spikiness and Rank

In the following, we assume that elements in 𝛀{\bm{\bm{\Omega}}} are drawn i.i.d. from the uniform distribution. However, when the sample size 𝛀1Iπ\left\|\mathbf{\Omega}\right\|_{1}\ll I^{\pi}, tensor completion is not always possible. Take the special case of matrix completion as an example. If 𝒳\mathscr{X} is an almost-zero matrix with only one element being 11, it cannot be recovered unless the nonzero element is observed. However, when 𝒳\mathscr{X} gets larger, there is a vanishing probability of observing the nonzero element, and so P𝛀(𝒳)=𝟎P_{{\bm{\bm{\Omega}}}}\left(\mathscr{X}\right)=\mathbf{0} with high probability (Candès and Recht, 2009; Negahban and Wainwright, 2012).

To exclude tensors that are too “spiky” and allow tensor completion, we introduce

mspike(𝒳)=Iπ𝒳max/𝒳F,\displaystyle m_{\text{spike}}(\mathscr{X})=\sqrt{I^{\pi}}\left\|\mathscr{X}\right\|_{\max}/\left\|\mathscr{X}\right\|_{F}, (32)

which is an extension of the measure I1I2𝑿max/𝑿F\sqrt{I^{1}I^{2}}\left\|{\bm{\bm{X}}}\right\|_{\max}/\left\|{\bm{\bm{X}}}\right\|_{F} in (Negahban and Wainwright, 2012; Gu et al., 2014) for matrices. Note that mspike(𝒳)m_{\text{spike}}(\mathscr{X}) is invariant to the scale of 𝒳\mathscr{X} and 1mspike(𝒳)Iπ1\leq m_{\text{spike}}(\mathscr{X})\leq\sqrt{I^{\pi}}. Moreover, mspike(𝒳)=1m_{\text{spike}}(\mathscr{X})=1 when all elements in 𝒳\mathscr{X} have the same value (least spiky); and mspike(𝒳)=Iπm_{\text{spike}}(\mathscr{X})=\sqrt{I^{\pi}} when 𝒳\mathscr{X} has only one nonzero element (spikiest). Similarly, to measure how close is 𝒳\mathscr{X} to low-rank, we use

mrank(𝒳)=i=1Dαi𝒳i/𝒳F,\displaystyle m_{\text{rank}}(\mathscr{X})=\sum\nolimits_{i=1}^{D}\alpha_{i}\left\|\mathscr{X}_{{\left\langle i\right\rangle}}\right\|_{*}/\left\|\mathscr{X}\right\|_{F}, (33)

where αi=λi/d=1Dλd\alpha_{i}=\lambda_{i}/\sum_{d=1}^{D}\lambda_{d}’s are pre-defined constants depending on the penalty strength. This is also extended from the measure 𝑿/𝑿F\left\|{\bm{\bm{X}}}\right\|_{*}/\left\|{\bm{\bm{X}}}\right\|_{F} in (Negahban and Wainwright, 2012; Gu et al., 2014) on matrices. Note that mrank(𝒳)m_{\text{rank}}(\mathscr{X}) \leq i=1D\sum_{i=1}^{D} αi\alpha_{i} rank(𝒳i)\sqrt{\text{rank}(\mathscr{X}_{{\left\langle i\right\rangle}})}, with equality holds when all nonzero singular values of 𝒳i\mathscr{X}_{i}’s are the same. The target tensor 𝒳\mathscr{X} should thus have small mspike(𝒳)m_{\text{spike}}(\mathscr{X}) and mrank(𝒳)m_{\text{rank}}(\mathscr{X}). In (14), assume for simplicity that D=MD=M and λi=λ\lambda_{i}=\lambda for i=1,,Mi=1,\dots,M. We then have the following constrained version of (14):

min𝒳12P𝛀(𝒳𝒪)F2+λr(𝒳)s.t.𝒳maxC,\displaystyle\min\nolimits_{\mathscr{X}}\frac{1}{2}\left\|P_{{\bm{\bm{\Omega}}}}\left(\mathscr{X}-\mathscr{O}\right)\right\|_{F}^{2}+\lambda r(\mathscr{X})\quad\text{s.t.}\quad\left\|\mathscr{X}\right\|_{\max}\leq C, (34)

where r(𝒳)=i=1Dϕ(𝒳i)r(\mathscr{X})=\sum\nolimits_{i=1}^{D}\phi(\mathscr{X}_{{\left\langle i\right\rangle}}) encourages 𝒳\mathscr{X} to be low-rank (i.e., small mrankm_{\text{rank}}), and the constraint on 𝒳max\left\|\mathscr{X}\right\|_{\max} avoids 𝒳\mathscr{X} to be spiky (i.e., small mspikem_{\text{spike}}).

3.5.2 Restricted Strong Convexity (RSC)

Following (Tomioka et al., 2011; Negahban and Wainwright, 2012; Loh and Wainwright, 2015; Zhu et al., 2018), we introduce the restricted strong convexity (RSC) condition.

Definition 14

(Restricted strong convexity (RSC) condition (Agarwal et al., 2010)) Let Δ\Delta be an arbitrary MM-order tensor. It satisfies the RSC condition if there exist constants α1,α2>0\alpha_{1},\alpha_{2}>0 and τ1,τ20\tau_{1},\tau_{2}\geq 0 such that

P𝛀(Δ)F2{α1ΔF2τ1logIπ𝛀1(i=1MΔi)2if ΔF1α2ΔF2τ2logIπ𝛀1(i=1MΔi)otherwise.\displaystyle\left\|P_{{\bm{\bm{\Omega}}}}\left(\Delta\right)\right\|_{F}^{2}\geq\begin{cases}\alpha_{1}\left\|\Delta\right\|_{F}^{2}-\tau_{1}\frac{\log I^{\pi}}{\|{\bm{\bm{\Omega}}}\|_{1}}\big{(}\sum_{i=1}^{M}\left\|\Delta_{{\left\langle i\right\rangle}}\right\|_{*}\big{)}^{2}&\text{if }\left\|\Delta\right\|_{F}\leq 1\\ \alpha_{2}\left\|\Delta\right\|_{F}^{2}-\tau_{2}\sqrt{\frac{\log I^{\pi}}{\|{\bm{\bm{\Omega}}}\|_{1}}}\big{(}\sum_{i=1}^{M}\left\|\Delta_{{\left\langle i\right\rangle}}\right\|_{*}\big{)}&\text{otherwise}\end{cases}. (35)

Let di=12(Ii+IπIi)d_{i}=\frac{1}{2}(I_{i}+\frac{I^{\pi}}{I_{i}}) for i=1,,Mi=1,\dots,M. Consider the set of tensors parameterized by n,γ0n,\gamma\geq 0:

𝒞~(n,γ)={𝒳I1××IM,𝒳0|mspike(𝒳)mrank(𝒳)1γLmini=1,,Mndilogdi},\displaystyle\tilde{\mathcal{C}}(n,\gamma)=\left\{\mathscr{X}\in\mathbb{R}^{I_{1}\times\dots\times I_{M}},\mathscr{X}\not=0\;|\;m_{\text{spike}}(\mathscr{X})\cdot m_{\text{rank}}(\mathscr{X})\leq\frac{1}{\gamma L}\min_{i=1,\dots,M}\sqrt{\frac{n}{d_{i}\log d_{i}}}\right\},

where LL is a positive constant. The following Lemma shows that the RSC condition holds when the low-rank tensor is not too spiky. If the RSC condition does not hold, the tensor can be too hard to be recovered.

Lemma 15

There exists c0c_{0}, c1c_{1}, c2c_{2}, c30c_{3}\geq 0 such that Δ𝒞~(𝛀1,c0)\forall\Delta\in\tilde{\mathcal{C}}(\left\|{\bm{\bm{\Omega}}}\right\|_{1},c_{0}), where 𝛀1>c3maxi=1,,M(dilogdi)\left\|{\bm{\bm{\Omega}}}\right\|_{1}>c_{3}\underset{i=1,\dots,M}{\max}(d_{i}\log d_{i}), we have

P𝛀(Δ)F𝛀118ΔF{1128Lmspike(Δ)𝛀1},\displaystyle\frac{\left\|P_{{\bm{\bm{\Omega}}}}\left(\Delta\right)\right\|_{F}}{\left\|{\bm{\bm{\Omega}}}\right\|_{1}}\geq\frac{1}{8}\left\|\Delta\right\|_{F}\left\{1-\frac{128L\cdot m_{\text{spike}}(\Delta)}{\sqrt{\left\|{\bm{\bm{\Omega}}}\right\|_{1}}}\right\}, (36)

with a high probability of at least 1maxi=1,,Mc1exp(c2dilogdi)1-\max_{i=1,\dots,M}c_{1}\exp(-c_{2}d_{i}\log d_{i}).

Another condition commonly used in low-rank matrix/tensor learning is incoherence (Candès and Recht, 2009; Mu et al., 2014; Yuan and Zhang, 2016), which prevents information of the row/column spaces of the matrix/tensor from being too concentrated in a few rows/columns. However, as discussed in (Negahban and Wainwright, 2012), the RSC condition is less restrictive than the incoherence condition, and can better describe “spikiness” (details are in Appendix A). Thus, we adopt the RSC instead of the incoherence condition here.

3.5.3 Main results

Let 𝒳I1××IM\mathscr{X}^{*}\in\mathbb{R}^{I_{1}\times\dots\times I_{M}} be the ground-truth tensor, and 𝒳~\tilde{\mathscr{X}} be an estimate of 𝒳\mathscr{X}^{*} obtained as a critical point of (34). The following bounds the distance between 𝒳\mathscr{X}^{*} and 𝒳~\tilde{\mathscr{X}}.

Theorem 16

Assume that κ\kappa is differentiable, and the RSC condition holds with 3κ0M/4<α13\kappa_{0}M/4<\alpha_{1}. Assume that there exists positive constant R>0R>0 such that i=1M𝒳iR\sum_{i=1}^{M}\left\|\mathscr{X}_{{\left\langle i\right\rangle}}\right\|_{*}\leq R, and λ\lambda satisfies

4κ0max(maxi[P𝛀(𝒳𝒪)]i,α2logIπ/𝛀1)λα24Rκ0,\displaystyle\frac{4}{\kappa_{0}}\max\left(\max_{i}\left\|\left[P_{{\bm{\bm{\Omega}}}}\left(\mathscr{X}^{*}-\mathscr{O}\right)\right]_{{\left\langle i\right\rangle}}\right\|_{\infty},\alpha_{2}\sqrt{\log I^{\pi}/\left\|{\bm{\bm{\Omega}}}\right\|_{1}}\right)\leq\lambda\leq\frac{\alpha_{2}}{4R\kappa_{0}}, (37)

where 𝛀1max(τ12,τ22)16R2log(Iπ)α22\|{\bm{\bm{\Omega}}}\|_{1}\geq\max\left(\tau_{1}^{2},\tau_{2}^{2}\right)\frac{16R^{2}\log(I^{\pi})}{\alpha_{2}^{2}}. Then,

𝒳𝒳~Fλκ0cvavi=1Mki,\displaystyle\big{\|}\mathscr{X}^{*}-\tilde{\mathscr{X}}\big{\|}_{F}\leq\frac{\lambda\kappa_{0}c_{v}}{a_{v}}\sum\nolimits_{i=1}^{M}\sqrt{k_{i}}, (38)

where av=α13Mκ04a_{v}=\alpha_{1}-\frac{3M\kappa_{0}}{4}, cv=112Mc_{v}=1-\frac{1}{2M}, and kik_{i} is the rank of 𝒳i\mathscr{X}^{*}_{{\left\langle i\right\rangle}}.

Proof [Sketch, details are in Appendix B.9.2.] The general idea of this proof is inspired from (Loh and Wainwright, 2015).333Note, however, that Loh and Wainwright (2015) use different mathematical tools as they consider sparse vectors with separable dimensions, while we consider overlapped tensor regularization with coupled singular values. There are three main steps:

  • Let 𝒱~=𝒳~𝒳\tilde{\mathscr{V}}=\tilde{\mathscr{X}}-\mathscr{X}^{*}. We prove by contradiction that 𝒱~F1\|\tilde{\mathscr{V}}\|_{F}\leq 1. Thus, we only need to consider the first condition in (35).

  • Let hi(𝒳)=ϕ(𝒳i)h_{i}(\mathscr{X})\!=\!\phi(\mathscr{X}_{{\left\langle i\right\rangle}}). From Assumption 1, we have that hi(𝒳)+μ2𝒳F2h_{i}(\mathscr{X})+\frac{\mu}{2}\left\|\mathscr{X}\right\|_{F}^{2} is convex. Using this together with the first condition in (35), we obtain

    (α1μM2)𝒱~F2\displaystyle\left(\alpha_{1}-\frac{\mu M}{2}\right)\|\tilde{\mathscr{V}}\|_{F}^{2} λi=1M(hi(𝒳)hi(𝒳~))+λκ02i=1M𝒱~i.\displaystyle\leq\lambda\sum\nolimits_{i=1}^{M}\big{(}h_{i}(\mathscr{X}^{*})-h_{i}(\tilde{\mathscr{X}})\big{)}+\frac{\lambda\kappa_{0}}{2}\sum\nolimits_{i=1}^{M}\|\tilde{\mathscr{V}}_{{\left\langle i\right\rangle}}\|_{*}.
  • Using the above inequality and properties of hih_{i}, we obtain

    av𝒱~F2λi=1Mbvhi(𝒳)cvhi(𝒳~),\displaystyle a_{v}\|\tilde{\mathscr{V}}\|_{F}^{2}\leq\lambda\sum\nolimits_{i=1}^{M}b_{v}h_{i}(\mathscr{X}^{*})-c_{v}h_{i}(\tilde{\mathscr{X}}),

    where av=α13M4κ0a_{v}=\alpha_{1}-\frac{3M}{4}\kappa_{0}, bv=1+12Mb_{v}=1+\frac{1}{2M} and cv=112Mc_{v}=1-\frac{1}{2M}. Finally, using Lemma 30 in Appendix B.9.1 on the above inequality, we have 𝒱~Fλκ0cvavi=1Mki\|\tilde{\mathscr{V}}\|_{F}\leq\frac{\lambda\kappa_{0}c_{v}}{a_{v}}\sum\nolimits_{i=1}^{M}\sqrt{k_{i}}.

 

Since 𝒳maxC\left\|\mathscr{X}\right\|_{\max}\leq C in (34), we have i=1M𝒳ii=1Mki(Ii+IπIi)C\sum_{i=1}^{M}\left\|\mathscr{X}_{{\left\langle i\right\rangle}}\right\|_{*}\leq\sum_{i=1}^{M}\sqrt{k_{i}(I_{i}+\frac{I^{\pi}}{I_{i}})}C (as 𝑿k𝑿Fmnk𝑿max\left\|{\bm{\bm{X}}}\right\|_{*}\leq\sqrt{k}\left\|{\bm{\bm{X}}}\right\|_{F}\leq\sqrt{mnk}\left\|{\bm{\bm{X}}}\right\|_{\max} for a rank-kk matrix 𝑿m×n{\bm{\bm{X}}}\in\mathbb{R}^{m\times n}). Thus, in Theorem 16, we can take R=i=1MR=\sum_{i=1}^{M} ki(Ii+Iπ/Ii)C\sqrt{k_{i}(I_{i}+I^{\pi}/I_{i})}C, which is finite and cannot be arbitrarily large. While we do not have access to the ground-truth 𝒳\mathscr{X}^{*} in practice, Theorem 16 shows that the critical point 𝒳~\tilde{\mathscr{X}} can be bounded by a finite distance from 𝒳\mathscr{X}^{*}, which means that an arbitrary critical point may not be bad. From (38), we can also see that the error 𝒳𝒳~F\|\mathscr{X}^{*}-\tilde{\mathscr{X}}\|_{F} increases with the tensor order MM and rank kik_{i}. This is reasonable as tensors with higher orders or larger ranks are usually harder to estimate. Besides, recall that κ0\kappa_{0} in Assumption 1 reflects how nonconvex the function κ(α)\kappa(\alpha) is; while α1\alpha_{1} in Definition 14 measures strong convexity. Thus, these two quantities play opposing roles in (38). Specifically, a larger α1\alpha_{1} leads to a larger ava_{v}, and subsequently smaller 𝒳𝒳~F\|\mathscr{X}^{*}-\tilde{\mathscr{X}}\|_{F}; whereas a larger κ0\kappa_{0} leads to a larger λκ0cvav\frac{\lambda\kappa_{0}c_{v}}{a_{v}}, and subsequently larger 𝒳𝒳~F\|\mathscr{X}^{*}-\tilde{\mathscr{X}}\|_{F}.

Finally, note that the range for λ\lambda is (37) can be empty, which means there can be no λ\lambda to ensure Theorem 16. To understand when this can happen, consider the two extreme cases:

  • C1.

    There is no noise in the observations, i.e., P𝛀(𝒳𝒪)P_{{\bm{\bm{\Omega}}}}\left(\mathscr{X}^{*}-\mathscr{O}\right) =0=0: In this case, (37) reduces to

    4α2κ0logIπ/𝛀1λα24Rκ0.\displaystyle\frac{4\alpha_{2}}{\kappa_{0}}\sqrt{\log I^{\pi}/\left\|{\bm{\bm{\Omega}}}\right\|_{1}}\leq\lambda\leq\frac{\alpha_{2}}{4R\kappa_{0}}.

    Thus, such a λ\lambda may not exist when the number of observations 𝛀1\left\|\bm{\Omega}\right\|_{1} is too small.

  • C2.

    All elements are observed: we then have P𝛀(Δ)F=ΔF\left\|P_{{\bm{\bm{\Omega}}}}\left(\Delta\right)\right\|_{F}=\left\|\Delta\right\|_{F}, and so α1=α2=1\alpha_{1}=\alpha_{2}=1 and τ1=τ2=0\tau_{1}=\tau_{2}=0 in Definition 14. Besides, the noise is not too small, which means logIπ/𝛀1maxi[𝒳𝒪]i\sqrt{\log I^{\pi}/\left\|{\bm{\bm{\Omega}}}\right\|_{1}}\leq\max_{i}\|\left[\mathscr{X}^{*}-\mathscr{O}\right]_{{\left\langle i\right\rangle}}\|_{\infty}. Then, (37) reduces to

    4κ0maxi[𝒳𝒪]iλ14Rκ0.\displaystyle\frac{4}{\kappa_{0}}\max_{i}\left\|\left[\mathscr{X}^{*}-\mathscr{O}\right]_{{\left\langle i\right\rangle}}\right\|_{\infty}\leq\lambda\leq\frac{1}{4R\kappa_{0}}.

    Thus, such a λ\lambda may not exist when the noise is too high.

Overall, when λ\lambda does not exist, it is likely that the tensor completion problem is too hard to have good recovery performance.

On the other hand, there are cases that λ\lambda always exists. For example, when 𝒪=𝒳=0\mathscr{O}=\mathscr{X}^{*}=0, we have R=0R=0. The requirement on λ\lambda is then 4α2κ0logIπ/𝛀1λ+\frac{4\alpha_{2}}{\kappa_{0}}\sqrt{\log I^{\pi}/\left\|{\bm{\bm{\Omega}}}\right\|_{1}}\leq\lambda\leq+\infty, and such a λ\lambda always exists.

3.5.4 Dependencies on Noise Level and Number of Observations

In this section, we demonstrate how the noise level affects (38). We assume that the observations are contaminated by additive Gaussian noise, i.e.,

𝒪i1iM={𝒳i1iM+ξi1iMif𝛀i1iM=10otherwise,\displaystyle\mathscr{O}_{i_{1}\dots i_{M}}=\begin{cases}\mathscr{X}^{*}_{i_{1}\dots i_{M}}+\xi_{i_{1}\dots i_{M}}&\text{if}\quad{\bm{\bm{\Omega}}}_{i_{1}\dots i_{M}}=1\\ 0&\text{otherwise}\end{cases}, (39)

where ξi1iM\xi_{i_{1}\dots i_{M}} is a random variable following the normal distribution 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}). The effects of the noise level σ\sigma and number of observations in 𝛀{\bm{\bm{\Omega}}} are shown in Corollaries 17 and 18, respectively, which can be derived from Theorem 16.

Corollary 17

Let =𝒪𝒳\mathscr{E}=\mathscr{O}-\mathscr{X}^{*} and λ=b1maxi[P𝛀()]i\lambda=b_{1}\max_{i}\|\left[P_{{\bm{\bm{\Omega}}}}\left(\mathscr{E}\right)\right]_{{\left\langle i\right\rangle}}\|_{\infty}. When 𝛀1\left\|{\bm{\bm{\Omega}}}\right\|_{1} is sufficiently large and b1[4κ0,α24Rκ0maxi[P𝛀()]i]b_{1}\in[\frac{4}{\kappa_{0}},\frac{\alpha_{2}}{4R\kappa_{0}\max_{i}\|\left[P_{{\bm{\bm{\Omega}}}}\left(\mathscr{E}\right)\right]_{{\left\langle i\right\rangle}}\|_{\infty}}] (to ensure λ\lambda satisfies (37)), then 𝔼[𝒳𝒳~F]σκ0cvIπavi=1Mki\mathbb{E}[\|\mathscr{X}^{*}-\tilde{\mathscr{X}}\|_{F}]\leq\sigma\frac{\kappa_{0}c_{v}\sqrt{I^{\pi}}}{a_{v}}\sum\nolimits_{i=1}^{M}\sqrt{k_{i}}.

Corollary 17 shows that the recovery error decreases as the noise level σ\sigma gets smaller, and we can expect an exact recovery when σ=0\sigma=0, which is empirically verified in Section 5.1.4. When κ(α)=α\kappa(\alpha)=\alpha, r(𝒳)r(\mathscr{X}) becomes the convex overlapping nuclear norm. In this case, Theorem 2 in (Tomioka et al., 2011) shows that the recovery error can be bounded as 𝒳𝒳~FO(σi=1Mki)\big{\|}\mathscr{X}^{*}-\tilde{\mathscr{X}}\big{\|}_{F}\leq O(\sigma\sum\nolimits_{i=1}^{M}\sqrt{k_{i}}). Thus, Corollary 17 can be seen as an extension of Theorem 2 in (Tomioka et al., 2011) to the nonconvex case.

Corollary 18

Let λ=b3logIπ/𝛀1\lambda=b_{3}\sqrt{\log I^{\pi}/\left\|{\bm{\bm{\Omega}}}\right\|_{1}}. Suppose that the noise level σ\sigma is sufficiently small and b3[4,1(4RlogIπ/𝛀1)]b_{3}\in\left[4,\frac{1}{(4R\sqrt{\log I^{\pi}/\left\|{\bm{\bm{\Omega}}}\right\|_{1}})}\right] (to ensure λ\lambda satisfies (37)). Then, 𝒳𝒳~Fb3κ0cvavlogIπ𝛀1i=1Mki\big{\|}\mathscr{X}^{*}\!-\!\tilde{\mathscr{X}}\big{\|}_{F}\!\leq\!\frac{b_{3}\kappa_{0}c_{v}}{a_{v}}\sqrt{\frac{\log I^{\pi}}{\left\|{\bm{\bm{\Omega}}}\right\|_{1}}}\sum\nolimits_{i=1}^{M}\sqrt{k_{i}}.

Corollary 18 shows that the recovery error decays as 𝛀1\sqrt{\left\|{\bm{\bm{\Omega}}}\right\|_{1}} gets larger. Such a dependency on the number of observed elements is the same as in matrix completion problems with nonconvex regularization (Gui et al., 2016). Corollary 18 can be seen as an extension of Corollary 3.6 in (Gui et al., 2016) to the tensor case.

4 Extensions

In this section, we show how the proposed NORT algorithm in Section 3 can be extended for robust tensor completion (Section 4.1) and tensor completion with graph Laplacian regularization (Section 4.2).

4.1 Robust Tensor Completion

In tensor completion applications such as video recovery and shadow removal, the observed data often have outliers (Candès et al., 2011; Lu et al., 2016a). Instead of using the square loss, more robust losses like the 1\ell_{1} loss (Candès et al., 2011; Lu et al., 2013; Gu et al., 2014) and capped-1\ell_{1} loss (Jiang et al., 2015), are preferred.

In the following, we assume that the loss is of the form (a)=κ(|a|)\ell(a)=\kappa_{\ell}(|a|), where κ\kappa_{\ell} is smooth and satisfies Assumption 1. The optimization problem then becomes

min𝒳F(𝒳)=𝛀i1iM=1κ(|𝒳i1iM𝒪i1iM|)+i=1Dλiϕ(𝒳i).\displaystyle\min\nolimits_{\mathscr{X}}F_{\ell}(\mathscr{X})=\sum\nolimits_{\mathbf{\Omega}_{i_{1}...i_{M}}=1}\kappa_{\ell}\left(|\mathscr{X}_{i_{1}\dots i_{M}}-\mathscr{O}_{i_{1}\dots i_{M}}|\right)+\sum\nolimits_{i=1}^{D}\lambda_{i}\phi(\mathscr{X}_{{\left\langle i\right\rangle}}). (40)

Since κ(|a|)\kappa_{\ell}(|a|) is non-differentiable at a=0a=0, Algorithm 2 cannot be directly used. Motivated by smoothing the 1\ell_{1} loss with the Huber loss (Huber, 1964) and the difference-of-convex decomposition of κ\kappa_{\ell} (Le Thi and Tao, 2005; Yao and Kwok, 2018), we propose to smoothly approximate κ(|a|)\kappa_{\ell}(|a|) by

κ~(|a|;δ)=κ0~(|a|;δ)+(κ(|a|)κ0|a|),\displaystyle\tilde{\kappa}_{\ell}(|a|;\delta)=\kappa_{0}\cdot\tilde{\ell}(|a|;\delta)+\Big{(}\kappa_{\ell}(|a|)-\kappa_{0}\cdot|a|\Big{)}, (41)

where κ0\kappa_{0} is in Assumption 1, δ\delta is a smoothing parameter, and ~\tilde{\ell} is the Huber loss (Huber, 1964):

~(a;δ)={|a||a|δ12δa2+12δotherwise.\displaystyle\tilde{\ell}(a;\delta)=\begin{cases}|a|&|a|\geq\delta\\ \frac{1}{2\delta}a^{2}+\frac{1}{2}\delta&\text{otherwise}\end{cases}.

The following Proposition shows that κ~\tilde{\kappa}_{\ell} is smooth, and a small δ\delta ensures that it is a close approximation to κ\kappa_{\ell}.

Proposition 19

κ~(|a|;δ)\tilde{\kappa}_{\ell}(|a|;\delta) is differentiable and limδ0κ~(|a|;δ)=κ(|a|)\lim_{\delta\rightarrow 0}\tilde{\kappa}_{\ell}(|a|;\delta)=\kappa_{\ell}(|a|).

Problem (40) is then transformed to

min𝒳𝛀i1iM=1κ~(|𝒳i1iM𝒪i1iM|;δ)+i=1Dλiϕ(𝒳i).\displaystyle\min\nolimits_{\mathscr{X}}\sum\nolimits_{\mathbf{\Omega}_{i_{1}...i_{M}}=1}\tilde{\kappa}_{\ell}(|\mathscr{X}_{i_{1}...i_{M}}-\mathscr{O}_{i_{1}...i_{M}}|;\delta)+\sum\nolimits_{i=1}^{D}\lambda_{i}\phi(\mathscr{X}_{{\left\langle i\right\rangle}}). (42)

In Algorithm 3, we gradually reduce the smoothing factor in step 3, and use Algorithm 2 to solve the smoothed problem (42) in each iteration.

Algorithm 3 Smoothing NORT for (40).
1:  Initialize δ0(0,1)\delta_{0}\in(0,1) and s=1s=1;
2:  while not converged do
3:     transform to problem (42) with κ~\tilde{\kappa}_{\ell} using δ=(δ0)s\delta=(\delta_{0})^{s};
4:     obtain 𝒳s\mathscr{X}_{s} by solving the smoothed objective with Algorithm 2;
5:     s=s+1s=s+1;
6:  end while
7:  return  𝒳s\mathscr{X}_{s}.

Convergence of Algorithm 3 is ensured in Theorem 20. However, the statistical guarantee in Section 3.5 does not hold as the robust loss is not smooth.

Theorem 20

The sequence {𝒳s}\{\mathscr{X}_{s}\} generated from Algorithm 3 has at least one limit point, and all limits points are critical points of Fτ(𝒳)=𝛀i1iM=1κ(|𝒳i1iM𝒪i1iM|)+g¯τ(𝒳)F_{\ell\tau}(\mathscr{X})=\sum\nolimits_{\mathbf{\Omega}_{i_{1}...i_{M}}=1}\kappa_{\ell}\left(|\mathscr{X}_{i_{1}\dots i_{M}}-\mathscr{O}_{i_{1}\dots i_{M}}|\right)+\bar{g}_{\tau}(\mathscr{X}).

4.2 Tensor Completion with Graph Laplacian Regularization

The graph Laplacian regularizer is often used in tensor completion (Narita et al., 2012; Song et al., 2017). For example, in Section 5.5, we will consider an application in spatial-temporal analysis (Bahadori et al., 2014), namely, climate prediction based on meteorological records. The spatial-temporal data is represented by a 3-order tensor 𝒪I1×I2×I3\mathscr{O}\in\mathbb{R}^{I^{1}\times I^{2}\times I^{3}}, where I1I^{1} is the number of locations, I2I^{2} is the number of time stamps, and I3I^{3} is the number of variables corresponding to climate observations (such as temperature and precipitation). Usually, observations are only available at a few stations, and slices in 𝒪\mathscr{O} corresponding to the unobserved locations are missing. Learning these entries can then be formulated as a tensor completion problem. To allow generalization to the unobserved locations, correlations among locations have to be leveraged. This can be achieved by using the graph Laplacian regularizer (Belkin et al., 2006) on a graph GG with nodes being the locations (Bahadori et al., 2014). Let the affinity matrix of GG be 𝑨m×m\bm{A}\in\mathbb{R}^{m\times m}, and the corresponding graph Laplacian matrix be 𝑮=𝑫𝑨\bm{G}=\bm{D}-\bm{A}, where Dii=jAijD_{ii}=\sum_{j}A_{ij}. As the spatial locations are stored along the tensor’s first dimension, the graph Laplacian regularizer is defined as h(𝒳1)=Tr(𝒳1𝑮𝒳1)h(\mathscr{X}_{{\left\langle 1\right\rangle}})=\text{Tr}(\mathscr{X}_{{\left\langle 1\right\rangle}}^{\top}\bm{G}\mathscr{X}_{{\left\langle 1\right\rangle}}), which encourages nearby stations to have similar observations. When 𝑮=𝑰\bm{G}=\bm{I}, it reduces to the commonly used Frobenius-norm regularizer 𝒳F2\left\|\mathscr{X}\right\|_{F}^{2} (Hsieh et al., 2015). With regularizer h(𝒳1)h(\mathscr{X}_{{\left\langle 1\right\rangle}}), problem (14) is then extended to:

min𝒳𝛀i1iM=1(𝒳i1iM,𝒪i1iM)+i=1Dλiϕ(𝒳i)+μh(𝒳1),\displaystyle\min\nolimits_{\mathscr{X}}\sum\nolimits_{\mathbf{\Omega}_{i_{1}...i_{M}}=1}\ell\left(\mathscr{X}_{i_{1}\dots i_{M}},\mathscr{O}_{i_{1}\dots i_{M}}\right)+\sum\nolimits_{i=1}^{D}\lambda_{i}\,\phi(\mathscr{X}_{{\left\langle i\right\rangle}})+\mu\,h(\mathscr{X}_{{\left\langle 1\right\rangle}}), (43)

where μ\mu is a hyperparameter.

Using the PA algorithm, it can be easily seen that the updates in (18)-(20) for 𝒳t\mathscr{X}_{t} and 𝒴t\mathscr{Y}_{t} remain the same, but that for 𝒵t\mathscr{Z}_{t} becomes

𝒵t=𝒳t1τξ(𝒳t)+μTr(𝒳1𝑮𝒳1).\mathscr{Z}_{t}=\mathscr{X}_{t}-\frac{1}{\tau}\xi(\mathscr{X}_{t})+\mu\nabla\text{Tr}(\mathscr{X}_{{\left\langle 1\right\rangle}}^{\top}\bm{G}\mathscr{X}_{{\left\langle 1\right\rangle}}).

To maintain efficiency of NORT, the key is to exploit the low-rank structures. Using (22), 𝒵t\mathscr{Z}_{t} can be written as

𝒵t\displaystyle\mathscr{Z}_{t} =\displaystyle= i=1D(𝑼ti(𝑽ti))i1τξ(𝒳t)μ[𝑮𝒳1]1.\displaystyle\sum\nolimits_{i=1}^{D}(\bm{U}_{t}^{i}(\bm{V}_{t}^{i})^{\top})^{{\left\langle i\right\rangle}}-\frac{1}{\tau}\xi(\mathscr{X}_{t})-\mu[\bm{G}\mathscr{X}_{{\left\langle 1\right\rangle}}]^{{\left\langle 1\right\rangle}}. (44)

𝑮𝒳1\bm{G}\mathscr{X}_{{\left\langle 1\right\rangle}} can also be rewritten in low-rank form as

𝑮𝒳1=(𝑮𝑼t1)(𝑽t1)+𝑮j1[(𝑼tj(𝑽tj))j]1.\displaystyle\bm{G}\mathscr{X}_{{\left\langle 1\right\rangle}}=(\bm{G}\bm{U}_{t}^{1})(\bm{V}_{t}^{1})^{\top}+\bm{G}\sum\nolimits_{j\neq 1}\big{[}(\bm{U}_{t}^{j}(\bm{V}_{t}^{j})^{\top})^{{\left\langle j\right\rangle}}\big{]}_{{\left\langle 1\right\rangle}}.

For matrix multiplications of the forms 𝒂(𝒵t)i\bm{a}^{\top}(\mathscr{Z}_{t})_{{\left\langle i\right\rangle}} and (𝒵t)i𝒃(\mathscr{Z}_{t})_{{\left\langle i\right\rangle}}\bm{b} involved in the SVD of the proximal step, we have

𝒂(𝒵t)i=(𝒂(𝑰μ𝑮)𝑼ti)(𝑽ti)+ji𝒂(𝑰μ𝑮)[(𝑼tj(𝑽tj))j]i1τ𝒂[ξ(𝒳t)]i,\displaystyle\!\!\bm{a}^{\top}(\mathscr{Z}_{t})_{{\left\langle i\right\rangle}}\!=\!(\bm{a}^{\top}(\bm{I}\!-\!\mu\bm{G})\bm{U}_{t}^{i})(\bm{V}_{t}^{i})^{\top}\!\!\!+\!\sum\nolimits_{j\neq i}\bm{a}^{\top}(\bm{I}\!-\!\mu\bm{G})\big{[}(\bm{U}_{t}^{j}(\bm{V}_{t}^{j})^{\top})^{{\left\langle j\right\rangle}}\big{]}_{{\left\langle i\right\rangle}}\!\!-\!\frac{1}{\tau}\bm{a}^{\top}[\xi(\mathscr{X}_{t})]_{{\left\langle i\right\rangle}},\!\! (45)

and

(𝒵t)i𝒃=(𝑰μ𝑮)𝑼ti[(𝑽ti)𝒃]+(𝑰μ𝑮)ji[(𝑼tj(𝑽tj))j]i𝒃1τ[ξ(𝒳t)]i𝒃.\displaystyle(\mathscr{Z}_{t})_{{\left\langle i\right\rangle}}\bm{b}=(\bm{I}\!-\!\mu\bm{G})\bm{U}_{t}^{i}\big{[}(\bm{V}_{t}^{i})^{\top}\bm{b}\big{]}+(\bm{I}-\mu\bm{G})\sum\nolimits_{j\neq i}\big{[}(\bm{U}_{t}^{j}(\bm{V}_{t}^{j})^{\top})^{{\left\langle j\right\rangle}}\big{]}_{{\left\langle i\right\rangle}}\bm{b}-\frac{1}{\tau}[\xi(\mathscr{X}_{t})]_{{\left\langle i\right\rangle}}\bm{b}. (46)

Thus, one can still leverage the efficient computational procedures in Proposition 4 to compute 𝒂^[(𝑼tj(𝑽tj))j]i\bm{\hat{a}}^{\top}[(\bm{U}_{t}^{j}(\bm{V}_{t}^{j})^{\top})^{{\left\langle j\right\rangle}}]_{{\left\langle i\right\rangle}}, where 𝒂^=𝒂(𝑰μ𝑮)\bm{\hat{a}}^{\top}\!\!\!=\!\bm{a}^{\top}\!(\bm{I}\!-\!\mu\bm{G}) in (45), and [(𝑼tj(𝑽tj))j]i𝒃[(\bm{U}_{t}^{j}(\bm{V}_{t}^{j})^{\top})^{{\left\langle j\right\rangle}}]_{{\left\langle i\right\rangle}}\bm{b} in (46).

By taking f(𝒳)=𝛀i1iM=1(𝒳i1iM,𝒪i1iM)+μh(𝒳1)f(\mathscr{X})=\sum\nolimits_{\mathbf{\Omega}_{i_{1}...i_{M}}=1}\ell\left(\mathscr{X}_{i_{1}\dots i_{M}},\mathscr{O}_{i_{1}\dots i_{M}}\right)+\mu\,h(\mathscr{X}_{{\left\langle 1\right\rangle}}), it is easy to see that the statistical analysis in Section 3.4 and convergence analysis in Section 3.5 still hold.

5 Experiments

In this section, experiments are performed on both synthetic (Section 5.1) and real-world data sets (Sections 5.2-5.5), using a PC with Intel-i9 CPU and 32GB memory. To reduce statistical variation, all results are averaged over five repetitions.

5.1 Synthetic Data

We follow the setup in (Song et al., 2017). First, we generate a 3-order tensor (i.e., M=3M=3) 𝒪¯=i=1rgsi(𝒂i𝒃i𝒄i)\bar{\mathscr{O}}=\sum_{i=1}^{r_{g}}s_{i}(\bm{a}_{i}\circ\bm{b}_{i}\circ\bm{c}_{i}), where 𝒂iI1\bm{a}_{i}\in\mathbb{R}^{I^{1}}, 𝒃iI2\bm{b}_{i}\in\mathbb{R}^{I^{2}} and 𝒄iI3\bm{c}_{i}\in\mathbb{R}^{I^{3}}, \circ denotes the outer product (i.e., [𝒂𝒃𝒄]ijk=aibjck[\bm{a}\circ\bm{b}\circ\bm{c}]_{ijk}=a_{i}b_{j}c_{k}). rgr_{g} denotes the ground-truth rank and is set to 5, with all kik_{i}’s equal to rg=5r_{g}=5. All elements in 𝒂i\bm{a}_{i}’s, 𝒃i\bm{b}_{i}’s, 𝒄i\bm{c}_{i}’s and sis_{i}’s are sampled independently from the standard normal distribution. Each element of 𝒪¯\bar{\mathscr{O}} is then corrupted by noise from 𝒩(0,0.012)\mathcal{N}(0,0.01^{2}) to form 𝒪\mathscr{O}. A total of 𝛀1=I3rgi=13Iilog(Iπ)\|{\bm{\bm{\Omega}}}\|_{1}=\frac{I^{3}}{r_{g}}\sum_{i=1}^{3}I^{i}\log(I^{\pi}) random elements are observed from 𝒪\mathscr{O}. We use 50%50\% of them for training, and the remaining 50%50\% for validation. Testing is evaluated on the unobserved elements in 𝒪¯\mathscr{\bar{O}}.

We use the square loss and three nonconvex penalties: capped-1\ell_{1} (Zhang, 2010a), LSP (Candès et al., 2008) and TNN (Hu et al., 2013). The following methods are compared:

  • PA-APG (Yu, 2013), which solves the convex overlapped nuclear norm minimization problem;

  • GDPAN (Zhong and Kwok, 2014), which directly applies the PA algorithm to (14) as described in (18)-(20);

  • LRTC (Chen et al., 2020), which uses ADMM (Boyd et al., 2011) on (14) as described in (15)-(17); and

  • The proposed NORT algorithm (Algorithm 2), and its slower variant without adaptive momentum (denoted “sNORT”). Recall from Corollary 10 that τ\tau has to be larger than ρ+Dκ0\rho+D\kappa_{0}. However, a large τ\tau leads to slow convergence (Remark 11). Hence, we set τ=1.01(ρ+Dκ0)\tau=1.01(\rho+D\kappa_{0}). Moreover, as in (Li et al., 2017), we set γ1=0.1\gamma_{1}=0.1 and p=0.5p=0.5 in Algorithm 2.

All algorithms are implemented in Matlab, with sparse tensor and matrix operations performed via Mex files in C. All hypeprparamters (including the λi\lambda_{i}’s in (14) and hyperparameter in the baselines) are tuned by grid search using the validation set. We early stop training if the relative change of objective in consecutive iterations is smaller than 10410^{-4} or reaching the maximum of 20002000 iterations.

5.1.1 Recovery Performance Comparison

In this experiment, we set I1=I2=I3=c^I^{1}=I^{2}=I^{3}=\hat{c}, where c^=200\hat{c}=200 and 400400. Following (Lu et al., 2016b; Yao et al., 2017, 2019b), performance is evaluated by the (i) root-mean-square-error on the unobserved elements: RMSE=P𝛀¯(𝒳𝒪¯)F/𝛀¯10.5\text{RMSE}=\left\|P_{\bar{{\bm{\bm{\Omega}}}}}(\mathscr{X}-\bar{\mathscr{O}})\right\|_{F}/\left\|\bar{{\bm{\bm{\Omega}}}}\right\|_{1}^{0.5}, where 𝒳\mathscr{X} is the low-rank tensor recovered, and 𝛀¯\bar{{\bm{\bm{\Omega}}}} contains the unobserved elements in 𝒪¯\bar{\mathscr{O}}; (ii) CPU time; and (iii) space, which is measured as the memory used by MATLAB when running each algorithm.

Results on RMSE and space are shown in Table 3. We can see that the nonconvex regularizers (capped-1\ell_{1}, LSP and TNN, with methods GDPAN, LRTC, sNORT and NORT) all yield almost the same RMSE, which is much lower than that of using the convex nuclear norm regularizer in PA-APG. As for the space required, sNORT and NORT take orders of magnitude smaller space than the others. LRTC takes the largest space due to the use of multiple auxiliary and dual variables. Convergence of the optimization objective is shown in Figure 1. As can be seen, NORT is the fastest, followed by sNORT and GDPAN, while LRTC is the slowest. These demonstrate the benefits of avoiding repeated tensor folding/unfolding operations and faster convergence of the proximal average algorithm.

Table 3: Testing RMSE and space required for the synthetic data.
c^=200\hat{c}=200 (sparsity:4.77%4.77\%) c^=400\hat{c}=400 (sparsity:2.70%2.70\%)
RMSE space (MB) RMSE space (MB)
convex PA-APG 0.0110±\pm0.0007 600.8±\pm70.4 0.0098±\pm0.0001 4804.5±\pm598.2
GDPAN 0.0010±\pm0.0001 423.1±\pm11.4 0.0006±\pm0.0001 3243.3±\pm489.6
nonconvex LRTC 0.0010±\pm0.0001 698.9±\pm21.5 0.0006±\pm0.0001 5870.6±\pm514.0
(capped-1\ell_{1}) sNORT 0.0010±\pm0.0001 10.1±\pm0.1 0.0006±\pm0.0001 44.6±\pm0.3
NORT 0.0009±\pm0.0001 14.4±\pm0.1 0.0006±\pm0.0001 66.3±\pm0.6
GDPAN 0.0010±\pm0.0001 426.9±\pm9.7 0.0006±\pm0.0001 3009.3±\pm376.2
nonconvex LRTC 0.0010±\pm0.0001 714.0±\pm24.1 0.0006±\pm0.0001 5867.7±\pm529.1
(LSP) sNORT 0.0010±\pm0.0001 10.8±\pm0.1 0.0006±\pm0.0001 44.6±\pm0.2
NORT 0.0010±\pm0.0001 14.0±\pm0.1 0.0006±\pm0.0001 62.1±\pm0.5
GDPAN 0.0010±\pm0.0001 427.3±\pm10.1 0.0006±\pm0.0001 3009.2±\pm412.2
nonconvex LRTC 0.0010±\pm0.0001 759.0±\pm24.3 0.0006±\pm0.0001 5865.5±\pm519.3
(TNN) sNORT 0.0010±\pm0.0001 10.2±\pm0.1 0.0006±\pm0.0001 44.7±\pm0.2
NORT 0.0010±\pm0.0001 14.4±\pm0.2 0.0006±\pm0.0001 63.1±\pm0.6
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a) capped-1\ell_{1}.
Refer to caption
(b) LSP.
Refer to caption
(c) TNN.
Figure 1: Convergence of the objective versus number of iterations (top) and CPU time (bottom) on the synthetic data (with c^=400\hat{c}=400).

5.1.2 Ranks during Iteration

Unlike factorization methods which explicitly constrain the iterate’s rank, in NORT (Algorithm 2), this is only implicitly controlled by the nonconvex regularizer. As shown in Table 2, having a large rank during the iteration may affect the efficiency of NORT. Figure 2 shows the ranks of (𝒵t)i(\mathscr{Z}_{t})_{{\left\langle i\right\rangle}} and 𝑿t+1i{\bm{\bm{X}}}_{t+1}^{i} at step 11 of Algorithm 2. As can be seen, the ranks of the iterates remain small compared with the tensor size (c^=400\hat{c}=400). Moreover, the ranks of 𝑿t+11{\bm{\bm{X}}}_{t+1}^{1}, 𝑿t+12{\bm{\bm{X}}}_{t+1}^{2}, and 𝑿t+13{\bm{\bm{X}}}_{t+1}^{3} all converge to the true rank (i.e., 55) of the ground-truth tensor.

Refer to caption
(a) capped-1\ell_{1}.
Refer to caption
(b) LSP.
Refer to caption
(c) TNN.
Figure 2: Ranks of {(𝒵t)i,𝑿t+1i}i=1,2,3\{(\mathscr{Z}_{t})_{{\left\langle i\right\rangle}},{\bm{\bm{X}}}_{t+1}^{i}\}_{i=1,2,3} versus number of iterations on synthetic data (with c^=400\hat{c}=400).

5.1.3 Quality of Critical Points

In this experiment, we empirically validate the statistical performance of critical points analysed in Theorem 16. Note that 𝒳0\mathscr{X}_{0} and 𝒳1\mathscr{X}_{1} are initialized as the zero tensor in Algorithm 2, and 𝒳t\mathscr{X}_{t} is implicitly stored by a summation of DD factorized matrices in (22). We randomly generate 𝒳0=𝒳1=i=1D(𝒖i(𝒗i))i\mathscr{X}_{0}=\mathscr{X}_{1}=\sum\nolimits_{i=1}^{D}\big{(}{\bm{\bm{u}}}^{i}({\bm{\bm{v}}}^{i})^{\top}\big{)}^{{\left\langle i\right\rangle}}, where elements in 𝒖i\bm{u}^{i}’s and 𝒗i\bm{v}^{i}’s follow 𝒩(0,1)\mathcal{N}(0,1). The statistical error is measured as the RMSE between 𝒳t\mathscr{X}_{t} during iterating of NORT (Algorithm 2) and the underlying ground-truth 𝒳\mathscr{X}^{*} (i.e., 𝒳t𝒳F2\|\mathscr{X}_{t}-\mathscr{X}^{*}\|_{F}^{2}), while the optimization error is measured as the RMSE between iterate 𝒳t\mathscr{X}_{t} and the globally optimal solution 𝒳˙\dot{\mathscr{X}} of (14) (i.e., 𝒳t𝒳˙F2\|\mathscr{X}_{t}-\dot{\mathscr{X}}\|_{F}^{2}). We use the same experimental setup as in Section 5.1.1. As the exact 𝒳˙\dot{\mathscr{X}} is not known, it is approximated by the 𝒳~\tilde{\mathscr{X}} which obtains the lowest training objective value over 20 repetitions.

Figure 3 shows the statistical error versus optimization error obtained by NORT with the (smooth) LSP regularizer and (nonsmooth) capped-1\ell_{1} regularizer. While both the statistical and optimization errors decrease with more iterations, the statistical error is generally larger than the optimization error since we may not have exact recovery when noise is present. Moreover, the optimization errors for different runs terminate at different values, indicating that NORT indeed converges to different local solutions. However, all these have similar statistical errors, which validates Theorem 16. Finally, while the capped-1\ell_{1} regularizer does not satisfy Assumption 1 (which is required by Theorem 16), Figure 3(b) still shows a similar pattern as Figure 3(a). This helps explain the good empirical performance obtained by the capped-1\ell_{1} regularizer (Jiang et al., 2015; Lu et al., 2016b; Yao et al., 2019b).

Refer to caption
(a) LSP.
Refer to caption
(b) capped-1\ell_{1}.
Figure 3: Statistical error (red) and optimization error (black) versus the number of NORT iterations (with c^=400\hat{c}=400) from 2020 runs of NORT (with different random seeds).

5.1.4 Effects of Noise Level and Number of Observations

In this section, we show the effects of noise level σ\sigma and number of observed elements 𝛀1\left\|\mathbf{\Omega}\right\|_{1} on the testing RMSE and training time. We use the same experimental setup as in Section 5.1.1. Since PA-APG is much worse (see Table 3) while LRTC and sNORT are slower than NORT (see Figure 1), we only use GDPAN as comparison baseline.

Figure 4(a) shows the testing RMSE with σ\sigma at different 𝛀1\left\|\mathbf{\Omega}\right\|_{1}’s (here, we plot s=𝛀1/Iπs=\left\|{\bm{\bm{\Omega}}}\right\|_{1}/I^{\pi}). As can be seen, the curves show a linear dependency on σ\sigma when 𝛀1\left\|{\bm{\bm{\Omega}}}\right\|_{1} is sufficiently large, which agrees with Corollary 17. Figure 4(b) shows the testing RMSE versus logIπ/𝛀1\sqrt{\log I^{\pi}/\left\|{\bm{\bm{\Omega}}}\right\|_{1}} at different σ\sigma’s. As can be seen, there is a linear dependency when the noise level σ\sigma is small, which agrees with Corollary 18. Finally, note that NORT and GDPAN obtain very similar testing RMSEs as both solve the same objective (but with different algorithms).

Refer to caption
(a) Different noise levels.
Refer to caption
(b) Different numbers of observations.
Figure 4: Effect of noise level and number of observations on the testing RMSE on the synthetic data (with c^=400\hat{c}=400). Note that NORT and GDPAN obtain similar performance and their curves overlap with each other.

Figure 5 shows the effects of noise level on the convergence of testing RMSE versus (training) CPU time. As can be seen, testing RMSEs generally terminates at a higher level when the noise level gets larger, and NORT is much faster than GDPAN under all noise level.

Refer to caption
(a) NORT.
Refer to caption
(b) GDPAN.
Figure 5: Effects of the noise level on the convergence on synthetic data (with c^=400\hat{c}=400, s=2.5%s=2.5\%).

Figure 6 shows the effects of numbers of observations on the convergence of testing RMSE versus (training) CPU time. First, we can see that NORT is much faster than GDPAN under various numbers of observations. Then, when ss gets smaller and the tensor completion problem is more ill-posed, more iterations are needed by both NORT and GDPAN, which makes them take more time to converge.

Refer to caption
(a) NORT.
Refer to caption
(b) GDPAN.
Figure 6: Effect of the number of observations on the convergence on synthetic data (with c^=400\hat{c}=400, σ=102\sigma=10^{-2}).

5.1.5 Effects of Tensor Order and Rank

In this experiment, we use a similar experimental setup as in Section 5.1.1, except that the tensor order MM is varied from 22 to 55. As high-order tensors have large memory requirements, while we always set I1=I2=I3=c^=400I^{1}=I^{2}=I^{3}=\hat{c}=400, we set I4=5I^{4}=5 when M=4M=4 and I4=I5=5I^{4}=I^{5}=5 when M=5M=5. Figure 7(a) shows the testing RMSE versus MM. As can be seen, the error grows almost linearly, which agrees with Theorem 16. Moreover, note that at M=5M=5, GDPAN runs out of memory because it needs to maintain dense tensors in each iteration.

Figure 7(b) shows the testing RMSE w.r.t. rg\sqrt{r_{g}} (where rgr_{g} is the ground-truth tensor rank). As can be seen, the error grows linearly w.r.t. rg\sqrt{r_{g}}, which again agrees with Theorem 16.

Refer to caption
(a) Effect of tensor order MM (rgr_{g} == 55).
Refer to caption
(b) Effect of ground-truth rank rgr_{g} at M=3M=3 (the corresponding rgr_{g} values are 55, 1010, 1515 and 2020).
Figure 7: Effects of tensor order and ground-truth rank on the testing RMSE on the synthetic data (with c^=400\hat{c}=400). Note that NORT and GDPAN obtain similar performance and their curves overlap with each other.

Figure 8(a) shows the convergence of testing RMSE versus (training) CPU time at different tensor orders. As can be seen, while both GDPAN and NORT need more time to converge for higher-order tensors, NORT is consistently faster than GDPAN. Figure 8(b) shows the convergence of testing RMSE at different ground-truth ranks. As can be seen, while NORT is still faster than GDPAN at different ground-truth tensor ranks (rgr_{g}), the relative speedup gets smaller when rgr_{g} gets larger. This is because NORT needs to construct sparse tensors (e.g., Algorithm 1) before using them for multiplications, and empirically, the handling of sparse tensors requires more time on memory addressing as the rank increases (Bader and Kolda, 2007).

Refer to caption
(a) Effect of tensor order MM (rg=5r_{g}=5).
Refer to caption
(b) Effect of ground-truth rank (M=3M\!=\!3).
Figure 8: Effects of tensor order and ground-truth rank on the convergence on the synthetic data (with c^=400\hat{c}=400). GDPAN runs out of memory when M=5M=5.

5.2 Tensor Completion Applications

Table 4: Algorithms compared on the real-world data sets.
algorithm model basic solver
convex ADMM (Tomioka et al., 2010) ADMM
FaLRTC (Liu et al., 2013) overlapped nuclear norm accelerated proximal algorithm on dual problem
PA-APG (Yu, 2013) accelerated PA algorithm
FFW (Guo et al., 2017) latent nuclear norm efficient Frank-Wolfe algorithm
TR-MM (Nimishakavi et al., 2018) squared latent nuclear norm Riemannian optimization on dual problem
TenNN (Zhang and Aeron, 2017) tensor-SVD ADMM
factorization RP (Kasai and Mishra, 2016) Turker decomposition Riemannian preconditioning
TMac (Xu et al., 2013) multiple matrices factorization alternative minimization
CP-WOPT (Hong et al., 2020) CP decomposition nonlinear conjugate gradient
TMac-TT (Bengua et al., 2017) tensor-train decomposition alternative minimization
TRLRF (Yuan et al., 2019) tensor-ring decomposition ADMM
non-convex GDPAN (Zhong and Kwok, 2014) nonconvex PA algorithm
LRTC (Chen et al., 2020) nonconvex overlapped nuclear norm regularization ADMM
NORT (Algorithm 2) proposed algorithm

In this section, we use the square loss. As different nonconvex regularizers have similar performance, we will only use LSP in the sequel. The proposed NORT algorithm is compared with:444We used our own implementations of LRTC, PA-APG and GDPAN as their codes are not publicly available.

  1. (i)

    algorithms for various convex regularizers including: ADMM (Boyd et al., 2011)555https://web.stanford.edu/~boyd/papers/admm/, PA-APG (Yu, 2013), FaLRTC (Liu et al., 2013)666https://github.com/andrewssobral/mctc4bmi/tree/master/algs_tc/LRTC, FFW (Guo et al., 2017)777https://github.com/quanmingyao/FFWTensor, TR-MM (Nimishakavi et al., 2018)888https://github.com/madhavcsa/Low-Rank-Tensor-Completion, and TenNN (Zhang and Aeron, 2017)999http://www.ece.tufts.edu/~shuchin/software.html;

  2. (ii)
  3. (iii)

    algorithms that can handle nonconvex regularizers including GDPAN (Zhong and Kwok, 2014) and LRTC (Chen et al., 2020).

More details are in Table 4. We do not compare with (i) sNORT, as it has already been shown to be slower than NORT; (ii) iterative hard thresholding (Rauhut et al., 2017), as its code is not publicly available and the more recent TMac-TT solves the same problem; (iii) the method in (Bahadori et al., 2014), as it can only deal with cokriging and forecasting problems.

Unless otherwise specified, performance is evaluated by (i) root-mean-squared-error on the unobserved elements: RMSE=P𝛀(𝒳𝒪)F/𝛀10.5\text{RMSE}=\left\|P_{{\bm{\bm{\Omega}}}^{\bot}}(\mathscr{X}-\mathscr{O})\right\|_{F}/\left\|{\bm{\bm{\Omega}}}^{\bot}\right\|_{1}^{0.5}, where 𝒳\mathscr{X} is the low-rank tensor recovered, and 𝛀{\bm{\bm{\Omega}}}^{\bot} contains the unobserved elements in 𝒪\mathscr{O}; and (ii) CPU time.

5.2.1 Color Images

We use the Windows, Tree and Rice images from (Hu et al., 2013), which are resized to 1000×1000×31000\times 1000\times 3 (Figure 9). Each pixel is normalized to [0,1][0,1]. We randomly sample 5% of the pixels for training, which are then corrupted by Gaussian noise 𝒩(0,0.012)\mathcal{N}(0,0.01^{2}); and another 5% clean pixels are used for validation. The remaining unseen clean pixels are used for testing. Hyperparameters of the various methods are tuned by using the validation set.

Refer to caption
(a) Windows.
Refer to caption
(b) Tree.
Refer to caption
(c) Rice.
Figure 9: Color images used in experiments. All are of size 1000×1000×31000\times 1000\times 3.

Table 5 shows the RMSE results. As can be seen, the best convex methods (PA-APG and FaLRTC) are based on the overlapped nuclear norm. This agrees with our motivation to build a nonconvex regularizer based on the overlapped nuclear norm. GDPAN, LRTC and NORT have similar RMSEs, which are lower than those by convex regularization and the factorization approach. Convergence of the testing RMSE is shown in Figure 10. As can be seen, while ADMM solves the same convex model as PA-APG and FaLRTC, it has slower convergence. FFW, RP and TR-MM are very fast but their testing RMSEs are higher than that of NORT. By utilizing the “sparse plus low-rank” structure and adaptive momentum, NORT is more efficient than GDPAN and LRTC.

Table 5: Testing RMSEs on color images. For all images 5% of the total pixels, which are corrupted by Gaussian noise 𝒩(0,0.012)\mathcal{N}(0,0.01^{2}), are used for training.
dataset Rice Tree Windows
convex ADMM 0.0680±\pm0.0003 0.0915±\pm0.0005 0.0709±\pm0.0004
PA-APG 0.0583±\pm0.0016 0.0488±\pm0.0007 0.0585±\pm0.0002
FaLRTC 0.0576±\pm0.0004 0.0494±\pm0.0011 0.0567±\pm0.0005
FFW 0.0634±\pm0.0003 0.0599±\pm0.0005 0.0772±\pm0.0004
TR-MM 0.0596±\pm0.0005 0.0515±\pm0.0011 0.0634±\pm0.0002
TenNN 0.0647±\pm0.0004 0.0562±\pm0.0004 0.0586±\pm0.0003
factorization RP 0.0541±\pm0.0011 0.0575±\pm0.0010 0.0388±\pm0.0026
TMac 0.1923±\pm0.0005 0.1750±\pm0.0006 0.1313±\pm0.0005
CP-WOPT 0.0912±\pm0.0086 0.0750±\pm0.0060 0.0964±\pm0.0102
TMac-TT 0.0729±\pm0.0022 0.0665±\pm0.0147 0.1045±\pm0.0107
TRLRF 0.0640±\pm0.0004 0.0780±\pm0.0048 0.0588±\pm0.0035
nonconvex GDPAN 0.0467±\pm0.0002 0.0394±\pm0.0006 0.0306±\pm0.0007
LRTC 0.0468±\pm0.0001 0.0392±\pm0.0006 0.0304±\pm0.0008
NORT 0.0468±\pm0.0001 0.0386±\pm0.0009 0.0297±\pm0.0007
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a) Rice.
Refer to caption
(b) Tree.
Refer to caption
(c) Windows.
Figure 10: Testing RMSE versus CPU time on color images. Top: All methods; Bottom: For improved clarity, methods which are too slow or with too poor performance are removed.

Finally, Table 6 compares NORT with PA-APG and RP, which are the best convex-regularization-based and factorization-based algorithms, respectively, as observed in Table 5. Table 6 shows the testing RMSEs at different noise levels σ\sigma’s. As can be seen, the testing RMSEs of all methods increase as σ\sigma increases. NORT has lower RMSEs at all σ\sigma settings. This is because natural images may not be exactly low-rank, and adaptive penalization of the singular values can better preserve the spectrum. A similar observation has also been made for nonconvex regularization on images (Yao et al., 2019b; Lu et al., 2016b). However, when the noise level becomes very high (σ=0.1\sigma=0.1 with pixel values in [0,1][0,1]), though NORT is still the best, its testing RMSE is not small.

Table 6: Testing RMSEs on image Tree at different noise levels σ\sigma. The percentage followed by the marker \uparrow indicates the relative increase of testing RMSE compared with NORT.
σ=0.001\sigma=0.001 σ=0.01\sigma=0.01 σ=0.1\sigma=0.1
(convex) PA-APG 0.0149 (35.8%\uparrow) 0.0488 (24.6%\uparrow) 0.1749 (18.6% \uparrow)
(factorization) RP 0.0139 (26.0%\uparrow) 0.0575 (15.6%\uparrow) 0.1623 (10.1% \uparrow)
(nonconvex) NORT 0.0110 0.0386 0.1474

5.2.2 Remote Sensing Data

Experiments are performed on three hyper-spectral images (Figure 11): Cabbage (1312×\times432×\times49), Scene (1312×\times951×\times49) and Female (592×\times409×\times148).151515Cabbage and Scene images are from https://sites.google.com/site/hyperspectralcolorimaging/dataset, while the Female images are downloaded from http://www.imageval.com/scene-database-4-faces-3-meters/. : The third dimension is for the bands of images.

Refer to caption
(a) Cabbage.
Refer to caption
(b) Scene.
Refer to caption
(c) Female.
Figure 11: Hyperspectral images used in the experiment.

We use the same setup as in Section 5.2.1, and hyperparameters are tuned on the validation set. ADMM, TenNN, GDPAN, LRTC, TMac-TT and TRLRF are slow and so not compared. Results are shown in Table 7. Again, NORT achieves much lower testing RMSE than convex regularization and factorization approach. Figure 12 shows convergence of the testing RMSE. As can be seen, NORT is the fastest.

Table 7: Testing RMSEs on remote sensing data.
Cabbage Scene Female
convex PA-APG 0.0913±\pm0.0006 0.1965±\pm0.0002 0.1157±\pm0.0003
FaLRTC 0.0909±\pm0.0002 0.1920±\pm0.0001 0.1133±\pm0.0004
FFW 0.0962±\pm0.0004 0.2037±\pm0.0002 0.2096±\pm0.0006
TR-MM 0.0959±\pm0.0001 0.1965±\pm0.0002 0.1397±\pm0.0006
factorization RP 0.0491±\pm0.0011 0.1804±\pm0.0005 0.0647±\pm0.0003
TMac 0.4919±\pm0.0059 0.5970±\pm0.0029 1.9897±\pm0.0006
CP-WOPT 0.1846±\pm0.0514 0.4811±\pm0.0082 0.1868±\pm0.0013
nonconvex NORT 0.0376±\pm0.0004 0.1714±\pm0.0012 0.0592±\pm0.0002
Refer to caption
(a) Cabbage.
Refer to caption
(b) Female.
Refer to caption
(c) Scene.
Figure 12: Testing RMSE versus CPU time on remote sensing data.

5.2.3 Social Networks

In this experiment, we consider multi-relational link prediction (Guo et al., 2017) as a tensor completion problem. Experiment is performed on the YouTube data set161616http://leitang.net/data/youtube-data.tar.gz (Lei et al., 2009), which contains 15,088 users and five types of user interactions. Thus, it forms a 15088×\times15088×\times5 tensor, with a total of 27,257,790 nonzero elements. Besides the full set, we also experiment with a YouTube subset obtained by randomly selecting 1,000 users (leading to 12,101 observations). We use 50%50\% of the observations for training, another 25%25\% for validation and the rest for testing. Table 8 shows the testing RMSE, and Figure 13 shows the convergence. As can be seen, NORT achieves smaller RMSE and is also much faster.

Table 8: Testing RMSEs on YouTube data sets. FaLRTC, PA-APG, TR-MM and CP-WOPT are slow, and thus not run on the full set.
subset full set
convex FaLRTC 0.657±\pm0.060
PA-APG 0.651±\pm0.047
FFW 0.697±\pm0.054 0.395±\pm0.001
TR-MM 0.670±\pm0.098
factorization RP 0.522±\pm0.038 0.410±\pm0.001
TMac 0.795±\pm0.033 0.611±\pm0.007
CP-WOPT 0.785±\pm0.040
nonconvex NORT 0.482±\pm0.030 0.370±\pm0.001
Refer to caption
(a) Subset.
Refer to caption
(b) Full set.
Figure 13: Testing RMSE versus CPU time on Youtube.

5.3 Link Prediction in Knowledge Graph

Knowledge Graph (KG) (Nickel et al., 2015; Toutanova et al., 2015) is an active research topic in data mining and machine learning. Let \mathcal{E} be the entity set and \mathcal{R} be the relation set. In a KG, nodes are the entities, while edges are relations representing the triplets 𝒮={(h,r,t)}\mathcal{S}=\{(h,r,t)\}, where hh\in\mathcal{E} is the head entity, tt\in\mathcal{E} is the tail entity, and rr\in\mathcal{R} is the relation between hh and tt.

KGs have many downstream applications, such as link prediction and triplet classification. It is common to store KGs as tensors, and solve the KG learning tasks with tensor methods (Lacroix et al., 2018; Balazevic et al., 2019). Take link prediction as an example. The KG can be seen as a 3-order incomplete tensor 𝒪={±1}I1×I2×I3\mathscr{O}=\{\pm 1\}\in\mathbb{R}^{I^{1}\times I^{2}\times I^{3}}, where I1=I2=||I^{1}=I^{2}=|\mathcal{E}| and I3=||I^{3}=|\mathcal{R}|. 𝒪i1i2i3=1\mathscr{O}_{i_{1}i_{2}i_{3}}=1 when entities i1i_{1} and i2i_{2} have the relation i3i_{3}, and 1-1 otherwise. Let 𝛀{\bm{\bm{\Omega}}} be a mask tensor denoting the observed values in 𝒪\mathscr{O}, i.e., 𝛀i1i2i3=1{\bm{\bm{\Omega}}}_{i_{1}i_{2}i_{3}}=1 if 𝒪i1i2i3\mathscr{O}_{i_{1}i_{2}i_{3}} is observed and 0 otherwise. The task is to predict elements in 𝒪\mathscr{O} which are not observed. Since 𝒪\mathscr{O} is binary, it is common to use the log loss as (,)\ell(\cdot,\cdot) in (14). The objective then becomes:

min𝒳(i1i2i3)𝛀log(1+exp(𝒳i1i2i3𝒪i1i2i3))+i=1Dλiϕ(𝒳i).\displaystyle\min\nolimits_{\mathscr{X}}\sum\nolimits_{(i_{1}i_{2}i_{3})\in{\bm{\bm{\Omega}}}}\log(1+\exp(-\mathscr{X}_{i_{1}i_{2}i_{3}}\mathscr{O}_{i_{1}i_{2}i_{3}}))+\sum\nolimits_{i=1}^{D}\lambda_{i}\phi(\mathscr{X}_{{\left\langle i\right\rangle}}). (47)

In step 9 of Algorithm 2, it is easy to see that

[ξ(𝒳t)]i1i2i3={𝒪i1i2i3exp(𝒳i1i2i3𝒪i1i2i3)1+exp(𝒳i1i2i3𝒪i1i2i3)(i1i2i3)𝛀0(i1i2i3)𝛀.\displaystyle[\xi(\mathscr{X}_{t})]_{i_{1}i_{2}i_{3}}=\left\{\begin{array}[]{cc}\frac{-\mathscr{O}_{i_{1}i_{2}i_{3}}\cdot\exp(-\mathscr{X}_{i_{1}i_{2}i_{3}}\mathscr{O}_{i_{1}i_{2}i_{3}})}{1+\exp(-\mathscr{X}_{i_{1}i_{2}i_{3}}\mathscr{O}_{i_{1}i_{2}i_{3}})}&(i_{1}i_{2}i_{3})\in{\bm{\bm{\Omega}}}\\ 0&(i_{1}i_{2}i_{3})\notin{\bm{\bm{\Omega}}}\end{array}\right..

Experiments are performed on two benchmark data sets: WN18RR171717https://github.com/TimDettmers/ConvE (Dettmers et al., 2018) and FB15k-237181818https://www.microsoft.com/en-us/download/details.aspx?id=52312 (Toutanova et al., 2015), which are subsets of WN18 and FB15k (Bordes et al., 2013), respectively. WN18 is a subset of WordNet (Miller, 1995), and FB15k is a subset of the Freebase database (Bollacker et al., 2008). To avoid test leakage, WN18RR and FB15k-237 do not contain near-duplicate and inverse-duplicate relations. Hence, link prediction on WN18RR and FB15k-237 is harder but more recommended than that on WN18 and FB15k (Dettmers et al., 2018). To form the entity set \mathcal{E}, we keep the top 500 (head and tail) entities that appear most frequently in the relations (rr’s). Relations that do not link to any of these 500 entities are removed, and those remained form the relation set \mathcal{R}. Following the public splits on entities in \mathcal{E} and relations in \mathcal{R} (Han et al., 2018), we split the observed triplets in 𝒮\mathcal{S} into a training set 𝒮train\mathcal{S}_{\text{train}}, validation set 𝒮val\mathcal{S}_{\text{val}} and testing set 𝒮test\mathcal{S}_{\text{test}}. For each observed triplet (h,r,t)𝒮train(h,r,t)\in\mathcal{S}_{\text{train}}, we sample a negative triplet from 𝒮^(h,r,t)={(h^,r,t)𝒮|h^}{(h,r,t^)𝒮|t^}\mathcal{\hat{S}}_{(h,r,t)}=\{(\hat{h},r,t)\notin\mathcal{S}|\hat{h}\in\mathcal{E}\}\cap\{(h,r,\hat{t})\notin\mathcal{S}|\hat{t}\in\mathcal{E}\}. We avoid duplicate negative triplets during sampling. We then represent the KG’s by tensors 𝒪\mathscr{O}’s of size 500×500×8500\times 500\times 8 for WN18RR, and 500×500×39500\times 500\times 39 for FB15k-237 with corresponding mask tensors 𝛀{\bm{\bm{\Omega}}}’s.

Following (Bordes et al., 2013; Dettmers et al., 2018), performance is evaluated on the testing triplets in 𝛀¯\bar{{\bm{\bm{\Omega}}}} by the following metrics: (i) mean reciprocal ranking: MRR=1/𝛀¯0(i1i2i3)𝛀¯\text{MRR}=1/\|\bar{{\bm{\bm{\Omega}}}}\|_{0}\sum_{(i_{1}i_{2}i_{3})\in\bar{{\bm{\bm{\Omega}}}}} 1/ranki31/\text{rank}_{i_{3}}, where ranki3\text{rank}_{i_{3}} is the ranking of score 𝒳i1i2i3\mathscr{X}_{i_{1}i_{2}i_{3}} among {𝒳i1i2j}\{\mathscr{X}_{i_{1}i_{2}j}\} with j=1,,||j=1,\dots,|\mathcal{R}| in descending order; (ii) Hits@1=1/𝛀¯0(i1i2i3)𝛀¯𝕀(ranki31)@1=1/\|\bar{{\bm{\bm{\Omega}}}}\|_{0}\sum_{(i_{1}i_{2}i_{3})\in\bar{{\bm{\bm{\Omega}}}}}\mathbb{I}(\text{rank}_{i_{3}}\leq 1), where 𝕀(c)\mathbb{I}(c) is the indicator function which returns 1 if the constraint cc is satisfied and 0 otherwise; and (iii) Hits@3=1/𝛀¯0(i1i2i3)𝛀¯𝕀(ranki33)@3=1/\|\bar{{\bm{\bm{\Omega}}}}\|_{0}\sum_{(i_{1}i_{2}i_{3})\in\bar{{\bm{\bm{\Omega}}}}}\mathbb{I}(\text{rank}_{i_{3}}\leq 3). For these three metrics, the higher the better.

The aforementioned algorithms are designed for the square loss, but not for the log loss in (47). We adapt the gradient-based algorithms including PA-APG, ADMM and CP-WOPT, as we only need to change the gradient calculation for (47). As a further baseline, we implement the classic Tucker decomposition (Tucker, 1966; Kolda and Bader, 2009) to optimize (47). While RP (Kasai and Mishra, 2016) is the state-of-the-art Tucker-type algorithm, it uses Riemannian preconditioning and cannot be easily modified to handle nonsmooth loss.

Results on WN18RR and FB15k-237 are shown in Tables 9 and 10, respectively. As can be seen, NORT again obtains the best ranking results. Figure 14 shows convergence of MRR with CPU time, and NORT is about two orders of magnitude faster than the other methods.

Table 9: Testing performance on the WN18RR data set.
MRR Hits@1 Hits@3
convex ADMM 0.362±\pm0.029 0.156±\pm0.024 0.422±\pm0.038
PA-APG 0.399±\pm0.017 0.203±\pm0.023 0.500±\pm0.038
factorization Tucker 0.439±\pm0.013 0.309±\pm0.016 0.438±\pm0.026
CP-WOPT 0.417±\pm0.018 0.266±\pm0.027 0.453±\pm0.019
nonconvex NORT 0.523±\pm0.022 0.375±\pm0.033 0.578±\pm0.024
Table 10: Testing performance on the FB15k-237 data set.
MRR Hits@1 Hits@3
convex ADMM 0.466±\pm0.006 0.411±\pm0.006 0.452±\pm0.011
PA-APG 0.514±\pm0.013 0.463±\pm0.015 0.590±\pm0.016
factorization Tucker 0.471±\pm0.018 0.355±\pm0.017 0.465±\pm0.015
CP-WOPT 0.420±\pm0.021 0.373±\pm0.015 0.488±\pm0.014
nonconvex NORT 0.677±\pm0.007 0.609±\pm0.007 0.698±\pm0.011
Refer to caption
(a) WN18RR.
Refer to caption
(b) FB15k-237.
Figure 14: Testing MRR versus CPU time on the WN18RR and FB15k-237 data sets.

5.4 Robust Tensor Completion

In this section, we apply the proposed method on robust video tensor completion. Three videos (Eagle191919http://youtu.be/ufnf_q_3Ofg, Friends202020http://youtu.be/xmLZsEfXEgE and Logo212121http://youtu.be/L5HQoFIaT4I) from (Indyk et al., 2019) are used. Example frames are shown in Figure 15. For each video, 200200 consecutive 360×640360\times 640 frames are downloaded from Youtube, and the pixel values are normalized to [0,1][0,1]. Each video can then be represented as a fourth-order tensor 𝒪¯\bar{\mathscr{O}} with size 360×640×3×200360\times 640\times 3\times 200. Each element of 𝒪¯\bar{\mathscr{O}} is normalized to [0,1][0,1]. This clean tensor 𝒪¯\bar{\mathscr{O}} is corrupted by a noise tensor 𝒩\mathscr{N} to form 𝒪\mathscr{O}. 𝒩\mathscr{N} is a sparse random tensor with approximately 1% nonzero elements. Each entry is first drawn uniformly from the interval [0,1][0,1], and then multiplied by 55 times the maximum value of 𝒪¯\bar{\mathscr{O}}. Hyperparameters are chosen based on performance on the first 100 noisy frames. Denoising performance is measured by the RMSE between the clean tensor 𝒪¯\bar{\mathscr{O}} and reconstructed tensor 𝒳\mathscr{X} on the last 100 frames.

Refer to caption
(a) Eagle.
Refer to caption
(b) Friends.
Refer to caption
(c) Logo.
Figure 15: Example image frames in the videos.

For the robust tensor completion, we take RTDGC (Gu et al., 2014) as the baseline, which adopts the 1\ell_{1} loss and overlapped nuclear norm in (40) (i.e., κ(x)=x\kappa_{\ell}(x)=x and ϕ\phi is the nuclear norm). As this is non-smooth and non-differentiable, RTDGC uses ADMM (Boyd et al., 2011) for the optimization, which handles the robust loss and low-rank regularizer separately. As discussed in Section 4.1, we use the smoothing NORT (Algorithm 3, with δ0=0.9\delta_{0}=0.9) to optimize (42), the smoothed version of (40). Table 11 shows the RMSE results. As can be seen, NORT obtains better denoising performance than RTDGC. This again validates the efficacy of nonconvex low-rank learning. Figure 16 shows convergence of the testing RMSE. As shown, NORT leads to a lower RMSE and converges much faster as folding/unfolding are avoided.

Table 11: Testing RMSEs on the videos.
Eagle Friends Logo
convex RTDGC 0.122±\pm0.007 0.128±\pm0.005 0.112±\pm0.008
nonconvex NORT 0.090±\pm0.003 0.075±\pm0.002 0.088±\pm0.004
Refer to caption
(a) Eagle.
Refer to caption
(b) Friends.
Refer to caption
(c) Logo.
Figure 16: Testing RMSE versus CPU time on the videos.

5.5 Spatial-temporal Data

In this experiment, we predict climate observations for locations that do not have any records. This is formulated as a regularized tensor completion problem in (43). We use the square loss with a graph Laplacian regularizer constructed as in (43).

We use the CCDS and USHCN data sets from (Bahadori et al., 2014). CCDS222222https://viterbi-web.usc.edu/~liu32/data/NA-1990-2002-Monthly.csv contains monthly observations of 17 variables (such as carbon dioxide and temperature) in 125 stations from January 1990 to December 2001. USHCN232323http://www.ncdc.noaa.gov/oa/climate/research/ushcn contains monthly observations of 4 variables (minimum, maximum, average temperature and total precipitation) in 1218 stations from from January 1919 to November 2019. As discussed in Section 4.2, these records are collectively represented by a 3-order tensor 𝒪I1×I2×I3\mathscr{O}\in\mathbb{R}^{I^{1}\times I^{2}\times I^{3}}, where I1I^{1} is the number of locations, I2I^{2} is the number of recorded time stamps, and I3I^{3} is the number of variables corresponding to climate observations. Consequently, CCDS is represented as a 125×156×17125\times 156\times 17 tensor and USHCN is represented as a 1218×1211×41218\times 1211\times 4 tensor. The affinity matrix is denoted 𝑨\bm{A}, with 𝑨ij\bm{A}_{ij} being the similarity s(i,j)=exp(2bij)s(i,j)=\exp(-2b_{ij}) between locations ii and jj (bijb_{ij} is the Haversine distance between ii and jj). Following (Bahadori et al., 2014), we normalize the data to zero mean and unit variance, then randomly sample 10% of the locations for training, another 10% for validation, and the rest for testing.

Algorithms FaLRTC, FFW, TR-MM, RP and TMac cannot be directly used for this graph Laplacian regularized tensor completion problem, while PA-APG, ADMM, Tucker and CP-WOPT can be adapted by modifying the gradient calculation. Hence we adapt and implement PA-APG, ADMM, Tucker and CP-WOPT as baselines in this section. In addition, we compare with a greedy algorithm (denoted “Greedy”)242424This method is denoted “ORTHOGONAL” in (Bahadori et al., 2014) and obtains the best results there. from (Bahadori et al., 2014), which successively adds a rank-1 matrix to approximate the mode-nn unfolding with the rank constraint. For the factorization-based algorithms Tucker and CP-WOPT, the graph Laplacian regularizer hh takes the corresponding factor matrix rather than 𝒳1\mathscr{X}_{{\left\langle 1\right\rangle}} as the input. Specifically, recall that Tucker factorizes 𝒳\mathscr{X} into [𝒢;𝑩1,𝑩2,𝑩3][\mathscr{G};{\bm{\bm{B}}}^{1},{\bm{\bm{B}}}^{2},{\bm{\bm{B}}}^{3}], where 𝒢k1×k2×k3\mathscr{G}\in\mathbb{R}^{k^{1}\times k^{2}\times k^{3}}, 𝑩iIi×ki{\bm{\bm{B}}}^{i}\in\mathbb{R}^{I^{i}\times k^{i}}, i=1,2,3i=1,2,3, and kik^{i}’s are hyperparameters. When k1=k2=k3k^{1}=k^{2}=k^{3} and 𝒢\mathscr{G} is superdiagonal, this reduces to the CP-WOPT decomposition. The graph Laplacian regularizer is then constructed as h(𝑩1)h({\bm{\bm{B}}}^{1}) to leverage location proximity. As an additional baseline, we also experiment with a NORT variant that does not use the Laplacian regularizer (denoted “NORT-no-Lap”).

Table 12: Testing RMSEs on CCDS and USHCN data sets.
CCDS USHCN
convex ADMM 0.890±\pm0.016 0.691±\pm0.005
PA-APG 0.866±\pm0.014 0.680±\pm0.009
factorization Tucker 0.856±\pm0.026 0.647±\pm0.006
CP-WOPT 0.887±\pm0.018 0.688±\pm0.009
rank constraint Greedy 0.871±\pm0.008 0.658±\pm0.012
nonconvex NORT-no-Lap 0.997±\pm0.001 1.391±\pm0.001
NORT 0.793±\pm0.002 0.583±\pm0.012

Table 12 shows the RMSE results. Again, NORT obtains the lowest testing RMSEs. Moreover, when the Laplacian regularizer is not used, the testing RMSE is much higher, demonstrating that the missing slices cannot be reliably completed. Figure 17 shows the convergence. As can be seen, NORT is orders of magnitude faster than the other algorithms. The gaps on the performance and speed between NORT and the other baselines are more obvious on the larger USHCN data set. Further, note from Figures 17(a) and 17(b) that though NORT-no-Lap has converged, it cannot decrease the testing RMSE during learning (Figures 17(c) and 17(d)). This validates the efficacy of the graph Laplacian regularizer.

Refer to caption
(a) CCDS.
Refer to caption
(b) USHCN.
Refer to caption
(c) CCDS.
Refer to caption
(d) USHCN.
Figure 17: Convergence of the training objective (below) and testing RMSE (top) versus CPU time on the spatial-temporal data.

6 Conclusion

In this paper, we propose a low-rank tensor completion model with nonconvex regularization. An efficient nonconvex proximal average algorithm is developed, which maintains the “sparse plus low-rank” structure throughout the iterations and incorporates adaptive momentum. Convergence to critical points is guaranteed, and the obtained critical points can have small statistical errors. The algorithm is also extended for nonsmooth losses and additional regularization, demonstrating broad applicability of the proposed algorithm. Experiments on a variety of synthetic and real data sets are performed. Results show that the proposed algorithm is more efficient and more accurate than existing state-of-the-art.

In the future, we will extend the proposed algorithm to simultaneous completion of multiple tensors, e.g., collaborative tensor completion (Zheng et al., 2013) and coupled tensor completion (Wimalawarne et al., 2018). Besides, it is also interesting to study how the proposed algorithm can be efficiently parallelized on GPUs and distributed computing environments (Phipps and Kolda, 2019).

Acknowledgement

BH was supported by the RGC Early Career Scheme No.22200720 and NSFC Young Scientists Fund No. 62006202.

Appendix

Appendix A Comparison with Incoherence Condition

The matrix incoherence condition (Candès and Recht, 2009; Candès et al., 2011; Negahban and Wainwright, 2012) is in form of the singular value decomposition 𝑿=𝑼𝚺𝑽m×n{\bm{\bm{X}}}=\bm{U}\bm{\Sigma}\bm{V}^{\top}\in\mathbb{R}^{m\times n}, where 𝑼m×r\bm{U}\in\mathbb{R}^{m\times r} (resp. 𝑽n×r\bm{V}\in\mathbb{R}^{n\times r}) contains the left (resp. right) singular vectors and 𝚺r×r\bm{\Sigma}\in\mathbb{R}^{r\times r} is the diagonal matrix containing singular values. The purpose of this condition is to enforce that the left and right singular vectors should not be aligned with the standard basis (i.e., vector 𝒆i{\bm{\bm{e}}}_{i}’s with the iith dimension being 11 and others being 0). Typically, this condition is stated as

maxj=1,,m|[𝑼𝑼]jj|μ0rm,andmaxj=1,,n|[𝑽𝑽]jj|μ0rn,\displaystyle\max_{j=1,\dots,m}\left|[\bm{U}\bm{U}^{\top}]_{jj}\right|\leq\mu_{0}\frac{r}{m},\quad\text{and}\quad\max_{j=1,\dots,n}\left|[\bm{V}\bm{V}^{\top}]_{jj}\right|\leq\mu_{0}\frac{r}{n}, (48)

for some constant μ0>0\mu_{0}>0. Note that (48) does not depend on the singular values of 𝑿{\bm{\bm{X}}}. However, this condition can be restrictive in realistic settings, where the underlying matrix is contaminated by noise. In this case, the observed matrix can have small singular values. Therefore, we need to impose conditions related to the singular values, and (32) shows such a dependency. An example matrix satisfying the matrix RSC condition but not the incoherence condition is in Section 3.4.2 of (Negahban and Wainwright, 2012). As a result, the RSC condition, which involves singular values, is less restrictive than the incoherence condition, and can better describe “spikiness”.

Appendix B Proofs

B.1 Proposition 4

Proof  For simplicity, we consider the case where 𝑼Ij×k\bm{U}\in\mathbb{R}^{I^{j}\times k} (resp. 𝑽(IπIj)×k\bm{V}\in\mathbb{R}^{(\frac{I^{\pi}}{I^{j}})\times k}) has only one single column 𝒖Ij\bm{u}\in\mathbb{R}^{I^{j}} (resp. 𝒗IπIj\bm{v}\in\mathbb{R}^{\frac{I^{\pi}}{I^{j}}}). We need to fold 𝒖𝒗\bm{u}\bm{v}^{\top} along with the jjth mode and then unfold it along its iith mode. Let us consider the structure of 𝒳=(𝒖𝒗)j\mathscr{X}=(\bm{u}\bm{v}^{\top})^{{\left\langle j\right\rangle}}, we can express it as

𝒳j=[𝐮𝒗1,,𝐮𝒗Iπ(IiIj)]Ij×IπIj,\displaystyle\mathscr{X}_{{\left\langle j\right\rangle}}=\big{[}\mathbf{u}\bm{v}_{1}^{\top},...,\mathbf{u}\bm{v}_{\frac{I^{\pi}}{(I^{i}I^{j})}}^{\top}\big{]}\in\mathbb{R}^{I^{j}\times\frac{I^{\pi}}{I^{j}}},

where 𝒗=[𝒗1;;𝒗Iπ(IiIj)]\bm{v}=[\bm{v}_{1};\dots;\bm{v}_{\frac{I^{\pi}}{(I^{i}I^{j})}}] with each 𝒗pIi\bm{v}_{p}\in\mathbb{R}^{I^{i}}. When unfolding 𝒳\mathscr{X} with the iith mode, the unfolding matrix is

[𝒗1𝒖,,𝒗Iπ(IiIj)𝒖]Ii×IπIi.\displaystyle\big{[}\bm{v}_{1}\bm{u}^{\top},\dots,\bm{v}_{\frac{I^{\pi}}{(I^{i}I^{j})}}\bm{u}^{\top}\big{]}\in\mathbb{R}^{I^{i}\times\frac{I^{\pi}}{I^{i}}}. (49)

Thus,

𝒂[𝒗1𝒖,,𝒗I3𝒖]\displaystyle\bm{a}^{\top}\big{[}\bm{v}_{1}\bm{u}^{\top},\dots,\bm{v}_{I^{3}}\bm{u}^{\top}\big{]} =\displaystyle= [(𝒂𝒗1)𝒖,,(𝒂𝒗I3)𝒖],\displaystyle\big{[}(\bm{a}^{\top}\bm{v}_{1})\bm{u}^{\top},...,(\bm{a}^{\top}\bm{v}_{I^{3}})\bm{u}^{\top}\big{]}, (50)
=\displaystyle= (𝒂mat(𝒗p;Ii,I¯ij))𝒖.\displaystyle\left(\bm{a}^{\top}\text{mat}\left(\bm{v}_{p};I^{i},\bar{I}^{ij}\right)\right)\otimes\bm{u}^{\top}.

Similarly, let 𝒃=[𝒃1;;𝒃IπIiIj]\bm{b}=\big{[}\bm{b}_{1};\dots;\bm{b}_{\frac{I^{\pi}}{I^{i}I^{j}}}\big{]}, where each 𝒃pIj\bm{b}_{p}\in\mathbb{R}^{I^{j}}. From (49), we have

[𝒗1𝒖,,𝒗Iπ(IiIj)𝒖]𝒃\displaystyle\big{[}\bm{v}_{1}\bm{u}^{\top},\dots,\bm{v}_{\frac{I^{\pi}}{(I^{i}I^{j})}}\bm{u}^{\top}\big{]}\bm{b} =\displaystyle= j=1IπIiIj𝒗i(𝒖𝒃i),\displaystyle\sum\nolimits_{j=1}^{\frac{I^{\pi}}{I^{i}I^{j}}}\bm{v}_{i}(\bm{u}^{\top}\bm{b}_{i}), (51)
=\displaystyle= [𝒗1;;𝒗IπIiIj][𝒖𝒃1𝒖𝒃IπIiIj],\displaystyle\big{[}\bm{v}_{1};\dots;\bm{v}_{\frac{I^{\pi}}{I^{i}I^{j}}}\big{]}\begin{bmatrix}\bm{u}^{\top}\bm{b}_{1}\\ \vdots\\ \bm{u}^{\top}\bm{b}_{\frac{I^{\pi}}{I^{i}I^{j}}}\end{bmatrix},
=\displaystyle= [𝒗1;;𝒗IπIiIj][𝒃1;;𝒃IπIiIj]𝒖,\displaystyle\big{[}\bm{v}_{1};\dots;\bm{v}_{\frac{I^{\pi}}{I^{i}I^{j}}}\big{]}\big{[}\bm{b}_{1};\dots;\bm{b}_{\frac{I^{\pi}}{I^{i}I^{j}}}\big{]}^{\top}\bm{u},
=\displaystyle= mat(𝒗;Ii,I¯ij)mat(𝒃;I¯ij,Ij)𝒖.\displaystyle\text{mat}\left(\bm{v};I^{i},\bar{I}^{ij}\right)\text{mat}\left(\bm{b};\bar{I}^{ij},I^{j}\right)\bm{u}.

When 𝑼\bm{U} (resp. 𝑽\bm{V}) has kk columns, combining with the fact that 𝑼𝑽=p=1k𝒖p𝒗p\bm{U}\bm{V}^{\top}=\sum_{p=1}^{k}\bm{u}_{p}\bm{v}_{p}^{\top} with (50) and (51), we obtain (26) and (27).  

B.2 Proposition 6

Proof  Define λ¯d=λd/τ\bar{\lambda}_{d}=\lambda_{d}/\tau, then

d=1Dmin𝑿d12𝑿d𝒵dF2+λ¯dϕ(𝑿d),\displaystyle\sum\nolimits_{d=1}^{D}\min\nolimits_{{\bm{\bm{X}}}_{d}}\frac{1}{2}\left\|{\bm{\bm{X}}}_{d}-\mathscr{Z}_{{\left\langle d\right\rangle}}\right\|_{F}^{2}+\bar{\lambda}_{d}\,\phi({\bm{\bm{X}}}_{d}),
=min{𝑿d}D2𝒵F2<𝒵,d=1D𝑿dd>+D2d=1D𝑿dF2+d=1Dλ¯dϕ(𝑿d),\displaystyle\!\!\!=\min\nolimits_{\{{\bm{\bm{X}}}_{d}\}}\frac{D}{2}\left\|\mathscr{Z}\right\|_{F}^{2}-\big{<}\mathscr{Z},\sum\nolimits_{d=1}^{D}{\bm{\bm{X}}}_{d}^{{\left\langle d\right\rangle}}\big{>}+\frac{D}{2}\sum\nolimits_{d=1}^{D}\left\|{\bm{\bm{X}}}_{d}\right\|_{F}^{2}+\sum\nolimits_{d=1}^{D}\bar{\lambda}_{d}\phi({\bm{\bm{X}}}_{d}),
=min{𝑿d}D2𝒵d=1D𝑿ddF2D2d=1D𝑿ddF2+d=1D[12𝑿dF2+λ¯dϕ(𝑿d)].\displaystyle\!\!\!=\min\nolimits_{\{{\bm{\bm{X}}}_{d}\}}\frac{D}{2}\left\|\mathscr{Z}\!-\!\sum\nolimits_{d=1}^{D}{\bm{\bm{X}}}_{d}^{{\left\langle d\right\rangle}}\right\|_{F}^{2}\!\!\!-\frac{D}{2}\left\|\sum\nolimits_{d=1}^{D}{\bm{\bm{X}}}_{d}^{{\left\langle d\right\rangle}}\right\|_{F}^{2}\!\!\!+\sum\nolimits_{d=1}^{D}\big{[}\frac{1}{2}\left\|{\bm{\bm{X}}}_{d}\right\|_{F}^{2}\!+\!\bar{\lambda}_{d}\phi({\bm{\bm{X}}}_{d})\big{]}. (52)

Next, we introduce an extra parameter as 𝒳=d=1D𝑿dd\mathscr{X}=\sum\nolimits_{d=1}^{D}{\bm{\bm{X}}}_{d}^{{\left\langle d\right\rangle}}, and express (52) as

min{𝑿d}:𝒳=d=1D𝑿ddD2𝒵𝒳F2D2d=1D𝑿ddF2+d=1D[12𝑿dF2+λ¯dϕ(𝑿d)],\displaystyle\min\nolimits_{\{{\bm{\bm{X}}}_{d}\}:\mathscr{X}=\sum\nolimits_{d=1}^{D}{\bm{\bm{X}}}_{d}^{{\left\langle d\right\rangle}}}\frac{D}{2}\left\|\mathscr{Z}-\mathscr{X}\right\|_{F}^{2}-\frac{D}{2}\left\|\sum\nolimits_{d=1}^{D}{\bm{\bm{X}}}_{d}^{{\left\langle d\right\rangle}}\right\|_{F}^{2}+\sum\nolimits_{d=1}^{D}\left[\frac{1}{2}\left\|{\bm{\bm{X}}}_{d}\right\|_{F}^{2}+\bar{\lambda}_{d}\phi({\bm{\bm{X}}}_{d})\right],
=min𝒳{D2𝒵𝒳F2+min{𝑿d}:d=1D𝑿dd=𝒳d=1D[12𝑿dF2+λ¯dϕ(𝑿d)]D2𝒳F2}.\displaystyle=\!\min_{\mathscr{X}}\left\{\frac{D}{2}\left\|\mathscr{Z}-\mathscr{X}\right\|_{F}^{2}+\min_{\{{\bm{\bm{X}}}_{d}\}:\sum\nolimits_{d=1}^{D}{\bm{\bm{X}}}_{d}^{{\left\langle d\right\rangle}}=\mathscr{X}}\sum\nolimits_{d=1}^{D}\left[\frac{1}{2}\left\|{\bm{\bm{X}}}_{d}\right\|_{F}^{2}\!+\!\bar{\lambda}_{d}\phi({\bm{\bm{X}}}_{d})\right]\!-\!\frac{D}{2}\left\|\mathscr{X}\right\|_{F}^{2}\right\}. (53)

We transform the above equation as

min𝒳12𝒵𝒳F2+1τg¯τ(𝒳)=proxg¯ττ(𝒳),\displaystyle\min\nolimits_{\mathscr{X}}\frac{1}{2}\left\|\mathscr{Z}-\mathscr{X}\right\|_{F}^{2}+\frac{1}{\tau}\bar{g}_{\tau}(\mathscr{X})=\text{prox}_{\frac{\bar{g}_{\tau}}{\tau}}(\mathscr{X}),

where g¯τ(𝒳)\bar{g}_{\tau}(\mathscr{X}) is defined as

g¯τ(𝒳)\displaystyle\bar{g}_{\tau}(\mathscr{X}) =τ[min{𝑿d}d=1D(12𝑿dF2+λ¯dϕ(𝑿d))D2𝒳F2],\displaystyle=\tau\left[\min\nolimits_{\{{\bm{\bm{X}}}_{d}\}}\sum\nolimits_{d=1}^{D}\big{(}\frac{1}{2}\left\|{\bm{\bm{X}}}_{d}\right\|_{F}^{2}+\bar{\lambda}_{d}\phi({\bm{\bm{X}}}_{d})\big{)}-\frac{D}{2}\left\|\mathscr{X}\right\|_{F}^{2}\right], (54)
 s.t. d=1D𝑿dd=𝒳.\displaystyle\text{\;s.t.\;}\sum\nolimits_{d=1}^{D}{\bm{\bm{X}}}_{d}^{{\left\langle d\right\rangle}}=\mathscr{X}.

Thus, there exists g¯τ\bar{g}_{\tau} such that proxg¯ττ(𝒵)=i=1D[proxλ¯dϕ([𝒵]i)]i\text{prox}_{\frac{\bar{g}_{\tau}}{\tau}}(\mathscr{Z})=\sum\nolimits_{i=1}^{D}\big{[}\text{prox}_{\bar{\lambda}_{d}\phi}(\left[\mathscr{Z}\right]_{{\left\langle i\right\rangle}})\big{]}^{{\left\langle i\right\rangle}}.  

B.3 Proposition 7

Let g(𝒳)=d=1Dλiϕ(𝒳d)g(\mathscr{X})=\sum_{d=1}^{D}\lambda_{i}\phi(\mathscr{X}_{{\left\langle d\right\rangle}}). Before proving Proposition 7, we first extend Proposition 2 in (Zhong and Kwok, 2014) in the following auxiliary Lemma.

B.3.1 Auxiliary Lemma

Lemma 21

0g(𝒳)g¯τ(𝒳)κ022τd=1Dλd20\leq g(\mathscr{X})-\bar{g}_{\tau}(\mathscr{X})\leq\frac{\kappa_{0}^{2}}{2\tau}\sum\nolimits_{d=1}^{D}\lambda_{d}^{2}.

Proof  From the definition of g¯τ\bar{g}_{\tau} in (54), if 𝒳=𝑿11==𝑿DD\mathscr{X}={\bm{\bm{X}}}^{{\left\langle 1\right\rangle}}_{1}=...={\bm{\bm{X}}}^{{\left\langle D\right\rangle}}_{D}, we have

g¯τ(𝒳)\displaystyle\bar{g}_{\tau}(\mathscr{X}) τ(d=1D(12𝑿ddF2+λ¯dϕ(𝑿d))D2𝒳F2),\displaystyle\leq\tau\big{(}\sum\nolimits_{d=1}^{D}\big{(}\frac{1}{2}\left\|{\bm{\bm{X}}}_{d}^{{\left\langle d\right\rangle}}\right\|_{F}^{2}\!+\bar{\lambda}_{d}\phi({\bm{\bm{X}}}_{d})\big{)}\!-\!\frac{D}{2}\left\|\mathscr{X}\right\|_{F}^{2}\big{)},
=d=1Dλdϕ(𝑿d)=d=1Dλdϕ(𝒳d)=g(𝒳).\displaystyle=\sum\nolimits_{d=1}^{D}\lambda_{d}\phi({\bm{\bm{X}}}_{d})=\sum\nolimits_{d=1}^{D}\lambda_{d}\phi(\mathscr{X}_{{\left\langle d\right\rangle}})=g(\mathscr{X}).

Thus, g(𝒳)g¯τ(𝒳)0g(\mathscr{X})-\bar{g}_{\tau}(\mathscr{X})\geq 0. Next, we prove the “\leq” part in the Lemma. Note that

sup𝑿dλdϕ(𝑿d)τmin𝒀(12𝒀𝑿dF2+λ¯dϕ(𝒀)),\displaystyle\sup\nolimits_{{\bm{\bm{X}}}_{d}}\lambda_{d}\phi({\bm{\bm{X}}}_{d})-\tau\min\nolimits_{{\bm{\bm{Y}}}}\big{(}\frac{1}{2}\left\|{\bm{\bm{Y}}}-{\bm{\bm{X}}}_{d}\right\|_{F}^{2}+\bar{\lambda}_{d}\phi({\bm{\bm{Y}}})\big{)},
=sup𝑿d,𝒀λdϕ(𝑿d)τ2𝒀𝑿dF2λdϕ(𝒀).\displaystyle=\sup\nolimits_{{\bm{\bm{X}}}_{d},{\bm{\bm{Y}}}}\lambda_{d}\phi({\bm{\bm{X}}}_{d})-\frac{\tau}{2}\left\|{\bm{\bm{Y}}}-{\bm{\bm{X}}}_{d}\right\|_{F}^{2}-\lambda_{d}\phi({\bm{\bm{Y}}}). (55)

Since ϕ\phi is κ0\kappa_{0}-Lipschitz continuous, let α=𝒀𝑿dF\alpha=\left\|{\bm{\bm{Y}}}-{\bm{\bm{X}}}_{d}\right\|_{F}, we have

(55)\displaystyle\mathscr{\eqref{eq:app2}} =sup𝑿d,𝒀λd[ϕ(𝑿d)ϕ(𝑿)]τ2𝒀𝑿dF2,\displaystyle=\sup\nolimits_{{\bm{\bm{X}}}_{d},{\bm{\bm{Y}}}}\lambda_{d}\left[\phi({\bm{\bm{X}}}_{d})-\phi({\bm{\bm{X}}})\right]-\frac{\tau}{2}\left\|{\bm{\bm{Y}}}-{\bm{\bm{X}}}_{d}\right\|_{F}^{2},
sup𝑿d,𝒀λdκ0𝒀𝑿dFτ2𝒀𝑿dF2,\displaystyle\leq\sup\nolimits_{{\bm{\bm{X}}}_{d},{\bm{\bm{Y}}}}\lambda_{d}\kappa_{0}\left\|{\bm{\bm{Y}}}-{\bm{\bm{X}}}_{d}\right\|_{F}-\frac{\tau}{2}\left\|{\bm{\bm{Y}}}-{\bm{\bm{X}}}_{d}\right\|_{F}^{2},
=supα[λdκ0ατ2α2]=supα12[αλdκ0τ]2+λd2κ022λd2κ022τ.\displaystyle=\sup\nolimits_{\alpha}\big{[}\lambda_{d}\kappa_{0}\alpha-\frac{\tau}{2}\alpha^{2}\big{]}=\sup\nolimits_{\alpha}-\frac{1}{2}\big{[}\alpha-\frac{\lambda_{d}\kappa_{0}}{\tau}\big{]}^{2}+\frac{\lambda_{d}^{2}\kappa_{0}^{2}}{2}\leq\frac{\lambda_{d}^{2}\kappa_{0}^{2}}{2\tau}. (56)

Next, we have

g(𝒳)g¯τ(𝒳)\displaystyle g(\mathscr{X})-\bar{g}_{\tau}(\mathscr{X}) g(𝒳)τ(min𝒴12𝒳𝒴F2+1τg¯τ(𝒴)),\displaystyle\leq g(\mathscr{X})-\tau\big{(}\min\nolimits_{\mathscr{Y}}\frac{1}{2}\left\|\mathscr{X}-\mathscr{Y}\right\|_{F}^{2}+\frac{1}{\tau}\bar{g}_{\tau}(\mathscr{Y})\big{)}, (57)
=d=1Dλdϕ(𝒳d)τd=1D(min{𝒀d}12𝒳d𝒀dF2+λdτϕ(𝒀d)),\displaystyle=\sum\nolimits_{d=1}^{D}\lambda_{d}\phi(\mathscr{X}_{{\left\langle d\right\rangle}})-\tau\sum\nolimits_{d=1}^{D}\big{(}\min\nolimits_{\{{\bm{\bm{Y}}}_{d}\}}\frac{1}{2}\left\|\mathscr{X}_{{\left\langle d\right\rangle}}-{\bm{\bm{Y}}}_{d}\right\|_{F}^{2}+\frac{\lambda_{d}}{\tau}\phi({\bm{\bm{Y}}}_{d})\big{)}, (58)
sup𝒳d=1Dλdϕ(𝒳d)τd=1D(min{𝒀d}12𝒳d𝒀dF2+λdτϕ(𝒀d)),\displaystyle\leq\sup\nolimits_{\mathscr{X}}\sum\nolimits_{d=1}^{D}\lambda_{d}\phi(\mathscr{X}_{{\left\langle d\right\rangle}})-\tau\sum\nolimits_{d=1}^{D}\big{(}\min\nolimits_{\{{\bm{\bm{Y}}}_{d}\}}\frac{1}{2}\left\|\mathscr{X}_{{\left\langle d\right\rangle}}-{\bm{\bm{Y}}}_{d}\right\|_{F}^{2}+\frac{\lambda_{d}}{\tau}\phi({\bm{\bm{Y}}}_{d})\big{)},
d=1Dλd2κ022τ.\displaystyle\leq\sum\nolimits_{d=1}^{D}\frac{\lambda_{d}^{2}\kappa_{0}^{2}}{2\tau}. (59)

Note that (57) comes from the fact that

min𝒴12𝒳𝒴F2+1τg¯τ(𝒴)12𝒳𝒳F2+1τg¯τ(𝒳)=1τg¯τ(𝒳),\displaystyle\min\nolimits_{\mathscr{Y}}\frac{1}{2}\left\|\mathscr{X}-\mathscr{Y}\right\|_{F}^{2}+\frac{1}{\tau}\bar{g}_{\tau}(\mathscr{Y})\leq\frac{1}{2}\left\|\mathscr{X}-\mathscr{X}\right\|_{F}^{2}+\frac{1}{\tau}\bar{g}_{\tau}(\mathscr{X})=\frac{1}{\tau}\bar{g}_{\tau}(\mathscr{X}),

then (58) is from the definition of g¯τ\bar{g}_{\tau} in Proposition 6, and (59) is from (56).  

B.3.2 Proof of Proposition 7

Proof  From Lemma 21, we have

min𝒳F(𝒳)min𝒳Fτ(𝒳)min𝒳F(𝒳)Fτ(𝒳)=g(𝒳)g¯τ(𝒳)0.\displaystyle\min_{\mathscr{X}}F(\mathscr{X})-\min_{\mathscr{X}}F_{\tau}(\mathscr{X})\geq\min_{\mathscr{X}}F(\mathscr{X})-F_{\tau}(\mathscr{X})=g(\mathscr{X})-\bar{g}_{\tau}(\mathscr{X})\geq 0.

Let 𝒳1=argmin𝒳F(𝒳)\mathscr{X}_{1}=\arg\min\nolimits_{\mathscr{X}}F(\mathscr{X}) and 𝒳τ=argmin𝒳Fτ(𝒳)\mathscr{X}_{\tau}=\arg\min\nolimits_{\mathscr{X}}F_{\tau}(\mathscr{X}). Then, we have

min𝒳F(𝒳)min𝒳Fτ(𝒳)=F(𝒳1)Fτ(𝒳τ)F(𝒳τ)Fτ(𝒳τ)=g(𝒳τ)g¯τ(𝒳τ)κ022τd=1Dλd2.\displaystyle\min_{\mathscr{X}}F(\mathscr{X})\!-\!\min_{\mathscr{X}}F_{\tau}(\mathscr{X})=\!F(\mathscr{X}_{1})\!-\!F_{\tau}(\mathscr{X}_{\tau})\leq\!F(\mathscr{X}_{\tau})\!-\!F_{\tau}(\mathscr{X}_{\tau})\!=\!g(\mathscr{X}_{\tau})\!-\!\bar{g}_{\tau}(\mathscr{X}_{\tau})\!\leq\!\frac{\kappa_{0}^{2}}{2\tau}\sum\nolimits_{d=1}^{D}\lambda_{d}^{2}.

Thus, 0minFminFτκ022τd=1Dλd20\leq\min F-\min F_{\tau}\leq\frac{\kappa_{0}^{2}}{2\tau}\sum\nolimits_{d=1}^{D}\lambda_{d}^{2}.  

B.4 Proposition 8

Proof  The proof of this proposition can also be found in (Zhong and Kwok, 2014), we add one here for the completeness. Recall that

proxg¯ττ(𝒳~f(𝒳~)/τ)=argmin𝒳12𝒳(𝒳~1τf(𝒳~))F2+1τg¯τ(𝒳).\displaystyle\text{prox}_{\frac{\bar{g}_{\tau}}{\tau}}(\tilde{\mathscr{X}}-\nabla f(\tilde{\mathscr{X}})/\tau)=\arg\min_{\mathscr{X}}\frac{1}{2}\left\|\mathscr{X}-\left(\tilde{\mathscr{X}}-\frac{1}{\tau}\nabla f(\tilde{\mathscr{X}})\right)\right\|_{F}^{2}+\frac{1}{\tau}\bar{g}_{\tau}(\mathscr{X}).

Let 𝒵=proxg¯ττ(𝒳~f(𝒳~)/τ)\mathscr{Z}=\text{prox}_{\frac{\bar{g}_{\tau}}{\tau}}(\tilde{\mathscr{X}}-\nabla f(\tilde{\mathscr{X}})/\tau). Thus,

0𝒵(𝒳~1τf(𝒳~))+1τg¯τ(𝒳).\displaystyle 0\in\mathscr{Z}-\left(\tilde{\mathscr{X}}-\frac{1}{\tau}\nabla f(\tilde{\mathscr{X}})\right)+\frac{1}{\tau}\partial\bar{g}_{\tau}(\mathscr{X}).

When 𝒵=𝒳~\mathscr{Z}=\tilde{\mathscr{X}}, we have 0f(𝒳~)+g¯τ(𝒳)0\in\nabla f(\tilde{\mathscr{X}})+\partial\bar{g}_{\tau}(\mathscr{X}). Thus, 𝒳~\tilde{\mathscr{X}} is a critical point of FτF_{\tau}.  

B.5 Theorem 9

First, we introduce the following Lemmas, which are basic properties for the proximal step.

Lemma 22

(Parikh and Boyd, 2013) Let τ>ρ+Dκ0\tau>\rho+D\kappa_{0} and η=τρ+Dκ0\eta=\tau-\rho+D\kappa_{0}. Then, Fτ(proxg¯ττ(𝒳))Fτ(𝒳)η2𝒳proxg¯ττ(𝒳)F2F_{\tau}(\text{prox}_{\frac{\bar{g}_{\tau}}{\tau}}(\mathscr{X}))\leq F_{\tau}(\mathscr{X})-\frac{\eta}{2}\big{\|}\mathscr{X}-\text{prox}_{\frac{\bar{g}_{\tau}}{\tau}}(\mathscr{X})\big{\|}_{F}^{2}.

Lemma 23

(Parikh and Boyd, 2013) If 𝒳=proxg¯ττ(𝒳1τf(𝒳))\mathscr{X}\!=\!\text{prox}_{\frac{\bar{g}_{\tau}}{\tau}}(\mathscr{X}\!-\!\frac{1}{\tau}\nabla f(\mathscr{X})), then 𝒳\mathscr{X} is a critical point of FτF_{\tau}.

Lemma 24

(Hare and Sagastizábal, 2009) The proximal map proxg¯ττ(𝒳)\text{prox}_{\frac{\bar{g}_{\tau}}{\tau}}(\mathscr{X}) is continuous.

Proof (of Theorem 9) Recall that proxg¯ττ(𝒳)=i=1Dproxλiϕτ(𝒳i)\text{prox}_{\frac{\bar{g}_{\tau}}{\tau}}(\mathscr{X})=\sum\nolimits_{i=1}^{D}\text{prox}_{\frac{\lambda_{i}\phi}{\tau}}(\mathscr{X}_{{\left\langle i\right\rangle}}). From Lemma 22,

  • If step 7 is performed, we have

    Fτ(𝒳t+1)Fτ(𝒱t)η2𝒳t+1𝒱tF2Fτ(𝒳t)η2𝒳t+1𝒳tF2.\displaystyle F_{\tau}(\mathscr{X}_{t+1})\leq F_{\tau}(\mathscr{V}_{t})-\frac{\eta}{2}\left\|\mathscr{X}_{t+1}-\mathscr{V}_{t}\right\|_{F}^{2}\leq F_{\tau}(\mathscr{X}_{t})-\frac{\eta}{2}\left\|\mathscr{X}_{t+1}-\mathscr{X}_{t}\right\|_{F}^{2}. (60)
  • If step 5 is performed,

    Fτ(𝒳t+1)Fτ(𝒱t)η2𝒳t+1𝒱tF2\displaystyle F_{\tau}(\mathscr{X}_{t+1})\leq F_{\tau}(\mathscr{V}_{t})-\frac{\eta}{2}\left\|\mathscr{X}_{t+1}-\mathscr{V}_{t}\right\|_{F}^{2} Fτ(𝒳¯t)η2𝒳t+1𝒳¯tF2,\displaystyle\leq F_{\tau}(\bar{\mathscr{X}}_{t})-\frac{\eta}{2}\left\|\mathscr{X}_{t+1}-\bar{\mathscr{X}}_{t}\right\|_{F}^{2},
    Fτ(𝒳t)η2𝒳t+1𝒳¯tF2.\displaystyle\leq F_{\tau}(\mathscr{X}_{t})-\frac{\eta}{2}\left\|\mathscr{X}_{t+1}-\bar{\mathscr{X}}_{t}\right\|_{F}^{2}. (61)

Combining (60) and (61), we have

2η(Fτ(𝒳1)Fτ(𝒳T+1))jχ1(T)𝒳t+1𝒳¯tF2+jχ2(T)𝒳t+1𝒳tF2,\displaystyle\frac{2}{\eta}(F_{\tau}(\mathscr{X}_{1})-F_{\tau}(\mathscr{X}_{T+1}))\geq\sum\nolimits_{j\in\chi_{1}(T)}\left\|\mathscr{X}_{t+1}-\bar{\mathscr{X}}_{t}\right\|_{F}^{2}+\sum\nolimits_{j\in\chi_{2}(T)}\left\|\mathscr{X}_{t+1}-\mathscr{X}_{t}\right\|_{F}^{2}, (62)

where χ1(T)\chi_{1}(T) and χ2(T)\chi_{2}(T) are a partition of {1,,T}\{1,...,T\} such that when jχ1(T)j\in\chi_{1}(T) step 5 is performed, and when jχ2(T)j\in\chi_{2}(T) step 7 is performed. As FτF_{\tau} is bounded from below and lim𝒳FFτ(𝒳)=\lim_{\left\|\mathscr{X}\right\|_{F}\rightarrow\infty}F_{\tau}(\mathscr{X})=\infty, taking T=T=\infty in (62), we have

jχ1()𝒳t+1𝒴tF2+jχ2()𝒳t+1𝒳tF2=c,\displaystyle\sum\nolimits_{j\in\chi_{1}(\infty)}\left\|\mathscr{X}_{t+1}-\mathscr{Y}_{t}\right\|_{F}^{2}+\sum\nolimits_{j\in\chi_{2}(\infty)}\left\|\mathscr{X}_{t+1}-\mathscr{X}_{t}\right\|_{F}^{2}=c,

where c2η[Fτ(𝒳1)Fτmin]c\leq\frac{2}{\eta}\left[F_{\tau}(\mathscr{X}_{1})-F_{\tau}^{\text{min}}\right] is a positive constant. Thus, the sequence {𝒳t}\{\mathscr{X}_{t}\} is bounded, and it must have limit points. Besides, one of the following three cases must hold.

  1. 1.

    χ1()\chi_{1}(\infty) is finite, χ2()\chi_{2}(\infty) is infinite. Let 𝒳~\tilde{\mathscr{X}} be a limit point of {𝒳t}\{\mathscr{X}_{t}\}, and {𝒳jt}\{\mathscr{X}_{j_{t}}\} be a subsequence that converges to 𝒳~\tilde{\mathscr{X}}. In this case, on using Lemma 24, we have

    limjt𝒳jt+1𝒳jtF2\displaystyle\lim\limits_{j_{t}\rightarrow\infty}\left\|\mathscr{X}_{j_{t}+1}-\mathscr{X}_{j_{t}}\right\|_{F}^{2} =limjtproxg¯ττ(𝒳jt1τf(𝒳jt))𝒳jtF2,\displaystyle=\lim\limits_{j_{t}\rightarrow\infty}\left\|\text{prox}_{\frac{\bar{g}_{\tau}}{\tau}}(\mathscr{X}_{j_{t}}-\frac{1}{\tau}\nabla f(\mathscr{X}_{j_{t}}))-\mathscr{X}_{j_{t}}\right\|_{F}^{2},
    =proxg¯ττ(𝒳~1τf(𝒳~))𝒳~F2=0.\displaystyle=\left\|\text{prox}_{\frac{\bar{g}_{\tau}}{\tau}}(\tilde{\mathscr{X}}-\frac{1}{\tau}\nabla f(\tilde{\mathscr{X}}))-\tilde{\mathscr{X}}\right\|_{F}^{2}=0.

    Thus, 𝒳~=proxg¯ττ(𝒳~1τf(𝒳~))\tilde{\mathscr{X}}=\text{prox}_{\frac{\bar{g}_{\tau}}{\tau}}(\tilde{\mathscr{X}}-\frac{1}{\tau}\nabla f(\tilde{\mathscr{X}})), and 𝒳~\tilde{\mathscr{X}} is a critical point of FτF_{\tau} from Lemma 23.

  2. 2.

    χ1()\chi_{1}(\infty) is infinite, χ2()\chi_{2}(\infty) is finite. Let 𝒳~\tilde{\mathscr{X}} be a limit point of {𝒳t}\{\mathscr{X}_{t}\}, and {𝒳jt}\{\mathscr{X}_{j_{t}}\} be a subsequence that converges to 𝒳~\tilde{\mathscr{X}}. In this case, we have

    limjt𝒳jt+1𝒴jtF2\displaystyle\lim\limits_{j_{t}\rightarrow\infty}\left\|\mathscr{X}_{j_{t}+1}-\mathscr{Y}_{j_{t}}\right\|_{F}^{2} =limjtproxg¯ττ(𝒳jt1τf(𝒳jt))𝒴jtF2,\displaystyle=\lim\limits_{j_{t}\rightarrow\infty}\left\|\text{prox}_{\frac{\bar{g}_{\tau}}{\tau}}(\mathscr{X}_{j_{t}}-\frac{1}{\tau}\nabla f(\mathscr{X}_{j_{t}}))-\mathscr{Y}_{j_{t}}\right\|_{F}^{2},
    =proxg¯ττ(𝒳~1τf(𝒳~))𝒳~F2=0.\displaystyle=\left\|\text{prox}_{\frac{\bar{g}_{\tau}}{\tau}}(\tilde{\mathscr{X}}-\frac{1}{\tau}\nabla f(\tilde{\mathscr{X}}))-\tilde{\mathscr{X}}\right\|_{F}^{2}=0.

    Thus, 𝒳~=proxg¯ττ(𝒳~1τf(𝒳~))\tilde{\mathscr{X}}=\text{prox}_{\frac{\bar{g}_{\tau}}{\tau}}(\tilde{\mathscr{X}}-\frac{1}{\tau}\nabla f(\tilde{\mathscr{X}})), and 𝒳~\tilde{\mathscr{X}} is a critical point of FτF_{\tau} from Lemma 23.

  3. 3.

    Both χ1()\chi_{1}(\infty) and χ2()\chi_{2}(\infty) are infinite. From the above cases, we can see that either χ1()\chi_{1}(\infty) or χ2()\chi_{2}(\infty) is infinite, and limit points are also the critical points of FτF_{\tau}.

Thus, all limit points of {𝒳t}\{\mathscr{X}_{t}\} are critical points of FτF_{\tau}.  

B.6 Corollary 10

This corollary can be easily derived from the proof of Theorem 9.

Proof  Since 𝒳t+1=proxg¯ττ(𝒱t1τf(𝒱t))\mathscr{X}_{t+1}=\text{prox}_{\frac{\bar{g}_{\tau}}{\tau}}(\mathscr{V}_{t}-\frac{1}{\tau}\nabla f(\mathscr{V}_{t})), conclusion (i) directly follows from Lemma 23. From (62), we have

min1,,T𝒳t+1𝒱tF2\displaystyle\min\nolimits_{1,\dots,T}\left\|\mathscr{X}_{t+1}-\mathscr{V}_{t}\right\|_{F}^{2} 1Tt=1T𝒳t+1𝒱tF2,\displaystyle\leq\frac{1}{T}\sum\nolimits_{t=1\dots T}\left\|\mathscr{X}_{t+1}-\mathscr{V}_{t}\right\|_{F}^{2},
2ηT(Fτ(𝒳1)Fτ(𝒳T+1))2ηT(Fτ(𝒳1)Fτmin).\displaystyle\leq\frac{2}{\eta T}(F_{\tau}(\mathscr{X}_{1})-F_{\tau}(\mathscr{X}_{T+1}))\leq\frac{2}{\eta T}(F_{\tau}(\mathscr{X}_{1})-F_{\tau}^{\min}).

Thus, we obtain Conclusion (ii).  

B.7 Theorem 13

We first bound Fτ\partial F_{\tau} in Lemma 25, then prove Theorem 13.

Lemma 25

For iterations in Algorithm 2, we have min𝒰tFτ(𝒳t)𝒰tF(τ+ρ)𝒳t+1𝒱tF\min\nolimits_{\mathscr{U}_{t}\in\partial F_{\tau}(\mathscr{X}_{t})}\left\|\mathscr{U}_{t}\right\|_{F}\!\leq\!(\tau+\rho)\left\|\mathscr{X}_{t+1}\!-\!\mathscr{V}_{t}\right\|_{F}.

Proof  Since 𝒳t+1\mathscr{X}_{t+1} is generated from the proximal step, i.e., 𝒳t+1=proxg¯ττ(𝒱t1τf(𝒱))\mathscr{X}_{t+1}=\text{prox}_{\frac{\bar{g}_{\tau}}{\tau}}(\mathscr{V}_{t}-\frac{1}{\tau}\nabla f(\mathscr{V})), from its optimality condition, we have

𝒳t+1(𝒱t1τf(𝒱t))+1τg¯τ(𝒳t+1)𝟎.\displaystyle\mathscr{X}_{t+1}-\big{(}\mathscr{V}_{t}-\frac{1}{\tau}\nabla f(\mathscr{V}_{t})\big{)}+\frac{1}{\tau}\partial\bar{g}_{\tau}(\mathscr{X}_{t+1})\ni\bm{0}.

Let 𝒰t=τ[𝒳t+1𝒱t][f(𝒱t)f(𝒳t+1)]\mathscr{U}_{t}=\tau\left[\mathscr{X}_{t+1}-\mathscr{V}_{t}\right]-\left[\nabla f(\mathscr{V}_{t})-\nabla f(\mathscr{X}_{t+1})\right]. We have

Fτ(𝒳t+1)=[f(𝒳t+1)+g¯τ(𝒳t+1)]𝒰t.\displaystyle\partial F_{\tau}(\mathscr{X}_{t+1})=\left[\nabla f(\mathscr{X}_{t+1})+\partial\bar{g}_{\tau}(\mathscr{X}_{t+1})\right]\in\mathscr{U}_{t}.

Thus, 𝒰tFτ𝒳t+1𝒱tF+f(𝒱t)f(𝒳t+1)F(τ+ρ)𝒳t+1𝒱tF\left\|\mathscr{U}_{t}\right\|_{F}\leq\tau\left\|\mathscr{X}_{t+1}\!-\!\mathscr{V}_{t}\right\|_{F}\!+\!\left\|\nabla f(\mathscr{V}_{t})\!-\!\nabla f(\mathscr{X}_{t+1})\right\|_{F}\leq\left(\tau+\rho\right)\left\|\mathscr{X}_{t+1}-\mathscr{V}_{t}\right\|_{F}.  

Proof (of Theorem 13). From Theorem 9, we have limTFτ(𝒳t)=Fτmin\lim\limits_{T\rightarrow\infty}F_{\tau}(\mathscr{X}_{t})=F_{\tau}^{\min}. Then, from Lemma 25, we have

limtmin𝒰tFτ(𝒳t)𝒰tFlimt(τ+ρ)𝒳t+1𝒱tF=0.\displaystyle\lim\limits_{t\rightarrow\infty}\min\nolimits_{\mathscr{U}_{t}\in\partial F_{\tau}(\mathscr{X}_{t})}\!\left\|\mathscr{U}_{t}\right\|_{F}\!\leq\!\lim\limits_{t\rightarrow\infty}(\tau\!+\!\rho)\left\|\mathscr{X}_{t+1}\!-\!\mathscr{V}_{t}\right\|_{F}\!=\!0.

Thus, for any ϵ\epsilon, c>0c>0 and t>t0t>t_{0} where t0t_{0} is a sufficiently large positive integer, we have

𝒳t{𝒳|min𝒰Fτ(𝒳)𝒰Fϵ,Fτmin<Fτ(𝒳)<Fτmin+c}.\displaystyle\mathscr{X}_{t}\in\left\{\mathscr{X}\,|\min\nolimits_{\mathscr{U}\in\partial F_{\tau}(\mathscr{X})}\left\|\mathscr{U}\right\|_{F}\leq\epsilon,F^{\min}_{\tau}<F_{\tau}(\mathscr{X})<F_{\tau}^{\min}+c\right\}.

Then, the uniformized KL property implies for all tt0t\geq t_{0},

1\displaystyle 1 ψ(Fτ(𝒳t+1)Fτmin)min𝒰tFτ(𝒳t)𝒰tF,\displaystyle\leq\psi^{\prime}\left(F_{\tau}(\mathscr{X}_{t+1})-F_{\tau}^{\min}\right)\min\nolimits_{\mathscr{U}_{t}\in\partial F_{\tau}(\mathscr{X}_{t})}\left\|\mathscr{U}_{t}\right\|_{F},
=ψ(Fτ(𝒳t+1)Fτmin)(τ+ρ)𝒳t+1𝒱tF.\displaystyle=\psi^{\prime}\left(F_{\tau}(\mathscr{X}_{t+1})-F_{\tau}^{\min}\right)(\tau+\rho)\left\|\mathscr{X}_{t+1}-\mathscr{V}_{t}\right\|_{F}. (63)

Moreover, from Lemma 22, we have

𝒳t+1𝒱tF22η[Fτ(𝒱t)Fτ(𝒳t+1)].\displaystyle\left\|\mathscr{X}_{t+1}\!-\!\mathscr{V}_{t}\right\|_{F}^{2}\leq\frac{2}{\eta}\left[F_{\tau}(\mathscr{V}_{t})-F_{\tau}(\mathscr{X}_{t+1})\right]. (64)

Let rt=Fτ(𝒳t)Fτminr_{t}=F_{\tau}(\mathscr{X}_{t})-F_{\tau}^{\min}, we have

rtrt+1\displaystyle r_{t}-r_{t+1} =Fτ(𝒳t)Fτmin[Fτ(𝒳t+1)Fτmin],\displaystyle=F_{\tau}(\mathscr{X}_{t})-F_{\tau}^{\min}-\left[F_{\tau}(\mathscr{X}_{t+1})-F_{\tau}^{\min}\right],
Fτ(𝒱t)Fτmin[Fτ(𝒳t+1)Fτmin]=Fτ(𝒱t)Fτ(𝒳t+1).\displaystyle\geq F_{\tau}(\mathscr{V}_{t})-F_{\tau}^{\min}-\left[F_{\tau}(\mathscr{X}_{t+1})-F_{\tau}^{\min}\right]=F_{\tau}(\mathscr{V}_{t})-F_{\tau}(\mathscr{X}_{t+1}). (65)

Combine (63), (64) and (65), we have

1\displaystyle 1 [ψ(rt)]2(τ+ρ)2𝒳t+1𝒱tF2,\displaystyle\leq\left[\psi^{\prime}(r_{t})\right]^{2}(\tau+\rho)^{2}\left\|\mathscr{X}_{t+1}-\mathscr{V}_{t}\right\|_{F}^{2},
2(τ+ρ)2η[ψ(rt)]2[Fτ(𝒱t)Fτ(𝒳t+1)]2(τ+ρ)2η[ψ(rt+1)]2(rtrt+1).\displaystyle\leq\frac{2(\tau+\rho)^{2}}{\eta}\left[\psi^{\prime}(r_{t})\right]^{2}\left[F_{\tau}(\mathscr{V}_{t})-F_{\tau}(\mathscr{X}_{t+1})\right]\leq\frac{2(\tau+\rho)^{2}}{\eta}\left[\psi^{\prime}(r_{t+1})\right]^{2}(r_{t}-r_{t+1}). (66)

Since ϕ(α)=Cxαx\phi(\alpha)=\frac{C}{x}\alpha^{x}, then ϕ(α)=Cαx1\phi^{\prime}(\alpha)=C\alpha^{x-1}, (66) becomes 1d1C2rt+12x2(rtrt+1)1\leq d_{1}C^{2}r_{t+1}^{2x-2}(r_{t}-r_{t+1}), where d1=2(τ+ρ)2ηd_{1}=\frac{2(\tau+\rho)^{2}}{\eta}. Finally, it is shown in (Bolte et al., 2014; Li and Lin, 2015; Li et al., 2017) that the sequence {rt}\{r_{t}\} satisfying the above inequality, convergence to zero with different rates stated in the Theorem.  

B.8 Lemma 15

First, we introduce the following Lemma.

Lemma 26 (Theorem 1 in (Negahban and Wainwright, 2012))

Consider a matrix 𝐗m×n{\bm{\bm{X}}}\in\mathbb{R}^{m\times n}. Let d=12(m+n)d=\frac{1}{2}\left(m+n\right) and mrank(𝐗)=𝐗𝐗Fm_{\text{rank}}({\bm{\bm{X}}})=\frac{\left\|{\bm{\bm{X}}}\right\|_{*}}{\left\|{\bm{\bm{X}}}\right\|_{F}}. Define a constraint set 𝒞\mathcal{C} (with parameters c0,nc_{0},n) as

𝒞(n,c0)={𝑿m×n,𝑿0|mspike(𝑿)mrank(𝑿)1c0L𝛀1dlogd},\displaystyle\mathcal{C}(n,c_{0})=\left\{{\bm{\bm{X}}}\in\mathbb{R}^{m\times n},{\bm{\bm{X}}}\not=0\;|\;m_{\text{spike}}({\bm{\bm{X}}})\cdot m_{\text{rank}}({\bm{\bm{X}}})\leq\frac{1}{c_{0}L}\sqrt{\frac{\left\|{\bm{\bm{\Omega}}}\right\|_{1}}{d\log d}}\right\},

where LL is a constant. There are constants (c0,c1,c2,c3)(c_{0},c_{1},c_{2},c_{3}) such that when 𝛀1>c3max(dlogd)\left\|{\bm{\bm{\Omega}}}\right\|_{1}>c_{3}\max(d\log d), we have

P𝛀(𝑿)F𝛀118𝑿F{1128Lmspike(𝑿)𝛀1},𝑿𝒞(𝛀1,c0),\displaystyle\frac{\left\|P_{{\bm{\bm{\Omega}}}}\left({\bm{\bm{X}}}\right)\right\|_{F}}{\left\|{\bm{\bm{\Omega}}}\right\|_{1}}\geq\frac{1}{8}\left\|{\bm{\bm{X}}}\right\|_{F}\left\{1-\frac{128L\cdot m_{\text{spike}}({\bm{\bm{X}}})}{\sqrt{\left\|{\bm{\bm{\Omega}}}\right\|_{1}}}\right\},\quad\forall{\bm{\bm{X}}}\in\mathcal{C}(\left\|{\bm{\bm{\Omega}}}\right\|_{1},c_{0}),

with a high probability greater at least of 1c1exp(c2dlogd)1-c_{1}\exp(-c_{2}d\log d).

Proof  (of Lemma 15) For a MMth-order tensor Δ\Delta, using Lemma 26 on each unfolded matrix Δi\Delta_{{\left\langle i\right\rangle}} (i=1,,Mi=1,\dots,M), we have

P𝛀(Δi)F𝛀118ΔF{1128Lmspike(Δi)𝛀1},\displaystyle\frac{\left\|P_{{\bm{\bm{\Omega}}}}\left(\Delta_{{\left\langle i\right\rangle}}\right)\right\|_{F}}{\left\|{\bm{\bm{\Omega}}}\right\|_{1}}\geq\frac{1}{8}\left\|\Delta\right\|_{F}\left\{1-\frac{128L\cdot m_{\text{spike}}(\Delta_{{\left\langle i\right\rangle}})}{\sqrt{\left\|{\bm{\bm{\Omega}}}\right\|_{1}}}\right\}, (67)

for all Δi𝒞i(𝛀1,c0)\Delta_{{\left\langle i\right\rangle}}\in\mathcal{C}^{i}(\left\|{\bm{\bm{\Omega}}}\right\|_{1},c_{0}). Note that the L.H.S. of (67) is the same for all i=1,,Mi=1,...,M. Thus, to ensure (67) holds for all Δi\Delta_{{\left\langle i\right\rangle}}, we need to take the intersection of all Δi\Delta_{{\left\langle i\right\rangle}}, which leads to

{𝒳I1××IM,𝒳0|mspike(𝒳)mrank(𝒳i)1c0L𝛀1dilogdi}.\displaystyle\left\{\mathscr{X}\in\mathbb{R}^{I_{1}\times...\times I_{M}},\mathscr{X}\not=0\;|\;m_{\text{spike}}(\mathscr{X})\cdot m_{\text{rank}}(\mathscr{X}_{{\left\langle i\right\rangle}})\leq\frac{1}{c_{0}L}\sqrt{\frac{\left\|{\bm{\bm{\Omega}}}\right\|_{1}}{d_{i}\log d_{i}}}\right\}. (68)

Recall that mrank(𝒳)=1Mi=1Mmrank(𝒳i)m_{\text{rank}}(\mathscr{X})=\frac{1}{M}\sum\nolimits_{i=1}^{M}m_{\text{rank}}(\mathscr{X}_{{\left\langle i\right\rangle}}) as defined in (33). Thus, 𝒞~(n,c0)\tilde{\mathcal{C}}(n,c_{0}) is a subset of (68). As a result, (36) holds.  

B.9 Theorem 16

Here, we first introduce some auxiliary lemmas in Appendix B.9.1, which will be used to prove Theorem 16 in Appendix B.9.2.

B.9.1 Auxiliary Lemmas

Lemma 27 (Lemma 4 in (Loh and Wainwright, 2015))

For κ\kappa in Assumption 1, we have

  • (i).

    The function ακ(α)α\alpha\rightarrow\frac{\kappa(\alpha)}{\alpha} is nonincreasing on α>0\alpha>0;

  • (ii).

    The derivative of κ\kappa is upper bounded by κ0\kappa_{0};

  • (iii).

    The function ακ(α)+α2c2\alpha\rightarrow\kappa(\alpha)+\frac{\alpha^{2}c}{2} is convex only when cκ0c\geq\kappa_{0};

  • (iv).

    λ|α|λκ(|α|)+α2κ02\lambda|\alpha|\leq\lambda\kappa(|\alpha|)+\frac{\alpha^{2}\kappa_{0}}{2}.

Lemma 28

𝒳,𝒴mini=1,,K𝒳i𝒴i\left\langle\mathscr{X},\mathscr{Y}\right\rangle\leq\min\nolimits_{i=1,...,K}\left\|\mathscr{X}_{{\left\langle i\right\rangle}}\right\|_{\infty}\left\|\mathscr{Y}_{{\left\langle i\right\rangle}}\right\|_{*}.

Proof  First, we have 𝒳,𝒴=𝒳i,𝒴i\left\langle\mathscr{X},\mathscr{Y}\right\rangle=\left\langle\mathscr{X}_{{\left\langle i\right\rangle}},\mathscr{Y}_{{\left\langle i\right\rangle}}\right\rangle for all i{1,,M}i\in\{1,\dots,M\}. Then, since \left\|\cdot\right\|_{\infty} and \left\|\cdot\right\|_{*} are dual norm with each other, 𝒳i,𝒴i𝒳i𝒴i\left\langle\mathscr{X}_{{\left\langle i\right\rangle}},\mathscr{Y}_{{\left\langle i\right\rangle}}\right\rangle\leq\left\|\mathscr{X}_{{\left\langle i\right\rangle}}\right\|_{\infty}\left\|\mathscr{Y}_{{\left\langle i\right\rangle}}\right\|_{*}. Thus, we have 𝒳,𝒴mini=1,,K𝒳i𝒴i\left\langle\mathscr{X},\mathscr{Y}\right\rangle\leq\min\nolimits_{i=1,...,K}\left\|\mathscr{X}_{{\left\langle i\right\rangle}}\right\|_{\infty}\left\|\mathscr{Y}_{{\left\langle i\right\rangle}}\right\|_{*}.  

Lemma 29

For all i{1,m}i\!\in\!\{1,...m\}, we have 𝒳F𝒳i\left\|\mathscr{X}\right\|_{F}\!\leq\!\left\|\mathscr{X}_{{\left\langle i\right\rangle}}\right\|_{*} and 𝒳imin(Ii,IπIi)𝒳F\left\|\mathscr{X}_{{\left\langle i\right\rangle}}\right\|_{*}\!\leq\!\sqrt{\min(I^{i},\frac{I^{\pi}}{I^{i}})}\left\|\mathscr{X}\right\|_{F}.

Proof  Note that 𝒳F=𝒳iF\left\|\mathscr{X}\right\|_{F}=\left\|\mathscr{X}_{{\left\langle i\right\rangle}}\right\|_{F} and 𝒳iF𝒳i\left\|\mathscr{X}_{{\left\langle i\right\rangle}}\right\|_{F}\leq\left\|\mathscr{X}_{{\left\langle i\right\rangle}}\right\|_{*}, thus 𝒳F𝒳i\left\|\mathscr{X}\right\|_{F}\leq\left\|\mathscr{X}_{{\left\langle i\right\rangle}}\right\|_{*}. Then, since 𝑿min(p,q)𝑿F\left\|{\bm{\bm{X}}}\right\|_{*}\leq\sqrt{\min(p,q)}\left\|{\bm{\bm{X}}}\right\|_{F} for a matrix 𝑿{\bm{\bm{X}}} of size p×qp\times q, we have 𝒳imin(Ii,Iπ/Ii)𝒳iF\left\|\mathscr{X}_{{\left\langle i\right\rangle}}\right\|_{*}\leq\sqrt{\min(I^{i},I^{\pi}/I^{i})}\left\|\mathscr{X}_{{\left\langle i\right\rangle}}\right\|_{F} =min(Ii,Iπ/Ii)𝒳F=\sqrt{\min(I^{i},I^{\pi}/I^{i})}\left\|\mathscr{X}\right\|_{F}.  

Lemma 30

Define hi(𝒳)=ϕ(𝒳i)h_{i}(\mathscr{X})=\phi(\mathscr{X}_{{\left\langle i\right\rangle}}). Let Φk(𝐀)\varPhi_{k}(\bm{A}) produce the best rank kk approximation to matrix 𝐀\bm{A} and Ψk(𝐀)=𝐀Φk(𝐀)\varPsi_{k}(\bm{A})=\bm{A}-\varPhi_{k}(\bm{A}). Suppose εi>0\varepsilon_{i}>0 for i{1,,M}i\in\{1,...,M\} are constants such that εihi(Φki(𝒜i))hi(Ψki(𝒜i))0\varepsilon_{i}h_{i}(\varPhi_{k_{i}}(\mathscr{A}_{{\left\langle i\right\rangle}}))-h_{i}(\varPsi_{k_{i}}(\mathscr{A}_{{\left\langle i\right\rangle}}))\geq 0. Then,

εihi(Φki(𝒜i))hi(Ψki(𝒜i))κ0(εiΦki(𝒜i)Ψki(𝒜i)).\displaystyle\varepsilon_{i}h_{i}(\varPhi_{k_{i}}(\mathscr{A}_{{\left\langle i\right\rangle}}))-h_{i}(\varPsi_{k_{i}}(\mathscr{A}_{{\left\langle i\right\rangle}}))\leq\kappa_{0}(\varepsilon_{i}\left\|\varPhi_{k_{i}}(\mathscr{A}_{{\left\langle i\right\rangle}})\right\|_{*}-\left\|\varPsi_{k_{i}}(\mathscr{A}_{{\left\langle i\right\rangle}})\right\|_{*}). (69)

Moreover, if 𝒳i\mathscr{X}^{*}_{{\left\langle i\right\rangle}} is of rank kik_{i}, for any tensor 𝒳\mathscr{X} satisfying εihi(𝒳i)hi(𝒳i)0\varepsilon_{i}h_{i}(\mathscr{X}^{*}_{{\left\langle i\right\rangle}})-h_{i}(\mathscr{X}_{{\left\langle i\right\rangle}})\geq 0 and εi>1\varepsilon_{i}>1, we have

εihi(𝒳i)hi(𝒳i)κ0(εiΦki(𝒱i)Ψki(𝒱i)),\displaystyle\varepsilon_{i}h_{i}(\mathscr{X}^{*}_{{\left\langle i\right\rangle}})\!-\!h_{i}(\mathscr{X}_{{\left\langle i\right\rangle}})\leq\kappa_{0}(\varepsilon_{i}\left\|\varPhi_{k_{i}}(\mathscr{V}_{{\left\langle i\right\rangle}})\right\|_{*}\!-\!\left\|\varPsi_{k_{i}}(\mathscr{V}_{{\left\langle i\right\rangle}})\right\|_{*}), (70)

where 𝒱=𝒳𝒳\mathscr{V}=\mathscr{X}^{*}-\mathscr{X}.

Proof  We first prove (69). Let h(α)=ακ(α)h(\alpha)=\frac{\alpha}{\kappa(\alpha)} on α>0\alpha>0. From Lemma 27, we know h(α)h(\alpha) is a non-decreasing function. Therefore,

Ψki(𝒜i)\displaystyle\left\|\varPsi_{k_{i}}(\mathscr{A}_{{\left\langle i\right\rangle}})\right\|_{*} =j=ki+1κ(σj(𝒜i))h(σj(𝒜i)),\displaystyle=\sum\nolimits_{j=k_{i}+1}\kappa\left(\sigma_{j}\left(\mathscr{A}_{{\left\langle i\right\rangle}}\right)\right)h\left(\sigma_{j}\left(\mathscr{A}_{{\left\langle i\right\rangle}}\right)\right),
h(σ1(𝒜i))j=ki+1κ(σj(𝒜i))=h(σ1(𝒜i))hi(Ψki(𝒜i)).\displaystyle\leq h\left(\sigma_{1}\left(\mathscr{A}_{{\left\langle i\right\rangle}}\right)\right)\sum\nolimits_{j=k_{i}+1}\kappa\left(\sigma_{j}\left(\mathscr{A}_{{\left\langle i\right\rangle}}\right)\right)=h\left(\sigma_{1}\left(\mathscr{A}_{{\left\langle i\right\rangle}}\right)\right)\cdot h_{i}\left(\varPsi_{k_{i}}(\mathscr{A}_{{\left\langle i\right\rangle}})\right). (71)

Again, using non-decreasing property of hh, we have

hi(Φki(𝒜i))h(σki+1(𝒜i))\displaystyle h_{i}\left(\varPhi_{k_{i}}(\mathscr{A}_{{\left\langle i\right\rangle}})\right)h\left(\sigma_{k_{i}+1}\left(\mathscr{A}_{{\left\langle i\right\rangle}}\right)\right) =h(σki+1(𝒜i))j=1kiκ(σj(𝒜i)),\displaystyle=h\left(\sigma_{k_{i}+1}\left(\mathscr{A}_{{\left\langle i\right\rangle}}\right)\right)\sum\nolimits_{j=1}^{k_{i}}\kappa\left(\sigma_{j}\left(\mathscr{A}_{{\left\langle i\right\rangle}}\right)\right),
j=1kiκ(σj(𝒜i))h(σj(𝒜i))=Φki(𝒜i).\displaystyle\leq\sum\nolimits_{j=1}^{k_{i}}\kappa\left(\sigma_{j}\left(\mathscr{A}_{{\left\langle i\right\rangle}}\right)\right)h\left(\sigma_{j}\left(\mathscr{A}_{{\left\langle i\right\rangle}}\right)\right)=\left\|\varPhi_{k_{i}}(\mathscr{A}_{{\left\langle i\right\rangle}})\right\|_{*}. (72)

Note that h(α)1/κ0h(\alpha)\geq 1/\kappa_{0} from Lemma 27, and combining (71) and (72), we have

0εihi(Φki(𝒜i))hi(Ψki(𝒜i))\displaystyle 0\leq\varepsilon_{i}\cdot h_{i}(\varPhi_{k_{i}}(\mathscr{A}_{{\left\langle i\right\rangle}}))-h_{i}(\varPsi_{k_{i}}(\mathscr{A}_{{\left\langle i\right\rangle}})) (εiΦki(𝒜i)Ψki(𝒜i))/h(σ1(𝒜i))\displaystyle\leq\big{(}\varepsilon_{i}\left\|\varPhi_{k_{i}}(\mathscr{A}_{{\left\langle i\right\rangle}})\right\|_{*}-\left\|\varPsi_{k_{i}}(\mathscr{A}_{{\left\langle i\right\rangle}})\right\|_{*}\big{)}/h\left(\sigma_{1}(\mathscr{A}_{{\left\langle i\right\rangle}})\right)
κ0(εiΦki(𝒜i)Ψki(𝒜i)).\displaystyle\leq\kappa_{0}\big{(}\varepsilon_{i}\left\|\varPhi_{k_{i}}(\mathscr{A}_{{\left\langle i\right\rangle}})\right\|_{*}-\left\|\varPsi_{k_{i}}(\mathscr{A}_{{\left\langle i\right\rangle}})\right\|_{*}\big{)}.

Thus, (69) is obtained. Next, we prove (70). The triangle inequality and subadditivity of hih_{i} (see Lemma 5 in (Loh and Wainwright, 2015)) imply that

0εihi(𝒳i)hi(𝒳i)\displaystyle 0\leq\varepsilon_{i}\cdot h_{i}(\mathscr{X}^{*}_{{\left\langle i\right\rangle}})-h_{i}(\mathscr{X}_{{\left\langle i\right\rangle}}) =εihi(Φmi(𝒳i))hi(Φmi(𝒳i))hi(Ψmi(𝒳i)),\displaystyle=\varepsilon_{i}\cdot h_{i}(\varPhi_{m_{i}}(\mathscr{X}^{*}_{{\left\langle i\right\rangle}}))-h_{i}(\varPhi_{m_{i}}(\mathscr{X}_{{\left\langle i\right\rangle}}))-h_{i}(\varPsi_{m_{i}}(\mathscr{X}_{{\left\langle i\right\rangle}})),
εihi(Φmi(𝒱i))hi(Ψmi(𝒱i)),\displaystyle\leq\varepsilon_{i}\cdot h_{i}\left(\varPhi_{m_{i}}\left(\mathscr{V}_{{\left\langle i\right\rangle}}\right)\right)-h_{i}\left(\varPsi_{m_{i}}\left(\mathscr{V}_{{\left\langle i\right\rangle}}\right)\right),
κ0(εiΦki(𝒱i)Ψki(𝒱i)).\displaystyle\leq\kappa_{0}\big{(}\varepsilon_{i}\left\|\varPhi_{k_{i}}(\mathscr{V}_{{\left\langle i\right\rangle}})\right\|_{*}-\left\|\varPsi_{k_{i}}(\mathscr{V}_{{\left\langle i\right\rangle}})\right\|_{*}\big{)}.

Thus, (70) is obtained.  

Lemma 31

ϕ(𝑿)κ0\left\|\partial\phi({\bm{\bm{X}}})\right\|_{\infty}\leq\kappa_{0} where ϕ\phi is defined in (7).

Proof  Let 𝑿{\bm{\bm{X}}} be of size m×nm\times n with mnm\leq n, and SVD of 𝑿{\bm{\bm{X}}} be 𝑼𝚺𝑽\bm{U}\bm{\Sigma}\bm{V}^{\top} where 𝚺= Diag(σ1,,σm)\bm{\Sigma}=\text{\,Diag}\left(\sigma_{1},\dots,\sigma_{m}\right). From Theorem 3.7 in (Lewis and Sendov, 2005), we have

ϕ(𝑿)=𝑼 Diag(κ(σ1),,κ(σm))𝑽.\displaystyle\partial\phi({\bm{\bm{X}}})=\bm{U}\text{\,Diag}\left(\kappa^{\prime}(\sigma_{1}),...,\kappa^{\prime}(\sigma_{m})\right)\bm{V}^{\top}.

From Lemma 27, we have κ(σ1)κ(σ2)κ0\kappa^{\prime}(\sigma_{1})\leq\kappa^{\prime}(\sigma_{2})\leq...\leq\kappa_{0}. Since 𝑿\left\|{\bm{\bm{X}}}\right\|_{\infty} returns the maximum singular value of 𝑿{\bm{\bm{X}}}, we have ϕ(𝑿)κ(σm)κ0\left\|\partial\phi({\bm{\bm{X}}})\right\|_{\infty}\leq\kappa^{\prime}(\sigma_{m})\leq\kappa_{0}.  

Lemma 32

ϕ(𝑿)+κ02𝑿F2\phi({\bm{\bm{X}}})+\frac{\kappa_{0}}{2}\left\|{\bm{\bm{X}}}\right\|_{F}^{2} is convex.

Proof  Using the definition of ϕ\phi in (7) and the fact 𝑿F2=iσi(𝑿)\left\|{\bm{\bm{X}}}\right\|_{F}^{2}=\sum_{i}\sigma_{i}({\bm{\bm{X}}}), we have

γ(𝑿)=ϕ(𝑿)+κ0/2𝑿F2=iψ(σi(𝑿)),\displaystyle\gamma({\bm{\bm{X}}})=\phi({\bm{\bm{X}}})+\kappa_{0}/2\left\|{\bm{\bm{X}}}\right\|_{F}^{2}=\sum\nolimits_{i}\psi(\sigma_{i}({\bm{\bm{X}}})),

where ψ(α)=κ(α)+κ0α2/2\psi(\alpha)=\kappa(\alpha)+\kappa_{0}\alpha^{2}/2. Since ψ(α)\psi(\alpha) is convex (Lemma 27), γ(𝑿)\gamma({\bm{\bm{X}}}) is convex (using Proposition 6.1 in (Lewis and Sendov, 2005)).  

B.9.2 Proof of Theorem 16

Proof  Part 1). Let 𝒱~=𝒳~𝒳\tilde{\mathscr{V}}=\tilde{\mathscr{X}}-\mathscr{X}^{*}, we begin by proving 𝒱~F1\|\tilde{\mathscr{V}}\|_{F}\leq 1. If not, then the second condition in (35) holds, i.e.,

f(𝒳~)f(𝒳),𝒱~α2𝒱~F2τ2logIπ/𝛀1i=1MΔi.\displaystyle\left\langle\nabla f(\tilde{\mathscr{X}})-\nabla f(\mathscr{X}^{*}),\tilde{\mathscr{V}}\right\rangle\geq\alpha_{2}\|\tilde{\mathscr{V}}\|_{F}^{2}-\tau_{2}\sqrt{\log I^{\pi}/\|{\bm{\bm{\Omega}}}\|_{1}}\sum\nolimits_{i=1}^{M}\left\|\Delta_{{\left\langle i\right\rangle}}\right\|_{*}. (73)

Since 𝒳~\tilde{\mathscr{X}} is a first-order critical point, then

f(𝒳~)+r(𝒳~),𝒳𝒳~0,\displaystyle\left\langle\nabla f(\tilde{\mathscr{X}})+\partial r(\tilde{\mathscr{X}}),\mathscr{X}-\tilde{\mathscr{X}}\right\rangle\geq 0, (74)
f(𝒳~)+r(𝒳~)𝟎.\displaystyle\nabla f(\tilde{\mathscr{X}})+\partial r(\tilde{\mathscr{X}})\ni\bm{0}. (75)

Taking 𝒳=𝒳\mathscr{X}=\mathscr{X}^{*}, from (74), we have

f(𝒳~)+r(𝒳~),𝒱~0.\displaystyle\left\langle\nabla f(\tilde{\mathscr{X}})+\partial r(\tilde{\mathscr{X}}),-\tilde{\mathscr{V}}\right\rangle\geq 0. (76)

Combining (73) and (76), we have

r(𝒳~)f(𝒳),𝒱~α2𝒱~F2τ2logIπ/𝛀1i=1MΔi.\displaystyle\left\langle-\partial r(\tilde{\mathscr{X}})-\nabla f\left(\mathscr{X}^{*}\right),\tilde{\mathscr{V}}\right\rangle\geq\alpha_{2}\|\tilde{\mathscr{V}}\|_{F}^{2}-\tau_{2}\sqrt{\log I^{\pi}/\|{\bm{\bm{\Omega}}}\|_{1}}\sum\nolimits_{i=1}^{M}\left\|\Delta_{{\left\langle i\right\rangle}}\right\|_{*}. (77)

Let v~i=𝒱~i\tilde{v}_{i}=\|\tilde{\mathscr{V}}_{{\left\langle i\right\rangle}}\|_{*} and v~=i=1Mv~i\tilde{v}=\sum\nolimits_{i=1}^{M}\tilde{v}_{i}. For the L.H.S of (77),

r(𝒳~)+f(𝒳),𝒱~\displaystyle\left\langle\partial r(\tilde{\mathscr{X}})+\nabla f\left(\mathscr{X}^{*}\right),\tilde{\mathscr{V}}\right\rangle =f(𝒳),𝒱~+λi=1Mϕ(𝒳i),𝒱~i\displaystyle=\left\langle\nabla f\left(\mathscr{X}^{*}\right),\tilde{\mathscr{V}}\right\rangle+\lambda\sum\nolimits_{i=1}^{M}\left\langle\partial\phi(\mathscr{X}_{{\left\langle i\right\rangle}}),\tilde{\mathscr{V}}_{{\left\langle i\right\rangle}}\right\rangle (78)
maxi[f(𝒳)]iv~i+λi=1Mϕ(𝒳i)v~i,\displaystyle\leq\max_{i}\left\|[\nabla f\left(\mathscr{X}^{*}\right)]_{{\left\langle i\right\rangle}}\right\|_{\infty}\tilde{v}_{i}+\lambda\sum\nolimits_{i=1}^{M}\left\|\partial\phi(\mathscr{X}_{{\left\langle i\right\rangle}})\right\|_{\infty}\tilde{v}_{i}, (79)

Next, note that the following inequalities hold.

  • From the left part of (37) in Theorem 16, we have maxi[f(𝒳)]iλκ04\max_{i}\left\|[\nabla f\left(\mathscr{X}^{*}\right)]_{{\left\langle i\right\rangle}}\right\|_{\infty}\leq\frac{\lambda\kappa_{0}}{4}.

  • From Lemma 31, we have ϕ(𝑿)κ0\left\|\partial\phi({\bm{\bm{X}}})\right\|_{\infty}\leq\kappa_{0}.

Combining with (79), we have

r(𝒳~)+f(𝒳),𝒱~λκ04+3λκ0=13λκ04.\displaystyle\left\langle\partial r(\tilde{\mathscr{X}})+\nabla f\left(\mathscr{X}^{*}\right),\tilde{\mathscr{V}}\right\rangle\leq\frac{\lambda\kappa_{0}}{4}+3\lambda\kappa_{0}=\frac{13\lambda\kappa_{0}}{4}. (80)

Combining (77) and (80), then rearranging terms, we have

𝒱~F1α2(τ2logIπ/𝛀1+λκ0)v~1α2(τ2logIπ/𝛀1+13λκ04)R.\displaystyle\|\tilde{\mathscr{V}}\|_{F}\leq\frac{1}{\alpha_{2}}\left(\tau_{2}\sqrt{\log I^{\pi}/\|{\bm{\bm{\Omega}}}\|_{1}}+\lambda\kappa_{0}\right)\tilde{v}\leq\frac{1}{\alpha_{2}}\left(\tau_{2}\sqrt{\log I^{\pi}/\|{\bm{\bm{\Omega}}}\|_{1}}+\frac{13\lambda\kappa_{0}}{4}\right)R.

Finally, using assumptions on 𝛀1\|{\bm{\bm{\Omega}}}\|_{1} and λ\lambda, we have 𝒱~F1\|\tilde{\mathscr{V}}\|_{F}\leq 1, which is in the contradiction with our assumption at the beginning of Part 1). Thus, 𝒱~F1\|\tilde{\mathscr{V}}\|_{F}\leq 1 must hold.

Part 2). Let hi(𝒳)=ϕ(𝒳i)h_{i}(\mathscr{X})\!=\!\phi(\mathscr{X}_{{\left\langle i\right\rangle}}). Since the function hi(𝒳)+μ/2𝒳F2h_{i}(\mathscr{X})\!+\!\mu/2\left\|\mathscr{X}\right\|_{F}^{2} is convex (Lemma 32), we have

hi(𝒳~),𝒳𝒳~h~i(𝒳~).\displaystyle\left\langle\partial h_{i}(\tilde{\mathscr{X}}),\mathscr{X}^{*}-\tilde{\mathscr{X}}\right\rangle\leq\tilde{h}_{i}(\tilde{\mathscr{X}}). (81)

where h~i(𝒳~)=hi(𝒳)hi(𝒳~)+L2𝒳~𝒳F2\tilde{h}_{i}(\tilde{\mathscr{X}})=h_{i}(\mathscr{X}^{*})-h_{i}(\tilde{\mathscr{X}})+\frac{L}{2}\|\tilde{\mathscr{X}}-\mathscr{X}^{*}\|_{F}^{2}. From the first condition in (35), we have

f(𝒱~)f(𝒳),𝒳~α1𝒱~F2τ1logIπ𝛀1v~2.\displaystyle\langle\nabla f(\tilde{\mathscr{V}})-\nabla f\left(\mathscr{X}^{*}\right),-\tilde{\mathscr{X}}\rangle\geq\alpha_{1}\|\tilde{\mathscr{V}}\|_{F}^{2}-\tau_{1}\frac{\log I^{\pi}}{\|{\bm{\bm{\Omega}}}\|_{1}}\tilde{v}^{2}. (82)

Combining (74) and (82), we have

α1𝒱~F2τ1logIπ𝛀1v~2\displaystyle\alpha_{1}\|\tilde{\mathscr{V}}\|_{F}^{2}-\tau_{1}\frac{\log I^{\pi}}{\|{\bm{\bm{\Omega}}}\|_{1}}\tilde{v}^{2} r(𝒳~),𝒱~f(𝒳),𝒱~,\displaystyle\leq\left\langle\partial r(\tilde{\mathscr{X}}),\tilde{\mathscr{V}}\right\rangle-\left\langle\nabla f(\mathscr{X}^{*}),\tilde{\mathscr{V}}\right\rangle,
=λi=1Mhi(𝒳~),𝒱~f(𝒳),𝒱~.\displaystyle=\lambda\sum\nolimits_{i=1}^{M}\left\langle\partial h_{i}(\tilde{\mathscr{X}}),\tilde{\mathscr{V}}\right\rangle-\left\langle\nabla f(\mathscr{X}^{*}),\tilde{\mathscr{V}}\right\rangle.

Together with (81), we have

α1𝒱~F2τ1logIπ𝛀1v~2\displaystyle\alpha_{1}\|\tilde{\mathscr{V}}\|_{F}^{2}-\tau_{1}\frac{\log I^{\pi}}{\|{\bm{\bm{\Omega}}}\|_{1}}\tilde{v}^{2} λi=1Mh~i(𝒳~)f(𝒳),𝒱~,\displaystyle\leq\lambda\sum\nolimits_{i=1}^{M}\tilde{h}_{i}(\tilde{\mathscr{X}})-\left\langle\nabla f(\mathscr{X}^{*}),\tilde{\mathscr{V}}\right\rangle,
λi=1Mh~i(𝒳~)+maxi[f(𝒳)]iv~i\displaystyle\leq\lambda\sum\nolimits_{i=1}^{M}\tilde{h}_{i}(\tilde{\mathscr{X}})+\max_{i}\left\|[\nabla f(\mathscr{X}^{*})]_{{\left\langle i\right\rangle}}\right\|_{\infty}\tilde{v}_{i}
λi=1Mh~i(𝒳~)+maxi[f(𝒳)]iv~,\displaystyle\leq\lambda\sum\nolimits_{i=1}^{M}\tilde{h}_{i}(\tilde{\mathscr{X}})+\max_{i}\left\|[\nabla f(\mathscr{X}^{*})]_{{\left\langle i\right\rangle}}\right\|_{\infty}\tilde{v},

where the second inequality is from Lemma 28. Rearranging items in the above inequality, we have

(α1μM2)𝒱~F2λi=1M(hi(𝒳)hi(𝒳~))+(maxi[f(𝒳)]i+τ1logIπ𝛀1v~)v~.\displaystyle\big{(}\alpha_{1}-\frac{\mu M}{2}\big{)}\|\tilde{\mathscr{V}}\|_{F}^{2}\!\leq\!\lambda\sum\nolimits_{i=1}^{M}\big{(}h_{i}(\mathscr{X}^{*})\!-\!h_{i}(\tilde{\mathscr{X}})\big{)}\!+\!\left(\max_{i}\left\|[\nabla f(\mathscr{X}^{*})]_{{\left\langle i\right\rangle}}\right\|_{\infty}\!+\!\tau_{1}\frac{\log I^{\pi}}{\|{\bm{\bm{\Omega}}}\|_{1}}\tilde{v}\right)\tilde{v}. (83)

Note that from the Assumption in Theorem 16, we have the following inequalities.

  • maxi[f(𝒳)]iκ0λ/4\max_{i}\big{\|}\left[\nabla f(\mathscr{X}^{*})\right]_{{\left\langle i\right\rangle}}\big{\|}_{\infty}\leq\kappa_{0}\lambda/4.

  • Since 𝛀116R2max(τ12,τ22)log(Iπ)/α22\|{\bm{\bm{\Omega}}}\|_{1}\geq 16R^{2}\max\left(\tau_{1}^{2},\tau_{2}^{2}\right)\log(I^{\pi})/\alpha_{2}^{2} and α2logIπ/𝛀1κ0λ/4\alpha_{2}\sqrt{\log I^{\pi}/\|{\bm{\bm{\Omega}}}\|_{1}}\leq\kappa_{0}\lambda/4, then

    τ1logIπ𝛀1v~τ1α2logIπ𝛀1v~α2logIπ𝛀1τ1α2α22logIπ16v~2τ12v~α2logIπ𝛀1λκ04.\displaystyle\frac{\tau_{1}\log I^{\pi}}{\|{\bm{\bm{\Omega}}}\|_{1}}\tilde{v}\leq\frac{\tau_{1}}{\alpha_{2}}\sqrt{\frac{\log I^{\pi}}{\|{\bm{\bm{\Omega}}}\|_{1}}}\tilde{v}\cdot\alpha_{2}\sqrt{\frac{\log I^{\pi}}{\|{\bm{\bm{\Omega}}}\|_{1}}}\leq\frac{\tau_{1}}{\alpha_{2}}\sqrt{\frac{\alpha_{2}^{2}\log I^{\pi}}{16\tilde{v}^{2}\tau_{1}^{2}}}\tilde{v}\cdot\alpha_{2}\sqrt{\frac{\log I^{\pi}}{\|{\bm{\bm{\Omega}}}\|_{1}}}\leq\frac{\lambda\kappa_{0}}{4}.

Combing above inequalities into (83), we further have

(α1μM2)𝒱~F2\displaystyle\big{(}\alpha_{1}-\frac{\mu M}{2}\big{)}\|\tilde{\mathscr{V}}\|_{F}^{2} λi=1M(hi(𝒳)hi(𝒳~))+λκ02v~.\displaystyle\leq\lambda\sum\nolimits_{i=1}^{M}\big{(}h_{i}(\mathscr{X}^{*})-h_{i}(\tilde{\mathscr{X}})\big{)}+\frac{\lambda\kappa_{0}}{2}\tilde{v}. (84)

Part 3). Combining (84) and Lemma 27, as well as the subadditivity of hih_{i}, we have

(α1LM2)𝒱~F2\displaystyle\big{(}\alpha_{1}-\frac{LM}{2}\big{)}\|\tilde{\mathscr{V}}\|_{F}^{2} λi=1M(hi(𝒳)hi(𝒳~))+λκ02(i=1Mhi(𝒱~)LM+LM2λκ0𝒱~F2),\displaystyle\leq\lambda\sum\nolimits_{i=1}^{M}\big{(}h_{i}(\mathscr{X}^{*})-h_{i}(\tilde{\mathscr{X}})\big{)}+\frac{\lambda\kappa_{0}}{2}\big{(}\frac{\sum\nolimits_{i=1}^{M}h_{i}(\tilde{\mathscr{V}})}{LM}+\frac{LM}{2\lambda\kappa_{0}}\|\tilde{\mathscr{V}}\|_{F}^{2}\big{)},
λi=1M(hi(𝒳)hi(𝒳~))+λi=1Mhi(𝒳)+hi(𝒳~)2D+LM4𝒱~F2.\displaystyle\leq\!\lambda\!\sum\nolimits_{i=1}^{M}\!\big{(}h_{i}(\mathscr{X}^{*})\!-\!h_{i}(\tilde{\mathscr{X}})\big{)}+\frac{\lambda\sum\nolimits_{i=1}^{M}h_{i}(\mathscr{X}^{*})\!+\!h_{i}(\tilde{\mathscr{X}})}{2D}+\frac{LM}{4}\|\tilde{\mathscr{V}}\|_{F}^{2}. (85)

Next, define

av=α13M4κ0,bv=1+12M,cv=112M.\displaystyle a_{v}=\alpha_{1}-\frac{3M}{4}\kappa_{0},\quad b_{v}=1+\frac{1}{2M},\quad c_{v}=1-\frac{1}{2M}.

Rearranging terms in (85), we have

av𝒱~F2λi=1Mbvhi(𝒳)cvhi(𝒳~).\displaystyle a_{v}\|\tilde{\mathscr{V}}\|_{F}^{2}\leq\lambda\sum\nolimits_{i=1}^{M}b_{v}h_{i}(\mathscr{X}^{*})-c_{v}h_{i}(\tilde{\mathscr{X}}). (86)

From Lemma 30, we have

bvhi(𝒳)cvhi(𝒳~)L(bvΦki(𝒱~i)cvΨki(𝒱~i)).\displaystyle b_{v}h_{i}(\mathscr{X}^{*})-c_{v}h_{i}(\tilde{\mathscr{X}})\leq L\big{(}b_{v}\left\|\varPhi_{k_{i}}(\tilde{\mathscr{V}}_{{\left\langle i\right\rangle}})\right\|_{*}-c_{v}\left\|\varPsi_{k_{i}}(\tilde{\mathscr{V}}_{{\left\langle i\right\rangle}})\right\|_{*}\big{)}. (87)

Besides, we have the cone condition

Φki(𝒱~i)cvbvΨki(𝒱~i).\displaystyle\left\|\varPhi_{k_{i}}(\tilde{\mathscr{V}}_{{\left\langle i\right\rangle}})\right\|_{*}\leq\frac{c_{v}}{b_{v}}\left\|\varPsi_{k_{i}}(\tilde{\mathscr{V}}_{{\left\langle i\right\rangle}})\right\|_{*}. (88)

Combining (86), (87) and (88), we have

av𝒱~F2\displaystyle a_{v}\|\tilde{\mathscr{V}}\|_{F}^{2} λκ0i=1M(bvΦki(𝒱~i)cvΨki(𝒱~i)),\displaystyle\leq\lambda\kappa_{0}\sum\nolimits_{i=1}^{M}\big{(}b_{v}\left\|\varPhi_{k_{i}}(\tilde{\mathscr{V}}_{{\left\langle i\right\rangle}})\right\|_{*}-c_{v}\left\|\varPsi_{k_{i}}(\tilde{\mathscr{V}}_{{\left\langle i\right\rangle}})\right\|_{*}\big{)},
λκ0i=1MbvΦki(𝒱~i)λκ0i=1Mcvki𝒱~F.\displaystyle\leq\lambda\kappa_{0}\sum\nolimits_{i=1}^{M}b_{v}\left\|\varPhi_{k_{i}}(\tilde{\mathscr{V}}_{{\left\langle i\right\rangle}})\right\|_{*}\leq\lambda\kappa_{0}\sum\nolimits_{i=1}^{M}c_{v}\sqrt{k_{i}}\|\tilde{\mathscr{V}}\|_{F}.

where the last inequality comes from Lemma 29. Since av>0a_{v}>0 as assumed, we conclude that

𝒱~Fλκ0cvavi=1Mki.\displaystyle\|\tilde{\mathscr{V}}\|_{F}\leq\frac{\lambda\kappa_{0}c_{v}}{a_{v}}\sum\nolimits_{i=1}^{M}\sqrt{k_{i}}.

which proves the theorem.  

B.10 Corollary 17

Proof  When noisy level is sufficiently small, (37) reduces to

logIπ/𝛀1λ14Rκ0.\displaystyle\sqrt{\log I^{\pi}/\left\|{\bm{\bm{\Omega}}}\right\|_{1}}\leq\lambda\leq\frac{1}{4R\kappa_{0}}. (89)

Let λ=b1maxi[P𝛀()]i\lambda=b_{1}\max_{i}\|\left[P_{{\bm{\bm{\Omega}}}}\left(\mathscr{E}\right)\right]_{{\left\langle i\right\rangle}}\|_{\infty} where b1[4κ0,α24Rκ0maxi[P𝛀()]i]b_{1}\in\left[\frac{4}{\kappa_{0}},\frac{\alpha_{2}}{4R\kappa_{0}\max_{i}\|\left[P_{{\bm{\bm{\Omega}}}}\left(\mathscr{E}\right)\right]_{{\left\langle i\right\rangle}}\|_{\infty}}\right]. It is easy to check (89) holds. Then, from Theorem 16, we will have

𝒳𝒳~Fb1maxi[P𝛀()]iκ0cvavi=1Mki.\displaystyle\big{\|}\mathscr{X}^{*}-\tilde{\mathscr{X}}\big{\|}_{F}\leq b_{1}\max_{i}\|\left[P_{{\bm{\bm{\Omega}}}}\left(\mathscr{E}\right)\right]_{{\left\langle i\right\rangle}}\|_{\infty}\cdot\frac{\kappa_{0}c_{v}}{a_{v}}\sum\nolimits_{i=1}^{M}\sqrt{k_{i}}. (90)

Next, note that

𝔼[[P𝛀()]i]𝔼[[P𝛀()]iF]=𝔼[ξ𝛀F]=σ𝛀FσIπ.\displaystyle\mathbb{E}\left[\|\left[P_{{\bm{\bm{\Omega}}}}\left(\mathscr{E}\right)\right]_{{\left\langle i\right\rangle}}\|_{\infty}\right]\leq\mathbb{E}\left[\|\left[P_{{\bm{\bm{\Omega}}}}\left(\mathscr{E}\right)\right]_{{\left\langle i\right\rangle}}\|_{F}\right]=\mathbb{E}\left[\|\xi\cdot{\bm{\bm{\Omega}}}\|_{F}\right]=\sigma\|{\bm{\bm{\Omega}}}\|_{F}\leq\sigma\sqrt{I^{\pi}}. (91)

Combining (90) and (91), we then have

𝔼[𝒳𝒳~F]σκ0cvIπavi=1Mki.\displaystyle\mathbb{E}\left[\big{\|}\mathscr{X}^{*}-\tilde{\mathscr{X}}\big{\|}_{F}\right]\leq\sigma\frac{\kappa_{0}c_{v}\sqrt{I^{\pi}}}{a_{v}}\sum\nolimits_{i=1}^{M}\sqrt{k_{i}}.

 

B.11 Corollary 18

Proof  When 𝛀1\left\|{\bm{\bm{\Omega}}}\right\|_{1} is sufficiently larger, (37) reduces to

4logIπ/𝛀1λ14R.\displaystyle 4\sqrt{\log I^{\pi}/\left\|{\bm{\bm{\Omega}}}\right\|_{1}}\leq\lambda\leq\frac{1}{4R}. (92)

Let λ=b3logIπ𝛀1\lambda=b_{3}\sqrt{\frac{\log I^{\pi}}{\left\|{\bm{\bm{\Omega}}}\right\|_{1}}} where b3[4,14RlogIπ/𝛀1]b_{3}\in\left[4,\frac{1}{4R\sqrt{\log I^{\pi}/\left\|{\bm{\bm{\Omega}}}\right\|_{1}}}\right]. It is easy to check (92) holds. Then, from Theorem 16, we will have

𝒳𝒳~Fb3logIπ𝛀1κ0cvavi=1Mki.\displaystyle\big{\|}\mathscr{X}^{*}-\tilde{\mathscr{X}}\big{\|}_{F}\leq b_{3}\sqrt{\frac{\log I^{\pi}}{\left\|{\bm{\bm{\Omega}}}\right\|_{1}}}\cdot\frac{\kappa_{0}c_{v}}{a_{v}}\sum\nolimits_{i=1}^{M}\sqrt{k_{i}}.

 

B.12 Proposition 19

Proof  First, from Proposition 1 in (Yao and Kwok, 2018), we know that the function κ(|a|)κ0|a|\kappa(|a|)-\kappa_{0}\cdot|a| is smooth. Since ~\tilde{\ell} is also smooth, thus κ~\tilde{\kappa}_{\ell} is differentiable. Finally, note that limδ0(a;δ)=|a|\lim\nolimits_{\delta\rightarrow 0}\ell(a;\delta)=|a|. Then, we have

limδ0κ~(|a|;δ)\displaystyle\lim\nolimits_{\delta\rightarrow 0}\tilde{\kappa}_{\ell}(|a|;\delta) =limδ0[κ0~(|a|;δ)+(κ(|a|)κ0|a|)],\displaystyle=\lim\nolimits_{\delta\rightarrow 0}\big{[}\kappa_{0}\cdot\tilde{\ell}(|a|;\delta)+\left(\kappa_{\ell}(|a|)-\kappa_{0}\cdot|a|\right)\big{]},
=κ0|a|+(κ(|a|)κ0|a|)=κ(|a|).\displaystyle=\kappa_{0}|a|+\left(\kappa_{\ell}(|a|)-\kappa_{0}|a|\right)=\kappa_{\ell}(|a|).

Thus, the Proposition holds.  

B.13 Theorem 20

Proof  First, by the definition of κ~\tilde{\kappa}_{\ell} in (41), when |a|δ|a|\leq\delta, we have

limδ0κ~(a;δ)=aδκ0{[0,κ0)ifa0(κ0,0)otherwise.\displaystyle\lim\limits_{\delta\rightarrow 0}\partial\tilde{\kappa}_{\ell}(a;\delta)=\frac{a}{\delta}\kappa_{0}\in\begin{cases}[0,\kappa_{0})&\text{if}\;a\geq 0\\ (-\kappa_{0},0)&\text{otherwise}\end{cases}.

Thus,

limδ0κ~(a;δ)=κ(|a|).\displaystyle\lim\nolimits_{\delta\rightarrow 0}\partial\tilde{\kappa}_{\ell}(a;\delta)=\partial\kappa_{\ell}(|a|). (93)

Define F~τ(𝒳;δ)=(i1iM)𝛀~(𝒳i1iM𝒪i1iM;δ)+i=1Dλiϕ(𝒳i)\tilde{F}_{\tau}(\mathscr{X};\delta)=\sum\nolimits_{(i_{1}...i_{M})\in\mathbf{\Omega}}\tilde{\ell}\left(\mathscr{X}_{i_{1}...i_{M}}-\mathscr{O}_{i_{1}...i_{M}};\delta\right)+\sum\nolimits_{i=1}^{D}\lambda_{i}\phi(\mathscr{X}_{{\left\langle i\right\rangle}}). Since 𝒳s\mathscr{X}_{s} is obtained from solving (42) at step 4 of Algorithm 3, we have 𝒳sF~τ(𝒳;(δ0)s)\mathscr{X}_{s}\in\partial\tilde{F}_{\tau}(\mathscr{X};(\delta_{0})^{s}). Take ss\rightarrow\infty and use (93), we have lims𝒳slimsF~τ(𝒳;(δ0)s)=limδ0F~τ(𝒳;δ)=F~τ(𝒳)\lim\nolimits_{s\rightarrow\infty}\mathscr{X}_{s}\in\lim\nolimits_{s\rightarrow\infty}\partial\tilde{F}_{\tau}(\mathscr{X};(\delta_{0})^{s})=\lim\nolimits_{\delta\rightarrow 0}\partial\tilde{F}_{\tau}(\mathscr{X};\delta)=\partial\tilde{F}_{\tau}(\mathscr{X}). Thus, Theorem 20 holds.  

References

  • Acar et al. (2011) E. Acar, D.M. Dunlavy, T. Kolda, and M. Mørup. Scalable tensor factorizations for incomplete data. Chemometrics and Intelligent Laboratory Systems, 106(1):41–56, 2011.
  • Agarwal et al. (2010) A. Agarwal, S. Negahban, and M. J. Wainwright. Fast global convergence rates of gradient methods for high-dimensional statistical recovery. In Advances in Neural Information Processing Systems, pages 37–45, 2010.
  • Attouch et al. (2013) H. Attouch, J. Bolte, and B. Svaiter. Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss-Seidel methods. Mathematical Programming, 137(1-2):91–129, 2013.
  • Bader and Kolda (2007) B. Bader and T. Kolda. Efficient MATLAB computations with sparse and factored tensors. SIAM Journal on Scientific Computing, 30(1):205–231, 2007.
  • Bahadori et al. (2014) M. Bahadori, Q. Yu, and Y. Liu. Fast multivariate spatio-temporal analysis via low rank tensor learning. In Advances in Neural Information Processing Systems, pages 3491–3499, 2014.
  • Balazevic et al. (2019) I. Balazevic, C. Allen, and T. Hospedales. TuckER: Tensor factorization for knowledge graph completion. In Conference on Empirical Methods in Natural Language Processing, pages 5188–5197, 2019.
  • Bauschke et al. (2008) H. Bauschke, R. Goebel, Y. Lucet, and X. Wang. The proximal average: Basic theory. SIAM Journal on Optimization, 19(2):766–785, 2008.
  • Belkin et al. (2006) M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 7(Nov):2399–2434, 2006.
  • Bengua et al. (2017) J. Bengua, H. Phien, H. Tuan, and M. Do. Efficient tensor completion for color image and video recovery: Low-rank tensor train. IEEE Transactions on Image Processing, 26(5):2466–2479, 2017.
  • Bollacker et al. (2008) K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: A collaboratively created graph database for structuring human knowledge. In ACM SIGMOD International Conference on Management of Data, pages 1247–1250, 2008.
  • Bolte et al. (2010) J. Bolte, A. Daniilidis, O. Ley, and L. Mazet. Characterizations of lojasiewicz inequalities and applications. Transactions of the American Mathematical Society, 362(6):3319–3363, 2010.
  • Bolte et al. (2014) J. Bolte, S. Sabach, and M. Teboulle. Proximal alternating linearized minimization or nonconvex and nonsmooth problems. Mathematical Programming, 146(1-2):459–494, 2014.
  • Bordes et al. (2013) A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko. Translating embeddings for modeling multi-relational data. In Advances in Neural Information Processing Systems, pages 2787–2795, 2013.
  • Boumal and Absil (2015) N. Boumal and P.-A. Absil. Low-rank matrix completion via preconditioned optimization on the grassmann manifold. Linear Algebra and its Applications, 475:200–239, 2015.
  • Boyd and Vandenberghe (2009) S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, 2009.
  • Boyd et al. (2011) S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1–122, 2011.
  • Cai et al. (2010) J.-F. Cai, E. J. Candès, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956–1982, 2010.
  • Candès and Recht (2009) E. J. Candès and B. Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):717–772, 2009.
  • Candès and Tao (2005) E. J Candès and Terence Tao. Decoding by linear programming. IEEE Transactions on Information Theory, 51(12):4203–4215, 2005.
  • Candès et al. (2008) E. J. Candès, M. B. Wakin, and S. P. Boyd. Enhancing sparsity by reweighted 1\ell_{1} minimization. Journal of Fourier Analysis and Applications, 14(5-6):877–905, 2008.
  • Candès et al. (2011) E. J. Candès, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? Journal of the ACM, 58(3):11, 2011.
  • Chen et al. (2019) X. Chen, S. Liu, K. Xu, X. Li, X. Lin, M. Hong, and D. Cox. ZO-AdaMM: Zeroth-order adaptive momentum method for black-box optimization. In Advances in Neural Information Processing Systems, pages 7204–7215, 2019.
  • Chen et al. (2020) X. Chen, J. Yang, and L. Sun. A nonconvex low-rank tensor completion model for spatiotemporal traffic data imputation. Transportation Research Part C: Emerging Technologies, 117:102673, 2020.
  • Cheng et al. (2016) H. Cheng, Y. Yu, X. Zhang, E. Xing, and D. Schuurmans. Scalable and sound low-rank tensor learning. In International Conference on Artificial Intelligence and Statistics, pages 1114–1123, 2016.
  • Cichocki et al. (2015) A. Cichocki, D. Mandic, L. De Lathauwer, G. Zhou, Q. Zhao, C. Caiafa, and H. Phan. Tensor decompositions for signal processing applications: From two-way to multiway component analysis. IEEE Signal Processing Magazine, 32(2):145–163, 2015.
  • Davis et al. (2011) D. Davis, R. Lichtenwalter, and N. V. Chawla. Multi-relational link prediction in heterogeneous information networks. In International Conference on Advances in Social Networks Analysis and Mining, pages 281–288, 2011.
  • Dettmers et al. (2018) T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel. Convolutional 2D knowledge graph embeddings. In AAAI Conference on Artificial Intelligence, 2018.
  • Duchi et al. (2011) J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
  • Efron et al. (2004) B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics, 32(2):407–499, 2004.
  • Fan and Li (2001) J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360, 2001.
  • Gandy et al. (2011) S. Gandy, B. Recht, and I. Yamada. Tensor completion and low-n-rank tensor recovery via convex optimization. Inverse Problems & Imaging, 27(2):025010, 2011.
  • Ghadimi and Lan (2016) S. Ghadimi and G. Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(1-2):59–99, 2016.
  • Gu et al. (2014) Q. Gu, H. Gui, and J. Han. Robust tensor decomposition with gross corruption. In Advances in Neural Information Processing Systems, pages 1422–1430, 2014.
  • Gu et al. (2017) S. Gu, Q. Xie, D. Meng, W. Zuo, X. Feng, and L. Zhang. Weighted nuclear norm minimization and its applications to low level vision. International Journal of Computer Vision, 121(2):183–208, 2017.
  • Gui et al. (2016) H. Gui, J. Han, and Q. Gu. Towards faster rates and oracle property for low-rank matrix estimation. In International Conference on Machine Learning, pages 2300–2309, 2016.
  • Guo et al. (2017) X. Guo, Q. Yao, and J. Kwok. Efficient sparse low-rank tensor completion using the Frank-Wolfe algorithm. In AAAI Conference on Artificial Intelligence, pages 1948–1954, 2017.
  • Han et al. (2018) X. Han, S. Cao, X. Lv, Y. Lin, Z. Liu, M. Sun, and J. Li. OpenKE: An open toolkit for knowledge embedding. In Conference on Empirical Methods in Natural Language Processing, pages 139–144, 2018.
  • Hare and Sagastizábal (2009) W. Hare and C. Sagastizábal. Computing proximal points of nonconvex functions. Mathematical Programming, 116(1-2):221–258, 2009.
  • Hartman (1959) P. Hartman. On functions representable as a difference of convex functions. Pacific Journal of Mathematics, 9(3):707–713, 1959.
  • He et al. (2019) W. He, Q. Yao, C. C. Li, N. Yokoya, and Q. Zhao. Non-local meets global: An integrated paradigm for hyperspectral denoising. In IEEE Conference on Computer Vision and Pattern Recognition, pages 6861–6870, 2019.
  • Hillar and Lim (2013) C. J. Hillar and L.-H. Lim. Most tensor problems are NP-hard. Journal of the ACM, 60(6), 2013.
  • Hong et al. (2020) D. Hong, T. Kolda, and J. A. Duersch. Generalized canonical polyadic tensor decomposition. SIAM Review, 62(1):133–163, 2020.
  • Hong et al. (2016) M. Hong, Z.-Q. Luo, and Meisam R. Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems. SIAM Journal on Optimization, 26(1):337–364, 2016.
  • Hsieh et al. (2015) C.-J. Hsieh, N. Natarajan, and I. Dhillon. PU learning for matrix completion. In International Conference on Machine Learning, pages 2445–2453, 2015.
  • Hu et al. (2013) Y. Hu, D. Zhang, J. Ye, X. Li, and X. He. Fast and accurate matrix completion via truncated nuclear norm regularization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(9):2117–2130, 2013.
  • Huang et al. (2020) H. Huang, Y. Liu, J. Liu, and C. Zhu. Provable tensor ring completion. Signal Processing, 171:107486, 2020.
  • Huber (1964) P. Huber. Robust estimation of a location parameter. Annals of Mathematical Statistics, pages 73–101, 1964.
  • Indyk et al. (2019) P. Indyk, A. Vakilian, and Y. Yuan. Learning-based low-rank approximations. In Advances in Neural Information Processing Systems, pages 7402–7412, 2019.
  • Janzamin et al. (2020) M. Janzamin, R. Ge, J. Kossaifi, and A. Anandkumar. Spectral learning on matrices and tensors. Foundations and Trends in Machine Learning, 2020.
  • Jiang et al. (2015) W. Jiang, F. Nie, and H. Huang. Robust dictionary learning with capped 1\ell_{1}-norm. In International Joint Conference on Artificial Intelligence, 2015.
  • Kasai and Mishra (2016) H. Kasai and B. Mishra. Low-rank tensor completion: A Riemannian manifold preconditioning approach. In International Conference on Machine Learning, pages 1012–1021, 2016.
  • Kingma and Ba (2014) D. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2014.
  • Kolda and Bader (2009) T. Kolda and B. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455–500, 2009.
  • Kressner et al. (2014) D. Kressner, M. Steinlechner, and B. Vandereycken. Low-rank tensor completion by Riemannian optimization. BIT Numerical Mathematics, 54(2):447–468, 2014.
  • Lacroix et al. (2018) T. Lacroix, N. Usunier, and G. Obozinski. Canonical tensor decomposition for knowledge base completion. In International Conference on Machine Learning, 2018.
  • Le Thi and Tao (2005) H. A. Le Thi and P. D. Tao. The DC (difference of convex functions) programming and DCA revisited with DC models of real world nonconvex optimization problems. Annals of Operations Research, 133(1-4):23–46, 2005.
  • Lei et al. (2009) T. Lei, X. Wang, and H. Liu. Uncoverning groups via heterogeneous interaction analysis. In IEEE International Conference on Data Mining, pages 503–512, 2009.
  • Lewis and Sendov (2005) A. S. Lewis and H. S. Sendov. Nonsmooth analysis of singular values. Set-Valued Analysis, 13(3):243–264, 2005.
  • Li and Lin (2015) H. Li and Z. Lin. Accelerated proximal gradient methods for nonconvex programming. In Advances in Neural Information Processing Systems, pages 379–387, 2015.
  • Li et al. (2017) Q. Li, Y. Zhou, Y. Liang, and P. Varshney. Convergence analysis of proximal gradient with momentum for nonconvex optimization. In International Conference on Machine Learning, pages 2111–2119, 2017.
  • Liu et al. (2013) J. Liu, P. Musialski, P. Wonka, and J. Ye. Tensor completion for estimating missing values in visual data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):208–220, 2013.
  • Loh and Wainwright (2015) P. Loh and M. Wainwright. Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima. Journal of Machine Learning Research, 16:559–616, 2015.
  • Lu et al. (2013) C. Lu, J. Shi, and J. Jia. Online robust dictionary learning. In IEEE Conference on Computer Vision and Pattern Recognition, pages 415–422, 2013.
  • Lu et al. (2016a) C. Lu, J. Feng, Y. Chen, W. Liu, Z. Lin, and S. Yan. Tensor robust principal component analysis: Exact recovery of corrupted low-rank tensors via convex optimization. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5249–5257, 2016a.
  • Lu et al. (2016b) C. Lu, J. Tang, S. Yan, and Z. Lin. Nonconvex nonsmooth low rank minimization via iteratively reweighted nuclear norm. IEEE Transactions on Image Processing, 25(2):829–839, 2016b.
  • Mazumder et al. (2010) R. Mazumder, T. Hastie, and R. Tibshirani. Spectral regularization algorithms for learning large incomplete matrices. Journal of Machine Learning Research, 11:2287–2322, 2010.
  • Mazumder et al. (2020) R. Mazumder, D. F. Saldana, and H. Weng. Matrix completion with nonconvex regularization: Spectral operators and scalable algorithms. Statistics and Computing, 30:1113–1138, 2020.
  • Miller (1995) G. A. Miller. WordNet: A lexical database for english. Communications of the ACM, 38(11):39–41, 1995.
  • Mu et al. (2014) C. Mu, B. Huang, J. Wright, and D. Goldfarb. Square deal: Lower bounds and improved relaxations for tensor recovery. In International Conference on Machine Learning, pages 73–81, 2014.
  • Narita et al. (2012) A. Narita, K. Hayashi, R. Tomioka, and H. Kashima. Tensor factorization using auxiliary information. Data Mining and Knowledge Discovery, 25(2):298–324, 2012.
  • Negahban and Wainwright (2012) S. Negahban and M. J. Wainwright. Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. Journal of Machine Learning Research, 13(May):1665–1697, 2012.
  • Negahban et al. (2012) S. Negahban, P. Ravikumar, M. Wainwright, and B. Yu. A unified framework for high-dimensional analysis of MM-estimators with decomposable regularizers. Statistical Science, 27(4):538–557, 2012.
  • Nesterov (2013) Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer Science & Business Media, 2013.
  • Nickel et al. (2015) M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich. A review of relational machine learning for knowledge graphs. Proceedings of the IEEE, 104(1):11–33, 2015.
  • Nimishakavi et al. (2018) M. Nimishakavi, P. Jawanpuria, and B. Mishra. A dual framework for low-rank tensor completion. In Advances in Neural Information Processing Systems, 2018.
  • Oseledets (2011) I. Oseledets. Tensor-train decomposition. SIAM Journal on Scientific Computing, 33(5):2295–2317, 2011.
  • Papalexakis et al. (2017) E. Papalexakis, C. Faloutsos, and N. Sidiropoulos. Tensors for data mining and data fusion: Models, applications, and scalable algorithms. ACM Transactions on Intelligent Systems and Technology, 8(2):16, 2017.
  • Parikh and Boyd (2013) N. Parikh and S. P. Boyd. Proximal algorithms. Foundations and Trends in Optimization, 1(3):123–231, 2013.
  • Phipps and Kolda (2019) E. Phipps and T. Kolda. Software for sparse tensor decomposition on emerging computing architectures. SIAM Journal on Scientific Computing, 41:C269–C290, 2019.
  • Rauhut et al. (2017) H. Rauhut, R. Schneider, and Ž. Stojanac. Low rank tensor recovery via iterative hard thresholding. Linear Algebra and its Applications, 523:220–262, 2017.
  • Rendle and Schmidt-Thieme (2010) S. Rendle and L. Schmidt-Thieme. Pairwise interaction tensor factorization for personalized tag recommendation. In ACM International Conference on Web Search and Data Mining, pages 81–90, 2010.
  • Shen et al. (2017) L. Shen, W. Liu, J. Huang, Y.-G. Jiang, and S. Ma. Adaptive proximal average approximation for composite convex minimization. In AAAI Conference on Artificial Intelligence, 2017.
  • Signoretto et al. (2011) M. Signoretto, R. Van de Plas, B. De Moor, and J. Suykens. Tensor versus matrix completion: A comparison with application to spectral data. Signal Processing Letter, 18(7):403–406, 2011.
  • Song et al. (2017) Q. Song, H. Ge, J. Caverlee, and X. Hu. Tensor completion algorithms in big data analytics. ACM Transactions on Knowledge Discovery from Data, 2017.
  • Srebro et al. (2005) N. Srebro, J. Rennie, and T. Jaakkola. Maximum-margin matrix factorization. In Advances in Neural Information Processing Systems, pages 1329–1336, 2005.
  • Tomioka and Suzuki (2013) R. Tomioka and T. Suzuki. Convex tensor decomposition via structured schatten norm regularization. In Advances in Neural Information Processing Systems, pages 1331–1339, 2013.
  • Tomioka et al. (2010) R. Tomioka, K. Hayashi, and H. Kashima. Estimation of low-rank tensors via convex optimization. Technical report, arXiv preprint, 2010.
  • Tomioka et al. (2011) R. Tomioka, T. Suzuki, K. Hayashi, and H. Kashima. Statistical performance of convex tensor decomposition. In Advances in Neural Information Processing Systems, pages 972–980, 2011.
  • Toutanova et al. (2015) K. Toutanova, D. Chen, P. Pantel, H. Poon, P. Choudhury, and M. Gamon. Representing text for joint embedding of text and knowledge bases. In Conference on Empirical Methods in Natural Language Processing, pages 1499–1509, 2015.
  • Tu et al. (2016) S. Tu, R. Boczar, M. Simchowitz, M. Soltanolkotabi, and B. Recht. Low-rank solutions of linear matrix equations via procrustes flow. In International Conference on Machine Learning, 2016.
  • Tucker (1966) L. R. Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279–311, 1966.
  • Vandereycken (2013) Bart Vandereycken. Low-rank matrix completion by riemannian optimization. SIAM Journal on Optimization, 23(2):1214–1236, 2013.
  • Wang et al. (2017) L. Wang, X. Zhang, and Q. Gu. A unified computational and statistical framework for nonconvex low-rank matrix estimation. In International Conference on Artificial Intelligence and Statistics, 2017.
  • Wang et al. (2019) Y. Wang, W. Yin, and J. Zeng. Global convergence of ADMM in nonconvex nonsmooth optimization. Journal of Scientific Computing, 78(1):29–63, 2019.
  • Wang et al. (2020) Z. Wang, Y. Zhou, Y. Liang, and G. Lan. Cubic regularization with momentum for nonconvex optimization. In Uncertainty in Artificial Intelligence, pages 313–322, 2020.
  • Wimalawarne et al. (2018) K. Wimalawarne, M. Yamada, and H. Mamitsuka. Convex coupled matrix and tensor completion. Neural Computation, 30:3095–3127, 2018.
  • Xu et al. (2013) Y. Xu, R. Hao, W. Yin, and Z. Su. Parallel matrix factorization for low-rank tensor completion. Inverse Problems & Imaging, 9(2):601–624, 2013.
  • Xue et al. (2018) S. Xue, W. Qiu, F. Liu, and X. Jin. Low-rank tensor completion by truncated nuclear norm regularization. International Conference on Pattern Recognition, pages 2600–2605, 2018.
  • Yao and Kwok (2018) Q. Yao and J. Kwok. Efficient learning with a family of nonconvex regularizers by redistributing nonconvexity. Journal of Machine Learning Research, 18(1):6574–6625, 2018.
  • Yao et al. (2017) Q. Yao, J. Kwok, F. Gao, W. Chen, and T.-Y. Liu. Efficient inexact proximal gradient algorithm for nonconvex problems. In International Joint Conference on Artificial Intelligence, pages 3308–3314, 2017.
  • Yao et al. (2019a) Q. Yao, J. Kwok, and B. Han. Efficient nonconvex regularized tensor completion with structure-aware proximal iterations. In International Conference on Machine Learning, pages 7035–7044, 2019a.
  • Yao et al. (2019b) Q. Yao, J. Kwok, T. Wang, and T.-Y. Liu. Large-scale low-rank matrix learning with nonconvex regularizers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019b.
  • Yu (2013) Y.-L. Yu. Better approximation and faster algorithm using the proximal average. In Advances in Neural Information Processing Systems, pages 458–466, 2013.
  • Yu et al. (2015) Y.-L. Yu, Z. Xun, M. Micol, and E. Xing. Minimizing nonconvex non-separable functions. In International Conference on Artificial Intelligence and Statistics, pages 1107–1115, 2015.
  • Yuan et al. (2019) L. Yuan, C. Li, D. Mandic, J. Cao, and Q. Zhao. Tensor ring decomposition with rank minimization on latent space: An efficient approach for tensor completion. AAAI Conference on Artificial Intelligence, 2019.
  • Yuan and Zhang (2016) M. Yuan and C.-H. Zhang. On tensor completion via nuclear norm minimization. Foundations of Computational Mathematics, 16(4):1031–1068, 2016.
  • Zhang (2010a) C. H. Zhang. Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics, 38(2):894–942, 2010a.
  • Zhang (2010b) T. Zhang. Analysis of multi-stage convex relaxation for sparse regularization. Journal of Machine Learning Research, 11:1081–1107, 2010b.
  • Zhang and Aeron (2017) Z. Zhang and S. Aeron. Exact tensor completion using t-SVD. IEEE Transactions on Signal Processing, 65(6):1511–1526, 2017.
  • Zhao et al. (2016) Q. Zhao, G. Zhou, S. Xie, L. Zhang, and A. Cichocki. Tensor ring decomposition. Technical report, arXiv, 2016.
  • Zheng and Lafferty (2015) Q. Zheng and J. Lafferty. A convergent gradient descent algorithm for rank minimization and semidefinite programming from random linear measurements. In Advances in Neural Information Processing Systems, 2015.
  • Zheng et al. (2013) X. Zheng, H. Ding, H. Mamitsuka, and S. Zhu. Collaborative matrix factorization with multiple similarities for predicting drug-target interactions. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2013.
  • Zhong and Kwok (2014) W. Zhong and J. Kwok. Gradient descent with proximal average for nonconvex and composite regularization. In AAAI Conference on Artificial Intelligence, pages 2206–2212, 2014.
  • Zhu et al. (2018) Z. Zhu, Q. Li, G. Tang, and M. Wakin. Global optimality in low-rank matrix optimization. IEEE Transactions on Signal Processing, 66(13):3614–3628, 2018.