This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

ROBUST AND OPTIMAL TENSOR ESTIMATION VIA ROBUST GRADIENT DESCENT

Xiaoyu Zhang    Di Wang    Guodong Li    Defeng Sun School of Mathematical Sciences, Tongji University, [email protected] School of Mathematical Sciences, Shanghai Jiao Tong University, [email protected] Department of Statistics and Actuarial Science, University of Hong Kong, [email protected] Department of Applied Mathematics, Hong Kong Polytechnic University, [email protected]
Abstract

Low-rank tensor models are widely used in statistics and machine learning. However, most existing methods rely heavily on the assumption that data follows a sub-Gaussian distribution. To address the challenges associated with heavy-tailed distributions encountered in real-world applications, we propose a novel robust estimation procedure based on truncated gradient descent for general low-rank tensor models. We establish the computational convergence of the proposed method and derive optimal statistical rates under heavy-tailed distributional settings of both covariates and noise for various low-rank models. Notably, the statistical error rates are governed by a local moment condition, which captures the distributional properties of tensor variables projected onto certain low-dimensional local regions. Furthermore, we present numerical results to demonstrate the effectiveness of our method.

62F35 ,
62H12,
62J12,
62H25,
Gradient descent,
heavy-tailed distribution,
nonconvex optimization,
robustness,
tensor decomposition,
keywords:
[class=MSC]
keywords:
\startlocaldefs\endlocaldefs

, , and

1 Introduction

1.1 Low-rank tensor modeling

Tensor models in statistics and machine learning have gained significant attention in recent years for analyzing complex multidimensional data. Applications of tensors can be found in various fields, including biomedical imaging analysis (Zhou, Li and Zhu, 2013; Li et al., 2018; Wu et al., 2022), recommender systems (Bi, Qu and Shen, 2018; Tarzanagh and Michailidis, 2022), and time series analysis (Chen, Yang and Zhang, 2022; Wang et al., 2022), where the data are naturally represented as third or higher-order tensors. Tensor decompositions and low-rank structures are prevalent in these models, as they facilitate dimension reduction, provide interpretable representations, and enhance computational efficiency (Kolda and Bader, 2009; Bi et al., 2021).

In this paper, we consider a general framework of tensor learning, where the loss function is denoted by ¯(𝓐;z)\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z), with ¯\overline{\mathcal{L}} being a differentiable loss function, 𝓐\mathscr{A} a dd-th order parameter tensor, and zz a random sample drawn from the population. This framework encompasses a wide range of models in statistics and machine learning, including tensor linear regression, tensor generalized linear regression, and tensor PCA, among others. For dimension reduction, the parameter tensor 𝓐\mathscr{A} is assumed to have a Tucker decomposition (Tucker, 1966)

𝓐=𝓢×1𝐔1×2𝐔2×d𝐔d=𝓢×j=1d𝐔j,\mbox{\boldmath$\mathscr{A}$}=\mbox{\boldmath$\mathscr{S}$}\times_{1}\mathbf{U}_{1}\times_{2}\mathbf{U}_{2}\cdots\times_{d}\mathbf{U}_{d}=\mbox{\boldmath$\mathscr{S}$}\times_{j=1}^{d}\mathbf{U}_{j}, (1.1)

which is one of the most commonly used low-rank structures in statistical tensor learning (see related definitions and notations in Section 2).

Substantial progress has been made recently in developing estimation methods and efficient algorithms for various low-rank tensor models. However, low-rank tensor problems are often highly nonconvex, making them challenging to solve directly. A common strategy to address this issue is convex relaxation by replacing the low-rank constraint with a tensor nuclear norm and solving the problem in the full tensor space, i.e. the space of 𝓐\mathscr{A} (Tomioka and Suzuki, 2013; Yuan and Zhang, 2016; Raskutti, Yuan and Chen, 2019). While these convex methods come with provable statistical guarantees, they suffer from heavy computational burdens and prohibitive storage requirements. As a result, they are often infeasible for high-dimensional tensor estimation in many practical applications.

Another estimation approach is nonconvex optimization, which operates directly on the parsimonious decomposition form in (1.1). This approach leads to scalable algorithms with more affordable computational complexity. Among the various developments in nonconvex optimization techniques, gradient descent and its variants have been extensively studied and can be applied to a wide range of low-rank tensor models. Recent advances have provided both computational and statistical guarantees for gradient descent algorithms, as well as corresponding initialization methods (Chen, Raskutti and Yuan, 2019; Han, Willett and Zhang, 2022; Tong et al., 2022; Dong et al., 2023). In terms of statistical performance, minimax optimal error rates have been established for commonly used low-rank tensor models under Gaussian or sub-Gaussian distributional assumptions, which are crucial technical conditions for managing high-dimensional data (Raskutti, Yuan and Chen, 2019; Chen, Raskutti and Yuan, 2019; Han, Willett and Zhang, 2022).

However, in many real applications, tensor data are often contaminated by outliers and heavy-tailed noise, violating the stringent Gaussian or sub-Gaussian assumptions. Empirical studies have shown that biomedical imaging data often exhibit non-Gaussian and heavy-tailed distributions (Beggs and Plenz, 2003; Friedman et al., 2012; Roberts, Boonstra and Breakspear, 2015). For example, in Section 6, we analyze chest computed tomography (CT) images of patients with and without COVID-19, and plot the kurtosis of image pixels in Figure 1. Among 22,500 pixels analyzed, 1,169 in the COVID-19 samples and 351 pixels in the non-COVID-19 samples exihibit heavy-tailed distributions, with sample kurtosis greater than eight. These findings highlight the prevalence of heavy-tailed behavior in real-world biomedical datasets. As recent studies (Wang, Zhang and Mai, 2023; Wei et al., 2023) have emphasized, conventional methods fail to produce reliable estimates when the data follow such heavy-tailed distributions. Therefore, it is crucial to develop stable and robust estimation methods that are not only computationally efficient and statistically optimal but also robust against the challenges posed by heavy-tailed distributions in real-world tensor data.

Refer to caption
Refer to caption
Figure 1: Histograms of kurtosis for COVID and non-COVID CT image pixels

The growing interest in robust estimation methods for high-dimensional low-rank matrix and tensor models underscores the pressing need for solutions that can handle heavy-tailed data. In terms of methodology to achieve robustness, the existing works can be broadly classified into two approaches: loss robustification and data robustification. The prestigious Huber regression method (Huber, 1964; Fan, Li and Wang, 2017; Sun, Zhou and Fan, 2020) exemplifies the first approach, where the standard least squares loss is replaced with a more robust variant. For instance, Tan, Sun and Witten (2023) applied the adaptive Huber regression with regularizations to sparse reduced-rank regression in the presence of heavy-tailed noise, and developed an ADMM algorithm for convex optimization. Shen et al. (2023) employed the least absolute deviation (LAD) and Huber loss functions for low-rank matrix and tensor trace regression, and proposed a Riemannian subgradient algorithm in the nonconvex optimization framework. While these loss-robustification methods provide robust control over residuals, they focus solely on the residuals’ deviations and do not address the heavy-tailedness of the covariates. Moreover, robust loss functions like LAD and Huber loss cannot be easily generalized to more complex tensor models beyond linear trace regression.

Alternatively, Fan, Wang and Zhu (2021) proposed a robust low-rank matrix estimation procedure via data robustification. This method applies appropriate shrinkage to the data, constructs robust moment estimators from the shrunk data, and ultimately derives a robust estimate for the low-rank parameter matrix. Building on this idea, subsequent works by Wang and Tsay (2023) and Lu et al. (2024) extended the data robustficaition framework to time series models and spatial temporal models, respectively. The primary objective of data robustification is to mitigate the influence of samples with large deviations, thereby producing a robust estimate. However, when applied to low-rank matrix and tensor models, the data robustification procedure has limitations. Specifically, it overlooks the inherent structure of the model and fails to exploit the low-rank decomposition. As shown in Section 4, not all information in the data contributes effectively to estimating the tensor decomposition. Consequently, the data robustification approach may be suboptimal for low-rank tensor estimation.

For general low-rank tensor models, we introduce a new robust estimation procedure via gradient robustification. Specifically, we replace the partial gradients in the gradient descent algorithm with their robust alternatives by truncating gradients that exhibit large deviations. This modification improves the accuracy of gradient estimation by mitigating the influence of outliers and noise. By robustifying the partial gradients with respect to the components in the tensor decomposition, this method effectively leverages the low-rank structure, making it applicable to a variety of low-rank models, similar to other gradient-based methods.

For various low-rank tensor models, we demonstrate that instead of the original data, low-dimensional multilinear transformations of the data are employed in the partial gradients. This allows us to work with more compact and informative representations of the data, leading to enhanced computational and statistical properties. As a result, the statistical error rates of this gradient-based method are shown to be linked to a new concept called local moment, which characterizes the distributional properties of the tensor data projected onto certain low-dimensional subspaces along each tensor direction. This leads to potentially much sharper statistical rates compared to traditional methods. Furthermore, since covariates and noise contribute to the partial gradients, the robustification of gradients is particularly effective in handling the heavy-tailed distributions of both covariates and noise.

However, it is important to note that the robust gradient alternatives may not correspond to the partial gradients of any standard robust loss function, which poses challenges in analyzing their convergence. To address this, we develop an algorithm-based convergence analysis for the proposed robust gradient descent method. Additionally, we establish minimax-optimal statistical error rates under local moment conditions for several commonly used low-rank tensor models.

The main contributions of this paper are three-fold:

  • (1)

    We introduce a novel and general estimation procedure based on robust gradient descent. This method is computationally scalable and applicable to a wide range of tensor problems, including both supervised tasks (e.g., tensor regression and classifications) and unsupervised tasks (e.g., tensor PCA).

  • (2)

    The robust methodology is shown to effectively handle the heavy-tailed distributions and is proven to achieve optimal statistical error rates under relaxed distributional assumptions on both covariates and noise. Specifically, for 0<ϵ10<\epsilon\leq 1, we only require finite second moments for the covariates and (1+ϵ)(1+\epsilon)-th moments for noise in tensor linear regression, finite second moments for the covariates in tensor logistic regression, and finite (1+ϵ)(1+\epsilon)-th moments for noise in tensor PCA.

  • (3)

    For heavy-tailed low-rank tensor models, we introduce the concept of local moment, a technical innovation that enables a more precise characterization of the effects of heavy-tailed distributions. This results in sharper statistical error rates compared to those derived from global moment conditions.

1.2 Related literature

This paper is related to a large body of literature on nonconvex methods for low-rank matrix and tensor estimation. The gradient descent algorithm and its variants have been extensively studied for low-rank matrix models (Netrapalli et al., 2014; Chen and Wainwright, 2015; Tu et al., 2016; Wang, Zhang and Gu, 2017; Ma et al., 2018) and low-rank tensor models (Xu, Zhang and Gu, 2017; Chen, Raskutti and Yuan, 2019; Han, Willett and Zhang, 2022; Tong, Ma and Chi, 2022; Tong et al., 2022). For simplicity, we focus on the robust alternatives to the standard gradient descent, although the proposed technique can be extended to other gradient-based methods. Robust gradient methods have also been explored for low-dimensional statistical models in convex optimization (Prasad et al., 2020). For low-rank matrix recovery, the median truncation gradient descent has been proposed to handle the arbitrary outliers in the data (Li et al., 2020), which focuses on linear compressed sensing problem without random noise. Our paper differs from the existing work as we consider the general low-rank tensor estimation framework under the heavy-tailed distribution setting.

Robust estimation against heavy-tailed distributions is another emerging area of research in high-dimensional statistics. Various robust MM-estimators have been proposed for mean estimation (Catoni, 2012; Bubeck, Cesa-Bianchi and Lugosi, 2013; Devroye et al., 2016) and high-dimensional linear regression (Fan, Li and Wang, 2017; Loh, 2017; Sun, Zhou and Fan, 2020; Wang et al., 2020). More recently, robust methods for low-rank matrix and tensor estimation have been developed in Fan, Wang and Zhu (2021), Tan, Sun and Witten (2023), Wang and Tsay (2023) and Shen et al. (2023). Compared to these existing methods, our proposed approach can achieve the same or even better convergence rates under more relaxed local distribution assumptions on both covariates and noise.

1.3 Organization of the paper

The remainder of this paper is organized as follows. In Section 2, we provide a review of relevant definitions and notations for low-rank tensor estimation. In Section 3, we introduce the robust gradient descent method and discuss its convergence properties. In Section 4, we apply the proposed methodology to three popular tensor models in heavy-tailed distribution settings. Simulation experiments and real data examples are presented in Sections 5 and 6 to validate the performance of the proposed methods. Finally, Section 7 concludes with a discussion of the findings and future directions. The proofs of the main results are relegated to Appendices A and B.

2 Tensor Algebra and Notation

Tensors, or multi-dimensional arrays, are higher-order extensions of matrices. A multi-dimensional array 𝓧p1××pd\mbox{\boldmath$\mathscr{X}$}\in\mathbb{R}^{p_{1}\times\cdots\times p_{d}} is called a dd-th order tensor. Throughout this paper, we denote vectors by boldface small letters (e.g. 𝐱\mathbf{x}, 𝐲\mathbf{y}), matrices by boldface capital letters (e.g. 𝐗\mathbf{X}, 𝐘\mathbf{Y}), and tensors by boldface Euler capital letters (e.g. 𝓧\mathscr{X}, 𝓨\mathscr{Y}), respectively.

Tensor matricization is the process of reordering the elements of a tensor into a matrix. For any dd-th-order tensor 𝓧p1××pd\mbox{\boldmath$\mathscr{X}$}\in\mathbb{R}^{p_{1}\times\cdots\times p_{d}}, its mode-kk matricization is denoted as 𝓧(k)pk×pk\mbox{\boldmath$\mathscr{X}$}_{(k)}\in\mathbb{R}^{p_{k}\times p_{-k}}, with pk==1,kdpp_{-k}=\prod_{\ell=1,\ell\neq k}^{d}p_{\ell}, whose (ik,j)(i_{k},j)-th element is mapped to the (i1,,id)(i_{1},\dots,i_{d})-th element of 𝓧\mathscr{X}, where j=1+s=1,skd(is1)Js(k)j=1+\sum_{s=1,s\neq k}^{d}(i_{s}-1)J_{s}^{(k)} with Js(k)==1,ks1pJ_{s}^{(k)}=\prod_{\ell=1,\ell\neq k}^{s-1}p_{\ell} and p0=1p_{0}=1.

Next, we review three types of multiplications for tensors. For any tensor 𝓧p1××pd\mbox{\boldmath$\mathscr{X}$}\in\mathbb{R}^{p_{1}\times\cdots\times p_{d}} and matrix 𝐘qk×pk\mathbf{Y}\in\mathbb{R}^{q_{k}\times p_{k}} with 1kd1\leq k\leq d, the mode-kk multiplication 𝓧×k𝐘\mbox{\boldmath$\mathscr{X}$}\times_{k}\mathbf{Y} produces a tensor in p1××pk1×qk×pk+1××pm\mathbb{R}^{p_{1}\times\cdots\times p_{k-1}\times q_{k}\times p_{k+1}\times\cdots\times p_{m}} defined by

(𝓧×k𝐘)i1ik1jik+1id=ik=1pk𝓧i1im𝐘jik.\left(\mbox{\boldmath$\mathscr{X}$}\times_{k}\mathbf{Y}\right)_{i_{1}\dots i_{k-1}ji_{k+1}\dots i_{d}}=\sum_{i_{k}=1}^{p_{k}}\mbox{\boldmath$\mathscr{X}$}_{i_{1}\dots i_{m}}\mathbf{Y}_{ji_{k}}. (2.1)

For any two tensors 𝓧p1×p2××pd\mbox{\boldmath$\mathscr{X}$}\in\mathbb{R}^{p_{1}\times p_{2}\times\cdots\times p_{d}} and 𝓨p1×p2×pd0\mbox{\boldmath$\mathscr{Y}$}\in\mathbb{R}^{p_{1}\times p_{2}\times\cdots p_{d_{0}}} with dd0d\geq d_{0}, their generalized inner product 𝓧,𝓨\langle\mbox{\boldmath$\mathscr{X}$},\mbox{\boldmath$\mathscr{Y}$}\rangle is the (dd0)(d-d_{0})-th-order tensor in pd0+1××pd\mathbb{R}^{p_{d_{0}+1}\times\dots\times p_{d}} defined by

𝓧,𝓨id0+1id=i1=1p1i2=1p2id0=1pd0𝓧i1i2id0id0+1id𝓨i1i2id0,\langle\mbox{\boldmath$\mathscr{X}$},\mbox{\boldmath$\mathscr{Y}$}\rangle_{i_{d_{0}+1}\dots i_{d}}=\sum_{i_{1}=1}^{p_{1}}\sum_{i_{2}=1}^{p_{2}}\dots\sum_{i_{d_{0}}=1}^{p_{d_{0}}}\mbox{\boldmath$\mathscr{X}$}_{i_{1}i_{2}\dots i_{d_{0}}i_{d_{0}+1}\dots i_{d}}\mbox{\boldmath$\mathscr{Y}$}_{i_{1}i_{2}\dots i_{d_{0}}}, (2.2)

for 1id0+1pd0+1,,1idpd1\leq i_{d_{0}+1}\leq p_{d_{0}+1},\dots,1\leq i_{d}\leq p_{d}. In particular, when d=d0d=d_{0}, it reduces to the conventional real-valued inner product. Additionally, the Frobenius norm of any tensor 𝓧\mathscr{X} is defined as 𝓧F=𝓧,𝓧\|\mbox{\boldmath$\mathscr{X}$}\|_{\text{F}}=\sqrt{\langle\mbox{\boldmath$\mathscr{X}$},\mbox{\boldmath$\mathscr{X}$}\rangle}. For any two tensors 𝓧p1××pd1\mbox{\boldmath$\mathscr{X}$}\in\mathbb{R}^{p_{1}\times\cdots\times p_{d_{1}}} and 𝓨q1××qd2\mbox{\boldmath$\mathscr{Y}$}\in\mathbb{R}^{q_{1}\times\cdots\times q_{d_{2}}}, their outer product 𝓧𝓨\mbox{\boldmath$\mathscr{X}$}\circ\mbox{\boldmath$\mathscr{Y}$} is the p1××pd1×q1××qd2p_{1}\times\cdots\times p_{d_{1}}\times q_{1}\times\cdots\times q_{d_{2}} tensor defined by

(𝓧𝓨)i1id1j1jd2=𝓧i1id1𝓨j1jd2,(\mbox{\boldmath$\mathscr{X}$}\circ\mbox{\boldmath$\mathscr{Y}$})_{i_{1}\dots i_{d_{1}}j_{1}\dots j_{d_{2}}}=\mbox{\boldmath$\mathscr{X}$}_{i_{1}\dots i_{d_{1}}}\mbox{\boldmath$\mathscr{Y}$}_{j_{1}\dots j_{d_{2}}}, (2.3)

for 1ikpk1\leq i_{k}\leq p_{k}, 1jq1\leq j_{\ell}\leq q_{\ell}, 1kd11\leq k\leq d_{1}, and 1d21\leq\ell\leq d_{2}.

For any tensor 𝓧p1××pd\mbox{\boldmath$\mathscr{X}$}\in\mathbb{R}^{p_{1}\times\cdots\times p_{d}}, its Tucker ranks (r1,,rd)(r_{1},\dots,r_{d}) are defined as the matrix ranks of its matricizations, i.e. rj=rank(𝓧(j))r_{j}=\text{rank}(\mbox{\boldmath$\mathscr{X}$}_{(j)}), for j=1,,dj=1,\dots,d. Note that rjr_{j}’s are analogous to the row and column ranks of a matrix, but they are not necessarily equal for third- and higher-order tensors. If 𝓧\mathscr{X} has Tucker ranks (r1,,rd)(r_{1},\dots,r_{d}), then 𝓧\mathscr{X} has the following Tucker decomposition (Tucker, 1966; De Lathauwer, De Moor and Vandewalle, 2000)

𝓧=𝓨×1𝐘1×2𝐘2×d𝐘d=𝓨×j=1d𝐘j,\mbox{\boldmath$\mathscr{X}$}=\mbox{\boldmath$\mathscr{Y}$}\times_{1}\mathbf{Y}_{1}\times_{2}\mathbf{Y}_{2}\cdots\times_{d}\mathbf{Y}_{d}=\mbox{\boldmath$\mathscr{Y}$}\times_{j=1}^{d}\mathbf{Y}_{j}, (2.4)

where each 𝐘jpj×rj\mathbf{Y}_{j}\in\mathbb{R}^{p_{j}\times r_{j}} is the factor matrix for j=1,,dj=1,\dots,d, and 𝓨r1××rd\mbox{\boldmath$\mathscr{Y}$}\in\mathbb{R}^{r_{1}\times\cdots\times r_{d}} is the core tensor. If 𝓧\mathscr{X} has the Tucker decomposition in (2.4), then we have the following results for its matricizations:

𝓧(k)=𝐘k𝓨(k)(𝐘d𝐘k+1𝐘k1𝐘1)=𝐘k𝓨(k)(jk𝐘j),k=1,,d,\mbox{\boldmath$\mathscr{X}$}_{(k)}=\mathbf{Y}_{k}\mbox{\boldmath$\mathscr{Y}$}_{(k)}(\mathbf{Y}_{d}\otimes\cdots\otimes\mathbf{Y}_{k+1}\otimes\mathbf{Y}_{k-1}\cdots\otimes\mathbf{Y}_{1})^{\top}=\mathbf{Y}_{k}\mbox{\boldmath$\mathscr{Y}$}_{(k)}(\otimes_{j\neq k}\mathbf{Y}_{j})^{\top},~{}~{}k=1,\dots,d, (2.5)

where jk\otimes_{j\neq k} is matrix Kronecker product operating in the reverse order.

Throughout this paper, we use CC to denote a generic positive constant. For any two real-valued sequences xkx_{k} and yky_{k}, we write xkykx_{k}\gtrsim y_{k} if there exists a constant C>0C>0 such that xkCykx_{k}\geq Cy_{k} for all kk. Additionally, we write xkykx_{k}\asymp y_{k} if xkykx_{k}\gtrsim y_{k} and ykxky_{k}\gtrsim x_{k}. For a generic matrix 𝐗\mathbf{X}, we let 𝐗\mathbf{X}^{\top}, 𝐗F\|\mathbf{X}\|_{\text{F}}, 𝐗\|\mathbf{X}\|, vec(𝐗)\text{vec}(\mathbf{X}) and σj(𝐗)\sigma_{j}(\mathbf{X}) denote its transpose, Frobenius norm, operator norm, vectorization, and the jj-th largest singular value, respectively. For any real symmetric matrix 𝐗\mathbf{X}, let λmin(𝐗)\lambda_{\min}(\mathbf{X}) and λmax(𝐗)\lambda_{\max}(\mathbf{X}) denote its minimum and maximum eigenvalues.

3 Methodology

3.1 Gradient descent with robust gradient estimates

We consider a general estimation framework with the loss function ¯(𝓐;zi)\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i}) for parameter tensor 𝓐\mathscr{A} and random observation ziz_{i}. Suppose the parameter tensor admits a Tucker low-rank decomposition 𝓐=𝓢×j=1d𝐔j\mbox{\boldmath$\mathscr{A}$}=\mbox{\boldmath$\mathscr{S}$}\times_{j=1}^{d}\mathbf{U}_{j}, defined in (2.4), where 𝓢r1×r2××rd\mbox{\boldmath$\mathscr{S}$}\in\mathbb{R}^{r_{1}\times r_{2}\times\dots\times r_{d}} is the core tensor and each 𝐔jpj×rj\mathbf{U}_{j}\in\mathbb{R}^{p_{j}\times r_{j}} is the factor matrix. Throughout the paper, we assume that the order dd is fixed and the ranks (r1,r2,,rd)(r_{1},r_{2},\cdots,r_{d}) are known. Given the tensor decomposition, we define the loss function with respect to the decomposition components as

(𝓢,𝐔1,,𝐔d;zi)=¯(𝓢×j=1d𝐔j;zi),and n(𝓢,𝐔1,,𝐔d;𝒟n)=1ni=1n(𝓢,𝐔1,,𝐔d;zi).\begin{split}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};z_{i})&=\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{S}$}\times_{j=1}^{d}\mathbf{U}_{j};z_{i}),\\ \text{and }\mathcal{L}_{n}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};\mathcal{D}_{n})&=\frac{1}{n}\sum_{i=1}^{n}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\cdots,\mathbf{U}_{d};z_{i}).\end{split} (3.1)

A standard method to estimate the components in the tensor decomposition is to minimize the following regularized loss function

n(𝓢,𝐔1,,𝐔d;𝒟n)+a2j=1d𝐔j𝐔jb2𝐈rjF2,\mathcal{L}_{n}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};\mathcal{D}_{n})+\frac{a}{2}\sum_{j=1}^{d}\|\mathbf{U}_{j}^{\top}\mathbf{U}_{j}-b^{2}\mathbf{I}_{r_{j}}\|_{\text{F}}^{2}, (3.2)

where a,b>0a,b>0 are tuning parameters, and the additional regularization term 𝐔j𝐔jb2𝐈rjF2\|\mathbf{U}_{j}^{\top}\mathbf{U}_{j}-b^{2}\mathbf{I}_{r_{j}}\|_{\text{F}}^{2} prevents the factor matrix 𝐔j\mathbf{U}_{j} from being singular, while also ensuring that the scaling among all factor matrices is balanced (Han, Willett and Zhang, 2022). This regularized loss minimization problem can be efficiently solved using gradient descent. Han, Willett and Zhang (2022) demonstrated that the gradient descent approach is computationally efficient. Moreover, with a suitable choice of initial values, the estimation error is proportional to

sup𝓣F=1,rank(𝓣(k))rk,1kd|1ni=1n¯(𝓐;zi),𝓣|,\sup_{\|\scalebox{0.7}{\mbox{\boldmath$\mathscr{T}$}}\|_{\text{F}}=1,\text{rank}(\scalebox{0.7}{\mbox{\boldmath$\mathscr{T}$}}_{(k)})\leq r_{k},1\leq k\leq d}\left|\left\langle\frac{1}{n}\sum_{i=1}^{n}\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i}),\mbox{\boldmath$\mathscr{T}$}\right\rangle\right|, (3.3)

where 𝓐\mbox{\boldmath$\mathscr{A}$}^{*} is the ground truth of the parameter tensor satisfying 𝔼¯(𝓐;zi)=𝟎\mathbb{E}\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})=\mathbf{0}, as 𝓐\mbox{\boldmath$\mathscr{A}$}^{*} minimizes the risk function 𝔼¯(𝓐;zi)\mathbb{E}\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i}) if the expecation exists. The error essentially depends on the intrinsic dimension of the low-rank tensors and the distribution of the data ziz_{i}. When the distribution of ziz_{i} is heavy-tailed, controlling the convergence rate becomes challenging, which may lead to suboptimal estimation performance.

To motivate our robust estimation method, it is important to note that the partial gradients of the loss function have a form of sample means

𝐔kn(𝓢,𝐔1,,𝐔d;𝒟n)=1ni=1n𝐔k(𝓢,𝐔1,,𝐔d;zi),\nabla_{\mathbf{U}_{k}}\mathcal{L}_{n}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};\mathcal{D}_{n})=\frac{1}{n}\sum_{i=1}^{n}\nabla_{\mathbf{U}_{k}}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};z_{i}), (3.4)

for k=1,,dk=1,\dots,d, and

𝓢n(𝓢,𝐔1,,𝐔d;𝒟n)=1ni=1n𝓢(𝓢,𝐔1,,𝐔d;zi).\nabla_{\scalebox{0.7}{\mbox{\boldmath$\mathscr{S}$}}}\mathcal{L}_{n}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};\mathcal{D}_{n})=\frac{1}{n}\sum_{i=1}^{n}\nabla_{\scalebox{0.7}{\mbox{\boldmath$\mathscr{S}$}}}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};z_{i}). (3.5)

It is well known that the sample mean is sensitive to outliers, as it can be significantly influenced by even a small fraction of extreme values. When some of ziz_{i}’s are outliers, the gradient descent approach may fail to be robust, leading to sub-optimal estimates, particularly in the presence of heavy-tailed distributions.

Therefore, we use the robust gradient estimates as alternatives to the standard gradients in vanilla gradient descent. For any given 𝓢,𝐔1,,𝐔d\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d}, we may construct some robust gradient functions, denoted by 𝓖0(𝓢,𝐔1,,𝐔d)\mbox{\boldmath$\mathscr{G}$}_{0}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d}) and 𝐆k(𝓢,𝐔1,,𝐔d)\mathbf{G}_{k}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d}), for k=1,,dk=1,\dots,d, to replace the partial gradients in standard gradient descent. Given initial values (𝓢(0),𝐔1(0),,𝐔k(0))(\mbox{\boldmath$\mathscr{S}$}^{(0)},\mathbf{U}_{1}^{(0)},\cdots,\mathbf{U}_{k}^{(0)}), a step size η>0\eta>0, and the number of iterations TT, we propose the following gradient descent algorithm, summarized in Algorithm 1, using the robust gradient functions.

Algorithm 1 Robust gradient descent for (𝓢,𝐔1,,𝐔d)(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d})

initialize: 𝓢(0),𝐔1(0),,𝐔d(0)\mbox{\boldmath$\mathscr{S}$}^{(0)},\mathbf{U}_{1}^{(0)},\dots,\mathbf{U}_{d}^{(0)}, a,b>0a,b>0, step size η>0\eta>0, and number of iterations TT

for t=0t=0 to T1T-1

for k=1k=1 to dd

𝐔k(t+1)𝐔k(t)η𝐆k(𝓢(t),𝐔1(t),,𝐔d(t))ηa𝐔k(t)(𝐔k(t)𝐔k(t)b2𝐈rk)\mathbf{U}_{k}^{(t+1)}\leftarrow\mathbf{U}_{k}^{(t)}-\eta\cdot\mathbf{G}_{k}(\mbox{\boldmath$\mathscr{S}$}^{(t)},\mathbf{U}_{1}^{(t)},\dots,\mathbf{U}_{d}^{(t)})-\eta a\mathbf{U}_{k}^{(t)}(\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}})

end for
         𝓢(t+1)𝓢(t)η𝓖0(𝓢(t),𝐔1(t),,𝐔d(t))\mbox{\boldmath$\mathscr{S}$}^{(t+1)}\leftarrow\mbox{\boldmath$\mathscr{S}$}^{(t)}-\eta\cdot\mbox{\boldmath$\mathscr{G}$}_{0}(\mbox{\boldmath$\mathscr{S}$}^{(t)},\mathbf{U}_{1}^{(t)},\dots,\mathbf{U}_{d}^{(t)})

end for
return
𝓐(T)=𝓢(T)×1𝐔1(T)×d𝐔d(T)\mbox{\boldmath$\mathscr{A}$}^{(T)}=\mbox{\boldmath$\mathscr{S}$}^{(T)}\times_{1}\mathbf{U}_{1}^{(T)}\dots\times_{d}\mathbf{U}_{d}^{(T)}

3.2 Local convergence analysis

The robust gradient functions may not be the partial gradients of a robust loss function. When the gradients are replaced by their robust alternatives, Algorithm 1 is not designed to solve a traditional minimization problem. As a result, studying its convergence becomes challenging. To address this and establish both computational and statistical guarantees, we introduce some notations and conditions.

Definition 3.1 (Restricted correlated gradient).

The loss function ¯\overline{\mathcal{L}} satisfies the restricted correlated gradient (RCG) condition: for any 𝓐\mathscr{A} such that rank(𝓐(k))rk\textup{rank}(\mbox{\boldmath$\mathscr{A}$}_{(k)})\leq r_{k}, 1kd1\leq k\leq d,

𝔼¯(𝓐;zi),𝓐𝓐α2𝓐𝓐F2+12β𝔼¯(𝓐;z)F2,\langle\mathbb{E}\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i}),\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\rangle\geq\frac{\alpha}{2}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\textup{F}}^{2}+\frac{1}{2\beta}\|\mathbb{E}\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z)\|_{\textup{F}}^{2}, (3.6)

where the RCG parameters α\alpha and β\beta satisfy 0<αβ0<\alpha\leq\beta.

The RCG condition implies that for any low-rank tensor 𝓐\mathscr{A}, the expectation of the gradient ¯(𝓐;zi)\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i}) is postively correlated with the optimal descent direction 𝓐𝓐\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}. If the loss function has a finite expectation 𝔼¯(𝓐;zi)\mathbb{E}\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i}), this condition is implied by the restricted strong convexity and smoothness conditions of the risk function 𝔼¯(𝓐;zi)\mathbb{E}\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i}) (Bubeck, 2015; Jain et al., 2017). However, in the setting of heavy-tailed distributions, the expecation of ¯(𝓐;zi)\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i}) may not exist, making it more appropriate to consider the RCG condition on the gradient rather than the loss function (see tensor linear regression in Remark 4.1 as an example). Furthermore, compared to the vanilla gradient descent method (Han, Willett and Zhang, 2022), the RCG condition is imposed on the expectation of the loss gradient, which simplifies the technical analysis under heavy-tailed distributions.

The robust gradient functions are expected to exhibit desirable stability against outliers. As an alternative to the sample mean, 𝐆k(𝓢,𝐔1,,𝐔d)\mathbf{G}_{k}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d}) can be viewed as a robust estimator of 𝔼[𝐔k(𝓢,𝐔1,,𝐔d)]\mathbb{E}[\nabla_{\mathbf{U}_{k}}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d})]. Similarly, 𝓖0(𝓢,𝐔1,,𝐔d)\mbox{\boldmath$\mathscr{G}$}_{0}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d}) can be viewed as a mean estimator of 𝔼[𝓢(𝓢,𝐔1,,𝐔d)]\mathbb{E}[\nabla_{\scalebox{0.7}{\mbox{\boldmath$\mathscr{S}$}}}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d})]. Formally, we define the following condition for the stability of the robust gradients.

Definition 3.2 (Stability of robust gradients).

Given (𝓢,𝐔1,,𝐔d)(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d}), the robust gradient functions are stable if there exist positive constants ϕ\phi and ξk\xi_{k}, for 0kd0\leq k\leq d, such that

𝐆k(𝓢,𝐔1,,𝐔d)𝔼𝐔k(𝓢,𝐔1,,𝐔d)F2ϕ𝓢×j=1d𝐔j𝓐F2+ξk2,and 𝓖0(𝓢,𝐔1,,𝐔d)𝔼𝓢(𝓢,𝐔1,,𝐔d)F2ϕ𝓢×j=1d𝐔j𝓐F2+ξ02.\begin{split}\|\mathbf{G}_{k}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\cdots,\mathbf{U}_{d})-\mathbb{E}\nabla_{\mathbf{U}_{k}}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d})\|_{\textup{F}}^{2}&\leq\phi\|\mbox{\boldmath$\mathscr{S}$}\times_{j=1}^{d}\mathbf{U}_{j}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\textup{F}}^{2}+\xi_{k}^{2},\\ \text{and }\|\mbox{\boldmath$\mathscr{G}$}_{0}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d})-\mathbb{E}\nabla_{\scalebox{0.7}{\mbox{\boldmath$\mathscr{S}$}}}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d})\|_{\textup{F}}^{2}&\leq\phi\|\mbox{\boldmath$\mathscr{S}$}\times_{j=1}^{d}\mathbf{U}_{j}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\textup{F}}^{2}+\xi_{0}^{2}.\end{split} (3.7)

The universal parameter ϕ\phi controls the performance of all gradient functions in the presence of an inaccurate Tucker decomposition, while the constants ξk\xi_{k}’s represent the estimation accuracy of the robust gradient estimators. A similar definition for robust estimation via convex optimization is considered in Prasad et al. (2020).

For the ground truth 𝓐\mbox{\boldmath$\mathscr{A}$}^{*}, denote its largest and smallest singular values across all directions by σ¯=max1kd𝓐(k)\bar{\sigma}=\max_{1\leq k\leq d}\|\mbox{\boldmath$\mathscr{A}$}^{*}_{(k)}\| and σ¯=min1kdσrk(𝓐(k))\underline{\sigma}=\min_{1\leq k\leq d}\sigma_{r_{k}}(\mbox{\boldmath$\mathscr{A}$}^{*}_{(k)}). The condition number of 𝓐\mbox{\boldmath$\mathscr{A}$}^{*} is then given by κ=σ¯/σ¯\kappa=\bar{\sigma}/\underline{\sigma}. To address the unidentifiability issue in Tucker decomposition, we consider the component-wise estimation error under rotation for the estimate (𝓢,𝐔1,,𝐔d)(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d}), following Han, Willett and Zhang (2022). The error is defined as

Err(𝓢,𝐔1,,𝐔d)=min𝐎k𝕆rk,1kd{𝓢𝓢×j=1d𝐎kF2+k=1d𝐔k𝐔k𝐎kF2},\text{Err}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d})=\min_{\mathbf{O}_{k}\in\mathbb{O}^{r_{k}},1\leq k\leq d}\left\{\|\mbox{\boldmath$\mathscr{S}$}-\mbox{\boldmath$\mathscr{S}$}^{*}\times_{j=1}^{d}\mathbf{O}_{k}^{\top}\|_{\text{F}}^{2}+\sum_{k=1}^{d}\|\mathbf{U}_{k}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}\|_{\text{F}}^{2}\right\}, (3.8)

where the true decomposition satisfies 𝐔k=b\|\mathbf{U}_{k}^{*}\|=b and the orthogonal matrices 𝐎k\mathbf{O}_{k}’s account for the unidentification of the Tucker decomposition. For the tt-th iteration of Algorithm 1, where t=0,1,,Tt=0,1,\dots,T, denote the estimated parameter by 𝓐(t)=𝓢(t)×j=1d𝐔j(t)\mbox{\boldmath$\mathscr{A}$}^{(t)}=\mbox{\boldmath$\mathscr{S}$}^{(t)}\times_{j=1}^{d}\mathbf{U}_{j}^{(t)}. The corresponding estimation error is then given by Err(𝓢(t),𝐔1(t),,𝐔d(t))\text{Err}(\mbox{\boldmath$\mathscr{S}$}^{(t)},\mathbf{U}_{1}^{(t)},\dots,\mathbf{U}_{d}^{(t)}).

Given the conditions and notations outlined above, we present the local convergence analysis for the gradient descent iterations with stable robust gradient functions.

Theorem 3.3 (Local convergence rate).

Suppose that the loss function ¯\overline{\mathcal{L}} satisfies the RCG condition with parameters α\alpha and β\beta as in Definition 3.1, and that the robust gradient functions at each step tt satisfy the stability condition with parameters ϕ\phi and ξk\xi_{k} as in Definition 3.2, for all k=0,1,,dk=0,1,\dots,d and t=1,2,,Tt=1,2,\dots,T. If the initial estimation error satisfies Err(𝓢(0),𝐔1(0),,𝐔d(0))αβ1σ¯2/(d+1)κ2\textup{Err}(\mbox{\boldmath$\mathscr{S}$}^{(0)},\mathbf{U}_{1}^{(0)},\dots,\mathbf{U}_{d}^{(0)})\lesssim\alpha\beta^{-1}\bar{\sigma}^{2/(d+1)}\kappa^{-2}, ϕα2κ4σ¯2d/(d+1)\phi\lesssim\alpha^{2}\kappa^{-4}\bar{\sigma}^{2d/(d+1)}, aακ2σ¯(2d2)/(d+1)a\asymp\alpha\kappa^{-2}\bar{\sigma}^{(2d-2)/(d+1)}, bσ¯1/(d+1)b\asymp\bar{\sigma}^{1/(d+1)}, and ηαβ1κ2\eta\asymp\alpha\beta^{-1}\kappa^{2}, then the estimation errors at the tt-th iteration, for t=1,2,,Tt=1,2,\dots,T, satisfy

Err(𝓢(t),𝐔1(t),,𝐔d(t))(1Cαβ1κ2)tErr(𝓢(0),𝐔1(0),,𝐔d(0))+Cα2σ¯4d/(d+1)κ4k=0dξk2,\begin{split}\textup{Err}(\mbox{\boldmath$\mathscr{S}$}^{(t)},\mathbf{U}_{1}^{(t)},\dots,\mathbf{U}_{d}^{(t)})\leq&(1-C\alpha\beta^{-1}\kappa^{-2})^{t}\cdot\textup{Err}(\mbox{\boldmath$\mathscr{S}$}^{(0)},\mathbf{U}_{1}^{(0)},\dots,\mathbf{U}_{d}^{(0)})\\ &+C\alpha^{-2}\bar{\sigma}^{-4d/(d+1)}\kappa^{4}\sum_{k=0}^{d}\xi_{k}^{2},\end{split} (3.9)

and

𝓐(t)𝓐F2κ2(1Cαβ1κ2)t𝓐(0)𝓐F2+σ¯2d/(d+1)α2κ4k=0dξk2.\|\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\textup{F}}^{2}\lesssim\kappa^{2}(1-C\alpha\beta^{-1}\kappa^{-2})^{t}\cdot\|\mbox{\boldmath$\mathscr{A}$}^{(0)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\textup{F}}^{2}+\bar{\sigma}^{-2d/(d+1)}\alpha^{-2}\kappa^{4}\sum_{k=0}^{d}\xi_{k}^{2}. (3.10)

Theorem 3.3 estalibshes the linear convergence of the robust gradient descent iterates under certain sufficient conditions. In both upper bounds provided, the first terms correspond to optimization errors that decay exponentially with the number of iterations, which reflects the improvement in the solution as the gradient descent progresses. The second terms, on the other hand, capture the statistical errors, which depend on the estimation accuracy of the robust gradient functions. These statistical errors arise due to the approximation of gradients in the presence of noise or outliers, and their magnitude is influenced by the robustness of the gradient estimator. Thus, to guarantee fast convergence, it is crucial to construct a stable robust gradient estimator, which is the focus of the next subsection. The corresponding theoretical analysis for each specific statistical model will be provided in Section 4, where we will discuss the performance of the estimator under different conditions.

3.3 Robust gradient estimation via entrywise truncation

In this subsection, we propose a general robust gradient estimation method. The partial gradient of the risk function is the expectation of the partial gradient of the loss function, which suggests that robust gradient estimation can be framed as a mean estimation problem. In this context, Fan, Wang and Zhu (2021) introduced a simple robust mean estimation procedure based on data truncation. They demonstrated that the truncated estimator achieves the optimal rate under certain bounded moment conditions. Motivated by their work, we apply the truncation method to estimate the partial gradients in our setting.

For any matrix 𝐌p×q\mathbf{M}\in\mathbb{R}^{p\times q}, we define the entrywise truncation operator T(,):p×q×+p×q\text{T}(\cdot,\cdot):\mathbb{R}^{p\times q}\times\mathbb{R}^{+}\to\mathbb{R}^{p\times q}, such that

T(𝐌,τ)j,k=sgn(𝐌j,k)min(|𝐌j,k|,τ),\text{T}(\mathbf{M},\tau)_{j,k}=\text{sgn}(\mathbf{M}_{j,k})\min(|\mathbf{M}_{j,k}|,\tau), (3.11)

for 1jp1\leq j\leq p and 1kq1\leq k\leq q, where sgn()\text{sgn}(\cdot) denotes the sign function. This operator truncates each entry of the matrix 𝐌\mathbf{M} to be no larger than the threshold τ\tau in absolute value, while preserving its sign. Similarly, we can define the entrywise truncation operator T(𝓣,τ)\text{T}(\mbox{\boldmath$\mathscr{T}$},\tau) for any tensor 𝓣\mathscr{T} with truncation parameter τ\tau. The truncation parameter τ\tau plays a critical role in balancing the trade-off between truncation bias and robustness. A smaller value of τ\tau will lead to stronger truncation, reducing the influence of outliers but potentially introducing more bias, while a larger τ\tau will preserve more of the data’s original values, reducing bias but making the estimator more sensitive to noise.

We consider the entrywise truncation gradient estimators. For 1kd1\leq k\leq d, the estimator for the gradient with respect to 𝐔k\mathbf{U}_{k} is given by

𝐆k(𝓢,𝐔1,,𝐔d;τ)=1ni=1nT(𝐔k(𝓢,𝐔1,,𝐔d;zi),τ)=1ni=1nT(¯(𝓢×j=1d𝐔j;zi)(k)(jk𝐔j)𝓢(k),τ).\begin{split}\mathbf{G}_{k}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};\tau)=&\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla_{\mathbf{U}_{k}}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};z_{i}),\tau)\\ =&\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{S}$}\times_{j=1}^{d}\mathbf{U}_{j};z_{i})_{(k)}(\otimes_{j\neq k}\mathbf{U}_{j})\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top},\tau).\end{split} (3.12)

Similarly, for the core tensor 𝓢\mathscr{S}, the truncation-based estimator is defined as

𝓖0(𝓢,𝐔1,,𝐔d;τ)=1ni=1nT(𝓢(𝓢,𝐔1,,𝐔d;zi),τ)=1ni=1nT(¯(𝓢×j=1d𝐔j;zi)×j=1d𝐔j,τ).\begin{split}\mbox{\boldmath$\mathscr{G}$}_{0}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};\tau)=&\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla_{\scalebox{0.7}{\mbox{\boldmath$\mathscr{S}$}}}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};z_{i}),\tau)\\ =&\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{S}$}\times_{j=1}^{d}\mathbf{U}_{j};z_{i})\times_{j=1}^{d}\mathbf{U}_{j}^{\top},\tau).\end{split} (3.13)

Note that the truncation-based robust gradient estimator is generally applicable to a wide range of tensor models. In Sections 4 and 5, we will show both theoretically and numerically that the entrywise truncation using a single parameter τ\tau can achieve optimal estimation performance under various distributional assumptions.

We briefly outline the idea for proving the stability of the truncated gradient estimator. A complete proof for each statistical application is provided in the supplementary material. For 1kd1\leq k\leq d, we define 𝐕k=(jk𝐔j)𝓢(k)\mathbf{V}_{k}=(\otimes_{j\neq k}\mathbf{U}_{j})\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top}, and the error term for the gradient estimator can be expressed as

𝐆k(𝓢,𝐔1,,𝐔d;τ)𝔼[𝐔k(𝓢,𝐔1,,𝐔d)]=1ni=1nT(¯(𝓐;zi)(k)𝐕k,τ)𝔼[¯(𝓐;zi)(k)𝐕k]=Tk,1+Tk,2+Tk,3+Tk,4,\begin{split}&\mathbf{G}_{k}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};\tau)-\mathbb{E}[\nabla_{\mathbf{U}_{k}}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d})]\\ =&\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k},\tau)-\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k}]=T_{k,1}+T_{k,2}+T_{k,3}+T_{k,4},\end{split} (3.14)

where the terms Tk,1T_{k,1}, Tk,2T_{k,2}, Tk,3T_{k,3}, and Tk,4T_{k,4} are defined as follows:

Tk,1=𝔼[T(¯(𝓐;zi)(k)𝐕k,τ)]𝔼[¯(𝓐;zi)(k)𝐕k],Tk,2=1ni=1nT(¯(𝓐;zi)(k)𝐕k,τ)𝔼[T(¯(𝓐;zi)(k)𝐕k,τ)],Tk,3=𝔼[¯(𝓐;zi)(k)𝐕k]𝔼[¯(𝓐;zi)(k)𝐕k]+𝔼[T(¯(𝓐;zi)(k)𝐕k,τ)]𝔼[T(¯(𝓐;zi)(k)𝐕k,τ)],and Tk,4=1ni=1nT(¯(𝓐;zi)(k)𝐕k,τ)1ni=1nT(¯(𝓐;zi)(k)𝐕k,τ)𝔼[T(¯(𝓐;zi)(k)𝐕k,τ)]+𝔼[T(¯(𝓐;zi)(k)𝐕k,τ)].\begin{split}T_{k,1}=&\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k}^{\top},\tau)]-\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k}],\\ T_{k,2}=&\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k},\tau)-\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k}^{\top},\tau)],\\ T_{k,3}=&\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k}]-\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k}]\\ &+\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k},\tau)]-\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k},\tau)],\\ \text{and }T_{k,4}=&\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k},\tau)-\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k},\tau)\\ &-\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k},\tau)]+\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k},\tau)].\end{split} (3.15)

Here, Tk,1T_{k,1} is the truncation bias at the ground truth 𝓐\mbox{\boldmath$\mathscr{A}$}^{*}, and Tk,2T_{k,2} represents the deviation of the truncated estimation around its expectation. As each truncated gradient, T(¯(𝓐;zi)(k)𝐕k,τ)\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k},\tau), is a bounded variable, we can apply the Bernstein inequality (Wainwright, 2019) to achieve a sub-Gaussian-type concentration without the Gaussian distributional assumption on the data. The truncation parameter τ\tau controls the magnitude of Tk,1F\|T_{k,1}\|_{\text{F}} and Tk,2F\|T_{k,2}\|_{\text{F}}, and an optimal τ\tau gives Tk,1FTk,2Fξk\|T_{k,1}\|_{\text{F}}\asymp\|T_{k,2}\|_{\text{F}}\asymp\xi_{k}. For Tk,3T_{k,3}, given some regularity conditions, we can obtain an upper bound for the truncation bias of the second-order approximation error in Tk,3F\|T_{k,3}\|_{\text{F}}. Similarly, as T(¯(𝓐;zi)(k)𝐕k,τ)T(¯(𝓐;zi)(k)𝐕k,τ)\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k},\tau)-\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k},\tau) is bounded, we can also achieve a sub-Gaussian-type concentration and show that Tk,3FTk,4Fϕ1/2𝓐𝓐F\|T_{k,3}\|_{\text{F}}\asymp\|T_{k,4}\|_{\text{F}}\lesssim\phi^{1/2}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}. Hence, we can show that i=14Tk,iF2ϕ𝓐𝓐F2+ξk2\sum_{i=1}^{4}\|T_{k,i}\|_{\text{F}}^{2}\lesssim\phi\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}+\xi_{k}^{2}.

By controlling the truncation bias, deviation, and approximation errors, we demonstrate that the truncated gradient estimator is stable and achieves optimal performance under certain conditions. A similar approach can be applied to the gradient with respect to the core tensor 𝓢\mathscr{S}, establishing the stability of the corresponding robust estimator.

3.4 Implementation and initialization

The optimal statistical convergence rates of the proposed estimator depend critically on the optimal value of the truncation parameter τ\tau, which varies according to the model dimension, sample size, and other relevant parameters. To select τ\tau in a data-driven manner, we propose using cross-validation to evaluate the estimates produced by the robust gradient descent algorithm for different values of τ\tau.

To initialize Algorithm 1, we may disregard the low-rank structure of 𝓐\mathscr{A} initially, and find the estimate 𝓐~\mathscr{\widetilde{A}}. Specifically, robust gradient descent can be applied to update 𝓐\mathscr{A}, as summarized in Algorithm 2, where the robust gradient is

g(𝓐;τ)=1ni=1nT(¯(𝓐;zi),τ),g(\mbox{\boldmath$\mathscr{A}$};\tau^{\prime})=\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i}),\tau^{\prime}), (3.16)

with another truncation parameter τ\tau^{\prime} and step size η\eta^{\prime}. Once we have 𝓐~\mathscr{\widetilde{A}}, we apply HOSVD (higher-order singular value decomposition) or HOOI (higher-order orthogonal iteration) (De Lathauwer, De Moor and Vandewalle, 2000) to 𝓐~\mathscr{\widetilde{A}} to obtain the initial values.

Algorithm 2 Robust gradient descent for 𝓐\mathscr{A}

initialize: 𝓐(0)=𝟎\mbox{\boldmath$\mathscr{A}$}^{(0)}=\mathbf{0}, step size η>0\eta^{\prime}>0, and number of iterations TT

for t=0t=0 to T1T-1

𝓐(t+1)𝓐(t)ηg(𝓐(t);τ)\mbox{\boldmath$\mathscr{A}$}^{(t+1)}\leftarrow\mbox{\boldmath$\mathscr{A}$}^{(t)}-\eta^{\prime}\cdot g(\mbox{\boldmath$\mathscr{A}$}^{(t)};\tau^{\prime})

end for
return
𝓐~=𝓐(T)\mbox{\boldmath$\mathscr{\widetilde{A}}$}=\mbox{\boldmath$\mathscr{A}$}^{(T)}

4 Applications to Tensor Models

In this section, we apply the proposed robust gradient descent algorithm, utilizing entrywise truncated gradient estimators, to three statistical tensor models: tensor linear regression, tensor logistic regression, and tensor PCA. For tensor linear regression, both the covariates and the noise are assumed to follow heavy-tailed distributions. In the case of tensor logistic regression and tensor PCA, we assume heavy-tailed covariates for the former and heavy-tailed noise for the latter. In the following, we let p¯=max1jdpj\bar{p}=\max_{1\leq j\leq d}p_{j} be the maximum dimension across all modes, and let deff=k=1dpkrk+k=1drkd_{\text{eff}}=\sum_{k=1}^{d}p_{k}r_{k}+\prod_{k=1}^{d}r_{k} be effective dimension of the low-rank tensor, which corresponds to the total number of parameters in the Tucker decomposition.

4.1 Heavy-tailed tensor linear regression

The first statistical model we consider is tensor linear regression:

𝓨i=𝓐,𝓧i+𝓔i,i=1,2,,n,\mbox{\boldmath$\mathscr{Y}$}_{i}=\langle\mbox{\boldmath$\mathscr{A}$}^{*},\mbox{\boldmath$\mathscr{X}$}_{i}\rangle+\mbox{\boldmath$\mathscr{E}$}_{i},\quad i=1,2,\dots,n, (4.1)

where 𝓧ip1×p2××pd0\mbox{\boldmath$\mathscr{X}$}_{i}\in\mathbb{R}^{p_{1}\times p_{2}\times\cdots\times p_{d_{0}}}, 𝓨i,𝓔ipd0+1×pd0+2××pd\mbox{\boldmath$\mathscr{Y}$}_{i},\mbox{\boldmath$\mathscr{E}$}_{i}\in\mathbb{R}^{p_{d_{0}+1}\times p_{d_{0}+2}\times\cdots\times p_{d}}, and 𝓐p1×p2××pd\mbox{\boldmath$\mathscr{A}$}^{*}\in\mathbb{R}^{p_{1}\times p_{2}\times\cdots\times p_{d}} with 0d0d0\leq d_{0}\leq d. The symbol ,\langle\cdot,\cdot\rangle represents the generalized inner product of tensors, defined in (2.2). When d=d0d=d_{0}, the response 𝓨i\mbox{\boldmath$\mathscr{Y}$}_{i} is a scalar, and ,\langle\cdot,\cdot\rangle reduces to the conventional inner product. We assume that the samples zi=(𝓧i,𝓨i)z_{i}=(\mbox{\boldmath$\mathscr{X}$}_{i},\mbox{\boldmath$\mathscr{Y}$}_{i}) are independent across i=1,,ni=1,\dots,n, and 𝔼[𝓔i|𝓧i]=𝟎\mathbb{E}[\mbox{\boldmath$\mathscr{E}$}_{i}|\mbox{\boldmath$\mathscr{X}$}_{i}]=\mathbf{0}.

Model (4.1) encompasses multivariate regression, multi-response regression, matrix trace regression as special cases. It has broad applicability in various statistics and machine learning contexts, such as multi-response tensor regression (Raskutti, Yuan and Chen, 2019), matrix compressed sensing (Candes and Plan, 2011), autoregressive time series modeling (Wang et al., 2022), matrix and tensor completion (Negahban and Wainwright, 2012; Cai et al., 2019), and others.

For model (4.1), it is common to consider the least squares loss function:

(𝓢,𝐔1,,𝐔d;𝓨i,𝓧i)=12𝓨i𝓢×j=1d𝐔j,𝓧iF2.\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};\mbox{\boldmath$\mathscr{Y}$}_{i},\mbox{\boldmath$\mathscr{X}$}_{i})=\frac{1}{2}\|\mbox{\boldmath$\mathscr{Y}$}_{i}-\langle\mbox{\boldmath$\mathscr{S}$}\times_{j=1}^{d}\mathbf{U}_{j},\mbox{\boldmath$\mathscr{X}$}_{i}\rangle\|_{\text{F}}^{2}. (4.2)

For any given (𝓢,𝐔1,,𝐔d)(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d}), its partial gradients with respect to each zi=(𝓨i,𝓧i)z_{i}=(\mbox{\boldmath$\mathscr{Y}$}_{i},\mbox{\boldmath$\mathscr{X}$}_{i}) are

𝐔k(𝓢,𝐔1,,𝐔d;zi)=[𝓧i(𝓐,𝓧i𝓨i)](k)(jk𝐔j)𝓢(k),1kd,and 𝓢(𝓢,𝐔1,,𝐔d;zi)=[𝓧i(𝓐,𝓧i𝓨i)]×j=1d𝐔j.\begin{split}\nabla_{\mathbf{U}_{k}}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};z_{i})&=[\mbox{\boldmath$\mathscr{X}$}_{i}\circ(\langle\mbox{\boldmath$\mathscr{A}$},\mbox{\boldmath$\mathscr{X}$}_{i}\rangle-\mbox{\boldmath$\mathscr{Y}$}_{i})]_{(k)}(\otimes_{j\neq k}\mathbf{U}_{j})\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top},~{}~{}1\leq k\leq d,\\ \text{and }\nabla_{\scalebox{0.7}{\mbox{\boldmath$\mathscr{S}$}}}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};z_{i})&=[\mbox{\boldmath$\mathscr{X}$}_{i}\circ(\langle\mbox{\boldmath$\mathscr{A}$},\mbox{\boldmath$\mathscr{X}$}_{i}\rangle-\mbox{\boldmath$\mathscr{Y}$}_{i})]\times_{j=1}^{d}\mathbf{U}_{j}^{\top}.\end{split} (4.3)

By applying the entrywise truncation operator from Section 3.3, the robust gradients with truncation parameter τ\tau are given by

𝐆k(𝓢,𝐔1,,𝐔d;τ)=1ni=1nT([𝓧i(𝓐,𝓧i𝓨i)](k)(jk𝐔j)𝓢(k),τ)\mathbf{G}_{k}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};\tau)=\frac{1}{n}\sum_{i=1}^{n}\text{T}\left([\mbox{\boldmath$\mathscr{X}$}_{i}\circ(\langle\mbox{\boldmath$\mathscr{A}$},\mbox{\boldmath$\mathscr{X}$}_{i}\rangle-\mbox{\boldmath$\mathscr{Y}$}_{i})]_{(k)}(\otimes_{j\neq k}\mathbf{U}_{j})\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top},\tau\right) (4.4)

for k=1,,dk=1,\dots,d, and

𝓖0(𝓢,𝐔1,,𝐔d;τ)=1ni=1nT([𝓧i(𝓐,𝓧i𝓨i)]×j=1d𝐔j,τ).\mbox{\boldmath$\mathscr{G}$}_{0}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};\tau)=\frac{1}{n}\sum_{i=1}^{n}\text{T}\left([\mbox{\boldmath$\mathscr{X}$}_{i}\circ(\langle\mbox{\boldmath$\mathscr{A}$},\mbox{\boldmath$\mathscr{X}$}_{i}\rangle-\mbox{\boldmath$\mathscr{Y}$}_{i})]\times_{j=1}^{d}\mathbf{U}_{j}^{\top},\tau\right). (4.5)
Remark 4.1.

For model (4.1), the loss function ¯(𝓐;zi)=𝓨i𝓐,𝓧iF2/2\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})=\|\mbox{\boldmath$\mathscr{Y}$}_{i}-\langle\mbox{\boldmath$\mathscr{A}$},\mbox{\boldmath$\mathscr{X}$}_{i}\rangle\|_{\textup{F}}^{2}/2 has a finite expectation if both 𝓧i\mbox{\boldmath$\mathscr{X}$}_{i} and 𝓔i\mbox{\boldmath$\mathscr{E}$}_{i} have finite second moments. In the following, we assume that 𝓔i\mbox{\boldmath$\mathscr{E}$}_{i} has a a finite (1+ϵ)(1+\epsilon) moment for some constant ϵ(0,1]\epsilon\in(0,1]. Under this assumption, we only require that the gradient ¯(𝓐;zi)=𝓧i(𝓐,𝓧i𝓨i)\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})=\mbox{\boldmath$\mathscr{X}$}_{i}\circ(\langle\mbox{\boldmath$\mathscr{A}$},\mbox{\boldmath$\mathscr{X}$}_{i}\rangle-\mbox{\boldmath$\mathscr{Y}$}_{i}) has a finite expectation.

Remark 4.2.

Adaptive Huber regression is a widely-used robust estimation method for high-dimensional linear regression (Sun, Zhou and Fan, 2020; Tan, Sun and Witten, 2023). The loss function is given by

H(𝓢,𝐔1,,𝐔d;𝒟n)=12i=1nν(𝓨i𝓢×j=1d𝐔j,𝓧i),\mathcal{L}_{\textup{H}}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};\mathcal{D}_{n})=\frac{1}{2}\sum_{i=1}^{n}\ell_{\nu}(\mbox{\boldmath$\mathscr{Y}$}_{i}-\langle\mbox{\boldmath$\mathscr{S}$}\times_{j=1}^{d}\mathbf{U}_{j},\mbox{\boldmath$\mathscr{X}$}_{i}\rangle), (4.6)

where ν(𝓣)=i1,i2,,idν(𝓣i1i2id)\ell_{\nu}(\mbox{\boldmath$\mathscr{T}$})=\sum_{i_{1},i_{2},\dots,i_{d}}\ell_{\nu}(\mbox{\boldmath$\mathscr{T}$}_{i_{1}i_{2}\dots i_{d}}) for any tensor 𝓣\mathscr{T}, ν(x)=x21(|x|ν)+(2νxν2)1(|x|>ν)\ell_{\nu}(x)=x^{2}\cdot 1(|x|\leq\nu)+(2\nu x-\nu^{2})\cdot 1(|x|>\nu) is the Huber loss, and ν>0\nu>0 is the robustness parameter. The partial gradients of H\mathcal{L}_{\textup{H}} are

𝐔kH(𝓢,𝐔1,,𝐔d;𝒟n)=1ni=1n[𝓧iT(𝓐,𝓧i𝓨i,ν)](k)(jk𝐔j)𝓢(k),and 𝓢H(𝓢,𝐔1,,𝐔d;𝒟n)=1ni=1n[𝓧iT(𝓐,𝓧i𝓨i,ν)]×j=1d𝐔j.\begin{split}\nabla_{\mathbf{U}_{k}}\mathcal{L}_{\textup{H}}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};\mathcal{D}_{n})&=\frac{1}{n}\sum_{i=1}^{n}[\mbox{\boldmath$\mathscr{X}$}_{i}\circ\textup{T}(\langle\mbox{\boldmath$\mathscr{A}$},\mbox{\boldmath$\mathscr{X}$}_{i}\rangle-\mbox{\boldmath$\mathscr{Y}$}_{i},\nu)]_{(k)}(\otimes_{j\neq k}\mathbf{U}_{j})\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top},\\ \text{and }\nabla_{\scalebox{0.7}{\mbox{\boldmath$\mathscr{S}$}}}\mathcal{L}_{\textup{H}}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};\mathcal{D}_{n})&=\frac{1}{n}\sum_{i=1}^{n}[\mbox{\boldmath$\mathscr{X}$}_{i}\circ\textup{T}(\langle\mbox{\boldmath$\mathscr{A}$},\mbox{\boldmath$\mathscr{X}$}_{i}\rangle-\mbox{\boldmath$\mathscr{Y}$}_{i},\nu)]\times_{j=1}^{d}\mathbf{U}_{j}^{\top}.\end{split} (4.7)

In comparison to the standard Huber regression, where the gradients bound (𝓐,𝓧i𝓨i)(\langle\mbox{\boldmath$\mathscr{A}$},\mbox{\boldmath$\mathscr{X}$}_{i}\rangle-\mbox{\boldmath$\mathscr{Y}$}_{i}) directly, the entrywise truncated gradients in (4.4) and (4.5) control the deviation of the term 𝓧i(𝓐,𝓧i𝓨i)\mbox{\boldmath$\mathscr{X}$}_{i}\circ(\langle\mbox{\boldmath$\mathscr{A}$},\mbox{\boldmath$\mathscr{X}$}_{i}\rangle-\mbox{\boldmath$\mathscr{Y}$}_{i}). This modification enables us to better handle heavy-tailedness in both the covariates and the noise.

By applying the Tucker decomposition 𝓐=𝓢×j=1d𝐔j\mbox{\boldmath$\mathscr{A}$}=\mbox{\boldmath$\mathscr{S}$}\times_{j=1}^{d}\mathbf{U}_{j} and utilizing properties of tensor inner products, the partial gradients can be rewritten as follows. For k=1,,d0k=1,\dots,d_{0},

𝐔k(𝓢,𝐔1,,𝐔d;zi)=[(𝓧i×j=1,jkd0𝐔j)(𝓢×j=d0+1d𝐔j𝐔j,𝓧i×j=1d0𝐔j𝓨i×j=1dd0𝐔d0+j)](k)𝓢(k).\begin{split}&\nabla_{\mathbf{U}_{k}}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};z_{i})=\big{[}\big{(}\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1,j\neq k}^{d_{0}}\mathbf{U}_{j}^{\top}\big{)}\circ\\ &~{}~{}~{}\big{(}\langle\mbox{\boldmath$\mathscr{S}$}\times_{j=d_{0}+1}^{d}\mathbf{U}_{j}^{\top}\mathbf{U}_{j},\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d_{0}}\mathbf{U}_{j}^{\top}\rangle-\mbox{\boldmath$\mathscr{Y}$}_{i}\times_{j=1}^{d-d_{0}}\mathbf{U}_{d_{0}+j}^{\top}\big{)}\big{]}_{(k)}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top}.\end{split} (4.8)

For k=d0+1,,dk=d_{0}+1,\dots,d,

𝐔k(𝓢,𝐔1,,𝐔d;zi)=[(𝓧i×j=1d0𝐔j)(𝓢×k𝐔k×j=d0+1,jkd𝐔j𝐔j,𝓧i×j=1d0𝐔j𝓨i×j=1,jkd0dd0𝐔d0+j)](k)𝓢(k).\begin{split}&\nabla_{\mathbf{U}_{k}}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};z_{i})=\big{[}\big{(}\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d_{0}}\mathbf{U}_{j}^{\top}\big{)}\circ\\ &~{}~{}~{}\big{(}\langle\mbox{\boldmath$\mathscr{S}$}\times_{k}\mathbf{U}_{k}\times_{j=d_{0}+1,j\neq k}^{d}\mathbf{U}_{j}^{\top}\mathbf{U}_{j},\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d_{0}}\mathbf{U}_{j}^{\top}\rangle-\mbox{\boldmath$\mathscr{Y}$}_{i}\times_{j=1,j\neq k-d_{0}}^{d-d_{0}}\mathbf{U}_{d_{0}+j}^{\top}\big{)}\big{]}_{(k)}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top}.\end{split} (4.9)

Finally, the gradient with respect to the core tensor 𝓢\mathscr{S} is

𝓢(𝓢,𝐔1,,𝐔d;zi)=(𝓧i×j=1d0𝐔j)(𝓢×j=d0+1d𝐔j𝐔j,𝓧i×j=1d0𝐔j𝓨i×j=1dd0𝐔d0+j).\begin{split}&\nabla_{\scalebox{0.7}{\mbox{\boldmath$\mathscr{S}$}}}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};z_{i})\\ &=\big{(}\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d_{0}}\mathbf{U}_{j}^{\top}\big{)}\circ\big{(}\langle\mbox{\boldmath$\mathscr{S}$}\times_{j=d_{0}+1}^{d}\mathbf{U}_{j}^{\top}\mathbf{U}_{j},\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d_{0}}\mathbf{U}_{j}^{\top}\rangle-\mbox{\boldmath$\mathscr{Y}$}_{i}\times_{j=1}^{d-d_{0}}\mathbf{U}_{d_{0}+j}^{\top}\big{)}.\end{split} (4.10)

In the above expressions, dimension reduction is applied to high-dimensional data 𝓧i\mbox{\boldmath$\mathscr{X}$}_{i} and 𝓨i\mbox{\boldmath$\mathscr{Y}$}_{i} via the factor matrices 𝐔j\mathbf{U}_{j}’s. This allows for efficient computation of the partial gradients. More importantly, the transformed data, such as 𝓧i×j=1d0𝐔j\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d_{0}}\mathbf{U}_{j}^{\top} and 𝓨i×j=1dd0𝐔d0+j\mbox{\boldmath$\mathscr{Y}$}_{i}\times_{j=1}^{d-d_{0}}\mathbf{U}_{d_{0}+j}^{\top}, appear in the partial gradients rather than the original high-dimensional data. This transformation can be seen as reducing the effective dimension of the data used in gradient calculation. As a result, when all 𝐔k\mathbf{U}_{k}’s are close to their ground truth values and lie within certain local regions, the gradients rely only on partial information of the data. This phenomenon motivates us to consider the local bounded moment conditions for the distributions of 𝓧i\mbox{\boldmath$\mathscr{X}$}_{i} and 𝓔i\mbox{\boldmath$\mathscr{E}$}_{i} when projected onto certain directions.

For any matrix 𝐌p×q\mathbf{M}\in\mathbb{R}^{p\times q}, consider its column subspace projection matrix 𝒫𝐌=𝐌(𝐌𝐌)𝐌\mathcal{P}_{\mathbf{M}}=\mathbf{M}(\mathbf{M}^{\top}\mathbf{M})^{\dagger}\mathbf{M}^{\top}, where \dagger is the Moore–Penrose pseudo-inverse. For any vector 𝐯p\mathbf{v}\in\mathbb{R}^{p}, the angle between 𝐯\mathbf{v} and its projection onto 𝒫𝐌\mathcal{P}_{\mathbf{M}} is arccos(𝒫𝐌𝐯2/𝐯2)\arccos(\|\mathcal{P}_{\mathbf{M}}\mathbf{v}\|_{2}/\|\mathbf{v}\|_{2}). For each 𝐔k\mathbf{U}_{k}^{*}, define the local set of unit-length vectors with sinθ\sin\theta radius δ\delta as

𝒱(𝐔k,δ)={𝐯pk:𝐯2=1andsinarccos(𝒫𝐔k𝐯2)δ}.\mathcal{V}(\mathbf{U}_{k}^{*},\delta)=\{\mathbf{v}\in\mathbb{R}^{p_{k}}:\|\mathbf{v}\|_{2}=1~{}\text{and}~{}\sin\arccos(\|\mathcal{P}_{\mathbf{U}^{*}_{k}}\mathbf{v}\|_{2})\leq\delta\}. (4.11)

By this definition, we have the following assumptions on the distributions of 𝓧i\mbox{\boldmath$\mathscr{X}$}_{i} and 𝓔i\mbox{\boldmath$\mathscr{E}$}_{i}.

Assumption 4.3.

The vectorized covariate vec(𝓧)\textup{vec}(\mbox{\boldmath$\mathscr{X}$}) has the mean 𝟎\mathbf{0} and positive definite variance 𝚺x\mathbf{\Sigma}_{x} satisfying 0<αxλmin(𝚺x)λmax(𝚺x)βx0<\alpha_{x}\leq\lambda_{\min}(\mathbf{\Sigma}_{x})\leq\lambda_{\max}(\mathbf{\Sigma}_{x})\leq\beta_{x}. For some ϵ(0,1]\epsilon\in(0,1] and δ[0,1]\delta\in[0,1], conditioned on any 𝓧i\mbox{\boldmath$\mathscr{X}$}_{i}, the random noise 𝓔i\mbox{\boldmath$\mathscr{E}$}_{i} has the local (1+ϵ)(1+\epsilon)-th moment bound

Me,1+ϵ,δ:=max1kdd0(sup𝐯j𝒱(𝐔d0+j,δ)𝔼[|𝓔i×j=1dd0𝐯j|1+ϵ|𝓧i],sup𝐯j𝒱(𝐔d0+j,δ),1lpd0+k𝔼[|𝓔i×j=1,jkdd0𝐯j×k𝐜l|1+ϵ|𝓧i]),\begin{split}M_{e,1+\epsilon,\delta}:=\max_{1\leq k\leq d-d_{0}}&\Bigg{(}\sup_{\mathbf{v}_{j}\in\mathcal{V}(\mathbf{U}_{d_{\scalebox{0.4}{0}}+j}^{*},\delta)}\mathbb{E}\left[\left|\mbox{\boldmath$\mathscr{E}$}_{i}\times_{j=1}^{d-d_{0}}\mathbf{v}_{j}^{\top}\right|^{1+\epsilon}|\mbox{\boldmath$\mathscr{X}$}_{i}\right],\\ &~{}~{}~{}\sup_{\begin{subarray}{c}\mathbf{v}_{j}\in\mathcal{V}(\mathbf{U}_{d_{\scalebox{0.4}{0}}+j}^{*},\delta),1\leq l\leq p_{d_{\scalebox{0.4}{0}}+k}\end{subarray}}\mathbb{E}\left[\left|\mbox{\boldmath$\mathscr{E}$}_{i}\times_{j=1,j\neq k}^{d-d_{0}}\mathbf{v}_{j}^{\top}\times_{k}\mathbf{c}_{l}^{\top}\right|^{1+\epsilon}|\mbox{\boldmath$\mathscr{X}$}_{i}\right]\Bigg{)},\end{split} (4.12)

where 𝐜l\mathbf{c}_{l} is the coordinate vector whose ll-th element is one and others are zero. Additionally, 𝓧i\mbox{\boldmath$\mathscr{X}$}_{i} has the local (1+ϵ)(1+\epsilon)-th moment bound

Mx,1+ϵ,δ:=max1kd0(sup𝐯j𝒱(𝐔j,δ)𝔼[|𝓧i×j=1d0𝐯j|1+ϵ],sup𝐯j𝒱(𝐔j,δ),1lpk𝔼[|𝓧i×j=1,jkd0𝐯j×k𝐜l|1+ϵ]).\begin{split}M_{x,1+\epsilon,\delta}:=\max_{1\leq k\leq d_{0}}&\Bigg{(}\sup_{\mathbf{v}_{j}\in\mathcal{V}(\mathbf{U}_{j}^{*},\delta)}\mathbb{E}\left[\left|\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d_{0}}\mathbf{v}_{j}^{\top}\right|^{1+\epsilon}\right],\\ &~{}~{}~{}\sup_{\begin{subarray}{c}\mathbf{v}_{j}\in\mathcal{V}(\mathbf{U}_{j}^{*},\delta),1\leq l\leq p_{k}\end{subarray}}\mathbb{E}\left[\left|\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1,j\neq k}^{d_{0}}\mathbf{v}_{j}^{\top}\times_{k}\mathbf{c}_{l}^{\top}\right|^{1+\epsilon}\right]\Bigg{)}.\end{split} (4.13)

We call Me,1+ϵ,δM_{e,1+\epsilon,\delta} and Mx,1+ϵ,δM_{x,1+\epsilon,\delta} the local moment bounds because they reflect the distributional properties of 𝓔i\mbox{\boldmath$\mathscr{E}$}_{i} and 𝓧i\mbox{\boldmath$\mathscr{X}$}_{i} transformed onto the local region around the column space of 𝐔k\mathbf{U}_{k}^{*}’s. The local moment bound assumptions essentially limits how far the random noise and covariates can deviate in certain directions, thereby controlling their tail behavior. When δ=1\delta=1, the local moment bounds become

Me,1+ϵ,1=sup𝐯j2=1𝔼[|𝓔i×j=1dd0𝐯j|1+ϵ|𝓧i]M_{e,1+\epsilon,1}=\sup_{\|\mathbf{v}_{j}\|_{2}=1}\mathbb{E}\left[\left|\mbox{\boldmath$\mathscr{E}$}_{i}\times_{j=1}^{d-d_{0}}\mathbf{v}_{j}^{\top}\right|^{1+\epsilon}|\mbox{\boldmath$\mathscr{X}$}_{i}\right] (4.14)

and

Mx,1+ϵ,1=sup𝐯j2=1𝔼[|𝓧i×j=1d0𝐯j|1+ϵ],M_{x,1+\epsilon,1}=\sup_{\|\mathbf{v}_{j}\|_{2}=1}\mathbb{E}\left[\left|\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d_{0}}\mathbf{v}_{j}^{\top}\right|^{1+\epsilon}\right], (4.15)

which are still different from and could be much smaller than the global (1+ϵ)(1+\epsilon)-th moments for the vectorized data, i.e., sup𝐯2=1𝔼[|vec(𝓧i)𝐯|1+ϵ]\sup_{\|\mathbf{v}\|_{2}=1}\mathbb{E}[|\text{vec}(\mbox{\boldmath$\mathscr{X}$}_{i})^{\top}\mathbf{v}|^{1+\epsilon}] and sup𝐯2=1𝔼[|vec(𝓔i)𝐯|1+ϵ]\sup_{\|\mathbf{v}\|_{2}=1}\mathbb{E}[|\text{vec}(\mbox{\boldmath$\mathscr{E}$}_{i})^{\top}\mathbf{v}|^{1+\epsilon}].

The second moment condition for 𝓧i\mbox{\boldmath$\mathscr{X}$}_{i} and (1+ϵ)(1+\epsilon)-th moment condition for 𝓔i\mbox{\boldmath$\mathscr{E}$}_{i} in Assumption 4.3 offer a relaxation of the commonly-used Gaussian and sub-Gaussian conditions in tensor regression literature. Notably, in Assumption 4.3, the elements of 𝓧i\mbox{\boldmath$\mathscr{X}$}_{i} or 𝓔i\mbox{\boldmath$\mathscr{E}$}_{i} are not required to be independent or uncorrelated. The parameters αx\alpha_{x}, βx\beta_{x}, Mx,1+ϵ,δM_{x,1+\epsilon,\delta} and Me,1+ϵ,δM_{e,1+\epsilon,\delta} are used to quantify how the distributions of 𝓧i\mbox{\boldmath$\mathscr{X}$}_{i} and 𝓔i\mbox{\boldmath$\mathscr{E}$}_{i} affect the rate of convergence, and they are allowed to vary with tensor dimensions. For tensor regression in (4.1), define Meff,δ=Mx,1+ϵ,δMe,1+ϵ,δM_{\text{eff},\delta}=M_{x,1+\epsilon,\delta}M_{e,1+\epsilon,\delta} as the effective (1+ϵ)(1+\epsilon)-th local moment bound.

Remark 4.4.

The local moment condition is established on the certain directions near the subspaces spanned by the columns of 𝐔k\mathbf{U}_{k}^{*}. Consequently, the local moment can be much smaller than the global moment commonly discussed in the robust estimation literature (Fan, Li and Wang, 2017; Fan, Wang and Zhu, 2021; Wang and Tsay, 2023).

To highlight the significance of the local moment concept, consider the following example. Let 𝐱=(𝐱1,𝐱2)\mathbf{x}=(\mathbf{x}_{1}^{\top},\mathbf{x}_{2}^{\top})^{\top} be a random vector, where 𝐱1=(Y1,Y2,,Yp)p\mathbf{x}_{1}=(Y_{1},Y_{2},\dots,Y_{p})^{\top}\in\mathbb{R}^{p}, 𝐱2=(Yp+1,Yp+1,,Yp+1)p\mathbf{x}_{2}=(Y_{p+1},Y_{p+1},\dots,Y_{p+1})^{\top}\in\mathbb{R}^{p}, and Y1,Y2,,Yp+1i.i.d.N(0,1)Y_{1},Y_{2},\dots,Y_{p+1}\sim_{i.i.d.}N(0,1). Suppose that the ground truth is 𝐮=(p1/2𝟏p,𝟎p)\mathbf{u}^{*}=(p^{-1/2}\mathbf{1}_{p}^{\top},\mathbf{0}_{p}^{\top})^{\top}. In this case, the global second moment of 𝐱\mathbf{x} is given by

Mx,2,1=sup𝐯2=1𝔼[|𝐯𝐱|2]=𝔼[|p1/2𝟏p𝐱2|2]=p,M_{x,2,1}=\sup_{\|\mathbf{v}\|_{2}=1}\mathbb{E}\left[|\mathbf{v}^{\top}\mathbf{x}|^{2}\right]=\mathbb{E}\left[|p^{-1/2}\mathbf{1}_{p}^{\top}\mathbf{x}_{2}|^{2}\right]=p, (4.16)

which diverges as the dimension pp increase. However, when δp1/2\delta\leq p^{-1/2}, the local second moment with radius δ\delta, defined as

Mx,2,δ=max{sup𝐯𝒱(𝐮,δ)𝔼[|𝐯𝐱|2],max1j2p𝔼[|𝐜j𝐱|2]}M_{x,2,\delta}=\max\left\{\sup_{\mathbf{v}\in\mathcal{V}(\mathbf{u}^{*},\delta)}\mathbb{E}\left[|\mathbf{v}^{\top}\mathbf{x}|^{2}\right],\max_{1\leq j\leq 2p}\mathbb{E}\left[|\mathbf{c}_{j}^{\top}\mathbf{x}|^{2}\right]\right\} (4.17)

remains bounded and is not greater than 2. This example illustrates that if the radius of the local region can be sufficiently controlled, the local moment bound can be significantly smaller than the glocal moment. As a result, the error rate associated with the local moment could be much sharper, leading to more precise theoretical results.

Denote the estimator obtained by the robust gradient descent algorithm with gradient truncation parameter τ\tau as 𝓐^(τ)\mbox{\boldmath$\mathscr{\widehat{A}}$}(\tau), and the corresponding estimation error by Err(𝓢^,𝐔^1,,𝐔^d)\textup{Err}(\widehat{\mbox{\boldmath$\mathscr{S}$}},\widehat{\mathbf{U}}_{1},\dots,\widehat{\mathbf{U}}_{d}) as in (3.8). Based on the local bounded moment conditions, we have the following guarantees.

Theorem 4.5.

For tensor linear regression in (4.1), suppose Assumption 4.3 holds with some ϵ(0,1]\epsilon\in(0,1] and δmin{σ¯1/(d+1)Err(0)+κ2αx1σ¯1deff1/2[Meff,δlog(p¯)/n]ϵ/(1+ϵ),1}\delta\geq\min\{\bar{\sigma}^{-1/(d+1)}\sqrt{\textup{Err}^{(0)}}+\kappa^{2}\alpha_{x}^{-1}\bar{\sigma}^{-1}d_{\textup{eff}}^{1/2}[M_{\textup{eff},\delta}\log(\bar{p})/n]^{\epsilon/(1+\epsilon)},1\}. If τσ¯d/(d+1)[nMeff,δ/log(p¯)]1/(1+ϵ)\tau\asymp\bar{\sigma}^{d/(d+1)}[nM_{\textup{eff},\delta}/\log(\bar{p})]^{1/(1+\epsilon)}, n(κ4βx3αx2σ¯)1+ϵlog(p¯)Meff,δ1+κ4βx2αx2log(p¯)n\gtrsim(\kappa^{4}\beta_{x}^{3}\alpha_{x}^{-2}\bar{\sigma})^{1+\epsilon}\log(\bar{p})M_{\textup{eff},\delta}^{-1}+\kappa^{4}\beta_{x}^{2}\alpha_{x}^{-2}\log(\bar{p}), and the conditions in Theorem 3.3 hold with α=αx/2\alpha=\alpha_{x}/2 and β=βx/2\beta=\beta_{x}/2, then with probability at least 1Cexp(Clog(p¯))1-C\exp(-C\log(\bar{p})), after sufficient iterations of Algorithm 1, we have the following error bounds

Err(𝓢^,𝐔^1,,𝐔^d)κ4αx2σ¯2d/(d+1)deff[Meff,δ1/ϵlog(p¯)n]2ϵ/(1+ϵ)\textup{Err}(\widehat{\mbox{\boldmath$\mathscr{S}$}},\widehat{\mathbf{U}}_{1},\dots,\widehat{\mathbf{U}}_{d})\lesssim\kappa^{4}\alpha_{x}^{-2}\bar{\sigma}^{-2d/(d+1)}d_{\textup{eff}}\left[\frac{M_{\textup{eff},\delta}^{1/\epsilon}\log(\bar{p})}{n}\right]^{2\epsilon/(1+\epsilon)} (4.18)

and

𝓐^(τ)𝓐Fκ2αx1deff1/2[Meff,δ1/ϵlog(p¯)n]ϵ/(1+ϵ).\|\mbox{\boldmath$\mathscr{\widehat{A}}$}(\tau)-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\textup{F}}\lesssim\ \kappa^{2}\alpha_{x}^{-1}d_{\textup{eff}}^{1/2}\left[\frac{M_{\textup{eff},\delta}^{1/\epsilon}\log(\bar{p})}{n}\right]^{\epsilon/(1+\epsilon)}. (4.19)

Theorem 4.5 shows that, under certain regularity conditions, the errors rates in (4.18) and (4.19) are proportional to deffMeff,δ2/(1+ϵ)[log(p¯)/n]2ϵ/(1+ϵ)d_{\textup{eff}}M_{\textup{eff},\delta}^{2/(1+\epsilon)}[\log(\bar{p})/n]^{2\epsilon/(1+\epsilon)} and deff1/2Meff,δ1/(1+ϵ)[log(p¯)/n]ϵ/(1+ϵ)d_{\textup{eff}}^{1/2}M_{\textup{eff},\delta}^{1/(1+\epsilon)}[\log(\bar{p})/n]^{\epsilon/(1+\epsilon)}, respectively. The results essentially depend on the rate of truncation parameter τ\tau, whose optimal value is proportional to [nMeff,δ/log(p¯)]1/(1+ϵ)[nM_{\text{eff},\delta}/\log(\bar{p})]^{1/(1+\epsilon)}. The convergence rates exhibit a smooth phase transition. When ϵ=1\epsilon=1, i.e. when 𝓔i\mbox{\boldmath$\mathscr{E}$}_{i} has a finite second local moment, the upper bound in (4.19) matches the result under Gaussian distributional assumption for low-rank matrix regression (Negahban and Wainwright, 2011) and tensor regression (Han, Willett and Zhang, 2022; Zhang et al., 2020). When 0<ϵ<10<\epsilon<1, the convergence rate in (4.19) decreases from deffMeff,δ1/ϵ[log(p¯)/n]1/2\sqrt{d_{\text{eff}}}M_{\text{eff},\delta}^{1/\epsilon}[\log(\bar{p})/n]^{1/2} to deffMeff,δ1/ϵ[log(p¯)/n]ϵ/(1+ϵ)\sqrt{d_{\text{eff}}}M_{\text{eff},\delta}^{1/\epsilon}[\log(\bar{p})/n]^{\epsilon/(1+\epsilon)}. Furthermore, the parameter δ\delta controls the radius of local region. If nn is sufficiently large and the intial error Err(0)<σ¯2/(d+1)\text{Err}^{(0)}<\bar{\sigma}^{2/(d+1)}, then the local radius δ\delta is smaller than 1, and the convergence rates are associated to local moment bounds.

Remark 4.6.

The convergence rate phase transition with respect to the moment order is also observed in heavy-tailed vector regression (Sun, Zhou and Fan, 2020), matrix regression (Tan, Sun and Witten, 2023), and time series autoregression (Wang and Tsay, 2023). For d=1d=1 and 2, the convergence rates obtained in Theorem 4.5 for ϵ(0,1]\epsilon\in(0,1] are minimax rate-optimal for vector-valued and matrix-valued regression models (Sun, Zhou and Fan, 2020; Tan, Sun and Witten, 2023). For d3d\geq 3, the rate in (4.19) with ϵ=1\epsilon=1 is also minimax rate-optimal (Raskutti, Yuan and Chen, 2019).

Remark 4.7.

We highlight two key differences between our theoretical results and those in the existing literature on heavy-tailed regression models. First, we relax the distributional condition on the covariate 𝓧i\mbox{\boldmath$\mathscr{X}$}_{i}. In high-dimensional Huber regression literature, the covariates are typically assumed to be sub-Gaussian or bounded (Fan, Li and Wang, 2017; Sun, Zhou and Fan, 2020; Tan, Sun and Witten, 2023; Shen et al., 2023). In contrast, the proposed method, based on gradient robustification, can handle both heavy-tailed covariates and noise. Second, our analysis relies on local moment conditions, making our results potentially much sharper than those based on global moment conditions. Numerical results which verify the advantages of our method over competing methods are provided in Section 5.

4.2 Heavy-tailed tensor logistic regression

For the generalized linear model, conditioned on the tensor covariate 𝓧i\mbox{\boldmath$\mathscr{X}$}_{i}, the response variable yiy_{i} follows the distribution

(yi|𝓧i)exp{yi𝓧i,𝓐Φ(𝓧i,𝓐)c(γ)},i=1,2,,n.\mathbb{P}(y_{i}|\mbox{\boldmath$\mathscr{X}$}_{i})\propto\exp\left\{\frac{y_{i}\langle\mbox{\boldmath$\mathscr{X}$}_{i},\mbox{\boldmath$\mathscr{A}$}^{*}\rangle-\Phi(\langle\mbox{\boldmath$\mathscr{X}$}_{i},\mbox{\boldmath$\mathscr{A}$}^{*}\rangle)}{c(\gamma)}\right\},~{}~{}i=1,2,\dots,n. (4.20)

Consequently, the loss function is given by

¯(𝓐;zi)=Φ(𝓧i,𝓐)yi𝓧i,𝓐.\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})=\Phi(\langle\mbox{\boldmath$\mathscr{X}$}_{i},\mbox{\boldmath$\mathscr{A}$}\rangle)-y_{i}\langle\mbox{\boldmath$\mathscr{X}$}_{i},\mbox{\boldmath$\mathscr{A}$}\rangle. (4.21)

A widely studied instance of the generalized linear models is logistic regression, where Φ(t)=log(1+exp(t))\Phi(t)=\log(1+\exp(t)). In this case, the gradient of the loss function becomes

¯(𝓐;zi)=(exp(𝓧i,𝓐)1+exp(𝓧i,𝓐)yi)𝓧i.\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})=\left(\frac{\exp(\langle\mbox{\boldmath$\mathscr{X}$}_{i},\mbox{\boldmath$\mathscr{A}$}\rangle)}{1+\exp(\langle\mbox{\boldmath$\mathscr{X}$}_{i},\mbox{\boldmath$\mathscr{A}$}\rangle)}-y_{i}\right)\mbox{\boldmath$\mathscr{X}$}_{i}. (4.22)

Here, yiy_{i} is the binary variable with probability

(yi=1|𝓧i)=exp(𝓧i,𝓐)1+exp(𝓧i,𝓐)and(yi=0|𝓧i)=1exp(𝓧i,𝓐)1+exp(𝓧i,𝓐).\mathbb{P}(y_{i}=1|\mbox{\boldmath$\mathscr{X}$}_{i})=\frac{\exp(\langle\mbox{\boldmath$\mathscr{X}$}_{i},\mbox{\boldmath$\mathscr{A}$}^{*}\rangle)}{1+\exp(\langle\mbox{\boldmath$\mathscr{X}$}_{i},\mbox{\boldmath$\mathscr{A}$}^{*}\rangle)}~{}~{}\text{and}~{}~{}\mathbb{P}(y_{i}=0|\mbox{\boldmath$\mathscr{X}$}_{i})=1-\frac{\exp(\langle\mbox{\boldmath$\mathscr{X}$}_{i},\mbox{\boldmath$\mathscr{A}$}^{*}\rangle)}{1+\exp(\langle\mbox{\boldmath$\mathscr{X}$}_{i},\mbox{\boldmath$\mathscr{A}$}^{*}\rangle)}. (4.23)

In the robust estimation literature for logistic regression, since yiy_{i} is a binary variable, it is often assumed that the covariate 𝓧i\mbox{\boldmath$\mathscr{X}$}_{i} follows a heavy-tailed distribution. For example, the finite fourth moment condition has been considered for vector-valued covariates in the context of heavy-tailed logistic regression (Prasad et al., 2020; Zhu and Zhou, 2021; Han, Tsay and Wu, 2023). Tensor logistic regression has been widely applied to classification tasks in neuroimaging analysis (see e.g. Zhou, Li and Zhu, 2013; Li et al., 2018; Wu et al., 2022). Recent empirical studies have shown that neuroimaging data follow heavy-tailed distributions (Beggs and Plenz, 2003; Friedman et al., 2012; Roberts, Boonstra and Breakspear, 2015). Consequently, there is a growing practical need to study robust estimation methods for tensor logistic regression.

Given the low-rank structure, the partial gradients of the logistic loss function are

𝐔k(𝓢,𝐔1,,𝐔d;zi)=(exp(𝓧i×j=1d𝐔j,𝓢)1+exp(𝓧i×j=1d𝐔j,𝓢)yi)(𝓧i×j=1,jkd𝐔j)(k)𝓢(k),\nabla_{\mathbf{U}_{k}}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};z_{i})=\left(\frac{\exp(\langle\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d}\mathbf{U}_{j}^{\top},\mbox{\boldmath$\mathscr{S}$}\rangle)}{1+\exp(\langle\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d}\mathbf{U}_{j}^{\top},\mbox{\boldmath$\mathscr{S}$}\rangle)}-y_{i}\right)(\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1,j\neq k}^{d}\mathbf{U}_{j}^{\top})_{(k)}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top}, (4.24)

for 1kd1\leq k\leq d, and

𝓢(𝓢,𝐔1,,𝐔d;zi)=(exp(𝓧i×j=1d𝐔j,𝓢)1+exp(𝓧i×j=1d𝐔j,𝓢)yi)(𝓧i×j=1d𝐔j).\nabla_{\scalebox{0.7}{\mbox{\boldmath$\mathscr{S}$}}}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};z_{i})=\left(\frac{\exp(\langle\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d}\mathbf{U}_{j}^{\top},\mbox{\boldmath$\mathscr{S}$}\rangle)}{1+\exp(\langle\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d}\mathbf{U}_{j}^{\top},\mbox{\boldmath$\mathscr{S}$}\rangle)}-y_{i}\right)(\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d}\mathbf{U}_{j}^{\top}). (4.25)

As in tensor linear regression, the partial gradients for tensor logistic regression involve the multilinear transformations of the covariates, namely 𝓧i×j=1d𝐔j\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d}\mathbf{U}_{j}^{\top} and 𝓧i×j=1,jkd𝐔j\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1,j\neq k}^{d}\mathbf{U}_{j}^{\top}. Therefore, to derive the optimal convergence rate, it is essential to characterize the distributional properties of the transformed covariates. In this work, we assume that the covariate 𝓧i\mbox{\boldmath$\mathscr{X}$}_{i} has a finite second moment and introduce the concept of the local second moment bound to obtain a sharp statistical convergence rate.

Assumption 4.8.

The vectorized covariate vec(𝓧i)\textup{vec}(\mbox{\boldmath$\mathscr{X}$}_{i}) has the mean 𝟎\mathbf{0} and positive definite variance 𝚺x\mathbf{\Sigma}_{x} satisfying 0<αxλmin(𝚺x)λmax(𝚺x)βx0<\alpha_{x}\leq\lambda_{\min}(\mathbf{\Sigma}_{x})\leq\lambda_{\max}(\mathbf{\Sigma}_{x})\leq\beta_{x}. Additionally, for some δ[0,1]\delta\in[0,1], the covariate 𝓧i\mbox{\boldmath$\mathscr{X}$}_{i} satisfies the local second moment bound

Mx,2,δ:=max1kd(sup𝐯j𝒱(𝐔j,δ)𝔼[|𝓧i×j=1d𝐯j|2],sup𝐯j𝒱(𝐔j,δ)1lpk𝔼[|𝓧i×j=1,jkd𝐯j×k𝐜l|2]).\begin{split}&M_{x,2,\delta}\\ &:=\max_{1\leq k\leq d}\left(\sup_{\mathbf{v}_{j}\in\mathcal{V}(\mathbf{U}_{j}^{*},\delta)}\mathbb{E}\left[\left|\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d}\mathbf{v}_{j}^{\top}\right|^{2}\right],\sup_{\begin{subarray}{c}\mathbf{v}_{j}\in\mathcal{V}(\mathbf{U}_{j}^{*},\delta)\\ 1\leq l\leq p_{k}\end{subarray}}\mathbb{E}\left[\left|\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1,j\neq k}^{d}\mathbf{v}_{j}^{\top}\times_{k}\mathbf{c}_{l}^{\top}\right|^{2}\right]\right).\end{split} (4.26)

By definition, it is clear that Mx,2,δM_{x,2,\delta} could be smaller, and in many cases much smaller, than βx\beta_{x}. However, the statistical convergence rate of the estimator is governed by the local moment bound Mx,2,δM_{x,2,\delta}, while βx\beta_{x} plays a role in determining the sample size requirement and the computational convergence rate. Specifically, the statistical guarantees for the robust estimator with truncation parameter τ\tau, denoted as 𝓐^(τ)\mbox{\boldmath$\mathscr{\widehat{A}}$}(\tau), depend on Mx,2,δM_{x,2,\delta}.

Theorem 4.9.

For low-rank tensor logistic regression, suppose Assumption 4.3 holds with some ϵ(0,1]\epsilon\in(0,1] and δmin{σ¯1/(d+1)Err(0)+Cκ2αx1σ¯1deffMeff,δlog(p¯)/n,1}\delta\geq\min\{\bar{\sigma}^{-1/(d+1)}\sqrt{\textup{Err}^{(0)}}+C\kappa^{2}\alpha_{x}^{-1}\bar{\sigma}^{-1}\sqrt{d_{\textup{eff}}M_{\textup{eff},\delta}\log(\bar{p})/n},1\}. If nκ4βx2αx2log(p¯)n\gtrsim\kappa^{4}\beta_{x}^{2}\alpha_{x}^{-2}\log(\bar{p}), τσ¯d/(d+1)[nMx,2,δ/log(p¯)]1/2\tau\asymp\bar{\sigma}^{d/(d+1)}[nM_{x,2,\delta}/\log(\bar{p})]^{1/2}, and other conditions in Theorem 3.3 hold with α=αx/2\alpha=\alpha_{x}/2 and β=βx/2\beta=\beta_{x}/2, then with probability at least 1Cexp(Clog(p¯))1-C\exp(-C\log(\bar{p})), after sufficient iterations of Algorithm 1, we have the following error bounds

Err(𝓢^,𝐔^1,,𝐔^d)κ4αx2σ¯2d/(d+1)deffMx,2,δlog(p¯)/n.\textup{Err}(\widehat{\mbox{\boldmath$\mathscr{S}$}},\widehat{\mathbf{U}}_{1},\dots,\widehat{\mathbf{U}}_{d})\lesssim\kappa^{4}\alpha_{x}^{-2}\bar{\sigma}^{-2d/(d+1)}d_{\textup{eff}}M_{x,2,\delta}\log(\bar{p})/n. (4.27)

and

𝓐^(τ)𝓐Fκ2αx1deff1/2Mx,2,δlog(p¯)/n.\|\mbox{\boldmath$\mathscr{\widehat{A}}$}(\tau)-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\textup{F}}\lesssim\kappa^{2}\alpha_{x}^{-1}d_{\textup{eff}}^{1/2}\sqrt{M_{x,2,\delta}\log(\bar{p})/n}. (4.28)

Theorem 4.9 provides the statistical convergence rates for heavy-tailed tensor logistic regression under the local second moment condition on the covariates. Importantly, the convergence rate of the robust gradient descent is shown to match the rate for low-rank tensor logistic regression when using the vanilla gradient descent algorithm with Gaussian design (Chen, Raskutti and Yuan, 2019). This demonstrates that our method can effectively handle heavy-tailed covariates, ensuring robust and optimal estimation in such settings. Similar to tensor linear regression, if the initial error satisfies Err(0)σ¯2/(d+1)\textup{Err}^{(0)}\leq\bar{\sigma}^{2/(d+1)} and the sample size nn is sufficiently large, then the local radius δ\delta is smaller than one. Under these conditions, the statistical convergence rates depend on the local second moment of the covariates, which governs the precision of the estimator.

Remark 4.10.

Compared to existing works on robust estimation for heavy-tailed logistic regression with vector covariates (Prasad et al., 2020; Zhu and Zhou, 2021; Han, Tsay and Wu, 2023), our proposed method offers a significant relaxation by replacing the finite fourth moment condition with a second moment condition. This relaxation makes the method more flexible and applicable to a broader range of scenarios, particularly when high-order moments (such as the fourth moment) do not exist or are difficult to ensure.

Remark 4.11.

The low-rank structure imposed on the tensor coefficient allows our approach to leverage the local second moment condition. This results in a much sharper convergence rate, compared to previous methods that typically do not exploit such structural properties. The local moment condition, which governs the behavior of the covariates in the partial gradients, contributes to a more precise characterization of gradient deviation, leading to better statistical guarantees.

4.3 Heavy-tailed tensor PCA

Another important statistical model for the tensor data is tensor principal component analysis (PCA). Specically, we consider

𝓨=𝓐+𝓔=𝓢×j=1d𝐔j+𝓔,\mbox{\boldmath$\mathscr{Y}$}=\mbox{\boldmath$\mathscr{A}$}^{*}+\mbox{\boldmath$\mathscr{E}$}=\mbox{\boldmath$\mathscr{S}$}^{*}\times_{j=1}^{d}\mathbf{U}_{j}^{*}+\mbox{\boldmath$\mathscr{E}$}, (4.29)

where 𝓨p1××pd\mbox{\boldmath$\mathscr{Y}$}\in\mathbb{R}^{p_{1}\times\cdots\times p_{d}} is the tensor-valued observation, 𝓐=𝓢×j=1d𝐔j\mbox{\boldmath$\mathscr{A}$}^{*}=\mbox{\boldmath$\mathscr{S}$}^{*}\times_{j=1}^{d}\mathbf{U}_{j}^{*} is the low-rank true signal, and 𝓔\mathscr{E} represents the random noise, assumed to be mean-zero. In the existing literature, much attention has been focused on the Gaussian or sub-Gaussian noise settings (Richard and Montanari, 2014; Zhang and Han, 2019; Han, Willett and Zhang, 2022).

In this subsection, we assume that random noise 𝓔\mathscr{E} is heavy-tailed. We propose applying the robust gradient descent method with truncated gradient estimators to estimate the low-rank signal in the presence of such heavy-tailed noise. The loss function for tensor PCA is (𝓢,𝐔1,,𝐔d;𝓨)=𝓨𝓢×j=1d𝐔jF2/2\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};\mbox{\boldmath$\mathscr{Y}$})=\|\mbox{\boldmath$\mathscr{Y}$}-\mbox{\boldmath$\mathscr{S}$}\times_{j=1}^{d}\mathbf{U}_{j}\|_{\text{F}}^{2}/2. The partial gradient with respect to 𝐔k\mathbf{U}_{k} is

𝐔k(𝓢,𝐔1,,𝐔d)=(𝓢×j=1,jkd𝐔j𝐔j×k𝐔k𝓨×j=1,jkd𝐔j)(k)𝓢(k)\nabla_{\mathbf{U}_{k}}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d})=(\mbox{\boldmath$\mathscr{S}$}\times_{j=1,j\neq k}^{d}\mathbf{U}_{j}^{\top}\mathbf{U}_{j}\times_{k}\mathbf{U}_{k}-\mbox{\boldmath$\mathscr{Y}$}\times_{j=1,j\neq k}^{d}\mathbf{U}_{j}^{\top})_{(k)}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top} (4.30)

for any k=1,2,,dk=1,2,\dots,d. Similarly, the partial gradient with respect to the core tensor 𝓢\mathscr{S} is

𝓢(𝓢,𝐔1,,𝐔d)=𝓢×j=1d𝐔j𝐔j𝓨×j=1d𝐔j.\nabla_{\scalebox{0.7}{\mbox{\boldmath$\mathscr{S}$}}}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d})=\mbox{\boldmath$\mathscr{S}$}\times_{j=1}^{d}\mathbf{U}_{j}^{\top}\mathbf{U}_{j}-\mbox{\boldmath$\mathscr{Y}$}\times_{j=1}^{d}\mathbf{U}_{j}^{\top}. (4.31)

These gradients are computed using multilinear transformation fo the data, specifically 𝓨×j=1,jkd𝐔j\mbox{\boldmath$\mathscr{Y}$}\times_{j=1,j\neq k}^{d}\mathbf{U}_{j}^{\top} and 𝓨×j=1d𝐔j\mbox{\boldmath$\mathscr{Y}$}\times_{j=1}^{d}\mathbf{U}_{j}^{\top}, which are the data projected onto the respective factor spaces. To ensure robustness against heavy-tailed noise, we introduce the local (1+ϵ)(1+\epsilon)-th moment condition for the noise tensor 𝓔\mathscr{E}. The condition is necessary for the gradient-based optimization method to converge effectively in the presence of heavy-tailed noise.

Assumption 4.12.

For some ϵ(0,1]\epsilon\in(0,1] and δ[0,1]\delta\in[0,1], 𝓔\mathscr{E} has the local (1+ϵ)(1+\epsilon)-th moment

Me,1+ϵ,δ:=max1kd(sup𝐯j𝒱(𝐔j,δ)𝔼[|𝓔×j=1d𝐯j|1+ϵ],sup𝐯j𝒱(𝐔j,δ)1lpk𝔼[|𝓔×j=1,jkd𝐯j×k𝐜l|1+ϵ]).\begin{split}&M_{e,1+\epsilon,\delta}\\ &:=\max_{1\leq k\leq d}\left(\sup_{\mathbf{v}_{j}\in\mathcal{V}(\mathbf{U}_{j}^{*},\delta)}\mathbb{E}\left[\left|\mbox{\boldmath$\mathscr{E}$}\times_{j=1}^{d}\mathbf{v}_{j}^{\top}\right|^{1+\epsilon}\right],\sup_{\begin{subarray}{c}\mathbf{v}_{j}\in\mathcal{V}(\mathbf{U}_{j}^{*},\delta)\\ 1\leq l\leq p_{k}\end{subarray}}\mathbb{E}\left[\left|\mbox{\boldmath$\mathscr{E}$}\times_{j=1,j\neq k}^{d}\mathbf{v}_{j}^{\top}\times_{k}\mathbf{c}_{l}^{\top}\right|^{1+\epsilon}\right]\right).\end{split} (4.32)

In constrast to many existing statistical analyses for tensor PCA, our method does not require the entries of the random noise 𝓔\mathscr{E} to be independent or idetically distributed. This relaxation is a key feature of our approach, allowing it to handle more general noise structures, including those with dependencies and heavy tails.

For the estimator obtained by the robust gradient descent, denoted as 𝓐^(τ)\mbox{\boldmath$\mathscr{\widehat{A}}$}(\tau), as well as the estimation error Err(𝓢^,𝐔^1,,𝐔^d)\textup{Err}(\widehat{\mbox{\boldmath$\mathscr{S}$}},\widehat{\mathbf{U}}_{1},\dots,\widehat{\mathbf{U}}_{d}), we have the following convergence rates.

Theorem 4.13.

For tensor PCA in (4.29), suppose Assumption 4.12 holds with some ϵ(0,1]\epsilon\in(0,1] and δmin(σ¯1/(d+1)Err(0)+Cσ¯1deff1/2Me,1+ϵ,δ1/(1+ϵ),1)\delta\geq{\min}(\bar{\sigma}^{-1/(d+1)}\sqrt{\textup{Err}^{(0)}}+C\bar{\sigma}^{-1}d_{\textup{eff}}^{1/2}M_{e,1+\epsilon,\delta}^{1/(1+\epsilon)},1). If σ¯/Me,1+ϵ,δ1/(1+ϵ)p¯\underline{\sigma}/M_{e,1+\epsilon,\delta}^{1/(1+\epsilon)}\gtrsim\sqrt{\bar{p}}, τκ2/ϵσ¯d/(d+1)Me,1+ϵ,δ1/(1+ϵ)\tau\asymp\kappa^{2/\epsilon}\bar{\sigma}^{d/(d+1)}M_{e,1+\epsilon,\delta}^{1/(1+\epsilon)}, and other conditions in Theorem 3.3 hold with α=β=1/2\alpha=\beta=1/2, then with probability at least 1Cexp(Cp¯)1-C\exp(-C\bar{p}), after sufficient iterations of Algorithm 1, we have the following error bounds

Err(𝓢^,𝐔^1,,𝐔^d)σ¯2d/(d+1)deffMe,1+ϵ,δ2/(1+ϵ)\textup{Err}(\widehat{\mbox{\boldmath$\mathscr{S}$}},\widehat{\mathbf{U}}_{1},\dots,\widehat{\mathbf{U}}_{d})\lesssim\bar{\sigma}^{-2d/(d+1)}d_{\textup{eff}}M_{e,1+\epsilon,\delta}^{2/(1+\epsilon)} (4.33)

and

𝓐^(τ)𝓐Fdeff1/2Me,1+ϵ,δ1/(1+ϵ).\|\mbox{\boldmath$\mathscr{\widehat{A}}$}(\tau)-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\textup{F}}\lesssim d_{\textup{eff}}^{1/2}M_{e,1+\epsilon,\delta}^{1/(1+\epsilon)}. (4.34)

Under the local (1+ϵ)(1+\epsilon)-th moment condition for the noise tensor 𝓔\mathscr{E}, the convergence rate of the proposed robust gradient descent method is shown to be comparable to that of vanilla gradient descent and achieves minimax optimality (Han, Willett and Zhang, 2022). Specically, for ϵ=1\epsilon=1, the signal-to-noise ratio (SNR) condition σ¯/Me,1+ϵ,δ1/2p¯\underline{\sigma}/M^{1/2}_{e,1+\epsilon,\delta}\gtrsim\sqrt{\bar{p}} is identical to the SNR condition under the sub-Gaussian noise setting (Zhang and Xia, 2018). This result demonstrates that the robust gradient descent method is capable of effectively handling heavy-tailed noise, while still achieving minimax-optimal estimation performance. Furthermore, in a manner similar to tensor linear regression and logistic regression, if the signal strength satisfies σ¯p¯\bar{\sigma}\gtrsim\sqrt{\bar{p}} and the initial error is sufficiently small, i.e., Err(0)<σ¯2/(d+1)\textup{Err}^{(0)}<\bar{\sigma}^{2/(d+1)}, the local radius δ\delta is strictly smaller than one.

Remark 4.14.

The proposed robust gradient estimation approach offers several key advantages. First, unlike many existing methods that assume the noise follows a sub-Gaussian distribution, our approach relaxes this assumption by requiring only a (1+ϵ)(1+\epsilon)-th moment condition on the noise tensor 𝓔\mathscr{E}. This broader assumption allows the method to handle more general noise distributions, including those with heavier tails. Second, our method also accommodates the possibility that the elements of the noise tensor 𝓔\mathscr{E} may exhibit strong correlations. However, it is important to note that the correlation structure does not directly affect the estimation performance. Instead, the noise that influences the estimation is only the part projected onto the local regions, characterized by Me,1+ϵ,δM_{e,1+\epsilon,\delta}. This localized effect allows the method to remain robust even when the global correlation in the noise is strong. Third, we allow Me,1+ϵ,δM_{e,1+\epsilon,\delta} to diverge to infinity, meaning that the method can tolerate increasingly larger noise values in certain regions. However, for reliable estimation performance, the SNR must satisfy the condition σ¯/Me,1+ϵ,δ1/2p¯\underline{\sigma}/M_{e,1+\epsilon,\delta}^{1/2}\gtrsim\sqrt{\bar{p}}. This condition ensures that, even as the noise may increase in magnitude, the method still achieves optimal estimation performance as long as the signal is sufficiently strong.

5 Simulation Experiments

In this section, we present two simulation experiments. The first experiment is designed to verify the efficacy of the proposed method and highlight its advantages over other competitors. By comparing the performance of the proposed approach with that of alternative estimators, we demonstrate its superior robustness and accuracy. The second one aims to demonstrate the local moment effect in statistical convergence rates. Specifically, this experiment examines how the relaxation of the sub-Gaussian assumption and the use of local moment conditions influence the convergence behavior, providing empirical evidence for the theoretical results.

5.1 Experiment 1

We consider four models, and for each of them, we explore how different covariate and noise distributions affect the performance across diverse scenarios.

  • Model I (multi-response linear regression):

    𝐲i=(𝐀)𝐱i+𝐞i,i=1,,500,\mathbf{y}_{i}=(\mathbf{A}^{*})^{\top}\mathbf{x}_{i}+\mathbf{e}_{i},~{}~{}i=1,\dots,500, (5.1)

    where 𝐱i15\mathbf{x}_{i}\in\mathbb{R}^{15}, 𝐲i,𝐞i10\mathbf{y}_{i},\mathbf{e}_{i}\in\mathbb{R}^{10}, and 𝐀15×10\mathbf{A}^{*}\in\mathbb{R}^{15\times 10} is a rank-3 matrix. Entries of 𝐱i\mathbf{x}_{i} and 𝐞i\mathbf{e}_{i} are i.i.d., and four distributional cases are adopted for model I: (1) xi,jN(0,1)x_{i,j}\sim N(0,1) and ei,kN(0,1)e_{i,k}\sim N(0,1); (2) xi,jN(0,1)x_{i,j}\sim N(0,1) and ei,kt1.5e_{i,k}\sim t_{1.5}; (3) xi,jt~1.5(20)x_{i,j}\sim\widetilde{t}_{1.5}(20) and ei,kN(0,1)e_{i,k}\sim N(0,1); and (4) xi,jt~1.5(20)x_{i,j}\sim\widetilde{t}_{1.5}(20) and ei,kt1.5e_{i,k}\sim t_{1.5}. Here, t~ν(τ)\widetilde{t}_{\nu}(\tau) is the bounded tt distribution with ν\nu degrees of freedom and bound τ\tau, i.e., Xt~ν(τ)X\sim\widetilde{t}_{\nu}(\tau) if X=T(Y,τ)X=\text{T}(Y,\tau) where YtνY\sim t_{\nu}.

  • Model II (tensor linear regression):

    yi=𝓐,𝓧i+ei,i=1,,500,y_{i}=\langle\mbox{\boldmath$\mathscr{A}$}^{*},\mbox{\boldmath$\mathscr{X}$}_{i}\rangle+e_{i},~{}~{}i=1,\dots,500, (5.2)

    where 𝓧i12×10×8\mbox{\boldmath$\mathscr{X}$}_{i}\in\mathbb{R}^{12\times 10\times 8}, yi,eiy_{i},e_{i}\in\mathbb{R}, and 𝓐12×10×8\mbox{\boldmath$\mathscr{A}$}^{*}\in\mathbb{R}^{12\times 10\times 8} has Tucker ranks (2,2,2)(2,2,2). Entries of 𝓧i\mbox{\boldmath$\mathscr{X}$}_{i} are i.i.d., and four distributional cases are adopted: (1) (𝓧i)j1j2j3N(0,1)(\mbox{\boldmath$\mathscr{X}$}_{i})_{j_{1}j_{2}j_{3}}\sim N(0,1) and eiN(0,1)e_{i}\sim N(0,1); (2) (𝓧i)j1j2j3N(0,1)(\mbox{\boldmath$\mathscr{X}$}_{i})_{j_{1}j_{2}j_{3}}\sim N(0,1) and eit1.5e_{i}\sim t_{1.5}; (3) (𝓧i)j1j2j3t~1.5(20)(\mbox{\boldmath$\mathscr{X}$}_{i})_{j_{1}j_{2}j_{3}}\sim\widetilde{t}_{1.5}(20) and eiN(0,1)e_{i}\sim N(0,1); and (4) (𝓧i)j1j2j3t~1.5(20)(\mbox{\boldmath$\mathscr{X}$}_{i})_{j_{1}j_{2}j_{3}}\sim\widetilde{t}_{1.5}(20) and eit1.5e_{i}\sim t_{1.5}.

  • Model III (tensor logistic regression):

    (yi=1|𝓧i)=exp(𝓧i,𝓐)1+exp(𝓧i,𝓐),i=1,,500,\mathbb{P}(y_{i}=1|\mbox{\boldmath$\mathscr{X}$}_{i})=\frac{\exp(\langle\mbox{\boldmath$\mathscr{X}$}_{i},\mbox{\boldmath$\mathscr{A}$}^{*}\rangle)}{1+\exp(\langle\mbox{\boldmath$\mathscr{X}$}_{i},\mbox{\boldmath$\mathscr{A}$}^{*}\rangle)},~{}~{}i=1,\dots,500, (5.3)

    where 𝓧i12×10×8\mbox{\boldmath$\mathscr{X}$}_{i}\in\mathbb{R}^{12\times 10\times 8}, yi{0,1}y_{i}\in\{0,1\}, and 𝓐12×10×8\mbox{\boldmath$\mathscr{A}$}^{*}\in\mathbb{R}^{12\times 10\times 8} has Tucker ranks (2,2,2)(2,2,2). Entries of 𝓧i\mbox{\boldmath$\mathscr{X}$}_{i} are i.i.d., and two distributional cases are adopted: (1) (𝓧i)j1j2j3N(0,1)(\mbox{\boldmath$\mathscr{X}$}_{i})_{j_{1}j_{2}j_{3}}\sim N(0,1) and (2) (𝓧i)j1j2j3t~1.5(20)(\mbox{\boldmath$\mathscr{X}$}_{i})_{j_{1}j_{2}j_{3}}\sim\widetilde{t}_{1.5}(20).

  • Model IV (tensor PCA): 𝓨=𝓐+𝓔,\mbox{\boldmath$\mathscr{Y}$}=\mbox{\boldmath$\mathscr{A}$}^{*}+\mbox{\boldmath$\mathscr{E}$}, where 𝓨,𝓔12×10×8\mbox{\boldmath$\mathscr{Y}$},\mbox{\boldmath$\mathscr{E}$}\in\mathbb{R}^{12\times 10\times 8}, and 𝓐12×10×8\mbox{\boldmath$\mathscr{A}$}^{*}\in\mathbb{R}^{12\times 10\times 8} has Tucker ranks (2,2,2)(2,2,2). Entries of 𝓔\mathscr{E} are i.i.d. and two distributional cases are adopted: (1) 𝓔j1j2j3N(0,1)\mbox{\boldmath$\mathscr{E}$}_{j_{1}j_{2}j_{3}}\sim N(0,1) and (2) 𝓔j1j2j3t1.5\mbox{\boldmath$\mathscr{E}$}_{j_{1}j_{2}j_{3}}\sim t_{1.5}.

For Model I, 𝐀\mathbf{A}^{*} has singular values (10,8,6)(10,8,6), and for Models II-IV, the Tucker decomposition of 𝓐\mbox{\boldmath$\mathscr{A}$}^{*} in (1.1) has a super-diagonal core tensor 𝓢=diag(10,8)\mbox{\boldmath$\mathscr{S}$}=\text{diag}(10,8). For all models, the factor matrices 𝐔j\mathbf{U}_{j}’s are generated randomly as the first rjr_{j} left singular vectors of a random Gaussian ensemble. The first distributional case of each model represents the light-tailed setting, while the others are heavy-tailed as at least part of data follow heavy-tailed distribution.

We apply the proposed robust gradient descent algorithm, as well as the vanilla gradient descent and adaptive Huber regression in Remark 4.2 (except for Model III) as competitors, to the data generated from each model. Initial values of 𝓐\mathscr{A} is obtained by Algorithm 2, and HOSVD is applied to obtained 𝓢(0),𝐔1(0),,𝐔d(0)\mbox{\boldmath$\mathscr{S}$}^{(0)},\mathbf{U}_{1}^{(0)},\dots,\mathbf{U}_{d}^{(0)}. The hyperparameters are set as follows: a=b=1a=b=1, step size η=103\eta=10^{-3}, number of iterations T=200T=200, and truncation parameter τ\tau is selected via five-fold cross-validation. For each model and distributional setting, the data generation and estimation procedure is replicated 500 times. The estimation errors across these 500 replications are presented in Figure 2, which summarizes the results of the experiment.

Refer to caption
Figure 2: Estimation errors in logarithm (upper bar for 75 percentile, dot for median, and lower bar for 25 percentile) of three estimation methods for all models and cases.

According to Figure 2, when the data follows the light-tailed distribution (Case 1) in all models, the performances of three estimation methods are nearly identical. However, in the heavy-tailed cases, the performance of vanilla gradient descent deteriorates significantly, with estimation errors much larger than those of the other two methods. Overall, the robust gradient descent method consistently yields the smallest estimation errors across all three methods. Specifically, when the covariates follow heavy-tailed distributions (i.e. Cases 3 and 4 for Model I and Cases 3 and 4 for Model II), the robust gradient descent method outperforms adaptive Huber regression, producing significantly smaller estimation errors. When the covariates are normally distributed and the noise follows a heavy-tailed distribution (i.e. Case 2 for Models I and II), the performances of the robust gradient descent and adaptive Huber regression methods are similar. These numerical findings support the methodological insights presented in Remark 4.2 and are in line with the theoretical results discussed in Remark 4.7, confirming the robustness and efficiency of the proposed method in handling heavy-tailed data.

5.2 Experiment 2

We consider Model I in (5.1) and Model II in (5.2) in the second experiment. To numerically verify the local moment condition, we introduce a row-wise sparsity structure for the factor matrices. Specifically, we define each 𝐔k\mathbf{U}_{k} as 𝐔k=[𝐔~k,𝟎3×(pk5)]\mathbf{U}_{k}^{\top}=[\widetilde{\mathbf{U}}_{k}^{\top},\mathbf{0}_{3\times(p_{k}-5)}]^{\top}, for k=1,2k=1,2, where 𝐔~k5×3\widetilde{\mathbf{U}}_{k}\in\mathbb{R}^{5\times 3} is an orthonormal matrix generated randomly, following a similar procedure as in the first experiment.

In Model I, we have 𝐔1𝐱i=𝐔~1𝐱~i\mathbf{U}_{1}^{\top}\mathbf{x}_{i}=\widetilde{\mathbf{U}}_{1}^{\top}\widetilde{\mathbf{x}}_{i} and 𝐔2𝐞i=𝐔~1𝐞~i\mathbf{U}_{2}^{\top}\mathbf{e}_{i}=\widetilde{\mathbf{U}}_{1}^{\top}\widetilde{\mathbf{e}}_{i}, where 𝐱~i\widetilde{\mathbf{x}}_{i} and 𝐞~i\widetilde{\mathbf{e}}_{i} are the sub-vectors of 𝐱i\mathbf{x}_{i} and 𝐞i\mathbf{e}_{i}, respectively, containing their first five elements. In other words, if we project the covariates or noise onto a local region around the column spaces of 𝐔1\mathbf{U}_{1}^{*} and 𝐔2\mathbf{U}_{2}^{*}, respectively, the transformed distribution is only related to the first five elements. Hence, we consider the following vectors 𝐱i,15N(0,1)\mathbf{x}_{i,1}\in\mathbb{R}^{5}\sim N(0,1), 𝐱i,25t~1.5(20)\mathbf{x}_{i,2}\in\mathbb{R}^{5}\sim\widetilde{t}_{1.5}(20), 𝐱i,310N(0,1)\mathbf{x}_{i,3}\in\mathbb{R}^{10}\sim N(0,1), 𝐱i,410t~1.5(20)\mathbf{x}_{i,4}\in\mathbb{R}^{10}\sim\widetilde{t}_{1.5}(20), 𝐞i,1,𝐞i,25N(0,1)\mathbf{e}_{i,1},\mathbf{e}_{i,2}\in\mathbb{R}^{5}\sim N(0,1), and 𝐞i,35t1.5\mathbf{e}_{i,3}\in\mathbb{R}^{5}\sim t_{1.5}, where 𝐯D\mathbf{v}\sim D represents that the elements of 𝐯\mathbf{v} are independent and follow the distribution DD. Seven distributional cases are considered: (1) 𝐱i=(𝐱i,1,𝐱i,3)\mathbf{x}_{i}=(\mathbf{x}_{i,1}^{\top},\mathbf{x}_{i,3}^{\top})^{\top}, 𝐞i=(𝐞i,1,𝐞i,2)\mathbf{e}_{i}=(\mathbf{e}_{i,1}^{\top},\mathbf{e}_{i,2}^{\top})^{\top}; (2) 𝐱i=(𝐱i,1,𝐱i,2)\mathbf{x}_{i}=(\mathbf{x}_{i,1}^{\top},\mathbf{x}_{i,2}^{\top})^{\top}, 𝐞i=(𝐞i,1,𝐞i,2)\mathbf{e}_{i}=(\mathbf{e}_{i,1}^{\top},\mathbf{e}_{i,2}^{\top})^{\top}; (3) 𝐱i=(𝐱i,1,𝐱i,3)\mathbf{x}_{i}=(\mathbf{x}_{i,1}^{\top},\mathbf{x}_{i,3}^{\top})^{\top}, 𝐞i=(𝐞i,1,𝐞i,3)\mathbf{e}_{i}=(\mathbf{e}_{i,1}^{\top},\mathbf{e}_{i,3}^{\top})^{\top}; (4) 𝐱i=(𝐱i,1,𝐱i,4)\mathbf{x}_{i}=(\mathbf{x}_{i,1}^{\top},\mathbf{x}_{i,4}^{\top})^{\top}, 𝐞i=(𝐞i,1,𝐞i,3)\mathbf{e}_{i}=(\mathbf{e}_{i,1}^{\top},\mathbf{e}_{i,3}^{\top})^{\top}; (5) 𝐱i=(𝐱i,2,𝐱i,3)\mathbf{x}_{i}=(\mathbf{x}_{i,2}^{\top},\mathbf{x}_{i,3}^{\top})^{\top}, 𝐞i=(𝐞i,1,𝐞i,2)\mathbf{e}_{i}=(\mathbf{e}_{i,1}^{\top},\mathbf{e}_{i,2}^{\top})^{\top}; (6) 𝐱i=(𝐱i,1,𝐱i,3)\mathbf{x}_{i}=(\mathbf{x}_{i,1}^{\top},\mathbf{x}_{i,3}^{\top})^{\top}, 𝐞i=(𝐞i,3,𝐞i,1)\mathbf{e}_{i}=(\mathbf{e}_{i,3}^{\top},\mathbf{e}_{i,1}^{\top})^{\top}; and (7) 𝐱i=(𝐱i,2,𝐱i,3)\mathbf{x}_{i}=(\mathbf{x}_{i,2}^{\top},\mathbf{x}_{i,3}^{\top})^{\top}, 𝐞i=(𝐞i,3,𝐞i,1)\mathbf{e}_{i}=(\mathbf{e}_{i,3}^{\top},\mathbf{e}_{i,1}^{\top})^{\top}. By definition, for some sufficiently small δ\delta, the local moments Mx,2,δM_{x,2,\delta} and Me,1+ϵ,δM_{e,1+\epsilon,\delta} remain unchanged in the first four settings even with varying distributions of 𝐱i\mathbf{x}_{i} and 𝐞i\mathbf{e}_{i}. In the last three settings, the local moments of covariates or noise get larger.

In Model II, by the row-wise sparse structure of 𝐔k\mathbf{U}_{k}’s, 𝓧i×j=1d𝐔j=𝓧~i×j=1d𝐔~j\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d}\mathbf{U}_{j}^{\top}=\widetilde{\mbox{\boldmath$\mathscr{X}$}}_{i}\times_{j=1}^{d}\widetilde{\mathbf{U}}_{j}^{\top}, where 𝓧~i5×5×5\widetilde{\mbox{\boldmath$\mathscr{X}$}}_{i}\in\mathbb{R}^{5\times 5\times 5} is sub-tensor of 𝓧i\mbox{\boldmath$\mathscr{X}$}_{i} containing (𝓧i)j1j2j3(\mbox{\boldmath$\mathscr{X}$}_{i})_{j_{1}j_{2}j_{3}} for 1jk51\leq j_{k}\leq 5 and 1k31\leq k\leq 3. Hence, we consider four distributional cases for 𝓧i\mbox{\boldmath$\mathscr{X}$}_{i}: (1) all elements 𝓧i\mbox{\boldmath$\mathscr{X}$}_{i} follow N(0,1)N(0,1) distribution; (2) all elements of 𝓧~i\widetilde{\mbox{\boldmath$\mathscr{X}$}}_{i} follow N(0,1)N(0,1) distribution, and the other elements of 𝓧i\mbox{\boldmath$\mathscr{X}$}_{i} follow t~1.5(20)\widetilde{t}_{1.5}(20) distribution; (3) all elements of 𝓧~i\widetilde{\mbox{\boldmath$\mathscr{X}$}}_{i} follow t~1.5(20)\widetilde{t}_{1.5}(20) distribution, and the other elements of 𝓧i\mbox{\boldmath$\mathscr{X}$}_{i} follow N(0,1)N(0,1) distribution; and (4) all elements of 𝓧i\mbox{\boldmath$\mathscr{X}$}_{i} follow t~1.5(20)\widetilde{t}_{1.5}(20) distribution. Similarly to model I, the local moment Mx,2,δM_{x,2,\delta} is the same in cases 1 and 2, and gets large in cases 3 and 4.

Refer to caption
Figure 3: Estimation errors (upper bar for 75 percentile, dot for median, and lower bar for 25 percentile) of robust gradient descent for all models and cases.

For each model and case, we apply the proposed robust gradient descent method and replicate the procedure for 500 times. The estimation errors over 500 replications are presented in Figure 3. For Model I, the estimation errors in Cases 1-4 are almost identical, but those in the other cases are significantly larger. Similarly, for Model II, the errors are nearly the same in Cases 1 and 2, and increase substantially in Cases 3 and 4. These numerical findings in the second experiment confirm that the estimation errors are closely tied to the local moment bounds of the covariates and noise distributions, as discussed in Theorem 4.5.

6 Real Data Example: Chest CT Images

In this section, we apply the proposed robust gradient descent (RGD) estimation approach to the publicly available COVID-CT dataset (Yang et al., 2020). The dataset consists of 317 COVID-19 positive chest CT scans and 397 negative scans, selected from four open-access databases. Each scan is a 150×150150\times 150 greyscale matrix with a binary label indicating the disease status. The histograms of kurtosis for the 150×150150\times 150 pixels are presented in Figure 1, which provide strong evidence of heavy-tailed distributions, a characteristic commonly observed in biomedical imaging data.

To identify COVID-positive samples based on chest CT scans, we employ a low-rank tensor logistic regression model with d=2d=2 and p1=p2=150p_{1}=p_{2}=150. For dimension reduction, we use a low-rank structure with r1=r2=5r_{1}=r_{2}=5. As in the first experiment (Section 5), we apply both the proposed robust gradient descent (RGD) algorithm and the vanilla gradient descent (VGD) algorithm to estimate the Tucker decomposition in (1.1). For training, we randomly select 200 positive scans and 250 negative scans, using the remaining data as testing data to evaluate the classification performance of the estimated models.

Using each estimation method, we classify the testing data into four categories: true positive (TP), false positive (FP), true negative (TN), and false negative (FN). The performance metrics used for evaluation include: precision rate: P=TP/(TP+FP)P=\text{TP}/(\text{TP}+\text{FP}); recall rate: R=TP/(TP+FN)R=\text{TP}/(\text{TP}+\text{FN}); and F1 score: F1=2/(P1+R1)F_{1}=2/(P^{-1}+R^{-1}). The precision, recall, and F1 scores for the RGD method are reported in Table 1, alongside the performance of the VGD method as a benchmark.

The results demonstrate that the RGD method outperforms the VGD method in terms of all three evaluation metrics: precision, recall, and F1 score. This highlights the ability of the proposed method to provide reliable and stable statistical inference in real-world applications, especially in the challenging domain of COVID-19 diagnosis from chest CT scans.

Table 1: Classification performance of VGD and RGD on chest CT images
Method Precision Recall F1 Score
VGD 0.898 0.829 0.862
RGD 0.954 0.880 0.916

7 Conclusion and Discussion

We propose a unified and general robust estimation framework for low-rank tensor models, which combines robust gradient estimators via element-wise truncation with gradient descent updates. This method is shown to possess three key properties: computational efficiency, statistical optimality, and distributional robustness. We apply this framework to a variety of tensor models, including tensor linear regression, tensor logistic regression, and tensor PCA. Under a mild local moment condition for the covariates and noise distributions, the proposed method achieves minimax optimal statistical error rates, even in the presence of heavy-tailed distributions.

The proposed robust gradient descent framework can be easily integrated with other types of robust gradient functions, such as median of means and rank-based methods. Furthermore, while we focus on heavy-tailed distributions, the framework can also be adapted to handle other scenarios with outliers, including Huber’s ϵ\epsilon-contamination model and data with measurement errors.

While we have primarily considered Tucker low-rank tensor models in this paper, the proposed method is highly versatile and can be extended to other tensor models with different low-dimensional structures. In addition to Tucker ranks, there are various definitions of tensor ranks and corresponding low-rank tensor models (e.g., canonical polyadic decomposition and matrix product state models) (Kolda and Bader, 2009). Given the broad applicability of gradient descent algorithms, the robust gradient descent approach can be leveraged for these alternative models as well. Moreover, sparsity is often a key component in low-rank tensor models, facilitating dimension reduction and enabling variable selection (Zhang and Han, 2019; Wang et al., 2022). The robust gradient descent method can be utilized for robust sparse tensor decomposition, further enhancing its utility in high-dimensional settings.

In nonconvex optimization, initialization plays a crucial role in determining the convergence rate and the required number of iterations. Specifically, the radius δ\delta in the local moment condition depends on the initialization strategy. Hence, developing computationally efficient and statistically robust initialization methods tailored to each statistical model is of significant interest for future work (Jain et al., 2017).

References

  • Beggs and Plenz (2003) {barticle}[author] \bauthor\bsnmBeggs, \bfnmJohn M\binitsJ. M. and \bauthor\bsnmPlenz, \bfnmDietmar\binitsD. (\byear2003). \btitleNeuronal avalanches in neocortical circuits. \bjournalJournal of Neuroscience \bvolume23 \bpages11167–11177. \endbibitem
  • Bi, Qu and Shen (2018) {barticle}[author] \bauthor\bsnmBi, \bfnmXuan\binitsX., \bauthor\bsnmQu, \bfnmAnnie\binitsA. and \bauthor\bsnmShen, \bfnmXiaotong\binitsX. (\byear2018). \btitleMultilayer tensor factorization with applications to recommender systems. \bjournalAnnals of Statistics \bvolume46 \bpages3308–3333. \endbibitem
  • Bi et al. (2021) {barticle}[author] \bauthor\bsnmBi, \bfnmXuan\binitsX., \bauthor\bsnmTang, \bfnmXiwei\binitsX., \bauthor\bsnmYuan, \bfnmYubai\binitsY., \bauthor\bsnmZhang, \bfnmYanqing\binitsY. and \bauthor\bsnmQu, \bfnmAnnie\binitsA. (\byear2021). \btitleTensors in statistics. \bjournalAnnual Review of Statistics and Its Application \bvolume8 \bpages345–368. \endbibitem
  • Bubeck (2015) {barticle}[author] \bauthor\bsnmBubeck, \bfnmSébastien\binitsS. (\byear2015). \btitleConvex optimization: Algorithms and complexity. \bjournalFoundations and Trends® in Machine Learning \bvolume8 \bpages231–357. \endbibitem
  • Bubeck, Cesa-Bianchi and Lugosi (2013) {barticle}[author] \bauthor\bsnmBubeck, \bfnmSébastien\binitsS., \bauthor\bsnmCesa-Bianchi, \bfnmNicolo\binitsN. and \bauthor\bsnmLugosi, \bfnmGábor\binitsG. (\byear2013). \btitleBandits with heavy tail. \bjournalIEEE Transactions on Information Theory \bvolume59 \bpages7711–7717. \endbibitem
  • Cai et al. (2019) {barticle}[author] \bauthor\bsnmCai, \bfnmChangxiao\binitsC., \bauthor\bsnmLi, \bfnmGen\binitsG., \bauthor\bsnmPoor, \bfnmH Vincent\binitsH. V. and \bauthor\bsnmChen, \bfnmYuxin\binitsY. (\byear2019). \btitleNonconvex low-rank tensor completion from noisy data. \bjournalAdvances in Neural Information Processing Systems \bvolume32. \endbibitem
  • Candes and Plan (2011) {barticle}[author] \bauthor\bsnmCandes, \bfnmEmmanuel J\binitsE. J. and \bauthor\bsnmPlan, \bfnmYaniv\binitsY. (\byear2011). \btitleTight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements. \bjournalIEEE Transactions on Information Theory \bvolume57 \bpages2342–2359. \endbibitem
  • Catoni (2012) {barticle}[author] \bauthor\bsnmCatoni, \bfnmOlivier\binitsO. (\byear2012). \btitleChallenging the empirical mean and empirical variance: a deviation study. \bjournalAnnales de l’IHP Probabilités et Statistiques \bvolume48 \bpages1148–1185. \endbibitem
  • Chen, Raskutti and Yuan (2019) {barticle}[author] \bauthor\bsnmChen, \bfnmHan\binitsH., \bauthor\bsnmRaskutti, \bfnmGarvesh\binitsG. and \bauthor\bsnmYuan, \bfnmMing\binitsM. (\byear2019). \btitleNon-convex projected gradient descent for generalized low-rank tensor regression. \bjournalJournal of Machine Learning Research \bvolume20 \bpages1–37. \endbibitem
  • Chen and Wainwright (2015) {barticle}[author] \bauthor\bsnmChen, \bfnmYudong\binitsY. and \bauthor\bsnmWainwright, \bfnmMartin J\binitsM. J. (\byear2015). \btitleFast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees. \bjournalarXiv preprint arXiv:1509.03025. \endbibitem
  • Chen, Yang and Zhang (2022) {barticle}[author] \bauthor\bsnmChen, \bfnmRong\binitsR., \bauthor\bsnmYang, \bfnmDan\binitsD. and \bauthor\bsnmZhang, \bfnmCun-Hui\binitsC.-H. (\byear2022). \btitleFactor models for high-dimensional tensor time series. \bjournalJournal of the American Statistical Association \bvolume117 \bpages94–116. \endbibitem
  • De Lathauwer, De Moor and Vandewalle (2000) {barticle}[author] \bauthor\bsnmDe Lathauwer, \bfnmLieven\binitsL., \bauthor\bsnmDe Moor, \bfnmBart\binitsB. and \bauthor\bsnmVandewalle, \bfnmJoos\binitsJ. (\byear2000). \btitleA multilinear singular value decomposition. \bjournalSIAM Journal on Matrix Analysis and Applications \bvolume21 \bpages1253–1278. \endbibitem
  • Devroye et al. (2016) {barticle}[author] \bauthor\bsnmDevroye, \bfnmLuc\binitsL., \bauthor\bsnmLerasle, \bfnmMatthieu\binitsM., \bauthor\bsnmLugosi, \bfnmGabor\binitsG., \bauthor\bsnmOlivetra, \bfnmRoberto I\binitsR. I. \betalet al. (\byear2016). \btitleSub-Gaussian mean estimators. \bjournalAnnals OF Statistics \bvolume44 \bpages2695. \endbibitem
  • Dong et al. (2023) {barticle}[author] \bauthor\bsnmDong, \bfnmHarry\binitsH., \bauthor\bsnmTong, \bfnmTian\binitsT., \bauthor\bsnmMa, \bfnmCong\binitsC. and \bauthor\bsnmChi, \bfnmYuejie\binitsY. (\byear2023). \btitleFast and provable tensor robust principal component analysis via scaled gradient descent. \bjournalInformation and Inference: A Journal of the IMA \bvolume12 \bpages1716–1758. \endbibitem
  • Fan, Li and Wang (2017) {barticle}[author] \bauthor\bsnmFan, \bfnmJianqing\binitsJ., \bauthor\bsnmLi, \bfnmQuefeng\binitsQ. and \bauthor\bsnmWang, \bfnmYuyan\binitsY. (\byear2017). \btitleEstimation of high dimensional mean regression in the absence of symmetry and light tail assumptions. \bjournalJournal of the Royal Statistical Society. Series B, Statistical methodology \bvolume79 \bpages247. \endbibitem
  • Fan, Wang and Zhu (2021) {barticle}[author] \bauthor\bsnmFan, \bfnmJianqing\binitsJ., \bauthor\bsnmWang, \bfnmWeichen\binitsW. and \bauthor\bsnmZhu, \bfnmZiwei\binitsZ. (\byear2021). \btitleA shrinkage principle for heavy-tailed data: High-dimensional robust low-rank matrix recovery. \bjournalAnnals of Statistics \bvolume49 \bpages1239. \endbibitem
  • Friedman et al. (2012) {barticle}[author] \bauthor\bsnmFriedman, \bfnmNir\binitsN., \bauthor\bsnmIto, \bfnmShinya\binitsS., \bauthor\bsnmBrinkman, \bfnmBraden AW\binitsB. A., \bauthor\bsnmShimono, \bfnmMasanori\binitsM., \bauthor\bsnmDeVille, \bfnmRE Lee\binitsR. L., \bauthor\bsnmDahmen, \bfnmKarin A\binitsK. A., \bauthor\bsnmBeggs, \bfnmJohn M\binitsJ. M. and \bauthor\bsnmButler, \bfnmThomas C\binitsT. C. (\byear2012). \btitleUniversal critical dynamics in high resolution neuronal avalanche data. \bjournalPhysical Review Letters \bvolume108 \bpages208102. \endbibitem
  • Han, Tsay and Wu (2023) {barticle}[author] \bauthor\bsnmHan, \bfnmYuefeng\binitsY., \bauthor\bsnmTsay, \bfnmRuey S\binitsR. S. and \bauthor\bsnmWu, \bfnmWei Biao\binitsW. B. (\byear2023). \btitleHigh dimensional generalized linear models for temporal dependent data. \bjournalBernoulli \bvolume29 \bpages105–131. \endbibitem
  • Han, Willett and Zhang (2022) {barticle}[author] \bauthor\bsnmHan, \bfnmRungang\binitsR., \bauthor\bsnmWillett, \bfnmRebecca\binitsR. and \bauthor\bsnmZhang, \bfnmAnru R\binitsA. R. (\byear2022). \btitleAn optimal statistical and computational framework for generalized tensor estimation. \bjournalThe Annals of Statistics \bvolume50 \bpages1–29. \endbibitem
  • Huber (1964) {barticle}[author] \bauthor\bsnmHuber, \bfnmPeter J\binitsP. J. (\byear1964). \btitleRobust Estimation of a Location Parameter. \bjournalThe Annals of Mathematical Statistics \bpages73–101. \endbibitem
  • Jain et al. (2017) {barticle}[author] \bauthor\bsnmJain, \bfnmPrateek\binitsP., \bauthor\bsnmKar, \bfnmPurushottam\binitsP. \betalet al. (\byear2017). \btitleNon-convex optimization for machine learning. \bjournalFoundations and Trends® in Machine Learning \bvolume10 \bpages142–363. \endbibitem
  • Kolda and Bader (2009) {barticle}[author] \bauthor\bsnmKolda, \bfnmTamara G\binitsT. G. and \bauthor\bsnmBader, \bfnmBrett W\binitsB. W. (\byear2009). \btitleTensor decompositions and applications. \bjournalSIAM Review \bvolume51 \bpages455–500. \endbibitem
  • Li et al. (2018) {barticle}[author] \bauthor\bsnmLi, \bfnmXiaoshan\binitsX., \bauthor\bsnmXu, \bfnmDa\binitsD., \bauthor\bsnmZhou, \bfnmHua\binitsH. and \bauthor\bsnmLi, \bfnmLexin\binitsL. (\byear2018). \btitleTucker tensor regression and neuroimaging analysis. \bjournalStatistics in Biosciences \bvolume10 \bpages520–545. \endbibitem
  • Li et al. (2020) {barticle}[author] \bauthor\bsnmLi, \bfnmYuanxin\binitsY., \bauthor\bsnmChi, \bfnmYuejie\binitsY., \bauthor\bsnmZhang, \bfnmHuishuai\binitsH. and \bauthor\bsnmLiang, \bfnmYingbin\binitsY. (\byear2020). \btitleNon-convex low-rank matrix recovery with arbitrary outliers via median-truncated gradient descent. \bjournalInformation and Inference: A Journal of the IMA \bvolume9 \bpages289–325. \endbibitem
  • Loh (2017) {barticle}[author] \bauthor\bsnmLoh, \bfnmPo-Ling\binitsP.-L. (\byear2017). \btitleStatistical consistency and asymptotic normality for high-dimensional robust MM-estimators. \bjournalThe Annals of Statistics \bvolume45 \bpages866–896. \endbibitem
  • Lu et al. (2024) {barticle}[author] \bauthor\bsnmLu, \bfnmYin\binitsY., \bauthor\bsnmTao, \bfnmChunbai\binitsC., \bauthor\bsnmWang, \bfnmDi\binitsD., \bauthor\bsnmUddin, \bfnmGazi Salah\binitsG. S., \bauthor\bsnmWu, \bfnmLibo\binitsL. and \bauthor\bsnmZhu, \bfnmXuening\binitsX. (\byear2024). \btitleRobust estimation for dynamic spatial autoregression models with nearly optimal rates. \bjournalAvailable at SSRN 4873355. \endbibitem
  • Ma et al. (2018) {binproceedings}[author] \bauthor\bsnmMa, \bfnmCong\binitsC., \bauthor\bsnmWang, \bfnmKaizheng\binitsK., \bauthor\bsnmChi, \bfnmYuejie\binitsY. and \bauthor\bsnmChen, \bfnmYuxin\binitsY. (\byear2018). \btitleImplicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval and matrix completion. In \bbooktitleInternational Conference on Machine Learning \bpages3345–3354. \bpublisherPMLR. \endbibitem
  • Negahban and Wainwright (2011) {barticle}[author] \bauthor\bsnmNegahban, \bfnmSahand\binitsS. and \bauthor\bsnmWainwright, \bfnmMartin J\binitsM. J. (\byear2011). \btitleEstimation of (near) low-rank matrices with noise and high-dimensional scaling. \bjournalAnnals of Statistics \bvolume39 \bpages1069–1097. \endbibitem
  • Negahban and Wainwright (2012) {barticle}[author] \bauthor\bsnmNegahban, \bfnmSahand\binitsS. and \bauthor\bsnmWainwright, \bfnmMartin J\binitsM. J. (\byear2012). \btitleRestricted strong convexity and weighted matrix completion: Optimal bounds with noise. \bjournalJournal of Machine Learning Research \bvolume13 \bpages1665–1697. \endbibitem
  • Netrapalli et al. (2014) {barticle}[author] \bauthor\bsnmNetrapalli, \bfnmPraneeth\binitsP., \bauthor\bsnmUN, \bfnmNiranjan\binitsN., \bauthor\bsnmSanghavi, \bfnmSujay\binitsS., \bauthor\bsnmAnandkumar, \bfnmAnimashree\binitsA. and \bauthor\bsnmJain, \bfnmPrateek\binitsP. (\byear2014). \btitleNon-convex robust PCA. \bjournalAdvances in Neural Information Processing Systems \bvolume27. \endbibitem
  • Prasad et al. (2020) {barticle}[author] \bauthor\bsnmPrasad, \bfnmAdarsh\binitsA., \bauthor\bsnmSuggala, \bfnmArun Sai\binitsA. S., \bauthor\bsnmBalakrishnan, \bfnmSivaraman\binitsS. and \bauthor\bsnmRavikumar, \bfnmPradeep\binitsP. (\byear2020). \btitleRobust estimation via robust gradient estimation. \bjournalJournal of the Royal Statistical Society Series B: Statistical Methodology \bvolume82 \bpages601–627. \endbibitem
  • Raskutti, Yuan and Chen (2019) {barticle}[author] \bauthor\bsnmRaskutti, \bfnmGarvesh\binitsG., \bauthor\bsnmYuan, \bfnmMing\binitsM. and \bauthor\bsnmChen, \bfnmHan\binitsH. (\byear2019). \btitleConvex regularization for high-dimensional multi-response tensor regression. \bjournalAnnals of Statistics \bvolume47 \bpages1554-1584. \endbibitem
  • Richard and Montanari (2014) {barticle}[author] \bauthor\bsnmRichard, \bfnmEmile\binitsE. and \bauthor\bsnmMontanari, \bfnmAndrea\binitsA. (\byear2014). \btitleA statistical model for tensor PCA. \bjournalAdvances in Neural Information Processing Systems \bvolume27. \endbibitem
  • Roberts, Boonstra and Breakspear (2015) {barticle}[author] \bauthor\bsnmRoberts, \bfnmJames A\binitsJ. A., \bauthor\bsnmBoonstra, \bfnmTjeerd W\binitsT. W. and \bauthor\bsnmBreakspear, \bfnmMichael\binitsM. (\byear2015). \btitleThe heavy tail of the human brain. \bjournalCurrent Opinion in Neurobiology \bvolume31 \bpages164–172. \endbibitem
  • Shen et al. (2023) {barticle}[author] \bauthor\bsnmShen, \bfnmYinan\binitsY., \bauthor\bsnmLi, \bfnmJingyang\binitsJ., \bauthor\bsnmCai, \bfnmJian-Feng\binitsJ.-F. and \bauthor\bsnmXia, \bfnmDong\binitsD. (\byear2023). \btitleComputationally efficient and statistically optimal robust low-rank matrix and tensor estimation. \bjournalarXiv preprint arXiv:2203.00953. \endbibitem
  • Sun, Zhou and Fan (2020) {barticle}[author] \bauthor\bsnmSun, \bfnmQiang\binitsQ., \bauthor\bsnmZhou, \bfnmWen-Xin\binitsW.-X. and \bauthor\bsnmFan, \bfnmJianqing\binitsJ. (\byear2020). \btitleAdaptive huber regression. \bjournalJournal of the American Statistical Association \bvolume115 \bpages254–265. \endbibitem
  • Tan, Sun and Witten (2023) {barticle}[author] \bauthor\bsnmTan, \bfnmKean Ming\binitsK. M., \bauthor\bsnmSun, \bfnmQiang\binitsQ. and \bauthor\bsnmWitten, \bfnmDaniela\binitsD. (\byear2023). \btitleSparse reduced rank Huber regression in high dimensions. \bjournalJournal of the American Statistical Association \bvolume118 \bpages2383–2393. \endbibitem
  • Tarzanagh and Michailidis (2022) {barticle}[author] \bauthor\bsnmTarzanagh, \bfnmDavoud Ataee\binitsD. A. and \bauthor\bsnmMichailidis, \bfnmGeorge\binitsG. (\byear2022). \btitleRegularized and smooth double core tensor factorization for heterogeneous data. \bjournalJournal of Machine Learning Research \bvolume23 \bpages1–49. \endbibitem
  • Tomioka and Suzuki (2013) {binproceedings}[author] \bauthor\bsnmTomioka, \bfnmRyota\binitsR. and \bauthor\bsnmSuzuki, \bfnmTaiji\binitsT. (\byear2013). \btitleConvex tensor decomposition via structured Schatten norm regularization. In \bbooktitleAdvances in Neural Information Processing Systems (NIPS) \bpages1331–1339. \endbibitem
  • Tong, Ma and Chi (2022) {binproceedings}[author] \bauthor\bsnmTong, \bfnmTian\binitsT., \bauthor\bsnmMa, \bfnmCong\binitsC. and \bauthor\bsnmChi, \bfnmYuejie\binitsY. (\byear2022). \btitleAccelerating ill-conditioned robust low-rank tensor regression. In \bbooktitleICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) \bpages9072–9076. \bpublisherIEEE. \endbibitem
  • Tong et al. (2022) {barticle}[author] \bauthor\bsnmTong, \bfnmTian\binitsT., \bauthor\bsnmMa, \bfnmCong\binitsC., \bauthor\bsnmPrater-Bennette, \bfnmAshley\binitsA., \bauthor\bsnmTripp, \bfnmErin\binitsE. and \bauthor\bsnmChi, \bfnmYuejie\binitsY. (\byear2022). \btitleScaling and scalability: Provable nonconvex low-rank tensor estimation from incomplete measurements. \bjournalJournal of Machine Learning Research \bvolume23 \bpages1–77. \endbibitem
  • Tu et al. (2016) {binproceedings}[author] \bauthor\bsnmTu, \bfnmStephen\binitsS., \bauthor\bsnmBoczar, \bfnmRoss\binitsR., \bauthor\bsnmSimchowitz, \bfnmMax\binitsM., \bauthor\bsnmSoltanolkotabi, \bfnmMahdi\binitsM. and \bauthor\bsnmRecht, \bfnmBen\binitsB. (\byear2016). \btitleLow-rank solutions of linear matrix equations via procrustes flow. In \bbooktitleInternational Conference on Machine Learning \bpages964–973. \bpublisherPMLR. \endbibitem
  • Tucker (1966) {barticle}[author] \bauthor\bsnmTucker, \bfnmLedyard R\binitsL. R. (\byear1966). \btitleSome mathematical notes on three-mode factor analysis. \bjournalPsychometrika \bvolume31 \bpages279–311. \endbibitem
  • Wainwright (2019) {bbook}[author] \bauthor\bsnmWainwright, \bfnmMartin J\binitsM. J. (\byear2019). \btitleHigh-dimensional statistics: A non-asymptotic viewpoint \bvolume48. \bpublisherCambridge university press. \endbibitem
  • Wang and Tsay (2023) {barticle}[author] \bauthor\bsnmWang, \bfnmDi\binitsD. and \bauthor\bsnmTsay, \bfnmRuey S\binitsR. S. (\byear2023). \btitleRate-optimal robust estimation of high-dimensional vector autoregressive models. \bjournalThe Annals of Statistics \bvolume51 \bpages846–877. \endbibitem
  • Wang, Zhang and Gu (2017) {binproceedings}[author] \bauthor\bsnmWang, \bfnmLingxiao\binitsL., \bauthor\bsnmZhang, \bfnmXiao\binitsX. and \bauthor\bsnmGu, \bfnmQuanquan\binitsQ. (\byear2017). \btitleA unified computational and statistical framework for nonconvex low-rank matrix estimation. In \bbooktitleArtificial Intelligence and Statistics \bpages981–990. \bpublisherPMLR. \endbibitem
  • Wang, Zhang and Mai (2023) {barticle}[author] \bauthor\bsnmWang, \bfnmNing\binitsN., \bauthor\bsnmZhang, \bfnmXin\binitsX. and \bauthor\bsnmMai, \bfnmQing\binitsQ. (\byear2023). \btitleHigh-dimensional tensor response regression using the t-distribution. \bjournalarXiv preprint arXiv:2306.12125. \endbibitem
  • Wang et al. (2020) {barticle}[author] \bauthor\bsnmWang, \bfnmLan\binitsL., \bauthor\bsnmPeng, \bfnmBo\binitsB., \bauthor\bsnmBradic, \bfnmJelena\binitsJ., \bauthor\bsnmLi, \bfnmRunze\binitsR. and \bauthor\bsnmWu, \bfnmYunan\binitsY. (\byear2020). \btitleA tuning-free robust and efficient approach to high-dimensional regression. \bjournalJournal of the American Statistical Association \bpages1–44. \endbibitem
  • Wang et al. (2022) {barticle}[author] \bauthor\bsnmWang, \bfnmDi\binitsD., \bauthor\bsnmZheng, \bfnmYao\binitsY., \bauthor\bsnmLian, \bfnmHeng\binitsH. and \bauthor\bsnmLi, \bfnmGuodong\binitsG. (\byear2022). \btitleHigh-dimensional vector autoregressive time series modeling via tensor decomposition. \bjournalJournal of the American Statistical Association \bvolume117 \bpages1338–1356. \endbibitem
  • Wei et al. (2023) {barticle}[author] \bauthor\bsnmWei, \bfnmBo\binitsB., \bauthor\bsnmPeng, \bfnmLimin\binitsL., \bauthor\bsnmGuo, \bfnmYing\binitsY., \bauthor\bsnmManatunga, \bfnmAmita\binitsA. and \bauthor\bsnmStevens, \bfnmJennifer\binitsJ. (\byear2023). \btitleTensor response quantile regression with neuroimaging data. \bjournalBiometrics \bvolume79 \bpages1947–1958. \endbibitem
  • Wu et al. (2022) {barticle}[author] \bauthor\bsnmWu, \bfnmYing\binitsY., \bauthor\bsnmChen, \bfnmDan\binitsD., \bauthor\bsnmLi, \bfnmChaoqian\binitsC. and \bauthor\bsnmTang, \bfnmNiansheng\binitsN. (\byear2022). \btitleBayesian tensor logistic regression with applications to neuroimaging data analysis of Alzheimer’s disease. \bjournalStatistical Methods in Medical Research \bvolume31 \bpages2368–2382. \endbibitem
  • Xu, Zhang and Gu (2017) {binproceedings}[author] \bauthor\bsnmXu, \bfnmPan\binitsP., \bauthor\bsnmZhang, \bfnmTingting\binitsT. and \bauthor\bsnmGu, \bfnmQuanquan\binitsQ. (\byear2017). \btitleEfficient algorithm for sparse tensor-variate gaussian graphical models via gradient descent. In \bbooktitleArtificial Intelligence and Statistics \bpages923–932. \bpublisherPMLR. \endbibitem
  • Yang et al. (2020) {barticle}[author] \bauthor\bsnmYang, \bfnmXingyi\binitsX., \bauthor\bsnmHe, \bfnmXuehai\binitsX., \bauthor\bsnmZhao, \bfnmJinyu\binitsJ., \bauthor\bsnmZhang, \bfnmYichen\binitsY., \bauthor\bsnmZhang, \bfnmShanghang\binitsS. and \bauthor\bsnmXie, \bfnmPengtao\binitsP. (\byear2020). \btitleCovid-ct-dataset: a ct scan dataset about covid-19. \bjournalarXiv preprint arXiv:2003.13865. \endbibitem
  • Yuan and Zhang (2016) {barticle}[author] \bauthor\bsnmYuan, \bfnmMing\binitsM. and \bauthor\bsnmZhang, \bfnmCun-Hui\binitsC.-H. (\byear2016). \btitleOn tensor completion via nuclear norm minimization. \bjournalFoundations of Computational Mathematics \bvolume16 \bpages1031–1068. \endbibitem
  • Zhang and Han (2019) {barticle}[author] \bauthor\bsnmZhang, \bfnmAnru\binitsA. and \bauthor\bsnmHan, \bfnmRungang\binitsR. (\byear2019). \btitleOptimal sparse singular value decomposition for high-dimensional high-order data. \bjournalJournal of the American Statistical Association. \endbibitem
  • Zhang and Xia (2018) {barticle}[author] \bauthor\bsnmZhang, \bfnmAnru\binitsA. and \bauthor\bsnmXia, \bfnmDong\binitsD. (\byear2018). \btitleTensor SVD: Statistical and Computational Limits. \bjournalIEEE Transactions on Information Theory \bvolume64 \bpages7311-7338. \endbibitem
  • Zhang et al. (2020) {barticle}[author] \bauthor\bsnmZhang, \bfnmAnru R\binitsA. R., \bauthor\bsnmLuo, \bfnmYuetian\binitsY., \bauthor\bsnmRaskutti, \bfnmGarvesh\binitsG. and \bauthor\bsnmYuan, \bfnmMing\binitsM. (\byear2020). \btitleISLET: Fast and optimal low-rank tensor regression via importance sketching. \bjournalSIAM Journal on Mathematics of Data Science \bvolume2 \bpages444–479. \endbibitem
  • Zhou, Li and Zhu (2013) {barticle}[author] \bauthor\bsnmZhou, \bfnmHua\binitsH., \bauthor\bsnmLi, \bfnmLexin\binitsL. and \bauthor\bsnmZhu, \bfnmHongtu\binitsH. (\byear2013). \btitleTensor regression with applications in neuroimaging data analysis. \bjournalJournal of the American Statistical Association \bvolume108 \bpages540–552. \endbibitem
  • Zhu and Zhou (2021) {binproceedings}[author] \bauthor\bsnmZhu, \bfnmZiwei\binitsZ. and \bauthor\bsnmZhou, \bfnmWenjing\binitsW. (\byear2021). \btitleTaming heavy-tailed features by shrinkage. In \bbooktitleInternational Conference on Artificial Intelligence and Statistics \bpages3268–3276. \bpublisherPMLR. \endbibitem

Appendix A Convergence Analysis of Robust Gradient Descent

A.1 Proofs of Theorem 3.3

The proof consists of five steps. In the first step, we introduce the notations and the regularity conditions in the following steps. In the second to fourth steps, we establish the convergence analysis of the estimation errors. Finally, in the last step, we verify the conditions given in the first steps recursively. In the following, we let Cd>0C_{d}>0 be the constant only depending on dd.

Step 1. (Notations and conditions)

We first introduce the notations used in the proof. At step tt, we simplify the notations of the robust gradient estimators to 𝓖0(t)\mbox{\boldmath$\mathscr{G}$}_{0}^{(t)} and 𝐆k(t)\mathbf{G}_{k}^{(t)}, for k=1,,dk=1,\dots,d and t=1,,Tt=1,\dots,T. Denote 𝐕k(t)=(jk𝐔j(t))𝓢(k)(t)\mathbf{V}_{k}^{(t)}=(\otimes_{j\neq k}\mathbf{U}_{j}^{(t)})\mbox{\boldmath$\mathscr{S}$}^{(t)\top}_{(k)},

𝚫k(t)=𝐆k(t)𝔼[k(t)]=𝐆k(t)𝔼[(𝓐(t))(k)𝐕k(t)],\mathbf{\Delta}_{k}^{(t)}=\mathbf{G}_{k}^{(t)}-\mathbb{E}[\nabla_{k}\mathcal{L}^{(t)}]=\mathbf{G}_{k}^{(t)}-\mathbb{E}[\nabla\mathcal{L}(\mbox{\boldmath$\mathscr{A}$}^{(t)})_{(k)}\mathbf{V}_{k}^{(t)}], (A.1)

and

𝚫0(t)=𝓖0(t)𝔼[0(t)]=𝓖0(t)𝔼[(𝓐(t))×j=1d𝐔j(t)],\mathbf{\Delta}_{0}^{(t)}=\mbox{\boldmath$\mathscr{G}$}_{0}^{(t)}-\mathbb{E}[\nabla_{0}\mathcal{L}^{(t)}]=\mbox{\boldmath$\mathscr{G}$}_{0}^{(t)}-\mathbb{E}[\nabla\mathcal{L}(\mbox{\boldmath$\mathscr{A}$}^{(t)})\times_{j=1}^{d}\mathbf{U}_{j}^{(t)\top}], (A.2)

as the robust gradient estimation errors. By the stability of the robust gradients, 𝚫k(t)F2ϕ𝓐(t)𝓐F2+ξk2\|\mathbf{\Delta}_{k}^{(t)}\|_{\text{F}}^{2}\leq\phi\|\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|^{2}_{\text{F}}+\xi_{k}^{2}, for all k=0,1,,dk=0,1,\dots,d and t=1,2,,Tt=1,2,\dots,T. In addition, we assume bσ¯1/(d+1)b\asymp\bar{\sigma}^{1/(d+1)}, as required in Theorem 3.3.

Let 𝓐=𝓢×k=1d𝐔k\mbox{\boldmath$\mathscr{A}$}^{*}=\mbox{\boldmath$\mathscr{S}$}^{*}\times_{k=1}^{d}\mathbf{U}_{k}^{*} such that 𝐔k𝐔k=b2𝐈rk\mathbf{U}_{k}^{*\top}\mathbf{U}_{k}^{*}=b^{2}\mathbf{I}_{r_{k}}, for k=1,,dk=1,\dots,d. Define 𝕆r={𝐌r×r:𝐌𝐌=𝐈r}\mathbb{O}_{r}=\{\mathbf{M}\in\mathbb{R}^{r\times r}:\mathbf{M}^{\top}\mathbf{M}=\mathbf{I}_{r}\} as the set of r×rr\times r orthogonal matrices. For each step t=0,1,,Tt=0,1,\dots,T, we define

Err(t)=min𝐎k𝕆rk,1kd{k=1d𝐔k(t)𝐔k𝐎kF2+𝓢(t)𝓢×j=1d𝐎jF2},\text{Err}^{(t)}=\min_{\mathbf{O}_{k}\in\mathbb{O}_{r_{k}},1\leq k\leq d}\left\{\sum_{k=1}^{d}\|\mathbf{U}^{(t)}_{k}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}\|_{\textup{F}}^{2}+\|\mbox{\boldmath$\mathscr{S}$}^{(t)}-\mbox{\boldmath$\mathscr{S}$}^{*}\times_{j=1}^{d}\mathbf{O}_{j}^{\top}\|^{2}_{\textup{F}}\right\}, (A.3)

and

(𝐎1(t),,𝐎d(t))=argmin𝐎k𝕆rk,1kd{k=1d𝐔k(t)𝐔k𝐎kF2+𝓢(t)𝓢×j=1d𝐎jF2}.(\mathbf{O}_{1}^{(t)},\cdots,\mathbf{O}_{d}^{(t)})=\operatorname*{arg\,min}_{{\mathbf{O}_{k}\in\mathbb{O}_{r_{k}},1\leq k\leq d}}\left\{\sum_{k=1}^{d}\|\mathbf{U}^{(t)}_{k}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}\|_{\textup{F}}^{2}+\|\mbox{\boldmath$\mathscr{S}$}^{(t)}-\mbox{\boldmath$\mathscr{S}$}^{*}\times_{j=1}^{d}\mathbf{O}_{j}^{\top}\|^{2}_{\textup{F}}\right\}. (A.4)

Here, Err(t)\text{Err}^{(t)} collects the combined estimation errors for all tensor decomposition components at step tt, and 𝐎k(t)\mathbf{O}_{k}^{(t)}’s are the optimal rotations used to handle the non-identifiability of the Tucker decomposition.

Next, we discuss some additional conditions used in the convergence analysis. To ease presentation, we first assume that these conditions hold and verify them in the last step.

(C1) For any t=0,1,,Tt=0,1,\dots,T and k=1,2,,dk=1,2,\dots,d, 𝓢(k)(t)Cσ¯bd\|\mbox{\boldmath$\mathscr{S}$}_{(k)}^{(t)}\|\leq C\bar{\sigma}b^{-d} and 𝐔k(t)Cb\|\mathbf{U}_{k}^{(t)}\|\leq Cb for some absolute constant greater than one. Hence, 𝐕k(t)𝓢(k)(t)jk𝐔j(t)Cdσ¯b1\|\mathbf{V}_{k}^{(t)}\|\leq\|\mbox{\boldmath$\mathscr{S}$}_{(k)}^{(t)}\|\cdot\prod_{j\neq k}\|\mathbf{U}_{j}^{(t)}\|\leq C_{d}\bar{\sigma}b^{-1}.

(C2) For any t=0,1,,Tt=0,1,\dots,T, Err(t)Cαβ1b2κ2\text{Err}^{(t)}\leq C\alpha\beta^{-1}b^{2}\kappa^{-2}.

Step 2. (Descent of Err(t)\text{Err}^{(t)})

By definition of Err(t)\text{Err}^{(t)} and 𝐎k(t)\mathbf{O}_{k}^{(t)}’s,

Err(t+1)=k=1d𝐔k(t+1)𝐔k𝐎k(t+1)F2+𝓢(t+1)𝓢×j=1d𝐎j(t+1)F2k=1d𝐔k(t+1)𝐔k𝐎k(t)F2+𝓢(t+1)𝓢×j=1d𝐎j(t)F2.\begin{split}\text{Err}^{(t+1)}&=\sum_{k=1}^{d}\left\|\mathbf{U}_{k}^{(t+1)}-\mathbf{U}^{*}_{k}\mathbf{O}_{k}^{(t+1)}\right\|^{2}_{\textup{F}}+\left\|\mbox{\boldmath$\mathscr{S}$}^{(t+1)}-\mbox{\boldmath$\mathscr{S}$}^{*}\times_{j=1}^{d}\mathbf{O}_{j}^{(t+1)\top}\right\|^{2}_{\textup{F}}\\ &\leq\sum_{k=1}^{d}\left\|\mathbf{U}_{k}^{(t+1)}-\mathbf{U}^{*}_{k}\mathbf{O}_{k}^{(t)}\right\|^{2}_{\textup{F}}+\left\|\mbox{\boldmath$\mathscr{S}$}^{(t+1)}-\mbox{\boldmath$\mathscr{S}$}^{*}\times_{j=1}^{d}\mathbf{O}_{j}^{(t)\top}\right\|^{2}_{\textup{F}}.\end{split} (A.5)

For each k=1,,dk=1,\cdots,d, since 𝐔k(t+1)=𝐔k(t)η𝐆k(t)aη𝐔k(t)(𝐔k(t)𝐔k(t)b2𝐈rk)\mathbf{U}_{k}^{(t+1)}=\mathbf{U}_{k}^{(t)}-\eta\mathbf{G}_{k}^{(t)}-a\eta\mathbf{U}_{k}^{(t)}(\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}), we have that for any ζ>0\zeta>0,

𝐔k(t+1)𝐔k𝐎k(t)F2=𝐔k(t)𝐔k𝐎k(t)η(𝐆k(t)+a𝐔k(t)(𝐔k(t)𝐔k(t)b2𝐈rk))F2=𝐔k(t)𝐔k𝐎k(t)η(𝔼[k(t)]+a𝐔k(t)(𝐔k(t)𝐔k(t)b2𝐈rk))η𝚫k(t)F2𝐔k(t)𝐔k𝐎k(t)η(𝔼[k(t)]+a𝐔k(t)(𝐔k(t)𝐔k(t)b2𝐈rk))F2+η2𝚫k(t)F2+2η𝚫k(t)F𝐔k(t)𝐔k𝐎k(t)η(𝔼[k(t)]+a𝐔k(t)(𝐔k(t)𝐔k(t)b2𝐈rk))F(1+ζ)𝐔k(t)𝐔k𝐎k(t)η(𝔼[k(t)]+a𝐔k(t)(𝐔k(t)𝐔k(t)b2𝐈rk))F2+(1+ζ1)η2𝚫k(t)F2,\begin{split}&\|\mathbf{U}^{(t+1)}_{k}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}\|_{\textup{F}}^{2}\\ =&\|\mathbf{U}_{k}^{(t)}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}-\eta(\mathbf{G}_{k}^{(t)}+a\mathbf{U}_{k}^{(t)}(\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}))\|_{\textup{F}}^{2}\\ =&\|\mathbf{U}_{k}^{(t)}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}-\eta(\mathbb{E}[\nabla_{k}\mathcal{L}^{(t)}]+a\mathbf{U}_{k}^{(t)}(\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}))-\eta\mathbf{\Delta}_{k}^{(t)}\|_{\textup{F}}^{2}\\ \leq&\|\mathbf{U}_{k}^{(t)}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}-\eta(\mathbb{E}[\nabla_{k}\mathcal{L}^{(t)}]+a\mathbf{U}_{k}^{(t)}(\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}))\|_{\text{F}}^{2}+\eta^{2}\|\mathbf{\Delta}_{k}^{(t)}\|_{\textup{F}}^{2}\\ &+2\eta\|\mathbf{\Delta}_{k}^{(t)}\|_{\textup{F}}\|\mathbf{U}_{k}^{(t)}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}-\eta(\mathbb{E}[\nabla_{k}\mathcal{L}^{(t)}]+a\mathbf{U}_{k}^{(t)}(\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}))\|_{\text{F}}\\ \leq&(1+\zeta)\|\mathbf{U}_{k}^{(t)}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}-\eta(\mathbb{E}[\nabla_{k}\mathcal{L}^{(t)}]+a\mathbf{U}_{k}^{(t)}(\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}))\|_{\text{F}}^{2}\\ &+(1+\zeta^{-1})\eta^{2}\|\mathbf{\Delta}_{k}^{(t)}\|_{\text{F}}^{2},\end{split} (A.6)

where the last inequality stems from the mean inequality.

For the first term on the right hand side in (A.6), we have the following decomposition

𝐔k(t)𝐔k𝐎k(t)η(𝔼[k(t)]+a𝐔k(t)(𝐔k(t)𝐔k(t)b2𝐈rk))F2=𝐔k(t)𝐔k𝐎k(t)F2+η2𝔼[k(t)]+a𝐔k(t)(𝐔k(t)𝐔k(t)b2𝐈rk)F22η𝐔k(t)𝐔k𝐎k(t),𝔼[k(t)]2ηa𝐔k(t)𝐔k𝐎k(t),𝐔k(t)(𝐔k(t)𝐔k(t)b2𝐈rk).\begin{split}&\|\mathbf{U}_{k}^{(t)}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}-\eta(\mathbb{E}[\nabla_{k}\mathcal{L}^{(t)}]+a\mathbf{U}_{k}^{(t)}(\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}))\|_{\text{F}}^{2}\\ =&\|\mathbf{U}_{k}^{(t)}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}\|_{\text{F}}^{2}+\eta^{2}\|\mathbb{E}[\nabla_{k}\mathcal{L}^{(t)}]+a\mathbf{U}_{k}^{(t)}(\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}})\|_{\text{F}}^{2}\\ &-2\eta\langle\mathbf{U}_{k}^{(t)}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)},\mathbb{E}[\nabla_{k}\mathcal{L}^{(t)}]\rangle-2\eta a\langle\mathbf{U}_{k}^{(t)}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)},\mathbf{U}_{k}^{(t)}(\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}})\rangle.\end{split} (A.7)

Here, by condition (C1), the second term in (A.7) can be bounded by

𝔼[k(t)]+a𝐔k(t)(𝐔k(t)𝐔k(t)b2𝐈rk)F22𝔼[¯(𝓐(t))](k)𝐕k(t)F2+2a2𝐔k(t)(𝐔k(t)𝐔k(t)b2𝐈rk)F22𝐕k(t)2𝔼[¯(𝓐(t))]F2+2a2𝐔k(t)2𝐔k(t)𝐔k(t)b2𝐈rkF2Cdb2σ¯2𝔼[¯(𝓐(t))]F2+Ca2b2𝐔k(t)𝐔k(t)b2𝐈rkF2.\begin{split}&\|\mathbb{E}[\nabla_{k}\mathcal{L}^{(t)}]+a\mathbf{U}_{k}^{(t)}(\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}})\|_{\text{F}}^{2}\\ \leq&2\|\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]_{(k)}\mathbf{V}_{k}^{(t)}\|_{\text{F}}^{2}+2a^{2}\|\mathbf{U}_{k}^{(t)}(\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}})\|_{\text{F}}^{2}\\ \leq&2\|\mathbf{V}_{k}^{(t)}\|^{2}\|\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\|_{\text{F}}^{2}+2a^{2}\|\mathbf{U}_{k}^{(t)}\|^{2}\|\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\|_{\text{F}}^{2}\\ \leq&C_{d}b^{-2}\bar{\sigma}^{2}\|\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\|_{\text{F}}^{2}+Ca^{2}b^{2}\|\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\|_{\text{F}}^{2}.\end{split} (A.8)

The third term in (A.7) can be rewritten as

𝐔k(t)𝐔k𝐎k(t),𝔼[k(t)]=𝓐(t)𝓢(t)×jk𝐔j(t)×k𝐔k𝐎k(t),𝔼[¯(𝓐(t))]𝔼[¯(𝓐)]=𝓐(t)𝓐k(t),𝔼[¯(𝓐(t))]𝔼[¯(𝓐)],\begin{split}&\langle\mathbf{U}_{k}^{(t)}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)},\mathbb{E}[\nabla_{k}\mathcal{L}^{(t)}]\rangle\\ =&\langle\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{S}$}^{(t)}\times_{j\neq k}\mathbf{U}_{j}^{(t)}\times_{k}\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)},\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]-\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*})]\rangle\\ =&\langle\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{A}$}_{k}^{(t)},\mathbb{E}[\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]-\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*})]\rangle,\end{split} (A.9)

where 𝓐k(t):=𝓢(t)×jk𝐔j(t)×k𝐔k𝐎k(t)\mbox{\boldmath$\mathscr{A}$}_{k}^{(t)}:=\mbox{\boldmath$\mathscr{S}$}^{(t)}\times_{j\neq k}\mathbf{U}_{j}^{(t)}\times_{k}\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}. For the fourth term in (A.7), we have

𝐔k(t)𝐔k𝐎k(t),𝐔k(t)(𝐔k(t)𝐔k(t)b2𝐈rk)=𝐔k(t)𝐔k(t)𝐔k(t)𝐔k𝐎k(t),𝐔k(t)𝐔k(t)b2𝐈rk=12𝐔k(t)𝐔k(t)𝐔k𝐔k,𝐔k(t)𝐔k(t)b2𝐈rk+12𝐔k𝐔k2𝐔k(t)𝐔k𝐎k(t)+𝐔k(t)𝐔k(t),𝐔k(t)𝐔k(t)b2𝐈rk=12𝐔k(t)𝐔k(t)b2𝐈rkF2+12(𝐔k𝐎k(t)𝐔k(t))(𝐔k𝐎k(t)𝐔k(t)),𝐔k(t)𝐔k(t)b2𝐈rk12𝐔k(t)𝐔k(t)b2𝐈rkF212𝐔k𝐎k(t)𝐔k(t)F2𝐔k(t)𝐔k(t)b2𝐈rkF12𝐔k(t)𝐔k(t)b2𝐈rkF214𝐔k𝐎k(t)𝐔k(t)F414𝐔k(t)𝐔k(t)b2𝐈rkF214𝐔k(t)𝐔k(t)b2𝐈rkF2Err(t)4𝐔k(t)𝐔k𝐎k(t)F2,\begin{split}&\left\langle\mathbf{U}_{k}^{(t)}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)},\mathbf{U}_{k}^{(t)}(\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}})\right\rangle\\ =&\left\langle\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)},\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\right\rangle\\ =&\frac{1}{2}\left\langle\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-\mathbf{U}_{k}^{*\top}\mathbf{U}_{k}^{*},\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\right\rangle\\ &+\frac{1}{2}\left\langle\mathbf{U}_{k}^{*\top}\mathbf{U}_{k}^{*}-2\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}+\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)},\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\right\rangle\\ =&\frac{1}{2}\|\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\|_{\text{F}}^{2}\\ &+\frac{1}{2}\left\langle(\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}-\mathbf{U}_{k}^{(t)})^{\top}(\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}-\mathbf{U}_{k}^{(t)}),\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\right\rangle\\ \geq&\frac{1}{2}\|\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\|_{\text{F}}^{2}-\frac{1}{2}\|\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}-\mathbf{U}_{k}^{(t)}\|_{\text{F}}^{2}\cdot\|\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\|_{\text{F}}\\ \geq&\frac{1}{2}\|\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\|_{\text{F}}^{2}-\frac{1}{4}\|\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}-\mathbf{U}_{k}^{(t)}\|_{\text{F}}^{4}-\frac{1}{4}\|\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\|_{\text{F}}^{2}\\ \geq&\frac{1}{4}\|\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\|_{\text{F}}^{2}-\frac{\text{Err}^{(t)}}{4}\|\mathbf{U}_{k}^{(t)}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}\|_{\text{F}}^{2},\end{split} (A.10)

where we use the fact that 𝐔k𝐎k(t)𝐔k(t)F2Err(t)\|\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}-\mathbf{U}_{k}^{(t)}\|_{\text{F}}^{2}\leq\text{Err}^{(t)}.

Hence, for any k=1,2,,dk=1,2,\dots,d,

𝐔k(t)𝐔k𝐎k(t)η(𝔼[k(t)]+a𝐔k(t)(𝐔k(t)𝐔k(t)b2𝐈rk))F2𝐔k(t)𝐔k𝐎k(t)F22Qk,1(t)η+Qk,2(t)η2,\begin{split}&\|\mathbf{U}_{k}^{(t)}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}-\eta(\mathbb{E}[\nabla_{k}\mathcal{L}^{(t)}]+a\mathbf{U}_{k}^{(t)}(\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}))\|_{\text{F}}^{2}\\ \leq&\|\mathbf{U}_{k}^{(t)}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}\|_{\text{F}}^{2}-2Q_{k,1}^{(t)}\eta+Q_{k,2}^{(t)}\eta^{2},\end{split} (A.11)

where

Qk,1(t)=𝓐(t)𝓐k(t),𝔼[¯(𝓐(t))]+a4𝐔k(t)𝐔k(t)b2𝐈rkF2aErr(t)4𝐔k(t)𝐔k𝐎k(t)F2Q_{k,1}^{(t)}=\langle\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{A}$}_{k}^{(t)},\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\rangle+\frac{a}{4}\left\|\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\right\|_{\text{F}}^{2}-\frac{a\text{Err}^{(t)}}{4}\left\|\mathbf{U}_{k}^{(t)}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}\right\|_{\text{F}}^{2} (A.12)

and

Qk,2(t)=Cdb2σ¯2𝔼[¯(𝓐(t))]F2+Ca2b2𝐔k(t)𝐔k(t)b2𝐈rkF2.Q_{k,2}^{(t)}=C_{d}b^{-2}\bar{\sigma}^{2}\|\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\|_{\text{F}}^{2}+Ca^{2}b^{2}\|\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\|_{\text{F}}^{2}. (A.13)

Similarly, for any ζ>0\zeta>0,

𝓢~(t+1)𝓢×k=1d𝐎k(t)F2=𝓢(t)η𝓖0(t)𝓢×k=1d𝐎k(t)F2=𝓢(t)𝓢×k=1d𝐎k(t)η𝔼[0(t)]η𝚫0(t)F2(1+ζ)𝓢(t)𝓢×k=1d𝐎k(t)η𝔼[0(t)]F2+η2(1+ζ1)𝚫0(t)F2,\begin{split}&\|\widetilde{\mbox{\boldmath$\mathscr{S}$}}^{(t+1)}-\mbox{\boldmath$\mathscr{S}$}^{*}\times_{k=1}^{d}\mathbf{O}_{k}^{(t)\top}\|_{\text{F}}^{2}=\|\mbox{\boldmath$\mathscr{S}$}^{(t)}-\eta\mbox{\boldmath$\mathscr{G}$}_{0}^{(t)}-\mbox{\boldmath$\mathscr{S}$}^{*}\times_{k=1}^{d}\mathbf{O}_{k}^{(t)\top}\|_{\text{F}}^{2}\\ =&\|\mbox{\boldmath$\mathscr{S}$}^{(t)}-\mbox{\boldmath$\mathscr{S}$}^{*}\times_{k=1}^{d}\mathbf{O}_{k}^{(t)\top}-\eta\mathbb{E}[\nabla_{0}\mathcal{L}^{(t)}]-\eta\mathbf{\Delta}_{0}^{(t)}\|_{\text{F}}^{2}\\ \leq&(1+\zeta)\|\mbox{\boldmath$\mathscr{S}$}^{(t)}-\mbox{\boldmath$\mathscr{S}$}^{*}\times_{k=1}^{d}\mathbf{O}_{k}^{(t)\top}-\eta\mathbb{E}[\nabla_{0}\mathcal{L}^{(t)}]\|_{\text{F}}^{2}+\eta^{2}(1+\zeta^{-1})\|\mathbf{\Delta}_{0}^{(t)}\|_{\text{F}}^{2},\end{split} (A.14)

and

𝓢(t)𝓢×k=1d𝐎k(t)η𝔼[0(t)]F2𝓢(t)𝓢×k=1d𝐎k(t)F22Q0,1(t)η+Q0,2(t)η2,\|\mbox{\boldmath$\mathscr{S}$}^{(t)}-\mbox{\boldmath$\mathscr{S}$}^{*}\times_{k=1}^{d}\mathbf{O}_{k}^{(t)\top}-\eta\mathbb{E}[\nabla_{0}\mathcal{L}^{(t)}]\|_{\text{F}}^{2}\leq\|\mbox{\boldmath$\mathscr{S}$}^{(t)}-\mbox{\boldmath$\mathscr{S}$}^{*}\times_{k=1}^{d}\mathbf{O}_{k}^{(t)\top}\|_{\text{F}}^{2}-2Q_{0,1}^{(t)}\eta+Q_{0,2}^{(t)}\eta^{2}, (A.15)

where

Q0,1(t)=𝓐(t)𝓐0(t),𝔼[¯(𝓐(t))] with 𝓐0(t)=𝓢×k=1d𝐔k(t)𝐎k(t)Q_{0,1}^{(t)}=\langle\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{A}$}_{0}^{(t)},\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\rangle\text{ with }\mbox{\boldmath$\mathscr{A}$}_{0}^{(t)}=\mbox{\boldmath$\mathscr{S}$}^{*}\times_{k=1}^{d}\mathbf{U}_{k}^{(t)}\mathbf{O}_{k}^{(t)\top}

and Q0,2(t)=Cdb2d𝔼[¯(𝓐(t))]F2Q_{0,2}^{(t)}=C_{d}b^{2d}\|\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\|_{\text{F}}^{2}.

Hence, combining the above results, we have

Err(t+1)(1+ζ){Err(t)2ηk=0dQk,1(t)+η2k=0dQk,2(t)}+(1+ζ1)η2k=0d𝚫k(t)F2.\text{Err}^{(t+1)}\leq(1+\zeta)\left\{\text{Err}^{(t)}-2\eta\sum_{k=0}^{d}Q^{(t)}_{k,1}+\eta^{2}\sum_{k=0}^{d}Q_{k,2}^{(t)}\right\}+(1+\zeta^{-1})\eta^{2}\sum_{k=0}^{d}\|\mathbf{\Delta}_{k}^{(t)}\|_{\text{F}}^{2}. (A.16)

Step 3. (Lower bound of k=0dQk,1(t)\sum_{k=0}^{d}Q^{(t)}_{k,1})

By definition of Qk,1(t)Q^{(t)}_{k,1} for k=0,,dk=0,\dots,d, we have

k=0dQk,1(t)=(d+1)𝓐(t)k=0d𝓐k(t),𝔼[¯(𝓐(t))]+ak=1d{14𝐔k(t)𝐔k(t)b2𝐈rkF2Err(t)4𝐔k(t)𝐔k𝐎k(t)F2}.\begin{split}\sum_{k=0}^{d}Q_{k,1}^{(t)}=&\left\langle(d+1)\mbox{\boldmath$\mathscr{A}$}^{(t)}-\sum_{k=0}^{d}\mbox{\boldmath$\mathscr{A}$}_{k}^{(t)},\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\right\rangle\\ &+a\sum_{k=1}^{d}\left\{\frac{1}{4}\|\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\|_{\text{F}}^{2}-\frac{\text{Err}^{(t)}}{4}\|\mathbf{U}_{k}^{(t)}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}\|_{\text{F}}^{2}\right\}.\end{split} (A.17)

For the first term, by RCG condition of ¯\overline{\mathcal{L}} and Cauchy’s inequality,

(d+1)𝓐(t)k=0d𝓐k(t),𝔼[¯(𝓐(t))]=𝓐(t)𝓐+𝓗,𝔼[¯(𝓐(t))]=𝓐(t)𝓐,𝔼[¯(𝓐(t))]𝔼[¯(𝓐)]+𝓗,𝔼[¯(𝓐(t))]α2𝓐(t)𝓐F2+12β𝔼[¯(𝓐(t))]F2𝓗F𝔼[¯(𝓐(t))]Fα2𝓐(t)𝓐F2+12β𝔼[¯(𝓐(t))]F214β𝔼[¯(𝓐(t))]F2β𝓗F2=α2𝓐(t)𝓐F2+14β𝔼[¯(𝓐(t))]F2β𝓗F2\begin{split}&\left\langle(d+1)\mbox{\boldmath$\mathscr{A}$}^{(t)}-\sum_{k=0}^{d}\mbox{\boldmath$\mathscr{A}$}_{k}^{(t)},\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\right\rangle=\langle\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{A}$}^{*}+\mbox{\boldmath$\mathscr{H}$},\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\rangle\\ =&\langle\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{A}$}^{*},\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]-\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*})]\rangle+\langle\mbox{\boldmath$\mathscr{H}$},\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\rangle\\ \geq&\frac{\alpha}{2}\|\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}+\frac{1}{2\beta}\|\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\|_{\text{F}}^{2}-\|\mbox{\boldmath$\mathscr{H}$}\|_{\text{F}}\cdot\|\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\|_{\text{F}}\\ \geq&\frac{\alpha}{2}\|\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}+\frac{1}{2\beta}\|\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\|_{\text{F}}^{2}-\frac{1}{4\beta}\|\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\|_{\text{F}}^{2}-\beta\|\mbox{\boldmath$\mathscr{H}$}\|_{\text{F}}^{2}\\ =&\frac{\alpha}{2}\|\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}+\frac{1}{4\beta}\|\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\|_{\text{F}}^{2}-\beta\|\mbox{\boldmath$\mathscr{H}$}\|_{\text{F}}^{2}\end{split} (A.18)

where 𝓗\mathscr{H} is the higher-order perturbation term in

𝓐=𝓐0(t)+k=1d(𝓐k(t)𝓐(t))+𝓗.\mbox{\boldmath$\mathscr{A}$}^{*}=\mbox{\boldmath$\mathscr{A}$}_{0}^{(t)}+\sum_{k=1}^{d}(\mbox{\boldmath$\mathscr{A}$}_{k}^{(t)}-\mbox{\boldmath$\mathscr{A}$}^{(t)})+\mbox{\boldmath$\mathscr{H}$}. (A.19)

By Lemma A.2, we have 𝓗FCdb2σ¯Err(t)\|\mbox{\boldmath$\mathscr{H}$}\|_{\text{F}}\leq C_{d}b^{-2}\bar{\sigma}\text{Err}^{(t)}. Hence, by Lemma A.1, k=0dQk,1(t)\sum_{k=0}^{d}Q_{k,1}^{(t)} can be lower bounded by

k=0dQk,1(t)α2𝓐(t)𝓐F2+14β𝔼[¯(𝓐(t))]F2Cdβb4σ¯2(Err(t))2+a4k=1d𝐔k(t)𝐔k(t)b2𝐈rkF2a4(Err(t))2{Cαb2dκ2Cdβb4σ¯2Err(t)aErr(t)4}Err(t)+14β𝔼[¯(𝓐(t))]F2+(a4Cdαb2d2κ2)k=1d𝐔k(t)𝐔k(t)b2𝐈rkF2Cαb2dκ2Err(t)+14β𝔼[¯(𝓐(t))]F2+(a4Cdαb2d2κ2)k=1d𝐔k(t)𝐔k(t)b2𝐈rkF2.\begin{split}&\sum_{k=0}^{d}Q_{k,1}^{(t)}\geq\frac{\alpha}{2}\|\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}+\frac{1}{4\beta}\|\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\|_{\text{F}}^{2}-C_{d}\beta b^{-4}\bar{\sigma}^{2}(\text{Err}^{(t)})^{2}\\ &+\frac{a}{4}\sum_{k=1}^{d}\|\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\|_{\text{F}}^{2}-\frac{a}{4}(\text{Err}^{(t)})^{2}\\ \geq&\left\{C\alpha b^{2d}\kappa^{-2}-C_{d}\beta b^{-4}\bar{\sigma}^{2}\text{Err}^{(t)}-\frac{a\text{Err}^{(t)}}{4}\right\}\text{Err}^{(t)}\\ &+\frac{1}{4\beta}\|\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\|_{\text{F}}^{2}+\left(\frac{a}{4}-C_{d}\alpha b^{2d-2}\kappa^{-2}\right)\sum_{k=1}^{d}\|\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\|_{\text{F}}^{2}\\ \geq&~{}C\alpha b^{2d}\kappa^{-2}\text{Err}^{(t)}+\frac{1}{4\beta}\|\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\|_{\text{F}}^{2}+\left(\frac{a}{4}-C_{d}\alpha b^{2d-2}\kappa^{-2}\right)\sum_{k=1}^{d}\|\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\|_{\text{F}}^{2}.\end{split} (A.20)

Step 4. (Convergence analysis)

We have the following bound for k=0dQk,2(t)\sum_{k=0}^{d}Q_{k,2}^{(t)}

k=0dQk,2(t)Cdb2d𝔼[¯(𝓐(t))]F2+3a2b2k=1d𝐔k(t)𝐔k(t)b2𝐈rkF2.\begin{split}\sum_{k=0}^{d}Q_{k,2}^{(t)}&\leq C_{d}b^{2d}\|\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\|_{\text{F}}^{2}+3a^{2}b^{2}\sum_{k=1}^{d}\|\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\|_{\text{F}}^{2}.\end{split} (A.21)

Combining the results above, we have

Err(t)2ηk=0dQk,1(t)+η2k=0dQk,2(t)(1Cαb2dκ2η)Err(t)+(Cdb2dη2η4β)𝔼[¯(𝓐(t))]F2+(3a2b2η2+Cdαb2d2κ2ηaη4)k=1d𝐔k(t)𝐔k(t)b2𝐈rkF2.\begin{split}&\text{Err}^{(t)}-2\eta\sum_{k=0}^{d}Q_{k,1}^{(t)}+\eta^{2}\sum_{k=0}^{d}Q_{k,2}^{(t)}\\ \leq&\left(1-C\alpha b^{2d}\kappa^{-2}\eta\right)\text{Err}^{(t)}+\left(C_{d}b^{2d}\eta^{2}-\frac{\eta}{4\beta}\right)\|\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\|_{\text{F}}^{2}\\ &+\left(3a^{2}b^{2}\eta^{2}+C_{d}\alpha b^{2d-2}\kappa^{-2}\eta-\frac{a\eta}{4}\right)\sum_{k=1}^{d}\|\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\|_{\text{F}}^{2}.\end{split} (A.22)

Taking η=η0b2dβ1\eta=\eta_{0}b^{-2d}\beta^{-1} and a=C0b2d2ακ2a=C_{0}b^{2d-2}\alpha\kappa^{-2} for some sufficiently small constants η0\eta_{0} and C0C_{0}, we have

Err(t)2ηk=0dQk,1(t)+η2k=0dQk,2(t)(1Cαβ1κ2)Err(t)\text{Err}^{(t)}-2\eta\sum_{k=0}^{d}Q_{k,1}^{(t)}+\eta^{2}\sum_{k=0}^{d}Q_{k,2}^{(t)}\leq(1-C\alpha\beta^{-1}\kappa^{-2})\text{Err}^{(t)} (A.23)

and

Err(t+1)(1+ζ)(1η0αβ1κ2)Err(t)+(1+ζ1)η2k=0d𝚫k(t)F2.\text{Err}^{(t+1)}\leq(1+\zeta)(1-\eta_{0}\alpha\beta^{-1}\kappa^{-2})\text{Err}^{(t)}+(1+\zeta^{-1})\eta^{2}\sum_{k=0}^{d}\|\mathbf{\Delta}_{k}^{(t)}\|_{\text{F}}^{2}. (A.24)

Taking ζ=η0αβ1κ2/2\zeta=\eta_{0}\alpha\beta^{-1}\kappa^{-2}/2, we have

Err(t+1)(1η0αβ1κ2/2)Err(t)+Cα1β1σ¯4d/(d+1)κ2k=0d𝚫k(t)F2.\text{Err}^{(t+1)}\leq(1-\eta_{0}\alpha\beta^{-1}\kappa^{-2}/2)\text{Err}^{(t)}+C\alpha^{-1}\beta^{-1}\bar{\sigma}^{-4d/(d+1)}\kappa^{2}\sum_{k=0}^{d}\|\mathbf{\Delta}_{k}^{(t)}\|_{\text{F}}^{2}. (A.25)

By stability of the robust gradient estimators, for k=0,1,,dk=0,1,\dots,d and t=1,2,,Tt=1,2,\dots,T,

𝚫k(t)F2ϕ𝓐(t)𝓐F2+ξk2.\|\mathbf{\Delta}^{(t)}_{k}\|_{\text{F}}^{2}\leq\phi\|\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}+\xi_{k}^{2}. (A.26)

Hence, as ϕα2κ4σ¯2d/(d+1)\phi\lesssim\alpha^{2}\kappa^{-4}\bar{\sigma}^{2d/(d+1)}, we have

Err(t+1)(1η0αβ1κ2/2)Err(t)+Cdα1β1σ¯4d/(d+1)κ2(ϕ𝓐(t)𝓐F2+k=0dξk2)(1η0αβ1κ2/2+Cdα1β1σ¯2d/(d+1)κ2ϕ)Err(t)+Cα1β1σ¯4d/(d+1)κ2k=0dξk2(1Cαβ1κ2)Err(t)+Cα1β1σ¯4d/(d+1)κ2k=0dξk2(1Cαβ1κ2)t+1Err(0)+Cα2σ¯4d/(d+1)κ4k=0dξk2.\begin{split}\text{Err}^{(t+1)}&\leq(1-\eta_{0}\alpha\beta^{-1}\kappa^{-2}/2)\text{Err}^{(t)}+C_{d}\alpha^{-1}\beta^{-1}\bar{\sigma}^{-4d/(d+1)}\kappa^{2}\left(\phi\|\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}+\sum_{k=0}^{d}\xi_{k}^{2}\right)\\ &\leq(1-\eta_{0}\alpha\beta^{-1}\kappa^{-2}/2+C_{d}\alpha^{-1}\beta^{-1}\bar{\sigma}^{-2d/(d+1)}\kappa^{2}\phi)\text{Err}^{(t)}+C\alpha^{-1}\beta^{-1}\bar{\sigma}^{-4d/(d+1)}\kappa^{2}\sum_{k=0}^{d}\xi_{k}^{2}\\ &\leq(1-C\alpha\beta^{-1}\kappa^{-2})\text{Err}^{(t)}+C\alpha^{-1}\beta^{-1}\bar{\sigma}^{-4d/(d+1)}\kappa^{2}\sum_{k=0}^{d}\xi_{k}^{2}\\ &\leq(1-C\alpha\beta^{-1}\kappa^{-2})^{t+1}\text{Err}^{(0)}+C\alpha^{-2}\bar{\sigma}^{-4d/(d+1)}\kappa^{4}\sum_{k=0}^{d}\xi_{k}^{2}.\end{split} (A.27)

We apply Lemma A.1 again and obtain

𝓐(t)𝓐F2Cσ¯2d/(d+1)Err(t+1)Cσ¯2d/(d+1)(1Cαβ1κ2)tErr(0)+Cσ¯2d/(d+1)α2κ4k=0dξk2Cκ2(1Cαβ1κ2)t𝓐(0)𝓐F2+Cσ¯2d/(d+1)α2κ4k=0dξk2.\begin{split}\|\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}&\leq C\bar{\sigma}^{2d/(d+1)}\text{Err}^{(t+1)}\\ &\leq C\bar{\sigma}^{2d/(d+1)}(1-C\alpha\beta^{-1}\kappa^{-2})^{t}\text{Err}^{(0)}+C\bar{\sigma}^{-2d/(d+1)}\alpha^{-2}\kappa^{4}\sum_{k=0}^{d}\xi_{k}^{2}\\ &\leq C\kappa^{2}(1-C\alpha\beta^{-1}\kappa^{-2})^{t}\|\mbox{\boldmath$\mathscr{A}$}^{(0)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}+C\bar{\sigma}^{-2d/(d+1)}\alpha^{-2}\kappa^{4}\sum_{k=0}^{d}\xi_{k}^{2}.\end{split} (A.28)

Step 5. (Verfications of conditions)

Finally, we show the conditions (C1) and (C2) hold for all t=1,2,t=1,2,\dots. By Lemma A.1, we have

Err(0)C(α/β)b2κ2Cb2.\text{Err}^{(0)}\leq C(\alpha/\beta)b^{2}\kappa^{-2}\leq Cb^{2}. (A.29)

By the recursive relationship in (A.27), by induction we can check that Err(t)Cb2\text{Err}^{(t)}\leq Cb^{2} for all t=1,2,,Tt=1,2,\dots,T. Furthermore, it implies that

𝐔k(t)𝐔k+𝐔k(t)𝐔k𝐎k(t)Cb,k=1,2,,d,\|\mathbf{U}_{k}^{(t)}\|\leq\|\mathbf{U}_{k}^{*}\|+\|\mathbf{U}_{k}^{(t)}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}\|\leq Cb,~{}~{}k=1,2,\dots,d, (A.30)

and

maxk𝓢(k)(t)maxk𝓢(k)+maxk𝓢(k)(t)𝓢×j=1d𝐎j(t)Cσ¯bd,\max_{k}\|\mbox{\boldmath$\mathscr{S}$}^{(t)}_{(k)}\|\leq\max_{k}\|\mbox{\boldmath$\mathscr{S}$}^{*}_{(k)}\|+\max_{k}\|\mbox{\boldmath$\mathscr{S}$}_{(k)}^{(t)}-\mbox{\boldmath$\mathscr{S}$}^{*}\times_{j=1}^{d}\mathbf{O}_{j}^{(t)\top}\|\leq C\underline{\sigma}b^{-d}, (A.31)

which completes the convergence analysis.

A.2 Auxiliary lemmas

The first lemma shows the equivalence between 𝓐𝓐F2\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2} and the combined error EE, which is from Lemma E.2 in Han, Willett and Zhang (2022) and is presented here for self-containedness. The proof of Lemma A.1 can be found in Han, Willett and Zhang (2022) and hence is omitted.

Lemma A.1.

Suppose 𝓐=[[𝓢;𝐔1,,𝐔d]]\mbox{\boldmath$\mathscr{A}$}^{*}=[\![\mbox{\boldmath$\mathscr{S}$}^{*};\mathbf{U}_{1}^{*},\dots,\mathbf{U}_{d}^{*}]\!], 𝐔k𝐔k=b2𝐈rk\mathbf{U}_{k}^{*\top}\mathbf{U}_{k}=b^{2}\mathbf{I}_{r_{k}}, for k=1,,dk=1,\dots,d, σ¯=maxk𝓐(k)sp\bar{\sigma}=\max_{k}\|\mbox{\boldmath$\mathscr{A}$}^{*}_{(k)}\|_{\textup{sp}}, and σ¯=minkσrk(𝓐(k))\underline{\sigma}=\min_{k}\sigma_{r_{k}}(\mbox{\boldmath$\mathscr{A}$}^{*}_{(k)}). Let 𝓐=[[𝓢;𝐔1,,𝐔d]]\mbox{\boldmath$\mathscr{A}$}=[\![\mbox{\boldmath$\mathscr{S}$};\mathbf{U}_{1},\dots,\mathbf{U}_{d}]\!] be another Tucker low-rank tensor with 𝐔kpk×rk\mathbf{U}_{k}\in\mathbb{R}^{p_{k}\times r_{k}}, 𝐔k(1+c0)b\|\mathbf{U}_{k}\|\leq(1+c_{0})b, and maxk𝓢(k)(1+c0)σ¯bd\max_{k}\|\mbox{\boldmath$\mathscr{S}$}_{(k)}\|\leq(1+c_{0})\bar{\sigma}b^{-d} for some c0>0c_{0}>0. Define

E:=min𝕆k𝕆pk,rk{k=1d𝐔k𝐔k𝐎kF2+𝓢[[𝓢;𝐎1,,𝐎d]]F2}.E:=\min_{\mathbb{O}_{k}\in\mathbb{O}_{p_{k},r_{k}}}\left\{\sum_{k=1}^{d}\|\mathbf{U}_{k}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}\|_{\textup{F}}^{2}+\left\|\mbox{\boldmath$\mathscr{S}$}-[\![\mbox{\boldmath$\mathscr{S}$}^{*};\mathbf{O}_{1}^{\top},\dots,\mathbf{O}_{d}^{\top}]\!]\right\|_{\text{F}}^{2}\right\}. (A.32)

Then, we have

Eb2d(C+C1b2d+2σ¯2)𝓐𝓐F2+2b2C1k=1d𝐔k𝐔kb2𝐈rkF2,and 𝓐𝓐F2Cb2d(C+C2σ¯2b2(d+1))E,\begin{split}&E\leq b^{-2d}(C+C_{1}b^{2d+2}\underline{\sigma}^{-2})\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\textup{F}}^{2}+2b^{-2}C_{1}\sum_{k=1}^{d}\|\mathbf{U}_{k}^{\top}\mathbf{U}_{k}-b^{2}\mathbf{I}_{r_{k}}\|_{\textup{F}}^{2},\\ \text{and }&\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\textup{F}}^{2}\leq Cb^{2d}(C+C_{2}\bar{\sigma}^{2}b^{-2(d+1)})E,\end{split} (A.33)

where C1,C2>0C_{1},C_{2}>0 are some constants related to c0c_{0}.

The second lemma is an upper bound for the second and higher-order terms in the perturbation of a tensor Tucker decomposition, as the higher-order generalization of Lemma E.3 in Han, Willett and Zhang (2022).

Lemma A.2.

Suppose that 𝓐=𝓢×k=1d𝐔k\mbox{\boldmath$\mathscr{A}$}^{*}=\mbox{\boldmath$\mathscr{S}$}^{*}\times_{k=1}^{d}\mathbf{U}_{k}^{*} and 𝓐=𝓢×k=1d𝐔k\mbox{\boldmath$\mathscr{A}$}=\mbox{\boldmath$\mathscr{S}$}\times_{k=1}^{d}\mathbf{U}_{k} with 𝐔k𝐔kb\|\mathbf{U}_{k}\|\asymp\|\mathbf{U}_{k}^{*}\|\asymp b and 𝓢(k)𝓢(k)σ¯bd\|\mbox{\boldmath$\mathscr{S}$}_{(k)}\|\asymp\|\mbox{\boldmath$\mathscr{S}$}_{(k)}^{*}\|\asymp\bar{\sigma}b^{-d}. For 𝐎k𝕆rk\mathbf{O}_{k}\in\mathbb{O}_{r_{k}}, 1kd1\leq k\leq d, 𝓗FCdb2σ¯Err\|\mbox{\boldmath$\mathscr{H}$}\|_{\textup{F}}\leq C_{d}b^{-2}\bar{\sigma}\textup{Err}, where 𝓗=𝓐𝓐0k=1d(𝓐k𝓐)\mbox{\boldmath$\mathscr{H}$}=\mbox{\boldmath$\mathscr{A}$}^{*}-\mbox{\boldmath$\mathscr{A}$}_{0}-\sum_{k=1}^{d}(\mbox{\boldmath$\mathscr{A}$}_{k}-\mbox{\boldmath$\mathscr{A}$}) and Err=k=1d𝐔k𝐔k𝐎kF2+𝓢𝓢×k=1d𝐎kF2\textup{Err}=\sum_{k=1}^{d}\|\mathbf{U}_{k}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}\|_{\textup{F}}^{2}+\|\mbox{\boldmath$\mathscr{S}$}-\mbox{\boldmath$\mathscr{S}$}^{*}\times_{k=1}^{d}\mathbf{O}_{k}^{\top}\|_{\textup{F}}^{2}. Then, 𝓗FCdb2σ¯Err\|\mbox{\boldmath$\mathscr{H}$}\|_{\textup{F}}\leq C_{d}b^{-2}\bar{\sigma}\textup{Err}.

Proof.

We have that

𝓗Fjk𝓢×i=j,k(𝐔i𝐔i𝐎i)×ij,k𝐔j𝐎jF+jkl𝓢×i=j,k,l(𝐔i𝐔i𝐎i)×ij,k,l𝐔j𝐎jF++j𝓢×ij(𝐔i𝐔i𝐎i)×i=j𝐔j𝐎jF+𝓢×i=1d(𝐔i𝐔i𝐎i)Fjk(𝓢×k=1d𝐎k𝓢)×i=j,k(𝐔i𝐔i𝐎i)×ij,k𝐔j𝐎jF+jkl(𝓢×k=1d𝐎k𝓢)×i=j,k,l(𝐔i𝐔i𝐎i)×ij,k,l𝐔j𝐎jF++j(𝓢×k=1d𝐎k𝓢)×ij(𝐔i𝐔i𝐎i)×i=j𝐔j𝐎jF+(𝓢×k=1d𝐎k𝓢)×i=1d(𝐔i𝐔i𝐎i)F(d2)B2B1d2B3+(d3)B2B1d3B33/2++dB2B1B3(d1)/2+B2B3d/2+(d2)B1d2B33/2+(d3)B1d3B32++dB1B3d/2+B3(d+1)/2Cdb2σ¯Err,\begin{split}\|\mbox{\boldmath$\mathscr{H}$}\|_{\text{F}}&\leq\sum_{j\neq k}\left\|\mbox{\boldmath$\mathscr{S}$}^{*}\times_{i=j,k}(\mathbf{U}_{i}-\mathbf{U}^{*}_{i}\mathbf{O}_{i})\times_{i\neq j,k}\mathbf{U}_{j}^{*}\mathbf{O}_{j}\right\|_{\text{F}}\\ &+\sum_{j\neq k\neq l}\left\|\mbox{\boldmath$\mathscr{S}$}^{*}\times_{i=j,k,l}(\mathbf{U}_{i}-\mathbf{U}^{*}_{i}\mathbf{O}_{i})\times_{i\neq j,k,l}\mathbf{U}_{j}^{*}\mathbf{O}_{j}\right\|_{\text{F}}\\ &+\cdots+\sum_{j}\|\mbox{\boldmath$\mathscr{S}$}^{*}\times_{i\neq j}(\mathbf{U}_{i}-\mathbf{U}^{*}_{i}\mathbf{O}_{i})\times_{i=j}\mathbf{U}_{j}^{*}\mathbf{O}_{j}\|_{\text{F}}+\|\mbox{\boldmath$\mathscr{S}$}^{*}\times_{i=1}^{d}(\mathbf{U}_{i}-\mathbf{U}^{*}_{i}\mathbf{O}_{i})\|_{\text{F}}\\ &\leq\sum_{j\neq k}\left\|(\mbox{\boldmath$\mathscr{S}$}\times_{k=1}^{d}\mathbf{O}_{k}-\mbox{\boldmath$\mathscr{S}$}^{*})\times_{i=j,k}(\mathbf{U}_{i}-\mathbf{U}^{*}_{i}\mathbf{O}_{i})\times_{i\neq j,k}\mathbf{U}_{j}^{*}\mathbf{O}_{j}\right\|_{\text{F}}\\ &+\sum_{j\neq k\neq l}\left\|(\mbox{\boldmath$\mathscr{S}$}\times_{k=1}^{d}\mathbf{O}_{k}-\mbox{\boldmath$\mathscr{S}$}^{*})\times_{i=j,k,l}(\mathbf{U}_{i}-\mathbf{U}^{*}_{i}\mathbf{O}_{i})\times_{i\neq j,k,l}\mathbf{U}_{j}^{*}\mathbf{O}_{j}\right\|_{\text{F}}\\ &+\cdots+\sum_{j}\|(\mbox{\boldmath$\mathscr{S}$}\times_{k=1}^{d}\mathbf{O}_{k}-\mbox{\boldmath$\mathscr{S}$}^{*})\times_{i\neq j}(\mathbf{U}_{i}-\mathbf{U}^{*}_{i}\mathbf{O}_{i})\times_{i=j}\mathbf{U}_{j}^{*}\mathbf{O}_{j}\|_{\text{F}}\\ &+\|(\mbox{\boldmath$\mathscr{S}$}\times_{k=1}^{d}\mathbf{O}_{k}-\mbox{\boldmath$\mathscr{S}$}^{*})\times_{i=1}^{d}(\mathbf{U}_{i}-\mathbf{U}^{*}_{i}\mathbf{O}_{i})\|_{\text{F}}\\ &\leq\binom{d}{2}B_{2}B_{1}^{d-2}B_{3}+\binom{d}{3}B_{2}B_{1}^{d-3}B_{3}^{3/2}+\cdots+dB_{2}B_{1}B_{3}^{(d-1)/2}+B_{2}B_{3}^{d/2}\\ &+\binom{d}{2}B_{1}^{d-2}B_{3}^{3/2}+\binom{d}{3}B_{1}^{d-3}B_{3}^{2}+\cdots+dB_{1}B_{3}^{d/2}+B_{3}^{(d+1)/2}\leq C_{d}b^{-2}\bar{\sigma}\textup{Err},\end{split} (A.34)

where B1=maxk(𝐔k,𝐔k)B_{1}=\max_{k}(\|\mathbf{U}^{*}_{k}\|,\|\mathbf{U}_{k}\|), B2=maxk(𝓢(k),𝓢(k))B_{2}=\max_{k}(\|\mbox{\boldmath$\mathscr{S}$}^{*}_{(k)}\|,\|\mbox{\boldmath$\mathscr{S}$}_{(k)}\|), and B3=maxk(𝐔k𝐔k𝐎kF2,𝓢𝓢×k=1d𝐎kF2)B_{3}=\max_{k}(\|\mathbf{U}_{k}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}\|_{\text{F}}^{2},\|\mbox{\boldmath$\mathscr{S}$}-\mbox{\boldmath$\mathscr{S}$}^{*}\times_{k=1}^{d}\mathbf{O}_{k}\|_{\text{F}}^{2}).

Appendix B Statistical Analysis of Robust Gradient

The most essential part of the statistical analysis is to prove that the robust gradient estimators are stable. For 1kd1\leq k\leq d, the robust gradient estimator with respect to 𝐔k\mathbf{U}_{k} is 𝐆k=n1i=1nT(¯(𝓐;zi)(k)𝐕k𝓢(k),τ).\mathbf{G}_{k}=n^{-1}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top},\tau). Note that

𝐆kk=1ni=1nT(¯(𝓐;zi)(k)𝐕k,τ)𝔼[¯(𝓐;zi)(k)𝐕k]=Tk,1+Tk,2+Tk,3+Tk,4,\begin{split}&\mathbf{G}_{k}-\nabla_{k}\mathcal{R}=\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k},\tau)-\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k}]\\ =&~{}T_{k,1}+T_{k,2}+T_{k,3}+T_{k,4},\end{split} (B.1)

where

Tk,1=𝔼[T(¯(𝓐;zi)(k)𝐕k,τ)]𝔼[¯(𝓐;zi)(k)𝐕k],Tk,2=1ni=1nT(¯(𝓐;zi)(k)𝐕k,τ)𝔼[T(¯(𝓐;zi)(k)𝐕k,τ)],Tk,3=𝔼[¯(𝓐;zi)(k)𝐕k]𝔼[¯(𝓐;zi)(k)𝐕k]+𝔼[T(¯(𝓐;zi)(k)𝐕k,τ)]𝔼[T(¯(𝓐;zi)(k)𝐕k,τ)],Tk,4=1ni=1nT(¯(𝓐;zi)(k)𝐕k,τ)1ni=1nT(¯(𝓐;zi)(k)𝐕k,τ)𝔼[T(¯(𝓐;zi)(k)𝐕k,τ)]+𝔼[T(¯(𝓐;zi)(k)𝐕k,τ)]\begin{split}T_{k,1}=&\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k},\tau)]-\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k}],\\ T_{k,2}=&\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k},\tau)-\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k},\tau)],\\ T_{k,3}=&\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k}]-\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k}]\\ &+\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k},\tau)]-\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k},\tau)],\\ T_{k,4}=&\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k},\tau)-\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k},\tau)\\ &-\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k},\tau)]+\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k},\tau)]\end{split} (B.2)

Similarly, for 𝓢\mathscr{S}, its robust gradient estimator is

𝓖0=1ni=1nT(¯(𝓐;zi)×j=1d𝐔j,τ).\mbox{\boldmath$\mathscr{G}$}_{0}=\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})\times_{j=1}^{d}\mathbf{U}_{j}^{\top},\tau).

We can also decompose 𝓖0𝔼0\mbox{\boldmath$\mathscr{G}$}_{0}-\mathbb{E}{\nabla_{0}\mathcal{L}} into four components,

𝓖0𝔼[0]=1ni=1nT(¯(𝓐;zi)×j=1d𝐔j,τ)𝔼[¯(𝓐;zi)×j=1d𝐔j,τ)]=T0,1+T0,2+T0,3+T0,4,\begin{split}&\mbox{\boldmath$\mathscr{G}$}_{0}-\mathbb{E}[\nabla_{0}\mathcal{L}]=\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})\times_{j=1}^{d}\mathbf{U}_{j}^{\top},\tau)-\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})\times_{j=1}^{d}\mathbf{U}_{j}^{\top},\tau)]\\ =&T_{0,1}+T_{0,2}+T_{0,3}+T_{0,4},\end{split} (B.3)

where

T0,1=𝔼[T(¯(𝓐;zi)×j=1d𝐔j,τ)]𝔼[¯(𝓐;zi)×j=1d𝐔j],T0,2=1ni=1nT(¯(𝓐;zi)×j=1d𝐔j,τ)𝔼[T(¯(𝓐;zi)×j=1d𝐔j,τ)],T0,3=𝔼[¯(𝓐;zi)×j=1d𝐔j]𝔼[¯(𝓐;zi)×j=1d𝐔j]+𝔼[T(¯(𝓐;zi)×j=1d𝐔j,τ)]𝔼[T(¯(𝓐;zi)×j=1d𝐔j,τ)],T0,4=1ni=1nT(¯(𝓐;zi)×j=1d𝐔j,τ)1ni=1nT(¯(𝓐;zi)×j=1d𝐔j,τ)𝔼[T(¯(𝓐;zi)×j=1d𝐔j,τ)]+𝔼[T(¯(𝓐;zi)×j=1d𝐔j,τ)].\begin{split}T_{0,1}=&\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})\times_{j=1}^{d}\mathbf{U}_{j}^{\top},\tau)]-\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})\times_{j=1}^{d}\mathbf{U}_{j}^{\top}],\\ T_{0,2}=&\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})\times_{j=1}^{d}\mathbf{U}_{j}^{\top},\tau)-\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})\times_{j=1}^{d}\mathbf{U}_{j}^{\top},\tau)],\\ T_{0,3}=&\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})\times_{j=1}^{d}\mathbf{U}_{j}^{\top}]-\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})\times_{j=1}^{d}\mathbf{U}_{j}^{\top}]\\ &+\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})\times_{j=1}^{d}\mathbf{U}_{j}^{\top},\tau)]-\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})\times_{j=1}^{d}\mathbf{U}_{j}^{\top},\tau)],\\ T_{0,4}=&\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})\times_{j=1}^{d}\mathbf{U}_{j}^{\top},\tau)-\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})\times_{j=1}^{d}\mathbf{U}_{j}^{\top},\tau)\\ &-\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})\times_{j=1}^{d}\mathbf{U}_{j}^{\top},\tau)]+\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})\times_{j=1}^{d}\mathbf{U}_{j}^{\top},\tau)].\end{split} (B.4)

To prove the stability of the robust gradient estimators, it suffices to give proper upper bounds of Tk,jF\|T_{k,j}\|_{\text{F}} for 0kd0\leq k\leq d and 1j41\leq j\leq 4.

B.1 Proof of Theorem 4.5

Proof.

The proof consists of seven steps. In the first six steps, we prove the stability of the robust gradient estimators for the general 1tT1\leq t\leq T and, hence, we omit the notation (t)(t) for simplicity. Specifically, in the first four steps, we give the upper bounds for Tk,1F,,Tk,4F\|T_{k,1}\|_{\text{F}},\dots,\|T_{k,4}\|_{\text{F}}, respectively, for 1kd01\leq k\leq d_{0}. In the fifth and sixth steps, we extend the proof to the terms for d0+1kdd_{0}+1\leq k\leq d and core tensor. In the last step, we apply the results to the local convergence analysis in Theorem 3.3 and verify the corresponding conditions. Throughout the first six steps, we assume that for each 1kd1\leq k\leq d, 𝐔kσ¯1/(d+1)\|\mathbf{U}_{k}\|\asymp\bar{\sigma}^{1/(d+1)} and sinθ(𝐔k,𝐔k)δ\|\sin\theta(\mathbf{U}_{k},\mathbf{U}_{k}^{*})\|\leq\delta and will verify them in the last step.

Step 1. (Bound Tk,1F\|T_{k,1}\|_{\text{F}} for 1kd01\leq k\leq d_{0})

For any 1kd01\leq k\leq d_{0}, we let rk=r1r2rd0/rkr_{k}^{\prime}=r_{1}r_{2}\cdots r_{d_{0}}/r_{k} and

¯(𝓐;zi)(k)𝐕k=[(𝓧i×j=1,jkd0𝐔j)(k)vec(𝓔i×j=1dd0𝐔d0+j)]𝓢(k).\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k}=[(\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1,j\neq k}^{d_{0}}\mathbf{U}_{j}^{\top})_{(k)}\otimes\text{vec}(-\mbox{\boldmath$\mathscr{E}$}_{i}\times_{j=1}^{d-d_{0}}\mathbf{U}_{d_{0}+j}^{\top})^{\top}]\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top}. (B.5)

Denote the columns of 𝓢(k)\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top} as 𝓢(k)=[𝐬k,1,sk,2,,𝐬k,rk]\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top}=[\mathbf{s}_{k,1},s_{k,2},\dots,\mathbf{s}_{k,r_{k}}] such that vec(𝐒k,j)=𝐬k,j\text{vec}(\mathbf{S}_{k,j})=\mathbf{s}_{k,j}. The (l,m)(l,m)-th entry of n1i=1n¯(𝓐;zi)(k)𝐕kn^{-1}\sum_{i=1}^{n}\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k} is

1ni=1n([(𝓧i×j=1,jkd0𝐔j)(k)vec(𝓔i×j=1dd0𝐔d0+j)]𝐬k,m)l=1ni=1n[(𝓧i×j=1,jkd0𝐔j)(k)𝐒k,mvec(𝓔i×j=1dd0𝐔d0+j)]l=1ni=1n[(𝓧i)(k)(j=1,jkd0𝐔j)𝐒k,m(j=d0+1d𝐔j)(𝐞i)]l=1ni=1n𝐜l(𝓧i)(k)(j=1,jkd0𝐔j)𝐒k,m(j=d0+1d𝐔j)(𝐞i),\begin{split}&\frac{1}{n}\sum_{i=1}^{n}\left(\left[(\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1,j\neq k}^{d_{0}}\mathbf{U}_{j}^{\top})_{(k)}\otimes\text{vec}(-\mbox{\boldmath$\mathscr{E}$}_{i}\times_{j=1}^{d-d_{0}}\mathbf{U}_{d_{0}+j}^{\top})^{\top}\right]\mathbf{s}_{k,m}\right)_{l}\\ &=\frac{1}{n}\sum_{i=1}^{n}\left[(\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1,j\neq k}^{d_{0}}\mathbf{U}_{j}^{\top})_{(k)}\mathbf{S}_{k,m}\text{vec}(-\mbox{\boldmath$\mathscr{E}$}_{i}\times_{j=1}^{d-d_{0}}\mathbf{U}_{d_{0}+j}^{\top})\right]_{l}\\ &=\frac{1}{n}\sum_{i=1}^{n}\left[(\mbox{\boldmath$\mathscr{X}$}_{i})_{(k)}(\otimes_{j=1,j\neq k}^{d_{0}}\mathbf{U}_{j})\mathbf{S}_{k,m}(\otimes_{j=d_{0}+1}^{d}\mathbf{U}_{j}^{\top})(-\mathbf{e}_{i})\right]_{l}\\ &=\frac{1}{n}\sum_{i=1}^{n}\mathbf{c}_{l}^{\top}(\mbox{\boldmath$\mathscr{X}$}_{i})_{(k)}(\otimes_{j=1,j\neq k}^{d_{0}}\mathbf{U}_{j})\mathbf{S}_{k,m}(\otimes_{j=d_{0}+1}^{d}\mathbf{U}_{j}^{\top})(-\mathbf{e}_{i}),\end{split} (B.6)

where 𝐜l\mathbf{c}_{l} is the coordinate vector whose ll-th entry is one and the others are zero.

For the fixed 𝐔j\mathbf{U}_{j}’s, let 𝐌k,1=(j=1,jkd0𝐔j)/j=1,jkd0𝐔j\mathbf{M}_{k,1}=(\otimes_{j=1,j\neq k}^{d_{0}}\mathbf{U}_{j})/\|\otimes_{j=1,j\neq k}^{d_{0}}\mathbf{U}_{j}\| and 𝐜l(𝓧i)(k)𝐌k,1=(wk,l,1(i),wk,l,2(i),,wk,l,rk(i))\mathbf{c}_{l}^{\top}(\mbox{\boldmath$\mathscr{X}$}_{i})_{(k)}\mathbf{M}_{k,1}=(w^{(i)}_{k,l,1},w^{(i)}_{k,l,2},\dots,w^{(i)}_{k,l,r_{k}^{\prime}}). Similarly, let 𝐌k,2=(j=d0+1d𝐔j)/j=d0+1d𝐔j\mathbf{M}_{k,2}=(\otimes_{j=d_{0}+1}^{d}\mathbf{U}_{j}^{\top})/\|\otimes_{j=d_{0}+1}^{d}\mathbf{U}_{j}^{\top}\| and 𝐌k,2𝐞i=(zk,1(i),zk,2(i),,zk,rd0+1rd0+2rd(i))\mathbf{M}_{k,2}\mathbf{e}_{i}=(z^{(i)}_{k,1},z^{(i)}_{k,2},\dots,z^{(i)}_{k,r_{d_{0}+1}r_{d_{0}+2}\cdots r_{d}})^{\top}. By Assumption 4.3, 𝔼|wk,l,j(i)|1+ϵMx,1+ϵ,δ\mathbb{E}|w^{(i)}_{k,l,j}|^{1+\epsilon}\leq M_{x,1+\epsilon,\delta} and 𝔼|zk,m(i)|1+ϵMe,1+ϵ,δ\mathbb{E}|z^{(i)}_{k,{m^{\prime}}}|^{1+\epsilon}\leq M_{e,1+\epsilon,\delta}, for j=1,2,,rkj=1,2,\dots,r_{k}^{\prime}, l=1,2,,pkl=1,2,\dots,p_{k}, and m=1,2,,rd0+1rd0+2rd{m^{\prime}}=1,2,\dots,r_{d_{0}+1}r_{d_{0}+2}\cdots r_{d}. Let 𝐌k,3,m=𝐒k,m/𝐒k,m\mathbf{M}_{k,3,m}=\mathbf{S}_{k,m}/\|\mathbf{S}_{k,m}\| and 𝐌k,3,m𝐌k,2𝐞i=(zk,m,1(i),zk,m,2(i),,zk,m,rk(i))\mathbf{M}_{k,3,m}\mathbf{M}_{k,2}\mathbf{e}_{i}=(z_{k,m,1}^{(i)},z_{k,m,2}^{(i)},\dots,z_{k,m,r_{k}^{\prime}}^{(i)}), for m=1,2,,rkm=1,2,\dots,r_{k}. Then, 𝔼[|zk,m,j(i)|1+ϵ|𝓧i]Me,1+ϵ,δ\mathbb{E}[|z_{k,m,j}^{(i)}|^{1+\epsilon}|\mbox{\boldmath$\mathscr{X}$}_{i}]\lesssim M_{e,1+\epsilon,\delta}. Let vk,j,l,m(i)=wk,l,j(i)zk,m,j(i)v^{(i)}_{k,j,l,m}=w^{(i)}_{k,l,j}z^{(i)}_{k,m,j}, which satisfies that

𝔼[|vk,j,l,m(i)|1+ϵ]=𝔼[|wk,j,l(i)|1+ϵ𝔼[|zk,m,j(i)|1+ϵ|𝓧i]]Mx,1+ϵ,δMe,1+ϵ,δ\mathbb{E}\left[|v_{k,j,l,m}^{(i)}|^{1+\epsilon}\right]=\mathbb{E}\left[|w_{k,j,l}^{(i)}|^{1+\epsilon}\cdot\mathbb{E}\left[|z_{k,m,j}^{(i)}|^{1+\epsilon}|\mbox{\boldmath$\mathscr{X}$}_{i}\right]\right]\lesssim M_{x,1+\epsilon,\delta}M_{e,1+\epsilon,\delta} (B.7)

Let vk,l,m(i)=j=1rkvk,j,l,m(i)v^{(i)}_{k,l,m}=\sum_{j=1}^{r_{k}^{\prime}}v_{k,j,l,m}^{(i)} and 𝔼[|vk,l,m(i)|1+ϵ]Mx,1+ϵ,δMe,1+ϵ,δ\mathbb{E}[|v^{(i)}_{k,l,m}|^{1+\epsilon}]\lesssim M_{x,1+\epsilon,\delta}M_{e,1+\epsilon,\delta}.

We first bound the bias, namely Tk,1T_{k,1} in (B.3). We have that

Tk,1F2σ¯2dd+1l=1pkm=1rk|𝔼[vk,l,m(i)]𝔼[T(vk,l,m(i),τk)]|2,\|T_{k,1}\|_{\text{F}}^{2}\asymp\bar{\sigma}^{\frac{2d}{d+1}}\sum_{l=1}^{p_{k}}\sum_{m=1}^{r_{k}}\left|\mathbb{E}[v^{(i)}_{k,l,m}]-\mathbb{E}[\text{T}(v^{(i)}_{k,l,m},\tau_{k})]\right|^{2}, (B.8)

where τk=τj=1,jkd𝐔j1(max1mrk𝐒k,m)1[nMx,1+ϵ,δMe,1+ϵ,δ/log(p¯)]1/(1+ϵ)\tau_{k}=\tau\cdot\|\otimes_{j=1,j\neq k}^{d}\mathbf{U}_{j}\|^{-1}\cdot(\max_{1\leq m\leq r_{k}}\|\mathbf{S}_{k,m}\|)^{-1}\asymp[nM_{x,1+\epsilon,\delta}M_{e,1+\epsilon,\delta}/\log(\bar{p})]^{1/(1+\epsilon)}.

For any l=1,2,,pkl=1,2,\dots,p_{k} and m=1,2,,rkm=1,2,\dots,r_{k}, by definition of the truncation operator T(,)\text{T}(\cdot,\cdot) and Markov’s inequality,

|𝔼[vk,l,m(i)]𝔼[T(vk,l,m(i),τk)]|𝔼[|vk,l,m(i)|1{|vk,l,m(i)|τk}]𝔼[|vk,l,m(i)|1+ϵ]1/(1+ϵ)(|vk,l,m(i)|τk)ϵ/(1+ϵ)𝔼[|vk,l,m(i)|1+ϵ]1/(1+ϵ)(𝔼[|vk,l,m(i)|1+ϵ]τk1+ϵ)ϵ/(1+ϵ)Mx,1+ϵ,δMe,1+ϵ,δτkϵ[Mx,1+ϵ,δ1/ϵMe,1+ϵ,δ1/ϵlog(p¯)n]ϵ/(1+ϵ)\begin{split}&\left|\mathbb{E}\left[v^{(i)}_{k,l,m}\right]-\mathbb{E}\left[\text{T}(v^{(i)}_{k,l,m},\tau_{k})\right]\right|\leq\mathbb{E}\left[|v^{(i)}_{k,l,m}|\cdot 1\{|v^{(i)}_{k,l,m}|\geq\tau_{k}\}\right]\\ \leq&\mathbb{E}\left[|v^{(i)}_{k,l,m}|^{1+\epsilon}\right]^{1/(1+\epsilon)}\cdot\mathbb{P}(|v^{(i)}_{k,l,m}|\geq\tau_{k})^{\epsilon/(1+\epsilon)}\\ \leq&\mathbb{E}\left[|v^{(i)}_{k,l,m}|^{1+\epsilon}\right]^{1/(1+\epsilon)}\cdot\left(\frac{\mathbb{E}\left[|v^{(i)}_{k,l,m}|^{1+\epsilon}\right]}{\tau_{k}^{1+\epsilon}}\right)^{\epsilon/(1+\epsilon)}\\ \asymp&~{}M_{x,1+\epsilon,\delta}\cdot M_{e,1+\epsilon,\delta}\cdot\tau_{k}^{-\epsilon}\asymp\left[\frac{M_{x,1+\epsilon,\delta}^{1/\epsilon}M_{e,1+\epsilon,\delta}^{1/\epsilon}\log(\bar{p})}{n}\right]^{\epsilon/(1+\epsilon)}\end{split} (B.9)

with truncation parameter

τk[nMx,1+ϵ,δMe,1+ϵ,δlog(p¯)]1/(1+ϵ).\tau_{k}\asymp\left[\frac{nM_{x,1+\epsilon,\delta}M_{e,1+\epsilon,\delta}}{\log(\bar{p})}\right]^{1/(1+\epsilon)}. (B.10)

Hence, for k=1,,d0k=1,\dots,d_{0},

Tk,1Fσ¯d/(d+1)pkrk[Mx,1+ϵ,δ1/ϵMe,1+ϵ,δ1/ϵlog(p¯)n]ϵ1+ϵ.\left\|T_{k,1}\right\|_{\text{F}}\lesssim\bar{\sigma}^{d/(d+1)}\sqrt{p_{k}r_{k}}\left[\frac{M_{x,1+\epsilon,\delta}^{1/\epsilon}M_{e,1+\epsilon,\delta}^{1/\epsilon}\log(\bar{p})}{n}\right]^{\frac{\epsilon}{1+\epsilon}}. (B.11)

Step 2. (Bound Tk,2F\|T_{k,2}\|_{\text{F}} for 1kd01\leq k\leq d_{0})

For Tk,2T_{k,2} in (B.3), similarly to Tk,1T_{k,1},

Tk,2F2σ¯2dd+1l,m|1ni=1nT(vk,l,m(i),τk)𝔼[T(vk,l,m(i),τk)]|2.\begin{split}&\|T_{k,2}\|_{\text{F}}^{2}\asymp\bar{\sigma}^{\frac{2d}{d+1}}\sum_{l,m}\left|\frac{1}{n}\sum_{i=1}^{n}\text{T}(v^{(i)}_{k,l,m},\tau_{k})-\mathbb{E}[\text{T}(v^{(i)}_{k,l,m},\tau_{k})]\right|^{2}.\end{split} (B.12)

For each i=1,2,,ni=1,2,\dots,n, it can be checked that

𝔼[T(vk,l,m(i),τk)2]τ01ϵ𝔼[|vk,l,m(i)|1+ϵ]τk1ϵMx,1+ϵ,δMe,1+ϵ,δ.\mathbb{E}\left[\text{T}(v^{(i)}_{k,l,m},\tau_{k})^{2}\right]\leq\tau_{0}^{1-\epsilon}\cdot\mathbb{E}\left[|v^{(i)}_{k,l,m}|^{1+\epsilon}\right]\asymp\tau_{k}^{1-\epsilon}\cdot M_{x,1+\epsilon,\delta}\cdot M_{e,1+\epsilon,\delta}. (B.13)

Thus, we have the upper bound for the variance

var(T(vk,l,m(i),τk))𝔼[T(vk,l,m(i),τk)2]τk1ϵMx,1+ϵ,δMe,1+ϵ,δ.\text{var}(\text{T}(v^{(i)}_{k,l,m},\tau_{k}))\leq\mathbb{E}\left[\text{T}(v^{(i)}_{k,l,m},\tau_{k})^{2}\right]\lesssim\tau_{k}^{1-\epsilon}\cdot M_{x,1+\epsilon,\delta}M_{e,1+\epsilon,\delta}. (B.14)

Also, for any q=3,4,q=3,4,\dots, the higher-order moments satisfy that

𝔼[|T(vk,l,m(i),τk)𝔼[T(vk,l,m(i),τk)]|q](2τk)q2𝔼[(T(vk,l,m(i),τk)𝔼[T(vk,l,m(i),τk)])2].\begin{split}&\mathbb{E}\left[\left|\text{T}(v^{(i)}_{k,l,m},\tau_{k})-\mathbb{E}[\text{T}(v^{(i)}_{k,l,m},\tau_{k})]\right|^{q}\right]\leq(2\tau_{k})^{q-2}\cdot\mathbb{E}\left[\left(\text{T}(v^{(i)}_{k,l,m},\tau_{k})-\mathbb{E}[\text{T}(v^{(i)}_{k,l,m},\tau_{k})]\right)^{2}\right].\end{split} (B.15)

By Bernstein’s inequality, for any 1lpk1\leq l\leq p_{k}, 1mrk1\leq m\leq r_{k}, and 0<tτkϵMx,1+ϵ,δMe,1+ϵ,δ0<t\lesssim\tau_{k}^{-\epsilon}M_{x,1+\epsilon,\delta}M_{e,1+\epsilon,\delta},

(|1ni=1nT(vk,l,m(i),τk)𝔼T(vk,l,m(i),τk)|t)2exp(nt24τk1ϵMx,1+ϵ,δMe,1+ϵ,δ).\begin{split}&\mathbb{P}\left(\left|\frac{1}{n}\sum_{i=1}^{n}\text{T}(v^{(i)}_{k,l,m},\tau_{k})-\mathbb{E}\text{T}(v^{(i)}_{k,l,m},\tau_{k})\right|\geq t\right)\leq 2\exp\left(-\frac{nt^{2}}{4\tau_{k}^{1-\epsilon}M_{x,1+\epsilon,\delta}M_{e,1+\epsilon,\delta}}\right).\end{split} (B.16)

Let t=CMx,1+ϵ,δ1/(1+ϵ)Me,1+ϵ,δ1/(1+ϵ)log(p¯)ϵ/(1+ϵ)nϵ/(1+ϵ)t=CM_{x,1+\epsilon,\delta}^{1/(1+\epsilon)}M_{e,1+\epsilon,\delta}^{1/(1+\epsilon)}\log(\bar{p})^{\epsilon/(1+\epsilon)}n^{-\epsilon/(1+\epsilon)}. Therefore, we have

(|1ni=1nT(vk,l,m(i),τk)𝔼T(vk,l,m(i),τk)|C[Mx,1+ϵ,δ1/ϵMe,1+ϵ,δ1/ϵlog(p¯)n]ϵ/(1+ϵ))Cexp(Clog(p¯))\begin{split}&\mathbb{P}\Bigg{(}\Bigg{|}\frac{1}{n}\sum_{i=1}^{n}\text{T}(v^{(i)}_{k,l,m},\tau_{k})-\mathbb{E}\text{T}(v^{(i)}_{k,l,m},\tau_{k})\Bigg{|}\\ &~{}~{}~{}~{}~{}~{}~{}\geq C\left[\frac{M_{x,1+\epsilon,\delta}^{1/\epsilon}M_{e,1+\epsilon,\delta}^{1/\epsilon}\log(\bar{p})}{n}\right]^{\epsilon/(1+\epsilon)}\Bigg{)}\leq C\exp\left(-C\log(\bar{p})\right)\end{split} (B.17)

and

(max1lpk1mrk|1ni=1nT(vk,l,m(i),τk)𝔼T(vk,l,m(i),τk)|C[Mx,1+ϵ,δ1/ϵMe,1+ϵ,δ1/ϵlog(p¯)n]ϵ/(1+ϵ))Cpkrkexp(Clog(p¯))Cexp(Clog(p¯)).\begin{split}&\mathbb{P}\left(\max_{\begin{subarray}{c}1\leq l\leq p_{k}\\ 1\leq m\leq r_{k}\end{subarray}}\Bigg{|}\frac{1}{n}\sum_{i=1}^{n}\text{T}(v^{(i)}_{k,l,m},\tau_{k})-\mathbb{E}\text{T}(v^{(i)}_{k,l,m},\tau_{k})\Bigg{|}\geq C\left[\frac{M_{x,1+\epsilon,\delta}^{1/\epsilon}M_{e,1+\epsilon,\delta}^{1/\epsilon}\log(\bar{p})}{n}\right]^{\epsilon/(1+\epsilon)}\right)\\ &\leq Cp_{k}r_{k}\exp\left(-C\log(\bar{p})\right)\leq C\exp(-C\log(\bar{p})).\end{split} (B.18)

Hence, for 1kd01\leq k\leq d_{0}, with high probability at least 1Cexp(Clog(p¯))1-C\exp(-C\log(\bar{p})),

1ni=1nT(¯(𝓐;zi)(k)𝐕k,τ)𝔼[T(¯(𝓐;zi)(k)𝐕k,τ)]Fσ¯d/(d+1)pkrk[Mx,1+ϵ,δ1/ϵMe,1+ϵ,δ1/ϵlog(p¯)n]ϵ/(1+ϵ).\begin{split}&\left\|\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k},\tau)-\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k},\tau)]\right\|_{\text{F}}\\ &\lesssim\bar{\sigma}^{d/(d+1)}\sqrt{p_{k}r_{k}}\left[\frac{M_{x,1+\epsilon,\delta}^{1/\epsilon}M_{e,1+\epsilon,\delta}^{1/\epsilon}\log(\bar{p})}{n}\right]^{\epsilon/(1+\epsilon)}.\end{split} (B.19)

Step 3. (Bound Tk,3F\|T_{k,3}\|_{\text{F}} for 1kd01\leq k\leq d_{0})

The tensor linear regression can be rewritten as

vec(𝓨i)=𝐲i=𝐀𝐱i+𝐞i=mat(𝓐)vec(𝓧i)+vec(𝓔i).\text{vec}(\mbox{\boldmath$\mathscr{Y}$}_{i})=\mathbf{y}_{i}=\mathbf{A}^{*}\mathbf{x}_{i}+\mathbf{e}_{i}=\text{mat}(\mbox{\boldmath$\mathscr{A}$})\text{vec}(\mbox{\boldmath$\mathscr{X}$}_{i})+\text{vec}(\mbox{\boldmath$\mathscr{E}$}_{i}). (B.20)

Hence, we denote matricization of 𝓐\mathscr{A} and 𝓐\mbox{\boldmath$\mathscr{A}$}^{*} as 𝐀\mathbf{A} and 𝐀\mathbf{A}^{*}.

As in step 1,

𝔼[¯(𝓐;zi)(k)𝐕k¯(𝓐;zi)(k)𝐕k]F2𝐕k2𝔼[(𝐀𝐀)𝐱i𝐱i]F2σ¯2d/(d+1)βx2𝓐𝓐F2.\begin{split}&\|\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k}-\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k}]\|_{\text{F}}^{2}\\ \leq&\|\mathbf{V}_{k}\|^{2}\cdot\|\mathbb{E}[(\mathbf{A}-\mathbf{A}^{*})\mathbf{x}_{i}\mathbf{x}_{i}^{\top}]\|_{\text{F}}^{2}\lesssim~{}\bar{\sigma}^{2d/(d+1)}\beta_{x}^{2}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}.\end{split} (B.21)

For 1lpk1\leq l\leq p_{k} and 1mrk1\leq m\leq r_{k}, let the (l,m)(l,m)-th entry of ¯(𝓐;zi)(k)𝐕k\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k} and ¯(𝓐;zi)(k)𝐕k\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k} be xk,l,m(i)x_{k,l,m}^{(i)} and yk,l,m(i)y_{k,l,m}^{(i)}, respectively. Let 𝔼[|xk,l,m(i)yk,l,m(i)|2]=sk,l,m2\mathbb{E}[|x_{k,l,m}^{(i)}-y_{k,l,m}^{(i)}|^{2}]=s_{k,l,m}^{2}, and then l=1pkm=1rksk,l,m2σ¯2d/(d+1)βx2𝓐𝓐F2\sum_{l=1}^{p_{k}}\sum_{m=1}^{r_{k}}s_{k,l,m}^{2}\lesssim\bar{\sigma}^{2d/(d+1)}\beta_{x}^{2}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}.

Similarly to vk,l,m(i)v^{(i)}_{k,l,m}, let 𝐌k,3,m𝐌k,2(𝐀𝐀)𝐱i=(rk,m,1(i),rk,m,2(i),,rk,m,rk(i))\mathbf{M}_{k,3,m}\mathbf{M}_{k,2}(\mathbf{A}-\mathbf{A}^{*})\mathbf{x}_{i}=(r_{k,m,1}^{(i)},r_{k,m,2}^{(i)},\dots,r_{k,m,r_{k}^{\prime}}^{(i)})^{\top}. Note that 𝐀σ¯\|\mathbf{A}\|\lesssim\bar{\sigma} and 𝐀σ¯\|\mathbf{A}^{*}\|\lesssim\bar{\sigma}, and hence 𝐀𝐀σ¯\|\mathbf{A}-\mathbf{A}^{*}\|\lesssim\bar{\sigma}. Then,

𝐜l(𝓧i)(k)𝐌k,1𝐌k,3,m𝐌k,2((𝐀𝐀)𝐱i𝐞i)=𝐜l(𝓧i)(k)𝐌k,1𝐌k,3,m𝐌k,2𝐞i+𝐜l(𝓧i)(k)𝐌k,1𝐌k,3,m𝐌k,2(𝐀𝐀)𝐱i=j=1rkwk,l,j(i)zk,m,j(i)+j=1rkwk,j,l(i)rk,m,j(i):=vk,l,m(i)+uk,l,m(i),\begin{split}&\mathbf{c}_{l}^{\top}(\mbox{\boldmath$\mathscr{X}$}_{i})_{(k)}\mathbf{M}_{k,1}\mathbf{M}_{k,3,m}\mathbf{M}_{k,2}((\mathbf{A}-\mathbf{A}^{*})\mathbf{x}_{i}-\mathbf{e}_{i})\\ =&-\mathbf{c}_{l}^{\top}(\mbox{\boldmath$\mathscr{X}$}_{i})_{(k)}\mathbf{M}_{k,1}\mathbf{M}_{k,3,m}\mathbf{M}_{k,2}\mathbf{e}_{i}+\mathbf{c}_{l}^{\top}(\mbox{\boldmath$\mathscr{X}$}_{i})_{(k)}\mathbf{M}_{k,1}\mathbf{M}_{k,3,m}\mathbf{M}_{k,2}(\mathbf{A}-\mathbf{A}^{*})\mathbf{x}_{i}\\ =&\sum_{j=1}^{r_{k}^{\prime}}w_{k,l,j}^{(i)}z_{k,m,j}^{(i)}+\sum_{j=1}^{r_{k}^{\prime}}w_{k,j,l}^{(i)}r_{k,m,j}^{(i)}:=v_{k,l,m}^{(i)}+u_{k,l,m}^{(i)},\end{split} (B.22)

where 𝔼[|vk,l,m(i)|1+ϵ]Mx,1+ϵ,δMe,1+ϵ,δ\mathbb{E}[|v_{k,l,m}^{(i)}|^{1+\epsilon}]\lesssim M_{x,1+\epsilon,\delta}M_{e,1+\epsilon,\delta} and 𝔼[|uk,l,m(i)|]σ¯βx\mathbb{E}[|u_{k,l,m}^{(i)}|]\lesssim\bar{\sigma}\beta_{x}

For any 1lpk1\leq l\leq p_{k} and 1mrk1\leq m\leq r_{k},

|𝔼[xk,l,m(i)yk,l,m(i)]𝔼[T(xk,l,m(i),τ)T(yk,l,m(i),τ)]|𝔼[|xk,l,m(i)yk,l,m(i)|1{|vk,l,m(i)+uk,l,m(i)|τk}]𝔼[|xk,l,m(i)yk,l,m(i)|(1{|vk,l,m(i)|τk/2}+1{|uk,l,m(i)|τk/2})]𝔼[|xk,l,m(i)yk,l,m(i)|2]1/2[(|vk,l,m(i)|τk/2)1/2+(|uk,l,m(i)|τk/2)1/2]sk,l,m([𝔼|vk,l,m(i)|1+ϵ(τk/2)1+ϵ]1/2+[𝔼|uk,l,m(i)|τk/2]1/2)sk,l,m[log(p¯)1/2n1/2+σ¯1/2βx1/2log(p¯)12+2ϵMx,1+ϵ,δ12+2ϵMe,1+ϵ,δ12+2ϵn12+2ϵ]:=sk,l,mϕδ1/2.\begin{split}&\left|\mathbb{E}[x_{k,l,m}^{(i)}-y_{k,l,m}^{(i)}]-\mathbb{E}[\text{T}(x_{k,l,m}^{(i)},\tau)-\text{T}(y_{k,l,m}^{(i)},\tau)]\right|\\ \leq&~{}\mathbb{E}\left[|x_{k,l,m}^{(i)}-y_{k,l,m}^{(i)}|\cdot 1\{|v^{(i)}_{k,l,m}+u^{(i)}_{k,l,m}|\geq\tau_{k}\}\right]\\ \leq&~{}\mathbb{E}\left[|x_{k,l,m}^{(i)}-y_{k,l,m}^{(i)}|\cdot(1\{|v^{(i)}_{k,l,m}|\geq\tau_{k}/2\}+1\{|u^{(i)}_{k,l,m}|\geq\tau_{k}/2\})\right]\\ \leq&~{}\mathbb{E}\left[|x_{k,l,m}^{(i)}-y_{k,l,m}^{(i)}|^{2}\right]^{1/2}\cdot\left[\mathbb{P}\left(|v^{(i)}_{k,l,m}|\geq\tau_{k}/2\right)^{1/2}+\mathbb{P}\left(|u^{(i)}_{k,l,m}|\geq\tau_{k}/2\right)^{1/2}\right]\\ \leq&~{}s_{k,l,m}\cdot\left(\left[\frac{\mathbb{E}|v_{k,l,m}^{(i)}|^{1+\epsilon}}{(\tau_{k}/2)^{1+\epsilon}}\right]^{1/2}+\left[\frac{\mathbb{E}|u_{k,l,m}^{(i)}|}{\tau_{k}/2}\right]^{1/2}\right)\\ \asymp&~{}s_{k,l,m}\left[\log(\bar{p})^{1/2}n^{-1/2}+\bar{\sigma}^{1/2}\beta_{x}^{1/2}\log(\bar{p})^{\frac{1}{2+2\epsilon}}M_{x,1+\epsilon,\delta}^{\frac{-1}{2+2\epsilon}}M_{e,1+\epsilon,\delta}^{\frac{-1}{2+2\epsilon}}n^{\frac{-1}{2+2\epsilon}}\right]:=s_{k,l,m}\phi_{\delta}^{1/2}.\end{split} (B.23)

Hence,

Tk,3F2=l=1pkm=1rk|𝔼[xk,l,m(i)yk,l,m(i)]𝔼[T(xk,l,m(i),τ)T(yk,l,m(i),τ)]|2ϕδl=1pkm=1rksk,l,m2ϕδσ¯2d/(d+1\begin{split}\|T_{k,3}\|_{\text{F}}^{2}=&\sum_{l=1}^{p_{k}}\sum_{m=1}^{r_{k}}\left|\mathbb{E}[x_{k,l,m}^{(i)}-y_{k,l,m}^{(i)}]-\mathbb{E}[\text{T}(x_{k,l,m}^{(i)},\tau)-\text{T}(y_{k,l,m}^{(i)},\tau)]\right|^{2}\\ \lesssim&~{}\phi_{\delta}\sum_{l=1}^{p_{k}}\sum_{m=1}^{r_{k}}s_{k,l,m}^{2}\lesssim\phi_{\delta}\bar{\sigma}^{2d/(d+1)}\beta_{x}^{2}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}.\end{split} (B.24)

Step 4. (Bound Tk,4F\|T_{k,4}\|_{\text{F}} for 1kd01\leq k\leq d_{0})

Let T(¯(𝓐;zi)(k)𝐕k,τ)T(¯(𝓐;zi)(k)𝐕k,τ)=𝐙k(i)={zk,j,l(i)}1jpk,1lrk\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k},\tau)-\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k},\tau)=\mathbf{Z}^{(i)}_{k}=\{z^{(i)}_{k,j,l}\}_{1\leq j\leq p_{k},1\leq l\leq r_{k}}. Then,

Tk,4F2=j=1pkl=1rk(1ni=1nzk,j,l(i)𝔼[zk,j,l(i)])2\|T_{k,4}\|_{\text{F}}^{2}=\sum_{j=1}^{p_{k}}\sum_{l=1}^{r_{k}}\left(\frac{1}{n}\sum_{i=1}^{n}z^{(i)}_{k,j,l}-\mathbb{E}[z_{k,j,l}^{(i)}]\right)^{2} (B.25)

Note that var(zk,j,l(i))=sk,j,l2\text{var}(z_{k,j,l}^{(i)})=s_{k,j,l}^{2} and note that j=1pkl=1rksk,j,l2Cσ¯2d/(d+1)βx2𝓐𝓐F2\sum_{j=1}^{p_{k}}\sum_{l=1}^{r_{k}}s^{2}_{k,j,l}\leq C\bar{\sigma}^{2d/(d+1)}\beta_{x}^{2}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}. Also, |zk,j,l(i)|2τ|z_{k,j,l}^{(i)}|\leq 2\tau. Similarly to the term Tk,2T_{k,2}, by Bernstein’s inequality, for any 0<t<τ2sk,j,l20<t<\tau^{-2}s_{k,j,l}^{2}, 1jpk1\leq j\leq p_{k} and 1lrk1\leq l\leq r_{k},

(|1ni=1nzk,j,l(i)𝔼[zk,j,l(i)]|t)2exp(nt22sk,j,l2).\mathbb{P}\left(\left|\frac{1}{n}\sum_{i=1}^{n}z_{k,j,l}^{(i)}-\mathbb{E}[z_{k,j,l}^{(i)}]\right|\geq t\right)\leq 2\exp\left(-\frac{nt^{2}}{2s^{2}_{k,j,l}}\right). (B.26)

Letting t=Csk,j,llog(p¯)/nt=Cs_{k,j,l}\sqrt{\log(\bar{p})/n}, we have that

(1jpk1lrk|1ni=1nzk,j,l(i)𝔼[zk,j,l(i)]|Csk,j,llog(p¯)/n)Cexp(Clog(p¯)).\mathbb{P}\left(\cup_{\begin{subarray}{c}1\leq j\leq p_{k}\\ 1\leq l\leq r_{k}\end{subarray}}\left|\frac{1}{n}\sum_{i=1}^{n}z_{k,j,l}^{(i)}-\mathbb{E}[z_{k,j,l}^{(i)}]\right|\geq Cs_{k,j,l}\sqrt{\log(\bar{p})/n}\right)\leq C\exp\left(-C\log(\bar{p})\right). (B.27)

Therefore, with probability at least 1Cexp(Clog(p¯))1-C\exp(-C\log(\bar{p})),

Tk,4F2log(p¯)nσ¯2d/(d+1)βx2𝓐𝓐F2.\|T_{k,4}\|_{\text{F}}^{2}\lesssim\frac{\log(\bar{p})}{n}\bar{\sigma}^{2d/(d+1)}\beta_{x}^{2}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}. (B.28)

Combining these results, for any 1kd01\leq k\leq d_{0}, with probablity 1Cexp(Clog(p¯))1-C\exp(-C\log(\bar{p})),

𝐆k𝔼[k]F2ϕδσ¯2d/(d+1)βx2𝓐𝓐F2+σ¯2d/(d+1)(pkrk)[Mx,1+ϵ,δ1/ϵMe,1+ϵ,δ1/ϵlog(p¯)n]2ϵ1+ϵ.\begin{split}&\|\mathbf{G}_{k}-\mathbb{E}[\nabla_{k}\mathcal{L}]\|_{\text{F}}^{2}\\ &\lesssim\phi_{\delta}\bar{\sigma}^{2d/(d+1)}\beta_{x}^{2}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}+\bar{\sigma}^{2d/(d+1)}(p_{k}r_{k})\left[\frac{M_{x,1+\epsilon,\delta}^{1/\epsilon}M_{e,1+\epsilon,\delta}^{1/\epsilon}\log(\bar{p})}{n}\right]^{\frac{2\epsilon}{1+\epsilon}}.\end{split} (B.29)

Step 5. (Extension to Tk,1,,Tk,4T_{k,1},\dots,T_{k,4} for d0+1kdd_{0}+1\leq k\leq d)

For d0+1kdd_{0}+1\leq k\leq d, we let rk=rd0+1rd0+2rd/rkr_{k}^{\prime}=r_{d_{0}+1}r_{d_{0}+2}\cdots r_{d}/r_{k} and

(𝓐;zi)(k)𝐕k=[(𝓔i×j=1,jkd0dd0𝐔d0+j)(kd0)vec(𝓧i×j=1d0𝐔j)]𝓢(k).\nabla{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k}=[(-\mbox{\boldmath$\mathscr{E}$}_{i}\times_{j=1,j\neq k-d_{0}}^{d-d_{0}}\mathbf{U}_{d_{0}+j}^{\top})_{(k-d_{0})}\otimes\text{vec}(\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d_{0}}\mathbf{U}_{j}^{\top})^{\top}]\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top}. (B.30)

The (l,m)(l,m)-th entry of n1i=1n¯(𝓐;zi)(k)𝐕kn^{-1}\sum_{i=1}^{n}\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k} is

1ni=1n([(𝓔i×j=1,jkd0dd0𝐔j)(kd0)vec(𝓧i×j=1d0𝐔j)]𝐬k,m)l=1ni=1n𝐜l(𝓔i)(kd0)(j=d0+1,jkd𝐔j)𝐒k,m(j=1d0𝐔j)𝐱i.\begin{split}&\frac{1}{n}\sum_{i=1}^{n}([(-\mbox{\boldmath$\mathscr{E}$}_{i}\times_{j=1,j\neq k-d_{0}}^{d-d_{0}}\mathbf{U}_{j}^{\top})_{(k-d_{0})}\otimes\text{vec}(\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d_{0}}\mathbf{U}_{j}^{\top})^{\top}]\mathbf{s}_{k,m})_{l}\\ &=\frac{1}{n}\sum_{i=1}^{n}\mathbf{c}_{l}^{\top}(-\mbox{\boldmath$\mathscr{E}$}_{i})_{(k-d_{0})}(\otimes_{j=d_{0}+1,j\neq k}^{d}\mathbf{U}_{j})\cdot\mathbf{S}_{k,m}(\otimes_{j=1}^{d_{0}}\mathbf{U}_{j}^{\top})\mathbf{x}_{i}.\end{split} (B.31)

Let 𝐌k,1=(j=d0+1,jkd𝐔j)/j=d0+1,jkd𝐔j\mathbf{M}_{k,1}=(\otimes_{j=d_{0}+1,j\neq k}^{d}\mathbf{U}_{j})/\|\otimes_{j=d_{0}+1,j\neq k}^{d}\mathbf{U}_{j}\| and 𝐜l(𝓔i)(kd0)𝐌k,1=(uk,l,1(i),uk,l,2(i),,uk,l,rk(i))\mathbf{c}_{l}^{\top}(-\mbox{\boldmath$\mathscr{E}$}_{i})_{(k-d_{0})}\mathbf{M}_{k,1}=(u_{k,l,1}^{(i)},u_{k,l,2}^{(i)},\dots,u_{k,l,r_{k}^{\prime}}^{(i)}). Let 𝐌k,2=(j=1d0𝐔j)/j=1d0𝐔j\mathbf{M}_{k,2}=(\otimes_{j=1}^{d_{0}}\mathbf{U}_{j}^{\top})/\|\otimes_{j=1}^{d_{0}}\mathbf{U}_{j}^{\top}\| and 𝐌k,2𝐱i=(sk,1(i),sk,2(i),,sk,r1r2rd0(i))\mathbf{M}_{k,2}\mathbf{x}_{i}=(s_{k,1}^{(i)},s_{k,2}^{(i)},\dots,s_{k,r_{1}r_{2}\cdots r_{d_{0}}}^{(i)})^{\top}.

By Assumption 4.3, 𝔼[|uk,l,j(i)|1+ϵ|𝓧i]Me,1+ϵ,δ\mathbb{E}[|u_{k,l,j}^{(i)}|^{1+\epsilon}|\mbox{\boldmath$\mathscr{X}$}_{i}]\leq M_{e,1+\epsilon,\delta} and 𝔼[|sk,j(i)|1+ϵ]Mx,1+ϵ,δ\mathbb{E}[|s_{k,j^{\prime}}^{(i)}|^{1+\epsilon}]\leq M_{x,1+\epsilon,\delta}, for j=1,2,,r1r2rd0j^{\prime}=1,2,\dots,r_{1}r_{2}\cdots r_{d_{0}} and l=1,2,,pkl=1,2,\dots,p_{k}. Let 𝐌k,3,m=𝐒k,m/𝐒k,m\mathbf{M}_{k,3,m}=\mathbf{S}_{k,m}/\|\mathbf{S}_{k,m}\| and 𝐌k,3,m𝐌2𝐱i=(sk,m,1(i),sk,m,2(i),,sk,m,rk(i))\mathbf{M}_{k,3,m}\mathbf{M}_{2}\mathbf{x}_{i}=(s_{k,m,1}^{(i)},s_{k,m,2}^{(i)},\dots,s_{k,m,r_{k}^{\prime}}^{(i)}), where 𝔼[|sk,m,j(i)|1+ϵ]Mx,1+ϵ,δ\mathbb{E}[|s_{k,m,j}^{(i)}|^{1+\epsilon}]\lesssim M_{x,1+\epsilon,\delta}. Let rk,j,l,m(i)=uk,l,j(i)sk,m,j(i)r_{k,j,l,m}^{(i)}=u_{k,l,j}^{(i)}s_{k,m,j}^{(i)}. Hence, we can have that

𝔼[T(¯(𝓐)(k)𝐕k𝓢(k),τ)]𝔼[¯(𝓐)(k)𝐕k𝓢(k)]Fσ¯d/(d+1)pkrk[Mx,1+ϵ,δ1/ϵMe,1+ϵ,δ1/ϵlog(p¯)n]ϵ/(1+ϵ).\begin{split}&\left\|\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*})_{(k)}\mathbf{V}_{k}^{\top}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top},\tau)]-\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*})_{(k)}\mathbf{V}_{k}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top}]\right\|_{\text{F}}\\ &\lesssim\bar{\sigma}^{d/(d+1)}\sqrt{p_{k}r_{k}}\left[\frac{M_{x,1+\epsilon,\delta}^{1/\epsilon}M_{e,1+\epsilon,\delta}^{1/\epsilon}\log(\bar{p})}{n}\right]^{\epsilon/(1+\epsilon)}.\end{split} (B.32)

and with probability at least 1Cexp(Clog(p¯))1-C\exp(-C\log(\bar{p})),

1ni=1nT(¯(𝓐;zi)(k)𝐕k𝓢(k),τ)𝔼[T(¯(𝓐;zi)(k)𝐕k𝓢(k),τ)]Fσ¯d/(d+1)pkrk[Mx,1+ϵ,δ1/ϵMe,1+ϵ,δ1/ϵlog(p¯)n]ϵ/(1+ϵ).\begin{split}&\left\|\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top},\tau)-\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top},\tau)]\right\|_{\text{F}}\\ &\lesssim\bar{\sigma}^{d/(d+1)}\sqrt{p_{k}r_{k}}\left[\frac{M_{x,1+\epsilon,\delta}^{1/\epsilon}M_{e,1+\epsilon,\delta}^{1/\epsilon}\log(\bar{p})}{n}\right]^{\epsilon/(1+\epsilon)}.\end{split} (B.33)

Moreover,

Tk,3F2ϕδσ¯2d/(d+1)βx2𝓐𝓐F2,\begin{split}&\|T_{k,3}\|_{\text{F}}^{2}\lesssim\phi_{\delta}\bar{\sigma}^{2d/(d+1)}\beta_{x}^{2}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2},\end{split} (B.34)

and with probability at least 1Cexp(Clog(p¯))1-C\exp(-C\log(\bar{p})),

Tk,4F2log(p¯)nσ¯2d/(d+1)βx2𝓐𝓐F2.\|T_{k,4}\|_{\text{F}}^{2}\lesssim\frac{\log(\bar{p})}{n}\bar{\sigma}^{2d/(d+1)}\beta_{x}^{2}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}. (B.35)

Therefore, for d0+1kdd_{0}+1\leq k\leq d, with probability at least 1Cexp(Clog(p¯))1-C\exp(-C\log(\bar{p})),

𝐆k𝔼[k]F2ϕδσ¯2dd+1βx2𝓐𝓐F2+σ¯2dd+1(pkrk)[Mx,1+ϵ,δ1/ϵMe,1+ϵ,δ1/ϵlog(p¯)n]2ϵ1+ϵ.\|\mathbf{G}_{k}-\mathbb{E}[\nabla_{k}\mathcal{L}]\|_{\text{F}}^{2}\lesssim\phi_{\delta}\bar{\sigma}^{\frac{2d}{d+1}}\beta_{x}^{2}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}+\bar{\sigma}^{\frac{2d}{d+1}}(p_{k}r_{k})\left[\frac{M_{x,1+\epsilon,\delta}^{1/\epsilon}M_{e,1+\epsilon,\delta}^{1/\epsilon}\log(\bar{p})}{n}\right]^{\frac{2\epsilon}{1+\epsilon}}. (B.36)

Step 6. (Extension to core tensor)

For the partial gradient with respect to the core tensor 𝓢\mathscr{S}, we have

¯(𝓐;zi)×j=1d𝐔j=(𝓧i×j=1d0𝐔j)(𝓔i×j=d0+1d𝐔j).\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})\times_{j=1}^{d}\mathbf{U}_{j}^{\top}=(\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d_{0}}\mathbf{U}_{j}^{\top})\circ(-\mbox{\boldmath$\mathscr{E}$}_{i}\times_{j=d_{0}+1}^{d}\mathbf{U}_{j}^{\top}). (B.37)

Let 𝐌0,1=j=1d0𝐔j/j=1d0𝐔j\mathbf{M}_{0,1}=\otimes_{j=1}^{d_{0}}\mathbf{U}_{j}/\|\otimes_{j=1}^{d_{0}}\mathbf{U}_{j}\| and 𝐌0,1𝐱i=(w0,1(i),,w0,r1r2rd0)\mathbf{M}_{0,1}^{\top}\mathbf{x}_{i}=(w_{0,1}^{(i)},\dots,w_{0,r_{1}r_{2}\cdots r_{d_{0}}})^{\top}, and let 𝐌0,2=j=d0+1d𝐔j/j=d0+1d𝐔j\mathbf{M}_{0,2}=\otimes_{j=d_{0}+1}^{d}\mathbf{U}_{j}/\|\otimes_{j=d_{0}+1}^{d}\mathbf{U}_{j}\| and 𝐌0,2𝒆i=(z0,1(i),,z0,rd0+1rd0+2rd)\mathbf{M}_{0,2}^{\top}\boldsymbol{e}_{i}=(z_{0,1}^{(i)},\dots,z_{0,r_{d_{0}+1}r_{d_{0}+2}\cdots r_{d}})^{\top}. By Assumption 4.3, 𝔼[|w0,j(i)|1+ϵ|𝓧i]Mx,1+ϵ,δ\mathbb{E}[|w_{0,j}^{(i)}|^{1+\epsilon}|\mbox{\boldmath$\mathscr{X}$}_{i}]\leq M_{x,1+\epsilon,\delta} and 𝔼[|z0,m(i)|1+ϵ|𝓧i]Me,1+ϵ,δ\mathbb{E}[|z_{0,m}^{(i)}|^{1+\epsilon}|\mbox{\boldmath$\mathscr{X}$}_{i}]\leq M_{e,1+\epsilon,\delta}, for all j=1,2,,r1r2rd0j=1,2,\dots,r_{1}r_{2}\cdots r_{d_{0}} and m=1,2,,rd0+1rd0+2rdm=1,2,\dots,r_{d_{0}+1}r_{d_{0}+2}\cdots r_{d}. Let v0,j,m(i)=w0,j(i)z0,m(i)v_{0,j,m}^{(i)}=w_{0,j}^{(i)}z_{0,m}^{(i)}.

In a similar fashion, we can show that with probability at least 1Cexp(Clog(p¯))1-C\exp(-C\log(\bar{p})),

T0,1Fσ¯d/(d+1)r1r2rd[Mx,1+ϵ,δ1/ϵMe,1+ϵ,δ1/ϵlog(p¯)n]ϵ/(1+ϵ),T0,2Fσ¯d/(d+1)r1r2rd[Mx,1+ϵ,δ1/ϵMe,1+ϵ,δ1/ϵlog(p¯)n]ϵ/(1+ϵ),T0,3FCϕδ1/2σ¯d/(d+1)βx𝓐𝓐F,and T0,4FClog(p¯)nσ¯d/(d+1)βx𝓐𝓐F.\begin{split}\|T_{0,1}\|_{\text{F}}&\lesssim\bar{\sigma}^{d/(d+1)}\sqrt{r_{1}r_{2}\cdots r_{d}}\left[\frac{M_{x,1+\epsilon,\delta}^{1/\epsilon}M_{e,1+\epsilon,\delta}^{1/\epsilon}\log(\bar{p})}{n}\right]^{\epsilon/(1+\epsilon)},\\ \|T_{0,2}\|_{\text{F}}&\lesssim\bar{\sigma}^{d/(d+1)}\sqrt{r_{1}r_{2}\cdots r_{d}}\left[\frac{M_{x,1+\epsilon,\delta}^{1/\epsilon}M_{e,1+\epsilon,\delta}^{1/\epsilon}\log(\bar{p})}{n}\right]^{\epsilon/(1+\epsilon)},\\ \|T_{0,3}\|_{\text{F}}&\lesssim C\phi_{\delta}^{1/2}\bar{\sigma}^{d/(d+1)}\beta_{x}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}},\\ \text{and }\|T_{0,4}\|_{\text{F}}&\lesssim C\sqrt{\frac{\log(\bar{p})}{n}}\bar{\sigma}^{d/(d+1)}\beta_{x}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}.\end{split} (B.38)

Hence, with probability at least 1Cexp(Clog(p¯))1-C\exp(-C\log(\bar{p})),

𝓖0𝔼[0]F2ϕδσ¯2dd+1βx𝓐𝓐F2+σ¯2dd+1k=1drk[Me,1+ϵ,δ1/ϵMx,1+ϵ,δ1/ϵlog(p¯)n]2ϵ1+ϵ.\|\mbox{\boldmath$\mathscr{G}$}_{0}-\mathbb{E}[\nabla_{0}\mathcal{L}]\|_{\text{F}}^{2}\lesssim\phi_{\delta}\bar{\sigma}^{\frac{2d}{d+1}}\beta_{x}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}+\bar{\sigma}^{\frac{2d}{d+1}}\prod_{k=1}^{d}r_{k}\left[\frac{M_{e,1+\epsilon,\delta}^{1/\epsilon}M_{x,1+\epsilon,\delta}^{1/\epsilon}\log(\bar{p})}{n}\right]^{\frac{2\epsilon}{1+\epsilon}}. (B.39)

Step 7. (Verify the conditions and conclude the proof)

In the last step, we apply the results above to Theorem 3.3. First, we examine the conditions in Theorem 3.3 hold. Under Assumption 4.3, by Lemma 3.11 in Bubeck (2015), we can show that the RCG condition in Definition 3.2 is implied by the restricted strong convexity and strong smoothness with α=αx\alpha=\alpha_{x} and β=βx\beta=\beta_{x}.

Next, we show the stability of the robust gradient estimators for all t=1,2,,Tt=1,2,\dots,T. By matrix perturbation theory, if 𝓐(0)𝓐Fαx/βxσ¯κ2δ\|\mbox{\boldmath$\mathscr{A}$}^{(0)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}\leq\sqrt{\alpha_{x}/\beta_{x}}\underline{\sigma}\kappa^{-2}\delta, we have sinΘ(𝐔k(0),𝐔k)δ\|\sin\Theta(\mathbf{U}_{k}^{(0)},\mathbf{U}_{k}^{*})\|\leq\delta for all k=1,,dk=1,\dots,d. After a finite number of iterations, CTC_{T}, with probability at least 1CTexp(Clog(p¯))1-C_{T}\exp(-C\log(\bar{p})), we can have sinΘ(𝐔k(CT),𝐔k)δ<(42)1\|\sin\Theta(\mathbf{U}_{k}^{(C_{T})},\mathbf{U}_{k}^{*})\|\leq\delta^{\prime}<(4\sqrt{2})^{-1}.

For any lkl\neq k and any tensor 𝓑p1××pd\mbox{\boldmath$\mathscr{B}$}\in\mathbb{R}^{p_{1}\times\cdots\times p_{d}}, (𝓑×jk𝐔j)(l)=𝐔l𝓑(l)(jl𝐔j)(\mbox{\boldmath$\mathscr{B}$}\times_{j\neq k}\mathbf{U}_{j}^{\top})_{(l)}=\mathbf{U}_{l}^{\top}\mbox{\boldmath$\mathscr{B}$}_{(l)}(\otimes_{j\neq l}\mathbf{U}_{j}^{\prime}), where 𝐔j=𝐔j\mathbf{U}_{j}^{\prime}=\mathbf{U}_{j} for jkj\neq k and 𝐔k=𝐈rk\mathbf{U}_{k}^{\prime}=\mathbf{I}_{r_{k}}. For any 𝐔l𝒞(𝐔l,δ)\mathbf{U}_{l}\in\mathcal{C}(\mathbf{U}_{l}^{*},\delta^{\prime}), we have 𝐔l𝐔l𝐎l2sinΘ(𝐔l,𝐔l)2δ\|\mathbf{U}_{l}-\mathbf{U}^{*}_{l}\mathbf{O}_{l}\|\leq\sqrt{2}\|\sin\Theta(\mathbf{U}_{l},\mathbf{U}_{l}^{*})\|\leq\sqrt{2}\delta^{\prime} for some 𝐎l𝕆rk×rk\mathbf{O}_{l}\in\mathbb{O}^{r_{k}\times r_{k}}. Let 𝚫l=𝐔l𝐔l𝐎l\mathbf{\Delta}_{l}=\mathbf{U}_{l}-\mathbf{U}^{*}_{l}\mathbf{O}_{l} and decompose 𝚫l=𝚫l,1+𝚫l,2\mathbf{\Delta}_{l}=\mathbf{\Delta}_{l,1}+\mathbf{\Delta}_{l,2} where 𝚫l,1,𝚫l,2=0\langle\mathbf{\Delta}_{l,1},\mathbf{\Delta}_{l,2}\rangle=0 and 𝚫l,1/𝚫l,1,𝚫l,2/𝚫l,2𝒞(𝐔l,δ)\mathbf{\Delta}_{l,1}/\|\mathbf{\Delta}_{l,1}\|,\mathbf{\Delta}_{l,2}/\|\mathbf{\Delta}_{l,2}\|\in\mathcal{C}(\mathbf{U}_{l}^{*},\delta^{\prime}). Thus, we have 𝚫l,12δ\|\mathbf{\Delta}_{l,1}\|\leq\sqrt{2}\delta^{\prime} and 𝚫l,22δ\|\mathbf{\Delta}_{l,2}\|\leq\sqrt{2}\delta^{\prime}.

Denote ξ=sup𝐔l𝒞(𝐔l,δ)𝐔l𝓑(l)(jl𝐔j)F\xi=\sup_{\mathbf{U}_{l}\in\mathcal{C}(\mathbf{U}_{l}^{*},\delta^{\prime})}\|\mathbf{U}_{l}^{\top}\mbox{\boldmath$\mathscr{B}$}_{(l)}(\otimes_{j\neq l}\mathbf{U}_{j}^{\prime})\|_{\text{F}}. Then, since

𝐔l𝓑(l)(jl𝐔j)F(𝐔l𝐎l)𝓑(l)(jl𝐔j)F+𝚫l𝓑(l)(jl𝐔j)F(𝐔l𝐎l)𝓑(l)(jl𝐔j)F+𝚫l,1(𝚫l,1/𝚫l,1)𝓑(l)(jl𝐔j)F+𝚫l,2(𝚫l,2/𝚫l,1)𝓑(l)(jl𝐔j)F,\begin{split}&\|\mathbf{U}_{l}^{\top}\mbox{\boldmath$\mathscr{B}$}_{(l)}(\otimes_{j\neq l}\mathbf{U}_{j}^{\prime})\|_{\text{F}}\\ &\leq\|(\mathbf{U}^{*}_{l}\mathbf{O}_{l})^{\top}\mbox{\boldmath$\mathscr{B}$}_{(l)}(\otimes_{j\neq l}\mathbf{U}_{j}^{\prime})\|_{\text{F}}+\|\mathbf{\Delta}_{l}^{\top}\mbox{\boldmath$\mathscr{B}$}_{(l)}(\otimes_{j\neq l}\mathbf{U}_{j}^{\prime})\|_{\text{F}}\\ &\leq\|(\mathbf{U}^{*}_{l}\mathbf{O}_{l})^{\top}\mbox{\boldmath$\mathscr{B}$}_{(l)}(\otimes_{j\neq l}\mathbf{U}_{j}^{\prime})\|_{\text{F}}+\|\mathbf{\Delta}_{l,1}\|\cdot\|(\mathbf{\Delta}_{l,1}/\|\mathbf{\Delta}_{l,1}\|)^{\top}\nabla\mbox{\boldmath$\mathscr{B}$}_{(l)}(\otimes_{j\neq l}\mathbf{U}_{j}^{\prime})\|_{\text{F}}\\ &+\|\mathbf{\Delta}_{l,2}\|\cdot\|(\mathbf{\Delta}_{l,2}/\|\mathbf{\Delta}_{l,1}\|)^{\top}\nabla\mbox{\boldmath$\mathscr{B}$}_{(l)}(\otimes_{j\neq l}\mathbf{U}_{j}^{\prime})\|_{\text{F}},\end{split} (B.40)

we have that

ξ(𝐔l𝐎l)𝓑(l)(jl𝐔j)F+(𝚫l,1+𝚫l,2)ξ,\xi\leq\|(\mathbf{U}^{*}_{l}\mathbf{O}_{l})^{\top}\nabla\mbox{\boldmath$\mathscr{B}$}_{(l)}(\otimes_{j\neq l}\mathbf{U}_{j}^{\prime})\|_{\text{F}}+(\|\mathbf{\Delta}_{l,1}\|+\|\mathbf{\Delta}_{l,2}\|)\xi, (B.41)

that is, taking δ=1/8\delta^{\prime}=1/8,

ξ(122δ)1(𝐔l𝐎l)𝓑(l)(jl𝐔j)F2(𝐔l𝐎l)𝓑(l)(jl𝐔j)F.\xi\leq(1-2\sqrt{2}\delta^{\prime})^{-1}\|(\mathbf{U}^{*}_{l}\mathbf{O}_{l})^{\top}\nabla\mbox{\boldmath$\mathscr{B}$}_{(l)}(\otimes_{j\neq l}\mathbf{U}_{j}^{\prime})\|_{\text{F}}\leq 2\|(\mathbf{U}^{*}_{l}\mathbf{O}_{l})^{\top}\nabla\mbox{\boldmath$\mathscr{B}$}_{(l)}(\otimes_{j\neq l}\mathbf{U}_{j}^{\prime})\|_{\text{F}}. (B.42)

Hence, for the iterate t=1,2,,Tt=1,2,\dots,T, combining the results in steps 1 to 6, we have that with probability at least 1Cexp(Clog(p¯))1-C\exp(-C\log(\bar{p})), for any k=1,2,,dk=1,2,\dots,d

𝐆k(t)𝔼[k(t)]F2ϕδσ¯2d/(d+1)βx2𝓐(t)𝓐F2+σ¯2d/(d+1)(pkrk)[Me,1+ϵ,δ1/ϵMx,1+ϵ,δ1/ϵlog(p¯)n]2ϵ1+ϵ\begin{split}&\|\mathbf{G}_{k}^{(t)}-\mathbb{E}[\nabla_{k}\mathcal{L}^{(t)}]\|_{\text{F}}^{2}\\ &\lesssim\phi_{\delta}\bar{\sigma}^{2d/(d+1)}\beta_{x}^{2}\|\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}+\bar{\sigma}^{2d/(d+1)}(p_{k}r_{k})\left[\frac{M_{e,1+\epsilon,\delta}^{1/\epsilon}M_{x,1+\epsilon,\delta}^{1/\epsilon}\log(\bar{p})}{n}\right]^{\frac{2\epsilon}{1+\epsilon}}\end{split} (B.43)

and

𝓖0(t)𝔼[0(t)]F2ϕδσ¯2d/(d+1)βx2𝓐(t)𝓐F2+σ¯2d/(d+1)k=1drk[Me,1+ϵ,δ1/ϵMx,1+ϵ,δ1/ϵlog(p¯)n]2ϵ1+ϵ.\begin{split}&\|\mbox{\boldmath$\mathscr{G}$}_{0}^{(t)}-\mathbb{E}[\nabla_{0}\mathcal{L}^{(t)}]\|_{\text{F}}^{2}\\ &\lesssim\phi_{\delta}\bar{\sigma}^{2d/(d+1)}\beta_{x}^{2}\|\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}+\bar{\sigma}^{2d/(d+1)}\prod_{k=1}^{d}r_{k}\left[\frac{M_{e,1+\epsilon,\delta}^{1/\epsilon}M_{x,1+\epsilon,\delta}^{1/\epsilon}\log(\bar{p})}{n}\right]^{\frac{2\epsilon}{1+\epsilon}}.\end{split} (B.44)

As nβxαx2κ4log(p¯)+σ¯(1+ϵ)2(βx/αx)2+2ϵlog(p¯)κ4+4ϵ(Mx,1+ϵ,δMe,1+ϵ,δ)1n\gtrsim\beta_{x}\alpha_{x}^{-2}\kappa^{4}\log(\bar{p})+\bar{\sigma}^{(1+\epsilon)^{2}}(\beta_{x}/\alpha_{x})^{2+2\epsilon}\log(\bar{p})\kappa^{4+4\epsilon}(M_{x,1+\epsilon,\delta}M_{e,1+\epsilon,\delta})^{-1}, plugging these into Theorem 3.3, we have that for all t=1,2,,Tt=1,2,\dots,T and k=1,2,,dk=1,2,\dots,d,

Err(t)(1η0αxβx1κ2/2)tErr(0)+Cαx2σ¯4d/(d+1)κ2k=0d𝚫k(t)F2Err(0)+Cαx2σ¯2d/(d+1)κ4(k=1rk+k=1dpkrk)[Me,1+ϵ,δ1/ϵMx,1+ϵ,δ1/ϵlog(p¯)n]2ϵ1+ϵ\begin{split}\text{Err}^{(t)}&\leq(1-\eta_{0}\alpha_{x}\beta_{x}^{-1}\kappa^{-2}/2)^{t}\text{Err}^{(0)}+C\alpha_{x}^{-2}\bar{\sigma}^{-4d/(d+1)}\kappa^{2}\sum_{k=0}^{d}\|\mathbf{\Delta}_{k}^{(t)}\|_{\text{F}}^{2}\\ &\leq\text{Err}^{(0)}+C\alpha_{x}^{-2}\bar{\sigma}^{-2d/(d+1)}\kappa^{4}\left(\prod_{k=1}r_{k}+\sum_{k=1}^{d}p_{k}r_{k}\right)\left[\frac{M_{e,1+\epsilon,\delta}^{1/\epsilon}M_{x,1+\epsilon,\delta}^{1/\epsilon}\log(\bar{p})}{n}\right]^{\frac{2\epsilon}{1+\epsilon}}\end{split} (B.45)

and

𝓐(t)𝓐F2κ2(1Cαxβx1κ2)t𝓐(0)𝓐F2+κ4αx2(k=1dpkrk+k=1drk)[Mx,1+ϵ,δ1/ϵMe,1+ϵ,δ1/ϵlog(p¯)n]2ϵ/(1+ϵ).\begin{split}\|\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}&\lesssim\kappa^{2}(1-C\alpha_{x}\beta_{x}^{-1}\kappa^{-2})^{t}\|\mbox{\boldmath$\mathscr{A}$}^{(0)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}\\ &+\kappa^{4}\alpha_{x}^{-2}\left(\sum_{k=1}^{d}p_{k}r_{k}+\prod_{k=1}^{d}r_{k}\right)\left[\frac{M_{x,1+\epsilon,\delta}^{1/\epsilon}M_{e,1+\epsilon,\delta}^{1/\epsilon}\log(\bar{p})}{n}\right]^{2\epsilon/(1+\epsilon)}.\end{split} (B.46)

Finally, for all t=1,2,,Tt=1,2,\dots,T and k=1,2,,dk=1,2,\dots,d,

sinΘ(𝐔k(t),𝐔k)2σ¯2/(d+1)Err(t)σ¯2d+1Err(0)+Cκ4αx2σ¯2deff[Meff,δlog(p¯)n]2ϵ1+ϵδ2.\|\sin\Theta(\mathbf{U}_{k}^{(t)},\mathbf{U}_{k}^{*})\|^{2}\leq\bar{\sigma}^{-2/(d+1)}\text{Err}^{(t)}\leq\bar{\sigma}^{\frac{-2}{d+1}}\text{Err}^{(0)}+C\kappa^{4}\alpha_{x}^{-2}\bar{\sigma}^{-2}d_{\text{eff}}\left[\frac{M_{\text{eff},\delta}\log(\bar{p})}{n}\right]^{\frac{2\epsilon}{1+\epsilon}}\leq\delta^{2}. (B.47)

B.2 Proof of Theorem 4.9

Proof.

The proof consists of six steps. In the first five steps, we prove the stability of the robust gradient estimators for the general 1tT1\leq t\leq T and, hence, we omit the notation (t)(t) for simplicity. Specifically, in the first four steps, we give the upper bounds for Tk,1F,,Tk,4F\|T_{k,1}\|_{\text{F}},\dots,\|T_{k,4}\|_{\text{F}}, respectively, for 1kd1\leq k\leq d. In the fifth step, we extend the proof to the terms for the core tensor. In the last step, we apply the results to the local convergence analysis in Theorem 3.3 and verify the corresponding conditions. Throughout the first five steps, we assume that for each 1kd1\leq k\leq d, 𝐔kσ¯1/(d+1)\|\mathbf{U}_{k}\|\asymp\bar{\sigma}^{1/(d+1)} and sinΘ(𝐔k,𝐔k)δ\|\sin\Theta(\mathbf{U}_{k},\mathbf{U}_{k}^{*})\|\leq\delta and will verify them in the last step.

Step 1. (Bound Tk,1F\|T_{k,1}\|_{\text{F}})

For any 1kd1\leq k\leq d, we let rk=r1r2rd/rkr_{k}^{\prime}=r_{1}r_{2}\cdots r_{d}/r_{k} and

¯(𝓐;zi)(k)(j=1,jkd𝐔j)𝓢(k)=(exp(𝓧i,𝓐)1+exp(𝓧i,𝓐)yi)(𝓧i)(k)(j=1,jkd𝐔j)𝓢(k).\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}(\otimes_{j=1,j\neq k}^{d}\mathbf{U}_{j})\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top}=\left(\frac{\exp(\langle\mbox{\boldmath$\mathscr{X}$}_{i},\mbox{\boldmath$\mathscr{A}$}^{*}\rangle)}{1+\exp(\langle\mbox{\boldmath$\mathscr{X}$}_{i},\mbox{\boldmath$\mathscr{A}$}^{*}\rangle)}-y_{i}\right)(\mbox{\boldmath$\mathscr{X}$}_{i})_{(k)}(\otimes_{j=1,j\neq k}^{d}\mathbf{U}_{j})\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top}. (B.48)

Let 𝐌k=(j=1,jkd𝐔j)/j=1,jkd𝐔j\mathbf{M}_{k}=(\otimes_{j=1,j\neq k}^{d}\mathbf{U}_{j})/\|\otimes_{j=1,j\neq k}^{d}\mathbf{U}_{j}\| and 𝐜l(𝓧i)(k)𝐌k=(wk,l,1(i),wk,l,2(i),,wk,l,rk(i))\mathbf{c}_{l}^{\top}(\mbox{\boldmath$\mathscr{X}$}_{i})_{(k)}\mathbf{M}_{k}=(w^{(i)}_{k,l,1},w^{(i)}_{k,l,2},\dots,w^{(i)}_{k,l,r_{k}^{\prime}}). By Assumption 4.8, 𝔼[|wk,l,j(i)|2]Mx,2,δ\mathbb{E}[|w_{k,l,j}^{(i)}|^{2}]\leq M_{x,2,\delta} for l=1,2,,pkl=1,2,\dots,p_{k} and j=1,2,,rkj=1,2,\dots,r_{k}^{\prime}. Let 𝐍k=𝓢(k)/𝓢(k)\mathbf{N}_{k}=\mbox{\boldmath$\mathscr{S}$}_{(k)}/\|\mbox{\boldmath$\mathscr{S}$}_{(k)}\| and 𝐜l(𝓧i)(k)𝐌k𝐍k=(zk,l,1(i),zk,l,2(i),,zk,l,rk(i))\mathbf{c}_{l}^{\top}(\mbox{\boldmath$\mathscr{X}$}_{i})_{(k)}\mathbf{M}_{k}\mathbf{N}_{k}^{\top}=(z_{k,l,1}^{(i)},z_{k,l,2}^{(i)},\dots,z_{k,l,r_{k}}^{(i)}). Then, 𝔼[|zk,l,j(i)|2]Mx,2,δ\mathbb{E}[|z^{(i)}_{k,l,j}|^{2}]\lesssim M_{x,2,\delta}. Also, denote qi(𝓐)=exp(𝓧i,𝓐)/[1+exp(𝓧i,𝓐)]q_{i}(\mbox{\boldmath$\mathscr{A}$})=\exp(\langle\mbox{\boldmath$\mathscr{X}$}_{i},\mbox{\boldmath$\mathscr{A}$}\rangle)/[1+\exp(\langle\mbox{\boldmath$\mathscr{X}$}_{i},\mbox{\boldmath$\mathscr{A}$}\rangle)] for any 𝓐\mathscr{A}.

We first bound the bias Tk,1F\|T_{k,1}\|_{\text{F}}. Note that |qi(𝓐)yi|1|q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})-y_{i}|\leq 1 and

𝔼[|(qi(𝓐)yi)zk,l,j(i)|2]=𝔼[𝔼[|qi(𝓐)yi|2|𝓧i]|zk,l,j(i)|2]Mx,2,δ.\mathbb{E}\left[|(q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})-y_{i})z_{k,l,j}^{(i)}|^{2}\right]=\mathbb{E}\left[\mathbb{E}\left[|q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})-y_{i}|^{2}|\mbox{\boldmath$\mathscr{X}$}_{i}\right]\cdot|z_{k,l,j}^{(i)}|^{2}\right]\leq M_{x,2,\delta}. (B.49)

Let τk=τ/j=1,jkd𝐔j[nMx,2,δ/log(p¯)]1/2\tau_{k}=\tau/\|\otimes_{j=1,j\neq k}^{d}\mathbf{U}_{j}\|\asymp[nM_{x,2,\delta}/\log(\bar{p})]^{1/2}. Then,

T1,kF2σ¯2d/(d+1)l=1pkj=1rk|𝔼[(qi(𝓐)yi)zk,l,j(i)]𝔼[T(qi(𝓐)yi)zk,l,j(i),τk)]|2.\|T_{1,k}\|_{\text{F}}^{2}\asymp\bar{\sigma}^{2d/(d+1)}\sum_{l=1}^{p_{k}}\sum_{j=1}^{r_{k}}\left|\mathbb{E}\left[(q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})-y_{i})z_{k,l,j}^{(i)}\right]-\mathbb{E}\left[\text{T}(q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})-y_{i})z_{k,l,j}^{(i)},\tau_{k})\right]\right|^{2}. (B.50)

By Holder’s inequality and Markov’s inequality,

|𝔼[(qi(𝓐)yi)zk,l,j(i)]𝔼[T((qi(𝓐)yi)zk,l,j(i),τk)]|𝔼[|(qi(𝓐)yi)zk,l,j(i)|1{|(qi(𝓐)yi)zk,l,j(i)|τk}]𝔼[|(qi(𝓐)yi)zk,l,j(i)|2]1/2(|(qi(𝓐)yi)zk,l,j(i)|τk)1/2𝔼[|(qi(𝓐)yi)zk,l,j(i)|2]1/2(𝔼[|(qi(𝓐)yi)zk,l,j(i)|2]τk2)1/2Mx,2,δτk1[Mx,2,δlog(p¯)n]1/2.\begin{split}&\left|\mathbb{E}\left[(q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})-y_{i})z_{k,l,j}^{(i)}\right]-\mathbb{E}\left[\text{T}((q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})-y_{i})z_{k,l,j}^{(i)},\tau_{k})\right]\right|\\ \leq&\mathbb{E}\left[|(q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})-y_{i})z_{k,l,j}^{(i)}|\cdot 1\{|(q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})-y_{i})z_{k,l,j}^{(i)}|\geq\tau_{k}\}\right]\\ \leq&\mathbb{E}\left[|(q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})-y_{i})z_{k,l,j}^{(i)}|^{2}\right]^{1/2}\cdot\mathbb{P}(|(q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})-y_{i})z_{k,l,j}^{(i)}|\geq\tau_{k})^{1/2}\\ \leq&\mathbb{E}\left[|(q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})-y_{i})z_{k,l,j}^{(i)}|^{2}\right]^{1/2}\cdot\left(\frac{\mathbb{E}\left[|(q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})-y_{i})z_{k,l,j}^{(i)}|^{2}\right]}{\tau_{k}^{2}}\right)^{1/2}\\ \asymp&~{}M_{x,2,\delta}\cdot\tau_{k}^{-1}\asymp\left[\frac{M_{x,2,\delta}\log(\bar{p})}{n}\right]^{1/2}.\end{split} (B.51)

Hence, we have

𝔼[T(¯(𝓐;zi)(k)𝐕k𝓢(k),τ)]𝔼[¯(𝓐;zi)(k)𝐕k𝓢(k)]Fσ¯dd+1pkrk[Mx,2,δlog(p¯)n]12.\begin{split}&\left\|\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k}^{\top}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top},\tau)]-\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k}^{\top}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top}]\right\|_{\text{F}}\\ &\lesssim\bar{\sigma}^{\frac{d}{d+1}}\sqrt{p_{k}r_{k}}\left[\frac{M_{x,2,\delta}\log(\bar{p})}{n}\right]^{\frac{1}{2}}.\end{split} (B.52)

Step 2. (Bound Tk,2F\|T_{k,2}\|_{\text{F}})

For Tk,2T_{k,2} in (B.3), it can be checked that

𝔼[T((qi(𝓐)yi)zk,l,j(i),τk)2]𝔼[|(qi(𝓐)yi)zk,l,j(i)|2]Mx,2,δ.\begin{split}&\mathbb{E}[\text{T}((q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})-y_{i})z_{k,l,j}^{(i)},\tau_{k})^{2}]\leq\mathbb{E}\left[|(q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})-y_{i})z_{k,l,j}^{(i)}|^{2}\right]\lesssim M_{x,2,\delta}.\end{split} (B.53)

Thus, var(T((qi(𝓐)yi)zk,l,j(i),τk))𝔼[T((qi(𝓐)yi)zk,l,j(i),τk)2]Mx,2,δ\text{var}(\text{T}((q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})-y_{i})z_{k,l,j}^{(i)},\tau_{k}))\leq\mathbb{E}[\text{T}((q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})-y_{i})z_{k,l,j}^{(i)},\tau_{k})^{2}]\lesssim M_{x,2,\delta}. Also, for any s=3,4,s=3,4,\dots, the higher-order moments satisfy that

𝔼[(T((q(𝓧i)yi)zk,l,j(i),τk)𝔼[T((q(𝓧i)yi)zk,l,j(i),τk)])s](2τk)s2𝔼[(T((q(𝓧i)yi)zk,l,j(i),τk)𝔼[T((q(𝓧i)yi)zk,l,j(i),τk)])2].\begin{split}&\mathbb{E}\left[(\text{T}((q(\mbox{\boldmath$\mathscr{X}$}_{i})-y_{i})z_{k,l,j}^{(i)},\tau_{k})-\mathbb{E}[\text{T}((q(\mbox{\boldmath$\mathscr{X}$}_{i})-y_{i})z_{k,l,j}^{(i)},\tau_{k})])^{s}\right]\\ &\leq(2\tau_{k})^{s-2}\mathbb{E}\left[(\text{T}((q(\mbox{\boldmath$\mathscr{X}$}_{i})-y_{i})z_{k,l,j}^{(i)},\tau_{k})-\mathbb{E}[\text{T}((q(\mbox{\boldmath$\mathscr{X}$}_{i})-y_{i})z_{k,l,j}^{(i)},\tau_{k})])^{2}\right].\end{split} (B.54)

By Bernstein’s inequality, for any 0<t<(2τk)1Mx,2,δ0<t<(2\tau_{k})^{-1}M_{x,2,\delta},

(|1ni=1nT((q(𝓧i)yi)zk,l,j(i),τk)𝔼T((q(𝓧i)yi)zk,l,j(i),τk)|>t)2exp(nt24Mx,2,δ)\begin{split}&\mathbb{P}\left(\left|\frac{1}{n}\sum_{i=1}^{n}\text{T}((q(\mbox{\boldmath$\mathscr{X}$}_{i})-y_{i})z_{k,l,j}^{(i)},\tau_{k})-\mathbb{E}\text{T}((q(\mbox{\boldmath$\mathscr{X}$}_{i})-y_{i})z_{k,l,j}^{(i)},\tau_{k})\right|>t\right)\\ &\leq 2\exp\left(-\frac{nt^{2}}{4M_{x,2,\delta}}\right)\end{split} (B.55)

Let t=CMx,2,δ1/2log(p¯)1/2n1/2t=CM_{x,2,\delta}^{1/2}\log(\bar{p})^{1/2}n^{-1/2}. Therefore, we have

(|1ni=1nT((q(𝓧i)yi)zk,l,j(i),τk)𝔼T((q(𝓧i)yi)zk,l,j(i),τk)|>C[Mx,2,δlog(p¯)n]12)Cexp(Clog(p¯))\begin{split}&\mathbb{P}\Bigg{(}\left|\frac{1}{n}\sum_{i=1}^{n}\text{T}((q(\mbox{\boldmath$\mathscr{X}$}_{i})-y_{i})z_{k,l,j}^{(i)},\tau_{k})-\mathbb{E}\text{T}((q(\mbox{\boldmath$\mathscr{X}$}_{i})-y_{i})z_{k,l,j}^{(i)},\tau_{k})\right|\\ &~{}~{}~{}~{}~{}~{}>C\left[\frac{M_{x,2,\delta}\log(\bar{p})}{n}\right]^{\frac{1}{2}}\Bigg{)}\leq C\exp(-C\log(\bar{p}))\end{split} (B.56)

and

(max1jpk1lrk|1ni=1nT((q(𝓧i)yi)zk,l,j(i),τk)𝔼[T((q(𝓧i)yi)zk,l,j(i),τk)]|>C[Mx,2,δ1/ϵlog(p¯)n]12)Cpkrkexp(Clog(p¯))Cexp(Clog(p¯)).\begin{split}\mathbb{P}\Bigg{(}&\max_{\begin{subarray}{c}1\leq j\leq p_{k}\\ 1\leq l\leq r_{k}\end{subarray}}\left|\frac{1}{n}\sum_{i=1}^{n}\text{T}((q(\mbox{\boldmath$\mathscr{X}$}_{i})-y_{i})z_{k,l,j}^{(i)},\tau_{k})-\mathbb{E}\left[\text{T}((q(\mbox{\boldmath$\mathscr{X}$}_{i})-y_{i})z_{k,l,j}^{(i)},\tau_{k})\right]\right|\\ &~{}~{}~{}>C\left[\frac{M_{x,2,\delta}^{1/\epsilon}\log(\bar{p})}{n}\right]^{\frac{1}{2}}\Bigg{)}\leq Cp_{k}r_{k}\exp(-C\log(\bar{p}))\leq C\exp(-C\log(\bar{p})).\end{split} (B.57)

Therefore, with probability at least 1Cexp(Clog(p¯))1-C\exp(-C\log(\bar{p})),

1ni=1nT(¯(𝓐;zi)(k)𝐕k,τ)𝔼[T(¯(𝓐;zi)(k)𝐕k,τ)]Fσ¯d/(d+1)pkrk[Mx,2,δlog(p¯)n]1/2.\begin{split}&\left\|\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k},\tau)-\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k},\tau)]\right\|_{\text{F}}\\ &\lesssim\bar{\sigma}^{d/(d+1)}\sqrt{p_{k}r_{k}}\left[\frac{M_{x,2,\delta}\log(\bar{p})}{n}\right]^{1/2}.\end{split} (B.58)

Step 3. (Bound Tk,3F\|T_{k,3}\|_{\text{F}})

By the Taylor’s expansion of ¯()\overline{\mathcal{L}}(\cdot),

𝔼[¯(𝓐;zi)(k)]𝐕k𝓢(k)𝔼[¯(𝓐;zi)(k)]𝐕k𝓢(k)F2𝐕k𝓢(k)2𝔼[¯(𝓐;zi)]𝔼[¯(𝓐;zi)]F2σ¯2d/(d+1)βx2𝓐𝓐F2.\begin{split}&\|\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}]\mathbf{V}_{k}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top}-\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}]\mathbf{V}_{k}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top}\|_{\text{F}}^{2}\\ &~{}\leq\|\mathbf{V}_{k}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top}\|^{2}\cdot\|\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})]-\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})]\|_{\text{F}}^{2}\\ &~{}\lesssim\bar{\sigma}^{2d/(d+1)}\beta^{2}_{x}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}.\end{split} (B.59)

For 1lpk1\leq l\leq p_{k} and 1mrk1\leq m\leq r_{k}, let the (l,m)(l,m)-th entry of ¯(𝓐;zi)(k)𝐕k𝓢(k)\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top} and ¯(𝓐;zi)(k)𝐕k𝓢(k)\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top} be xk,l,m(i)x_{k,l,m}^{(i)} and yk,l,m(i)y_{k,l,m}^{(i)}, respectively. Let 𝔼[|xk,l,m(i)yk,l,m(i)|2]=sk,l,m2\mathbb{E}[|x_{k,l,m}^{(i)}-y_{k,l,m}^{(i)}|^{2}]=s_{k,l,m}^{2}, and then l=1pkm=1rksk,l,m2σ¯2d/(d+1)βx2𝓐𝓐F2\sum_{l=1}^{p_{k}}\sum_{m=1}^{r_{k}}s_{k,l,m}^{2}\lesssim\bar{\sigma}^{2d/(d+1)}\beta^{2}_{x}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}.

For any 1lpk1\leq l\leq p_{k} and 1mrk1\leq m\leq r_{k},

|𝔼[xk,l,m(i)yk,l,m(i)]𝔼[T(xk,l,m(i),τ)T(yk,l,m(i),τ)]|𝔼[|xk,l,m(i)yk,l,m(i)|(1{|(qi(𝓐)yi)zk,l,j(i)|τk}+1{|(qi(𝓐)yi)zk,l,j(i)|τk})]𝔼[|xk,l,m(i)yk,l,m(i)|2]1/2[(|(qi(𝓐)yi)zk,l,j(i)|τk)1/2+(|(qi(𝓐)yi)zk,l,j(i)|τk)1/2]sk,l,m[𝔼|zk,l,j|2τk2]1/2sk,l,mlog(p¯)/n.\begin{split}&\left|\mathbb{E}[x_{k,l,m}^{(i)}-y_{k,l,m}^{(i)}]-\mathbb{E}[\text{T}(x_{k,l,m}^{(i)},\tau)-\text{T}(y_{k,l,m}^{(i)},\tau)]\right|\\ \lesssim&~{}\mathbb{E}\left[|x_{k,l,m}^{(i)}-y_{k,l,m}^{(i)}|\cdot(1\{|(q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})-y_{i})z_{k,l,j}^{(i)}|\geq\tau_{k}\}+1\{|(q_{i}(\mbox{\boldmath$\mathscr{A}$})-y_{i})z_{k,l,j}^{(i)}|\geq\tau_{k}\})\right]\\ \leq&~{}\mathbb{E}\left[|x_{k,l,m}^{(i)}-y_{k,l,m}^{(i)}|^{2}\right]^{1/2}\cdot\\ &\left[\mathbb{P}\left(|(q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})-y_{i})z_{k,l,j}^{(i)}|\geq\tau_{k}\right)^{1/2}+\mathbb{P}\left(|(q_{i}(\mbox{\boldmath$\mathscr{A}$})-y_{i})z_{k,l,j}^{(i)}|\geq\tau_{k}\right)^{1/2}\right]\\ \lesssim&~{}s_{k,l,m}\cdot\left[\frac{\mathbb{E}|z_{k,l,j}|^{2}}{\tau_{k}^{2}}\right]^{1/2}\asymp s_{k,l,m}\sqrt{\log(\bar{p})/n}.\end{split} (B.60)

Hence,

Tk,3F2=l=1pkm=1rk|𝔼[xk,l,m(i)yk,l,m(i)]𝔼[T(xk,l,m(i),τ)T(yk,l,m(i),τ)]|2log(p¯)nl=1pkm=1rksk,l,m2log(p¯)nσ¯2d/(d+1)βx2𝓐𝓐F2.\begin{split}\|T_{k,3}\|_{\text{F}}^{2}=&\sum_{l=1}^{p_{k}}\sum_{m=1}^{r_{k}}\left|\mathbb{E}[x_{k,l,m}^{(i)}-y_{k,l,m}^{(i)}]-\mathbb{E}[\text{T}(x_{k,l,m}^{(i)},\tau)-\text{T}(y_{k,l,m}^{(i)},\tau)]\right|^{2}\\ \lesssim&~{}\frac{\log(\bar{p})}{n}\sum_{l=1}^{p_{k}}\sum_{m=1}^{r_{k}}s_{k,l,m}^{2}\lesssim\frac{\log(\bar{p})}{n}\bar{\sigma}^{2d/(d+1)}\beta_{x}^{2}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}.\end{split} (B.61)

Step 4. (Bound Tk,4F\|T_{k,4}\|_{\text{F}})

Let T(¯(𝓐;zi)(k)𝐕k𝓢(k),τ)T(¯(𝓐;zi)(k)𝐕k𝓢(k),τ)=𝐙k(i)={zk,j,l(i)}1jpk,1lrk\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top},\tau)-\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top},\tau)=\mathbf{Z}^{(i)}_{k}=\{z^{(i)}_{k,j,l}\}_{1\leq j\leq p_{k},1\leq l\leq r_{k}}. Then,

Tk,4F2=j=1pkl=1rk(1ni=1nzk,j,l(i)𝔼[zk,j,l(i)])2\|T_{k,4}\|_{\text{F}}^{2}=\sum_{j=1}^{p_{k}}\sum_{l=1}^{r_{k}}\left(\frac{1}{n}\sum_{i=1}^{n}z^{(i)}_{k,j,l}-\mathbb{E}[z_{k,j,l}^{(i)}]\right)^{2} (B.62)

Note that var(zk,j,l(i))=sk,j,l2\text{var}(z_{k,j,l}^{(i)})=s_{k,j,l}^{2} and note that j=1pkl=1rksk,j,l2Cσ¯2d/(d+1)βx2𝓐𝓐F2\sum_{j=1}^{p_{k}}\sum_{l=1}^{r_{k}}s^{2}_{k,j,l}\leq C\bar{\sigma}^{2d/(d+1)}\beta_{x}^{2}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}. Also, |zk,j,l(i)|2τ|z_{k,j,l}^{(i)}|\leq 2\tau. Similarly to the term Tk,2T_{k,2}, by Bernstein’s inequality, for any 0<t<τ2sk,j,l20<t<\tau^{-2}s_{k,j,l}^{2}, 1jpk1\leq j\leq p_{k} and 1lrk1\leq l\leq r_{k},

(|1ni=1nzk,j,l(i)𝔼[zk,j,l(i)]|t)2exp(nt22sk,j,l2).\mathbb{P}\left(\left|\frac{1}{n}\sum_{i=1}^{n}z_{k,j,l}^{(i)}-\mathbb{E}[z_{k,j,l}^{(i)}]\right|\geq t\right)\leq 2\exp\left(-\frac{nt^{2}}{2s^{2}_{k,j,l}}\right). (B.63)

Letting t=Csk,j,llog(p¯)/nt=Cs_{k,j,l}\sqrt{\log(\bar{p})/n}, we have that

(1jpk1lrk|1ni=1nzk,j,l(i)𝔼[zk,j,l(i)]|Csk,j,llog(p¯)/n)Cexp(Clog(p¯)).\mathbb{P}\left(\cup_{\begin{subarray}{c}1\leq j\leq p_{k}\\ 1\leq l\leq r_{k}\end{subarray}}\left|\frac{1}{n}\sum_{i=1}^{n}z_{k,j,l}^{(i)}-\mathbb{E}[z_{k,j,l}^{(i)}]\right|\geq Cs_{k,j,l}\sqrt{\log(\bar{p})/n}\right)\leq C\exp\left(-C\log(\bar{p})\right). (B.64)

Therefore, with probability at least 1Cexp(Clog(p¯))1-C\exp(-C\log(\bar{p})),

Tk,4F2log(p¯)nσ¯2d/(d+1)βx2𝓐𝓐F2.\|T_{k,4}\|_{\text{F}}^{2}\lesssim\frac{\log(\bar{p})}{n}\bar{\sigma}^{2d/(d+1)}\beta_{x}^{2}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}. (B.65)

Combining these results, for any 1kd01\leq k\leq d_{0}, with probablity 1Cexp(Clog(p¯))1-C\exp(-C\log(\bar{p})),

𝐆k𝔼[k]F2log(p¯)nσ¯2d/(d+1)βx2𝓐𝓐F2+σ¯2d/(d+1)pkrkMx,2,δlog(p¯)n.\begin{split}&\|\mathbf{G}_{k}-\mathbb{E}[\nabla_{k}\mathcal{L}]\|_{\text{F}}^{2}\\ \lesssim&~{}\frac{\log(\bar{p})}{n}\bar{\sigma}^{2d/(d+1)}\beta_{x}^{2}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}+\frac{\bar{\sigma}^{2d/(d+1)}p_{k}r_{k}M_{x,2,\delta}\log(\bar{p})}{n}.\end{split} (B.66)

Step 5. (Extension to core tensor)

For the partial gradient with respect to the core tensor 𝓢\mathscr{S}, we have

¯(𝓐;zi)×j=1d𝐔j=qi(𝓐)𝓧i×j=1d𝐔j.\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})\times_{j=1}^{d}\mathbf{U}_{j}^{\top}=q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})\cdot\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d}\mathbf{U}_{j}^{\top}. (B.67)

Let 𝐌0=j=1d𝐔j/j=1d𝐔j\mathbf{M}_{0}=\otimes_{j=1}^{d}\mathbf{U}_{j}/\|\otimes_{j=1}^{d}\mathbf{U}_{j}\| and 𝐌0𝐱i=(w0,1(i),,w0,r1r2rd(i))\mathbf{M}_{0}^{\top}\mathbf{x}_{i}=(w_{0,1}^{(i)},\dots,w_{0,r_{1}r_{2}\cdots r_{d}}^{(i)})^{\top}. By Assumption 4.8, 𝔼|w0,j|2Mx,2,δ\mathbb{E}|w_{0,j}|^{2}\leq M_{x,2,\delta} for all j=1,2,,r1r2rdj=1,2,\dots,r_{1}r_{2}\cdots r_{d}. In a similar fashion, we can show that with probability at least 1Cexp(Clog(p¯))1-C\exp(-C\log(\bar{p})),

T0,1Fσ¯d/(d+1)r1r2rdMx,2,δlog(p¯)n,T0,2Fσ¯d/(d+1)r1r2rdMx,2,δlog(p¯)n,T0,3Fσ¯d/(d+1)log(p¯)nσ¯d/(d+1)βx𝓐𝓐F,and T0,4Fσ¯d/(d+1)log(p¯)nσ¯d/(d+1)βx𝓐𝓐F.\begin{split}\|T_{0,1}\|_{\text{F}}&\lesssim\bar{\sigma}^{d/(d+1)}\sqrt{\frac{r_{1}r_{2}\cdots r_{d}M_{x,2,\delta}\log(\bar{p})}{n}},\\ \|T_{0,2}\|_{\text{F}}&\lesssim\bar{\sigma}^{d/(d+1)}\sqrt{\frac{r_{1}r_{2}\cdots r_{d}M_{x,2,\delta}\log(\bar{p})}{n}},\\ \|T_{0,3}\|_{\text{F}}&\lesssim\bar{\sigma}^{d/(d+1)}\sqrt{\frac{\log(\bar{p})}{n}}\bar{\sigma}^{d/(d+1)}\beta_{x}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}},\\ \text{and }\|T_{0,4}\|_{\text{F}}&\lesssim\bar{\sigma}^{d/(d+1)}\sqrt{\frac{\log(\bar{p})}{n}}\bar{\sigma}^{d/(d+1)}\beta_{x}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}.\end{split} (B.68)

Hence, with probability at least 1Cexp(Clog(p¯))1-C\exp(-C\log(\bar{p})),

𝓖0𝔼[0]F2log(p¯)nσ¯2dd+1βx2𝓐𝓐F2+σ¯2dd+1(k=1drk)Mx,2,δlog(p¯)n.\|\mbox{\boldmath$\mathscr{G}$}_{0}-\mathbb{E}[\nabla_{0}\mathcal{L}]\|_{\text{F}}^{2}\lesssim\frac{\log(\bar{p})}{n}\bar{\sigma}^{\frac{2d}{d+1}}\beta_{x}^{2}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}+\bar{\sigma}^{\frac{2d}{d+1}}\frac{(\prod_{k=1}^{d}r_{k})M_{x,2,\delta}\log(\bar{p})}{n}. (B.69)

Step 6. (Verify the conditions and conclude the proof)

Under Assumption 4.8, we can calculate the Hessian 2¯(𝓐)\nabla^{2}\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}) and easily check that the RCG condition holds.

Next, we can plug the results in the first five steps into Theorem 3.3. Using the same argument in the proof of Theorem 4.5, we have that with probability at least 1Cexp(Clog(p¯))1-C\exp(-C\log(\bar{p})),

Err(t)Err(0)+Cκ4αx2σ¯2d/(d+1)deffMx,2,δlog(p¯)n,\text{Err}^{(t)}\leq\text{Err}^{(0)}+C\kappa^{4}\alpha_{x}^{-2}\bar{\sigma}^{-2d/(d+1)}\frac{d_{\text{eff}}M_{x,2,\delta}\log(\bar{p})}{n}, (B.70)

and

𝓐(t)𝓐F2κ2(1Cαxβx1κ2)t𝓐(0)𝓐F2+κ4αx2(k=1dpkrk+k=1drk)[Mx,2,δlog(p¯)n],\begin{split}\|\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}&\lesssim\kappa^{2}(1-C\alpha_{x}\beta_{x}^{-1}\kappa^{-2})^{t}\|\mbox{\boldmath$\mathscr{A}$}^{(0)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}\\ &+\kappa^{4}\alpha_{x}^{-2}\left(\sum_{k=1}^{d}p_{k}r_{k}+\prod_{k=1}^{d}r_{k}\right)\left[\frac{M_{x,2,\delta}\log(\bar{p})}{n}\right],\end{split} (B.71)

for all t=1,2,,Tt=1,2,\dots,T.

Finally, for all t=1,2,,Tt=1,2,\dots,T and k=1,2,,dk=1,2,\dots,d,

sinΘ(𝐔k(t),𝐔k)σ¯1d+1Err(t)σ¯1d+1Err(0)+Cκ2αx1σ¯1deff1/2[Meff,δlog(p¯)n]1/2δ.\|\sin\Theta(\mathbf{U}_{k}^{(t)},\mathbf{U}_{k}^{*})\|\leq\bar{\sigma}^{\frac{-1}{d+1}}\sqrt{\text{Err}^{(t)}}\leq\bar{\sigma}^{\frac{-1}{d+1}}\sqrt{\text{Err}^{(0)}}+C\kappa^{2}\alpha_{x}^{-1}\bar{\sigma}^{-1}d_{\text{eff}}^{1/2}\left[\frac{M_{\text{eff},\delta}\log(\bar{p})}{n}\right]^{1/2}\leq\delta. (B.72)

B.3 Proof of Theorem 4.13

The proof consists of six steps. In the first five steps, we prove the stability of the robust gradient estimators for the general 1tT1\leq t\leq T and, hence, we omit the notation (t)(t) for simplicity. Specifically, in the first four steps, we give the upper bounds for Tk,1F,,Tk,4F\|T_{k,1}\|_{\text{F}},\dots,\|T_{k,4}\|_{\text{F}}, respectively, for 1kd1\leq k\leq d. In the fifth step, we extend the proof to the terms for the core tensor. In the last step, we apply the results to the local convergence analysis in Theorem 3.3 and verify the corresponding conditions. Throughout the first five steps, we assume that for each 1kd1\leq k\leq d, 𝐔kσ¯1/(d+1)\|\mathbf{U}_{k}\|\asymp\bar{\sigma}^{1/(d+1)} and sinΘ(𝐔k,𝐔k)δ\|\sin\Theta(\mathbf{U}_{k},\mathbf{U}_{k}^{*})\|\leq\delta and will verify them in the last step.

Step 1. (Bound Tk,1F\|T_{k,1}\|_{\text{F}})

Let 𝐌k,1=(j=1,jkd𝐔j)/j=1,jkd𝐔j\mathbf{M}_{k,1}=(\otimes_{j=1,j\neq k}^{d}\mathbf{U}_{j})/\|\otimes_{j=1,j\neq k}^{d}\mathbf{U}_{j}\| and its columns as 𝐌k,1=[𝐦k,1,𝐦k,2,,𝐦k,rk]\mathbf{M}_{k,1}=[\mathbf{m}_{k,1},\mathbf{m}_{k,2},\dots,\mathbf{m}_{k,r_{k}^{\prime}}]. Let zk,j,l=𝐜j𝓔(k)𝐦k,lz_{k,j,l}=-\mathbf{c}_{j}^{\top}\mbox{\boldmath$\mathscr{E}$}_{(k)}\mathbf{m}_{k,l}, and by Assumption 4.8, 𝔼[|zk,j,l|1+ϵ]Me,1+ϵ,δ\mathbb{E}[|z_{k,j,l}|^{1+\epsilon}]\leq M_{e,1+\epsilon,\delta}. Let 𝐌k,2=𝓢(k)/𝓢(k)\mathbf{M}_{k,2}=\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top}/\|\mbox{\boldmath$\mathscr{S}$}_{(k)}\| and, hence, we have 𝐜j𝓔(k)𝐌k,1𝐌k,2𝐜m=wk,j,m-\mathbf{c}_{j}^{\top}\mbox{\boldmath$\mathscr{E}$}_{(k)}\mathbf{M}_{k,1}\mathbf{M}_{k,2}\mathbf{c}_{m}=w_{k,j,m}, for 1jpk1\leq j\leq p_{k} and 1mrk1\leq m\leq r_{k}, satisfying 𝔼[|wk,j,m|1+ϵ]Me,1+ϵ,δ\mathbb{E}[|w_{k,j,m}|^{1+\epsilon}]\lesssim M_{e,1+\epsilon,\delta}.

We first bound the bias, for any 1jpk1\leq j\leq p_{k} and 1mrk1\leq m\leq r_{k},

|𝔼[wk,j,m]𝔼[T(wk,j,m,τk)]|𝔼[|wk,j,m|1{|wk,j,m|τk}]𝔼[|wk,j,m|1+ϵ]1/(1+ϵ)(|wk,j,m|τk)ϵ/(1+ϵ)𝔼[|wk,j,m|1+ϵ]1/(1+ϵ)(𝔼[|wk,j,m|1+ϵ]τk1+ϵ)ϵ/(1+ϵ)Me,1+ϵ,δτkϵMe,1+ϵ,δ1/(1+ϵ)κ2\begin{split}&|\mathbb{E}[w_{k,j,m}]-\mathbb{E}[\text{T}(w_{k,j,m},\tau_{k})]|\leq\mathbb{E}[|w_{k,j,m}|\cdot 1\{|w_{k,j,m}|\geq\tau_{k}\}]\\ \leq&\mathbb{E}\left[|w_{k,j,m}|^{1+\epsilon}\right]^{1/(1+\epsilon)}\cdot\mathbb{P}(|w_{k,j,m}|\geq\tau_{k})^{\epsilon/(1+\epsilon)}\\ \leq&\mathbb{E}\left[|w_{k,j,m}|^{1+\epsilon}\right]^{1/(1+\epsilon)}\cdot\left(\frac{\mathbb{E}[|w_{k,j,m}|^{1+\epsilon}]}{\tau_{k}^{1+\epsilon}}\right)^{\epsilon/(1+\epsilon)}\\ \lesssim&M_{e,1+\epsilon,\delta}\cdot\tau_{k}^{-\epsilon}\asymp M_{e,1+\epsilon,\delta}^{1/(1+\epsilon)}\cdot\kappa^{-2}\end{split} (B.73)

with the truncation parameter τkMe,1+ϵ,δ1/(1+ϵ)κ2/ϵ\tau_{k}\asymp M_{e,1+\epsilon,\delta}^{1/(1+\epsilon)}\cdot\kappa^{2/\epsilon}. Hence,

Tk,1Fσ¯d/(d+1)pkrkMe,1+ϵ,δ1/(1+ϵ)κ2.\|T_{k,1}\|_{\text{F}}\lesssim\bar{\sigma}^{d/(d+1)}\sqrt{p_{k}r_{k}}M_{e,1+\epsilon,\delta}^{1/(1+\epsilon)}\kappa^{-2}. (B.74)

Step 2. Bound (Tk,2F\|T_{k,2}\|_{\text{F}})

Note that

𝔼[T(wk,j,m,τk)2]τk1ϵ𝔼[|wk,j,m|1+ϵ]τk1ϵMe,1+ϵ,δ.\mathbb{E}\left[\text{T}(w_{k,j,m},\tau_{k})^{2}\right]\leq\tau_{k}^{1-\epsilon}\cdot\mathbb{E}[|w_{k,j,m}|^{1+\epsilon}]\asymp\tau_{k}^{1-\epsilon}\cdot M_{e,1+\epsilon,\delta}. (B.75)

Thus, we have var(T(wk,j,m,τk))𝔼[T(wk,j,m,τk)2]τk1ϵMe,1+ϵ,δ\text{var}(\text{T}(w_{k,j,m},\tau_{k}))\leq\mathbb{E}[\text{T}(w_{k,j,m},\tau_{k})^{2}]\leq\tau_{k}^{1-\epsilon}M_{e,1+\epsilon,\delta}. Also, for any s=3,4,s=3,4,\dots, the higher-order moments satisfy that

𝔼[(T(wk,j,m,τk)𝔼[T(wk,j,m,τk)])s](2τk)s2𝔼[(T(wk,j,m,τk)𝔼[T(wk,j,m,τk)])2].\mathbb{E}\left[(\text{T}(w_{k,j,m},\tau_{k})-\mathbb{E}[\text{T}(w_{k,j,m},\tau_{k})])^{s}\right]\leq(2\tau_{k})^{s-2}\mathbb{E}\left[(\text{T}(w_{k,j,m},\tau_{k})-\mathbb{E}[\text{T}(w_{k,j,m},\tau_{k})])^{2}\right]. (B.76)

By Bernstein’s inequality, for any 0<tτkϵMe,1+ϵ,δ0<t\leq\tau_{k}^{-\epsilon}M_{e,1+\epsilon,\delta},

(|T(wk,j,m,τk)𝔼[T(wk,j,m,τk)]|t)2exp(t24τk1ϵMe,1+ϵ,δ).\mathbb{P}(|T(w_{k,j,m},\tau_{k})-\mathbb{E}[T(w_{k,j,m},\tau_{k})]|\geq t)\leq 2\exp\left(-\frac{t^{2}}{4\tau_{k}^{1-\epsilon}M_{e,1+\epsilon,\delta}}\right). (B.77)

Letting t=CMe,1+ϵ,δ1/(1+ϵ)κ2t=CM_{e,1+\epsilon,\delta}^{1/(1+\epsilon)}\kappa^{-2}, since σ¯/Me,1+ϵ,δ1/(1+ϵ)p¯\underline{\sigma}/M_{e,1+\epsilon,\delta}^{1/(1+\epsilon)}\gtrsim\sqrt{\bar{p}} we have

(|T(wk,j,m,τk)𝔼[T(wk,j,m,τk)]|CMe,1+ϵ,δ1/(1+ϵ)κ2)Cexp(Clog(p¯))\mathbb{P}(|T(w_{k,j,m},\tau_{k})-\mathbb{E}[T(w_{k,j,m},\tau_{k})]|\geq CM_{e,1+\epsilon,\delta}^{1/(1+\epsilon)}\kappa^{-2})\leq C\exp\left(-C\log(\bar{p})\right) (B.78)

and

(max1jpk1mrk|T(wk,j,m,τk)𝔼[T(wk,j,m,τk)]|CMe,1+ϵ,δ1/(1+ϵ)κ2)Cexp(Clog(p¯)).\mathbb{P}\left(\max_{\begin{subarray}{c}1\leq j\leq p_{k}\\ 1\leq m\leq r_{k}\end{subarray}}|T(w_{k,j,m},\tau_{k})-\mathbb{E}[T(w_{k,j,m},\tau_{k})]|\geq CM_{e,1+\epsilon,\delta}^{1/(1+\epsilon)}\kappa^{-2}\right)\leq C\exp\left(-C\log(\bar{p})\right). (B.79)

Hence, with probabilty at least 1Cexp(Clog(p¯))1-C\exp(-C\log(\bar{p})),

Tk,2Fσ¯d/(d+1)pkrkMe,1+ϵ,δ1/(1+ϵ)κ2.\|T_{k,2}\|_{\text{F}}\lesssim\bar{\sigma}^{d/(d+1)}\sqrt{p_{k}r_{k}}M_{e,1+\epsilon,\delta}^{1/(1+\epsilon)}\kappa^{-2}. (B.80)

Step 3. (Bound Tk,3F\|T_{k,3}\|_{\text{F}})

Clearly, we have

¯(𝓐;𝓨)(k)𝐕k𝓢(k)¯(𝓐;𝓨)(k)𝐕k𝓢(k)F2σ¯2d/(d+1)𝓐𝓐F2.\|\nabla\overline{\mathcal{R}}(\mbox{\boldmath$\mathscr{A}$};\mbox{\boldmath$\mathscr{Y}$})_{(k)}\mathbf{V}_{k}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top}-\nabla\overline{\mathcal{R}}(\mbox{\boldmath$\mathscr{A}$}^{*};\mbox{\boldmath$\mathscr{Y}$})_{(k)}\mathbf{V}_{k}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top}\|_{\text{F}}^{2}\lesssim\bar{\sigma}^{2d/(d+1)}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}.

For 1lpk1\leq l\leq p_{k} and 1mrk1\leq m\leq r_{k}, let the (l,m)(l,m)-th entry of ¯(𝓐;𝓨)(k)𝐕k𝓢(k)\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};\mbox{\boldmath$\mathscr{Y}$})_{(k)}\mathbf{V}_{k}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top} and ¯(𝓐;𝓨)(k)𝐕k𝓢(k)\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};\mbox{\boldmath$\mathscr{Y}$})_{(k)}\mathbf{V}_{k}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top} be vk,l,mv_{k,l,m} and uk,l,mu_{k,l,m}, respectively. Let (𝔼[vk,l,m(i)uk,l,m(i)])2=sk,l,m2(\mathbb{E}[v_{k,l,m}^{(i)}-u_{k,l,m}^{(i)}])^{2}=s_{k,l,m}^{2}, and then l=1pkm=1rksk,l,m2σ¯2d/(d+1)𝓐𝓐F2\sum_{l=1}^{p_{k}}\sum_{m=1}^{r_{k}}s_{k,l,m}^{2}\lesssim\bar{\sigma}^{2d/(d+1)}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}.

For any 1lpk1\leq l\leq p_{k} and 1mrk1\leq m\leq r_{k},

|𝔼[vk,l,muk,l,m]𝔼[T(vk,l,m,τ)T(uk,l,m,τ)]|𝔼[|vk,l,muk,l,m|1{|wk,l,m|τk}]𝔼[|vk,l,muk,l,m(i)|2]1/2(|wk,l,m|τk)1/2sk,l,m[𝔼|wk,l,m|1+ϵτk1+ϵ]1/2sk,l,mκ2.\begin{split}&\left|\mathbb{E}[v_{k,l,m}-u_{k,l,m}]-\mathbb{E}[\text{T}(v_{k,l,m},\tau)-\text{T}(u_{k,l,m},\tau)]\right|\\ \lesssim&~{}\mathbb{E}\left[|v_{k,l,m}-u_{k,l,m}|\cdot 1\{|w_{k,l,m}|\geq\tau_{k}\}\right]\\ \leq&~{}\mathbb{E}\left[|v_{k,l,m}-u_{k,l,m}^{(i)}|^{2}\right]^{1/2}\cdot\mathbb{P}\left(|w_{k,l,m}|\geq\tau_{k}\right)^{1/2}\\ \leq&~{}s_{k,l,m}\cdot\left[\frac{\mathbb{E}|w_{k,l,m}|^{1+\epsilon}}{\tau_{k}^{1+\epsilon}}\right]^{1/2}\asymp s_{k,l,m}\cdot\kappa^{-2}.\end{split} (B.81)

Hence,

Tk,3F2=l=1pkm=1rk|𝔼[xk,l,m(i)yk,l,m(i)]𝔼[T(xk,l,m(i),τ)T(yk,l,m(i),τ)]|2κ4l=1pkm=1rksk,l,m2κ4σ¯2d/(d+1)𝓐𝓐F2.\begin{split}\|T_{k,3}\|_{\text{F}}^{2}=&\sum_{l=1}^{p_{k}}\sum_{m=1}^{r_{k}}\left|\mathbb{E}[x_{k,l,m}^{(i)}-y_{k,l,m}^{(i)}]-\mathbb{E}[\text{T}(x_{k,l,m}^{(i)},\tau)-\text{T}(y_{k,l,m}^{(i)},\tau)]\right|^{2}\\ \lesssim&~{}\kappa^{-4}\sum_{l=1}^{p_{k}}\sum_{m=1}^{r_{k}}s_{k,l,m}^{2}\lesssim\kappa^{-4}\bar{\sigma}^{2d/(d+1)}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}.\end{split} (B.82)

Step 4. (Bound Tk,4F\|T_{k,4}\|_{\text{F}})

Let T(¯(𝓐;zi)(k)𝐕k𝓢(k),τ)T(¯(𝓐;zi)(k)𝐕k𝓢(k),τ)=𝐙k={zk,j,l}1jpk,1lrk\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top},\tau)-\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top},\tau)=\mathbf{Z}_{k}=\{z_{k,j,l}\}_{1\leq j\leq p_{k},1\leq l\leq r_{k}}. Then,

Tk,4F2=j=1pkl=1rk(zk,j,l𝔼[zk,j,l])2\|T_{k,4}\|_{\text{F}}^{2}=\sum_{j=1}^{p_{k}}\sum_{l=1}^{r_{k}}\left(z_{k,j,l}-\mathbb{E}[z_{k,j,l}]\right)^{2} (B.83)

Note that var(zk,j,l)=sk,j,l2\text{var}(z_{k,j,l})=s_{k,j,l}^{2} and note that j=1pkl=1rksk,j,l2Cσ¯2d/(d+1)𝓐𝓐F2\sum_{j=1}^{p_{k}}\sum_{l=1}^{r_{k}}s^{2}_{k,j,l}\leq C\bar{\sigma}^{2d/(d+1)}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}. Also, |zk,j,l|2τ|z_{k,j,l}|\leq 2\tau. Similarly to the term Tk,2T_{k,2}, by Bernstein’s inequality, for any 0<t<τ2sk,j,l20<t<\tau^{-2}s_{k,j,l}^{2}, 1jpk1\leq j\leq p_{k} and 1lrk1\leq l\leq r_{k},

(|zk,j,l𝔼[zk,j,l]|t)2exp(t22sk,j,l2).\mathbb{P}\left(\left|z_{k,j,l}-\mathbb{E}[z_{k,j,l}]\right|\geq t\right)\leq 2\exp\left(-\frac{t^{2}}{2s^{2}_{k,j,l}}\right). (B.84)

Letting t=Cκ2sk,j,lt=C\kappa^{-2}s_{k,j,l}, we have that

(1jpk1lrk|zk,j,l𝔼[zk,j,l]|Cκ2sk,j,l)Cexp(Clog(p¯)).\mathbb{P}\left(\cup_{\begin{subarray}{c}1\leq j\leq p_{k}\\ 1\leq l\leq r_{k}\end{subarray}}\left|z_{k,j,l}-\mathbb{E}[z_{k,j,l}]\right|\geq C\kappa^{-2}s_{k,j,l}\right)\leq C\exp\left(-C\log(\bar{p})\right). (B.85)

Therefore, with probability at least 1Cexp(Clog(p¯))1-C\exp(-C\log(\bar{p})),

Tk,4F2κ4σ¯2d/(d+1)𝓐𝓐F2.\|T_{k,4}\|_{\text{F}}^{2}\lesssim\kappa^{-4}\bar{\sigma}^{2d/(d+1)}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}. (B.86)

Combining these results, for any 1kd1\leq k\leq d, with probability at least 1Cexp(Clog(p¯))1-C\exp(-C\log(\bar{p})),

𝐆k𝔼[k]F2κ4σ¯2d/(d+1)pkrkMe,1+ϵ,δ2/(1+ϵ)+κ4σ¯2d/(d+1)𝓐𝓐F2.\|\mathbf{G}_{k}-\mathbb{E}[\nabla_{k}\mathcal{L}]\|_{\text{F}}^{2}\lesssim\kappa^{-4}\bar{\sigma}^{2d/(d+1)}p_{k}r_{k}M_{e,1+\epsilon,\delta}^{2/(1+\epsilon)}+\kappa^{-4}\bar{\sigma}^{2d/(d+1)}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}. (B.87)

Step 5. (Extension to core tensor)

In a similar fashion, we can show that with probability at least 1Cexp(Clog(p¯))1-C\exp(-C\log(\bar{p})),

T0,1Fσ¯d/(d+1)k=1drkMe,1+ϵ,δ1/(1+ϵ)κ2,T0,2Fσ¯d/(d+1)k=1drkMe,1+ϵ,δ1/(1+ϵ)κ2,T0,3Fκ4σ¯2d/(d+1)𝓐𝓐F2,and T0,4Fκ4σ¯2d/(d+1)𝓐𝓐F2.\begin{split}\|T_{0,1}\|_{\text{F}}&\lesssim\bar{\sigma}^{d/(d+1)}\sqrt{\prod_{k=1}^{d}r_{k}}M_{e,1+\epsilon,\delta}^{1/(1+\epsilon)}\kappa^{-2},\\ \|T_{0,2}\|_{\text{F}}&\lesssim\bar{\sigma}^{d/(d+1)}\sqrt{\prod_{k=1}^{d}r_{k}}M_{e,1+\epsilon,\delta}^{1/(1+\epsilon)}\kappa^{-2},\\ \|T_{0,3}\|_{\text{F}}&\lesssim\kappa^{-4}\bar{\sigma}^{2d/(d+1)}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2},\\ \text{and }\|T_{0,4}\|_{\text{F}}&\lesssim\kappa^{-4}\bar{\sigma}^{2d/(d+1)}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}.\end{split} (B.88)

Hence, we have with high probability,

𝓖0𝔼0F2κ4σ¯2d/(d+1)(k=1drk)Me,1+ϵ,δ2/(1+ϵ)+κ4σ¯2d/(d+1)𝓐𝓐F2.\|\mbox{\boldmath$\mathscr{G}$}_{0}-\mathbb{E}\nabla_{0}\mathcal{L}\|_{\text{F}}^{2}\lesssim\kappa^{-4}\bar{\sigma}^{2d/(d+1)}\left(\prod_{k=1}^{d}r_{k}\right)M_{e,1+\epsilon,\delta}^{2/(1+\epsilon)}+\kappa^{-4}\bar{\sigma}^{2d/(d+1)}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}. (B.89)

Step 6. (Verify the conditions and conclude the proof)

By definition, the RCG condition is satisfied naturally. Next, we can plug the results in the first five steps into Theorem 3.3. Using the same argument in the proof of Theorem 4.5, we have that with probability at least 1Cexp(Clog(p¯))1-C\exp(-C\log(\bar{p})),

Err(t)Err(0)+Cσ¯2d/(d+1)(k=1dpkrk+k=1drk)Me,1+ϵ,δ2/(1+ϵ),\text{Err}^{(t)}\leq\text{Err}^{(0)}+C\bar{\sigma}^{-2d/(d+1)}\left(\sum_{k=1}^{d}p_{k}r_{k}+\prod_{k=1}^{d}r_{k}\right)M_{e,1+\epsilon,\delta}^{2/(1+\epsilon)}, (B.90)

and

𝓐(t)𝓐F2(1Cκ2)t𝓐(0)𝓐F2+(k=1dpkrk+k=1drk)Me,1+ϵ,δ2/(1+ϵ),\begin{split}\|\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}&\lesssim(1-C\kappa^{-2})^{t}\|\mbox{\boldmath$\mathscr{A}$}^{(0)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}+\left(\sum_{k=1}^{d}p_{k}r_{k}+\prod_{k=1}^{d}r_{k}\right)M_{e,1+\epsilon,\delta}^{2/(1+\epsilon)},\end{split} (B.91)

for all t=1,2,,Tt=1,2,\dots,T.

Finally, for all t=1,2,,Tt=1,2,\dots,T and k=1,2,,dk=1,2,\dots,d,

sinΘ(𝐔k(t),𝐔k)σ¯1d+1Err(t)σ¯1d+1Err(0)+Cσ¯1deff1/2Me,1+ϵ,δ1/(1+ϵ)δ.\|\sin\Theta(\mathbf{U}_{k}^{(t)},\mathbf{U}_{k}^{*})\|\leq\bar{\sigma}^{\frac{-1}{d+1}}\sqrt{\text{Err}^{(t)}}\leq\bar{\sigma}^{\frac{-1}{d+1}}\sqrt{\text{Err}^{(0)}}+C\bar{\sigma}^{-1}d_{\text{eff}}^{1/2}M_{e,1+\epsilon,\delta}^{1/(1+\epsilon)}\leq\delta. (B.92)