ROBUST AND OPTIMAL TENSOR ESTIMATION VIA ROBUST GRADIENT DESCENT

Xiaoyu Zhang Di Wang Guodong Li Defeng Sun School of Mathematical Sciences, Tongji University, [email protected] School of Mathematical Sciences, Shanghai Jiao Tong University, [email protected] Department of Statistics and Actuarial Science, University of Hong Kong, [email protected] Department of Applied Mathematics, Hong Kong Polytechnic University, [email protected]

Abstract

Low-rank tensor models are widely used in statistics and machine learning. However, most existing methods rely heavily on the assumption that data follows a sub-Gaussian distribution. To address the challenges associated with heavy-tailed distributions encountered in real-world applications, we propose a novel robust estimation procedure based on truncated gradient descent for general low-rank tensor models. We establish the computational convergence of the proposed method and derive optimal statistical rates under heavy-tailed distributional settings of both covariates and noise for various low-rank models. Notably, the statistical error rates are governed by a local moment condition, which captures the distributional properties of tensor variables projected onto certain low-dimensional local regions. Furthermore, we present numerical results to demonstrate the effectiveness of our method.

62F35 ,

62H12,

62J12,

62H25,

Gradient descent,

heavy-tailed distribution,

nonconvex optimization,

robustness,

tensor decomposition,

keywords:

[class=MSC]

keywords:

\startlocaldefs\endlocaldefs

, , and

1 Introduction

1.1 Low-rank tensor modeling

Tensor models in statistics and machine learning have gained significant attention in recent years for analyzing complex multidimensional data. Applications of tensors can be found in various fields, including biomedical imaging analysis (Zhou, Li and Zhu, 2013; Li et al., 2018; Wu et al., 2022), recommender systems (Bi, Qu and Shen, 2018; Tarzanagh and Michailidis, 2022), and time series analysis (Chen, Yang and Zhang, 2022; Wang et al., 2022), where the data are naturally represented as third or higher-order tensors. Tensor decompositions and low-rank structures are prevalent in these models, as they facilitate dimension reduction, provide interpretable representations, and enhance computational efficiency (Kolda and Bader, 2009; Bi et al., 2021).

In this paper, we consider a general framework of tensor learning, where the loss function is denoted by $\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z)$ , with $\overline{\mathcal{L}}$ being a differentiable loss function, $\mathscr{A}$ a $d$ -th order parameter tensor, and $z$ a random sample drawn from the population. This framework encompasses a wide range of models in statistics and machine learning, including tensor linear regression, tensor generalized linear regression, and tensor PCA, among others. For dimension reduction, the parameter tensor $\mathscr{A}$ is assumed to have a Tucker decomposition (Tucker, 1966)

\mbox{\boldmath$\mathscr{A}$}=\mbox{\boldmath$\mathscr{S}$}\times_{1}\mathbf{U}_{1}\times_{2}\mathbf{U}_{2}\cdots\times_{d}\mathbf{U}_{d}=\mbox{\boldmath$\mathscr{S}$}\times_{j=1}^{d}\mathbf{U}_{j},

(1.1)

which is one of the most commonly used low-rank structures in statistical tensor learning (see related definitions and notations in Section 2).

Substantial progress has been made recently in developing estimation methods and efficient algorithms for various low-rank tensor models. However, low-rank tensor problems are often highly nonconvex, making them challenging to solve directly. A common strategy to address this issue is convex relaxation by replacing the low-rank constraint with a tensor nuclear norm and solving the problem in the full tensor space, i.e. the space of $\mathscr{A}$ (Tomioka and Suzuki, 2013; Yuan and Zhang, 2016; Raskutti, Yuan and Chen, 2019). While these convex methods come with provable statistical guarantees, they suffer from heavy computational burdens and prohibitive storage requirements. As a result, they are often infeasible for high-dimensional tensor estimation in many practical applications.

Another estimation approach is nonconvex optimization, which operates directly on the parsimonious decomposition form in (1.1). This approach leads to scalable algorithms with more affordable computational complexity. Among the various developments in nonconvex optimization techniques, gradient descent and its variants have been extensively studied and can be applied to a wide range of low-rank tensor models. Recent advances have provided both computational and statistical guarantees for gradient descent algorithms, as well as corresponding initialization methods (Chen, Raskutti and Yuan, 2019; Han, Willett and Zhang, 2022; Tong et al., 2022; Dong et al., 2023). In terms of statistical performance, minimax optimal error rates have been established for commonly used low-rank tensor models under Gaussian or sub-Gaussian distributional assumptions, which are crucial technical conditions for managing high-dimensional data (Raskutti, Yuan and Chen, 2019; Chen, Raskutti and Yuan, 2019; Han, Willett and Zhang, 2022).

However, in many real applications, tensor data are often contaminated by outliers and heavy-tailed noise, violating the stringent Gaussian or sub-Gaussian assumptions. Empirical studies have shown that biomedical imaging data often exhibit non-Gaussian and heavy-tailed distributions (Beggs and Plenz, 2003; Friedman et al., 2012; Roberts, Boonstra and Breakspear, 2015). For example, in Section 6, we analyze chest computed tomography (CT) images of patients with and without COVID-19, and plot the kurtosis of image pixels in Figure 1. Among 22,500 pixels analyzed, 1,169 in the COVID-19 samples and 351 pixels in the non-COVID-19 samples exihibit heavy-tailed distributions, with sample kurtosis greater than eight. These findings highlight the prevalence of heavy-tailed behavior in real-world biomedical datasets. As recent studies (Wang, Zhang and Mai, 2023; Wei et al., 2023) have emphasized, conventional methods fail to produce reliable estimates when the data follow such heavy-tailed distributions. Therefore, it is crucial to develop stable and robust estimation methods that are not only computationally efficient and statistically optimal but also robust against the challenges posed by heavy-tailed distributions in real-world tensor data.

Refer to caption — Figure 1: Histograms of kurtosis for COVID and non-COVID CT image pixels

The growing interest in robust estimation methods for high-dimensional low-rank matrix and tensor models underscores the pressing need for solutions that can handle heavy-tailed data. In terms of methodology to achieve robustness, the existing works can be broadly classified into two approaches: loss robustification and data robustification. The prestigious Huber regression method (Huber, 1964; Fan, Li and Wang, 2017; Sun, Zhou and Fan, 2020) exemplifies the first approach, where the standard least squares loss is replaced with a more robust variant. For instance, Tan, Sun and Witten (2023) applied the adaptive Huber regression with regularizations to sparse reduced-rank regression in the presence of heavy-tailed noise, and developed an ADMM algorithm for convex optimization. Shen et al. (2023) employed the least absolute deviation (LAD) and Huber loss functions for low-rank matrix and tensor trace regression, and proposed a Riemannian subgradient algorithm in the nonconvex optimization framework. While these loss-robustification methods provide robust control over residuals, they focus solely on the residuals’ deviations and do not address the heavy-tailedness of the covariates. Moreover, robust loss functions like LAD and Huber loss cannot be easily generalized to more complex tensor models beyond linear trace regression.

Alternatively, Fan, Wang and Zhu (2021) proposed a robust low-rank matrix estimation procedure via data robustification. This method applies appropriate shrinkage to the data, constructs robust moment estimators from the shrunk data, and ultimately derives a robust estimate for the low-rank parameter matrix. Building on this idea, subsequent works by Wang and Tsay (2023) and Lu et al. (2024) extended the data robustficaition framework to time series models and spatial temporal models, respectively. The primary objective of data robustification is to mitigate the influence of samples with large deviations, thereby producing a robust estimate. However, when applied to low-rank matrix and tensor models, the data robustification procedure has limitations. Specifically, it overlooks the inherent structure of the model and fails to exploit the low-rank decomposition. As shown in Section 4, not all information in the data contributes effectively to estimating the tensor decomposition. Consequently, the data robustification approach may be suboptimal for low-rank tensor estimation.

For general low-rank tensor models, we introduce a new robust estimation procedure via gradient robustification. Specifically, we replace the partial gradients in the gradient descent algorithm with their robust alternatives by truncating gradients that exhibit large deviations. This modification improves the accuracy of gradient estimation by mitigating the influence of outliers and noise. By robustifying the partial gradients with respect to the components in the tensor decomposition, this method effectively leverages the low-rank structure, making it applicable to a variety of low-rank models, similar to other gradient-based methods.

For various low-rank tensor models, we demonstrate that instead of the original data, low-dimensional multilinear transformations of the data are employed in the partial gradients. This allows us to work with more compact and informative representations of the data, leading to enhanced computational and statistical properties. As a result, the statistical error rates of this gradient-based method are shown to be linked to a new concept called local moment, which characterizes the distributional properties of the tensor data projected onto certain low-dimensional subspaces along each tensor direction. This leads to potentially much sharper statistical rates compared to traditional methods. Furthermore, since covariates and noise contribute to the partial gradients, the robustification of gradients is particularly effective in handling the heavy-tailed distributions of both covariates and noise.

However, it is important to note that the robust gradient alternatives may not correspond to the partial gradients of any standard robust loss function, which poses challenges in analyzing their convergence. To address this, we develop an algorithm-based convergence analysis for the proposed robust gradient descent method. Additionally, we establish minimax-optimal statistical error rates under local moment conditions for several commonly used low-rank tensor models.

The main contributions of this paper are three-fold:

(1)

We introduce a novel and general estimation procedure based on robust gradient descent. This method is computationally scalable and applicable to a wide range of tensor problems, including both supervised tasks (e.g., tensor regression and classifications) and unsupervised tasks (e.g., tensor PCA).
(2)

The robust methodology is shown to effectively handle the heavy-tailed distributions and is proven to achieve optimal statistical error rates under relaxed distributional assumptions on both covariates and noise. Specifically, for $0<\epsilon\leq 1$ , we only require finite second moments for the covariates and $(1+\epsilon)$ -th moments for noise in tensor linear regression, finite second moments for the covariates in tensor logistic regression, and finite $(1+\epsilon)$ -th moments for noise in tensor PCA.
(3)

For heavy-tailed low-rank tensor models, we introduce the concept of local moment, a technical innovation that enables a more precise characterization of the effects of heavy-tailed distributions. This results in sharper statistical error rates compared to those derived from global moment conditions.

1.2 Related literature

This paper is related to a large body of literature on nonconvex methods for low-rank matrix and tensor estimation. The gradient descent algorithm and its variants have been extensively studied for low-rank matrix models (Netrapalli et al., 2014; Chen and Wainwright, 2015; Tu et al., 2016; Wang, Zhang and Gu, 2017; Ma et al., 2018) and low-rank tensor models (Xu, Zhang and Gu, 2017; Chen, Raskutti and Yuan, 2019; Han, Willett and Zhang, 2022; Tong, Ma and Chi, 2022; Tong et al., 2022). For simplicity, we focus on the robust alternatives to the standard gradient descent, although the proposed technique can be extended to other gradient-based methods. Robust gradient methods have also been explored for low-dimensional statistical models in convex optimization (Prasad et al., 2020). For low-rank matrix recovery, the median truncation gradient descent has been proposed to handle the arbitrary outliers in the data (Li et al., 2020), which focuses on linear compressed sensing problem without random noise. Our paper differs from the existing work as we consider the general low-rank tensor estimation framework under the heavy-tailed distribution setting.

Robust estimation against heavy-tailed distributions is another emerging area of research in high-dimensional statistics. Various robust $M$ -estimators have been proposed for mean estimation (Catoni, 2012; Bubeck, Cesa-Bianchi and Lugosi, 2013; Devroye et al., 2016) and high-dimensional linear regression (Fan, Li and Wang, 2017; Loh, 2017; Sun, Zhou and Fan, 2020; Wang et al., 2020). More recently, robust methods for low-rank matrix and tensor estimation have been developed in Fan, Wang and Zhu (2021), Tan, Sun and Witten (2023), Wang and Tsay (2023) and Shen et al. (2023). Compared to these existing methods, our proposed approach can achieve the same or even better convergence rates under more relaxed local distribution assumptions on both covariates and noise.

1.3 Organization of the paper

The remainder of this paper is organized as follows. In Section 2, we provide a review of relevant definitions and notations for low-rank tensor estimation. In Section 3, we introduce the robust gradient descent method and discuss its convergence properties. In Section 4, we apply the proposed methodology to three popular tensor models in heavy-tailed distribution settings. Simulation experiments and real data examples are presented in Sections 5 and 6 to validate the performance of the proposed methods. Finally, Section 7 concludes with a discussion of the findings and future directions. The proofs of the main results are relegated to Appendices A and B.

2 Tensor Algebra and Notation

Tensors, or multi-dimensional arrays, are higher-order extensions of matrices. A multi-dimensional array $\mbox{\boldmath$\mathscr{X}$}\in\mathbb{R}^{p_{1}\times\cdots\times p_{d}}$ is called a $d$ -th order tensor. Throughout this paper, we denote vectors by boldface small letters (e.g. $\mathbf{x}$ , $\mathbf{y}$ ), matrices by boldface capital letters (e.g. $\mathbf{X}$ , $\mathbf{Y}$ ), and tensors by boldface Euler capital letters (e.g. $\mathscr{X}$ , $\mathscr{Y}$ ), respectively.

Tensor matricization is the process of reordering the elements of a tensor into a matrix. For any $d$ -th-order tensor $\mbox{\boldmath$\mathscr{X}$}\in\mathbb{R}^{p_{1}\times\cdots\times p_{d}}$ , its mode- $k$ matricization is denoted as $\mbox{\boldmath$\mathscr{X}$}_{(k)}\in\mathbb{R}^{p_{k}\times p_{-k}}$ , with $p_{-k}=\prod_{\ell=1,\ell\neq k}^{d}p_{\ell}$ , whose $(i_{k},j)$ -th element is mapped to the $(i_{1},\dots,i_{d})$ -th element of $\mathscr{X}$ , where $j=1+\sum_{s=1,s\neq k}^{d}(i_{s}-1)J_{s}^{(k)}$ with $J_{s}^{(k)}=\prod_{\ell=1,\ell\neq k}^{s-1}p_{\ell}$ and $p_{0}=1$ .

Next, we review three types of multiplications for tensors. For any tensor $\mbox{\boldmath$\mathscr{X}$}\in\mathbb{R}^{p_{1}\times\cdots\times p_{d}}$ and matrix $\mathbf{Y}\in\mathbb{R}^{q_{k}\times p_{k}}$ with $1\leq k\leq d$ , the mode- $k$ multiplication $\mbox{\boldmath$\mathscr{X}$}\times_{k}\mathbf{Y}$ produces a tensor in $\mathbb{R}^{p_{1}\times\cdots\times p_{k-1}\times q_{k}\times p_{k+1}\times\cdots\times p_{m}}$ defined by

\left(\mbox{\boldmath$\mathscr{X}$}\times_{k}\mathbf{Y}\right)_{i_{1}\dots i_{k-1}ji_{k+1}\dots i_{d}}=\sum_{i_{k}=1}^{p_{k}}\mbox{\boldmath$\mathscr{X}$}_{i_{1}\dots i_{m}}\mathbf{Y}_{ji_{k}}.

(2.1)

For any two tensors $\mbox{\boldmath$\mathscr{X}$}\in\mathbb{R}^{p_{1}\times p_{2}\times\cdots\times p_{d}}$ and $\mbox{\boldmath$\mathscr{Y}$}\in\mathbb{R}^{p_{1}\times p_{2}\times\cdots p_{d_{0}}}$ with $d\geq d_{0}$ , their generalized inner product $\langle\mbox{\boldmath$\mathscr{X}$},\mbox{\boldmath$\mathscr{Y}$}\rangle$ is the $(d-d_{0})$ -th-order tensor in $\mathbb{R}^{p_{d_{0}+1}\times\dots\times p_{d}}$ defined by

\langle\mbox{\boldmath$\mathscr{X}$},\mbox{\boldmath$\mathscr{Y}$}\rangle_{i_{d_{0}+1}\dots i_{d}}=\sum_{i_{1}=1}^{p_{1}}\sum_{i_{2}=1}^{p_{2}}\dots\sum_{i_{d_{0}}=1}^{p_{d_{0}}}\mbox{\boldmath$\mathscr{X}$}_{i_{1}i_{2}\dots i_{d_{0}}i_{d_{0}+1}\dots i_{d}}\mbox{\boldmath$\mathscr{Y}$}_{i_{1}i_{2}\dots i_{d_{0}}},

(2.2)

for $1\leq i_{d_{0}+1}\leq p_{d_{0}+1},\dots,1\leq i_{d}\leq p_{d}$ . In particular, when $d=d_{0}$ , it reduces to the conventional real-valued inner product. Additionally, the Frobenius norm of any tensor $\mathscr{X}$ is defined as $\|\mbox{\boldmath$\mathscr{X}$}\|_{\text{F}}=\sqrt{\langle\mbox{\boldmath$\mathscr{X}$},\mbox{\boldmath$\mathscr{X}$}\rangle}$ . For any two tensors $\mbox{\boldmath$\mathscr{X}$}\in\mathbb{R}^{p_{1}\times\cdots\times p_{d_{1}}}$ and $\mbox{\boldmath$\mathscr{Y}$}\in\mathbb{R}^{q_{1}\times\cdots\times q_{d_{2}}}$ , their outer product $\mbox{\boldmath$\mathscr{X}$}\circ\mbox{\boldmath$\mathscr{Y}$}$ is the $p_{1}\times\cdots\times p_{d_{1}}\times q_{1}\times\cdots\times q_{d_{2}}$ tensor defined by

(\mbox{\boldmath$\mathscr{X}$}\circ\mbox{\boldmath$\mathscr{Y}$})_{i_{1}\dots i_{d_{1}}j_{1}\dots j_{d_{2}}}=\mbox{\boldmath$\mathscr{X}$}_{i_{1}\dots i_{d_{1}}}\mbox{\boldmath$\mathscr{Y}$}_{j_{1}\dots j_{d_{2}}},

(2.3)

for $1\leq i_{k}\leq p_{k}$ , $1\leq j_{\ell}\leq q_{\ell}$ , $1\leq k\leq d_{1}$ , and $1\leq\ell\leq d_{2}$ .

For any tensor $\mbox{\boldmath$\mathscr{X}$}\in\mathbb{R}^{p_{1}\times\cdots\times p_{d}}$ , its Tucker ranks $(r_{1},\dots,r_{d})$ are defined as the matrix ranks of its matricizations, i.e. $r_{j}=\text{rank}(\mbox{\boldmath$\mathscr{X}$}_{(j)})$ , for $j=1,\dots,d$ . Note that $r_{j}$ ’s are analogous to the row and column ranks of a matrix, but they are not necessarily equal for third- and higher-order tensors. If $\mathscr{X}$ has Tucker ranks $(r_{1},\dots,r_{d})$ , then $\mathscr{X}$ has the following Tucker decomposition (Tucker, 1966; De Lathauwer, De Moor and Vandewalle, 2000)

\mbox{\boldmath$\mathscr{X}$}=\mbox{\boldmath$\mathscr{Y}$}\times_{1}\mathbf{Y}_{1}\times_{2}\mathbf{Y}_{2}\cdots\times_{d}\mathbf{Y}_{d}=\mbox{\boldmath$\mathscr{Y}$}\times_{j=1}^{d}\mathbf{Y}_{j},

(2.4)

where each $\mathbf{Y}_{j}\in\mathbb{R}^{p_{j}\times r_{j}}$ is the factor matrix for $j=1,\dots,d$ , and $\mbox{\boldmath$\mathscr{Y}$}\in\mathbb{R}^{r_{1}\times\cdots\times r_{d}}$ is the core tensor. If $\mathscr{X}$ has the Tucker decomposition in (2.4), then we have the following results for its matricizations:

\mbox{\boldmath$\mathscr{X}$}_{(k)}=\mathbf{Y}_{k}\mbox{\boldmath$\mathscr{Y}$}_{(k)}(\mathbf{Y}_{d}\otimes\cdots\otimes\mathbf{Y}_{k+1}\otimes\mathbf{Y}_{k-1}\cdots\otimes\mathbf{Y}_{1})^{\top}=\mathbf{Y}_{k}\mbox{\boldmath$\mathscr{Y}$}_{(k)}(\otimes_{j\neq k}\mathbf{Y}_{j})^{\top},~{}~{}k=1,\dots,d,

(2.5)

where $\otimes_{j\neq k}$ is matrix Kronecker product operating in the reverse order.

Throughout this paper, we use $C$ to denote a generic positive constant. For any two real-valued sequences $x_{k}$ and $y_{k}$ , we write $x_{k}\gtrsim y_{k}$ if there exists a constant $C>0$ such that $x_{k}\geq Cy_{k}$ for all $k$ . Additionally, we write $x_{k}\asymp y_{k}$ if $x_{k}\gtrsim y_{k}$ and $y_{k}\gtrsim x_{k}$ . For a generic matrix $\mathbf{X}$ , we let $\mathbf{X}^{\top}$ , $\|\mathbf{X}\|_{\text{F}}$ , $\|\mathbf{X}\|$ , $\text{vec}(\mathbf{X})$ and $\sigma_{j}(\mathbf{X})$ denote its transpose, Frobenius norm, operator norm, vectorization, and the $j$ -th largest singular value, respectively. For any real symmetric matrix $\mathbf{X}$ , let $\lambda_{\min}(\mathbf{X})$ and $\lambda_{\max}(\mathbf{X})$ denote its minimum and maximum eigenvalues.

3 Methodology

3.1 Gradient descent with robust gradient estimates

We consider a general estimation framework with the loss function $\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})$ for parameter tensor $\mathscr{A}$ and random observation $z_{i}$ . Suppose the parameter tensor admits a Tucker low-rank decomposition $\mbox{\boldmath$\mathscr{A}$}=\mbox{\boldmath$\mathscr{S}$}\times_{j=1}^{d}\mathbf{U}_{j}$ , defined in (2.4), where $\mbox{\boldmath$\mathscr{S}$}\in\mathbb{R}^{r_{1}\times r_{2}\times\dots\times r_{d}}$ is the core tensor and each $\mathbf{U}_{j}\in\mathbb{R}^{p_{j}\times r_{j}}$ is the factor matrix. Throughout the paper, we assume that the order $d$ is fixed and the ranks $(r_{1},r_{2},\cdots,r_{d})$ are known. Given the tensor decomposition, we define the loss function with respect to the decomposition components as

\begin{split}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};z_{i})&=\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{S}$}\times_{j=1}^{d}\mathbf{U}_{j};z_{i}),\\ \text{and }\mathcal{L}_{n}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};\mathcal{D}_{n})&=\frac{1}{n}\sum_{i=1}^{n}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\cdots,\mathbf{U}_{d};z_{i}).\end{split}

(3.1)

A standard method to estimate the components in the tensor decomposition is to minimize the following regularized loss function

\mathcal{L}_{n}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};\mathcal{D}_{n})+\frac{a}{2}\sum_{j=1}^{d}\|\mathbf{U}_{j}^{\top}\mathbf{U}_{j}-b^{2}\mathbf{I}_{r_{j}}\|_{\text{F}}^{2},

(3.2)

where $a,b>0$ are tuning parameters, and the additional regularization term $\|\mathbf{U}_{j}^{\top}\mathbf{U}_{j}-b^{2}\mathbf{I}_{r_{j}}\|_{\text{F}}^{2}$ prevents the factor matrix $\mathbf{U}_{j}$ from being singular, while also ensuring that the scaling among all factor matrices is balanced (Han, Willett and Zhang, 2022). This regularized loss minimization problem can be efficiently solved using gradient descent. Han, Willett and Zhang (2022) demonstrated that the gradient descent approach is computationally efficient. Moreover, with a suitable choice of initial values, the estimation error is proportional to

\sup_{\|\scalebox{0.7}{\mbox{\boldmath$\mathscr{T}$}}\|_{\text{F}}=1,\text{rank}(\scalebox{0.7}{\mbox{\boldmath$\mathscr{T}$}}_{(k)})\leq r_{k},1\leq k\leq d}\left|\left\langle\frac{1}{n}\sum_{i=1}^{n}\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i}),\mbox{\boldmath$\mathscr{T}$}\right\rangle\right|,

(3.3)

where $\mbox{\boldmath$\mathscr{A}$}^{*}$ is the ground truth of the parameter tensor satisfying $\mathbb{E}\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})=\mathbf{0}$ , as $\mbox{\boldmath$\mathscr{A}$}^{*}$ minimizes the risk function $\mathbb{E}\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})$ if the expecation exists. The error essentially depends on the intrinsic dimension of the low-rank tensors and the distribution of the data $z_{i}$ . When the distribution of $z_{i}$ is heavy-tailed, controlling the convergence rate becomes challenging, which may lead to suboptimal estimation performance.

To motivate our robust estimation method, it is important to note that the partial gradients of the loss function have a form of sample means

\nabla_{\mathbf{U}_{k}}\mathcal{L}_{n}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};\mathcal{D}_{n})=\frac{1}{n}\sum_{i=1}^{n}\nabla_{\mathbf{U}_{k}}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};z_{i}),

(3.4)

for $k=1,\dots,d$ , and

\nabla_{\scalebox{0.7}{\mbox{\boldmath$\mathscr{S}$}}}\mathcal{L}_{n}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};\mathcal{D}_{n})=\frac{1}{n}\sum_{i=1}^{n}\nabla_{\scalebox{0.7}{\mbox{\boldmath$\mathscr{S}$}}}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};z_{i}).

(3.5)

It is well known that the sample mean is sensitive to outliers, as it can be significantly influenced by even a small fraction of extreme values. When some of $z_{i}$ ’s are outliers, the gradient descent approach may fail to be robust, leading to sub-optimal estimates, particularly in the presence of heavy-tailed distributions.

Therefore, we use the robust gradient estimates as alternatives to the standard gradients in vanilla gradient descent. For any given $\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d}$ , we may construct some robust gradient functions, denoted by $\mbox{\boldmath$\mathscr{G}$}_{0}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d})$ and $\mathbf{G}_{k}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d})$ , for $k=1,\dots,d$ , to replace the partial gradients in standard gradient descent. Given initial values $(\mbox{\boldmath$\mathscr{S}$}^{(0)},\mathbf{U}_{1}^{(0)},\cdots,\mathbf{U}_{k}^{(0)})$ , a step size $\eta>0$ , and the number of iterations $T$ , we propose the following gradient descent algorithm, summarized in Algorithm 1, using the robust gradient functions.

Algorithm 1 Robust gradient descent for

(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d})

initialize: $\mbox{\boldmath$\mathscr{S}$}^{(0)},\mathbf{U}_{1}^{(0)},\dots,\mathbf{U}_{d}^{(0)}$ , $a,b>0$ , step size $\eta>0$ , and number of iterations $T$

for $t=0$ to $T-1$

for $k=1$ to $d$

$\mathbf{U}_{k}^{(t+1)}\leftarrow\mathbf{U}_{k}^{(t)}-\eta\cdot\mathbf{G}_{k}(\mbox{\boldmath$\mathscr{S}$}^{(t)},\mathbf{U}_{1}^{(t)},\dots,\mathbf{U}_{d}^{(t)})-\eta a\mathbf{U}_{k}^{(t)}(\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}})$

end for
$\mbox{\boldmath$\mathscr{S}$}^{(t+1)}\leftarrow\mbox{\boldmath$\mathscr{S}$}^{(t)}-\eta\cdot\mbox{\boldmath$\mathscr{G}$}_{0}(\mbox{\boldmath$\mathscr{S}$}^{(t)},\mathbf{U}_{1}^{(t)},\dots,\mathbf{U}_{d}^{(t)})$

end for
return $\mbox{\boldmath$\mathscr{A}$}^{(T)}=\mbox{\boldmath$\mathscr{S}$}^{(T)}\times_{1}\mathbf{U}_{1}^{(T)}\dots\times_{d}\mathbf{U}_{d}^{(T)}$

3.2 Local convergence analysis

The robust gradient functions may not be the partial gradients of a robust loss function. When the gradients are replaced by their robust alternatives, Algorithm 1 is not designed to solve a traditional minimization problem. As a result, studying its convergence becomes challenging. To address this and establish both computational and statistical guarantees, we introduce some notations and conditions.

Definition 3.1 (Restricted correlated gradient).

The loss function $\overline{\mathcal{L}}$ satisfies the restricted correlated gradient (RCG) condition: for any $\mathscr{A}$ such that $\textup{rank}(\mbox{\boldmath$\mathscr{A}$}_{(k)})\leq r_{k}$ , $1\leq k\leq d$ ,

\langle\mathbb{E}\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i}),\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\rangle\geq\frac{\alpha}{2}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\textup{F}}^{2}+\frac{1}{2\beta}\|\mathbb{E}\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z)\|_{\textup{F}}^{2},

(3.6)

where the RCG parameters $\alpha$ and $\beta$ satisfy $0<\alpha\leq\beta$ .

The RCG condition implies that for any low-rank tensor $\mathscr{A}$ , the expectation of the gradient $\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})$ is postively correlated with the optimal descent direction $\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}$ . If the loss function has a finite expectation $\mathbb{E}\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})$ , this condition is implied by the restricted strong convexity and smoothness conditions of the risk function $\mathbb{E}\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})$ (Bubeck, 2015; Jain et al., 2017). However, in the setting of heavy-tailed distributions, the expecation of $\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})$ may not exist, making it more appropriate to consider the RCG condition on the gradient rather than the loss function (see tensor linear regression in Remark 4.1 as an example). Furthermore, compared to the vanilla gradient descent method (Han, Willett and Zhang, 2022), the RCG condition is imposed on the expectation of the loss gradient, which simplifies the technical analysis under heavy-tailed distributions.

The robust gradient functions are expected to exhibit desirable stability against outliers. As an alternative to the sample mean, $\mathbf{G}_{k}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d})$ can be viewed as a robust estimator of $\mathbb{E}[\nabla_{\mathbf{U}_{k}}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d})]$ . Similarly, $\mbox{\boldmath$\mathscr{G}$}_{0}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d})$ can be viewed as a mean estimator of $\mathbb{E}[\nabla_{\scalebox{0.7}{\mbox{\boldmath$\mathscr{S}$}}}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d})]$ . Formally, we define the following condition for the stability of the robust gradients.

Definition 3.2 (Stability of robust gradients).

Given $(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d})$ , the robust gradient functions are stable if there exist positive constants $\phi$ and $\xi_{k}$ , for $0\leq k\leq d$ , such that

\begin{split}\|\mathbf{G}_{k}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\cdots,\mathbf{U}_{d})-\mathbb{E}\nabla_{\mathbf{U}_{k}}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d})\|_{\textup{F}}^{2}&\leq\phi\|\mbox{\boldmath$\mathscr{S}$}\times_{j=1}^{d}\mathbf{U}_{j}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\textup{F}}^{2}+\xi_{k}^{2},\\ \text{and }\|\mbox{\boldmath$\mathscr{G}$}_{0}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d})-\mathbb{E}\nabla_{\scalebox{0.7}{\mbox{\boldmath$\mathscr{S}$}}}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d})\|_{\textup{F}}^{2}&\leq\phi\|\mbox{\boldmath$\mathscr{S}$}\times_{j=1}^{d}\mathbf{U}_{j}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\textup{F}}^{2}+\xi_{0}^{2}.\end{split}

(3.7)

The universal parameter $\phi$ controls the performance of all gradient functions in the presence of an inaccurate Tucker decomposition, while the constants $\xi_{k}$ ’s represent the estimation accuracy of the robust gradient estimators. A similar definition for robust estimation via convex optimization is considered in Prasad et al. (2020).

For the ground truth $\mbox{\boldmath$\mathscr{A}$}^{*}$ , denote its largest and smallest singular values across all directions by $\bar{\sigma}=\max_{1\leq k\leq d}\|\mbox{\boldmath$\mathscr{A}$}^{*}_{(k)}\|$ and $\underline{\sigma}=\min_{1\leq k\leq d}\sigma_{r_{k}}(\mbox{\boldmath$\mathscr{A}$}^{*}_{(k)})$ . The condition number of $\mbox{\boldmath$\mathscr{A}$}^{*}$ is then given by $\kappa=\bar{\sigma}/\underline{\sigma}$ . To address the unidentifiability issue in Tucker decomposition, we consider the component-wise estimation error under rotation for the estimate $(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d})$ , following Han, Willett and Zhang (2022). The error is defined as

\text{Err}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d})=\min_{\mathbf{O}_{k}\in\mathbb{O}^{r_{k}},1\leq k\leq d}\left\{\|\mbox{\boldmath$\mathscr{S}$}-\mbox{\boldmath$\mathscr{S}$}^{*}\times_{j=1}^{d}\mathbf{O}_{k}^{\top}\|_{\text{F}}^{2}+\sum_{k=1}^{d}\|\mathbf{U}_{k}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}\|_{\text{F}}^{2}\right\},

(3.8)

where the true decomposition satisfies $\|\mathbf{U}_{k}^{*}\|=b$ and the orthogonal matrices $\mathbf{O}_{k}$ ’s account for the unidentification of the Tucker decomposition. For the $t$ -th iteration of Algorithm 1, where $t=0,1,\dots,T$ , denote the estimated parameter by $\mbox{\boldmath$\mathscr{A}$}^{(t)}=\mbox{\boldmath$\mathscr{S}$}^{(t)}\times_{j=1}^{d}\mathbf{U}_{j}^{(t)}$ . The corresponding estimation error is then given by $\text{Err}(\mbox{\boldmath$\mathscr{S}$}^{(t)},\mathbf{U}_{1}^{(t)},\dots,\mathbf{U}_{d}^{(t)})$ .

Given the conditions and notations outlined above, we present the local convergence analysis for the gradient descent iterations with stable robust gradient functions.

Theorem 3.3 (Local convergence rate).

Suppose that the loss function $\overline{\mathcal{L}}$ satisfies the RCG condition with parameters $\alpha$ and $\beta$ as in Definition 3.1, and that the robust gradient functions at each step $t$ satisfy the stability condition with parameters $\phi$ and $\xi_{k}$ as in Definition 3.2, for all $k=0,1,\dots,d$ and $t=1,2,\dots,T$ . If the initial estimation error satisfies $\textup{Err}(\mbox{\boldmath$\mathscr{S}$}^{(0)},\mathbf{U}_{1}^{(0)},\dots,\mathbf{U}_{d}^{(0)})\lesssim\alpha\beta^{-1}\bar{\sigma}^{2/(d+1)}\kappa^{-2}$ , $\phi\lesssim\alpha^{2}\kappa^{-4}\bar{\sigma}^{2d/(d+1)}$ , $a\asymp\alpha\kappa^{-2}\bar{\sigma}^{(2d-2)/(d+1)}$ , $b\asymp\bar{\sigma}^{1/(d+1)}$ , and $\eta\asymp\alpha\beta^{-1}\kappa^{2}$ , then the estimation errors at the $t$ -th iteration, for $t=1,2,\dots,T$ , satisfy

\begin{split}\textup{Err}(\mbox{\boldmath$\mathscr{S}$}^{(t)},\mathbf{U}_{1}^{(t)},\dots,\mathbf{U}_{d}^{(t)})\leq&(1-C\alpha\beta^{-1}\kappa^{-2})^{t}\cdot\textup{Err}(\mbox{\boldmath$\mathscr{S}$}^{(0)},\mathbf{U}_{1}^{(0)},\dots,\mathbf{U}_{d}^{(0)})\\ &+C\alpha^{-2}\bar{\sigma}^{-4d/(d+1)}\kappa^{4}\sum_{k=0}^{d}\xi_{k}^{2},\end{split}

(3.9)

and

\|\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\textup{F}}^{2}\lesssim\kappa^{2}(1-C\alpha\beta^{-1}\kappa^{-2})^{t}\cdot\|\mbox{\boldmath$\mathscr{A}$}^{(0)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\textup{F}}^{2}+\bar{\sigma}^{-2d/(d+1)}\alpha^{-2}\kappa^{4}\sum_{k=0}^{d}\xi_{k}^{2}.

(3.10)

Theorem 3.3 estalibshes the linear convergence of the robust gradient descent iterates under certain sufficient conditions. In both upper bounds provided, the first terms correspond to optimization errors that decay exponentially with the number of iterations, which reflects the improvement in the solution as the gradient descent progresses. The second terms, on the other hand, capture the statistical errors, which depend on the estimation accuracy of the robust gradient functions. These statistical errors arise due to the approximation of gradients in the presence of noise or outliers, and their magnitude is influenced by the robustness of the gradient estimator. Thus, to guarantee fast convergence, it is crucial to construct a stable robust gradient estimator, which is the focus of the next subsection. The corresponding theoretical analysis for each specific statistical model will be provided in Section 4, where we will discuss the performance of the estimator under different conditions.

3.3 Robust gradient estimation via entrywise truncation

In this subsection, we propose a general robust gradient estimation method. The partial gradient of the risk function is the expectation of the partial gradient of the loss function, which suggests that robust gradient estimation can be framed as a mean estimation problem. In this context, Fan, Wang and Zhu (2021) introduced a simple robust mean estimation procedure based on data truncation. They demonstrated that the truncated estimator achieves the optimal rate under certain bounded moment conditions. Motivated by their work, we apply the truncation method to estimate the partial gradients in our setting.

For any matrix $\mathbf{M}\in\mathbb{R}^{p\times q}$ , we define the entrywise truncation operator $\text{T}(\cdot,\cdot):\mathbb{R}^{p\times q}\times\mathbb{R}^{+}\to\mathbb{R}^{p\times q}$ , such that

\text{T}(\mathbf{M},\tau)_{j,k}=\text{sgn}(\mathbf{M}_{j,k})\min(|\mathbf{M}_{j,k}|,\tau),

(3.11)

for $1\leq j\leq p$ and $1\leq k\leq q$ , where $\text{sgn}(\cdot)$ denotes the sign function. This operator truncates each entry of the matrix $\mathbf{M}$ to be no larger than the threshold $\tau$ in absolute value, while preserving its sign. Similarly, we can define the entrywise truncation operator $\text{T}(\mbox{\boldmath$\mathscr{T}$},\tau)$ for any tensor $\mathscr{T}$ with truncation parameter $\tau$ . The truncation parameter $\tau$ plays a critical role in balancing the trade-off between truncation bias and robustness. A smaller value of $\tau$ will lead to stronger truncation, reducing the influence of outliers but potentially introducing more bias, while a larger $\tau$ will preserve more of the data’s original values, reducing bias but making the estimator more sensitive to noise.

We consider the entrywise truncation gradient estimators. For $1\leq k\leq d$ , the estimator for the gradient with respect to $\mathbf{U}_{k}$ is given by

\begin{split}\mathbf{G}_{k}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};\tau)=&\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla_{\mathbf{U}_{k}}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};z_{i}),\tau)\\ =&\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{S}$}\times_{j=1}^{d}\mathbf{U}_{j};z_{i})_{(k)}(\otimes_{j\neq k}\mathbf{U}_{j})\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top},\tau).\end{split}

(3.12)

Similarly, for the core tensor $\mathscr{S}$ , the truncation-based estimator is defined as

\begin{split}\mbox{\boldmath$\mathscr{G}$}_{0}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};\tau)=&\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla_{\scalebox{0.7}{\mbox{\boldmath$\mathscr{S}$}}}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};z_{i}),\tau)\\ =&\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{S}$}\times_{j=1}^{d}\mathbf{U}_{j};z_{i})\times_{j=1}^{d}\mathbf{U}_{j}^{\top},\tau).\end{split}

(3.13)

Note that the truncation-based robust gradient estimator is generally applicable to a wide range of tensor models. In Sections 4 and 5, we will show both theoretically and numerically that the entrywise truncation using a single parameter $\tau$ can achieve optimal estimation performance under various distributional assumptions.

We briefly outline the idea for proving the stability of the truncated gradient estimator. A complete proof for each statistical application is provided in the supplementary material. For $1\leq k\leq d$ , we define $\mathbf{V}_{k}=(\otimes_{j\neq k}\mathbf{U}_{j})\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top}$ , and the error term for the gradient estimator can be expressed as

\begin{split}&\mathbf{G}_{k}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};\tau)-\mathbb{E}[\nabla_{\mathbf{U}_{k}}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d})]\\ =&\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k},\tau)-\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k}]=T_{k,1}+T_{k,2}+T_{k,3}+T_{k,4},\end{split}

(3.14)

where the terms $T_{k,1}$ , $T_{k,2}$ , $T_{k,3}$ , and $T_{k,4}$ are defined as follows:

\begin{split}T_{k,1}=&\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k}^{\top},\tau)]-\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k}],\\ T_{k,2}=&\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k},\tau)-\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k}^{\top},\tau)],\\ T_{k,3}=&\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k}]-\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k}]\\ &+\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k},\tau)]-\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k},\tau)],\\ \text{and }T_{k,4}=&\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k},\tau)-\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k},\tau)\\ &-\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k},\tau)]+\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k},\tau)].\end{split}

(3.15)

Here, $T_{k,1}$ is the truncation bias at the ground truth $\mbox{\boldmath$\mathscr{A}$}^{*}$ , and $T_{k,2}$ represents the deviation of the truncated estimation around its expectation. As each truncated gradient, $\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k},\tau)$ , is a bounded variable, we can apply the Bernstein inequality (Wainwright, 2019) to achieve a sub-Gaussian-type concentration without the Gaussian distributional assumption on the data. The truncation parameter $\tau$ controls the magnitude of $\|T_{k,1}\|_{\text{F}}$ and $\|T_{k,2}\|_{\text{F}}$ , and an optimal $\tau$ gives $\|T_{k,1}\|_{\text{F}}\asymp\|T_{k,2}\|_{\text{F}}\asymp\xi_{k}$ . For $T_{k,3}$ , given some regularity conditions, we can obtain an upper bound for the truncation bias of the second-order approximation error in $\|T_{k,3}\|_{\text{F}}$ . Similarly, as $\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k},\tau)-\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k},\tau)$ is bounded, we can also achieve a sub-Gaussian-type concentration and show that $\|T_{k,3}\|_{\text{F}}\asymp\|T_{k,4}\|_{\text{F}}\lesssim\phi^{1/2}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}$ . Hence, we can show that $\sum_{i=1}^{4}\|T_{k,i}\|_{\text{F}}^{2}\lesssim\phi\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}+\xi_{k}^{2}$ .

By controlling the truncation bias, deviation, and approximation errors, we demonstrate that the truncated gradient estimator is stable and achieves optimal performance under certain conditions. A similar approach can be applied to the gradient with respect to the core tensor $\mathscr{S}$ , establishing the stability of the corresponding robust estimator.

3.4 Implementation and initialization

The optimal statistical convergence rates of the proposed estimator depend critically on the optimal value of the truncation parameter $\tau$ , which varies according to the model dimension, sample size, and other relevant parameters. To select $\tau$ in a data-driven manner, we propose using cross-validation to evaluate the estimates produced by the robust gradient descent algorithm for different values of $\tau$ .

To initialize Algorithm 1, we may disregard the low-rank structure of $\mathscr{A}$ initially, and find the estimate $\mathscr{\widetilde{A}}$ . Specifically, robust gradient descent can be applied to update $\mathscr{A}$ , as summarized in Algorithm 2, where the robust gradient is

g(\mbox{\boldmath$\mathscr{A}$};\tau^{\prime})=\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i}),\tau^{\prime}),

(3.16)

with another truncation parameter $\tau^{\prime}$ and step size $\eta^{\prime}$ . Once we have $\mathscr{\widetilde{A}}$ , we apply HOSVD (higher-order singular value decomposition) or HOOI (higher-order orthogonal iteration) (De Lathauwer, De Moor and Vandewalle, 2000) to $\mathscr{\widetilde{A}}$ to obtain the initial values.

Algorithm 2 Robust gradient descent for

\mathscr{A}

initialize: $\mbox{\boldmath$\mathscr{A}$}^{(0)}=\mathbf{0}$ , step size $\eta^{\prime}>0$ , and number of iterations $T$

for $t=0$ to $T-1$

$\mbox{\boldmath$\mathscr{A}$}^{(t+1)}\leftarrow\mbox{\boldmath$\mathscr{A}$}^{(t)}-\eta^{\prime}\cdot g(\mbox{\boldmath$\mathscr{A}$}^{(t)};\tau^{\prime})$

end for
return $\mbox{\boldmath$\mathscr{\widetilde{A}}$}=\mbox{\boldmath$\mathscr{A}$}^{(T)}$

4 Applications to Tensor Models

In this section, we apply the proposed robust gradient descent algorithm, utilizing entrywise truncated gradient estimators, to three statistical tensor models: tensor linear regression, tensor logistic regression, and tensor PCA. For tensor linear regression, both the covariates and the noise are assumed to follow heavy-tailed distributions. In the case of tensor logistic regression and tensor PCA, we assume heavy-tailed covariates for the former and heavy-tailed noise for the latter. In the following, we let $\bar{p}=\max_{1\leq j\leq d}p_{j}$ be the maximum dimension across all modes, and let $d_{\text{eff}}=\sum_{k=1}^{d}p_{k}r_{k}+\prod_{k=1}^{d}r_{k}$ be effective dimension of the low-rank tensor, which corresponds to the total number of parameters in the Tucker decomposition.

4.1 Heavy-tailed tensor linear regression

The first statistical model we consider is tensor linear regression:

\mbox{\boldmath$\mathscr{Y}$}_{i}=\langle\mbox{\boldmath$\mathscr{A}$}^{*},\mbox{\boldmath$\mathscr{X}$}_{i}\rangle+\mbox{\boldmath$\mathscr{E}$}_{i},\quad i=1,2,\dots,n,

(4.1)

where $\mbox{\boldmath$\mathscr{X}$}_{i}\in\mathbb{R}^{p_{1}\times p_{2}\times\cdots\times p_{d_{0}}}$ , $\mbox{\boldmath$\mathscr{Y}$}_{i},\mbox{\boldmath$\mathscr{E}$}_{i}\in\mathbb{R}^{p_{d_{0}+1}\times p_{d_{0}+2}\times\cdots\times p_{d}}$ , and $\mbox{\boldmath$\mathscr{A}$}^{*}\in\mathbb{R}^{p_{1}\times p_{2}\times\cdots\times p_{d}}$ with $0\leq d_{0}\leq d$ . The symbol $\langle\cdot,\cdot\rangle$ represents the generalized inner product of tensors, defined in (2.2). When $d=d_{0}$ , the response $\mbox{\boldmath$\mathscr{Y}$}_{i}$ is a scalar, and $\langle\cdot,\cdot\rangle$ reduces to the conventional inner product. We assume that the samples $z_{i}=(\mbox{\boldmath$\mathscr{X}$}_{i},\mbox{\boldmath$\mathscr{Y}$}_{i})$ are independent across $i=1,\dots,n$ , and $\mathbb{E}[\mbox{\boldmath$\mathscr{E}$}_{i}|\mbox{\boldmath$\mathscr{X}$}_{i}]=\mathbf{0}$ .

Model (4.1) encompasses multivariate regression, multi-response regression, matrix trace regression as special cases. It has broad applicability in various statistics and machine learning contexts, such as multi-response tensor regression (Raskutti, Yuan and Chen, 2019), matrix compressed sensing (Candes and Plan, 2011), autoregressive time series modeling (Wang et al., 2022), matrix and tensor completion (Negahban and Wainwright, 2012; Cai et al., 2019), and others.

For model (4.1), it is common to consider the least squares loss function:

\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};\mbox{\boldmath$\mathscr{Y}$}_{i},\mbox{\boldmath$\mathscr{X}$}_{i})=\frac{1}{2}\|\mbox{\boldmath$\mathscr{Y}$}_{i}-\langle\mbox{\boldmath$\mathscr{S}$}\times_{j=1}^{d}\mathbf{U}_{j},\mbox{\boldmath$\mathscr{X}$}_{i}\rangle\|_{\text{F}}^{2}.

(4.2)

For any given $(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d})$ , its partial gradients with respect to each $z_{i}=(\mbox{\boldmath$\mathscr{Y}$}_{i},\mbox{\boldmath$\mathscr{X}$}_{i})$ are

\begin{split}\nabla_{\mathbf{U}_{k}}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};z_{i})&=[\mbox{\boldmath$\mathscr{X}$}_{i}\circ(\langle\mbox{\boldmath$\mathscr{A}$},\mbox{\boldmath$\mathscr{X}$}_{i}\rangle-\mbox{\boldmath$\mathscr{Y}$}_{i})]_{(k)}(\otimes_{j\neq k}\mathbf{U}_{j})\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top},~{}~{}1\leq k\leq d,\\ \text{and }\nabla_{\scalebox{0.7}{\mbox{\boldmath$\mathscr{S}$}}}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};z_{i})&=[\mbox{\boldmath$\mathscr{X}$}_{i}\circ(\langle\mbox{\boldmath$\mathscr{A}$},\mbox{\boldmath$\mathscr{X}$}_{i}\rangle-\mbox{\boldmath$\mathscr{Y}$}_{i})]\times_{j=1}^{d}\mathbf{U}_{j}^{\top}.\end{split}

(4.3)

By applying the entrywise truncation operator from Section 3.3, the robust gradients with truncation parameter $\tau$ are given by

\mathbf{G}_{k}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};\tau)=\frac{1}{n}\sum_{i=1}^{n}\text{T}\left([\mbox{\boldmath$\mathscr{X}$}_{i}\circ(\langle\mbox{\boldmath$\mathscr{A}$},\mbox{\boldmath$\mathscr{X}$}_{i}\rangle-\mbox{\boldmath$\mathscr{Y}$}_{i})]_{(k)}(\otimes_{j\neq k}\mathbf{U}_{j})\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top},\tau\right)

(4.4)

for $k=1,\dots,d$ , and

\mbox{\boldmath$\mathscr{G}$}_{0}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};\tau)=\frac{1}{n}\sum_{i=1}^{n}\text{T}\left([\mbox{\boldmath$\mathscr{X}$}_{i}\circ(\langle\mbox{\boldmath$\mathscr{A}$},\mbox{\boldmath$\mathscr{X}$}_{i}\rangle-\mbox{\boldmath$\mathscr{Y}$}_{i})]\times_{j=1}^{d}\mathbf{U}_{j}^{\top},\tau\right).

(4.5)

Remark 4.1.

For model (4.1), the loss function $\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})=\|\mbox{\boldmath$\mathscr{Y}$}_{i}-\langle\mbox{\boldmath$\mathscr{A}$},\mbox{\boldmath$\mathscr{X}$}_{i}\rangle\|_{\textup{F}}^{2}/2$ has a finite expectation if both $\mbox{\boldmath$\mathscr{X}$}_{i}$ and $\mbox{\boldmath$\mathscr{E}$}_{i}$ have finite second moments. In the following, we assume that $\mbox{\boldmath$\mathscr{E}$}_{i}$ has a a finite $(1+\epsilon)$ moment for some constant $\epsilon\in(0,1]$ . Under this assumption, we only require that the gradient $\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})=\mbox{\boldmath$\mathscr{X}$}_{i}\circ(\langle\mbox{\boldmath$\mathscr{A}$},\mbox{\boldmath$\mathscr{X}$}_{i}\rangle-\mbox{\boldmath$\mathscr{Y}$}_{i})$ has a finite expectation.

Remark 4.2.

Adaptive Huber regression is a widely-used robust estimation method for high-dimensional linear regression (Sun, Zhou and Fan, 2020; Tan, Sun and Witten, 2023). The loss function is given by

\mathcal{L}_{\textup{H}}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};\mathcal{D}_{n})=\frac{1}{2}\sum_{i=1}^{n}\ell_{\nu}(\mbox{\boldmath$\mathscr{Y}$}_{i}-\langle\mbox{\boldmath$\mathscr{S}$}\times_{j=1}^{d}\mathbf{U}_{j},\mbox{\boldmath$\mathscr{X}$}_{i}\rangle),

(4.6)

where $\ell_{\nu}(\mbox{\boldmath$\mathscr{T}$})=\sum_{i_{1},i_{2},\dots,i_{d}}\ell_{\nu}(\mbox{\boldmath$\mathscr{T}$}_{i_{1}i_{2}\dots i_{d}})$ for any tensor $\mathscr{T}$ , $\ell_{\nu}(x)=x^{2}\cdot 1(|x|\leq\nu)+(2\nu x-\nu^{2})\cdot 1(|x|>\nu)$ is the Huber loss, and $\nu>0$ is the robustness parameter. The partial gradients of $\mathcal{L}_{\textup{H}}$ are

\begin{split}\nabla_{\mathbf{U}_{k}}\mathcal{L}_{\textup{H}}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};\mathcal{D}_{n})&=\frac{1}{n}\sum_{i=1}^{n}[\mbox{\boldmath$\mathscr{X}$}_{i}\circ\textup{T}(\langle\mbox{\boldmath$\mathscr{A}$},\mbox{\boldmath$\mathscr{X}$}_{i}\rangle-\mbox{\boldmath$\mathscr{Y}$}_{i},\nu)]_{(k)}(\otimes_{j\neq k}\mathbf{U}_{j})\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top},\\ \text{and }\nabla_{\scalebox{0.7}{\mbox{\boldmath$\mathscr{S}$}}}\mathcal{L}_{\textup{H}}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};\mathcal{D}_{n})&=\frac{1}{n}\sum_{i=1}^{n}[\mbox{\boldmath$\mathscr{X}$}_{i}\circ\textup{T}(\langle\mbox{\boldmath$\mathscr{A}$},\mbox{\boldmath$\mathscr{X}$}_{i}\rangle-\mbox{\boldmath$\mathscr{Y}$}_{i},\nu)]\times_{j=1}^{d}\mathbf{U}_{j}^{\top}.\end{split}

(4.7)

In comparison to the standard Huber regression, where the gradients bound $(\langle\mbox{\boldmath$\mathscr{A}$},\mbox{\boldmath$\mathscr{X}$}_{i}\rangle-\mbox{\boldmath$\mathscr{Y}$}_{i})$ directly, the entrywise truncated gradients in (4.4) and (4.5) control the deviation of the term $\mbox{\boldmath$\mathscr{X}$}_{i}\circ(\langle\mbox{\boldmath$\mathscr{A}$},\mbox{\boldmath$\mathscr{X}$}_{i}\rangle-\mbox{\boldmath$\mathscr{Y}$}_{i})$ . This modification enables us to better handle heavy-tailedness in both the covariates and the noise.

By applying the Tucker decomposition $\mbox{\boldmath$\mathscr{A}$}=\mbox{\boldmath$\mathscr{S}$}\times_{j=1}^{d}\mathbf{U}_{j}$ and utilizing properties of tensor inner products, the partial gradients can be rewritten as follows. For $k=1,\dots,d_{0}$ ,

\begin{split}&\nabla_{\mathbf{U}_{k}}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};z_{i})=\big{[}\big{(}\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1,j\neq k}^{d_{0}}\mathbf{U}_{j}^{\top}\big{)}\circ\\ &~{}~{}~{}\big{(}\langle\mbox{\boldmath$\mathscr{S}$}\times_{j=d_{0}+1}^{d}\mathbf{U}_{j}^{\top}\mathbf{U}_{j},\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d_{0}}\mathbf{U}_{j}^{\top}\rangle-\mbox{\boldmath$\mathscr{Y}$}_{i}\times_{j=1}^{d-d_{0}}\mathbf{U}_{d_{0}+j}^{\top}\big{)}\big{]}_{(k)}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top}.\end{split}

(4.8)

For $k=d_{0}+1,\dots,d$ ,

\begin{split}&\nabla_{\mathbf{U}_{k}}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};z_{i})=\big{[}\big{(}\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d_{0}}\mathbf{U}_{j}^{\top}\big{)}\circ\\ &~{}~{}~{}\big{(}\langle\mbox{\boldmath$\mathscr{S}$}\times_{k}\mathbf{U}_{k}\times_{j=d_{0}+1,j\neq k}^{d}\mathbf{U}_{j}^{\top}\mathbf{U}_{j},\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d_{0}}\mathbf{U}_{j}^{\top}\rangle-\mbox{\boldmath$\mathscr{Y}$}_{i}\times_{j=1,j\neq k-d_{0}}^{d-d_{0}}\mathbf{U}_{d_{0}+j}^{\top}\big{)}\big{]}_{(k)}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top}.\end{split}

(4.9)

Finally, the gradient with respect to the core tensor $\mathscr{S}$ is

\begin{split}&\nabla_{\scalebox{0.7}{\mbox{\boldmath$\mathscr{S}$}}}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};z_{i})\\ &=\big{(}\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d_{0}}\mathbf{U}_{j}^{\top}\big{)}\circ\big{(}\langle\mbox{\boldmath$\mathscr{S}$}\times_{j=d_{0}+1}^{d}\mathbf{U}_{j}^{\top}\mathbf{U}_{j},\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d_{0}}\mathbf{U}_{j}^{\top}\rangle-\mbox{\boldmath$\mathscr{Y}$}_{i}\times_{j=1}^{d-d_{0}}\mathbf{U}_{d_{0}+j}^{\top}\big{)}.\end{split}

(4.10)

In the above expressions, dimension reduction is applied to high-dimensional data $\mbox{\boldmath$\mathscr{X}$}_{i}$ and $\mbox{\boldmath$\mathscr{Y}$}_{i}$ via the factor matrices $\mathbf{U}_{j}$ ’s. This allows for efficient computation of the partial gradients. More importantly, the transformed data, such as $\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d_{0}}\mathbf{U}_{j}^{\top}$ and $\mbox{\boldmath$\mathscr{Y}$}_{i}\times_{j=1}^{d-d_{0}}\mathbf{U}_{d_{0}+j}^{\top}$ , appear in the partial gradients rather than the original high-dimensional data. This transformation can be seen as reducing the effective dimension of the data used in gradient calculation. As a result, when all $\mathbf{U}_{k}$ ’s are close to their ground truth values and lie within certain local regions, the gradients rely only on partial information of the data. This phenomenon motivates us to consider the local bounded moment conditions for the distributions of $\mbox{\boldmath$\mathscr{X}$}_{i}$ and $\mbox{\boldmath$\mathscr{E}$}_{i}$ when projected onto certain directions.

For any matrix $\mathbf{M}\in\mathbb{R}^{p\times q}$ , consider its column subspace projection matrix $\mathcal{P}_{\mathbf{M}}=\mathbf{M}(\mathbf{M}^{\top}\mathbf{M})^{\dagger}\mathbf{M}^{\top}$ , where $\dagger$ is the Moore–Penrose pseudo-inverse. For any vector $\mathbf{v}\in\mathbb{R}^{p}$ , the angle between $\mathbf{v}$ and its projection onto $\mathcal{P}_{\mathbf{M}}$ is $\arccos(\|\mathcal{P}_{\mathbf{M}}\mathbf{v}\|_{2}/\|\mathbf{v}\|_{2})$ . For each $\mathbf{U}_{k}^{*}$ , define the local set of unit-length vectors with $\sin\theta$ radius $\delta$ as

\mathcal{V}(\mathbf{U}_{k}^{*},\delta)=\{\mathbf{v}\in\mathbb{R}^{p_{k}}:\|\mathbf{v}\|_{2}=1~{}\text{and}~{}\sin\arccos(\|\mathcal{P}_{\mathbf{U}^{*}_{k}}\mathbf{v}\|_{2})\leq\delta\}.

(4.11)

By this definition, we have the following assumptions on the distributions of $\mbox{\boldmath$\mathscr{X}$}_{i}$ and $\mbox{\boldmath$\mathscr{E}$}_{i}$ .

Assumption 4.3.

The vectorized covariate $\textup{vec}(\mbox{\boldmath$\mathscr{X}$})$ has the mean $\mathbf{0}$ and positive definite variance $\mathbf{\Sigma}_{x}$ satisfying $0<\alpha_{x}\leq\lambda_{\min}(\mathbf{\Sigma}_{x})\leq\lambda_{\max}(\mathbf{\Sigma}_{x})\leq\beta_{x}$ . For some $\epsilon\in(0,1]$ and $\delta\in[0,1]$ , conditioned on any $\mbox{\boldmath$\mathscr{X}$}_{i}$ , the random noise $\mbox{\boldmath$\mathscr{E}$}_{i}$ has the local $(1+\epsilon)$ -th moment bound

\begin{split}M_{e,1+\epsilon,\delta}:=\max_{1\leq k\leq d-d_{0}}&\Bigg{(}\sup_{\mathbf{v}_{j}\in\mathcal{V}(\mathbf{U}_{d_{\scalebox{0.4}{0}}+j}^{*},\delta)}\mathbb{E}\left[\left|\mbox{\boldmath$\mathscr{E}$}_{i}\times_{j=1}^{d-d_{0}}\mathbf{v}_{j}^{\top}\right|^{1+\epsilon}|\mbox{\boldmath$\mathscr{X}$}_{i}\right],\\ &~{}~{}~{}\sup_{\begin{subarray}{c}\mathbf{v}_{j}\in\mathcal{V}(\mathbf{U}_{d_{\scalebox{0.4}{0}}+j}^{*},\delta),1\leq l\leq p_{d_{\scalebox{0.4}{0}}+k}\end{subarray}}\mathbb{E}\left[\left|\mbox{\boldmath$\mathscr{E}$}_{i}\times_{j=1,j\neq k}^{d-d_{0}}\mathbf{v}_{j}^{\top}\times_{k}\mathbf{c}_{l}^{\top}\right|^{1+\epsilon}|\mbox{\boldmath$\mathscr{X}$}_{i}\right]\Bigg{)},\end{split}

(4.12)

where $\mathbf{c}_{l}$ is the coordinate vector whose $l$ -th element is one and others are zero. Additionally, $\mbox{\boldmath$\mathscr{X}$}_{i}$ has the local $(1+\epsilon)$ -th moment bound

\begin{split}M_{x,1+\epsilon,\delta}:=\max_{1\leq k\leq d_{0}}&\Bigg{(}\sup_{\mathbf{v}_{j}\in\mathcal{V}(\mathbf{U}_{j}^{*},\delta)}\mathbb{E}\left[\left|\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d_{0}}\mathbf{v}_{j}^{\top}\right|^{1+\epsilon}\right],\\ &~{}~{}~{}\sup_{\begin{subarray}{c}\mathbf{v}_{j}\in\mathcal{V}(\mathbf{U}_{j}^{*},\delta),1\leq l\leq p_{k}\end{subarray}}\mathbb{E}\left[\left|\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1,j\neq k}^{d_{0}}\mathbf{v}_{j}^{\top}\times_{k}\mathbf{c}_{l}^{\top}\right|^{1+\epsilon}\right]\Bigg{)}.\end{split}

(4.13)

We call $M_{e,1+\epsilon,\delta}$ and $M_{x,1+\epsilon,\delta}$ the local moment bounds because they reflect the distributional properties of $\mbox{\boldmath$\mathscr{E}$}_{i}$ and $\mbox{\boldmath$\mathscr{X}$}_{i}$ transformed onto the local region around the column space of $\mathbf{U}_{k}^{*}$ ’s. The local moment bound assumptions essentially limits how far the random noise and covariates can deviate in certain directions, thereby controlling their tail behavior. When $\delta=1$ , the local moment bounds become

M_{e,1+\epsilon,1}=\sup_{\|\mathbf{v}_{j}\|_{2}=1}\mathbb{E}\left[\left|\mbox{\boldmath$\mathscr{E}$}_{i}\times_{j=1}^{d-d_{0}}\mathbf{v}_{j}^{\top}\right|^{1+\epsilon}|\mbox{\boldmath$\mathscr{X}$}_{i}\right]

(4.14)

and

M_{x,1+\epsilon,1}=\sup_{\|\mathbf{v}_{j}\|_{2}=1}\mathbb{E}\left[\left|\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d_{0}}\mathbf{v}_{j}^{\top}\right|^{1+\epsilon}\right],

(4.15)

which are still different from and could be much smaller than the global $(1+\epsilon)$ -th moments for the vectorized data, i.e., $\sup_{\|\mathbf{v}\|_{2}=1}\mathbb{E}[|\text{vec}(\mbox{\boldmath$\mathscr{X}$}_{i})^{\top}\mathbf{v}|^{1+\epsilon}]$ and $\sup_{\|\mathbf{v}\|_{2}=1}\mathbb{E}[|\text{vec}(\mbox{\boldmath$\mathscr{E}$}_{i})^{\top}\mathbf{v}|^{1+\epsilon}]$ .

The second moment condition for $\mbox{\boldmath$\mathscr{X}$}_{i}$ and $(1+\epsilon)$ -th moment condition for $\mbox{\boldmath$\mathscr{E}$}_{i}$ in Assumption 4.3 offer a relaxation of the commonly-used Gaussian and sub-Gaussian conditions in tensor regression literature. Notably, in Assumption 4.3, the elements of $\mbox{\boldmath$\mathscr{X}$}_{i}$ or $\mbox{\boldmath$\mathscr{E}$}_{i}$ are not required to be independent or uncorrelated. The parameters $\alpha_{x}$ , $\beta_{x}$ , $M_{x,1+\epsilon,\delta}$ and $M_{e,1+\epsilon,\delta}$ are used to quantify how the distributions of $\mbox{\boldmath$\mathscr{X}$}_{i}$ and $\mbox{\boldmath$\mathscr{E}$}_{i}$ affect the rate of convergence, and they are allowed to vary with tensor dimensions. For tensor regression in (4.1), define $M_{\text{eff},\delta}=M_{x,1+\epsilon,\delta}M_{e,1+\epsilon,\delta}$ as the effective $(1+\epsilon)$ -th local moment bound.

Remark 4.4.

The local moment condition is established on the certain directions near the subspaces spanned by the columns of $\mathbf{U}_{k}^{*}$ . Consequently, the local moment can be much smaller than the global moment commonly discussed in the robust estimation literature (Fan, Li and Wang, 2017; Fan, Wang and Zhu, 2021; Wang and Tsay, 2023).

To highlight the significance of the local moment concept, consider the following example. Let $\mathbf{x}=(\mathbf{x}_{1}^{\top},\mathbf{x}_{2}^{\top})^{\top}$ be a random vector, where $\mathbf{x}_{1}=(Y_{1},Y_{2},\dots,Y_{p})^{\top}\in\mathbb{R}^{p}$ , $\mathbf{x}_{2}=(Y_{p+1},Y_{p+1},\dots,Y_{p+1})^{\top}\in\mathbb{R}^{p}$ , and $Y_{1},Y_{2},\dots,Y_{p+1}\sim_{i.i.d.}N(0,1)$ . Suppose that the ground truth is $\mathbf{u}^{*}=(p^{-1/2}\mathbf{1}_{p}^{\top},\mathbf{0}_{p}^{\top})^{\top}$ . In this case, the global second moment of $\mathbf{x}$ is given by

M_{x,2,1}=\sup_{\|\mathbf{v}\|_{2}=1}\mathbb{E}\left[|\mathbf{v}^{\top}\mathbf{x}|^{2}\right]=\mathbb{E}\left[|p^{-1/2}\mathbf{1}_{p}^{\top}\mathbf{x}_{2}|^{2}\right]=p,

(4.16)

which diverges as the dimension $p$ increase. However, when $\delta\leq p^{-1/2}$ , the local second moment with radius $\delta$ , defined as

M_{x,2,\delta}=\max\left\{\sup_{\mathbf{v}\in\mathcal{V}(\mathbf{u}^{*},\delta)}\mathbb{E}\left[|\mathbf{v}^{\top}\mathbf{x}|^{2}\right],\max_{1\leq j\leq 2p}\mathbb{E}\left[|\mathbf{c}_{j}^{\top}\mathbf{x}|^{2}\right]\right\}

(4.17)

remains bounded and is not greater than 2. This example illustrates that if the radius of the local region can be sufficiently controlled, the local moment bound can be significantly smaller than the glocal moment. As a result, the error rate associated with the local moment could be much sharper, leading to more precise theoretical results.

Denote the estimator obtained by the robust gradient descent algorithm with gradient truncation parameter $\tau$ as $\mbox{\boldmath$\mathscr{\widehat{A}}$}(\tau)$ , and the corresponding estimation error by $\textup{Err}(\widehat{\mbox{\boldmath$\mathscr{S}$}},\widehat{\mathbf{U}}_{1},\dots,\widehat{\mathbf{U}}_{d})$ as in (3.8). Based on the local bounded moment conditions, we have the following guarantees.

Theorem 4.5.

For tensor linear regression in (4.1), suppose Assumption 4.3 holds with some $\epsilon\in(0,1]$ and $\delta\geq\min\{\bar{\sigma}^{-1/(d+1)}\sqrt{\textup{Err}^{(0)}}+\kappa^{2}\alpha_{x}^{-1}\bar{\sigma}^{-1}d_{\textup{eff}}^{1/2}[M_{\textup{eff},\delta}\log(\bar{p})/n]^{\epsilon/(1+\epsilon)},1\}$ . If $\tau\asymp\bar{\sigma}^{d/(d+1)}[nM_{\textup{eff},\delta}/\log(\bar{p})]^{1/(1+\epsilon)}$ , $n\gtrsim(\kappa^{4}\beta_{x}^{3}\alpha_{x}^{-2}\bar{\sigma})^{1+\epsilon}\log(\bar{p})M_{\textup{eff},\delta}^{-1}+\kappa^{4}\beta_{x}^{2}\alpha_{x}^{-2}\log(\bar{p})$ , and the conditions in Theorem 3.3 hold with $\alpha=\alpha_{x}/2$ and $\beta=\beta_{x}/2$ , then with probability at least $1-C\exp(-C\log(\bar{p}))$ , after sufficient iterations of Algorithm 1, we have the following error bounds

\textup{Err}(\widehat{\mbox{\boldmath$\mathscr{S}$}},\widehat{\mathbf{U}}_{1},\dots,\widehat{\mathbf{U}}_{d})\lesssim\kappa^{4}\alpha_{x}^{-2}\bar{\sigma}^{-2d/(d+1)}d_{\textup{eff}}\left[\frac{M_{\textup{eff},\delta}^{1/\epsilon}\log(\bar{p})}{n}\right]^{2\epsilon/(1+\epsilon)}

(4.18)

and

\|\mbox{\boldmath$\mathscr{\widehat{A}}$}(\tau)-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\textup{F}}\lesssim\ \kappa^{2}\alpha_{x}^{-1}d_{\textup{eff}}^{1/2}\left[\frac{M_{\textup{eff},\delta}^{1/\epsilon}\log(\bar{p})}{n}\right]^{\epsilon/(1+\epsilon)}.

(4.19)

Theorem 4.5 shows that, under certain regularity conditions, the errors rates in (4.18) and (4.19) are proportional to $d_{\textup{eff}}M_{\textup{eff},\delta}^{2/(1+\epsilon)}[\log(\bar{p})/n]^{2\epsilon/(1+\epsilon)}$ and $d_{\textup{eff}}^{1/2}M_{\textup{eff},\delta}^{1/(1+\epsilon)}[\log(\bar{p})/n]^{\epsilon/(1+\epsilon)}$ , respectively. The results essentially depend on the rate of truncation parameter $\tau$ , whose optimal value is proportional to $[nM_{\text{eff},\delta}/\log(\bar{p})]^{1/(1+\epsilon)}$ . The convergence rates exhibit a smooth phase transition. When $\epsilon=1$ , i.e. when $\mbox{\boldmath$\mathscr{E}$}_{i}$ has a finite second local moment, the upper bound in (4.19) matches the result under Gaussian distributional assumption for low-rank matrix regression (Negahban and Wainwright, 2011) and tensor regression (Han, Willett and Zhang, 2022; Zhang et al., 2020). When $0<\epsilon<1$ , the convergence rate in (4.19) decreases from $\sqrt{d_{\text{eff}}}M_{\text{eff},\delta}^{1/\epsilon}[\log(\bar{p})/n]^{1/2}$ to $\sqrt{d_{\text{eff}}}M_{\text{eff},\delta}^{1/\epsilon}[\log(\bar{p})/n]^{\epsilon/(1+\epsilon)}$ . Furthermore, the parameter $\delta$ controls the radius of local region. If $n$ is sufficiently large and the intial error $\text{Err}^{(0)}<\bar{\sigma}^{2/(d+1)}$ , then the local radius $\delta$ is smaller than 1, and the convergence rates are associated to local moment bounds.

Remark 4.6.

The convergence rate phase transition with respect to the moment order is also observed in heavy-tailed vector regression (Sun, Zhou and Fan, 2020), matrix regression (Tan, Sun and Witten, 2023), and time series autoregression (Wang and Tsay, 2023). For $d=1$ and 2, the convergence rates obtained in Theorem 4.5 for $\epsilon\in(0,1]$ are minimax rate-optimal for vector-valued and matrix-valued regression models (Sun, Zhou and Fan, 2020; Tan, Sun and Witten, 2023). For $d\geq 3$ , the rate in (4.19) with $\epsilon=1$ is also minimax rate-optimal (Raskutti, Yuan and Chen, 2019).

Remark 4.7.

We highlight two key differences between our theoretical results and those in the existing literature on heavy-tailed regression models. First, we relax the distributional condition on the covariate $\mbox{\boldmath$\mathscr{X}$}_{i}$ . In high-dimensional Huber regression literature, the covariates are typically assumed to be sub-Gaussian or bounded (Fan, Li and Wang, 2017; Sun, Zhou and Fan, 2020; Tan, Sun and Witten, 2023; Shen et al., 2023). In contrast, the proposed method, based on gradient robustification, can handle both heavy-tailed covariates and noise. Second, our analysis relies on local moment conditions, making our results potentially much sharper than those based on global moment conditions. Numerical results which verify the advantages of our method over competing methods are provided in Section 5.

4.2 Heavy-tailed tensor logistic regression

For the generalized linear model, conditioned on the tensor covariate $\mbox{\boldmath$\mathscr{X}$}_{i}$ , the response variable $y_{i}$ follows the distribution

\mathbb{P}(y_{i}|\mbox{\boldmath$\mathscr{X}$}_{i})\propto\exp\left\{\frac{y_{i}\langle\mbox{\boldmath$\mathscr{X}$}_{i},\mbox{\boldmath$\mathscr{A}$}^{*}\rangle-\Phi(\langle\mbox{\boldmath$\mathscr{X}$}_{i},\mbox{\boldmath$\mathscr{A}$}^{*}\rangle)}{c(\gamma)}\right\},~{}~{}i=1,2,\dots,n.

(4.20)

Consequently, the loss function is given by

\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})=\Phi(\langle\mbox{\boldmath$\mathscr{X}$}_{i},\mbox{\boldmath$\mathscr{A}$}\rangle)-y_{i}\langle\mbox{\boldmath$\mathscr{X}$}_{i},\mbox{\boldmath$\mathscr{A}$}\rangle.

(4.21)

A widely studied instance of the generalized linear models is logistic regression, where $\Phi(t)=\log(1+\exp(t))$ . In this case, the gradient of the loss function becomes

\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})=\left(\frac{\exp(\langle\mbox{\boldmath$\mathscr{X}$}_{i},\mbox{\boldmath$\mathscr{A}$}\rangle)}{1+\exp(\langle\mbox{\boldmath$\mathscr{X}$}_{i},\mbox{\boldmath$\mathscr{A}$}\rangle)}-y_{i}\right)\mbox{\boldmath$\mathscr{X}$}_{i}.

(4.22)

Here, $y_{i}$ is the binary variable with probability

\mathbb{P}(y_{i}=1|\mbox{\boldmath$\mathscr{X}$}_{i})=\frac{\exp(\langle\mbox{\boldmath$\mathscr{X}$}_{i},\mbox{\boldmath$\mathscr{A}$}^{*}\rangle)}{1+\exp(\langle\mbox{\boldmath$\mathscr{X}$}_{i},\mbox{\boldmath$\mathscr{A}$}^{*}\rangle)}~{}~{}\text{and}~{}~{}\mathbb{P}(y_{i}=0|\mbox{\boldmath$\mathscr{X}$}_{i})=1-\frac{\exp(\langle\mbox{\boldmath$\mathscr{X}$}_{i},\mbox{\boldmath$\mathscr{A}$}^{*}\rangle)}{1+\exp(\langle\mbox{\boldmath$\mathscr{X}$}_{i},\mbox{\boldmath$\mathscr{A}$}^{*}\rangle)}.

(4.23)

In the robust estimation literature for logistic regression, since $y_{i}$ is a binary variable, it is often assumed that the covariate $\mbox{\boldmath$\mathscr{X}$}_{i}$ follows a heavy-tailed distribution. For example, the finite fourth moment condition has been considered for vector-valued covariates in the context of heavy-tailed logistic regression (Prasad et al., 2020; Zhu and Zhou, 2021; Han, Tsay and Wu, 2023). Tensor logistic regression has been widely applied to classification tasks in neuroimaging analysis (see e.g. Zhou, Li and Zhu, 2013; Li et al., 2018; Wu et al., 2022). Recent empirical studies have shown that neuroimaging data follow heavy-tailed distributions (Beggs and Plenz, 2003; Friedman et al., 2012; Roberts, Boonstra and Breakspear, 2015). Consequently, there is a growing practical need to study robust estimation methods for tensor logistic regression.

Given the low-rank structure, the partial gradients of the logistic loss function are

\nabla_{\mathbf{U}_{k}}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};z_{i})=\left(\frac{\exp(\langle\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d}\mathbf{U}_{j}^{\top},\mbox{\boldmath$\mathscr{S}$}\rangle)}{1+\exp(\langle\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d}\mathbf{U}_{j}^{\top},\mbox{\boldmath$\mathscr{S}$}\rangle)}-y_{i}\right)(\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1,j\neq k}^{d}\mathbf{U}_{j}^{\top})_{(k)}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top},

(4.24)

for $1\leq k\leq d$ , and

\nabla_{\scalebox{0.7}{\mbox{\boldmath$\mathscr{S}$}}}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};z_{i})=\left(\frac{\exp(\langle\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d}\mathbf{U}_{j}^{\top},\mbox{\boldmath$\mathscr{S}$}\rangle)}{1+\exp(\langle\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d}\mathbf{U}_{j}^{\top},\mbox{\boldmath$\mathscr{S}$}\rangle)}-y_{i}\right)(\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d}\mathbf{U}_{j}^{\top}).

(4.25)

As in tensor linear regression, the partial gradients for tensor logistic regression involve the multilinear transformations of the covariates, namely $\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d}\mathbf{U}_{j}^{\top}$ and $\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1,j\neq k}^{d}\mathbf{U}_{j}^{\top}$ . Therefore, to derive the optimal convergence rate, it is essential to characterize the distributional properties of the transformed covariates. In this work, we assume that the covariate $\mbox{\boldmath$\mathscr{X}$}_{i}$ has a finite second moment and introduce the concept of the local second moment bound to obtain a sharp statistical convergence rate.

Assumption 4.8.

The vectorized covariate $\textup{vec}(\mbox{\boldmath$\mathscr{X}$}_{i})$ has the mean $\mathbf{0}$ and positive definite variance $\mathbf{\Sigma}_{x}$ satisfying $0<\alpha_{x}\leq\lambda_{\min}(\mathbf{\Sigma}_{x})\leq\lambda_{\max}(\mathbf{\Sigma}_{x})\leq\beta_{x}$ . Additionally, for some $\delta\in[0,1]$ , the covariate $\mbox{\boldmath$\mathscr{X}$}_{i}$ satisfies the local second moment bound

\begin{split}&M_{x,2,\delta}\\ &:=\max_{1\leq k\leq d}\left(\sup_{\mathbf{v}_{j}\in\mathcal{V}(\mathbf{U}_{j}^{*},\delta)}\mathbb{E}\left[\left|\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d}\mathbf{v}_{j}^{\top}\right|^{2}\right],\sup_{\begin{subarray}{c}\mathbf{v}_{j}\in\mathcal{V}(\mathbf{U}_{j}^{*},\delta)\\ 1\leq l\leq p_{k}\end{subarray}}\mathbb{E}\left[\left|\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1,j\neq k}^{d}\mathbf{v}_{j}^{\top}\times_{k}\mathbf{c}_{l}^{\top}\right|^{2}\right]\right).\end{split}

(4.26)

By definition, it is clear that $M_{x,2,\delta}$ could be smaller, and in many cases much smaller, than $\beta_{x}$ . However, the statistical convergence rate of the estimator is governed by the local moment bound $M_{x,2,\delta}$ , while $\beta_{x}$ plays a role in determining the sample size requirement and the computational convergence rate. Specifically, the statistical guarantees for the robust estimator with truncation parameter $\tau$ , denoted as $\mbox{\boldmath$\mathscr{\widehat{A}}$}(\tau)$ , depend on $M_{x,2,\delta}$ .

Theorem 4.9.

For low-rank tensor logistic regression, suppose Assumption 4.3 holds with some $\epsilon\in(0,1]$ and $\delta\geq\min\{\bar{\sigma}^{-1/(d+1)}\sqrt{\textup{Err}^{(0)}}+C\kappa^{2}\alpha_{x}^{-1}\bar{\sigma}^{-1}\sqrt{d_{\textup{eff}}M_{\textup{eff},\delta}\log(\bar{p})/n},1\}$ . If $n\gtrsim\kappa^{4}\beta_{x}^{2}\alpha_{x}^{-2}\log(\bar{p})$ , $\tau\asymp\bar{\sigma}^{d/(d+1)}[nM_{x,2,\delta}/\log(\bar{p})]^{1/2}$ , and other conditions in Theorem 3.3 hold with $\alpha=\alpha_{x}/2$ and $\beta=\beta_{x}/2$ , then with probability at least $1-C\exp(-C\log(\bar{p}))$ , after sufficient iterations of Algorithm 1, we have the following error bounds

\textup{Err}(\widehat{\mbox{\boldmath$\mathscr{S}$}},\widehat{\mathbf{U}}_{1},\dots,\widehat{\mathbf{U}}_{d})\lesssim\kappa^{4}\alpha_{x}^{-2}\bar{\sigma}^{-2d/(d+1)}d_{\textup{eff}}M_{x,2,\delta}\log(\bar{p})/n.

(4.27)

and

\|\mbox{\boldmath$\mathscr{\widehat{A}}$}(\tau)-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\textup{F}}\lesssim\kappa^{2}\alpha_{x}^{-1}d_{\textup{eff}}^{1/2}\sqrt{M_{x,2,\delta}\log(\bar{p})/n}.

(4.28)

Theorem 4.9 provides the statistical convergence rates for heavy-tailed tensor logistic regression under the local second moment condition on the covariates. Importantly, the convergence rate of the robust gradient descent is shown to match the rate for low-rank tensor logistic regression when using the vanilla gradient descent algorithm with Gaussian design (Chen, Raskutti and Yuan, 2019). This demonstrates that our method can effectively handle heavy-tailed covariates, ensuring robust and optimal estimation in such settings. Similar to tensor linear regression, if the initial error satisfies $\textup{Err}^{(0)}\leq\bar{\sigma}^{2/(d+1)}$ and the sample size $n$ is sufficiently large, then the local radius $\delta$ is smaller than one. Under these conditions, the statistical convergence rates depend on the local second moment of the covariates, which governs the precision of the estimator.

Remark 4.10.

Compared to existing works on robust estimation for heavy-tailed logistic regression with vector covariates (Prasad et al., 2020; Zhu and Zhou, 2021; Han, Tsay and Wu, 2023), our proposed method offers a significant relaxation by replacing the finite fourth moment condition with a second moment condition. This relaxation makes the method more flexible and applicable to a broader range of scenarios, particularly when high-order moments (such as the fourth moment) do not exist or are difficult to ensure.

Remark 4.11.

The low-rank structure imposed on the tensor coefficient allows our approach to leverage the local second moment condition. This results in a much sharper convergence rate, compared to previous methods that typically do not exploit such structural properties. The local moment condition, which governs the behavior of the covariates in the partial gradients, contributes to a more precise characterization of gradient deviation, leading to better statistical guarantees.

4.3 Heavy-tailed tensor PCA

Another important statistical model for the tensor data is tensor principal component analysis (PCA). Specically, we consider

\mbox{\boldmath$\mathscr{Y}$}=\mbox{\boldmath$\mathscr{A}$}^{*}+\mbox{\boldmath$\mathscr{E}$}=\mbox{\boldmath$\mathscr{S}$}^{*}\times_{j=1}^{d}\mathbf{U}_{j}^{*}+\mbox{\boldmath$\mathscr{E}$},

(4.29)

where $\mbox{\boldmath$\mathscr{Y}$}\in\mathbb{R}^{p_{1}\times\cdots\times p_{d}}$ is the tensor-valued observation, $\mbox{\boldmath$\mathscr{A}$}^{*}=\mbox{\boldmath$\mathscr{S}$}^{*}\times_{j=1}^{d}\mathbf{U}_{j}^{*}$ is the low-rank true signal, and $\mathscr{E}$ represents the random noise, assumed to be mean-zero. In the existing literature, much attention has been focused on the Gaussian or sub-Gaussian noise settings (Richard and Montanari, 2014; Zhang and Han, 2019; Han, Willett and Zhang, 2022).

In this subsection, we assume that random noise $\mathscr{E}$ is heavy-tailed. We propose applying the robust gradient descent method with truncated gradient estimators to estimate the low-rank signal in the presence of such heavy-tailed noise. The loss function for tensor PCA is $\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d};\mbox{\boldmath$\mathscr{Y}$})=\|\mbox{\boldmath$\mathscr{Y}$}-\mbox{\boldmath$\mathscr{S}$}\times_{j=1}^{d}\mathbf{U}_{j}\|_{\text{F}}^{2}/2$ . The partial gradient with respect to $\mathbf{U}_{k}$ is

\nabla_{\mathbf{U}_{k}}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d})=(\mbox{\boldmath$\mathscr{S}$}\times_{j=1,j\neq k}^{d}\mathbf{U}_{j}^{\top}\mathbf{U}_{j}\times_{k}\mathbf{U}_{k}-\mbox{\boldmath$\mathscr{Y}$}\times_{j=1,j\neq k}^{d}\mathbf{U}_{j}^{\top})_{(k)}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top}

(4.30)

for any $k=1,2,\dots,d$ . Similarly, the partial gradient with respect to the core tensor $\mathscr{S}$ is

\nabla_{\scalebox{0.7}{\mbox{\boldmath$\mathscr{S}$}}}\mathcal{L}(\mbox{\boldmath$\mathscr{S}$},\mathbf{U}_{1},\dots,\mathbf{U}_{d})=\mbox{\boldmath$\mathscr{S}$}\times_{j=1}^{d}\mathbf{U}_{j}^{\top}\mathbf{U}_{j}-\mbox{\boldmath$\mathscr{Y}$}\times_{j=1}^{d}\mathbf{U}_{j}^{\top}.

(4.31)

These gradients are computed using multilinear transformation fo the data, specifically $\mbox{\boldmath$\mathscr{Y}$}\times_{j=1,j\neq k}^{d}\mathbf{U}_{j}^{\top}$ and $\mbox{\boldmath$\mathscr{Y}$}\times_{j=1}^{d}\mathbf{U}_{j}^{\top}$ , which are the data projected onto the respective factor spaces. To ensure robustness against heavy-tailed noise, we introduce the local $(1+\epsilon)$ -th moment condition for the noise tensor $\mathscr{E}$ . The condition is necessary for the gradient-based optimization method to converge effectively in the presence of heavy-tailed noise.

Assumption 4.12.

For some $\epsilon\in(0,1]$ and $\delta\in[0,1]$ , $\mathscr{E}$ has the local $(1+\epsilon)$ -th moment

\begin{split}&M_{e,1+\epsilon,\delta}\\ &:=\max_{1\leq k\leq d}\left(\sup_{\mathbf{v}_{j}\in\mathcal{V}(\mathbf{U}_{j}^{*},\delta)}\mathbb{E}\left[\left|\mbox{\boldmath$\mathscr{E}$}\times_{j=1}^{d}\mathbf{v}_{j}^{\top}\right|^{1+\epsilon}\right],\sup_{\begin{subarray}{c}\mathbf{v}_{j}\in\mathcal{V}(\mathbf{U}_{j}^{*},\delta)\\ 1\leq l\leq p_{k}\end{subarray}}\mathbb{E}\left[\left|\mbox{\boldmath$\mathscr{E}$}\times_{j=1,j\neq k}^{d}\mathbf{v}_{j}^{\top}\times_{k}\mathbf{c}_{l}^{\top}\right|^{1+\epsilon}\right]\right).\end{split}

(4.32)

In constrast to many existing statistical analyses for tensor PCA, our method does not require the entries of the random noise $\mathscr{E}$ to be independent or idetically distributed. This relaxation is a key feature of our approach, allowing it to handle more general noise structures, including those with dependencies and heavy tails.

For the estimator obtained by the robust gradient descent, denoted as $\mbox{\boldmath$\mathscr{\widehat{A}}$}(\tau)$ , as well as the estimation error $\textup{Err}(\widehat{\mbox{\boldmath$\mathscr{S}$}},\widehat{\mathbf{U}}_{1},\dots,\widehat{\mathbf{U}}_{d})$ , we have the following convergence rates.

Theorem 4.13.

For tensor PCA in (4.29), suppose Assumption 4.12 holds with some $\epsilon\in(0,1]$ and $\delta\geq{\min}(\bar{\sigma}^{-1/(d+1)}\sqrt{\textup{Err}^{(0)}}+C\bar{\sigma}^{-1}d_{\textup{eff}}^{1/2}M_{e,1+\epsilon,\delta}^{1/(1+\epsilon)},1)$ . If $\underline{\sigma}/M_{e,1+\epsilon,\delta}^{1/(1+\epsilon)}\gtrsim\sqrt{\bar{p}}$ , $\tau\asymp\kappa^{2/\epsilon}\bar{\sigma}^{d/(d+1)}M_{e,1+\epsilon,\delta}^{1/(1+\epsilon)}$ , and other conditions in Theorem 3.3 hold with $\alpha=\beta=1/2$ , then with probability at least $1-C\exp(-C\bar{p})$ , after sufficient iterations of Algorithm 1, we have the following error bounds

\textup{Err}(\widehat{\mbox{\boldmath$\mathscr{S}$}},\widehat{\mathbf{U}}_{1},\dots,\widehat{\mathbf{U}}_{d})\lesssim\bar{\sigma}^{-2d/(d+1)}d_{\textup{eff}}M_{e,1+\epsilon,\delta}^{2/(1+\epsilon)}

(4.33)

and

\|\mbox{\boldmath$\mathscr{\widehat{A}}$}(\tau)-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\textup{F}}\lesssim d_{\textup{eff}}^{1/2}M_{e,1+\epsilon,\delta}^{1/(1+\epsilon)}.

(4.34)

Under the local $(1+\epsilon)$ -th moment condition for the noise tensor $\mathscr{E}$ , the convergence rate of the proposed robust gradient descent method is shown to be comparable to that of vanilla gradient descent and achieves minimax optimality (Han, Willett and Zhang, 2022). Specically, for $\epsilon=1$ , the signal-to-noise ratio (SNR) condition $\underline{\sigma}/M^{1/2}_{e,1+\epsilon,\delta}\gtrsim\sqrt{\bar{p}}$ is identical to the SNR condition under the sub-Gaussian noise setting (Zhang and Xia, 2018). This result demonstrates that the robust gradient descent method is capable of effectively handling heavy-tailed noise, while still achieving minimax-optimal estimation performance. Furthermore, in a manner similar to tensor linear regression and logistic regression, if the signal strength satisfies $\bar{\sigma}\gtrsim\sqrt{\bar{p}}$ and the initial error is sufficiently small, i.e., $\textup{Err}^{(0)}<\bar{\sigma}^{2/(d+1)}$ , the local radius $\delta$ is strictly smaller than one.

Remark 4.14.

The proposed robust gradient estimation approach offers several key advantages. First, unlike many existing methods that assume the noise follows a sub-Gaussian distribution, our approach relaxes this assumption by requiring only a $(1+\epsilon)$ -th moment condition on the noise tensor $\mathscr{E}$ . This broader assumption allows the method to handle more general noise distributions, including those with heavier tails. Second, our method also accommodates the possibility that the elements of the noise tensor $\mathscr{E}$ may exhibit strong correlations. However, it is important to note that the correlation structure does not directly affect the estimation performance. Instead, the noise that influences the estimation is only the part projected onto the local regions, characterized by $M_{e,1+\epsilon,\delta}$ . This localized effect allows the method to remain robust even when the global correlation in the noise is strong. Third, we allow $M_{e,1+\epsilon,\delta}$ to diverge to infinity, meaning that the method can tolerate increasingly larger noise values in certain regions. However, for reliable estimation performance, the SNR must satisfy the condition $\underline{\sigma}/M_{e,1+\epsilon,\delta}^{1/2}\gtrsim\sqrt{\bar{p}}$ . This condition ensures that, even as the noise may increase in magnitude, the method still achieves optimal estimation performance as long as the signal is sufficiently strong.

5 Simulation Experiments

In this section, we present two simulation experiments. The first experiment is designed to verify the efficacy of the proposed method and highlight its advantages over other competitors. By comparing the performance of the proposed approach with that of alternative estimators, we demonstrate its superior robustness and accuracy. The second one aims to demonstrate the local moment effect in statistical convergence rates. Specifically, this experiment examines how the relaxation of the sub-Gaussian assumption and the use of local moment conditions influence the convergence behavior, providing empirical evidence for the theoretical results.

5.1 Experiment 1

We consider four models, and for each of them, we explore how different covariate and noise distributions affect the performance across diverse scenarios.

•

Model I (multi-response linear regression):

$\mathbf{y}_{i}=(\mathbf{A}^{*})^{\top}\mathbf{x}_{i}+\mathbf{e}_{i},~{}~{}i=1,\dots,500,$ (5.1)

where $\mathbf{x}_{i}\in\mathbb{R}^{15}$ , $\mathbf{y}_{i},\mathbf{e}_{i}\in\mathbb{R}^{10}$ , and $\mathbf{A}^{*}\in\mathbb{R}^{15\times 10}$ is a rank-3 matrix. Entries of $\mathbf{x}_{i}$ and $\mathbf{e}_{i}$ are i.i.d., and four distributional cases are adopted for model I: (1) $x_{i,j}\sim N(0,1)$ and $e_{i,k}\sim N(0,1)$ ; (2) $x_{i,j}\sim N(0,1)$ and $e_{i,k}\sim t_{1.5}$ ; (3) $x_{i,j}\sim\widetilde{t}_{1.5}(20)$ and $e_{i,k}\sim N(0,1)$ ; and (4) $x_{i,j}\sim\widetilde{t}_{1.5}(20)$ and $e_{i,k}\sim t_{1.5}$ . Here, $\widetilde{t}_{\nu}(\tau)$ is the bounded $t$ distribution with $\nu$ degrees of freedom and bound $\tau$ , i.e., $X\sim\widetilde{t}_{\nu}(\tau)$ if $X=\text{T}(Y,\tau)$ where $Y\sim t_{\nu}$ .
•

Model II (tensor linear regression):

$y_{i}=\langle\mbox{\boldmath$\mathscr{A}$}^{*},\mbox{\boldmath$\mathscr{X}$}_{i}\rangle+e_{i},~{}~{}i=1,\dots,500,$ (5.2)

where $\mbox{\boldmath$\mathscr{X}$}_{i}\in\mathbb{R}^{12\times 10\times 8}$ , $y_{i},e_{i}\in\mathbb{R}$ , and $\mbox{\boldmath$\mathscr{A}$}^{*}\in\mathbb{R}^{12\times 10\times 8}$ has Tucker ranks $(2,2,2)$ . Entries of $\mbox{\boldmath$\mathscr{X}$}_{i}$ are i.i.d., and four distributional cases are adopted: (1) $(\mbox{\boldmath$\mathscr{X}$}_{i})_{j_{1}j_{2}j_{3}}\sim N(0,1)$ and $e_{i}\sim N(0,1)$ ; (2) $(\mbox{\boldmath$\mathscr{X}$}_{i})_{j_{1}j_{2}j_{3}}\sim N(0,1)$ and $e_{i}\sim t_{1.5}$ ; (3) $(\mbox{\boldmath$\mathscr{X}$}_{i})_{j_{1}j_{2}j_{3}}\sim\widetilde{t}_{1.5}(20)$ and $e_{i}\sim N(0,1)$ ; and (4) $(\mbox{\boldmath$\mathscr{X}$}_{i})_{j_{1}j_{2}j_{3}}\sim\widetilde{t}_{1.5}(20)$ and $e_{i}\sim t_{1.5}$ .

•

Model III (tensor logistic regression):

\mathbb{P}(y_{i}=1|\mbox{\boldmath$\mathscr{X}$}_{i})=\frac{\exp(\langle\mbox{\boldmath$\mathscr{X}$}_{i},\mbox{\boldmath$\mathscr{A}$}^{*}\rangle)}{1+\exp(\langle\mbox{\boldmath$\mathscr{X}$}_{i},\mbox{\boldmath$\mathscr{A}$}^{*}\rangle)},~{}~{}i=1,\dots,500,

(5.3)

where $\mbox{\boldmath$\mathscr{X}$}_{i}\in\mathbb{R}^{12\times 10\times 8}$ , $y_{i}\in\{0,1\}$ , and $\mbox{\boldmath$\mathscr{A}$}^{*}\in\mathbb{R}^{12\times 10\times 8}$ has Tucker ranks $(2,2,2)$ . Entries of $\mbox{\boldmath$\mathscr{X}$}_{i}$ are i.i.d., and two distributional cases are adopted: (1) $(\mbox{\boldmath$\mathscr{X}$}_{i})_{j_{1}j_{2}j_{3}}\sim N(0,1)$ and (2) $(\mbox{\boldmath$\mathscr{X}$}_{i})_{j_{1}j_{2}j_{3}}\sim\widetilde{t}_{1.5}(20)$ .

•

Model IV (tensor PCA): $\mbox{\boldmath$\mathscr{Y}$}=\mbox{\boldmath$\mathscr{A}$}^{*}+\mbox{\boldmath$\mathscr{E}$},$ where $\mbox{\boldmath$\mathscr{Y}$},\mbox{\boldmath$\mathscr{E}$}\in\mathbb{R}^{12\times 10\times 8}$ , and $\mbox{\boldmath$\mathscr{A}$}^{*}\in\mathbb{R}^{12\times 10\times 8}$ has Tucker ranks $(2,2,2)$ . Entries of $\mathscr{E}$ are i.i.d. and two distributional cases are adopted: (1) $\mbox{\boldmath$\mathscr{E}$}_{j_{1}j_{2}j_{3}}\sim N(0,1)$ and (2) $\mbox{\boldmath$\mathscr{E}$}_{j_{1}j_{2}j_{3}}\sim t_{1.5}$ .

For Model I, $\mathbf{A}^{*}$ has singular values $(10,8,6)$ , and for Models II-IV, the Tucker decomposition of $\mbox{\boldmath$\mathscr{A}$}^{*}$ in (1.1) has a super-diagonal core tensor $\mbox{\boldmath$\mathscr{S}$}=\text{diag}(10,8)$ . For all models, the factor matrices $\mathbf{U}_{j}$ ’s are generated randomly as the first $r_{j}$ left singular vectors of a random Gaussian ensemble. The first distributional case of each model represents the light-tailed setting, while the others are heavy-tailed as at least part of data follow heavy-tailed distribution.

We apply the proposed robust gradient descent algorithm, as well as the vanilla gradient descent and adaptive Huber regression in Remark 4.2 (except for Model III) as competitors, to the data generated from each model. Initial values of $\mathscr{A}$ is obtained by Algorithm 2, and HOSVD is applied to obtained $\mbox{\boldmath$\mathscr{S}$}^{(0)},\mathbf{U}_{1}^{(0)},\dots,\mathbf{U}_{d}^{(0)}$ . The hyperparameters are set as follows: $a=b=1$ , step size $\eta=10^{-3}$ , number of iterations $T=200$ , and truncation parameter $\tau$ is selected via five-fold cross-validation. For each model and distributional setting, the data generation and estimation procedure is replicated 500 times. The estimation errors across these 500 replications are presented in Figure 2, which summarizes the results of the experiment.

According to Figure 2, when the data follows the light-tailed distribution (Case 1) in all models, the performances of three estimation methods are nearly identical. However, in the heavy-tailed cases, the performance of vanilla gradient descent deteriorates significantly, with estimation errors much larger than those of the other two methods. Overall, the robust gradient descent method consistently yields the smallest estimation errors across all three methods. Specifically, when the covariates follow heavy-tailed distributions (i.e. Cases 3 and 4 for Model I and Cases 3 and 4 for Model II), the robust gradient descent method outperforms adaptive Huber regression, producing significantly smaller estimation errors. When the covariates are normally distributed and the noise follows a heavy-tailed distribution (i.e. Case 2 for Models I and II), the performances of the robust gradient descent and adaptive Huber regression methods are similar. These numerical findings support the methodological insights presented in Remark 4.2 and are in line with the theoretical results discussed in Remark 4.7, confirming the robustness and efficiency of the proposed method in handling heavy-tailed data.

5.2 Experiment 2

We consider Model I in (5.1) and Model II in (5.2) in the second experiment. To numerically verify the local moment condition, we introduce a row-wise sparsity structure for the factor matrices. Specifically, we define each $\mathbf{U}_{k}$ as $\mathbf{U}_{k}^{\top}=[\widetilde{\mathbf{U}}_{k}^{\top},\mathbf{0}_{3\times(p_{k}-5)}]^{\top}$ , for $k=1,2$ , where $\widetilde{\mathbf{U}}_{k}\in\mathbb{R}^{5\times 3}$ is an orthonormal matrix generated randomly, following a similar procedure as in the first experiment.

In Model I, we have $\mathbf{U}_{1}^{\top}\mathbf{x}_{i}=\widetilde{\mathbf{U}}_{1}^{\top}\widetilde{\mathbf{x}}_{i}$ and $\mathbf{U}_{2}^{\top}\mathbf{e}_{i}=\widetilde{\mathbf{U}}_{1}^{\top}\widetilde{\mathbf{e}}_{i}$ , where $\widetilde{\mathbf{x}}_{i}$ and $\widetilde{\mathbf{e}}_{i}$ are the sub-vectors of $\mathbf{x}_{i}$ and $\mathbf{e}_{i}$ , respectively, containing their first five elements. In other words, if we project the covariates or noise onto a local region around the column spaces of $\mathbf{U}_{1}^{*}$ and $\mathbf{U}_{2}^{*}$ , respectively, the transformed distribution is only related to the first five elements. Hence, we consider the following vectors $\mathbf{x}_{i,1}\in\mathbb{R}^{5}\sim N(0,1)$ , $\mathbf{x}_{i,2}\in\mathbb{R}^{5}\sim\widetilde{t}_{1.5}(20)$ , $\mathbf{x}_{i,3}\in\mathbb{R}^{10}\sim N(0,1)$ , $\mathbf{x}_{i,4}\in\mathbb{R}^{10}\sim\widetilde{t}_{1.5}(20)$ , $\mathbf{e}_{i,1},\mathbf{e}_{i,2}\in\mathbb{R}^{5}\sim N(0,1)$ , and $\mathbf{e}_{i,3}\in\mathbb{R}^{5}\sim t_{1.5}$ , where $\mathbf{v}\sim D$ represents that the elements of $\mathbf{v}$ are independent and follow the distribution $D$ . Seven distributional cases are considered: (1) $\mathbf{x}_{i}=(\mathbf{x}_{i,1}^{\top},\mathbf{x}_{i,3}^{\top})^{\top}$ , $\mathbf{e}_{i}=(\mathbf{e}_{i,1}^{\top},\mathbf{e}_{i,2}^{\top})^{\top}$ ; (2) $\mathbf{x}_{i}=(\mathbf{x}_{i,1}^{\top},\mathbf{x}_{i,2}^{\top})^{\top}$ , $\mathbf{e}_{i}=(\mathbf{e}_{i,1}^{\top},\mathbf{e}_{i,2}^{\top})^{\top}$ ; (3) $\mathbf{x}_{i}=(\mathbf{x}_{i,1}^{\top},\mathbf{x}_{i,3}^{\top})^{\top}$ , $\mathbf{e}_{i}=(\mathbf{e}_{i,1}^{\top},\mathbf{e}_{i,3}^{\top})^{\top}$ ; (4) $\mathbf{x}_{i}=(\mathbf{x}_{i,1}^{\top},\mathbf{x}_{i,4}^{\top})^{\top}$ , $\mathbf{e}_{i}=(\mathbf{e}_{i,1}^{\top},\mathbf{e}_{i,3}^{\top})^{\top}$ ; (5) $\mathbf{x}_{i}=(\mathbf{x}_{i,2}^{\top},\mathbf{x}_{i,3}^{\top})^{\top}$ , $\mathbf{e}_{i}=(\mathbf{e}_{i,1}^{\top},\mathbf{e}_{i,2}^{\top})^{\top}$ ; (6) $\mathbf{x}_{i}=(\mathbf{x}_{i,1}^{\top},\mathbf{x}_{i,3}^{\top})^{\top}$ , $\mathbf{e}_{i}=(\mathbf{e}_{i,3}^{\top},\mathbf{e}_{i,1}^{\top})^{\top}$ ; and (7) $\mathbf{x}_{i}=(\mathbf{x}_{i,2}^{\top},\mathbf{x}_{i,3}^{\top})^{\top}$ , $\mathbf{e}_{i}=(\mathbf{e}_{i,3}^{\top},\mathbf{e}_{i,1}^{\top})^{\top}$ . By definition, for some sufficiently small $\delta$ , the local moments $M_{x,2,\delta}$ and $M_{e,1+\epsilon,\delta}$ remain unchanged in the first four settings even with varying distributions of $\mathbf{x}_{i}$ and $\mathbf{e}_{i}$ . In the last three settings, the local moments of covariates or noise get larger.

In Model II, by the row-wise sparse structure of $\mathbf{U}_{k}$ ’s, $\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d}\mathbf{U}_{j}^{\top}=\widetilde{\mbox{\boldmath$\mathscr{X}$}}_{i}\times_{j=1}^{d}\widetilde{\mathbf{U}}_{j}^{\top}$ , where $\widetilde{\mbox{\boldmath$\mathscr{X}$}}_{i}\in\mathbb{R}^{5\times 5\times 5}$ is sub-tensor of $\mbox{\boldmath$\mathscr{X}$}_{i}$ containing $(\mbox{\boldmath$\mathscr{X}$}_{i})_{j_{1}j_{2}j_{3}}$ for $1\leq j_{k}\leq 5$ and $1\leq k\leq 3$ . Hence, we consider four distributional cases for $\mbox{\boldmath$\mathscr{X}$}_{i}$ : (1) all elements $\mbox{\boldmath$\mathscr{X}$}_{i}$ follow $N(0,1)$ distribution; (2) all elements of $\widetilde{\mbox{\boldmath$\mathscr{X}$}}_{i}$ follow $N(0,1)$ distribution, and the other elements of $\mbox{\boldmath$\mathscr{X}$}_{i}$ follow $\widetilde{t}_{1.5}(20)$ distribution; (3) all elements of $\widetilde{\mbox{\boldmath$\mathscr{X}$}}_{i}$ follow $\widetilde{t}_{1.5}(20)$ distribution, and the other elements of $\mbox{\boldmath$\mathscr{X}$}_{i}$ follow $N(0,1)$ distribution; and (4) all elements of $\mbox{\boldmath$\mathscr{X}$}_{i}$ follow $\widetilde{t}_{1.5}(20)$ distribution. Similarly to model I, the local moment $M_{x,2,\delta}$ is the same in cases 1 and 2, and gets large in cases 3 and 4.

For each model and case, we apply the proposed robust gradient descent method and replicate the procedure for 500 times. The estimation errors over 500 replications are presented in Figure 3. For Model I, the estimation errors in Cases 1-4 are almost identical, but those in the other cases are significantly larger. Similarly, for Model II, the errors are nearly the same in Cases 1 and 2, and increase substantially in Cases 3 and 4. These numerical findings in the second experiment confirm that the estimation errors are closely tied to the local moment bounds of the covariates and noise distributions, as discussed in Theorem 4.5.

6 Real Data Example: Chest CT Images

In this section, we apply the proposed robust gradient descent (RGD) estimation approach to the publicly available COVID-CT dataset (Yang et al., 2020). The dataset consists of 317 COVID-19 positive chest CT scans and 397 negative scans, selected from four open-access databases. Each scan is a $150\times 150$ greyscale matrix with a binary label indicating the disease status. The histograms of kurtosis for the $150\times 150$ pixels are presented in Figure 1, which provide strong evidence of heavy-tailed distributions, a characteristic commonly observed in biomedical imaging data.

To identify COVID-positive samples based on chest CT scans, we employ a low-rank tensor logistic regression model with $d=2$ and $p_{1}=p_{2}=150$ . For dimension reduction, we use a low-rank structure with $r_{1}=r_{2}=5$ . As in the first experiment (Section 5), we apply both the proposed robust gradient descent (RGD) algorithm and the vanilla gradient descent (VGD) algorithm to estimate the Tucker decomposition in (1.1). For training, we randomly select 200 positive scans and 250 negative scans, using the remaining data as testing data to evaluate the classification performance of the estimated models.

Using each estimation method, we classify the testing data into four categories: true positive (TP), false positive (FP), true negative (TN), and false negative (FN). The performance metrics used for evaluation include: precision rate: $P=\text{TP}/(\text{TP}+\text{FP})$ ; recall rate: $R=\text{TP}/(\text{TP}+\text{FN})$ ; and F1 score: $F_{1}=2/(P^{-1}+R^{-1})$ . The precision, recall, and F1 scores for the RGD method are reported in Table 1, alongside the performance of the VGD method as a benchmark.

The results demonstrate that the RGD method outperforms the VGD method in terms of all three evaluation metrics: precision, recall, and F1 score. This highlights the ability of the proposed method to provide reliable and stable statistical inference in real-world applications, especially in the challenging domain of COVID-19 diagnosis from chest CT scans.

Table 1: Classification performance of VGD and RGD on chest CT images

Method	Precision	Recall	F1 Score
VGD	0.898	0.829	0.862
RGD	0.954	0.880	0.916

7 Conclusion and Discussion

We propose a unified and general robust estimation framework for low-rank tensor models, which combines robust gradient estimators via element-wise truncation with gradient descent updates. This method is shown to possess three key properties: computational efficiency, statistical optimality, and distributional robustness. We apply this framework to a variety of tensor models, including tensor linear regression, tensor logistic regression, and tensor PCA. Under a mild local moment condition for the covariates and noise distributions, the proposed method achieves minimax optimal statistical error rates, even in the presence of heavy-tailed distributions.

The proposed robust gradient descent framework can be easily integrated with other types of robust gradient functions, such as median of means and rank-based methods. Furthermore, while we focus on heavy-tailed distributions, the framework can also be adapted to handle other scenarios with outliers, including Huber’s $\epsilon$ -contamination model and data with measurement errors.

While we have primarily considered Tucker low-rank tensor models in this paper, the proposed method is highly versatile and can be extended to other tensor models with different low-dimensional structures. In addition to Tucker ranks, there are various definitions of tensor ranks and corresponding low-rank tensor models (e.g., canonical polyadic decomposition and matrix product state models) (Kolda and Bader, 2009). Given the broad applicability of gradient descent algorithms, the robust gradient descent approach can be leveraged for these alternative models as well. Moreover, sparsity is often a key component in low-rank tensor models, facilitating dimension reduction and enabling variable selection (Zhang and Han, 2019; Wang et al., 2022). The robust gradient descent method can be utilized for robust sparse tensor decomposition, further enhancing its utility in high-dimensional settings.

In nonconvex optimization, initialization plays a crucial role in determining the convergence rate and the required number of iterations. Specifically, the radius $\delta$ in the local moment condition depends on the initialization strategy. Hence, developing computationally efficient and statistically robust initialization methods tailored to each statistical model is of significant interest for future work (Jain et al., 2017).

References

Beggs and Plenz (2003) {barticle}[author] \bauthor\bsnmBeggs, \bfnmJohn M\binitsJ. M. and \bauthor\bsnmPlenz, \bfnmDietmar\binitsD. (\byear2003). \btitleNeuronal avalanches in neocortical circuits. \bjournalJournal of Neuroscience \bvolume23 \bpages11167–11177. \endbibitem
Bi, Qu and Shen (2018) {barticle}[author] \bauthor\bsnmBi, \bfnmXuan\binitsX., \bauthor\bsnmQu, \bfnmAnnie\binitsA. and \bauthor\bsnmShen, \bfnmXiaotong\binitsX. (\byear2018). \btitleMultilayer tensor factorization with applications to recommender systems. \bjournalAnnals of Statistics \bvolume46 \bpages3308–3333. \endbibitem
Bi et al. (2021) {barticle}[author] \bauthor\bsnmBi, \bfnmXuan\binitsX., \bauthor\bsnmTang, \bfnmXiwei\binitsX., \bauthor\bsnmYuan, \bfnmYubai\binitsY., \bauthor\bsnmZhang, \bfnmYanqing\binitsY. and \bauthor\bsnmQu, \bfnmAnnie\binitsA. (\byear2021). \btitleTensors in statistics. \bjournalAnnual Review of Statistics and Its Application \bvolume8 \bpages345–368. \endbibitem
Bubeck (2015) {barticle}[author] \bauthor\bsnmBubeck, \bfnmSébastien\binitsS. (\byear2015). \btitleConvex optimization: Algorithms and complexity. \bjournalFoundations and Trends® in Machine Learning \bvolume8 \bpages231–357. \endbibitem
Bubeck, Cesa-Bianchi and Lugosi (2013) {barticle}[author] \bauthor\bsnmBubeck, \bfnmSébastien\binitsS., \bauthor\bsnmCesa-Bianchi, \bfnmNicolo\binitsN. and \bauthor\bsnmLugosi, \bfnmGábor\binitsG. (\byear2013). \btitleBandits with heavy tail. \bjournalIEEE Transactions on Information Theory \bvolume59 \bpages7711–7717. \endbibitem
Cai et al. (2019) {barticle}[author] \bauthor\bsnmCai, \bfnmChangxiao\binitsC., \bauthor\bsnmLi, \bfnmGen\binitsG., \bauthor\bsnmPoor, \bfnmH Vincent\binitsH. V. and \bauthor\bsnmChen, \bfnmYuxin\binitsY. (\byear2019). \btitleNonconvex low-rank tensor completion from noisy data. \bjournalAdvances in Neural Information Processing Systems \bvolume32. \endbibitem
Candes and Plan (2011) {barticle}[author] \bauthor\bsnmCandes, \bfnmEmmanuel J\binitsE. J. and \bauthor\bsnmPlan, \bfnmYaniv\binitsY. (\byear2011). \btitleTight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements. \bjournalIEEE Transactions on Information Theory \bvolume57 \bpages2342–2359. \endbibitem
Catoni (2012) {barticle}[author] \bauthor\bsnmCatoni, \bfnmOlivier\binitsO. (\byear2012). \btitleChallenging the empirical mean and empirical variance: a deviation study. \bjournalAnnales de l’IHP Probabilités et Statistiques \bvolume48 \bpages1148–1185. \endbibitem
Chen, Raskutti and Yuan (2019) {barticle}[author] \bauthor\bsnmChen, \bfnmHan\binitsH., \bauthor\bsnmRaskutti, \bfnmGarvesh\binitsG. and \bauthor\bsnmYuan, \bfnmMing\binitsM. (\byear2019). \btitleNon-convex projected gradient descent for generalized low-rank tensor regression. \bjournalJournal of Machine Learning Research \bvolume20 \bpages1–37. \endbibitem
Chen and Wainwright (2015) {barticle}[author] \bauthor\bsnmChen, \bfnmYudong\binitsY. and \bauthor\bsnmWainwright, \bfnmMartin J\binitsM. J. (\byear2015). \btitleFast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees. \bjournalarXiv preprint arXiv:1509.03025. \endbibitem
Chen, Yang and Zhang (2022) {barticle}[author] \bauthor\bsnmChen, \bfnmRong\binitsR., \bauthor\bsnmYang, \bfnmDan\binitsD. and \bauthor\bsnmZhang, \bfnmCun-Hui\binitsC.-H. (\byear2022). \btitleFactor models for high-dimensional tensor time series. \bjournalJournal of the American Statistical Association \bvolume117 \bpages94–116. \endbibitem
De Lathauwer, De Moor and Vandewalle (2000) {barticle}[author] \bauthor\bsnmDe Lathauwer, \bfnmLieven\binitsL., \bauthor\bsnmDe Moor, \bfnmBart\binitsB. and \bauthor\bsnmVandewalle, \bfnmJoos\binitsJ. (\byear2000). \btitleA multilinear singular value decomposition. \bjournalSIAM Journal on Matrix Analysis and Applications \bvolume21 \bpages1253–1278. \endbibitem
Devroye et al. (2016) {barticle}[author] \bauthor\bsnmDevroye, \bfnmLuc\binitsL., \bauthor\bsnmLerasle, \bfnmMatthieu\binitsM., \bauthor\bsnmLugosi, \bfnmGabor\binitsG., \bauthor\bsnmOlivetra, \bfnmRoberto I\binitsR. I. \betalet al. (\byear2016). \btitleSub-Gaussian mean estimators. \bjournalAnnals OF Statistics \bvolume44 \bpages2695. \endbibitem
Dong et al. (2023) {barticle}[author] \bauthor\bsnmDong, \bfnmHarry\binitsH., \bauthor\bsnmTong, \bfnmTian\binitsT., \bauthor\bsnmMa, \bfnmCong\binitsC. and \bauthor\bsnmChi, \bfnmYuejie\binitsY. (\byear2023). \btitleFast and provable tensor robust principal component analysis via scaled gradient descent. \bjournalInformation and Inference: A Journal of the IMA \bvolume12 \bpages1716–1758. \endbibitem
Fan, Li and Wang (2017) {barticle}[author] \bauthor\bsnmFan, \bfnmJianqing\binitsJ., \bauthor\bsnmLi, \bfnmQuefeng\binitsQ. and \bauthor\bsnmWang, \bfnmYuyan\binitsY. (\byear2017). \btitleEstimation of high dimensional mean regression in the absence of symmetry and light tail assumptions. \bjournalJournal of the Royal Statistical Society. Series B, Statistical methodology \bvolume79 \bpages247. \endbibitem
Fan, Wang and Zhu (2021) {barticle}[author] \bauthor\bsnmFan, \bfnmJianqing\binitsJ., \bauthor\bsnmWang, \bfnmWeichen\binitsW. and \bauthor\bsnmZhu, \bfnmZiwei\binitsZ. (\byear2021). \btitleA shrinkage principle for heavy-tailed data: High-dimensional robust low-rank matrix recovery. \bjournalAnnals of Statistics \bvolume49 \bpages1239. \endbibitem
Friedman et al. (2012) {barticle}[author] \bauthor\bsnmFriedman, \bfnmNir\binitsN., \bauthor\bsnmIto, \bfnmShinya\binitsS., \bauthor\bsnmBrinkman, \bfnmBraden AW\binitsB. A., \bauthor\bsnmShimono, \bfnmMasanori\binitsM., \bauthor\bsnmDeVille, \bfnmRE Lee\binitsR. L., \bauthor\bsnmDahmen, \bfnmKarin A\binitsK. A., \bauthor\bsnmBeggs, \bfnmJohn M\binitsJ. M. and \bauthor\bsnmButler, \bfnmThomas C\binitsT. C. (\byear2012). \btitleUniversal critical dynamics in high resolution neuronal avalanche data. \bjournalPhysical Review Letters \bvolume108 \bpages208102. \endbibitem
Han, Tsay and Wu (2023) {barticle}[author] \bauthor\bsnmHan, \bfnmYuefeng\binitsY., \bauthor\bsnmTsay, \bfnmRuey S\binitsR. S. and \bauthor\bsnmWu, \bfnmWei Biao\binitsW. B. (\byear2023). \btitleHigh dimensional generalized linear models for temporal dependent data. \bjournalBernoulli \bvolume29 \bpages105–131. \endbibitem
Han, Willett and Zhang (2022) {barticle}[author] \bauthor\bsnmHan, \bfnmRungang\binitsR., \bauthor\bsnmWillett, \bfnmRebecca\binitsR. and \bauthor\bsnmZhang, \bfnmAnru R\binitsA. R. (\byear2022). \btitleAn optimal statistical and computational framework for generalized tensor estimation. \bjournalThe Annals of Statistics \bvolume50 \bpages1–29. \endbibitem
Huber (1964) {barticle}[author] \bauthor\bsnmHuber, \bfnmPeter J\binitsP. J. (\byear1964). \btitleRobust Estimation of a Location Parameter. \bjournalThe Annals of Mathematical Statistics \bpages73–101. \endbibitem
Jain et al. (2017) {barticle}[author] \bauthor\bsnmJain, \bfnmPrateek\binitsP., \bauthor\bsnmKar, \bfnmPurushottam\binitsP. \betalet al. (\byear2017). \btitleNon-convex optimization for machine learning. \bjournalFoundations and Trends® in Machine Learning \bvolume10 \bpages142–363. \endbibitem
Kolda and Bader (2009) {barticle}[author] \bauthor\bsnmKolda, \bfnmTamara G\binitsT. G. and \bauthor\bsnmBader, \bfnmBrett W\binitsB. W. (\byear2009). \btitleTensor decompositions and applications. \bjournalSIAM Review \bvolume51 \bpages455–500. \endbibitem
Li et al. (2018) {barticle}[author] \bauthor\bsnmLi, \bfnmXiaoshan\binitsX., \bauthor\bsnmXu, \bfnmDa\binitsD., \bauthor\bsnmZhou, \bfnmHua\binitsH. and \bauthor\bsnmLi, \bfnmLexin\binitsL. (\byear2018). \btitleTucker tensor regression and neuroimaging analysis. \bjournalStatistics in Biosciences \bvolume10 \bpages520–545. \endbibitem
Li et al. (2020) {barticle}[author] \bauthor\bsnmLi, \bfnmYuanxin\binitsY., \bauthor\bsnmChi, \bfnmYuejie\binitsY., \bauthor\bsnmZhang, \bfnmHuishuai\binitsH. and \bauthor\bsnmLiang, \bfnmYingbin\binitsY. (\byear2020). \btitleNon-convex low-rank matrix recovery with arbitrary outliers via median-truncated gradient descent. \bjournalInformation and Inference: A Journal of the IMA \bvolume9 \bpages289–325. \endbibitem
Loh (2017) {barticle}[author] \bauthor\bsnmLoh, \bfnmPo-Ling\binitsP.-L. (\byear2017). \btitleStatistical consistency and asymptotic normality for high-dimensional robust $M$ -estimators. \bjournalThe Annals of Statistics \bvolume45 \bpages866–896. \endbibitem
Lu et al. (2024) {barticle}[author] \bauthor\bsnmLu, \bfnmYin\binitsY., \bauthor\bsnmTao, \bfnmChunbai\binitsC., \bauthor\bsnmWang, \bfnmDi\binitsD., \bauthor\bsnmUddin, \bfnmGazi Salah\binitsG. S., \bauthor\bsnmWu, \bfnmLibo\binitsL. and \bauthor\bsnmZhu, \bfnmXuening\binitsX. (\byear2024). \btitleRobust estimation for dynamic spatial autoregression models with nearly optimal rates. \bjournalAvailable at SSRN 4873355. \endbibitem
Ma et al. (2018) {binproceedings}[author] \bauthor\bsnmMa, \bfnmCong\binitsC., \bauthor\bsnmWang, \bfnmKaizheng\binitsK., \bauthor\bsnmChi, \bfnmYuejie\binitsY. and \bauthor\bsnmChen, \bfnmYuxin\binitsY. (\byear2018). \btitleImplicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval and matrix completion. In \bbooktitleInternational Conference on Machine Learning \bpages3345–3354. \bpublisherPMLR. \endbibitem
Negahban and Wainwright (2011) {barticle}[author] \bauthor\bsnmNegahban, \bfnmSahand\binitsS. and \bauthor\bsnmWainwright, \bfnmMartin J\binitsM. J. (\byear2011). \btitleEstimation of (near) low-rank matrices with noise and high-dimensional scaling. \bjournalAnnals of Statistics \bvolume39 \bpages1069–1097. \endbibitem
Negahban and Wainwright (2012) {barticle}[author] \bauthor\bsnmNegahban, \bfnmSahand\binitsS. and \bauthor\bsnmWainwright, \bfnmMartin J\binitsM. J. (\byear2012). \btitleRestricted strong convexity and weighted matrix completion: Optimal bounds with noise. \bjournalJournal of Machine Learning Research \bvolume13 \bpages1665–1697. \endbibitem
Netrapalli et al. (2014) {barticle}[author] \bauthor\bsnmNetrapalli, \bfnmPraneeth\binitsP., \bauthor\bsnmUN, \bfnmNiranjan\binitsN., \bauthor\bsnmSanghavi, \bfnmSujay\binitsS., \bauthor\bsnmAnandkumar, \bfnmAnimashree\binitsA. and \bauthor\bsnmJain, \bfnmPrateek\binitsP. (\byear2014). \btitleNon-convex robust PCA. \bjournalAdvances in Neural Information Processing Systems \bvolume27. \endbibitem
Prasad et al. (2020) {barticle}[author] \bauthor\bsnmPrasad, \bfnmAdarsh\binitsA., \bauthor\bsnmSuggala, \bfnmArun Sai\binitsA. S., \bauthor\bsnmBalakrishnan, \bfnmSivaraman\binitsS. and \bauthor\bsnmRavikumar, \bfnmPradeep\binitsP. (\byear2020). \btitleRobust estimation via robust gradient estimation. \bjournalJournal of the Royal Statistical Society Series B: Statistical Methodology \bvolume82 \bpages601–627. \endbibitem
Raskutti, Yuan and Chen (2019) {barticle}[author] \bauthor\bsnmRaskutti, \bfnmGarvesh\binitsG., \bauthor\bsnmYuan, \bfnmMing\binitsM. and \bauthor\bsnmChen, \bfnmHan\binitsH. (\byear2019). \btitleConvex regularization for high-dimensional multi-response tensor regression. \bjournalAnnals of Statistics \bvolume47 \bpages1554-1584. \endbibitem
Richard and Montanari (2014) {barticle}[author] \bauthor\bsnmRichard, \bfnmEmile\binitsE. and \bauthor\bsnmMontanari, \bfnmAndrea\binitsA. (\byear2014). \btitleA statistical model for tensor PCA. \bjournalAdvances in Neural Information Processing Systems \bvolume27. \endbibitem
Roberts, Boonstra and Breakspear (2015) {barticle}[author] \bauthor\bsnmRoberts, \bfnmJames A\binitsJ. A., \bauthor\bsnmBoonstra, \bfnmTjeerd W\binitsT. W. and \bauthor\bsnmBreakspear, \bfnmMichael\binitsM. (\byear2015). \btitleThe heavy tail of the human brain. \bjournalCurrent Opinion in Neurobiology \bvolume31 \bpages164–172. \endbibitem
Shen et al. (2023) {barticle}[author] \bauthor\bsnmShen, \bfnmYinan\binitsY., \bauthor\bsnmLi, \bfnmJingyang\binitsJ., \bauthor\bsnmCai, \bfnmJian-Feng\binitsJ.-F. and \bauthor\bsnmXia, \bfnmDong\binitsD. (\byear2023). \btitleComputationally efficient and statistically optimal robust low-rank matrix and tensor estimation. \bjournalarXiv preprint arXiv:2203.00953. \endbibitem
Sun, Zhou and Fan (2020) {barticle}[author] \bauthor\bsnmSun, \bfnmQiang\binitsQ., \bauthor\bsnmZhou, \bfnmWen-Xin\binitsW.-X. and \bauthor\bsnmFan, \bfnmJianqing\binitsJ. (\byear2020). \btitleAdaptive huber regression. \bjournalJournal of the American Statistical Association \bvolume115 \bpages254–265. \endbibitem
Tan, Sun and Witten (2023) {barticle}[author] \bauthor\bsnmTan, \bfnmKean Ming\binitsK. M., \bauthor\bsnmSun, \bfnmQiang\binitsQ. and \bauthor\bsnmWitten, \bfnmDaniela\binitsD. (\byear2023). \btitleSparse reduced rank Huber regression in high dimensions. \bjournalJournal of the American Statistical Association \bvolume118 \bpages2383–2393. \endbibitem
Tarzanagh and Michailidis (2022) {barticle}[author] \bauthor\bsnmTarzanagh, \bfnmDavoud Ataee\binitsD. A. and \bauthor\bsnmMichailidis, \bfnmGeorge\binitsG. (\byear2022). \btitleRegularized and smooth double core tensor factorization for heterogeneous data. \bjournalJournal of Machine Learning Research \bvolume23 \bpages1–49. \endbibitem
Tomioka and Suzuki (2013) {binproceedings}[author] \bauthor\bsnmTomioka, \bfnmRyota\binitsR. and \bauthor\bsnmSuzuki, \bfnmTaiji\binitsT. (\byear2013). \btitleConvex tensor decomposition via structured Schatten norm regularization. In \bbooktitleAdvances in Neural Information Processing Systems (NIPS) \bpages1331–1339. \endbibitem
Tong, Ma and Chi (2022) {binproceedings}[author] \bauthor\bsnmTong, \bfnmTian\binitsT., \bauthor\bsnmMa, \bfnmCong\binitsC. and \bauthor\bsnmChi, \bfnmYuejie\binitsY. (\byear2022). \btitleAccelerating ill-conditioned robust low-rank tensor regression. In \bbooktitleICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) \bpages9072–9076. \bpublisherIEEE. \endbibitem
Tong et al. (2022) {barticle}[author] \bauthor\bsnmTong, \bfnmTian\binitsT., \bauthor\bsnmMa, \bfnmCong\binitsC., \bauthor\bsnmPrater-Bennette, \bfnmAshley\binitsA., \bauthor\bsnmTripp, \bfnmErin\binitsE. and \bauthor\bsnmChi, \bfnmYuejie\binitsY. (\byear2022). \btitleScaling and scalability: Provable nonconvex low-rank tensor estimation from incomplete measurements. \bjournalJournal of Machine Learning Research \bvolume23 \bpages1–77. \endbibitem
Tu et al. (2016) {binproceedings}[author] \bauthor\bsnmTu, \bfnmStephen\binitsS., \bauthor\bsnmBoczar, \bfnmRoss\binitsR., \bauthor\bsnmSimchowitz, \bfnmMax\binitsM., \bauthor\bsnmSoltanolkotabi, \bfnmMahdi\binitsM. and \bauthor\bsnmRecht, \bfnmBen\binitsB. (\byear2016). \btitleLow-rank solutions of linear matrix equations via procrustes flow. In \bbooktitleInternational Conference on Machine Learning \bpages964–973. \bpublisherPMLR. \endbibitem
Tucker (1966) {barticle}[author] \bauthor\bsnmTucker, \bfnmLedyard R\binitsL. R. (\byear1966). \btitleSome mathematical notes on three-mode factor analysis. \bjournalPsychometrika \bvolume31 \bpages279–311. \endbibitem
Wainwright (2019) {bbook}[author] \bauthor\bsnmWainwright, \bfnmMartin J\binitsM. J. (\byear2019). \btitleHigh-dimensional statistics: A non-asymptotic viewpoint \bvolume48. \bpublisherCambridge university press. \endbibitem
Wang and Tsay (2023) {barticle}[author] \bauthor\bsnmWang, \bfnmDi\binitsD. and \bauthor\bsnmTsay, \bfnmRuey S\binitsR. S. (\byear2023). \btitleRate-optimal robust estimation of high-dimensional vector autoregressive models. \bjournalThe Annals of Statistics \bvolume51 \bpages846–877. \endbibitem
Wang, Zhang and Gu (2017) {binproceedings}[author] \bauthor\bsnmWang, \bfnmLingxiao\binitsL., \bauthor\bsnmZhang, \bfnmXiao\binitsX. and \bauthor\bsnmGu, \bfnmQuanquan\binitsQ. (\byear2017). \btitleA unified computational and statistical framework for nonconvex low-rank matrix estimation. In \bbooktitleArtificial Intelligence and Statistics \bpages981–990. \bpublisherPMLR. \endbibitem
Wang, Zhang and Mai (2023) {barticle}[author] \bauthor\bsnmWang, \bfnmNing\binitsN., \bauthor\bsnmZhang, \bfnmXin\binitsX. and \bauthor\bsnmMai, \bfnmQing\binitsQ. (\byear2023). \btitleHigh-dimensional tensor response regression using the t-distribution. \bjournalarXiv preprint arXiv:2306.12125. \endbibitem
Wang et al. (2020) {barticle}[author] \bauthor\bsnmWang, \bfnmLan\binitsL., \bauthor\bsnmPeng, \bfnmBo\binitsB., \bauthor\bsnmBradic, \bfnmJelena\binitsJ., \bauthor\bsnmLi, \bfnmRunze\binitsR. and \bauthor\bsnmWu, \bfnmYunan\binitsY. (\byear2020). \btitleA tuning-free robust and efficient approach to high-dimensional regression. \bjournalJournal of the American Statistical Association \bpages1–44. \endbibitem
Wang et al. (2022) {barticle}[author] \bauthor\bsnmWang, \bfnmDi\binitsD., \bauthor\bsnmZheng, \bfnmYao\binitsY., \bauthor\bsnmLian, \bfnmHeng\binitsH. and \bauthor\bsnmLi, \bfnmGuodong\binitsG. (\byear2022). \btitleHigh-dimensional vector autoregressive time series modeling via tensor decomposition. \bjournalJournal of the American Statistical Association \bvolume117 \bpages1338–1356. \endbibitem
Wei et al. (2023) {barticle}[author] \bauthor\bsnmWei, \bfnmBo\binitsB., \bauthor\bsnmPeng, \bfnmLimin\binitsL., \bauthor\bsnmGuo, \bfnmYing\binitsY., \bauthor\bsnmManatunga, \bfnmAmita\binitsA. and \bauthor\bsnmStevens, \bfnmJennifer\binitsJ. (\byear2023). \btitleTensor response quantile regression with neuroimaging data. \bjournalBiometrics \bvolume79 \bpages1947–1958. \endbibitem
Wu et al. (2022) {barticle}[author] \bauthor\bsnmWu, \bfnmYing\binitsY., \bauthor\bsnmChen, \bfnmDan\binitsD., \bauthor\bsnmLi, \bfnmChaoqian\binitsC. and \bauthor\bsnmTang, \bfnmNiansheng\binitsN. (\byear2022). \btitleBayesian tensor logistic regression with applications to neuroimaging data analysis of Alzheimer’s disease. \bjournalStatistical Methods in Medical Research \bvolume31 \bpages2368–2382. \endbibitem
Xu, Zhang and Gu (2017) {binproceedings}[author] \bauthor\bsnmXu, \bfnmPan\binitsP., \bauthor\bsnmZhang, \bfnmTingting\binitsT. and \bauthor\bsnmGu, \bfnmQuanquan\binitsQ. (\byear2017). \btitleEfficient algorithm for sparse tensor-variate gaussian graphical models via gradient descent. In \bbooktitleArtificial Intelligence and Statistics \bpages923–932. \bpublisherPMLR. \endbibitem
Yang et al. (2020) {barticle}[author] \bauthor\bsnmYang, \bfnmXingyi\binitsX., \bauthor\bsnmHe, \bfnmXuehai\binitsX., \bauthor\bsnmZhao, \bfnmJinyu\binitsJ., \bauthor\bsnmZhang, \bfnmYichen\binitsY., \bauthor\bsnmZhang, \bfnmShanghang\binitsS. and \bauthor\bsnmXie, \bfnmPengtao\binitsP. (\byear2020). \btitleCovid-ct-dataset: a ct scan dataset about covid-19. \bjournalarXiv preprint arXiv:2003.13865. \endbibitem
Yuan and Zhang (2016) {barticle}[author] \bauthor\bsnmYuan, \bfnmMing\binitsM. and \bauthor\bsnmZhang, \bfnmCun-Hui\binitsC.-H. (\byear2016). \btitleOn tensor completion via nuclear norm minimization. \bjournalFoundations of Computational Mathematics \bvolume16 \bpages1031–1068. \endbibitem
Zhang and Han (2019) {barticle}[author] \bauthor\bsnmZhang, \bfnmAnru\binitsA. and \bauthor\bsnmHan, \bfnmRungang\binitsR. (\byear2019). \btitleOptimal sparse singular value decomposition for high-dimensional high-order data. \bjournalJournal of the American Statistical Association. \endbibitem
Zhang and Xia (2018) {barticle}[author] \bauthor\bsnmZhang, \bfnmAnru\binitsA. and \bauthor\bsnmXia, \bfnmDong\binitsD. (\byear2018). \btitleTensor SVD: Statistical and Computational Limits. \bjournalIEEE Transactions on Information Theory \bvolume64 \bpages7311-7338. \endbibitem
Zhang et al. (2020) {barticle}[author] \bauthor\bsnmZhang, \bfnmAnru R\binitsA. R., \bauthor\bsnmLuo, \bfnmYuetian\binitsY., \bauthor\bsnmRaskutti, \bfnmGarvesh\binitsG. and \bauthor\bsnmYuan, \bfnmMing\binitsM. (\byear2020). \btitleISLET: Fast and optimal low-rank tensor regression via importance sketching. \bjournalSIAM Journal on Mathematics of Data Science \bvolume2 \bpages444–479. \endbibitem
Zhou, Li and Zhu (2013) {barticle}[author] \bauthor\bsnmZhou, \bfnmHua\binitsH., \bauthor\bsnmLi, \bfnmLexin\binitsL. and \bauthor\bsnmZhu, \bfnmHongtu\binitsH. (\byear2013). \btitleTensor regression with applications in neuroimaging data analysis. \bjournalJournal of the American Statistical Association \bvolume108 \bpages540–552. \endbibitem
Zhu and Zhou (2021) {binproceedings}[author] \bauthor\bsnmZhu, \bfnmZiwei\binitsZ. and \bauthor\bsnmZhou, \bfnmWenjing\binitsW. (\byear2021). \btitleTaming heavy-tailed features by shrinkage. In \bbooktitleInternational Conference on Artificial Intelligence and Statistics \bpages3268–3276. \bpublisherPMLR. \endbibitem

Appendix A Convergence Analysis of Robust Gradient Descent

A.1 Proofs of Theorem 3.3

The proof consists of five steps. In the first step, we introduce the notations and the regularity conditions in the following steps. In the second to fourth steps, we establish the convergence analysis of the estimation errors. Finally, in the last step, we verify the conditions given in the first steps recursively. In the following, we let $C_{d}>0$ be the constant only depending on $d$ .

Step 1. (Notations and conditions)

We first introduce the notations used in the proof. At step $t$ , we simplify the notations of the robust gradient estimators to $\mbox{\boldmath$\mathscr{G}$}_{0}^{(t)}$ and $\mathbf{G}_{k}^{(t)}$ , for $k=1,\dots,d$ and $t=1,\dots,T$ . Denote $\mathbf{V}_{k}^{(t)}=(\otimes_{j\neq k}\mathbf{U}_{j}^{(t)})\mbox{\boldmath$\mathscr{S}$}^{(t)\top}_{(k)}$ ,

\mathbf{\Delta}_{k}^{(t)}=\mathbf{G}_{k}^{(t)}-\mathbb{E}[\nabla_{k}\mathcal{L}^{(t)}]=\mathbf{G}_{k}^{(t)}-\mathbb{E}[\nabla\mathcal{L}(\mbox{\boldmath$\mathscr{A}$}^{(t)})_{(k)}\mathbf{V}_{k}^{(t)}],

(A.1)

and

\mathbf{\Delta}_{0}^{(t)}=\mbox{\boldmath$\mathscr{G}$}_{0}^{(t)}-\mathbb{E}[\nabla_{0}\mathcal{L}^{(t)}]=\mbox{\boldmath$\mathscr{G}$}_{0}^{(t)}-\mathbb{E}[\nabla\mathcal{L}(\mbox{\boldmath$\mathscr{A}$}^{(t)})\times_{j=1}^{d}\mathbf{U}_{j}^{(t)\top}],

(A.2)

as the robust gradient estimation errors. By the stability of the robust gradients, $\|\mathbf{\Delta}_{k}^{(t)}\|_{\text{F}}^{2}\leq\phi\|\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|^{2}_{\text{F}}+\xi_{k}^{2}$ , for all $k=0,1,\dots,d$ and $t=1,2,\dots,T$ . In addition, we assume $b\asymp\bar{\sigma}^{1/(d+1)}$ , as required in Theorem 3.3.

Let $\mbox{\boldmath$\mathscr{A}$}^{*}=\mbox{\boldmath$\mathscr{S}$}^{*}\times_{k=1}^{d}\mathbf{U}_{k}^{*}$ such that $\mathbf{U}_{k}^{*\top}\mathbf{U}_{k}^{*}=b^{2}\mathbf{I}_{r_{k}}$ , for $k=1,\dots,d$ . Define $\mathbb{O}_{r}=\{\mathbf{M}\in\mathbb{R}^{r\times r}:\mathbf{M}^{\top}\mathbf{M}=\mathbf{I}_{r}\}$ as the set of $r\times r$ orthogonal matrices. For each step $t=0,1,\dots,T$ , we define

\text{Err}^{(t)}=\min_{\mathbf{O}_{k}\in\mathbb{O}_{r_{k}},1\leq k\leq d}\left\{\sum_{k=1}^{d}\|\mathbf{U}^{(t)}_{k}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}\|_{\textup{F}}^{2}+\|\mbox{\boldmath$\mathscr{S}$}^{(t)}-\mbox{\boldmath$\mathscr{S}$}^{*}\times_{j=1}^{d}\mathbf{O}_{j}^{\top}\|^{2}_{\textup{F}}\right\},

(A.3)

and

(\mathbf{O}_{1}^{(t)},\cdots,\mathbf{O}_{d}^{(t)})=\operatorname*{arg\,min}_{{\mathbf{O}_{k}\in\mathbb{O}_{r_{k}},1\leq k\leq d}}\left\{\sum_{k=1}^{d}\|\mathbf{U}^{(t)}_{k}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}\|_{\textup{F}}^{2}+\|\mbox{\boldmath$\mathscr{S}$}^{(t)}-\mbox{\boldmath$\mathscr{S}$}^{*}\times_{j=1}^{d}\mathbf{O}_{j}^{\top}\|^{2}_{\textup{F}}\right\}.

(A.4)

Here, $\text{Err}^{(t)}$ collects the combined estimation errors for all tensor decomposition components at step $t$ , and $\mathbf{O}_{k}^{(t)}$ ’s are the optimal rotations used to handle the non-identifiability of the Tucker decomposition.

Next, we discuss some additional conditions used in the convergence analysis. To ease presentation, we first assume that these conditions hold and verify them in the last step.

(C1) For any $t=0,1,\dots,T$ and $k=1,2,\dots,d$ , $\|\mbox{\boldmath$\mathscr{S}$}_{(k)}^{(t)}\|\leq C\bar{\sigma}b^{-d}$ and $\|\mathbf{U}_{k}^{(t)}\|\leq Cb$ for some absolute constant greater than one. Hence, $\|\mathbf{V}_{k}^{(t)}\|\leq\|\mbox{\boldmath$\mathscr{S}$}_{(k)}^{(t)}\|\cdot\prod_{j\neq k}\|\mathbf{U}_{j}^{(t)}\|\leq C_{d}\bar{\sigma}b^{-1}$ .

(C2) For any $t=0,1,\dots,T$ , $\text{Err}^{(t)}\leq C\alpha\beta^{-1}b^{2}\kappa^{-2}$ .

Step 2. (Descent of $\text{Err}^{(t)}$ )

By definition of $\text{Err}^{(t)}$ and $\mathbf{O}_{k}^{(t)}$ ’s,

\begin{split}\text{Err}^{(t+1)}&=\sum_{k=1}^{d}\left\|\mathbf{U}_{k}^{(t+1)}-\mathbf{U}^{*}_{k}\mathbf{O}_{k}^{(t+1)}\right\|^{2}_{\textup{F}}+\left\|\mbox{\boldmath$\mathscr{S}$}^{(t+1)}-\mbox{\boldmath$\mathscr{S}$}^{*}\times_{j=1}^{d}\mathbf{O}_{j}^{(t+1)\top}\right\|^{2}_{\textup{F}}\\ &\leq\sum_{k=1}^{d}\left\|\mathbf{U}_{k}^{(t+1)}-\mathbf{U}^{*}_{k}\mathbf{O}_{k}^{(t)}\right\|^{2}_{\textup{F}}+\left\|\mbox{\boldmath$\mathscr{S}$}^{(t+1)}-\mbox{\boldmath$\mathscr{S}$}^{*}\times_{j=1}^{d}\mathbf{O}_{j}^{(t)\top}\right\|^{2}_{\textup{F}}.\end{split}

(A.5)

For each $k=1,\cdots,d$ , since $\mathbf{U}_{k}^{(t+1)}=\mathbf{U}_{k}^{(t)}-\eta\mathbf{G}_{k}^{(t)}-a\eta\mathbf{U}_{k}^{(t)}(\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}})$ , we have that for any $\zeta>0$ ,

\begin{split}&\|\mathbf{U}^{(t+1)}_{k}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}\|_{\textup{F}}^{2}\\ =&\|\mathbf{U}_{k}^{(t)}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}-\eta(\mathbf{G}_{k}^{(t)}+a\mathbf{U}_{k}^{(t)}(\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}))\|_{\textup{F}}^{2}\\ =&\|\mathbf{U}_{k}^{(t)}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}-\eta(\mathbb{E}[\nabla_{k}\mathcal{L}^{(t)}]+a\mathbf{U}_{k}^{(t)}(\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}))-\eta\mathbf{\Delta}_{k}^{(t)}\|_{\textup{F}}^{2}\\ \leq&\|\mathbf{U}_{k}^{(t)}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}-\eta(\mathbb{E}[\nabla_{k}\mathcal{L}^{(t)}]+a\mathbf{U}_{k}^{(t)}(\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}))\|_{\text{F}}^{2}+\eta^{2}\|\mathbf{\Delta}_{k}^{(t)}\|_{\textup{F}}^{2}\\ &+2\eta\|\mathbf{\Delta}_{k}^{(t)}\|_{\textup{F}}\|\mathbf{U}_{k}^{(t)}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}-\eta(\mathbb{E}[\nabla_{k}\mathcal{L}^{(t)}]+a\mathbf{U}_{k}^{(t)}(\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}))\|_{\text{F}}\\ \leq&(1+\zeta)\|\mathbf{U}_{k}^{(t)}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}-\eta(\mathbb{E}[\nabla_{k}\mathcal{L}^{(t)}]+a\mathbf{U}_{k}^{(t)}(\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}))\|_{\text{F}}^{2}\\ &+(1+\zeta^{-1})\eta^{2}\|\mathbf{\Delta}_{k}^{(t)}\|_{\text{F}}^{2},\end{split}

(A.6)

where the last inequality stems from the mean inequality.

For the first term on the right hand side in (A.6), we have the following decomposition

\begin{split}&\|\mathbf{U}_{k}^{(t)}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}-\eta(\mathbb{E}[\nabla_{k}\mathcal{L}^{(t)}]+a\mathbf{U}_{k}^{(t)}(\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}))\|_{\text{F}}^{2}\\ =&\|\mathbf{U}_{k}^{(t)}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}\|_{\text{F}}^{2}+\eta^{2}\|\mathbb{E}[\nabla_{k}\mathcal{L}^{(t)}]+a\mathbf{U}_{k}^{(t)}(\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}})\|_{\text{F}}^{2}\\ &-2\eta\langle\mathbf{U}_{k}^{(t)}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)},\mathbb{E}[\nabla_{k}\mathcal{L}^{(t)}]\rangle-2\eta a\langle\mathbf{U}_{k}^{(t)}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)},\mathbf{U}_{k}^{(t)}(\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}})\rangle.\end{split}

(A.7)

Here, by condition (C1), the second term in (A.7) can be bounded by

\begin{split}&\|\mathbb{E}[\nabla_{k}\mathcal{L}^{(t)}]+a\mathbf{U}_{k}^{(t)}(\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}})\|_{\text{F}}^{2}\\ \leq&2\|\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]_{(k)}\mathbf{V}_{k}^{(t)}\|_{\text{F}}^{2}+2a^{2}\|\mathbf{U}_{k}^{(t)}(\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}})\|_{\text{F}}^{2}\\ \leq&2\|\mathbf{V}_{k}^{(t)}\|^{2}\|\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\|_{\text{F}}^{2}+2a^{2}\|\mathbf{U}_{k}^{(t)}\|^{2}\|\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\|_{\text{F}}^{2}\\ \leq&C_{d}b^{-2}\bar{\sigma}^{2}\|\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\|_{\text{F}}^{2}+Ca^{2}b^{2}\|\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\|_{\text{F}}^{2}.\end{split}

(A.8)

The third term in (A.7) can be rewritten as

\begin{split}&\langle\mathbf{U}_{k}^{(t)}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)},\mathbb{E}[\nabla_{k}\mathcal{L}^{(t)}]\rangle\\ =&\langle\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{S}$}^{(t)}\times_{j\neq k}\mathbf{U}_{j}^{(t)}\times_{k}\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)},\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]-\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*})]\rangle\\ =&\langle\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{A}$}_{k}^{(t)},\mathbb{E}[\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]-\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*})]\rangle,\end{split}

(A.9)

where $\mbox{\boldmath$\mathscr{A}$}_{k}^{(t)}:=\mbox{\boldmath$\mathscr{S}$}^{(t)}\times_{j\neq k}\mathbf{U}_{j}^{(t)}\times_{k}\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}$ . For the fourth term in (A.7), we have

\begin{split}&\left\langle\mathbf{U}_{k}^{(t)}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)},\mathbf{U}_{k}^{(t)}(\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}})\right\rangle\\ =&\left\langle\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)},\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\right\rangle\\ =&\frac{1}{2}\left\langle\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-\mathbf{U}_{k}^{*\top}\mathbf{U}_{k}^{*},\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\right\rangle\\ &+\frac{1}{2}\left\langle\mathbf{U}_{k}^{*\top}\mathbf{U}_{k}^{*}-2\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}+\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)},\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\right\rangle\\ =&\frac{1}{2}\|\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\|_{\text{F}}^{2}\\ &+\frac{1}{2}\left\langle(\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}-\mathbf{U}_{k}^{(t)})^{\top}(\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}-\mathbf{U}_{k}^{(t)}),\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\right\rangle\\ \geq&\frac{1}{2}\|\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\|_{\text{F}}^{2}-\frac{1}{2}\|\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}-\mathbf{U}_{k}^{(t)}\|_{\text{F}}^{2}\cdot\|\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\|_{\text{F}}\\ \geq&\frac{1}{2}\|\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\|_{\text{F}}^{2}-\frac{1}{4}\|\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}-\mathbf{U}_{k}^{(t)}\|_{\text{F}}^{4}-\frac{1}{4}\|\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\|_{\text{F}}^{2}\\ \geq&\frac{1}{4}\|\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\|_{\text{F}}^{2}-\frac{\text{Err}^{(t)}}{4}\|\mathbf{U}_{k}^{(t)}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}\|_{\text{F}}^{2},\end{split}

(A.10)

where we use the fact that $\|\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}-\mathbf{U}_{k}^{(t)}\|_{\text{F}}^{2}\leq\text{Err}^{(t)}$ .

Hence, for any $k=1,2,\dots,d$ ,

\begin{split}&\|\mathbf{U}_{k}^{(t)}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}-\eta(\mathbb{E}[\nabla_{k}\mathcal{L}^{(t)}]+a\mathbf{U}_{k}^{(t)}(\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}))\|_{\text{F}}^{2}\\ \leq&\|\mathbf{U}_{k}^{(t)}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}\|_{\text{F}}^{2}-2Q_{k,1}^{(t)}\eta+Q_{k,2}^{(t)}\eta^{2},\end{split}

(A.11)

where

Q_{k,1}^{(t)}=\langle\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{A}$}_{k}^{(t)},\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\rangle+\frac{a}{4}\left\|\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\right\|_{\text{F}}^{2}-\frac{a\text{Err}^{(t)}}{4}\left\|\mathbf{U}_{k}^{(t)}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}\right\|_{\text{F}}^{2}

(A.12)

and

Q_{k,2}^{(t)}=C_{d}b^{-2}\bar{\sigma}^{2}\|\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\|_{\text{F}}^{2}+Ca^{2}b^{2}\|\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\|_{\text{F}}^{2}.

(A.13)

Similarly, for any $\zeta>0$ ,

\begin{split}&\|\widetilde{\mbox{\boldmath$\mathscr{S}$}}^{(t+1)}-\mbox{\boldmath$\mathscr{S}$}^{*}\times_{k=1}^{d}\mathbf{O}_{k}^{(t)\top}\|_{\text{F}}^{2}=\|\mbox{\boldmath$\mathscr{S}$}^{(t)}-\eta\mbox{\boldmath$\mathscr{G}$}_{0}^{(t)}-\mbox{\boldmath$\mathscr{S}$}^{*}\times_{k=1}^{d}\mathbf{O}_{k}^{(t)\top}\|_{\text{F}}^{2}\\ =&\|\mbox{\boldmath$\mathscr{S}$}^{(t)}-\mbox{\boldmath$\mathscr{S}$}^{*}\times_{k=1}^{d}\mathbf{O}_{k}^{(t)\top}-\eta\mathbb{E}[\nabla_{0}\mathcal{L}^{(t)}]-\eta\mathbf{\Delta}_{0}^{(t)}\|_{\text{F}}^{2}\\ \leq&(1+\zeta)\|\mbox{\boldmath$\mathscr{S}$}^{(t)}-\mbox{\boldmath$\mathscr{S}$}^{*}\times_{k=1}^{d}\mathbf{O}_{k}^{(t)\top}-\eta\mathbb{E}[\nabla_{0}\mathcal{L}^{(t)}]\|_{\text{F}}^{2}+\eta^{2}(1+\zeta^{-1})\|\mathbf{\Delta}_{0}^{(t)}\|_{\text{F}}^{2},\end{split}

(A.14)

and

\|\mbox{\boldmath$\mathscr{S}$}^{(t)}-\mbox{\boldmath$\mathscr{S}$}^{*}\times_{k=1}^{d}\mathbf{O}_{k}^{(t)\top}-\eta\mathbb{E}[\nabla_{0}\mathcal{L}^{(t)}]\|_{\text{F}}^{2}\leq\|\mbox{\boldmath$\mathscr{S}$}^{(t)}-\mbox{\boldmath$\mathscr{S}$}^{*}\times_{k=1}^{d}\mathbf{O}_{k}^{(t)\top}\|_{\text{F}}^{2}-2Q_{0,1}^{(t)}\eta+Q_{0,2}^{(t)}\eta^{2},

(A.15)

where

Q_{0,1}^{(t)}=\langle\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{A}$}_{0}^{(t)},\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\rangle\text{ with }\mbox{\boldmath$\mathscr{A}$}_{0}^{(t)}=\mbox{\boldmath$\mathscr{S}$}^{*}\times_{k=1}^{d}\mathbf{U}_{k}^{(t)}\mathbf{O}_{k}^{(t)\top}

and $Q_{0,2}^{(t)}=C_{d}b^{2d}\|\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\|_{\text{F}}^{2}$ .

Hence, combining the above results, we have

\text{Err}^{(t+1)}\leq(1+\zeta)\left\{\text{Err}^{(t)}-2\eta\sum_{k=0}^{d}Q^{(t)}_{k,1}+\eta^{2}\sum_{k=0}^{d}Q_{k,2}^{(t)}\right\}+(1+\zeta^{-1})\eta^{2}\sum_{k=0}^{d}\|\mathbf{\Delta}_{k}^{(t)}\|_{\text{F}}^{2}.

(A.16)

Step 3. (Lower bound of $\sum_{k=0}^{d}Q^{(t)}_{k,1}$ )

By definition of $Q^{(t)}_{k,1}$ for $k=0,\dots,d$ , we have

\begin{split}\sum_{k=0}^{d}Q_{k,1}^{(t)}=&\left\langle(d+1)\mbox{\boldmath$\mathscr{A}$}^{(t)}-\sum_{k=0}^{d}\mbox{\boldmath$\mathscr{A}$}_{k}^{(t)},\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\right\rangle\\ &+a\sum_{k=1}^{d}\left\{\frac{1}{4}\|\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\|_{\text{F}}^{2}-\frac{\text{Err}^{(t)}}{4}\|\mathbf{U}_{k}^{(t)}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}\|_{\text{F}}^{2}\right\}.\end{split}

(A.17)

For the first term, by RCG condition of $\overline{\mathcal{L}}$ and Cauchy’s inequality,

\begin{split}&\left\langle(d+1)\mbox{\boldmath$\mathscr{A}$}^{(t)}-\sum_{k=0}^{d}\mbox{\boldmath$\mathscr{A}$}_{k}^{(t)},\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\right\rangle=\langle\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{A}$}^{*}+\mbox{\boldmath$\mathscr{H}$},\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\rangle\\ =&\langle\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{A}$}^{*},\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]-\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*})]\rangle+\langle\mbox{\boldmath$\mathscr{H}$},\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\rangle\\ \geq&\frac{\alpha}{2}\|\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}+\frac{1}{2\beta}\|\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\|_{\text{F}}^{2}-\|\mbox{\boldmath$\mathscr{H}$}\|_{\text{F}}\cdot\|\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\|_{\text{F}}\\ \geq&\frac{\alpha}{2}\|\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}+\frac{1}{2\beta}\|\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\|_{\text{F}}^{2}-\frac{1}{4\beta}\|\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\|_{\text{F}}^{2}-\beta\|\mbox{\boldmath$\mathscr{H}$}\|_{\text{F}}^{2}\\ =&\frac{\alpha}{2}\|\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}+\frac{1}{4\beta}\|\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\|_{\text{F}}^{2}-\beta\|\mbox{\boldmath$\mathscr{H}$}\|_{\text{F}}^{2}\end{split}

(A.18)

where $\mathscr{H}$ is the higher-order perturbation term in

\mbox{\boldmath$\mathscr{A}$}^{*}=\mbox{\boldmath$\mathscr{A}$}_{0}^{(t)}+\sum_{k=1}^{d}(\mbox{\boldmath$\mathscr{A}$}_{k}^{(t)}-\mbox{\boldmath$\mathscr{A}$}^{(t)})+\mbox{\boldmath$\mathscr{H}$}.

(A.19)

By Lemma A.2, we have $\|\mbox{\boldmath$\mathscr{H}$}\|_{\text{F}}\leq C_{d}b^{-2}\bar{\sigma}\text{Err}^{(t)}$ . Hence, by Lemma A.1, $\sum_{k=0}^{d}Q_{k,1}^{(t)}$ can be lower bounded by

\begin{split}&\sum_{k=0}^{d}Q_{k,1}^{(t)}\geq\frac{\alpha}{2}\|\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}+\frac{1}{4\beta}\|\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\|_{\text{F}}^{2}-C_{d}\beta b^{-4}\bar{\sigma}^{2}(\text{Err}^{(t)})^{2}\\ &+\frac{a}{4}\sum_{k=1}^{d}\|\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\|_{\text{F}}^{2}-\frac{a}{4}(\text{Err}^{(t)})^{2}\\ \geq&\left\{C\alpha b^{2d}\kappa^{-2}-C_{d}\beta b^{-4}\bar{\sigma}^{2}\text{Err}^{(t)}-\frac{a\text{Err}^{(t)}}{4}\right\}\text{Err}^{(t)}\\ &+\frac{1}{4\beta}\|\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\|_{\text{F}}^{2}+\left(\frac{a}{4}-C_{d}\alpha b^{2d-2}\kappa^{-2}\right)\sum_{k=1}^{d}\|\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\|_{\text{F}}^{2}\\ \geq&~{}C\alpha b^{2d}\kappa^{-2}\text{Err}^{(t)}+\frac{1}{4\beta}\|\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\|_{\text{F}}^{2}+\left(\frac{a}{4}-C_{d}\alpha b^{2d-2}\kappa^{-2}\right)\sum_{k=1}^{d}\|\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\|_{\text{F}}^{2}.\end{split}

(A.20)

Step 4. (Convergence analysis)

We have the following bound for $\sum_{k=0}^{d}Q_{k,2}^{(t)}$

\begin{split}\sum_{k=0}^{d}Q_{k,2}^{(t)}&\leq C_{d}b^{2d}\|\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\|_{\text{F}}^{2}+3a^{2}b^{2}\sum_{k=1}^{d}\|\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\|_{\text{F}}^{2}.\end{split}

(A.21)

Combining the results above, we have

\begin{split}&\text{Err}^{(t)}-2\eta\sum_{k=0}^{d}Q_{k,1}^{(t)}+\eta^{2}\sum_{k=0}^{d}Q_{k,2}^{(t)}\\ \leq&\left(1-C\alpha b^{2d}\kappa^{-2}\eta\right)\text{Err}^{(t)}+\left(C_{d}b^{2d}\eta^{2}-\frac{\eta}{4\beta}\right)\|\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{(t)})]\|_{\text{F}}^{2}\\ &+\left(3a^{2}b^{2}\eta^{2}+C_{d}\alpha b^{2d-2}\kappa^{-2}\eta-\frac{a\eta}{4}\right)\sum_{k=1}^{d}\|\mathbf{U}_{k}^{(t)\top}\mathbf{U}_{k}^{(t)}-b^{2}\mathbf{I}_{r_{k}}\|_{\text{F}}^{2}.\end{split}

(A.22)

Taking $\eta=\eta_{0}b^{-2d}\beta^{-1}$ and $a=C_{0}b^{2d-2}\alpha\kappa^{-2}$ for some sufficiently small constants $\eta_{0}$ and $C_{0}$ , we have

\text{Err}^{(t)}-2\eta\sum_{k=0}^{d}Q_{k,1}^{(t)}+\eta^{2}\sum_{k=0}^{d}Q_{k,2}^{(t)}\leq(1-C\alpha\beta^{-1}\kappa^{-2})\text{Err}^{(t)}

(A.23)

and

\text{Err}^{(t+1)}\leq(1+\zeta)(1-\eta_{0}\alpha\beta^{-1}\kappa^{-2})\text{Err}^{(t)}+(1+\zeta^{-1})\eta^{2}\sum_{k=0}^{d}\|\mathbf{\Delta}_{k}^{(t)}\|_{\text{F}}^{2}.

(A.24)

Taking $\zeta=\eta_{0}\alpha\beta^{-1}\kappa^{-2}/2$ , we have

\text{Err}^{(t+1)}\leq(1-\eta_{0}\alpha\beta^{-1}\kappa^{-2}/2)\text{Err}^{(t)}+C\alpha^{-1}\beta^{-1}\bar{\sigma}^{-4d/(d+1)}\kappa^{2}\sum_{k=0}^{d}\|\mathbf{\Delta}_{k}^{(t)}\|_{\text{F}}^{2}.

(A.25)

By stability of the robust gradient estimators, for $k=0,1,\dots,d$ and $t=1,2,\dots,T$ ,

\|\mathbf{\Delta}^{(t)}_{k}\|_{\text{F}}^{2}\leq\phi\|\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}+\xi_{k}^{2}.

(A.26)

Hence, as $\phi\lesssim\alpha^{2}\kappa^{-4}\bar{\sigma}^{2d/(d+1)}$ , we have

\begin{split}\text{Err}^{(t+1)}&\leq(1-\eta_{0}\alpha\beta^{-1}\kappa^{-2}/2)\text{Err}^{(t)}+C_{d}\alpha^{-1}\beta^{-1}\bar{\sigma}^{-4d/(d+1)}\kappa^{2}\left(\phi\|\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}+\sum_{k=0}^{d}\xi_{k}^{2}\right)\\ &\leq(1-\eta_{0}\alpha\beta^{-1}\kappa^{-2}/2+C_{d}\alpha^{-1}\beta^{-1}\bar{\sigma}^{-2d/(d+1)}\kappa^{2}\phi)\text{Err}^{(t)}+C\alpha^{-1}\beta^{-1}\bar{\sigma}^{-4d/(d+1)}\kappa^{2}\sum_{k=0}^{d}\xi_{k}^{2}\\ &\leq(1-C\alpha\beta^{-1}\kappa^{-2})\text{Err}^{(t)}+C\alpha^{-1}\beta^{-1}\bar{\sigma}^{-4d/(d+1)}\kappa^{2}\sum_{k=0}^{d}\xi_{k}^{2}\\ &\leq(1-C\alpha\beta^{-1}\kappa^{-2})^{t+1}\text{Err}^{(0)}+C\alpha^{-2}\bar{\sigma}^{-4d/(d+1)}\kappa^{4}\sum_{k=0}^{d}\xi_{k}^{2}.\end{split}

(A.27)

We apply Lemma A.1 again and obtain

\begin{split}\|\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}&\leq C\bar{\sigma}^{2d/(d+1)}\text{Err}^{(t+1)}\\ &\leq C\bar{\sigma}^{2d/(d+1)}(1-C\alpha\beta^{-1}\kappa^{-2})^{t}\text{Err}^{(0)}+C\bar{\sigma}^{-2d/(d+1)}\alpha^{-2}\kappa^{4}\sum_{k=0}^{d}\xi_{k}^{2}\\ &\leq C\kappa^{2}(1-C\alpha\beta^{-1}\kappa^{-2})^{t}\|\mbox{\boldmath$\mathscr{A}$}^{(0)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}+C\bar{\sigma}^{-2d/(d+1)}\alpha^{-2}\kappa^{4}\sum_{k=0}^{d}\xi_{k}^{2}.\end{split}

(A.28)

Step 5. (Verfications of conditions)

Finally, we show the conditions (C1) and (C2) hold for all $t=1,2,\dots$ . By Lemma A.1, we have

\text{Err}^{(0)}\leq C(\alpha/\beta)b^{2}\kappa^{-2}\leq Cb^{2}.

(A.29)

By the recursive relationship in (A.27), by induction we can check that $\text{Err}^{(t)}\leq Cb^{2}$ for all $t=1,2,\dots,T$ . Furthermore, it implies that

\|\mathbf{U}_{k}^{(t)}\|\leq\|\mathbf{U}_{k}^{*}\|+\|\mathbf{U}_{k}^{(t)}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}^{(t)}\|\leq Cb,~{}~{}k=1,2,\dots,d,

(A.30)

and

\max_{k}\|\mbox{\boldmath$\mathscr{S}$}^{(t)}_{(k)}\|\leq\max_{k}\|\mbox{\boldmath$\mathscr{S}$}^{*}_{(k)}\|+\max_{k}\|\mbox{\boldmath$\mathscr{S}$}_{(k)}^{(t)}-\mbox{\boldmath$\mathscr{S}$}^{*}\times_{j=1}^{d}\mathbf{O}_{j}^{(t)\top}\|\leq C\underline{\sigma}b^{-d},

(A.31)

which completes the convergence analysis.

A.2 Auxiliary lemmas

The first lemma shows the equivalence between $\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}$ and the combined error $E$ , which is from Lemma E.2 in Han, Willett and Zhang (2022) and is presented here for self-containedness. The proof of Lemma A.1 can be found in Han, Willett and Zhang (2022) and hence is omitted.

Lemma A.1.

Suppose $\mbox{\boldmath$\mathscr{A}$}^{*}=[\![\mbox{\boldmath$\mathscr{S}$}^{*};\mathbf{U}_{1}^{*},\dots,\mathbf{U}_{d}^{*}]\!]$ , $\mathbf{U}_{k}^{*\top}\mathbf{U}_{k}=b^{2}\mathbf{I}_{r_{k}}$ , for $k=1,\dots,d$ , $\bar{\sigma}=\max_{k}\|\mbox{\boldmath$\mathscr{A}$}^{*}_{(k)}\|_{\textup{sp}}$ , and $\underline{\sigma}=\min_{k}\sigma_{r_{k}}(\mbox{\boldmath$\mathscr{A}$}^{*}_{(k)})$ . Let $\mbox{\boldmath$\mathscr{A}$}=[\![\mbox{\boldmath$\mathscr{S}$};\mathbf{U}_{1},\dots,\mathbf{U}_{d}]\!]$ be another Tucker low-rank tensor with $\mathbf{U}_{k}\in\mathbb{R}^{p_{k}\times r_{k}}$ , $\|\mathbf{U}_{k}\|\leq(1+c_{0})b$ , and $\max_{k}\|\mbox{\boldmath$\mathscr{S}$}_{(k)}\|\leq(1+c_{0})\bar{\sigma}b^{-d}$ for some $c_{0}>0$ . Define

E:=\min_{\mathbb{O}_{k}\in\mathbb{O}_{p_{k},r_{k}}}\left\{\sum_{k=1}^{d}\|\mathbf{U}_{k}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}\|_{\textup{F}}^{2}+\left\|\mbox{\boldmath$\mathscr{S}$}-[\![\mbox{\boldmath$\mathscr{S}$}^{*};\mathbf{O}_{1}^{\top},\dots,\mathbf{O}_{d}^{\top}]\!]\right\|_{\text{F}}^{2}\right\}.

(A.32)

Then, we have

\begin{split}&E\leq b^{-2d}(C+C_{1}b^{2d+2}\underline{\sigma}^{-2})\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\textup{F}}^{2}+2b^{-2}C_{1}\sum_{k=1}^{d}\|\mathbf{U}_{k}^{\top}\mathbf{U}_{k}-b^{2}\mathbf{I}_{r_{k}}\|_{\textup{F}}^{2},\\ \text{and }&\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\textup{F}}^{2}\leq Cb^{2d}(C+C_{2}\bar{\sigma}^{2}b^{-2(d+1)})E,\end{split}

(A.33)

where $C_{1},C_{2}>0$ are some constants related to $c_{0}$ .

The second lemma is an upper bound for the second and higher-order terms in the perturbation of a tensor Tucker decomposition, as the higher-order generalization of Lemma E.3 in Han, Willett and Zhang (2022).

Lemma A.2.

Suppose that $\mbox{\boldmath$\mathscr{A}$}^{*}=\mbox{\boldmath$\mathscr{S}$}^{*}\times_{k=1}^{d}\mathbf{U}_{k}^{*}$ and $\mbox{\boldmath$\mathscr{A}$}=\mbox{\boldmath$\mathscr{S}$}\times_{k=1}^{d}\mathbf{U}_{k}$ with $\|\mathbf{U}_{k}\|\asymp\|\mathbf{U}_{k}^{*}\|\asymp b$ and $\|\mbox{\boldmath$\mathscr{S}$}_{(k)}\|\asymp\|\mbox{\boldmath$\mathscr{S}$}_{(k)}^{*}\|\asymp\bar{\sigma}b^{-d}$ . For $\mathbf{O}_{k}\in\mathbb{O}_{r_{k}}$ , $1\leq k\leq d$ , $\|\mbox{\boldmath$\mathscr{H}$}\|_{\textup{F}}\leq C_{d}b^{-2}\bar{\sigma}\textup{Err}$ , where $\mbox{\boldmath$\mathscr{H}$}=\mbox{\boldmath$\mathscr{A}$}^{*}-\mbox{\boldmath$\mathscr{A}$}_{0}-\sum_{k=1}^{d}(\mbox{\boldmath$\mathscr{A}$}_{k}-\mbox{\boldmath$\mathscr{A}$})$ and $\textup{Err}=\sum_{k=1}^{d}\|\mathbf{U}_{k}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}\|_{\textup{F}}^{2}+\|\mbox{\boldmath$\mathscr{S}$}-\mbox{\boldmath$\mathscr{S}$}^{*}\times_{k=1}^{d}\mathbf{O}_{k}^{\top}\|_{\textup{F}}^{2}$ . Then, $\|\mbox{\boldmath$\mathscr{H}$}\|_{\textup{F}}\leq C_{d}b^{-2}\bar{\sigma}\textup{Err}$ .

Proof.

We have that

\begin{split}\|\mbox{\boldmath$\mathscr{H}$}\|_{\text{F}}&\leq\sum_{j\neq k}\left\|\mbox{\boldmath$\mathscr{S}$}^{*}\times_{i=j,k}(\mathbf{U}_{i}-\mathbf{U}^{*}_{i}\mathbf{O}_{i})\times_{i\neq j,k}\mathbf{U}_{j}^{*}\mathbf{O}_{j}\right\|_{\text{F}}\\ &+\sum_{j\neq k\neq l}\left\|\mbox{\boldmath$\mathscr{S}$}^{*}\times_{i=j,k,l}(\mathbf{U}_{i}-\mathbf{U}^{*}_{i}\mathbf{O}_{i})\times_{i\neq j,k,l}\mathbf{U}_{j}^{*}\mathbf{O}_{j}\right\|_{\text{F}}\\ &+\cdots+\sum_{j}\|\mbox{\boldmath$\mathscr{S}$}^{*}\times_{i\neq j}(\mathbf{U}_{i}-\mathbf{U}^{*}_{i}\mathbf{O}_{i})\times_{i=j}\mathbf{U}_{j}^{*}\mathbf{O}_{j}\|_{\text{F}}+\|\mbox{\boldmath$\mathscr{S}$}^{*}\times_{i=1}^{d}(\mathbf{U}_{i}-\mathbf{U}^{*}_{i}\mathbf{O}_{i})\|_{\text{F}}\\ &\leq\sum_{j\neq k}\left\|(\mbox{\boldmath$\mathscr{S}$}\times_{k=1}^{d}\mathbf{O}_{k}-\mbox{\boldmath$\mathscr{S}$}^{*})\times_{i=j,k}(\mathbf{U}_{i}-\mathbf{U}^{*}_{i}\mathbf{O}_{i})\times_{i\neq j,k}\mathbf{U}_{j}^{*}\mathbf{O}_{j}\right\|_{\text{F}}\\ &+\sum_{j\neq k\neq l}\left\|(\mbox{\boldmath$\mathscr{S}$}\times_{k=1}^{d}\mathbf{O}_{k}-\mbox{\boldmath$\mathscr{S}$}^{*})\times_{i=j,k,l}(\mathbf{U}_{i}-\mathbf{U}^{*}_{i}\mathbf{O}_{i})\times_{i\neq j,k,l}\mathbf{U}_{j}^{*}\mathbf{O}_{j}\right\|_{\text{F}}\\ &+\cdots+\sum_{j}\|(\mbox{\boldmath$\mathscr{S}$}\times_{k=1}^{d}\mathbf{O}_{k}-\mbox{\boldmath$\mathscr{S}$}^{*})\times_{i\neq j}(\mathbf{U}_{i}-\mathbf{U}^{*}_{i}\mathbf{O}_{i})\times_{i=j}\mathbf{U}_{j}^{*}\mathbf{O}_{j}\|_{\text{F}}\\ &+\|(\mbox{\boldmath$\mathscr{S}$}\times_{k=1}^{d}\mathbf{O}_{k}-\mbox{\boldmath$\mathscr{S}$}^{*})\times_{i=1}^{d}(\mathbf{U}_{i}-\mathbf{U}^{*}_{i}\mathbf{O}_{i})\|_{\text{F}}\\ &\leq\binom{d}{2}B_{2}B_{1}^{d-2}B_{3}+\binom{d}{3}B_{2}B_{1}^{d-3}B_{3}^{3/2}+\cdots+dB_{2}B_{1}B_{3}^{(d-1)/2}+B_{2}B_{3}^{d/2}\\ &+\binom{d}{2}B_{1}^{d-2}B_{3}^{3/2}+\binom{d}{3}B_{1}^{d-3}B_{3}^{2}+\cdots+dB_{1}B_{3}^{d/2}+B_{3}^{(d+1)/2}\leq C_{d}b^{-2}\bar{\sigma}\textup{Err},\end{split}

(A.34)

where $B_{1}=\max_{k}(\|\mathbf{U}^{*}_{k}\|,\|\mathbf{U}_{k}\|)$ , $B_{2}=\max_{k}(\|\mbox{\boldmath$\mathscr{S}$}^{*}_{(k)}\|,\|\mbox{\boldmath$\mathscr{S}$}_{(k)}\|)$ , and $B_{3}=\max_{k}(\|\mathbf{U}_{k}-\mathbf{U}_{k}^{*}\mathbf{O}_{k}\|_{\text{F}}^{2},\|\mbox{\boldmath$\mathscr{S}$}-\mbox{\boldmath$\mathscr{S}$}^{*}\times_{k=1}^{d}\mathbf{O}_{k}\|_{\text{F}}^{2})$ .

∎

Appendix B Statistical Analysis of Robust Gradient

The most essential part of the statistical analysis is to prove that the robust gradient estimators are stable. For $1\leq k\leq d$ , the robust gradient estimator with respect to $\mathbf{U}_{k}$ is $\mathbf{G}_{k}=n^{-1}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top},\tau).$ Note that

\begin{split}&\mathbf{G}_{k}-\nabla_{k}\mathcal{R}=\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k},\tau)-\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k}]\\ =&~{}T_{k,1}+T_{k,2}+T_{k,3}+T_{k,4},\end{split}

(B.1)

where

\begin{split}T_{k,1}=&\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k},\tau)]-\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k}],\\ T_{k,2}=&\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k},\tau)-\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k},\tau)],\\ T_{k,3}=&\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k}]-\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k}]\\ &+\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k},\tau)]-\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k},\tau)],\\ T_{k,4}=&\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k},\tau)-\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k},\tau)\\ &-\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k},\tau)]+\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k},\tau)]\end{split}

(B.2)

Similarly, for $\mathscr{S}$ , its robust gradient estimator is

\mbox{\boldmath$\mathscr{G}$}_{0}=\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})\times_{j=1}^{d}\mathbf{U}_{j}^{\top},\tau).

We can also decompose $\mbox{\boldmath$\mathscr{G}$}_{0}-\mathbb{E}{\nabla_{0}\mathcal{L}}$ into four components,

\begin{split}&\mbox{\boldmath$\mathscr{G}$}_{0}-\mathbb{E}[\nabla_{0}\mathcal{L}]=\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})\times_{j=1}^{d}\mathbf{U}_{j}^{\top},\tau)-\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})\times_{j=1}^{d}\mathbf{U}_{j}^{\top},\tau)]\\ =&T_{0,1}+T_{0,2}+T_{0,3}+T_{0,4},\end{split}

(B.3)

where

\begin{split}T_{0,1}=&\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})\times_{j=1}^{d}\mathbf{U}_{j}^{\top},\tau)]-\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})\times_{j=1}^{d}\mathbf{U}_{j}^{\top}],\\ T_{0,2}=&\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})\times_{j=1}^{d}\mathbf{U}_{j}^{\top},\tau)-\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})\times_{j=1}^{d}\mathbf{U}_{j}^{\top},\tau)],\\ T_{0,3}=&\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})\times_{j=1}^{d}\mathbf{U}_{j}^{\top}]-\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})\times_{j=1}^{d}\mathbf{U}_{j}^{\top}]\\ &+\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})\times_{j=1}^{d}\mathbf{U}_{j}^{\top},\tau)]-\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})\times_{j=1}^{d}\mathbf{U}_{j}^{\top},\tau)],\\ T_{0,4}=&\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})\times_{j=1}^{d}\mathbf{U}_{j}^{\top},\tau)-\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})\times_{j=1}^{d}\mathbf{U}_{j}^{\top},\tau)\\ &-\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})\times_{j=1}^{d}\mathbf{U}_{j}^{\top},\tau)]+\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})\times_{j=1}^{d}\mathbf{U}_{j}^{\top},\tau)].\end{split}

(B.4)

To prove the stability of the robust gradient estimators, it suffices to give proper upper bounds of $\|T_{k,j}\|_{\text{F}}$ for $0\leq k\leq d$ and $1\leq j\leq 4$ .

B.1 Proof of Theorem 4.5

Proof.

The proof consists of seven steps. In the first six steps, we prove the stability of the robust gradient estimators for the general $1\leq t\leq T$ and, hence, we omit the notation $(t)$ for simplicity. Specifically, in the first four steps, we give the upper bounds for $\|T_{k,1}\|_{\text{F}},\dots,\|T_{k,4}\|_{\text{F}}$ , respectively, for $1\leq k\leq d_{0}$ . In the fifth and sixth steps, we extend the proof to the terms for $d_{0}+1\leq k\leq d$ and core tensor. In the last step, we apply the results to the local convergence analysis in Theorem 3.3 and verify the corresponding conditions. Throughout the first six steps, we assume that for each $1\leq k\leq d$ , $\|\mathbf{U}_{k}\|\asymp\bar{\sigma}^{1/(d+1)}$ and $\|\sin\theta(\mathbf{U}_{k},\mathbf{U}_{k}^{*})\|\leq\delta$ and will verify them in the last step.

Step 1. (Bound $\|T_{k,1}\|_{\text{F}}$ for $1\leq k\leq d_{0}$ )

For any $1\leq k\leq d_{0}$ , we let $r_{k}^{\prime}=r_{1}r_{2}\cdots r_{d_{0}}/r_{k}$ and

\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k}=[(\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1,j\neq k}^{d_{0}}\mathbf{U}_{j}^{\top})_{(k)}\otimes\text{vec}(-\mbox{\boldmath$\mathscr{E}$}_{i}\times_{j=1}^{d-d_{0}}\mathbf{U}_{d_{0}+j}^{\top})^{\top}]\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top}.

(B.5)

Denote the columns of $\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top}$ as $\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top}=[\mathbf{s}_{k,1},s_{k,2},\dots,\mathbf{s}_{k,r_{k}}]$ such that $\text{vec}(\mathbf{S}_{k,j})=\mathbf{s}_{k,j}$ . The $(l,m)$ -th entry of $n^{-1}\sum_{i=1}^{n}\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k}$ is

\begin{split}&\frac{1}{n}\sum_{i=1}^{n}\left(\left[(\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1,j\neq k}^{d_{0}}\mathbf{U}_{j}^{\top})_{(k)}\otimes\text{vec}(-\mbox{\boldmath$\mathscr{E}$}_{i}\times_{j=1}^{d-d_{0}}\mathbf{U}_{d_{0}+j}^{\top})^{\top}\right]\mathbf{s}_{k,m}\right)_{l}\\ &=\frac{1}{n}\sum_{i=1}^{n}\left[(\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1,j\neq k}^{d_{0}}\mathbf{U}_{j}^{\top})_{(k)}\mathbf{S}_{k,m}\text{vec}(-\mbox{\boldmath$\mathscr{E}$}_{i}\times_{j=1}^{d-d_{0}}\mathbf{U}_{d_{0}+j}^{\top})\right]_{l}\\ &=\frac{1}{n}\sum_{i=1}^{n}\left[(\mbox{\boldmath$\mathscr{X}$}_{i})_{(k)}(\otimes_{j=1,j\neq k}^{d_{0}}\mathbf{U}_{j})\mathbf{S}_{k,m}(\otimes_{j=d_{0}+1}^{d}\mathbf{U}_{j}^{\top})(-\mathbf{e}_{i})\right]_{l}\\ &=\frac{1}{n}\sum_{i=1}^{n}\mathbf{c}_{l}^{\top}(\mbox{\boldmath$\mathscr{X}$}_{i})_{(k)}(\otimes_{j=1,j\neq k}^{d_{0}}\mathbf{U}_{j})\mathbf{S}_{k,m}(\otimes_{j=d_{0}+1}^{d}\mathbf{U}_{j}^{\top})(-\mathbf{e}_{i}),\end{split}

(B.6)

where $\mathbf{c}_{l}$ is the coordinate vector whose $l$ -th entry is one and the others are zero.

For the fixed $\mathbf{U}_{j}$ ’s, let $\mathbf{M}_{k,1}=(\otimes_{j=1,j\neq k}^{d_{0}}\mathbf{U}_{j})/\|\otimes_{j=1,j\neq k}^{d_{0}}\mathbf{U}_{j}\|$ and $\mathbf{c}_{l}^{\top}(\mbox{\boldmath$\mathscr{X}$}_{i})_{(k)}\mathbf{M}_{k,1}=(w^{(i)}_{k,l,1},w^{(i)}_{k,l,2},\dots,w^{(i)}_{k,l,r_{k}^{\prime}})$ . Similarly, let $\mathbf{M}_{k,2}=(\otimes_{j=d_{0}+1}^{d}\mathbf{U}_{j}^{\top})/\|\otimes_{j=d_{0}+1}^{d}\mathbf{U}_{j}^{\top}\|$ and $\mathbf{M}_{k,2}\mathbf{e}_{i}=(z^{(i)}_{k,1},z^{(i)}_{k,2},\dots,z^{(i)}_{k,r_{d_{0}+1}r_{d_{0}+2}\cdots r_{d}})^{\top}$ . By Assumption 4.3, $\mathbb{E}|w^{(i)}_{k,l,j}|^{1+\epsilon}\leq M_{x,1+\epsilon,\delta}$ and $\mathbb{E}|z^{(i)}_{k,{m^{\prime}}}|^{1+\epsilon}\leq M_{e,1+\epsilon,\delta}$ , for $j=1,2,\dots,r_{k}^{\prime}$ , $l=1,2,\dots,p_{k}$ , and ${m^{\prime}}=1,2,\dots,r_{d_{0}+1}r_{d_{0}+2}\cdots r_{d}$ . Let $\mathbf{M}_{k,3,m}=\mathbf{S}_{k,m}/\|\mathbf{S}_{k,m}\|$ and $\mathbf{M}_{k,3,m}\mathbf{M}_{k,2}\mathbf{e}_{i}=(z_{k,m,1}^{(i)},z_{k,m,2}^{(i)},\dots,z_{k,m,r_{k}^{\prime}}^{(i)})$ , for $m=1,2,\dots,r_{k}$ . Then, $\mathbb{E}[|z_{k,m,j}^{(i)}|^{1+\epsilon}|\mbox{\boldmath$\mathscr{X}$}_{i}]\lesssim M_{e,1+\epsilon,\delta}$ . Let $v^{(i)}_{k,j,l,m}=w^{(i)}_{k,l,j}z^{(i)}_{k,m,j}$ , which satisfies that

\mathbb{E}\left[|v_{k,j,l,m}^{(i)}|^{1+\epsilon}\right]=\mathbb{E}\left[|w_{k,j,l}^{(i)}|^{1+\epsilon}\cdot\mathbb{E}\left[|z_{k,m,j}^{(i)}|^{1+\epsilon}|\mbox{\boldmath$\mathscr{X}$}_{i}\right]\right]\lesssim M_{x,1+\epsilon,\delta}M_{e,1+\epsilon,\delta}

(B.7)

Let $v^{(i)}_{k,l,m}=\sum_{j=1}^{r_{k}^{\prime}}v_{k,j,l,m}^{(i)}$ and $\mathbb{E}[|v^{(i)}_{k,l,m}|^{1+\epsilon}]\lesssim M_{x,1+\epsilon,\delta}M_{e,1+\epsilon,\delta}$ .

We first bound the bias, namely $T_{k,1}$ in (B.3). We have that

\|T_{k,1}\|_{\text{F}}^{2}\asymp\bar{\sigma}^{\frac{2d}{d+1}}\sum_{l=1}^{p_{k}}\sum_{m=1}^{r_{k}}\left|\mathbb{E}[v^{(i)}_{k,l,m}]-\mathbb{E}[\text{T}(v^{(i)}_{k,l,m},\tau_{k})]\right|^{2},

(B.8)

where $\tau_{k}=\tau\cdot\|\otimes_{j=1,j\neq k}^{d}\mathbf{U}_{j}\|^{-1}\cdot(\max_{1\leq m\leq r_{k}}\|\mathbf{S}_{k,m}\|)^{-1}\asymp[nM_{x,1+\epsilon,\delta}M_{e,1+\epsilon,\delta}/\log(\bar{p})]^{1/(1+\epsilon)}$ .

For any $l=1,2,\dots,p_{k}$ and $m=1,2,\dots,r_{k}$ , by definition of the truncation operator $\text{T}(\cdot,\cdot)$ and Markov’s inequality,

\begin{split}&\left|\mathbb{E}\left[v^{(i)}_{k,l,m}\right]-\mathbb{E}\left[\text{T}(v^{(i)}_{k,l,m},\tau_{k})\right]\right|\leq\mathbb{E}\left[|v^{(i)}_{k,l,m}|\cdot 1\{|v^{(i)}_{k,l,m}|\geq\tau_{k}\}\right]\\ \leq&\mathbb{E}\left[|v^{(i)}_{k,l,m}|^{1+\epsilon}\right]^{1/(1+\epsilon)}\cdot\mathbb{P}(|v^{(i)}_{k,l,m}|\geq\tau_{k})^{\epsilon/(1+\epsilon)}\\ \leq&\mathbb{E}\left[|v^{(i)}_{k,l,m}|^{1+\epsilon}\right]^{1/(1+\epsilon)}\cdot\left(\frac{\mathbb{E}\left[|v^{(i)}_{k,l,m}|^{1+\epsilon}\right]}{\tau_{k}^{1+\epsilon}}\right)^{\epsilon/(1+\epsilon)}\\ \asymp&~{}M_{x,1+\epsilon,\delta}\cdot M_{e,1+\epsilon,\delta}\cdot\tau_{k}^{-\epsilon}\asymp\left[\frac{M_{x,1+\epsilon,\delta}^{1/\epsilon}M_{e,1+\epsilon,\delta}^{1/\epsilon}\log(\bar{p})}{n}\right]^{\epsilon/(1+\epsilon)}\end{split}

(B.9)

with truncation parameter

\tau_{k}\asymp\left[\frac{nM_{x,1+\epsilon,\delta}M_{e,1+\epsilon,\delta}}{\log(\bar{p})}\right]^{1/(1+\epsilon)}.

(B.10)

Hence, for $k=1,\dots,d_{0}$ ,

\left\|T_{k,1}\right\|_{\text{F}}\lesssim\bar{\sigma}^{d/(d+1)}\sqrt{p_{k}r_{k}}\left[\frac{M_{x,1+\epsilon,\delta}^{1/\epsilon}M_{e,1+\epsilon,\delta}^{1/\epsilon}\log(\bar{p})}{n}\right]^{\frac{\epsilon}{1+\epsilon}}.

(B.11)

Step 2. (Bound $\|T_{k,2}\|_{\text{F}}$ for $1\leq k\leq d_{0}$ )

For $T_{k,2}$ in (B.3), similarly to $T_{k,1}$ ,

\begin{split}&\|T_{k,2}\|_{\text{F}}^{2}\asymp\bar{\sigma}^{\frac{2d}{d+1}}\sum_{l,m}\left|\frac{1}{n}\sum_{i=1}^{n}\text{T}(v^{(i)}_{k,l,m},\tau_{k})-\mathbb{E}[\text{T}(v^{(i)}_{k,l,m},\tau_{k})]\right|^{2}.\end{split}

(B.12)

For each $i=1,2,\dots,n$ , it can be checked that

\mathbb{E}\left[\text{T}(v^{(i)}_{k,l,m},\tau_{k})^{2}\right]\leq\tau_{0}^{1-\epsilon}\cdot\mathbb{E}\left[|v^{(i)}_{k,l,m}|^{1+\epsilon}\right]\asymp\tau_{k}^{1-\epsilon}\cdot M_{x,1+\epsilon,\delta}\cdot M_{e,1+\epsilon,\delta}.

(B.13)

Thus, we have the upper bound for the variance

\text{var}(\text{T}(v^{(i)}_{k,l,m},\tau_{k}))\leq\mathbb{E}\left[\text{T}(v^{(i)}_{k,l,m},\tau_{k})^{2}\right]\lesssim\tau_{k}^{1-\epsilon}\cdot M_{x,1+\epsilon,\delta}M_{e,1+\epsilon,\delta}.

(B.14)

Also, for any $q=3,4,\dots$ , the higher-order moments satisfy that

\begin{split}&\mathbb{E}\left[\left|\text{T}(v^{(i)}_{k,l,m},\tau_{k})-\mathbb{E}[\text{T}(v^{(i)}_{k,l,m},\tau_{k})]\right|^{q}\right]\leq(2\tau_{k})^{q-2}\cdot\mathbb{E}\left[\left(\text{T}(v^{(i)}_{k,l,m},\tau_{k})-\mathbb{E}[\text{T}(v^{(i)}_{k,l,m},\tau_{k})]\right)^{2}\right].\end{split}

(B.15)

By Bernstein’s inequality, for any $1\leq l\leq p_{k}$ , $1\leq m\leq r_{k}$ , and $0<t\lesssim\tau_{k}^{-\epsilon}M_{x,1+\epsilon,\delta}M_{e,1+\epsilon,\delta}$ ,

\begin{split}&\mathbb{P}\left(\left|\frac{1}{n}\sum_{i=1}^{n}\text{T}(v^{(i)}_{k,l,m},\tau_{k})-\mathbb{E}\text{T}(v^{(i)}_{k,l,m},\tau_{k})\right|\geq t\right)\leq 2\exp\left(-\frac{nt^{2}}{4\tau_{k}^{1-\epsilon}M_{x,1+\epsilon,\delta}M_{e,1+\epsilon,\delta}}\right).\end{split}

(B.16)

Let $t=CM_{x,1+\epsilon,\delta}^{1/(1+\epsilon)}M_{e,1+\epsilon,\delta}^{1/(1+\epsilon)}\log(\bar{p})^{\epsilon/(1+\epsilon)}n^{-\epsilon/(1+\epsilon)}$ . Therefore, we have

\begin{split}&\mathbb{P}\Bigg{(}\Bigg{|}\frac{1}{n}\sum_{i=1}^{n}\text{T}(v^{(i)}_{k,l,m},\tau_{k})-\mathbb{E}\text{T}(v^{(i)}_{k,l,m},\tau_{k})\Bigg{|}\\ &~{}~{}~{}~{}~{}~{}~{}\geq C\left[\frac{M_{x,1+\epsilon,\delta}^{1/\epsilon}M_{e,1+\epsilon,\delta}^{1/\epsilon}\log(\bar{p})}{n}\right]^{\epsilon/(1+\epsilon)}\Bigg{)}\leq C\exp\left(-C\log(\bar{p})\right)\end{split}

(B.17)

and

\begin{split}&\mathbb{P}\left(\max_{\begin{subarray}{c}1\leq l\leq p_{k}\\ 1\leq m\leq r_{k}\end{subarray}}\Bigg{|}\frac{1}{n}\sum_{i=1}^{n}\text{T}(v^{(i)}_{k,l,m},\tau_{k})-\mathbb{E}\text{T}(v^{(i)}_{k,l,m},\tau_{k})\Bigg{|}\geq C\left[\frac{M_{x,1+\epsilon,\delta}^{1/\epsilon}M_{e,1+\epsilon,\delta}^{1/\epsilon}\log(\bar{p})}{n}\right]^{\epsilon/(1+\epsilon)}\right)\\ &\leq Cp_{k}r_{k}\exp\left(-C\log(\bar{p})\right)\leq C\exp(-C\log(\bar{p})).\end{split}

(B.18)

Hence, for $1\leq k\leq d_{0}$ , with high probability at least $1-C\exp(-C\log(\bar{p}))$ ,

\begin{split}&\left\|\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k},\tau)-\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k},\tau)]\right\|_{\text{F}}\\ &\lesssim\bar{\sigma}^{d/(d+1)}\sqrt{p_{k}r_{k}}\left[\frac{M_{x,1+\epsilon,\delta}^{1/\epsilon}M_{e,1+\epsilon,\delta}^{1/\epsilon}\log(\bar{p})}{n}\right]^{\epsilon/(1+\epsilon)}.\end{split}

(B.19)

Step 3. (Bound $\|T_{k,3}\|_{\text{F}}$ for $1\leq k\leq d_{0}$ )

The tensor linear regression can be rewritten as

\text{vec}(\mbox{\boldmath$\mathscr{Y}$}_{i})=\mathbf{y}_{i}=\mathbf{A}^{*}\mathbf{x}_{i}+\mathbf{e}_{i}=\text{mat}(\mbox{\boldmath$\mathscr{A}$})\text{vec}(\mbox{\boldmath$\mathscr{X}$}_{i})+\text{vec}(\mbox{\boldmath$\mathscr{E}$}_{i}).

(B.20)

Hence, we denote matricization of $\mathscr{A}$ and $\mbox{\boldmath$\mathscr{A}$}^{*}$ as $\mathbf{A}$ and $\mathbf{A}^{*}$ .

As in step 1,

\begin{split}&\|\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k}-\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k}]\|_{\text{F}}^{2}\\ \leq&\|\mathbf{V}_{k}\|^{2}\cdot\|\mathbb{E}[(\mathbf{A}-\mathbf{A}^{*})\mathbf{x}_{i}\mathbf{x}_{i}^{\top}]\|_{\text{F}}^{2}\lesssim~{}\bar{\sigma}^{2d/(d+1)}\beta_{x}^{2}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}.\end{split}

(B.21)

For $1\leq l\leq p_{k}$ and $1\leq m\leq r_{k}$ , let the $(l,m)$ -th entry of $\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k}$ and $\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k}$ be $x_{k,l,m}^{(i)}$ and $y_{k,l,m}^{(i)}$ , respectively. Let $\mathbb{E}[|x_{k,l,m}^{(i)}-y_{k,l,m}^{(i)}|^{2}]=s_{k,l,m}^{2}$ , and then $\sum_{l=1}^{p_{k}}\sum_{m=1}^{r_{k}}s_{k,l,m}^{2}\lesssim\bar{\sigma}^{2d/(d+1)}\beta_{x}^{2}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}$ .

Similarly to $v^{(i)}_{k,l,m}$ , let $\mathbf{M}_{k,3,m}\mathbf{M}_{k,2}(\mathbf{A}-\mathbf{A}^{*})\mathbf{x}_{i}=(r_{k,m,1}^{(i)},r_{k,m,2}^{(i)},\dots,r_{k,m,r_{k}^{\prime}}^{(i)})^{\top}$ . Note that $\|\mathbf{A}\|\lesssim\bar{\sigma}$ and $\|\mathbf{A}^{*}\|\lesssim\bar{\sigma}$ , and hence $\|\mathbf{A}-\mathbf{A}^{*}\|\lesssim\bar{\sigma}$ . Then,

\begin{split}&\mathbf{c}_{l}^{\top}(\mbox{\boldmath$\mathscr{X}$}_{i})_{(k)}\mathbf{M}_{k,1}\mathbf{M}_{k,3,m}\mathbf{M}_{k,2}((\mathbf{A}-\mathbf{A}^{*})\mathbf{x}_{i}-\mathbf{e}_{i})\\ =&-\mathbf{c}_{l}^{\top}(\mbox{\boldmath$\mathscr{X}$}_{i})_{(k)}\mathbf{M}_{k,1}\mathbf{M}_{k,3,m}\mathbf{M}_{k,2}\mathbf{e}_{i}+\mathbf{c}_{l}^{\top}(\mbox{\boldmath$\mathscr{X}$}_{i})_{(k)}\mathbf{M}_{k,1}\mathbf{M}_{k,3,m}\mathbf{M}_{k,2}(\mathbf{A}-\mathbf{A}^{*})\mathbf{x}_{i}\\ =&\sum_{j=1}^{r_{k}^{\prime}}w_{k,l,j}^{(i)}z_{k,m,j}^{(i)}+\sum_{j=1}^{r_{k}^{\prime}}w_{k,j,l}^{(i)}r_{k,m,j}^{(i)}:=v_{k,l,m}^{(i)}+u_{k,l,m}^{(i)},\end{split}

(B.22)

where $\mathbb{E}[|v_{k,l,m}^{(i)}|^{1+\epsilon}]\lesssim M_{x,1+\epsilon,\delta}M_{e,1+\epsilon,\delta}$ and $\mathbb{E}[|u_{k,l,m}^{(i)}|]\lesssim\bar{\sigma}\beta_{x}$

For any $1\leq l\leq p_{k}$ and $1\leq m\leq r_{k}$ ,

\begin{split}&\left|\mathbb{E}[x_{k,l,m}^{(i)}-y_{k,l,m}^{(i)}]-\mathbb{E}[\text{T}(x_{k,l,m}^{(i)},\tau)-\text{T}(y_{k,l,m}^{(i)},\tau)]\right|\\ \leq&~{}\mathbb{E}\left[|x_{k,l,m}^{(i)}-y_{k,l,m}^{(i)}|\cdot 1\{|v^{(i)}_{k,l,m}+u^{(i)}_{k,l,m}|\geq\tau_{k}\}\right]\\ \leq&~{}\mathbb{E}\left[|x_{k,l,m}^{(i)}-y_{k,l,m}^{(i)}|\cdot(1\{|v^{(i)}_{k,l,m}|\geq\tau_{k}/2\}+1\{|u^{(i)}_{k,l,m}|\geq\tau_{k}/2\})\right]\\ \leq&~{}\mathbb{E}\left[|x_{k,l,m}^{(i)}-y_{k,l,m}^{(i)}|^{2}\right]^{1/2}\cdot\left[\mathbb{P}\left(|v^{(i)}_{k,l,m}|\geq\tau_{k}/2\right)^{1/2}+\mathbb{P}\left(|u^{(i)}_{k,l,m}|\geq\tau_{k}/2\right)^{1/2}\right]\\ \leq&~{}s_{k,l,m}\cdot\left(\left[\frac{\mathbb{E}|v_{k,l,m}^{(i)}|^{1+\epsilon}}{(\tau_{k}/2)^{1+\epsilon}}\right]^{1/2}+\left[\frac{\mathbb{E}|u_{k,l,m}^{(i)}|}{\tau_{k}/2}\right]^{1/2}\right)\\ \asymp&~{}s_{k,l,m}\left[\log(\bar{p})^{1/2}n^{-1/2}+\bar{\sigma}^{1/2}\beta_{x}^{1/2}\log(\bar{p})^{\frac{1}{2+2\epsilon}}M_{x,1+\epsilon,\delta}^{\frac{-1}{2+2\epsilon}}M_{e,1+\epsilon,\delta}^{\frac{-1}{2+2\epsilon}}n^{\frac{-1}{2+2\epsilon}}\right]:=s_{k,l,m}\phi_{\delta}^{1/2}.\end{split}

(B.23)

Hence,

\begin{split}\|T_{k,3}\|_{\text{F}}^{2}=&\sum_{l=1}^{p_{k}}\sum_{m=1}^{r_{k}}\left|\mathbb{E}[x_{k,l,m}^{(i)}-y_{k,l,m}^{(i)}]-\mathbb{E}[\text{T}(x_{k,l,m}^{(i)},\tau)-\text{T}(y_{k,l,m}^{(i)},\tau)]\right|^{2}\\ \lesssim&~{}\phi_{\delta}\sum_{l=1}^{p_{k}}\sum_{m=1}^{r_{k}}s_{k,l,m}^{2}\lesssim\phi_{\delta}\bar{\sigma}^{2d/(d+1)}\beta_{x}^{2}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}.\end{split}

(B.24)

Step 4. (Bound $\|T_{k,4}\|_{\text{F}}$ for $1\leq k\leq d_{0}$ )

Let $\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k},\tau)-\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k},\tau)=\mathbf{Z}^{(i)}_{k}=\{z^{(i)}_{k,j,l}\}_{1\leq j\leq p_{k},1\leq l\leq r_{k}}$ . Then,

\|T_{k,4}\|_{\text{F}}^{2}=\sum_{j=1}^{p_{k}}\sum_{l=1}^{r_{k}}\left(\frac{1}{n}\sum_{i=1}^{n}z^{(i)}_{k,j,l}-\mathbb{E}[z_{k,j,l}^{(i)}]\right)^{2}

(B.25)

Note that $\text{var}(z_{k,j,l}^{(i)})=s_{k,j,l}^{2}$ and note that $\sum_{j=1}^{p_{k}}\sum_{l=1}^{r_{k}}s^{2}_{k,j,l}\leq C\bar{\sigma}^{2d/(d+1)}\beta_{x}^{2}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}$ . Also, $|z_{k,j,l}^{(i)}|\leq 2\tau$ . Similarly to the term $T_{k,2}$ , by Bernstein’s inequality, for any $0<t<\tau^{-2}s_{k,j,l}^{2}$ , $1\leq j\leq p_{k}$ and $1\leq l\leq r_{k}$ ,

\mathbb{P}\left(\left|\frac{1}{n}\sum_{i=1}^{n}z_{k,j,l}^{(i)}-\mathbb{E}[z_{k,j,l}^{(i)}]\right|\geq t\right)\leq 2\exp\left(-\frac{nt^{2}}{2s^{2}_{k,j,l}}\right).

(B.26)

Letting $t=Cs_{k,j,l}\sqrt{\log(\bar{p})/n}$ , we have that

\mathbb{P}\left(\cup_{\begin{subarray}{c}1\leq j\leq p_{k}\\ 1\leq l\leq r_{k}\end{subarray}}\left|\frac{1}{n}\sum_{i=1}^{n}z_{k,j,l}^{(i)}-\mathbb{E}[z_{k,j,l}^{(i)}]\right|\geq Cs_{k,j,l}\sqrt{\log(\bar{p})/n}\right)\leq C\exp\left(-C\log(\bar{p})\right).

(B.27)

Therefore, with probability at least $1-C\exp(-C\log(\bar{p}))$ ,

\|T_{k,4}\|_{\text{F}}^{2}\lesssim\frac{\log(\bar{p})}{n}\bar{\sigma}^{2d/(d+1)}\beta_{x}^{2}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}.

(B.28)

Combining these results, for any $1\leq k\leq d_{0}$ , with probablity $1-C\exp(-C\log(\bar{p}))$ ,

\begin{split}&\|\mathbf{G}_{k}-\mathbb{E}[\nabla_{k}\mathcal{L}]\|_{\text{F}}^{2}\\ &\lesssim\phi_{\delta}\bar{\sigma}^{2d/(d+1)}\beta_{x}^{2}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}+\bar{\sigma}^{2d/(d+1)}(p_{k}r_{k})\left[\frac{M_{x,1+\epsilon,\delta}^{1/\epsilon}M_{e,1+\epsilon,\delta}^{1/\epsilon}\log(\bar{p})}{n}\right]^{\frac{2\epsilon}{1+\epsilon}}.\end{split}

(B.29)

Step 5. (Extension to $T_{k,1},\dots,T_{k,4}$ for $d_{0}+1\leq k\leq d$ )

For $d_{0}+1\leq k\leq d$ , we let $r_{k}^{\prime}=r_{d_{0}+1}r_{d_{0}+2}\cdots r_{d}/r_{k}$ and

\nabla{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k}=[(-\mbox{\boldmath$\mathscr{E}$}_{i}\times_{j=1,j\neq k-d_{0}}^{d-d_{0}}\mathbf{U}_{d_{0}+j}^{\top})_{(k-d_{0})}\otimes\text{vec}(\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d_{0}}\mathbf{U}_{j}^{\top})^{\top}]\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top}.

(B.30)

The $(l,m)$ -th entry of $n^{-1}\sum_{i=1}^{n}\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k}$ is

\begin{split}&\frac{1}{n}\sum_{i=1}^{n}([(-\mbox{\boldmath$\mathscr{E}$}_{i}\times_{j=1,j\neq k-d_{0}}^{d-d_{0}}\mathbf{U}_{j}^{\top})_{(k-d_{0})}\otimes\text{vec}(\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d_{0}}\mathbf{U}_{j}^{\top})^{\top}]\mathbf{s}_{k,m})_{l}\\ &=\frac{1}{n}\sum_{i=1}^{n}\mathbf{c}_{l}^{\top}(-\mbox{\boldmath$\mathscr{E}$}_{i})_{(k-d_{0})}(\otimes_{j=d_{0}+1,j\neq k}^{d}\mathbf{U}_{j})\cdot\mathbf{S}_{k,m}(\otimes_{j=1}^{d_{0}}\mathbf{U}_{j}^{\top})\mathbf{x}_{i}.\end{split}

(B.31)

Let $\mathbf{M}_{k,1}=(\otimes_{j=d_{0}+1,j\neq k}^{d}\mathbf{U}_{j})/\|\otimes_{j=d_{0}+1,j\neq k}^{d}\mathbf{U}_{j}\|$ and $\mathbf{c}_{l}^{\top}(-\mbox{\boldmath$\mathscr{E}$}_{i})_{(k-d_{0})}\mathbf{M}_{k,1}=(u_{k,l,1}^{(i)},u_{k,l,2}^{(i)},\dots,u_{k,l,r_{k}^{\prime}}^{(i)})$ . Let $\mathbf{M}_{k,2}=(\otimes_{j=1}^{d_{0}}\mathbf{U}_{j}^{\top})/\|\otimes_{j=1}^{d_{0}}\mathbf{U}_{j}^{\top}\|$ and $\mathbf{M}_{k,2}\mathbf{x}_{i}=(s_{k,1}^{(i)},s_{k,2}^{(i)},\dots,s_{k,r_{1}r_{2}\cdots r_{d_{0}}}^{(i)})^{\top}$ .

By Assumption 4.3, $\mathbb{E}[|u_{k,l,j}^{(i)}|^{1+\epsilon}|\mbox{\boldmath$\mathscr{X}$}_{i}]\leq M_{e,1+\epsilon,\delta}$ and $\mathbb{E}[|s_{k,j^{\prime}}^{(i)}|^{1+\epsilon}]\leq M_{x,1+\epsilon,\delta}$ , for $j^{\prime}=1,2,\dots,r_{1}r_{2}\cdots r_{d_{0}}$ and $l=1,2,\dots,p_{k}$ . Let $\mathbf{M}_{k,3,m}=\mathbf{S}_{k,m}/\|\mathbf{S}_{k,m}\|$ and $\mathbf{M}_{k,3,m}\mathbf{M}_{2}\mathbf{x}_{i}=(s_{k,m,1}^{(i)},s_{k,m,2}^{(i)},\dots,s_{k,m,r_{k}^{\prime}}^{(i)})$ , where $\mathbb{E}[|s_{k,m,j}^{(i)}|^{1+\epsilon}]\lesssim M_{x,1+\epsilon,\delta}$ . Let $r_{k,j,l,m}^{(i)}=u_{k,l,j}^{(i)}s_{k,m,j}^{(i)}$ . Hence, we can have that

\begin{split}&\left\|\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*})_{(k)}\mathbf{V}_{k}^{\top}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top},\tau)]-\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*})_{(k)}\mathbf{V}_{k}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top}]\right\|_{\text{F}}\\ &\lesssim\bar{\sigma}^{d/(d+1)}\sqrt{p_{k}r_{k}}\left[\frac{M_{x,1+\epsilon,\delta}^{1/\epsilon}M_{e,1+\epsilon,\delta}^{1/\epsilon}\log(\bar{p})}{n}\right]^{\epsilon/(1+\epsilon)}.\end{split}

(B.32)

and with probability at least $1-C\exp(-C\log(\bar{p}))$ ,

\begin{split}&\left\|\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top},\tau)-\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top},\tau)]\right\|_{\text{F}}\\ &\lesssim\bar{\sigma}^{d/(d+1)}\sqrt{p_{k}r_{k}}\left[\frac{M_{x,1+\epsilon,\delta}^{1/\epsilon}M_{e,1+\epsilon,\delta}^{1/\epsilon}\log(\bar{p})}{n}\right]^{\epsilon/(1+\epsilon)}.\end{split}

(B.33)

Moreover,

\begin{split}&\|T_{k,3}\|_{\text{F}}^{2}\lesssim\phi_{\delta}\bar{\sigma}^{2d/(d+1)}\beta_{x}^{2}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2},\end{split}

(B.34)

and with probability at least $1-C\exp(-C\log(\bar{p}))$ ,

\|T_{k,4}\|_{\text{F}}^{2}\lesssim\frac{\log(\bar{p})}{n}\bar{\sigma}^{2d/(d+1)}\beta_{x}^{2}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}.

(B.35)

Therefore, for $d_{0}+1\leq k\leq d$ , with probability at least $1-C\exp(-C\log(\bar{p}))$ ,

\|\mathbf{G}_{k}-\mathbb{E}[\nabla_{k}\mathcal{L}]\|_{\text{F}}^{2}\lesssim\phi_{\delta}\bar{\sigma}^{\frac{2d}{d+1}}\beta_{x}^{2}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}+\bar{\sigma}^{\frac{2d}{d+1}}(p_{k}r_{k})\left[\frac{M_{x,1+\epsilon,\delta}^{1/\epsilon}M_{e,1+\epsilon,\delta}^{1/\epsilon}\log(\bar{p})}{n}\right]^{\frac{2\epsilon}{1+\epsilon}}.

(B.36)

Step 6. (Extension to core tensor)

For the partial gradient with respect to the core tensor $\mathscr{S}$ , we have

\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})\times_{j=1}^{d}\mathbf{U}_{j}^{\top}=(\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d_{0}}\mathbf{U}_{j}^{\top})\circ(-\mbox{\boldmath$\mathscr{E}$}_{i}\times_{j=d_{0}+1}^{d}\mathbf{U}_{j}^{\top}).

(B.37)

Let $\mathbf{M}_{0,1}=\otimes_{j=1}^{d_{0}}\mathbf{U}_{j}/\|\otimes_{j=1}^{d_{0}}\mathbf{U}_{j}\|$ and $\mathbf{M}_{0,1}^{\top}\mathbf{x}_{i}=(w_{0,1}^{(i)},\dots,w_{0,r_{1}r_{2}\cdots r_{d_{0}}})^{\top}$ , and let $\mathbf{M}_{0,2}=\otimes_{j=d_{0}+1}^{d}\mathbf{U}_{j}/\|\otimes_{j=d_{0}+1}^{d}\mathbf{U}_{j}\|$ and $\mathbf{M}_{0,2}^{\top}\boldsymbol{e}_{i}=(z_{0,1}^{(i)},\dots,z_{0,r_{d_{0}+1}r_{d_{0}+2}\cdots r_{d}})^{\top}$ . By Assumption 4.3, $\mathbb{E}[|w_{0,j}^{(i)}|^{1+\epsilon}|\mbox{\boldmath$\mathscr{X}$}_{i}]\leq M_{x,1+\epsilon,\delta}$ and $\mathbb{E}[|z_{0,m}^{(i)}|^{1+\epsilon}|\mbox{\boldmath$\mathscr{X}$}_{i}]\leq M_{e,1+\epsilon,\delta}$ , for all $j=1,2,\dots,r_{1}r_{2}\cdots r_{d_{0}}$ and $m=1,2,\dots,r_{d_{0}+1}r_{d_{0}+2}\cdots r_{d}$ . Let $v_{0,j,m}^{(i)}=w_{0,j}^{(i)}z_{0,m}^{(i)}$ .

In a similar fashion, we can show that with probability at least $1-C\exp(-C\log(\bar{p}))$ ,

\begin{split}\|T_{0,1}\|_{\text{F}}&\lesssim\bar{\sigma}^{d/(d+1)}\sqrt{r_{1}r_{2}\cdots r_{d}}\left[\frac{M_{x,1+\epsilon,\delta}^{1/\epsilon}M_{e,1+\epsilon,\delta}^{1/\epsilon}\log(\bar{p})}{n}\right]^{\epsilon/(1+\epsilon)},\\ \|T_{0,2}\|_{\text{F}}&\lesssim\bar{\sigma}^{d/(d+1)}\sqrt{r_{1}r_{2}\cdots r_{d}}\left[\frac{M_{x,1+\epsilon,\delta}^{1/\epsilon}M_{e,1+\epsilon,\delta}^{1/\epsilon}\log(\bar{p})}{n}\right]^{\epsilon/(1+\epsilon)},\\ \|T_{0,3}\|_{\text{F}}&\lesssim C\phi_{\delta}^{1/2}\bar{\sigma}^{d/(d+1)}\beta_{x}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}},\\ \text{and }\|T_{0,4}\|_{\text{F}}&\lesssim C\sqrt{\frac{\log(\bar{p})}{n}}\bar{\sigma}^{d/(d+1)}\beta_{x}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}.\end{split}

(B.38)

Hence, with probability at least $1-C\exp(-C\log(\bar{p}))$ ,

\|\mbox{\boldmath$\mathscr{G}$}_{0}-\mathbb{E}[\nabla_{0}\mathcal{L}]\|_{\text{F}}^{2}\lesssim\phi_{\delta}\bar{\sigma}^{\frac{2d}{d+1}}\beta_{x}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}+\bar{\sigma}^{\frac{2d}{d+1}}\prod_{k=1}^{d}r_{k}\left[\frac{M_{e,1+\epsilon,\delta}^{1/\epsilon}M_{x,1+\epsilon,\delta}^{1/\epsilon}\log(\bar{p})}{n}\right]^{\frac{2\epsilon}{1+\epsilon}}.

(B.39)

Step 7. (Verify the conditions and conclude the proof)

In the last step, we apply the results above to Theorem 3.3. First, we examine the conditions in Theorem 3.3 hold. Under Assumption 4.3, by Lemma 3.11 in Bubeck (2015), we can show that the RCG condition in Definition 3.2 is implied by the restricted strong convexity and strong smoothness with $\alpha=\alpha_{x}$ and $\beta=\beta_{x}$ .

Next, we show the stability of the robust gradient estimators for all $t=1,2,\dots,T$ . By matrix perturbation theory, if $\|\mbox{\boldmath$\mathscr{A}$}^{(0)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}\leq\sqrt{\alpha_{x}/\beta_{x}}\underline{\sigma}\kappa^{-2}\delta$ , we have $\|\sin\Theta(\mathbf{U}_{k}^{(0)},\mathbf{U}_{k}^{*})\|\leq\delta$ for all $k=1,\dots,d$ . After a finite number of iterations, $C_{T}$ , with probability at least $1-C_{T}\exp(-C\log(\bar{p}))$ , we can have $\|\sin\Theta(\mathbf{U}_{k}^{(C_{T})},\mathbf{U}_{k}^{*})\|\leq\delta^{\prime}<(4\sqrt{2})^{-1}$ .

For any $l\neq k$ and any tensor $\mbox{\boldmath$\mathscr{B}$}\in\mathbb{R}^{p_{1}\times\cdots\times p_{d}}$ , $(\mbox{\boldmath$\mathscr{B}$}\times_{j\neq k}\mathbf{U}_{j}^{\top})_{(l)}=\mathbf{U}_{l}^{\top}\mbox{\boldmath$\mathscr{B}$}_{(l)}(\otimes_{j\neq l}\mathbf{U}_{j}^{\prime})$ , where $\mathbf{U}_{j}^{\prime}=\mathbf{U}_{j}$ for $j\neq k$ and $\mathbf{U}_{k}^{\prime}=\mathbf{I}_{r_{k}}$ . For any $\mathbf{U}_{l}\in\mathcal{C}(\mathbf{U}_{l}^{*},\delta^{\prime})$ , we have $\|\mathbf{U}_{l}-\mathbf{U}^{*}_{l}\mathbf{O}_{l}\|\leq\sqrt{2}\|\sin\Theta(\mathbf{U}_{l},\mathbf{U}_{l}^{*})\|\leq\sqrt{2}\delta^{\prime}$ for some $\mathbf{O}_{l}\in\mathbb{O}^{r_{k}\times r_{k}}$ . Let $\mathbf{\Delta}_{l}=\mathbf{U}_{l}-\mathbf{U}^{*}_{l}\mathbf{O}_{l}$ and decompose $\mathbf{\Delta}_{l}=\mathbf{\Delta}_{l,1}+\mathbf{\Delta}_{l,2}$ where $\langle\mathbf{\Delta}_{l,1},\mathbf{\Delta}_{l,2}\rangle=0$ and $\mathbf{\Delta}_{l,1}/\|\mathbf{\Delta}_{l,1}\|,\mathbf{\Delta}_{l,2}/\|\mathbf{\Delta}_{l,2}\|\in\mathcal{C}(\mathbf{U}_{l}^{*},\delta^{\prime})$ . Thus, we have $\|\mathbf{\Delta}_{l,1}\|\leq\sqrt{2}\delta^{\prime}$ and $\|\mathbf{\Delta}_{l,2}\|\leq\sqrt{2}\delta^{\prime}$ .

Denote $\xi=\sup_{\mathbf{U}_{l}\in\mathcal{C}(\mathbf{U}_{l}^{*},\delta^{\prime})}\|\mathbf{U}_{l}^{\top}\mbox{\boldmath$\mathscr{B}$}_{(l)}(\otimes_{j\neq l}\mathbf{U}_{j}^{\prime})\|_{\text{F}}$ . Then, since

\begin{split}&\|\mathbf{U}_{l}^{\top}\mbox{\boldmath$\mathscr{B}$}_{(l)}(\otimes_{j\neq l}\mathbf{U}_{j}^{\prime})\|_{\text{F}}\\ &\leq\|(\mathbf{U}^{*}_{l}\mathbf{O}_{l})^{\top}\mbox{\boldmath$\mathscr{B}$}_{(l)}(\otimes_{j\neq l}\mathbf{U}_{j}^{\prime})\|_{\text{F}}+\|\mathbf{\Delta}_{l}^{\top}\mbox{\boldmath$\mathscr{B}$}_{(l)}(\otimes_{j\neq l}\mathbf{U}_{j}^{\prime})\|_{\text{F}}\\ &\leq\|(\mathbf{U}^{*}_{l}\mathbf{O}_{l})^{\top}\mbox{\boldmath$\mathscr{B}$}_{(l)}(\otimes_{j\neq l}\mathbf{U}_{j}^{\prime})\|_{\text{F}}+\|\mathbf{\Delta}_{l,1}\|\cdot\|(\mathbf{\Delta}_{l,1}/\|\mathbf{\Delta}_{l,1}\|)^{\top}\nabla\mbox{\boldmath$\mathscr{B}$}_{(l)}(\otimes_{j\neq l}\mathbf{U}_{j}^{\prime})\|_{\text{F}}\\ &+\|\mathbf{\Delta}_{l,2}\|\cdot\|(\mathbf{\Delta}_{l,2}/\|\mathbf{\Delta}_{l,1}\|)^{\top}\nabla\mbox{\boldmath$\mathscr{B}$}_{(l)}(\otimes_{j\neq l}\mathbf{U}_{j}^{\prime})\|_{\text{F}},\end{split}

(B.40)

we have that

\xi\leq\|(\mathbf{U}^{*}_{l}\mathbf{O}_{l})^{\top}\nabla\mbox{\boldmath$\mathscr{B}$}_{(l)}(\otimes_{j\neq l}\mathbf{U}_{j}^{\prime})\|_{\text{F}}+(\|\mathbf{\Delta}_{l,1}\|+\|\mathbf{\Delta}_{l,2}\|)\xi,

(B.41)

that is, taking $\delta^{\prime}=1/8$ ,

\xi\leq(1-2\sqrt{2}\delta^{\prime})^{-1}\|(\mathbf{U}^{*}_{l}\mathbf{O}_{l})^{\top}\nabla\mbox{\boldmath$\mathscr{B}$}_{(l)}(\otimes_{j\neq l}\mathbf{U}_{j}^{\prime})\|_{\text{F}}\leq 2\|(\mathbf{U}^{*}_{l}\mathbf{O}_{l})^{\top}\nabla\mbox{\boldmath$\mathscr{B}$}_{(l)}(\otimes_{j\neq l}\mathbf{U}_{j}^{\prime})\|_{\text{F}}.

(B.42)

Hence, for the iterate $t=1,2,\dots,T$ , combining the results in steps 1 to 6, we have that with probability at least $1-C\exp(-C\log(\bar{p}))$ , for any $k=1,2,\dots,d$

\begin{split}&\|\mathbf{G}_{k}^{(t)}-\mathbb{E}[\nabla_{k}\mathcal{L}^{(t)}]\|_{\text{F}}^{2}\\ &\lesssim\phi_{\delta}\bar{\sigma}^{2d/(d+1)}\beta_{x}^{2}\|\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}+\bar{\sigma}^{2d/(d+1)}(p_{k}r_{k})\left[\frac{M_{e,1+\epsilon,\delta}^{1/\epsilon}M_{x,1+\epsilon,\delta}^{1/\epsilon}\log(\bar{p})}{n}\right]^{\frac{2\epsilon}{1+\epsilon}}\end{split}

(B.43)

and

\begin{split}&\|\mbox{\boldmath$\mathscr{G}$}_{0}^{(t)}-\mathbb{E}[\nabla_{0}\mathcal{L}^{(t)}]\|_{\text{F}}^{2}\\ &\lesssim\phi_{\delta}\bar{\sigma}^{2d/(d+1)}\beta_{x}^{2}\|\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}+\bar{\sigma}^{2d/(d+1)}\prod_{k=1}^{d}r_{k}\left[\frac{M_{e,1+\epsilon,\delta}^{1/\epsilon}M_{x,1+\epsilon,\delta}^{1/\epsilon}\log(\bar{p})}{n}\right]^{\frac{2\epsilon}{1+\epsilon}}.\end{split}

(B.44)

As $n\gtrsim\beta_{x}\alpha_{x}^{-2}\kappa^{4}\log(\bar{p})+\bar{\sigma}^{(1+\epsilon)^{2}}(\beta_{x}/\alpha_{x})^{2+2\epsilon}\log(\bar{p})\kappa^{4+4\epsilon}(M_{x,1+\epsilon,\delta}M_{e,1+\epsilon,\delta})^{-1}$ , plugging these into Theorem 3.3, we have that for all $t=1,2,\dots,T$ and $k=1,2,\dots,d$ ,

\begin{split}\text{Err}^{(t)}&\leq(1-\eta_{0}\alpha_{x}\beta_{x}^{-1}\kappa^{-2}/2)^{t}\text{Err}^{(0)}+C\alpha_{x}^{-2}\bar{\sigma}^{-4d/(d+1)}\kappa^{2}\sum_{k=0}^{d}\|\mathbf{\Delta}_{k}^{(t)}\|_{\text{F}}^{2}\\ &\leq\text{Err}^{(0)}+C\alpha_{x}^{-2}\bar{\sigma}^{-2d/(d+1)}\kappa^{4}\left(\prod_{k=1}r_{k}+\sum_{k=1}^{d}p_{k}r_{k}\right)\left[\frac{M_{e,1+\epsilon,\delta}^{1/\epsilon}M_{x,1+\epsilon,\delta}^{1/\epsilon}\log(\bar{p})}{n}\right]^{\frac{2\epsilon}{1+\epsilon}}\end{split}

(B.45)

and

\begin{split}\|\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}&\lesssim\kappa^{2}(1-C\alpha_{x}\beta_{x}^{-1}\kappa^{-2})^{t}\|\mbox{\boldmath$\mathscr{A}$}^{(0)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}\\ &+\kappa^{4}\alpha_{x}^{-2}\left(\sum_{k=1}^{d}p_{k}r_{k}+\prod_{k=1}^{d}r_{k}\right)\left[\frac{M_{x,1+\epsilon,\delta}^{1/\epsilon}M_{e,1+\epsilon,\delta}^{1/\epsilon}\log(\bar{p})}{n}\right]^{2\epsilon/(1+\epsilon)}.\end{split}

(B.46)

Finally, for all $t=1,2,\dots,T$ and $k=1,2,\dots,d$ ,

\|\sin\Theta(\mathbf{U}_{k}^{(t)},\mathbf{U}_{k}^{*})\|^{2}\leq\bar{\sigma}^{-2/(d+1)}\text{Err}^{(t)}\leq\bar{\sigma}^{\frac{-2}{d+1}}\text{Err}^{(0)}+C\kappa^{4}\alpha_{x}^{-2}\bar{\sigma}^{-2}d_{\text{eff}}\left[\frac{M_{\text{eff},\delta}\log(\bar{p})}{n}\right]^{\frac{2\epsilon}{1+\epsilon}}\leq\delta^{2}.

(B.47)

∎

B.2 Proof of Theorem 4.9

Proof.

The proof consists of six steps. In the first five steps, we prove the stability of the robust gradient estimators for the general $1\leq t\leq T$ and, hence, we omit the notation $(t)$ for simplicity. Specifically, in the first four steps, we give the upper bounds for $\|T_{k,1}\|_{\text{F}},\dots,\|T_{k,4}\|_{\text{F}}$ , respectively, for $1\leq k\leq d$ . In the fifth step, we extend the proof to the terms for the core tensor. In the last step, we apply the results to the local convergence analysis in Theorem 3.3 and verify the corresponding conditions. Throughout the first five steps, we assume that for each $1\leq k\leq d$ , $\|\mathbf{U}_{k}\|\asymp\bar{\sigma}^{1/(d+1)}$ and $\|\sin\Theta(\mathbf{U}_{k},\mathbf{U}_{k}^{*})\|\leq\delta$ and will verify them in the last step.

Step 1. (Bound $\|T_{k,1}\|_{\text{F}}$ )

For any $1\leq k\leq d$ , we let $r_{k}^{\prime}=r_{1}r_{2}\cdots r_{d}/r_{k}$ and

\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}(\otimes_{j=1,j\neq k}^{d}\mathbf{U}_{j})\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top}=\left(\frac{\exp(\langle\mbox{\boldmath$\mathscr{X}$}_{i},\mbox{\boldmath$\mathscr{A}$}^{*}\rangle)}{1+\exp(\langle\mbox{\boldmath$\mathscr{X}$}_{i},\mbox{\boldmath$\mathscr{A}$}^{*}\rangle)}-y_{i}\right)(\mbox{\boldmath$\mathscr{X}$}_{i})_{(k)}(\otimes_{j=1,j\neq k}^{d}\mathbf{U}_{j})\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top}.

(B.48)

Let $\mathbf{M}_{k}=(\otimes_{j=1,j\neq k}^{d}\mathbf{U}_{j})/\|\otimes_{j=1,j\neq k}^{d}\mathbf{U}_{j}\|$ and $\mathbf{c}_{l}^{\top}(\mbox{\boldmath$\mathscr{X}$}_{i})_{(k)}\mathbf{M}_{k}=(w^{(i)}_{k,l,1},w^{(i)}_{k,l,2},\dots,w^{(i)}_{k,l,r_{k}^{\prime}})$ . By Assumption 4.8, $\mathbb{E}[|w_{k,l,j}^{(i)}|^{2}]\leq M_{x,2,\delta}$ for $l=1,2,\dots,p_{k}$ and $j=1,2,\dots,r_{k}^{\prime}$ . Let $\mathbf{N}_{k}=\mbox{\boldmath$\mathscr{S}$}_{(k)}/\|\mbox{\boldmath$\mathscr{S}$}_{(k)}\|$ and $\mathbf{c}_{l}^{\top}(\mbox{\boldmath$\mathscr{X}$}_{i})_{(k)}\mathbf{M}_{k}\mathbf{N}_{k}^{\top}=(z_{k,l,1}^{(i)},z_{k,l,2}^{(i)},\dots,z_{k,l,r_{k}}^{(i)})$ . Then, $\mathbb{E}[|z^{(i)}_{k,l,j}|^{2}]\lesssim M_{x,2,\delta}$ . Also, denote $q_{i}(\mbox{\boldmath$\mathscr{A}$})=\exp(\langle\mbox{\boldmath$\mathscr{X}$}_{i},\mbox{\boldmath$\mathscr{A}$}\rangle)/[1+\exp(\langle\mbox{\boldmath$\mathscr{X}$}_{i},\mbox{\boldmath$\mathscr{A}$}\rangle)]$ for any $\mathscr{A}$ .

We first bound the bias $\|T_{k,1}\|_{\text{F}}$ . Note that $|q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})-y_{i}|\leq 1$ and

\mathbb{E}\left[|(q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})-y_{i})z_{k,l,j}^{(i)}|^{2}\right]=\mathbb{E}\left[\mathbb{E}\left[|q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})-y_{i}|^{2}|\mbox{\boldmath$\mathscr{X}$}_{i}\right]\cdot|z_{k,l,j}^{(i)}|^{2}\right]\leq M_{x,2,\delta}.

(B.49)

Let $\tau_{k}=\tau/\|\otimes_{j=1,j\neq k}^{d}\mathbf{U}_{j}\|\asymp[nM_{x,2,\delta}/\log(\bar{p})]^{1/2}$ . Then,

\|T_{1,k}\|_{\text{F}}^{2}\asymp\bar{\sigma}^{2d/(d+1)}\sum_{l=1}^{p_{k}}\sum_{j=1}^{r_{k}}\left|\mathbb{E}\left[(q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})-y_{i})z_{k,l,j}^{(i)}\right]-\mathbb{E}\left[\text{T}(q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})-y_{i})z_{k,l,j}^{(i)},\tau_{k})\right]\right|^{2}.

(B.50)

By Holder’s inequality and Markov’s inequality,

\begin{split}&\left|\mathbb{E}\left[(q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})-y_{i})z_{k,l,j}^{(i)}\right]-\mathbb{E}\left[\text{T}((q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})-y_{i})z_{k,l,j}^{(i)},\tau_{k})\right]\right|\\ \leq&\mathbb{E}\left[|(q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})-y_{i})z_{k,l,j}^{(i)}|\cdot 1\{|(q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})-y_{i})z_{k,l,j}^{(i)}|\geq\tau_{k}\}\right]\\ \leq&\mathbb{E}\left[|(q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})-y_{i})z_{k,l,j}^{(i)}|^{2}\right]^{1/2}\cdot\mathbb{P}(|(q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})-y_{i})z_{k,l,j}^{(i)}|\geq\tau_{k})^{1/2}\\ \leq&\mathbb{E}\left[|(q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})-y_{i})z_{k,l,j}^{(i)}|^{2}\right]^{1/2}\cdot\left(\frac{\mathbb{E}\left[|(q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})-y_{i})z_{k,l,j}^{(i)}|^{2}\right]}{\tau_{k}^{2}}\right)^{1/2}\\ \asymp&~{}M_{x,2,\delta}\cdot\tau_{k}^{-1}\asymp\left[\frac{M_{x,2,\delta}\log(\bar{p})}{n}\right]^{1/2}.\end{split}

(B.51)

Hence, we have

\begin{split}&\left\|\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k}^{\top}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top},\tau)]-\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k}^{\top}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top}]\right\|_{\text{F}}\\ &\lesssim\bar{\sigma}^{\frac{d}{d+1}}\sqrt{p_{k}r_{k}}\left[\frac{M_{x,2,\delta}\log(\bar{p})}{n}\right]^{\frac{1}{2}}.\end{split}

(B.52)

Step 2. (Bound $\|T_{k,2}\|_{\text{F}}$ )

For $T_{k,2}$ in (B.3), it can be checked that

\begin{split}&\mathbb{E}[\text{T}((q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})-y_{i})z_{k,l,j}^{(i)},\tau_{k})^{2}]\leq\mathbb{E}\left[|(q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})-y_{i})z_{k,l,j}^{(i)}|^{2}\right]\lesssim M_{x,2,\delta}.\end{split}

(B.53)

Thus, $\text{var}(\text{T}((q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})-y_{i})z_{k,l,j}^{(i)},\tau_{k}))\leq\mathbb{E}[\text{T}((q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})-y_{i})z_{k,l,j}^{(i)},\tau_{k})^{2}]\lesssim M_{x,2,\delta}$ . Also, for any $s=3,4,\dots$ , the higher-order moments satisfy that

\begin{split}&\mathbb{E}\left[(\text{T}((q(\mbox{\boldmath$\mathscr{X}$}_{i})-y_{i})z_{k,l,j}^{(i)},\tau_{k})-\mathbb{E}[\text{T}((q(\mbox{\boldmath$\mathscr{X}$}_{i})-y_{i})z_{k,l,j}^{(i)},\tau_{k})])^{s}\right]\\ &\leq(2\tau_{k})^{s-2}\mathbb{E}\left[(\text{T}((q(\mbox{\boldmath$\mathscr{X}$}_{i})-y_{i})z_{k,l,j}^{(i)},\tau_{k})-\mathbb{E}[\text{T}((q(\mbox{\boldmath$\mathscr{X}$}_{i})-y_{i})z_{k,l,j}^{(i)},\tau_{k})])^{2}\right].\end{split}

(B.54)

By Bernstein’s inequality, for any $0<t<(2\tau_{k})^{-1}M_{x,2,\delta}$ ,

\begin{split}&\mathbb{P}\left(\left|\frac{1}{n}\sum_{i=1}^{n}\text{T}((q(\mbox{\boldmath$\mathscr{X}$}_{i})-y_{i})z_{k,l,j}^{(i)},\tau_{k})-\mathbb{E}\text{T}((q(\mbox{\boldmath$\mathscr{X}$}_{i})-y_{i})z_{k,l,j}^{(i)},\tau_{k})\right|>t\right)\\ &\leq 2\exp\left(-\frac{nt^{2}}{4M_{x,2,\delta}}\right)\end{split}

(B.55)

Let $t=CM_{x,2,\delta}^{1/2}\log(\bar{p})^{1/2}n^{-1/2}$ . Therefore, we have

\begin{split}&\mathbb{P}\Bigg{(}\left|\frac{1}{n}\sum_{i=1}^{n}\text{T}((q(\mbox{\boldmath$\mathscr{X}$}_{i})-y_{i})z_{k,l,j}^{(i)},\tau_{k})-\mathbb{E}\text{T}((q(\mbox{\boldmath$\mathscr{X}$}_{i})-y_{i})z_{k,l,j}^{(i)},\tau_{k})\right|\\ &~{}~{}~{}~{}~{}~{}>C\left[\frac{M_{x,2,\delta}\log(\bar{p})}{n}\right]^{\frac{1}{2}}\Bigg{)}\leq C\exp(-C\log(\bar{p}))\end{split}

(B.56)

and

\begin{split}\mathbb{P}\Bigg{(}&\max_{\begin{subarray}{c}1\leq j\leq p_{k}\\ 1\leq l\leq r_{k}\end{subarray}}\left|\frac{1}{n}\sum_{i=1}^{n}\text{T}((q(\mbox{\boldmath$\mathscr{X}$}_{i})-y_{i})z_{k,l,j}^{(i)},\tau_{k})-\mathbb{E}\left[\text{T}((q(\mbox{\boldmath$\mathscr{X}$}_{i})-y_{i})z_{k,l,j}^{(i)},\tau_{k})\right]\right|\\ &~{}~{}~{}>C\left[\frac{M_{x,2,\delta}^{1/\epsilon}\log(\bar{p})}{n}\right]^{\frac{1}{2}}\Bigg{)}\leq Cp_{k}r_{k}\exp(-C\log(\bar{p}))\leq C\exp(-C\log(\bar{p})).\end{split}

(B.57)

Therefore, with probability at least $1-C\exp(-C\log(\bar{p}))$ ,

\begin{split}&\left\|\frac{1}{n}\sum_{i=1}^{n}\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k},\tau)-\mathbb{E}[\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k},\tau)]\right\|_{\text{F}}\\ &\lesssim\bar{\sigma}^{d/(d+1)}\sqrt{p_{k}r_{k}}\left[\frac{M_{x,2,\delta}\log(\bar{p})}{n}\right]^{1/2}.\end{split}

(B.58)

Step 3. (Bound $\|T_{k,3}\|_{\text{F}}$ )

By the Taylor’s expansion of $\overline{\mathcal{L}}(\cdot)$ ,

\begin{split}&\|\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}]\mathbf{V}_{k}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top}-\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}]\mathbf{V}_{k}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top}\|_{\text{F}}^{2}\\ &~{}\leq\|\mathbf{V}_{k}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top}\|^{2}\cdot\|\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})]-\mathbb{E}[\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})]\|_{\text{F}}^{2}\\ &~{}\lesssim\bar{\sigma}^{2d/(d+1)}\beta^{2}_{x}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}.\end{split}

(B.59)

For $1\leq l\leq p_{k}$ and $1\leq m\leq r_{k}$ , let the $(l,m)$ -th entry of $\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top}$ and $\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top}$ be $x_{k,l,m}^{(i)}$ and $y_{k,l,m}^{(i)}$ , respectively. Let $\mathbb{E}[|x_{k,l,m}^{(i)}-y_{k,l,m}^{(i)}|^{2}]=s_{k,l,m}^{2}$ , and then $\sum_{l=1}^{p_{k}}\sum_{m=1}^{r_{k}}s_{k,l,m}^{2}\lesssim\bar{\sigma}^{2d/(d+1)}\beta^{2}_{x}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}$ .

For any $1\leq l\leq p_{k}$ and $1\leq m\leq r_{k}$ ,

\begin{split}&\left|\mathbb{E}[x_{k,l,m}^{(i)}-y_{k,l,m}^{(i)}]-\mathbb{E}[\text{T}(x_{k,l,m}^{(i)},\tau)-\text{T}(y_{k,l,m}^{(i)},\tau)]\right|\\ \lesssim&~{}\mathbb{E}\left[|x_{k,l,m}^{(i)}-y_{k,l,m}^{(i)}|\cdot(1\{|(q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})-y_{i})z_{k,l,j}^{(i)}|\geq\tau_{k}\}+1\{|(q_{i}(\mbox{\boldmath$\mathscr{A}$})-y_{i})z_{k,l,j}^{(i)}|\geq\tau_{k}\})\right]\\ \leq&~{}\mathbb{E}\left[|x_{k,l,m}^{(i)}-y_{k,l,m}^{(i)}|^{2}\right]^{1/2}\cdot\\ &\left[\mathbb{P}\left(|(q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})-y_{i})z_{k,l,j}^{(i)}|\geq\tau_{k}\right)^{1/2}+\mathbb{P}\left(|(q_{i}(\mbox{\boldmath$\mathscr{A}$})-y_{i})z_{k,l,j}^{(i)}|\geq\tau_{k}\right)^{1/2}\right]\\ \lesssim&~{}s_{k,l,m}\cdot\left[\frac{\mathbb{E}|z_{k,l,j}|^{2}}{\tau_{k}^{2}}\right]^{1/2}\asymp s_{k,l,m}\sqrt{\log(\bar{p})/n}.\end{split}

(B.60)

Hence,

\begin{split}\|T_{k,3}\|_{\text{F}}^{2}=&\sum_{l=1}^{p_{k}}\sum_{m=1}^{r_{k}}\left|\mathbb{E}[x_{k,l,m}^{(i)}-y_{k,l,m}^{(i)}]-\mathbb{E}[\text{T}(x_{k,l,m}^{(i)},\tau)-\text{T}(y_{k,l,m}^{(i)},\tau)]\right|^{2}\\ \lesssim&~{}\frac{\log(\bar{p})}{n}\sum_{l=1}^{p_{k}}\sum_{m=1}^{r_{k}}s_{k,l,m}^{2}\lesssim\frac{\log(\bar{p})}{n}\bar{\sigma}^{2d/(d+1)}\beta_{x}^{2}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}.\end{split}

(B.61)

Step 4. (Bound $\|T_{k,4}\|_{\text{F}}$ )

Let $\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top},\tau)-\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top},\tau)=\mathbf{Z}^{(i)}_{k}=\{z^{(i)}_{k,j,l}\}_{1\leq j\leq p_{k},1\leq l\leq r_{k}}$ . Then,

\|T_{k,4}\|_{\text{F}}^{2}=\sum_{j=1}^{p_{k}}\sum_{l=1}^{r_{k}}\left(\frac{1}{n}\sum_{i=1}^{n}z^{(i)}_{k,j,l}-\mathbb{E}[z_{k,j,l}^{(i)}]\right)^{2}

(B.62)

\mathbb{P}\left(\left|\frac{1}{n}\sum_{i=1}^{n}z_{k,j,l}^{(i)}-\mathbb{E}[z_{k,j,l}^{(i)}]\right|\geq t\right)\leq 2\exp\left(-\frac{nt^{2}}{2s^{2}_{k,j,l}}\right).

(B.63)

Letting $t=Cs_{k,j,l}\sqrt{\log(\bar{p})/n}$ , we have that

\mathbb{P}\left(\cup_{\begin{subarray}{c}1\leq j\leq p_{k}\\ 1\leq l\leq r_{k}\end{subarray}}\left|\frac{1}{n}\sum_{i=1}^{n}z_{k,j,l}^{(i)}-\mathbb{E}[z_{k,j,l}^{(i)}]\right|\geq Cs_{k,j,l}\sqrt{\log(\bar{p})/n}\right)\leq C\exp\left(-C\log(\bar{p})\right).

(B.64)

Therefore, with probability at least $1-C\exp(-C\log(\bar{p}))$ ,

\|T_{k,4}\|_{\text{F}}^{2}\lesssim\frac{\log(\bar{p})}{n}\bar{\sigma}^{2d/(d+1)}\beta_{x}^{2}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}.

(B.65)

Combining these results, for any $1\leq k\leq d_{0}$ , with probablity $1-C\exp(-C\log(\bar{p}))$ ,

\begin{split}&\|\mathbf{G}_{k}-\mathbb{E}[\nabla_{k}\mathcal{L}]\|_{\text{F}}^{2}\\ \lesssim&~{}\frac{\log(\bar{p})}{n}\bar{\sigma}^{2d/(d+1)}\beta_{x}^{2}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}+\frac{\bar{\sigma}^{2d/(d+1)}p_{k}r_{k}M_{x,2,\delta}\log(\bar{p})}{n}.\end{split}

(B.66)

Step 5. (Extension to core tensor)

For the partial gradient with respect to the core tensor $\mathscr{S}$ , we have

\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})\times_{j=1}^{d}\mathbf{U}_{j}^{\top}=q_{i}(\mbox{\boldmath$\mathscr{A}$}^{*})\cdot\mbox{\boldmath$\mathscr{X}$}_{i}\times_{j=1}^{d}\mathbf{U}_{j}^{\top}.

(B.67)

Let $\mathbf{M}_{0}=\otimes_{j=1}^{d}\mathbf{U}_{j}/\|\otimes_{j=1}^{d}\mathbf{U}_{j}\|$ and $\mathbf{M}_{0}^{\top}\mathbf{x}_{i}=(w_{0,1}^{(i)},\dots,w_{0,r_{1}r_{2}\cdots r_{d}}^{(i)})^{\top}$ . By Assumption 4.8, $\mathbb{E}|w_{0,j}|^{2}\leq M_{x,2,\delta}$ for all $j=1,2,\dots,r_{1}r_{2}\cdots r_{d}$ . In a similar fashion, we can show that with probability at least $1-C\exp(-C\log(\bar{p}))$ ,

\begin{split}\|T_{0,1}\|_{\text{F}}&\lesssim\bar{\sigma}^{d/(d+1)}\sqrt{\frac{r_{1}r_{2}\cdots r_{d}M_{x,2,\delta}\log(\bar{p})}{n}},\\ \|T_{0,2}\|_{\text{F}}&\lesssim\bar{\sigma}^{d/(d+1)}\sqrt{\frac{r_{1}r_{2}\cdots r_{d}M_{x,2,\delta}\log(\bar{p})}{n}},\\ \|T_{0,3}\|_{\text{F}}&\lesssim\bar{\sigma}^{d/(d+1)}\sqrt{\frac{\log(\bar{p})}{n}}\bar{\sigma}^{d/(d+1)}\beta_{x}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}},\\ \text{and }\|T_{0,4}\|_{\text{F}}&\lesssim\bar{\sigma}^{d/(d+1)}\sqrt{\frac{\log(\bar{p})}{n}}\bar{\sigma}^{d/(d+1)}\beta_{x}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}.\end{split}

(B.68)

Hence, with probability at least $1-C\exp(-C\log(\bar{p}))$ ,

\|\mbox{\boldmath$\mathscr{G}$}_{0}-\mathbb{E}[\nabla_{0}\mathcal{L}]\|_{\text{F}}^{2}\lesssim\frac{\log(\bar{p})}{n}\bar{\sigma}^{\frac{2d}{d+1}}\beta_{x}^{2}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}+\bar{\sigma}^{\frac{2d}{d+1}}\frac{(\prod_{k=1}^{d}r_{k})M_{x,2,\delta}\log(\bar{p})}{n}.

(B.69)

Step 6. (Verify the conditions and conclude the proof)

Under Assumption 4.8, we can calculate the Hessian $\nabla^{2}\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$})$ and easily check that the RCG condition holds.

Next, we can plug the results in the first five steps into Theorem 3.3. Using the same argument in the proof of Theorem 4.5, we have that with probability at least $1-C\exp(-C\log(\bar{p}))$ ,

\text{Err}^{(t)}\leq\text{Err}^{(0)}+C\kappa^{4}\alpha_{x}^{-2}\bar{\sigma}^{-2d/(d+1)}\frac{d_{\text{eff}}M_{x,2,\delta}\log(\bar{p})}{n},

(B.70)

and

\begin{split}\|\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}&\lesssim\kappa^{2}(1-C\alpha_{x}\beta_{x}^{-1}\kappa^{-2})^{t}\|\mbox{\boldmath$\mathscr{A}$}^{(0)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}\\ &+\kappa^{4}\alpha_{x}^{-2}\left(\sum_{k=1}^{d}p_{k}r_{k}+\prod_{k=1}^{d}r_{k}\right)\left[\frac{M_{x,2,\delta}\log(\bar{p})}{n}\right],\end{split}

(B.71)

for all $t=1,2,\dots,T$ .

Finally, for all $t=1,2,\dots,T$ and $k=1,2,\dots,d$ ,

\|\sin\Theta(\mathbf{U}_{k}^{(t)},\mathbf{U}_{k}^{*})\|\leq\bar{\sigma}^{\frac{-1}{d+1}}\sqrt{\text{Err}^{(t)}}\leq\bar{\sigma}^{\frac{-1}{d+1}}\sqrt{\text{Err}^{(0)}}+C\kappa^{2}\alpha_{x}^{-1}\bar{\sigma}^{-1}d_{\text{eff}}^{1/2}\left[\frac{M_{\text{eff},\delta}\log(\bar{p})}{n}\right]^{1/2}\leq\delta.

(B.72)

∎

B.3 Proof of Theorem 4.13

Step 1. (Bound $\|T_{k,1}\|_{\text{F}}$ )

Let $\mathbf{M}_{k,1}=(\otimes_{j=1,j\neq k}^{d}\mathbf{U}_{j})/\|\otimes_{j=1,j\neq k}^{d}\mathbf{U}_{j}\|$ and its columns as $\mathbf{M}_{k,1}=[\mathbf{m}_{k,1},\mathbf{m}_{k,2},\dots,\mathbf{m}_{k,r_{k}^{\prime}}]$ . Let $z_{k,j,l}=-\mathbf{c}_{j}^{\top}\mbox{\boldmath$\mathscr{E}$}_{(k)}\mathbf{m}_{k,l}$ , and by Assumption 4.8, $\mathbb{E}[|z_{k,j,l}|^{1+\epsilon}]\leq M_{e,1+\epsilon,\delta}$ . Let $\mathbf{M}_{k,2}=\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top}/\|\mbox{\boldmath$\mathscr{S}$}_{(k)}\|$ and, hence, we have $-\mathbf{c}_{j}^{\top}\mbox{\boldmath$\mathscr{E}$}_{(k)}\mathbf{M}_{k,1}\mathbf{M}_{k,2}\mathbf{c}_{m}=w_{k,j,m}$ , for $1\leq j\leq p_{k}$ and $1\leq m\leq r_{k}$ , satisfying $\mathbb{E}[|w_{k,j,m}|^{1+\epsilon}]\lesssim M_{e,1+\epsilon,\delta}$ .

We first bound the bias, for any $1\leq j\leq p_{k}$ and $1\leq m\leq r_{k}$ ,

\begin{split}&|\mathbb{E}[w_{k,j,m}]-\mathbb{E}[\text{T}(w_{k,j,m},\tau_{k})]|\leq\mathbb{E}[|w_{k,j,m}|\cdot 1\{|w_{k,j,m}|\geq\tau_{k}\}]\\ \leq&\mathbb{E}\left[|w_{k,j,m}|^{1+\epsilon}\right]^{1/(1+\epsilon)}\cdot\mathbb{P}(|w_{k,j,m}|\geq\tau_{k})^{\epsilon/(1+\epsilon)}\\ \leq&\mathbb{E}\left[|w_{k,j,m}|^{1+\epsilon}\right]^{1/(1+\epsilon)}\cdot\left(\frac{\mathbb{E}[|w_{k,j,m}|^{1+\epsilon}]}{\tau_{k}^{1+\epsilon}}\right)^{\epsilon/(1+\epsilon)}\\ \lesssim&M_{e,1+\epsilon,\delta}\cdot\tau_{k}^{-\epsilon}\asymp M_{e,1+\epsilon,\delta}^{1/(1+\epsilon)}\cdot\kappa^{-2}\end{split}

(B.73)

with the truncation parameter $\tau_{k}\asymp M_{e,1+\epsilon,\delta}^{1/(1+\epsilon)}\cdot\kappa^{2/\epsilon}$ . Hence,

\|T_{k,1}\|_{\text{F}}\lesssim\bar{\sigma}^{d/(d+1)}\sqrt{p_{k}r_{k}}M_{e,1+\epsilon,\delta}^{1/(1+\epsilon)}\kappa^{-2}.

(B.74)

Step 2. Bound ( $\|T_{k,2}\|_{\text{F}}$ )

Note that

\mathbb{E}\left[\text{T}(w_{k,j,m},\tau_{k})^{2}\right]\leq\tau_{k}^{1-\epsilon}\cdot\mathbb{E}[|w_{k,j,m}|^{1+\epsilon}]\asymp\tau_{k}^{1-\epsilon}\cdot M_{e,1+\epsilon,\delta}.

(B.75)

Thus, we have $\text{var}(\text{T}(w_{k,j,m},\tau_{k}))\leq\mathbb{E}[\text{T}(w_{k,j,m},\tau_{k})^{2}]\leq\tau_{k}^{1-\epsilon}M_{e,1+\epsilon,\delta}$ . Also, for any $s=3,4,\dots$ , the higher-order moments satisfy that

\mathbb{E}\left[(\text{T}(w_{k,j,m},\tau_{k})-\mathbb{E}[\text{T}(w_{k,j,m},\tau_{k})])^{s}\right]\leq(2\tau_{k})^{s-2}\mathbb{E}\left[(\text{T}(w_{k,j,m},\tau_{k})-\mathbb{E}[\text{T}(w_{k,j,m},\tau_{k})])^{2}\right].

(B.76)

By Bernstein’s inequality, for any $0<t\leq\tau_{k}^{-\epsilon}M_{e,1+\epsilon,\delta}$ ,

\mathbb{P}(|T(w_{k,j,m},\tau_{k})-\mathbb{E}[T(w_{k,j,m},\tau_{k})]|\geq t)\leq 2\exp\left(-\frac{t^{2}}{4\tau_{k}^{1-\epsilon}M_{e,1+\epsilon,\delta}}\right).

(B.77)

Letting $t=CM_{e,1+\epsilon,\delta}^{1/(1+\epsilon)}\kappa^{-2}$ , since $\underline{\sigma}/M_{e,1+\epsilon,\delta}^{1/(1+\epsilon)}\gtrsim\sqrt{\bar{p}}$ we have

\mathbb{P}(|T(w_{k,j,m},\tau_{k})-\mathbb{E}[T(w_{k,j,m},\tau_{k})]|\geq CM_{e,1+\epsilon,\delta}^{1/(1+\epsilon)}\kappa^{-2})\leq C\exp\left(-C\log(\bar{p})\right)

(B.78)

and

\mathbb{P}\left(\max_{\begin{subarray}{c}1\leq j\leq p_{k}\\ 1\leq m\leq r_{k}\end{subarray}}|T(w_{k,j,m},\tau_{k})-\mathbb{E}[T(w_{k,j,m},\tau_{k})]|\geq CM_{e,1+\epsilon,\delta}^{1/(1+\epsilon)}\kappa^{-2}\right)\leq C\exp\left(-C\log(\bar{p})\right).

(B.79)

Hence, with probabilty at least $1-C\exp(-C\log(\bar{p}))$ ,

\|T_{k,2}\|_{\text{F}}\lesssim\bar{\sigma}^{d/(d+1)}\sqrt{p_{k}r_{k}}M_{e,1+\epsilon,\delta}^{1/(1+\epsilon)}\kappa^{-2}.

(B.80)

Step 3. (Bound $\|T_{k,3}\|_{\text{F}}$ )

Clearly, we have

\|\nabla\overline{\mathcal{R}}(\mbox{\boldmath$\mathscr{A}$};\mbox{\boldmath$\mathscr{Y}$})_{(k)}\mathbf{V}_{k}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top}-\nabla\overline{\mathcal{R}}(\mbox{\boldmath$\mathscr{A}$}^{*};\mbox{\boldmath$\mathscr{Y}$})_{(k)}\mathbf{V}_{k}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top}\|_{\text{F}}^{2}\lesssim\bar{\sigma}^{2d/(d+1)}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}.

For $1\leq l\leq p_{k}$ and $1\leq m\leq r_{k}$ , let the $(l,m)$ -th entry of $\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};\mbox{\boldmath$\mathscr{Y}$})_{(k)}\mathbf{V}_{k}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top}$ and $\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};\mbox{\boldmath$\mathscr{Y}$})_{(k)}\mathbf{V}_{k}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top}$ be $v_{k,l,m}$ and $u_{k,l,m}$ , respectively. Let $(\mathbb{E}[v_{k,l,m}^{(i)}-u_{k,l,m}^{(i)}])^{2}=s_{k,l,m}^{2}$ , and then $\sum_{l=1}^{p_{k}}\sum_{m=1}^{r_{k}}s_{k,l,m}^{2}\lesssim\bar{\sigma}^{2d/(d+1)}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}$ .

For any $1\leq l\leq p_{k}$ and $1\leq m\leq r_{k}$ ,

\begin{split}&\left|\mathbb{E}[v_{k,l,m}-u_{k,l,m}]-\mathbb{E}[\text{T}(v_{k,l,m},\tau)-\text{T}(u_{k,l,m},\tau)]\right|\\ \lesssim&~{}\mathbb{E}\left[|v_{k,l,m}-u_{k,l,m}|\cdot 1\{|w_{k,l,m}|\geq\tau_{k}\}\right]\\ \leq&~{}\mathbb{E}\left[|v_{k,l,m}-u_{k,l,m}^{(i)}|^{2}\right]^{1/2}\cdot\mathbb{P}\left(|w_{k,l,m}|\geq\tau_{k}\right)^{1/2}\\ \leq&~{}s_{k,l,m}\cdot\left[\frac{\mathbb{E}|w_{k,l,m}|^{1+\epsilon}}{\tau_{k}^{1+\epsilon}}\right]^{1/2}\asymp s_{k,l,m}\cdot\kappa^{-2}.\end{split}

(B.81)

Hence,

\begin{split}\|T_{k,3}\|_{\text{F}}^{2}=&\sum_{l=1}^{p_{k}}\sum_{m=1}^{r_{k}}\left|\mathbb{E}[x_{k,l,m}^{(i)}-y_{k,l,m}^{(i)}]-\mathbb{E}[\text{T}(x_{k,l,m}^{(i)},\tau)-\text{T}(y_{k,l,m}^{(i)},\tau)]\right|^{2}\\ \lesssim&~{}\kappa^{-4}\sum_{l=1}^{p_{k}}\sum_{m=1}^{r_{k}}s_{k,l,m}^{2}\lesssim\kappa^{-4}\bar{\sigma}^{2d/(d+1)}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}.\end{split}

(B.82)

Step 4. (Bound $\|T_{k,4}\|_{\text{F}}$ )

Let $\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$};z_{i})_{(k)}\mathbf{V}_{k}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top},\tau)-\text{T}(\nabla\overline{\mathcal{L}}(\mbox{\boldmath$\mathscr{A}$}^{*};z_{i})_{(k)}\mathbf{V}_{k}\mbox{\boldmath$\mathscr{S}$}_{(k)}^{\top},\tau)=\mathbf{Z}_{k}=\{z_{k,j,l}\}_{1\leq j\leq p_{k},1\leq l\leq r_{k}}$ . Then,

\|T_{k,4}\|_{\text{F}}^{2}=\sum_{j=1}^{p_{k}}\sum_{l=1}^{r_{k}}\left(z_{k,j,l}-\mathbb{E}[z_{k,j,l}]\right)^{2}

(B.83)

Note that $\text{var}(z_{k,j,l})=s_{k,j,l}^{2}$ and note that $\sum_{j=1}^{p_{k}}\sum_{l=1}^{r_{k}}s^{2}_{k,j,l}\leq C\bar{\sigma}^{2d/(d+1)}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}$ . Also, $|z_{k,j,l}|\leq 2\tau$ . Similarly to the term $T_{k,2}$ , by Bernstein’s inequality, for any $0<t<\tau^{-2}s_{k,j,l}^{2}$ , $1\leq j\leq p_{k}$ and $1\leq l\leq r_{k}$ ,

\mathbb{P}\left(\left|z_{k,j,l}-\mathbb{E}[z_{k,j,l}]\right|\geq t\right)\leq 2\exp\left(-\frac{t^{2}}{2s^{2}_{k,j,l}}\right).

(B.84)

Letting $t=C\kappa^{-2}s_{k,j,l}$ , we have that

\mathbb{P}\left(\cup_{\begin{subarray}{c}1\leq j\leq p_{k}\\ 1\leq l\leq r_{k}\end{subarray}}\left|z_{k,j,l}-\mathbb{E}[z_{k,j,l}]\right|\geq C\kappa^{-2}s_{k,j,l}\right)\leq C\exp\left(-C\log(\bar{p})\right).

(B.85)

Therefore, with probability at least $1-C\exp(-C\log(\bar{p}))$ ,

\|T_{k,4}\|_{\text{F}}^{2}\lesssim\kappa^{-4}\bar{\sigma}^{2d/(d+1)}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}.

(B.86)

Combining these results, for any $1\leq k\leq d$ , with probability at least $1-C\exp(-C\log(\bar{p}))$ ,

\|\mathbf{G}_{k}-\mathbb{E}[\nabla_{k}\mathcal{L}]\|_{\text{F}}^{2}\lesssim\kappa^{-4}\bar{\sigma}^{2d/(d+1)}p_{k}r_{k}M_{e,1+\epsilon,\delta}^{2/(1+\epsilon)}+\kappa^{-4}\bar{\sigma}^{2d/(d+1)}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}.

(B.87)

Step 5. (Extension to core tensor)

In a similar fashion, we can show that with probability at least $1-C\exp(-C\log(\bar{p}))$ ,

\begin{split}\|T_{0,1}\|_{\text{F}}&\lesssim\bar{\sigma}^{d/(d+1)}\sqrt{\prod_{k=1}^{d}r_{k}}M_{e,1+\epsilon,\delta}^{1/(1+\epsilon)}\kappa^{-2},\\ \|T_{0,2}\|_{\text{F}}&\lesssim\bar{\sigma}^{d/(d+1)}\sqrt{\prod_{k=1}^{d}r_{k}}M_{e,1+\epsilon,\delta}^{1/(1+\epsilon)}\kappa^{-2},\\ \|T_{0,3}\|_{\text{F}}&\lesssim\kappa^{-4}\bar{\sigma}^{2d/(d+1)}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2},\\ \text{and }\|T_{0,4}\|_{\text{F}}&\lesssim\kappa^{-4}\bar{\sigma}^{2d/(d+1)}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}.\end{split}

(B.88)

Hence, we have with high probability,

\|\mbox{\boldmath$\mathscr{G}$}_{0}-\mathbb{E}\nabla_{0}\mathcal{L}\|_{\text{F}}^{2}\lesssim\kappa^{-4}\bar{\sigma}^{2d/(d+1)}\left(\prod_{k=1}^{d}r_{k}\right)M_{e,1+\epsilon,\delta}^{2/(1+\epsilon)}+\kappa^{-4}\bar{\sigma}^{2d/(d+1)}\|\mbox{\boldmath$\mathscr{A}$}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}.

(B.89)

Step 6. (Verify the conditions and conclude the proof)

By definition, the RCG condition is satisfied naturally. Next, we can plug the results in the first five steps into Theorem 3.3. Using the same argument in the proof of Theorem 4.5, we have that with probability at least $1-C\exp(-C\log(\bar{p}))$ ,

\text{Err}^{(t)}\leq\text{Err}^{(0)}+C\bar{\sigma}^{-2d/(d+1)}\left(\sum_{k=1}^{d}p_{k}r_{k}+\prod_{k=1}^{d}r_{k}\right)M_{e,1+\epsilon,\delta}^{2/(1+\epsilon)},

(B.90)

and

\begin{split}\|\mbox{\boldmath$\mathscr{A}$}^{(t)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}&\lesssim(1-C\kappa^{-2})^{t}\|\mbox{\boldmath$\mathscr{A}$}^{(0)}-\mbox{\boldmath$\mathscr{A}$}^{*}\|_{\text{F}}^{2}+\left(\sum_{k=1}^{d}p_{k}r_{k}+\prod_{k=1}^{d}r_{k}\right)M_{e,1+\epsilon,\delta}^{2/(1+\epsilon)},\end{split}

(B.91)

for all $t=1,2,\dots,T$ .

Finally, for all $t=1,2,\dots,T$ and $k=1,2,\dots,d$ ,

\|\sin\Theta(\mathbf{U}_{k}^{(t)},\mathbf{U}_{k}^{*})\|\leq\bar{\sigma}^{\frac{-1}{d+1}}\sqrt{\text{Err}^{(t)}}\leq\bar{\sigma}^{\frac{-1}{d+1}}\sqrt{\text{Err}^{(0)}}+C\bar{\sigma}^{-1}d_{\text{eff}}^{1/2}M_{e,1+\epsilon,\delta}^{1/(1+\epsilon)}\leq\delta.

(B.92)