ROBUST AND OPTIMAL TENSOR ESTIMATION VIA ROBUST GRADIENT DESCENT
Abstract
Low-rank tensor models are widely used in statistics and machine learning. However, most existing methods rely heavily on the assumption that data follows a sub-Gaussian distribution. To address the challenges associated with heavy-tailed distributions encountered in real-world applications, we propose a novel robust estimation procedure based on truncated gradient descent for general low-rank tensor models. We establish the computational convergence of the proposed method and derive optimal statistical rates under heavy-tailed distributional settings of both covariates and noise for various low-rank models. Notably, the statistical error rates are governed by a local moment condition, which captures the distributional properties of tensor variables projected onto certain low-dimensional local regions. Furthermore, we present numerical results to demonstrate the effectiveness of our method.
keywords:
[class=MSC]keywords:
, , and
1 Introduction
1.1 Low-rank tensor modeling
Tensor models in statistics and machine learning have gained significant attention in recent years for analyzing complex multidimensional data. Applications of tensors can be found in various fields, including biomedical imaging analysis (Zhou, Li and Zhu, 2013; Li et al., 2018; Wu et al., 2022), recommender systems (Bi, Qu and Shen, 2018; Tarzanagh and Michailidis, 2022), and time series analysis (Chen, Yang and Zhang, 2022; Wang et al., 2022), where the data are naturally represented as third or higher-order tensors. Tensor decompositions and low-rank structures are prevalent in these models, as they facilitate dimension reduction, provide interpretable representations, and enhance computational efficiency (Kolda and Bader, 2009; Bi et al., 2021).
In this paper, we consider a general framework of tensor learning, where the loss function is denoted by , with being a differentiable loss function, a -th order parameter tensor, and a random sample drawn from the population. This framework encompasses a wide range of models in statistics and machine learning, including tensor linear regression, tensor generalized linear regression, and tensor PCA, among others. For dimension reduction, the parameter tensor is assumed to have a Tucker decomposition (Tucker, 1966)
(1.1) |
which is one of the most commonly used low-rank structures in statistical tensor learning (see related definitions and notations in Section 2).
Substantial progress has been made recently in developing estimation methods and efficient algorithms for various low-rank tensor models. However, low-rank tensor problems are often highly nonconvex, making them challenging to solve directly. A common strategy to address this issue is convex relaxation by replacing the low-rank constraint with a tensor nuclear norm and solving the problem in the full tensor space, i.e. the space of (Tomioka and Suzuki, 2013; Yuan and Zhang, 2016; Raskutti, Yuan and Chen, 2019). While these convex methods come with provable statistical guarantees, they suffer from heavy computational burdens and prohibitive storage requirements. As a result, they are often infeasible for high-dimensional tensor estimation in many practical applications.
Another estimation approach is nonconvex optimization, which operates directly on the parsimonious decomposition form in (1.1). This approach leads to scalable algorithms with more affordable computational complexity. Among the various developments in nonconvex optimization techniques, gradient descent and its variants have been extensively studied and can be applied to a wide range of low-rank tensor models. Recent advances have provided both computational and statistical guarantees for gradient descent algorithms, as well as corresponding initialization methods (Chen, Raskutti and Yuan, 2019; Han, Willett and Zhang, 2022; Tong et al., 2022; Dong et al., 2023). In terms of statistical performance, minimax optimal error rates have been established for commonly used low-rank tensor models under Gaussian or sub-Gaussian distributional assumptions, which are crucial technical conditions for managing high-dimensional data (Raskutti, Yuan and Chen, 2019; Chen, Raskutti and Yuan, 2019; Han, Willett and Zhang, 2022).
However, in many real applications, tensor data are often contaminated by outliers and heavy-tailed noise, violating the stringent Gaussian or sub-Gaussian assumptions. Empirical studies have shown that biomedical imaging data often exhibit non-Gaussian and heavy-tailed distributions (Beggs and Plenz, 2003; Friedman et al., 2012; Roberts, Boonstra and Breakspear, 2015). For example, in Section 6, we analyze chest computed tomography (CT) images of patients with and without COVID-19, and plot the kurtosis of image pixels in Figure 1. Among 22,500 pixels analyzed, 1,169 in the COVID-19 samples and 351 pixels in the non-COVID-19 samples exihibit heavy-tailed distributions, with sample kurtosis greater than eight. These findings highlight the prevalence of heavy-tailed behavior in real-world biomedical datasets. As recent studies (Wang, Zhang and Mai, 2023; Wei et al., 2023) have emphasized, conventional methods fail to produce reliable estimates when the data follow such heavy-tailed distributions. Therefore, it is crucial to develop stable and robust estimation methods that are not only computationally efficient and statistically optimal but also robust against the challenges posed by heavy-tailed distributions in real-world tensor data.


The growing interest in robust estimation methods for high-dimensional low-rank matrix and tensor models underscores the pressing need for solutions that can handle heavy-tailed data. In terms of methodology to achieve robustness, the existing works can be broadly classified into two approaches: loss robustification and data robustification. The prestigious Huber regression method (Huber, 1964; Fan, Li and Wang, 2017; Sun, Zhou and Fan, 2020) exemplifies the first approach, where the standard least squares loss is replaced with a more robust variant. For instance, Tan, Sun and Witten (2023) applied the adaptive Huber regression with regularizations to sparse reduced-rank regression in the presence of heavy-tailed noise, and developed an ADMM algorithm for convex optimization. Shen et al. (2023) employed the least absolute deviation (LAD) and Huber loss functions for low-rank matrix and tensor trace regression, and proposed a Riemannian subgradient algorithm in the nonconvex optimization framework. While these loss-robustification methods provide robust control over residuals, they focus solely on the residuals’ deviations and do not address the heavy-tailedness of the covariates. Moreover, robust loss functions like LAD and Huber loss cannot be easily generalized to more complex tensor models beyond linear trace regression.
Alternatively, Fan, Wang and Zhu (2021) proposed a robust low-rank matrix estimation procedure via data robustification. This method applies appropriate shrinkage to the data, constructs robust moment estimators from the shrunk data, and ultimately derives a robust estimate for the low-rank parameter matrix. Building on this idea, subsequent works by Wang and Tsay (2023) and Lu et al. (2024) extended the data robustficaition framework to time series models and spatial temporal models, respectively. The primary objective of data robustification is to mitigate the influence of samples with large deviations, thereby producing a robust estimate. However, when applied to low-rank matrix and tensor models, the data robustification procedure has limitations. Specifically, it overlooks the inherent structure of the model and fails to exploit the low-rank decomposition. As shown in Section 4, not all information in the data contributes effectively to estimating the tensor decomposition. Consequently, the data robustification approach may be suboptimal for low-rank tensor estimation.
For general low-rank tensor models, we introduce a new robust estimation procedure via gradient robustification. Specifically, we replace the partial gradients in the gradient descent algorithm with their robust alternatives by truncating gradients that exhibit large deviations. This modification improves the accuracy of gradient estimation by mitigating the influence of outliers and noise. By robustifying the partial gradients with respect to the components in the tensor decomposition, this method effectively leverages the low-rank structure, making it applicable to a variety of low-rank models, similar to other gradient-based methods.
For various low-rank tensor models, we demonstrate that instead of the original data, low-dimensional multilinear transformations of the data are employed in the partial gradients. This allows us to work with more compact and informative representations of the data, leading to enhanced computational and statistical properties. As a result, the statistical error rates of this gradient-based method are shown to be linked to a new concept called local moment, which characterizes the distributional properties of the tensor data projected onto certain low-dimensional subspaces along each tensor direction. This leads to potentially much sharper statistical rates compared to traditional methods. Furthermore, since covariates and noise contribute to the partial gradients, the robustification of gradients is particularly effective in handling the heavy-tailed distributions of both covariates and noise.
However, it is important to note that the robust gradient alternatives may not correspond to the partial gradients of any standard robust loss function, which poses challenges in analyzing their convergence. To address this, we develop an algorithm-based convergence analysis for the proposed robust gradient descent method. Additionally, we establish minimax-optimal statistical error rates under local moment conditions for several commonly used low-rank tensor models.
The main contributions of this paper are three-fold:
-
(1)
We introduce a novel and general estimation procedure based on robust gradient descent. This method is computationally scalable and applicable to a wide range of tensor problems, including both supervised tasks (e.g., tensor regression and classifications) and unsupervised tasks (e.g., tensor PCA).
-
(2)
The robust methodology is shown to effectively handle the heavy-tailed distributions and is proven to achieve optimal statistical error rates under relaxed distributional assumptions on both covariates and noise. Specifically, for , we only require finite second moments for the covariates and -th moments for noise in tensor linear regression, finite second moments for the covariates in tensor logistic regression, and finite -th moments for noise in tensor PCA.
-
(3)
For heavy-tailed low-rank tensor models, we introduce the concept of local moment, a technical innovation that enables a more precise characterization of the effects of heavy-tailed distributions. This results in sharper statistical error rates compared to those derived from global moment conditions.
1.2 Related literature
This paper is related to a large body of literature on nonconvex methods for low-rank matrix and tensor estimation. The gradient descent algorithm and its variants have been extensively studied for low-rank matrix models (Netrapalli et al., 2014; Chen and Wainwright, 2015; Tu et al., 2016; Wang, Zhang and Gu, 2017; Ma et al., 2018) and low-rank tensor models (Xu, Zhang and Gu, 2017; Chen, Raskutti and Yuan, 2019; Han, Willett and Zhang, 2022; Tong, Ma and Chi, 2022; Tong et al., 2022). For simplicity, we focus on the robust alternatives to the standard gradient descent, although the proposed technique can be extended to other gradient-based methods. Robust gradient methods have also been explored for low-dimensional statistical models in convex optimization (Prasad et al., 2020). For low-rank matrix recovery, the median truncation gradient descent has been proposed to handle the arbitrary outliers in the data (Li et al., 2020), which focuses on linear compressed sensing problem without random noise. Our paper differs from the existing work as we consider the general low-rank tensor estimation framework under the heavy-tailed distribution setting.
Robust estimation against heavy-tailed distributions is another emerging area of research in high-dimensional statistics. Various robust -estimators have been proposed for mean estimation (Catoni, 2012; Bubeck, Cesa-Bianchi and Lugosi, 2013; Devroye et al., 2016) and high-dimensional linear regression (Fan, Li and Wang, 2017; Loh, 2017; Sun, Zhou and Fan, 2020; Wang et al., 2020). More recently, robust methods for low-rank matrix and tensor estimation have been developed in Fan, Wang and Zhu (2021), Tan, Sun and Witten (2023), Wang and Tsay (2023) and Shen et al. (2023). Compared to these existing methods, our proposed approach can achieve the same or even better convergence rates under more relaxed local distribution assumptions on both covariates and noise.
1.3 Organization of the paper
The remainder of this paper is organized as follows. In Section 2, we provide a review of relevant definitions and notations for low-rank tensor estimation. In Section 3, we introduce the robust gradient descent method and discuss its convergence properties. In Section 4, we apply the proposed methodology to three popular tensor models in heavy-tailed distribution settings. Simulation experiments and real data examples are presented in Sections 5 and 6 to validate the performance of the proposed methods. Finally, Section 7 concludes with a discussion of the findings and future directions. The proofs of the main results are relegated to Appendices A and B.
2 Tensor Algebra and Notation
Tensors, or multi-dimensional arrays, are higher-order extensions of matrices. A multi-dimensional array is called a -th order tensor. Throughout this paper, we denote vectors by boldface small letters (e.g. , ), matrices by boldface capital letters (e.g. , ), and tensors by boldface Euler capital letters (e.g. , ), respectively.
Tensor matricization is the process of reordering the elements of a tensor into a matrix. For any -th-order tensor , its mode- matricization is denoted as , with , whose -th element is mapped to the -th element of , where with and .
Next, we review three types of multiplications for tensors. For any tensor and matrix with , the mode- multiplication produces a tensor in defined by
(2.1) |
For any two tensors and with , their generalized inner product is the -th-order tensor in defined by
(2.2) |
for . In particular, when , it reduces to the conventional real-valued inner product. Additionally, the Frobenius norm of any tensor is defined as . For any two tensors and , their outer product is the tensor defined by
(2.3) |
for , , , and .
For any tensor , its Tucker ranks are defined as the matrix ranks of its matricizations, i.e. , for . Note that ’s are analogous to the row and column ranks of a matrix, but they are not necessarily equal for third- and higher-order tensors. If has Tucker ranks , then has the following Tucker decomposition (Tucker, 1966; De Lathauwer, De Moor and Vandewalle, 2000)
(2.4) |
where each is the factor matrix for , and is the core tensor. If has the Tucker decomposition in (2.4), then we have the following results for its matricizations:
(2.5) |
where is matrix Kronecker product operating in the reverse order.
Throughout this paper, we use to denote a generic positive constant. For any two real-valued sequences and , we write if there exists a constant such that for all . Additionally, we write if and . For a generic matrix , we let , , , and denote its transpose, Frobenius norm, operator norm, vectorization, and the -th largest singular value, respectively. For any real symmetric matrix , let and denote its minimum and maximum eigenvalues.
3 Methodology
3.1 Gradient descent with robust gradient estimates
We consider a general estimation framework with the loss function for parameter tensor and random observation . Suppose the parameter tensor admits a Tucker low-rank decomposition , defined in (2.4), where is the core tensor and each is the factor matrix. Throughout the paper, we assume that the order is fixed and the ranks are known. Given the tensor decomposition, we define the loss function with respect to the decomposition components as
(3.1) |
A standard method to estimate the components in the tensor decomposition is to minimize the following regularized loss function
(3.2) |
where are tuning parameters, and the additional regularization term prevents the factor matrix from being singular, while also ensuring that the scaling among all factor matrices is balanced (Han, Willett and Zhang, 2022). This regularized loss minimization problem can be efficiently solved using gradient descent. Han, Willett and Zhang (2022) demonstrated that the gradient descent approach is computationally efficient. Moreover, with a suitable choice of initial values, the estimation error is proportional to
(3.3) |
where is the ground truth of the parameter tensor satisfying , as minimizes the risk function if the expecation exists. The error essentially depends on the intrinsic dimension of the low-rank tensors and the distribution of the data . When the distribution of is heavy-tailed, controlling the convergence rate becomes challenging, which may lead to suboptimal estimation performance.
To motivate our robust estimation method, it is important to note that the partial gradients of the loss function have a form of sample means
(3.4) |
for , and
(3.5) |
It is well known that the sample mean is sensitive to outliers, as it can be significantly influenced by even a small fraction of extreme values. When some of ’s are outliers, the gradient descent approach may fail to be robust, leading to sub-optimal estimates, particularly in the presence of heavy-tailed distributions.
Therefore, we use the robust gradient estimates as alternatives to the standard gradients in vanilla gradient descent. For any given , we may construct some robust gradient functions, denoted by and , for , to replace the partial gradients in standard gradient descent. Given initial values , a step size , and the number of iterations , we propose the following gradient descent algorithm, summarized in Algorithm 1, using the robust gradient functions.
initialize: , , step size , and number of iterations
for to
for to
end for
end for
return
3.2 Local convergence analysis
The robust gradient functions may not be the partial gradients of a robust loss function. When the gradients are replaced by their robust alternatives, Algorithm 1 is not designed to solve a traditional minimization problem. As a result, studying its convergence becomes challenging. To address this and establish both computational and statistical guarantees, we introduce some notations and conditions.
Definition 3.1 (Restricted correlated gradient).
The loss function satisfies the restricted correlated gradient (RCG) condition: for any such that , ,
(3.6) |
where the RCG parameters and satisfy .
The RCG condition implies that for any low-rank tensor , the expectation of the gradient is postively correlated with the optimal descent direction . If the loss function has a finite expectation , this condition is implied by the restricted strong convexity and smoothness conditions of the risk function (Bubeck, 2015; Jain et al., 2017). However, in the setting of heavy-tailed distributions, the expecation of may not exist, making it more appropriate to consider the RCG condition on the gradient rather than the loss function (see tensor linear regression in Remark 4.1 as an example). Furthermore, compared to the vanilla gradient descent method (Han, Willett and Zhang, 2022), the RCG condition is imposed on the expectation of the loss gradient, which simplifies the technical analysis under heavy-tailed distributions.
The robust gradient functions are expected to exhibit desirable stability against outliers. As an alternative to the sample mean, can be viewed as a robust estimator of . Similarly, can be viewed as a mean estimator of . Formally, we define the following condition for the stability of the robust gradients.
Definition 3.2 (Stability of robust gradients).
Given , the robust gradient functions are stable if there exist positive constants and , for , such that
(3.7) |
The universal parameter controls the performance of all gradient functions in the presence of an inaccurate Tucker decomposition, while the constants ’s represent the estimation accuracy of the robust gradient estimators. A similar definition for robust estimation via convex optimization is considered in Prasad et al. (2020).
For the ground truth , denote its largest and smallest singular values across all directions by and . The condition number of is then given by . To address the unidentifiability issue in Tucker decomposition, we consider the component-wise estimation error under rotation for the estimate , following Han, Willett and Zhang (2022). The error is defined as
(3.8) |
where the true decomposition satisfies and the orthogonal matrices ’s account for the unidentification of the Tucker decomposition. For the -th iteration of Algorithm 1, where , denote the estimated parameter by . The corresponding estimation error is then given by .
Given the conditions and notations outlined above, we present the local convergence analysis for the gradient descent iterations with stable robust gradient functions.
Theorem 3.3 (Local convergence rate).
Suppose that the loss function satisfies the RCG condition with parameters and as in Definition 3.1, and that the robust gradient functions at each step satisfy the stability condition with parameters and as in Definition 3.2, for all and . If the initial estimation error satisfies , , , , and , then the estimation errors at the -th iteration, for , satisfy
(3.9) |
and
(3.10) |
Theorem 3.3 estalibshes the linear convergence of the robust gradient descent iterates under certain sufficient conditions. In both upper bounds provided, the first terms correspond to optimization errors that decay exponentially with the number of iterations, which reflects the improvement in the solution as the gradient descent progresses. The second terms, on the other hand, capture the statistical errors, which depend on the estimation accuracy of the robust gradient functions. These statistical errors arise due to the approximation of gradients in the presence of noise or outliers, and their magnitude is influenced by the robustness of the gradient estimator. Thus, to guarantee fast convergence, it is crucial to construct a stable robust gradient estimator, which is the focus of the next subsection. The corresponding theoretical analysis for each specific statistical model will be provided in Section 4, where we will discuss the performance of the estimator under different conditions.
3.3 Robust gradient estimation via entrywise truncation
In this subsection, we propose a general robust gradient estimation method. The partial gradient of the risk function is the expectation of the partial gradient of the loss function, which suggests that robust gradient estimation can be framed as a mean estimation problem. In this context, Fan, Wang and Zhu (2021) introduced a simple robust mean estimation procedure based on data truncation. They demonstrated that the truncated estimator achieves the optimal rate under certain bounded moment conditions. Motivated by their work, we apply the truncation method to estimate the partial gradients in our setting.
For any matrix , we define the entrywise truncation operator , such that
(3.11) |
for and , where denotes the sign function. This operator truncates each entry of the matrix to be no larger than the threshold in absolute value, while preserving its sign. Similarly, we can define the entrywise truncation operator for any tensor with truncation parameter . The truncation parameter plays a critical role in balancing the trade-off between truncation bias and robustness. A smaller value of will lead to stronger truncation, reducing the influence of outliers but potentially introducing more bias, while a larger will preserve more of the data’s original values, reducing bias but making the estimator more sensitive to noise.
We consider the entrywise truncation gradient estimators. For , the estimator for the gradient with respect to is given by
(3.12) |
Similarly, for the core tensor , the truncation-based estimator is defined as
(3.13) |
Note that the truncation-based robust gradient estimator is generally applicable to a wide range of tensor models. In Sections 4 and 5, we will show both theoretically and numerically that the entrywise truncation using a single parameter can achieve optimal estimation performance under various distributional assumptions.
We briefly outline the idea for proving the stability of the truncated gradient estimator. A complete proof for each statistical application is provided in the supplementary material. For , we define , and the error term for the gradient estimator can be expressed as
(3.14) |
where the terms , , , and are defined as follows:
(3.15) |
Here, is the truncation bias at the ground truth , and represents the deviation of the truncated estimation around its expectation. As each truncated gradient, , is a bounded variable, we can apply the Bernstein inequality (Wainwright, 2019) to achieve a sub-Gaussian-type concentration without the Gaussian distributional assumption on the data. The truncation parameter controls the magnitude of and , and an optimal gives . For , given some regularity conditions, we can obtain an upper bound for the truncation bias of the second-order approximation error in . Similarly, as is bounded, we can also achieve a sub-Gaussian-type concentration and show that . Hence, we can show that .
By controlling the truncation bias, deviation, and approximation errors, we demonstrate that the truncated gradient estimator is stable and achieves optimal performance under certain conditions. A similar approach can be applied to the gradient with respect to the core tensor , establishing the stability of the corresponding robust estimator.
3.4 Implementation and initialization
The optimal statistical convergence rates of the proposed estimator depend critically on the optimal value of the truncation parameter , which varies according to the model dimension, sample size, and other relevant parameters. To select in a data-driven manner, we propose using cross-validation to evaluate the estimates produced by the robust gradient descent algorithm for different values of .
To initialize Algorithm 1, we may disregard the low-rank structure of initially, and find the estimate . Specifically, robust gradient descent can be applied to update , as summarized in Algorithm 2, where the robust gradient is
(3.16) |
with another truncation parameter and step size . Once we have , we apply HOSVD (higher-order singular value decomposition) or HOOI (higher-order orthogonal iteration) (De Lathauwer, De Moor and Vandewalle, 2000) to to obtain the initial values.
initialize: , step size , and number of iterations
for to
end for
return
4 Applications to Tensor Models
In this section, we apply the proposed robust gradient descent algorithm, utilizing entrywise truncated gradient estimators, to three statistical tensor models: tensor linear regression, tensor logistic regression, and tensor PCA. For tensor linear regression, both the covariates and the noise are assumed to follow heavy-tailed distributions. In the case of tensor logistic regression and tensor PCA, we assume heavy-tailed covariates for the former and heavy-tailed noise for the latter. In the following, we let be the maximum dimension across all modes, and let be effective dimension of the low-rank tensor, which corresponds to the total number of parameters in the Tucker decomposition.
4.1 Heavy-tailed tensor linear regression
The first statistical model we consider is tensor linear regression:
(4.1) |
where , , and with . The symbol represents the generalized inner product of tensors, defined in (2.2). When , the response is a scalar, and reduces to the conventional inner product. We assume that the samples are independent across , and .
Model (4.1) encompasses multivariate regression, multi-response regression, matrix trace regression as special cases. It has broad applicability in various statistics and machine learning contexts, such as multi-response tensor regression (Raskutti, Yuan and Chen, 2019), matrix compressed sensing (Candes and Plan, 2011), autoregressive time series modeling (Wang et al., 2022), matrix and tensor completion (Negahban and Wainwright, 2012; Cai et al., 2019), and others.
For model (4.1), it is common to consider the least squares loss function:
(4.2) |
For any given , its partial gradients with respect to each are
(4.3) |
By applying the entrywise truncation operator from Section 3.3, the robust gradients with truncation parameter are given by
(4.4) |
for , and
(4.5) |
Remark 4.1.
For model (4.1), the loss function has a finite expectation if both and have finite second moments. In the following, we assume that has a a finite moment for some constant . Under this assumption, we only require that the gradient has a finite expectation.
Remark 4.2.
Adaptive Huber regression is a widely-used robust estimation method for high-dimensional linear regression (Sun, Zhou and Fan, 2020; Tan, Sun and Witten, 2023). The loss function is given by
(4.6) |
where for any tensor , is the Huber loss, and is the robustness parameter. The partial gradients of are
(4.7) |
In comparison to the standard Huber regression, where the gradients bound directly, the entrywise truncated gradients in (4.4) and (4.5) control the deviation of the term . This modification enables us to better handle heavy-tailedness in both the covariates and the noise.
By applying the Tucker decomposition and utilizing properties of tensor inner products, the partial gradients can be rewritten as follows. For ,
(4.8) |
For ,
(4.9) |
Finally, the gradient with respect to the core tensor is
(4.10) |
In the above expressions, dimension reduction is applied to high-dimensional data and via the factor matrices ’s. This allows for efficient computation of the partial gradients. More importantly, the transformed data, such as and , appear in the partial gradients rather than the original high-dimensional data. This transformation can be seen as reducing the effective dimension of the data used in gradient calculation. As a result, when all ’s are close to their ground truth values and lie within certain local regions, the gradients rely only on partial information of the data. This phenomenon motivates us to consider the local bounded moment conditions for the distributions of and when projected onto certain directions.
For any matrix , consider its column subspace projection matrix , where is the Moore–Penrose pseudo-inverse. For any vector , the angle between and its projection onto is . For each , define the local set of unit-length vectors with radius as
(4.11) |
By this definition, we have the following assumptions on the distributions of and .
Assumption 4.3.
The vectorized covariate has the mean and positive definite variance satisfying . For some and , conditioned on any , the random noise has the local -th moment bound
(4.12) |
where is the coordinate vector whose -th element is one and others are zero. Additionally, has the local -th moment bound
(4.13) |
We call and the local moment bounds because they reflect the distributional properties of and transformed onto the local region around the column space of ’s. The local moment bound assumptions essentially limits how far the random noise and covariates can deviate in certain directions, thereby controlling their tail behavior. When , the local moment bounds become
(4.14) |
and
(4.15) |
which are still different from and could be much smaller than the global -th moments for the vectorized data, i.e., and .
The second moment condition for and -th moment condition for in Assumption 4.3 offer a relaxation of the commonly-used Gaussian and sub-Gaussian conditions in tensor regression literature. Notably, in Assumption 4.3, the elements of or are not required to be independent or uncorrelated. The parameters , , and are used to quantify how the distributions of and affect the rate of convergence, and they are allowed to vary with tensor dimensions. For tensor regression in (4.1), define as the effective -th local moment bound.
Remark 4.4.
The local moment condition is established on the certain directions near the subspaces spanned by the columns of . Consequently, the local moment can be much smaller than the global moment commonly discussed in the robust estimation literature (Fan, Li and Wang, 2017; Fan, Wang and Zhu, 2021; Wang and Tsay, 2023).
To highlight the significance of the local moment concept, consider the following example. Let be a random vector, where , , and . Suppose that the ground truth is . In this case, the global second moment of is given by
(4.16) |
which diverges as the dimension increase. However, when , the local second moment with radius , defined as
(4.17) |
remains bounded and is not greater than 2. This example illustrates that if the radius of the local region can be sufficiently controlled, the local moment bound can be significantly smaller than the glocal moment. As a result, the error rate associated with the local moment could be much sharper, leading to more precise theoretical results.
Denote the estimator obtained by the robust gradient descent algorithm with gradient truncation parameter as , and the corresponding estimation error by as in (3.8). Based on the local bounded moment conditions, we have the following guarantees.
Theorem 4.5.
Theorem 4.5 shows that, under certain regularity conditions, the errors rates in (4.18) and (4.19) are proportional to and , respectively. The results essentially depend on the rate of truncation parameter , whose optimal value is proportional to . The convergence rates exhibit a smooth phase transition. When , i.e. when has a finite second local moment, the upper bound in (4.19) matches the result under Gaussian distributional assumption for low-rank matrix regression (Negahban and Wainwright, 2011) and tensor regression (Han, Willett and Zhang, 2022; Zhang et al., 2020). When , the convergence rate in (4.19) decreases from to . Furthermore, the parameter controls the radius of local region. If is sufficiently large and the intial error , then the local radius is smaller than 1, and the convergence rates are associated to local moment bounds.
Remark 4.6.
The convergence rate phase transition with respect to the moment order is also observed in heavy-tailed vector regression (Sun, Zhou and Fan, 2020), matrix regression (Tan, Sun and Witten, 2023), and time series autoregression (Wang and Tsay, 2023). For and 2, the convergence rates obtained in Theorem 4.5 for are minimax rate-optimal for vector-valued and matrix-valued regression models (Sun, Zhou and Fan, 2020; Tan, Sun and Witten, 2023). For , the rate in (4.19) with is also minimax rate-optimal (Raskutti, Yuan and Chen, 2019).
Remark 4.7.
We highlight two key differences between our theoretical results and those in the existing literature on heavy-tailed regression models. First, we relax the distributional condition on the covariate . In high-dimensional Huber regression literature, the covariates are typically assumed to be sub-Gaussian or bounded (Fan, Li and Wang, 2017; Sun, Zhou and Fan, 2020; Tan, Sun and Witten, 2023; Shen et al., 2023). In contrast, the proposed method, based on gradient robustification, can handle both heavy-tailed covariates and noise. Second, our analysis relies on local moment conditions, making our results potentially much sharper than those based on global moment conditions. Numerical results which verify the advantages of our method over competing methods are provided in Section 5.
4.2 Heavy-tailed tensor logistic regression
For the generalized linear model, conditioned on the tensor covariate , the response variable follows the distribution
(4.20) |
Consequently, the loss function is given by
(4.21) |
A widely studied instance of the generalized linear models is logistic regression, where . In this case, the gradient of the loss function becomes
(4.22) |
Here, is the binary variable with probability
(4.23) |
In the robust estimation literature for logistic regression, since is a binary variable, it is often assumed that the covariate follows a heavy-tailed distribution. For example, the finite fourth moment condition has been considered for vector-valued covariates in the context of heavy-tailed logistic regression (Prasad et al., 2020; Zhu and Zhou, 2021; Han, Tsay and Wu, 2023). Tensor logistic regression has been widely applied to classification tasks in neuroimaging analysis (see e.g. Zhou, Li and Zhu, 2013; Li et al., 2018; Wu et al., 2022). Recent empirical studies have shown that neuroimaging data follow heavy-tailed distributions (Beggs and Plenz, 2003; Friedman et al., 2012; Roberts, Boonstra and Breakspear, 2015). Consequently, there is a growing practical need to study robust estimation methods for tensor logistic regression.
Given the low-rank structure, the partial gradients of the logistic loss function are
(4.24) |
for , and
(4.25) |
As in tensor linear regression, the partial gradients for tensor logistic regression involve the multilinear transformations of the covariates, namely and . Therefore, to derive the optimal convergence rate, it is essential to characterize the distributional properties of the transformed covariates. In this work, we assume that the covariate has a finite second moment and introduce the concept of the local second moment bound to obtain a sharp statistical convergence rate.
Assumption 4.8.
The vectorized covariate has the mean and positive definite variance satisfying . Additionally, for some , the covariate satisfies the local second moment bound
(4.26) |
By definition, it is clear that could be smaller, and in many cases much smaller, than . However, the statistical convergence rate of the estimator is governed by the local moment bound , while plays a role in determining the sample size requirement and the computational convergence rate. Specifically, the statistical guarantees for the robust estimator with truncation parameter , denoted as , depend on .
Theorem 4.9.
Theorem 4.9 provides the statistical convergence rates for heavy-tailed tensor logistic regression under the local second moment condition on the covariates. Importantly, the convergence rate of the robust gradient descent is shown to match the rate for low-rank tensor logistic regression when using the vanilla gradient descent algorithm with Gaussian design (Chen, Raskutti and Yuan, 2019). This demonstrates that our method can effectively handle heavy-tailed covariates, ensuring robust and optimal estimation in such settings. Similar to tensor linear regression, if the initial error satisfies and the sample size is sufficiently large, then the local radius is smaller than one. Under these conditions, the statistical convergence rates depend on the local second moment of the covariates, which governs the precision of the estimator.
Remark 4.10.
Compared to existing works on robust estimation for heavy-tailed logistic regression with vector covariates (Prasad et al., 2020; Zhu and Zhou, 2021; Han, Tsay and Wu, 2023), our proposed method offers a significant relaxation by replacing the finite fourth moment condition with a second moment condition. This relaxation makes the method more flexible and applicable to a broader range of scenarios, particularly when high-order moments (such as the fourth moment) do not exist or are difficult to ensure.
Remark 4.11.
The low-rank structure imposed on the tensor coefficient allows our approach to leverage the local second moment condition. This results in a much sharper convergence rate, compared to previous methods that typically do not exploit such structural properties. The local moment condition, which governs the behavior of the covariates in the partial gradients, contributes to a more precise characterization of gradient deviation, leading to better statistical guarantees.
4.3 Heavy-tailed tensor PCA
Another important statistical model for the tensor data is tensor principal component analysis (PCA). Specically, we consider
(4.29) |
where is the tensor-valued observation, is the low-rank true signal, and represents the random noise, assumed to be mean-zero. In the existing literature, much attention has been focused on the Gaussian or sub-Gaussian noise settings (Richard and Montanari, 2014; Zhang and Han, 2019; Han, Willett and Zhang, 2022).
In this subsection, we assume that random noise is heavy-tailed. We propose applying the robust gradient descent method with truncated gradient estimators to estimate the low-rank signal in the presence of such heavy-tailed noise. The loss function for tensor PCA is . The partial gradient with respect to is
(4.30) |
for any . Similarly, the partial gradient with respect to the core tensor is
(4.31) |
These gradients are computed using multilinear transformation fo the data, specifically and , which are the data projected onto the respective factor spaces. To ensure robustness against heavy-tailed noise, we introduce the local -th moment condition for the noise tensor . The condition is necessary for the gradient-based optimization method to converge effectively in the presence of heavy-tailed noise.
Assumption 4.12.
For some and , has the local -th moment
(4.32) |
In constrast to many existing statistical analyses for tensor PCA, our method does not require the entries of the random noise to be independent or idetically distributed. This relaxation is a key feature of our approach, allowing it to handle more general noise structures, including those with dependencies and heavy tails.
For the estimator obtained by the robust gradient descent, denoted as , as well as the estimation error , we have the following convergence rates.
Theorem 4.13.
Under the local -th moment condition for the noise tensor , the convergence rate of the proposed robust gradient descent method is shown to be comparable to that of vanilla gradient descent and achieves minimax optimality (Han, Willett and Zhang, 2022). Specically, for , the signal-to-noise ratio (SNR) condition is identical to the SNR condition under the sub-Gaussian noise setting (Zhang and Xia, 2018). This result demonstrates that the robust gradient descent method is capable of effectively handling heavy-tailed noise, while still achieving minimax-optimal estimation performance. Furthermore, in a manner similar to tensor linear regression and logistic regression, if the signal strength satisfies and the initial error is sufficiently small, i.e., , the local radius is strictly smaller than one.
Remark 4.14.
The proposed robust gradient estimation approach offers several key advantages. First, unlike many existing methods that assume the noise follows a sub-Gaussian distribution, our approach relaxes this assumption by requiring only a -th moment condition on the noise tensor . This broader assumption allows the method to handle more general noise distributions, including those with heavier tails. Second, our method also accommodates the possibility that the elements of the noise tensor may exhibit strong correlations. However, it is important to note that the correlation structure does not directly affect the estimation performance. Instead, the noise that influences the estimation is only the part projected onto the local regions, characterized by . This localized effect allows the method to remain robust even when the global correlation in the noise is strong. Third, we allow to diverge to infinity, meaning that the method can tolerate increasingly larger noise values in certain regions. However, for reliable estimation performance, the SNR must satisfy the condition . This condition ensures that, even as the noise may increase in magnitude, the method still achieves optimal estimation performance as long as the signal is sufficiently strong.
5 Simulation Experiments
In this section, we present two simulation experiments. The first experiment is designed to verify the efficacy of the proposed method and highlight its advantages over other competitors. By comparing the performance of the proposed approach with that of alternative estimators, we demonstrate its superior robustness and accuracy. The second one aims to demonstrate the local moment effect in statistical convergence rates. Specifically, this experiment examines how the relaxation of the sub-Gaussian assumption and the use of local moment conditions influence the convergence behavior, providing empirical evidence for the theoretical results.
5.1 Experiment 1
We consider four models, and for each of them, we explore how different covariate and noise distributions affect the performance across diverse scenarios.
-
•
Model I (multi-response linear regression):
(5.1) where , , and is a rank-3 matrix. Entries of and are i.i.d., and four distributional cases are adopted for model I: (1) and ; (2) and ; (3) and ; and (4) and . Here, is the bounded distribution with degrees of freedom and bound , i.e., if where .
-
•
Model II (tensor linear regression):
(5.2) where , , and has Tucker ranks . Entries of are i.i.d., and four distributional cases are adopted: (1) and ; (2) and ; (3) and ; and (4) and .
-
•
Model III (tensor logistic regression):
(5.3) where , , and has Tucker ranks . Entries of are i.i.d., and two distributional cases are adopted: (1) and (2) .
-
•
Model IV (tensor PCA): where , and has Tucker ranks . Entries of are i.i.d. and two distributional cases are adopted: (1) and (2) .
For Model I, has singular values , and for Models II-IV, the Tucker decomposition of in (1.1) has a super-diagonal core tensor . For all models, the factor matrices ’s are generated randomly as the first left singular vectors of a random Gaussian ensemble. The first distributional case of each model represents the light-tailed setting, while the others are heavy-tailed as at least part of data follow heavy-tailed distribution.
We apply the proposed robust gradient descent algorithm, as well as the vanilla gradient descent and adaptive Huber regression in Remark 4.2 (except for Model III) as competitors, to the data generated from each model. Initial values of is obtained by Algorithm 2, and HOSVD is applied to obtained . The hyperparameters are set as follows: , step size , number of iterations , and truncation parameter is selected via five-fold cross-validation. For each model and distributional setting, the data generation and estimation procedure is replicated 500 times. The estimation errors across these 500 replications are presented in Figure 2, which summarizes the results of the experiment.

According to Figure 2, when the data follows the light-tailed distribution (Case 1) in all models, the performances of three estimation methods are nearly identical. However, in the heavy-tailed cases, the performance of vanilla gradient descent deteriorates significantly, with estimation errors much larger than those of the other two methods. Overall, the robust gradient descent method consistently yields the smallest estimation errors across all three methods. Specifically, when the covariates follow heavy-tailed distributions (i.e. Cases 3 and 4 for Model I and Cases 3 and 4 for Model II), the robust gradient descent method outperforms adaptive Huber regression, producing significantly smaller estimation errors. When the covariates are normally distributed and the noise follows a heavy-tailed distribution (i.e. Case 2 for Models I and II), the performances of the robust gradient descent and adaptive Huber regression methods are similar. These numerical findings support the methodological insights presented in Remark 4.2 and are in line with the theoretical results discussed in Remark 4.7, confirming the robustness and efficiency of the proposed method in handling heavy-tailed data.
5.2 Experiment 2
We consider Model I in (5.1) and Model II in (5.2) in the second experiment. To numerically verify the local moment condition, we introduce a row-wise sparsity structure for the factor matrices. Specifically, we define each as , for , where is an orthonormal matrix generated randomly, following a similar procedure as in the first experiment.
In Model I, we have and , where and are the sub-vectors of and , respectively, containing their first five elements. In other words, if we project the covariates or noise onto a local region around the column spaces of and , respectively, the transformed distribution is only related to the first five elements. Hence, we consider the following vectors , , , , , and , where represents that the elements of are independent and follow the distribution . Seven distributional cases are considered: (1) , ; (2) , ; (3) , ; (4) , ; (5) , ; (6) , ; and (7) , . By definition, for some sufficiently small , the local moments and remain unchanged in the first four settings even with varying distributions of and . In the last three settings, the local moments of covariates or noise get larger.
In Model II, by the row-wise sparse structure of ’s, , where is sub-tensor of containing for and . Hence, we consider four distributional cases for : (1) all elements follow distribution; (2) all elements of follow distribution, and the other elements of follow distribution; (3) all elements of follow distribution, and the other elements of follow distribution; and (4) all elements of follow distribution. Similarly to model I, the local moment is the same in cases 1 and 2, and gets large in cases 3 and 4.

For each model and case, we apply the proposed robust gradient descent method and replicate the procedure for 500 times. The estimation errors over 500 replications are presented in Figure 3. For Model I, the estimation errors in Cases 1-4 are almost identical, but those in the other cases are significantly larger. Similarly, for Model II, the errors are nearly the same in Cases 1 and 2, and increase substantially in Cases 3 and 4. These numerical findings in the second experiment confirm that the estimation errors are closely tied to the local moment bounds of the covariates and noise distributions, as discussed in Theorem 4.5.
6 Real Data Example: Chest CT Images
In this section, we apply the proposed robust gradient descent (RGD) estimation approach to the publicly available COVID-CT dataset (Yang et al., 2020). The dataset consists of 317 COVID-19 positive chest CT scans and 397 negative scans, selected from four open-access databases. Each scan is a greyscale matrix with a binary label indicating the disease status. The histograms of kurtosis for the pixels are presented in Figure 1, which provide strong evidence of heavy-tailed distributions, a characteristic commonly observed in biomedical imaging data.
To identify COVID-positive samples based on chest CT scans, we employ a low-rank tensor logistic regression model with and . For dimension reduction, we use a low-rank structure with . As in the first experiment (Section 5), we apply both the proposed robust gradient descent (RGD) algorithm and the vanilla gradient descent (VGD) algorithm to estimate the Tucker decomposition in (1.1). For training, we randomly select 200 positive scans and 250 negative scans, using the remaining data as testing data to evaluate the classification performance of the estimated models.
Using each estimation method, we classify the testing data into four categories: true positive (TP), false positive (FP), true negative (TN), and false negative (FN). The performance metrics used for evaluation include: precision rate: ; recall rate: ; and F1 score: . The precision, recall, and F1 scores for the RGD method are reported in Table 1, alongside the performance of the VGD method as a benchmark.
The results demonstrate that the RGD method outperforms the VGD method in terms of all three evaluation metrics: precision, recall, and F1 score. This highlights the ability of the proposed method to provide reliable and stable statistical inference in real-world applications, especially in the challenging domain of COVID-19 diagnosis from chest CT scans.
Method | Precision | Recall | F1 Score |
---|---|---|---|
VGD | 0.898 | 0.829 | 0.862 |
RGD | 0.954 | 0.880 | 0.916 |
7 Conclusion and Discussion
We propose a unified and general robust estimation framework for low-rank tensor models, which combines robust gradient estimators via element-wise truncation with gradient descent updates. This method is shown to possess three key properties: computational efficiency, statistical optimality, and distributional robustness. We apply this framework to a variety of tensor models, including tensor linear regression, tensor logistic regression, and tensor PCA. Under a mild local moment condition for the covariates and noise distributions, the proposed method achieves minimax optimal statistical error rates, even in the presence of heavy-tailed distributions.
The proposed robust gradient descent framework can be easily integrated with other types of robust gradient functions, such as median of means and rank-based methods. Furthermore, while we focus on heavy-tailed distributions, the framework can also be adapted to handle other scenarios with outliers, including Huber’s -contamination model and data with measurement errors.
While we have primarily considered Tucker low-rank tensor models in this paper, the proposed method is highly versatile and can be extended to other tensor models with different low-dimensional structures. In addition to Tucker ranks, there are various definitions of tensor ranks and corresponding low-rank tensor models (e.g., canonical polyadic decomposition and matrix product state models) (Kolda and Bader, 2009). Given the broad applicability of gradient descent algorithms, the robust gradient descent approach can be leveraged for these alternative models as well. Moreover, sparsity is often a key component in low-rank tensor models, facilitating dimension reduction and enabling variable selection (Zhang and Han, 2019; Wang et al., 2022). The robust gradient descent method can be utilized for robust sparse tensor decomposition, further enhancing its utility in high-dimensional settings.
In nonconvex optimization, initialization plays a crucial role in determining the convergence rate and the required number of iterations. Specifically, the radius in the local moment condition depends on the initialization strategy. Hence, developing computationally efficient and statistically robust initialization methods tailored to each statistical model is of significant interest for future work (Jain et al., 2017).
References
- Beggs and Plenz (2003) {barticle}[author] \bauthor\bsnmBeggs, \bfnmJohn M\binitsJ. M. and \bauthor\bsnmPlenz, \bfnmDietmar\binitsD. (\byear2003). \btitleNeuronal avalanches in neocortical circuits. \bjournalJournal of Neuroscience \bvolume23 \bpages11167–11177. \endbibitem
- Bi, Qu and Shen (2018) {barticle}[author] \bauthor\bsnmBi, \bfnmXuan\binitsX., \bauthor\bsnmQu, \bfnmAnnie\binitsA. and \bauthor\bsnmShen, \bfnmXiaotong\binitsX. (\byear2018). \btitleMultilayer tensor factorization with applications to recommender systems. \bjournalAnnals of Statistics \bvolume46 \bpages3308–3333. \endbibitem
- Bi et al. (2021) {barticle}[author] \bauthor\bsnmBi, \bfnmXuan\binitsX., \bauthor\bsnmTang, \bfnmXiwei\binitsX., \bauthor\bsnmYuan, \bfnmYubai\binitsY., \bauthor\bsnmZhang, \bfnmYanqing\binitsY. and \bauthor\bsnmQu, \bfnmAnnie\binitsA. (\byear2021). \btitleTensors in statistics. \bjournalAnnual Review of Statistics and Its Application \bvolume8 \bpages345–368. \endbibitem
- Bubeck (2015) {barticle}[author] \bauthor\bsnmBubeck, \bfnmSébastien\binitsS. (\byear2015). \btitleConvex optimization: Algorithms and complexity. \bjournalFoundations and Trends® in Machine Learning \bvolume8 \bpages231–357. \endbibitem
- Bubeck, Cesa-Bianchi and Lugosi (2013) {barticle}[author] \bauthor\bsnmBubeck, \bfnmSébastien\binitsS., \bauthor\bsnmCesa-Bianchi, \bfnmNicolo\binitsN. and \bauthor\bsnmLugosi, \bfnmGábor\binitsG. (\byear2013). \btitleBandits with heavy tail. \bjournalIEEE Transactions on Information Theory \bvolume59 \bpages7711–7717. \endbibitem
- Cai et al. (2019) {barticle}[author] \bauthor\bsnmCai, \bfnmChangxiao\binitsC., \bauthor\bsnmLi, \bfnmGen\binitsG., \bauthor\bsnmPoor, \bfnmH Vincent\binitsH. V. and \bauthor\bsnmChen, \bfnmYuxin\binitsY. (\byear2019). \btitleNonconvex low-rank tensor completion from noisy data. \bjournalAdvances in Neural Information Processing Systems \bvolume32. \endbibitem
- Candes and Plan (2011) {barticle}[author] \bauthor\bsnmCandes, \bfnmEmmanuel J\binitsE. J. and \bauthor\bsnmPlan, \bfnmYaniv\binitsY. (\byear2011). \btitleTight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements. \bjournalIEEE Transactions on Information Theory \bvolume57 \bpages2342–2359. \endbibitem
- Catoni (2012) {barticle}[author] \bauthor\bsnmCatoni, \bfnmOlivier\binitsO. (\byear2012). \btitleChallenging the empirical mean and empirical variance: a deviation study. \bjournalAnnales de l’IHP Probabilités et Statistiques \bvolume48 \bpages1148–1185. \endbibitem
- Chen, Raskutti and Yuan (2019) {barticle}[author] \bauthor\bsnmChen, \bfnmHan\binitsH., \bauthor\bsnmRaskutti, \bfnmGarvesh\binitsG. and \bauthor\bsnmYuan, \bfnmMing\binitsM. (\byear2019). \btitleNon-convex projected gradient descent for generalized low-rank tensor regression. \bjournalJournal of Machine Learning Research \bvolume20 \bpages1–37. \endbibitem
- Chen and Wainwright (2015) {barticle}[author] \bauthor\bsnmChen, \bfnmYudong\binitsY. and \bauthor\bsnmWainwright, \bfnmMartin J\binitsM. J. (\byear2015). \btitleFast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees. \bjournalarXiv preprint arXiv:1509.03025. \endbibitem
- Chen, Yang and Zhang (2022) {barticle}[author] \bauthor\bsnmChen, \bfnmRong\binitsR., \bauthor\bsnmYang, \bfnmDan\binitsD. and \bauthor\bsnmZhang, \bfnmCun-Hui\binitsC.-H. (\byear2022). \btitleFactor models for high-dimensional tensor time series. \bjournalJournal of the American Statistical Association \bvolume117 \bpages94–116. \endbibitem
- De Lathauwer, De Moor and Vandewalle (2000) {barticle}[author] \bauthor\bsnmDe Lathauwer, \bfnmLieven\binitsL., \bauthor\bsnmDe Moor, \bfnmBart\binitsB. and \bauthor\bsnmVandewalle, \bfnmJoos\binitsJ. (\byear2000). \btitleA multilinear singular value decomposition. \bjournalSIAM Journal on Matrix Analysis and Applications \bvolume21 \bpages1253–1278. \endbibitem
- Devroye et al. (2016) {barticle}[author] \bauthor\bsnmDevroye, \bfnmLuc\binitsL., \bauthor\bsnmLerasle, \bfnmMatthieu\binitsM., \bauthor\bsnmLugosi, \bfnmGabor\binitsG., \bauthor\bsnmOlivetra, \bfnmRoberto I\binitsR. I. \betalet al. (\byear2016). \btitleSub-Gaussian mean estimators. \bjournalAnnals OF Statistics \bvolume44 \bpages2695. \endbibitem
- Dong et al. (2023) {barticle}[author] \bauthor\bsnmDong, \bfnmHarry\binitsH., \bauthor\bsnmTong, \bfnmTian\binitsT., \bauthor\bsnmMa, \bfnmCong\binitsC. and \bauthor\bsnmChi, \bfnmYuejie\binitsY. (\byear2023). \btitleFast and provable tensor robust principal component analysis via scaled gradient descent. \bjournalInformation and Inference: A Journal of the IMA \bvolume12 \bpages1716–1758. \endbibitem
- Fan, Li and Wang (2017) {barticle}[author] \bauthor\bsnmFan, \bfnmJianqing\binitsJ., \bauthor\bsnmLi, \bfnmQuefeng\binitsQ. and \bauthor\bsnmWang, \bfnmYuyan\binitsY. (\byear2017). \btitleEstimation of high dimensional mean regression in the absence of symmetry and light tail assumptions. \bjournalJournal of the Royal Statistical Society. Series B, Statistical methodology \bvolume79 \bpages247. \endbibitem
- Fan, Wang and Zhu (2021) {barticle}[author] \bauthor\bsnmFan, \bfnmJianqing\binitsJ., \bauthor\bsnmWang, \bfnmWeichen\binitsW. and \bauthor\bsnmZhu, \bfnmZiwei\binitsZ. (\byear2021). \btitleA shrinkage principle for heavy-tailed data: High-dimensional robust low-rank matrix recovery. \bjournalAnnals of Statistics \bvolume49 \bpages1239. \endbibitem
- Friedman et al. (2012) {barticle}[author] \bauthor\bsnmFriedman, \bfnmNir\binitsN., \bauthor\bsnmIto, \bfnmShinya\binitsS., \bauthor\bsnmBrinkman, \bfnmBraden AW\binitsB. A., \bauthor\bsnmShimono, \bfnmMasanori\binitsM., \bauthor\bsnmDeVille, \bfnmRE Lee\binitsR. L., \bauthor\bsnmDahmen, \bfnmKarin A\binitsK. A., \bauthor\bsnmBeggs, \bfnmJohn M\binitsJ. M. and \bauthor\bsnmButler, \bfnmThomas C\binitsT. C. (\byear2012). \btitleUniversal critical dynamics in high resolution neuronal avalanche data. \bjournalPhysical Review Letters \bvolume108 \bpages208102. \endbibitem
- Han, Tsay and Wu (2023) {barticle}[author] \bauthor\bsnmHan, \bfnmYuefeng\binitsY., \bauthor\bsnmTsay, \bfnmRuey S\binitsR. S. and \bauthor\bsnmWu, \bfnmWei Biao\binitsW. B. (\byear2023). \btitleHigh dimensional generalized linear models for temporal dependent data. \bjournalBernoulli \bvolume29 \bpages105–131. \endbibitem
- Han, Willett and Zhang (2022) {barticle}[author] \bauthor\bsnmHan, \bfnmRungang\binitsR., \bauthor\bsnmWillett, \bfnmRebecca\binitsR. and \bauthor\bsnmZhang, \bfnmAnru R\binitsA. R. (\byear2022). \btitleAn optimal statistical and computational framework for generalized tensor estimation. \bjournalThe Annals of Statistics \bvolume50 \bpages1–29. \endbibitem
- Huber (1964) {barticle}[author] \bauthor\bsnmHuber, \bfnmPeter J\binitsP. J. (\byear1964). \btitleRobust Estimation of a Location Parameter. \bjournalThe Annals of Mathematical Statistics \bpages73–101. \endbibitem
- Jain et al. (2017) {barticle}[author] \bauthor\bsnmJain, \bfnmPrateek\binitsP., \bauthor\bsnmKar, \bfnmPurushottam\binitsP. \betalet al. (\byear2017). \btitleNon-convex optimization for machine learning. \bjournalFoundations and Trends® in Machine Learning \bvolume10 \bpages142–363. \endbibitem
- Kolda and Bader (2009) {barticle}[author] \bauthor\bsnmKolda, \bfnmTamara G\binitsT. G. and \bauthor\bsnmBader, \bfnmBrett W\binitsB. W. (\byear2009). \btitleTensor decompositions and applications. \bjournalSIAM Review \bvolume51 \bpages455–500. \endbibitem
- Li et al. (2018) {barticle}[author] \bauthor\bsnmLi, \bfnmXiaoshan\binitsX., \bauthor\bsnmXu, \bfnmDa\binitsD., \bauthor\bsnmZhou, \bfnmHua\binitsH. and \bauthor\bsnmLi, \bfnmLexin\binitsL. (\byear2018). \btitleTucker tensor regression and neuroimaging analysis. \bjournalStatistics in Biosciences \bvolume10 \bpages520–545. \endbibitem
- Li et al. (2020) {barticle}[author] \bauthor\bsnmLi, \bfnmYuanxin\binitsY., \bauthor\bsnmChi, \bfnmYuejie\binitsY., \bauthor\bsnmZhang, \bfnmHuishuai\binitsH. and \bauthor\bsnmLiang, \bfnmYingbin\binitsY. (\byear2020). \btitleNon-convex low-rank matrix recovery with arbitrary outliers via median-truncated gradient descent. \bjournalInformation and Inference: A Journal of the IMA \bvolume9 \bpages289–325. \endbibitem
- Loh (2017) {barticle}[author] \bauthor\bsnmLoh, \bfnmPo-Ling\binitsP.-L. (\byear2017). \btitleStatistical consistency and asymptotic normality for high-dimensional robust -estimators. \bjournalThe Annals of Statistics \bvolume45 \bpages866–896. \endbibitem
- Lu et al. (2024) {barticle}[author] \bauthor\bsnmLu, \bfnmYin\binitsY., \bauthor\bsnmTao, \bfnmChunbai\binitsC., \bauthor\bsnmWang, \bfnmDi\binitsD., \bauthor\bsnmUddin, \bfnmGazi Salah\binitsG. S., \bauthor\bsnmWu, \bfnmLibo\binitsL. and \bauthor\bsnmZhu, \bfnmXuening\binitsX. (\byear2024). \btitleRobust estimation for dynamic spatial autoregression models with nearly optimal rates. \bjournalAvailable at SSRN 4873355. \endbibitem
- Ma et al. (2018) {binproceedings}[author] \bauthor\bsnmMa, \bfnmCong\binitsC., \bauthor\bsnmWang, \bfnmKaizheng\binitsK., \bauthor\bsnmChi, \bfnmYuejie\binitsY. and \bauthor\bsnmChen, \bfnmYuxin\binitsY. (\byear2018). \btitleImplicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval and matrix completion. In \bbooktitleInternational Conference on Machine Learning \bpages3345–3354. \bpublisherPMLR. \endbibitem
- Negahban and Wainwright (2011) {barticle}[author] \bauthor\bsnmNegahban, \bfnmSahand\binitsS. and \bauthor\bsnmWainwright, \bfnmMartin J\binitsM. J. (\byear2011). \btitleEstimation of (near) low-rank matrices with noise and high-dimensional scaling. \bjournalAnnals of Statistics \bvolume39 \bpages1069–1097. \endbibitem
- Negahban and Wainwright (2012) {barticle}[author] \bauthor\bsnmNegahban, \bfnmSahand\binitsS. and \bauthor\bsnmWainwright, \bfnmMartin J\binitsM. J. (\byear2012). \btitleRestricted strong convexity and weighted matrix completion: Optimal bounds with noise. \bjournalJournal of Machine Learning Research \bvolume13 \bpages1665–1697. \endbibitem
- Netrapalli et al. (2014) {barticle}[author] \bauthor\bsnmNetrapalli, \bfnmPraneeth\binitsP., \bauthor\bsnmUN, \bfnmNiranjan\binitsN., \bauthor\bsnmSanghavi, \bfnmSujay\binitsS., \bauthor\bsnmAnandkumar, \bfnmAnimashree\binitsA. and \bauthor\bsnmJain, \bfnmPrateek\binitsP. (\byear2014). \btitleNon-convex robust PCA. \bjournalAdvances in Neural Information Processing Systems \bvolume27. \endbibitem
- Prasad et al. (2020) {barticle}[author] \bauthor\bsnmPrasad, \bfnmAdarsh\binitsA., \bauthor\bsnmSuggala, \bfnmArun Sai\binitsA. S., \bauthor\bsnmBalakrishnan, \bfnmSivaraman\binitsS. and \bauthor\bsnmRavikumar, \bfnmPradeep\binitsP. (\byear2020). \btitleRobust estimation via robust gradient estimation. \bjournalJournal of the Royal Statistical Society Series B: Statistical Methodology \bvolume82 \bpages601–627. \endbibitem
- Raskutti, Yuan and Chen (2019) {barticle}[author] \bauthor\bsnmRaskutti, \bfnmGarvesh\binitsG., \bauthor\bsnmYuan, \bfnmMing\binitsM. and \bauthor\bsnmChen, \bfnmHan\binitsH. (\byear2019). \btitleConvex regularization for high-dimensional multi-response tensor regression. \bjournalAnnals of Statistics \bvolume47 \bpages1554-1584. \endbibitem
- Richard and Montanari (2014) {barticle}[author] \bauthor\bsnmRichard, \bfnmEmile\binitsE. and \bauthor\bsnmMontanari, \bfnmAndrea\binitsA. (\byear2014). \btitleA statistical model for tensor PCA. \bjournalAdvances in Neural Information Processing Systems \bvolume27. \endbibitem
- Roberts, Boonstra and Breakspear (2015) {barticle}[author] \bauthor\bsnmRoberts, \bfnmJames A\binitsJ. A., \bauthor\bsnmBoonstra, \bfnmTjeerd W\binitsT. W. and \bauthor\bsnmBreakspear, \bfnmMichael\binitsM. (\byear2015). \btitleThe heavy tail of the human brain. \bjournalCurrent Opinion in Neurobiology \bvolume31 \bpages164–172. \endbibitem
- Shen et al. (2023) {barticle}[author] \bauthor\bsnmShen, \bfnmYinan\binitsY., \bauthor\bsnmLi, \bfnmJingyang\binitsJ., \bauthor\bsnmCai, \bfnmJian-Feng\binitsJ.-F. and \bauthor\bsnmXia, \bfnmDong\binitsD. (\byear2023). \btitleComputationally efficient and statistically optimal robust low-rank matrix and tensor estimation. \bjournalarXiv preprint arXiv:2203.00953. \endbibitem
- Sun, Zhou and Fan (2020) {barticle}[author] \bauthor\bsnmSun, \bfnmQiang\binitsQ., \bauthor\bsnmZhou, \bfnmWen-Xin\binitsW.-X. and \bauthor\bsnmFan, \bfnmJianqing\binitsJ. (\byear2020). \btitleAdaptive huber regression. \bjournalJournal of the American Statistical Association \bvolume115 \bpages254–265. \endbibitem
- Tan, Sun and Witten (2023) {barticle}[author] \bauthor\bsnmTan, \bfnmKean Ming\binitsK. M., \bauthor\bsnmSun, \bfnmQiang\binitsQ. and \bauthor\bsnmWitten, \bfnmDaniela\binitsD. (\byear2023). \btitleSparse reduced rank Huber regression in high dimensions. \bjournalJournal of the American Statistical Association \bvolume118 \bpages2383–2393. \endbibitem
- Tarzanagh and Michailidis (2022) {barticle}[author] \bauthor\bsnmTarzanagh, \bfnmDavoud Ataee\binitsD. A. and \bauthor\bsnmMichailidis, \bfnmGeorge\binitsG. (\byear2022). \btitleRegularized and smooth double core tensor factorization for heterogeneous data. \bjournalJournal of Machine Learning Research \bvolume23 \bpages1–49. \endbibitem
- Tomioka and Suzuki (2013) {binproceedings}[author] \bauthor\bsnmTomioka, \bfnmRyota\binitsR. and \bauthor\bsnmSuzuki, \bfnmTaiji\binitsT. (\byear2013). \btitleConvex tensor decomposition via structured Schatten norm regularization. In \bbooktitleAdvances in Neural Information Processing Systems (NIPS) \bpages1331–1339. \endbibitem
- Tong, Ma and Chi (2022) {binproceedings}[author] \bauthor\bsnmTong, \bfnmTian\binitsT., \bauthor\bsnmMa, \bfnmCong\binitsC. and \bauthor\bsnmChi, \bfnmYuejie\binitsY. (\byear2022). \btitleAccelerating ill-conditioned robust low-rank tensor regression. In \bbooktitleICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) \bpages9072–9076. \bpublisherIEEE. \endbibitem
- Tong et al. (2022) {barticle}[author] \bauthor\bsnmTong, \bfnmTian\binitsT., \bauthor\bsnmMa, \bfnmCong\binitsC., \bauthor\bsnmPrater-Bennette, \bfnmAshley\binitsA., \bauthor\bsnmTripp, \bfnmErin\binitsE. and \bauthor\bsnmChi, \bfnmYuejie\binitsY. (\byear2022). \btitleScaling and scalability: Provable nonconvex low-rank tensor estimation from incomplete measurements. \bjournalJournal of Machine Learning Research \bvolume23 \bpages1–77. \endbibitem
- Tu et al. (2016) {binproceedings}[author] \bauthor\bsnmTu, \bfnmStephen\binitsS., \bauthor\bsnmBoczar, \bfnmRoss\binitsR., \bauthor\bsnmSimchowitz, \bfnmMax\binitsM., \bauthor\bsnmSoltanolkotabi, \bfnmMahdi\binitsM. and \bauthor\bsnmRecht, \bfnmBen\binitsB. (\byear2016). \btitleLow-rank solutions of linear matrix equations via procrustes flow. In \bbooktitleInternational Conference on Machine Learning \bpages964–973. \bpublisherPMLR. \endbibitem
- Tucker (1966) {barticle}[author] \bauthor\bsnmTucker, \bfnmLedyard R\binitsL. R. (\byear1966). \btitleSome mathematical notes on three-mode factor analysis. \bjournalPsychometrika \bvolume31 \bpages279–311. \endbibitem
- Wainwright (2019) {bbook}[author] \bauthor\bsnmWainwright, \bfnmMartin J\binitsM. J. (\byear2019). \btitleHigh-dimensional statistics: A non-asymptotic viewpoint \bvolume48. \bpublisherCambridge university press. \endbibitem
- Wang and Tsay (2023) {barticle}[author] \bauthor\bsnmWang, \bfnmDi\binitsD. and \bauthor\bsnmTsay, \bfnmRuey S\binitsR. S. (\byear2023). \btitleRate-optimal robust estimation of high-dimensional vector autoregressive models. \bjournalThe Annals of Statistics \bvolume51 \bpages846–877. \endbibitem
- Wang, Zhang and Gu (2017) {binproceedings}[author] \bauthor\bsnmWang, \bfnmLingxiao\binitsL., \bauthor\bsnmZhang, \bfnmXiao\binitsX. and \bauthor\bsnmGu, \bfnmQuanquan\binitsQ. (\byear2017). \btitleA unified computational and statistical framework for nonconvex low-rank matrix estimation. In \bbooktitleArtificial Intelligence and Statistics \bpages981–990. \bpublisherPMLR. \endbibitem
- Wang, Zhang and Mai (2023) {barticle}[author] \bauthor\bsnmWang, \bfnmNing\binitsN., \bauthor\bsnmZhang, \bfnmXin\binitsX. and \bauthor\bsnmMai, \bfnmQing\binitsQ. (\byear2023). \btitleHigh-dimensional tensor response regression using the t-distribution. \bjournalarXiv preprint arXiv:2306.12125. \endbibitem
- Wang et al. (2020) {barticle}[author] \bauthor\bsnmWang, \bfnmLan\binitsL., \bauthor\bsnmPeng, \bfnmBo\binitsB., \bauthor\bsnmBradic, \bfnmJelena\binitsJ., \bauthor\bsnmLi, \bfnmRunze\binitsR. and \bauthor\bsnmWu, \bfnmYunan\binitsY. (\byear2020). \btitleA tuning-free robust and efficient approach to high-dimensional regression. \bjournalJournal of the American Statistical Association \bpages1–44. \endbibitem
- Wang et al. (2022) {barticle}[author] \bauthor\bsnmWang, \bfnmDi\binitsD., \bauthor\bsnmZheng, \bfnmYao\binitsY., \bauthor\bsnmLian, \bfnmHeng\binitsH. and \bauthor\bsnmLi, \bfnmGuodong\binitsG. (\byear2022). \btitleHigh-dimensional vector autoregressive time series modeling via tensor decomposition. \bjournalJournal of the American Statistical Association \bvolume117 \bpages1338–1356. \endbibitem
- Wei et al. (2023) {barticle}[author] \bauthor\bsnmWei, \bfnmBo\binitsB., \bauthor\bsnmPeng, \bfnmLimin\binitsL., \bauthor\bsnmGuo, \bfnmYing\binitsY., \bauthor\bsnmManatunga, \bfnmAmita\binitsA. and \bauthor\bsnmStevens, \bfnmJennifer\binitsJ. (\byear2023). \btitleTensor response quantile regression with neuroimaging data. \bjournalBiometrics \bvolume79 \bpages1947–1958. \endbibitem
- Wu et al. (2022) {barticle}[author] \bauthor\bsnmWu, \bfnmYing\binitsY., \bauthor\bsnmChen, \bfnmDan\binitsD., \bauthor\bsnmLi, \bfnmChaoqian\binitsC. and \bauthor\bsnmTang, \bfnmNiansheng\binitsN. (\byear2022). \btitleBayesian tensor logistic regression with applications to neuroimaging data analysis of Alzheimer’s disease. \bjournalStatistical Methods in Medical Research \bvolume31 \bpages2368–2382. \endbibitem
- Xu, Zhang and Gu (2017) {binproceedings}[author] \bauthor\bsnmXu, \bfnmPan\binitsP., \bauthor\bsnmZhang, \bfnmTingting\binitsT. and \bauthor\bsnmGu, \bfnmQuanquan\binitsQ. (\byear2017). \btitleEfficient algorithm for sparse tensor-variate gaussian graphical models via gradient descent. In \bbooktitleArtificial Intelligence and Statistics \bpages923–932. \bpublisherPMLR. \endbibitem
- Yang et al. (2020) {barticle}[author] \bauthor\bsnmYang, \bfnmXingyi\binitsX., \bauthor\bsnmHe, \bfnmXuehai\binitsX., \bauthor\bsnmZhao, \bfnmJinyu\binitsJ., \bauthor\bsnmZhang, \bfnmYichen\binitsY., \bauthor\bsnmZhang, \bfnmShanghang\binitsS. and \bauthor\bsnmXie, \bfnmPengtao\binitsP. (\byear2020). \btitleCovid-ct-dataset: a ct scan dataset about covid-19. \bjournalarXiv preprint arXiv:2003.13865. \endbibitem
- Yuan and Zhang (2016) {barticle}[author] \bauthor\bsnmYuan, \bfnmMing\binitsM. and \bauthor\bsnmZhang, \bfnmCun-Hui\binitsC.-H. (\byear2016). \btitleOn tensor completion via nuclear norm minimization. \bjournalFoundations of Computational Mathematics \bvolume16 \bpages1031–1068. \endbibitem
- Zhang and Han (2019) {barticle}[author] \bauthor\bsnmZhang, \bfnmAnru\binitsA. and \bauthor\bsnmHan, \bfnmRungang\binitsR. (\byear2019). \btitleOptimal sparse singular value decomposition for high-dimensional high-order data. \bjournalJournal of the American Statistical Association. \endbibitem
- Zhang and Xia (2018) {barticle}[author] \bauthor\bsnmZhang, \bfnmAnru\binitsA. and \bauthor\bsnmXia, \bfnmDong\binitsD. (\byear2018). \btitleTensor SVD: Statistical and Computational Limits. \bjournalIEEE Transactions on Information Theory \bvolume64 \bpages7311-7338. \endbibitem
- Zhang et al. (2020) {barticle}[author] \bauthor\bsnmZhang, \bfnmAnru R\binitsA. R., \bauthor\bsnmLuo, \bfnmYuetian\binitsY., \bauthor\bsnmRaskutti, \bfnmGarvesh\binitsG. and \bauthor\bsnmYuan, \bfnmMing\binitsM. (\byear2020). \btitleISLET: Fast and optimal low-rank tensor regression via importance sketching. \bjournalSIAM Journal on Mathematics of Data Science \bvolume2 \bpages444–479. \endbibitem
- Zhou, Li and Zhu (2013) {barticle}[author] \bauthor\bsnmZhou, \bfnmHua\binitsH., \bauthor\bsnmLi, \bfnmLexin\binitsL. and \bauthor\bsnmZhu, \bfnmHongtu\binitsH. (\byear2013). \btitleTensor regression with applications in neuroimaging data analysis. \bjournalJournal of the American Statistical Association \bvolume108 \bpages540–552. \endbibitem
- Zhu and Zhou (2021) {binproceedings}[author] \bauthor\bsnmZhu, \bfnmZiwei\binitsZ. and \bauthor\bsnmZhou, \bfnmWenjing\binitsW. (\byear2021). \btitleTaming heavy-tailed features by shrinkage. In \bbooktitleInternational Conference on Artificial Intelligence and Statistics \bpages3268–3276. \bpublisherPMLR. \endbibitem
Appendix A Convergence Analysis of Robust Gradient Descent
A.1 Proofs of Theorem 3.3
The proof consists of five steps. In the first step, we introduce the notations and the regularity conditions in the following steps. In the second to fourth steps, we establish the convergence analysis of the estimation errors. Finally, in the last step, we verify the conditions given in the first steps recursively. In the following, we let be the constant only depending on .
Step 1. (Notations and conditions)
We first introduce the notations used in the proof. At step , we simplify the notations of the robust gradient estimators to and , for and . Denote ,
(A.1) |
and
(A.2) |
as the robust gradient estimation errors. By the stability of the robust gradients, , for all and . In addition, we assume , as required in Theorem 3.3.
Let such that , for . Define as the set of orthogonal matrices. For each step , we define
(A.3) |
and
(A.4) |
Here, collects the combined estimation errors for all tensor decomposition components at step , and ’s are the optimal rotations used to handle the non-identifiability of the Tucker decomposition.
Next, we discuss some additional conditions used in the convergence analysis. To ease presentation, we first assume that these conditions hold and verify them in the last step.
(C1) For any and , and for some absolute constant greater than one. Hence, .
(C2) For any , .
Step 2. (Descent of )
By definition of and ’s,
(A.5) |
For each , since , we have that for any ,
(A.6) |
where the last inequality stems from the mean inequality.
For the first term on the right hand side in (A.6), we have the following decomposition
(A.7) |
Here, by condition (C1), the second term in (A.7) can be bounded by
(A.8) |
The third term in (A.7) can be rewritten as
(A.9) |
where . For the fourth term in (A.7), we have
(A.10) |
where we use the fact that .
Hence, for any ,
(A.11) |
where
(A.12) |
and
(A.13) |
Similarly, for any ,
(A.14) |
and
(A.15) |
where
and .
Hence, combining the above results, we have
(A.16) |
Step 3. (Lower bound of )
By definition of for , we have
(A.17) |
For the first term, by RCG condition of and Cauchy’s inequality,
(A.18) |
where is the higher-order perturbation term in
(A.19) |
Step 4. (Convergence analysis)
We have the following bound for
(A.21) |
Combining the results above, we have
(A.22) |
Taking and for some sufficiently small constants and , we have
(A.23) |
and
(A.24) |
Taking , we have
(A.25) |
By stability of the robust gradient estimators, for and ,
(A.26) |
Hence, as , we have
(A.27) |
We apply Lemma A.1 again and obtain
(A.28) |
Step 5. (Verfications of conditions)
A.2 Auxiliary lemmas
The first lemma shows the equivalence between and the combined error , which is from Lemma E.2 in Han, Willett and Zhang (2022) and is presented here for self-containedness. The proof of Lemma A.1 can be found in Han, Willett and Zhang (2022) and hence is omitted.
Lemma A.1.
Suppose , , for , , and . Let be another Tucker low-rank tensor with , , and for some . Define
(A.32) |
Then, we have
(A.33) |
where are some constants related to .
The second lemma is an upper bound for the second and higher-order terms in the perturbation of a tensor Tucker decomposition, as the higher-order generalization of Lemma E.3 in Han, Willett and Zhang (2022).
Lemma A.2.
Suppose that and with and . For , , , where and . Then, .
Proof.
We have that
(A.34) |
where , , and .
∎
Appendix B Statistical Analysis of Robust Gradient
The most essential part of the statistical analysis is to prove that the robust gradient estimators are stable. For , the robust gradient estimator with respect to is Note that
(B.1) |
where
(B.2) |
Similarly, for , its robust gradient estimator is
We can also decompose into four components,
(B.3) |
where
(B.4) |
To prove the stability of the robust gradient estimators, it suffices to give proper upper bounds of for and .
B.1 Proof of Theorem 4.5
Proof.
The proof consists of seven steps. In the first six steps, we prove the stability of the robust gradient estimators for the general and, hence, we omit the notation for simplicity. Specifically, in the first four steps, we give the upper bounds for , respectively, for . In the fifth and sixth steps, we extend the proof to the terms for and core tensor. In the last step, we apply the results to the local convergence analysis in Theorem 3.3 and verify the corresponding conditions.
Throughout the first six steps, we assume that for each , and and will verify them in the last step.
Step 1. (Bound for )
For any , we let and
(B.5) |
Denote the columns of as such that . The -th entry of is
(B.6) |
where is the coordinate vector whose -th entry is one and the others are zero.
For the fixed ’s, let and . Similarly, let and . By Assumption 4.3, and , for , , and . Let and , for . Then, . Let , which satisfies that
(B.7) |
Let and .
For any and , by definition of the truncation operator and Markov’s inequality,
(B.9) |
with truncation parameter
(B.10) |
Hence, for ,
(B.11) |
Step 2. (Bound for )
For in (B.3), similarly to ,
(B.12) |
For each , it can be checked that
(B.13) |
Thus, we have the upper bound for the variance
(B.14) |
Also, for any , the higher-order moments satisfy that
(B.15) |
By Bernstein’s inequality, for any , , and ,
(B.16) |
Let . Therefore, we have
(B.17) |
and
(B.18) |
Hence, for , with high probability at least ,
(B.19) |
Step 3. (Bound for )
The tensor linear regression can be rewritten as
(B.20) |
Hence, we denote matricization of and as and .
As in step 1,
(B.21) |
For and , let the -th entry of and be and , respectively. Let , and then .
Similarly to , let . Note that and , and hence . Then,
(B.22) |
where and
For any and ,
(B.23) |
Hence,
(B.24) |
Step 4. (Bound for )
Let . Then,
(B.25) |
Note that and note that . Also, . Similarly to the term , by Bernstein’s inequality, for any , and ,
(B.26) |
Letting , we have that
(B.27) |
Therefore, with probability at least ,
(B.28) |
Combining these results, for any , with probablity ,
(B.29) |
Step 5. (Extension to for )
For , we let and
(B.30) |
The -th entry of is
(B.31) |
Let and . Let and .
By Assumption 4.3, and , for and . Let and , where . Let . Hence, we can have that
(B.32) |
and with probability at least ,
(B.33) |
Moreover,
(B.34) |
and with probability at least ,
(B.35) |
Therefore, for , with probability at least ,
(B.36) |
Step 6. (Extension to core tensor)
For the partial gradient with respect to the core tensor , we have
(B.37) |
Let and , and let and . By Assumption 4.3, and , for all and . Let .
In a similar fashion, we can show that with probability at least ,
(B.38) |
Hence, with probability at least ,
(B.39) |
Step 7. (Verify the conditions and conclude the proof)
In the last step, we apply the results above to Theorem 3.3. First, we examine the conditions in Theorem 3.3 hold. Under Assumption 4.3, by Lemma 3.11 in Bubeck (2015), we can show that the RCG condition in Definition 3.2 is implied by the restricted strong convexity and strong smoothness with and .
Next, we show the stability of the robust gradient estimators for all . By matrix perturbation theory, if , we have for all . After a finite number of iterations, , with probability at least , we can have .
For any and any tensor , , where for and . For any , we have for some . Let and decompose where and . Thus, we have and .
Denote . Then, since
(B.40) |
we have that
(B.41) |
that is, taking ,
(B.42) |
Hence, for the iterate , combining the results in steps 1 to 6, we have that with probability at least , for any
(B.43) |
and
(B.44) |
As , plugging these into Theorem 3.3, we have that for all and ,
(B.45) |
and
(B.46) |
Finally, for all and ,
(B.47) |
∎
B.2 Proof of Theorem 4.9
Proof.
The proof consists of six steps. In the first five steps, we prove the stability of the robust gradient estimators for the general and, hence, we omit the notation for simplicity. Specifically, in the first four steps, we give the upper bounds for , respectively, for . In the fifth step, we extend the proof to the terms for the core tensor. In the last step, we apply the results to the local convergence analysis in Theorem 3.3 and verify the corresponding conditions.
Throughout the first five steps, we assume that for each , and and will verify them in the last step.
Step 1. (Bound )
For any , we let and
(B.48) |
Let and . By Assumption 4.8, for and . Let and . Then, . Also, denote for any .
We first bound the bias . Note that and
(B.49) |
Let . Then,
(B.50) |
By Holder’s inequality and Markov’s inequality,
(B.51) |
Hence, we have
(B.52) |
Step 2. (Bound )
For in (B.3), it can be checked that
(B.53) |
Thus, . Also, for any , the higher-order moments satisfy that
(B.54) |
By Bernstein’s inequality, for any ,
(B.55) |
Let . Therefore, we have
(B.56) |
and
(B.57) |
Therefore, with probability at least ,
(B.58) |
Step 3. (Bound )
By the Taylor’s expansion of ,
(B.59) |
For and , let the -th entry of and be and , respectively. Let , and then .
For any and ,
(B.60) |
Hence,
(B.61) |
Step 4. (Bound )
Let . Then,
(B.62) |
Note that and note that . Also, . Similarly to the term , by Bernstein’s inequality, for any , and ,
(B.63) |
Letting , we have that
(B.64) |
Therefore, with probability at least ,
(B.65) |
Combining these results, for any , with probablity ,
(B.66) |
Step 5. (Extension to core tensor)
For the partial gradient with respect to the core tensor , we have
(B.67) |
Let and . By Assumption 4.8, for all . In a similar fashion, we can show that with probability at least ,
(B.68) |
Hence, with probability at least ,
(B.69) |
Step 6. (Verify the conditions and conclude the proof)
Under Assumption 4.8, we can calculate the Hessian and easily check that the RCG condition holds.
Next, we can plug the results in the first five steps into Theorem 3.3. Using the same argument in the proof of Theorem 4.5, we have that with probability at least ,
(B.70) |
and
(B.71) |
for all .
Finally, for all and ,
(B.72) |
∎
B.3 Proof of Theorem 4.13
The proof consists of six steps. In the first five steps, we prove the stability of the robust gradient estimators for the general and, hence, we omit the notation for simplicity. Specifically, in the first four steps, we give the upper bounds for , respectively, for . In the fifth step, we extend the proof to the terms for the core tensor. In the last step, we apply the results to the local convergence analysis in Theorem 3.3 and verify the corresponding conditions.
Throughout the first five steps, we assume that for each , and and will verify them in the last step.
Step 1. (Bound )
Let and its columns as . Let , and by Assumption 4.8, . Let and, hence, we have , for and , satisfying .
We first bound the bias, for any and ,
(B.73) |
with the truncation parameter . Hence,
(B.74) |
Step 2. Bound ()
Note that
(B.75) |
Thus, we have . Also, for any , the higher-order moments satisfy that
(B.76) |
By Bernstein’s inequality, for any ,
(B.77) |
Letting , since we have
(B.78) |
and
(B.79) |
Hence, with probabilty at least ,
(B.80) |
Step 3. (Bound )
Clearly, we have
For and , let the -th entry of and be and , respectively. Let , and then .
For any and ,
(B.81) |
Hence,
(B.82) |
Step 4. (Bound )
Let . Then,
(B.83) |
Note that and note that . Also, . Similarly to the term , by Bernstein’s inequality, for any , and ,
(B.84) |
Letting , we have that
(B.85) |
Therefore, with probability at least ,
(B.86) |
Combining these results, for any , with probability at least ,
(B.87) |
Step 5. (Extension to core tensor)
In a similar fashion, we can show that with probability at least ,
(B.88) |
Hence, we have with high probability,
(B.89) |
Step 6. (Verify the conditions and conclude the proof)
By definition, the RCG condition is satisfied naturally. Next, we can plug the results in the first five steps into Theorem 3.3. Using the same argument in the proof of Theorem 4.5, we have that with probability at least ,
(B.90) |
and
(B.91) |
for all .
Finally, for all and ,
(B.92) |