This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Robust High-Dimensional Regression with Coefficient Thresholding and its Application to Imaging Data Analysis

Bingyuan Liu1, Qi Zhang1, Lingzhou Xue1, Peter X.-K. Song2, and Jian Kang2
1Pennsylvania State University and 2University of Michigan
(First Version: November 2019;
This Version: January 2021)
Abstract

It is of importance to develop statistical techniques to analyze high-dimensional data in the presence of both complex dependence and possible outliers in real-world applications such as imaging data analyses. We propose a new robust high-dimensional regression with coefficient thresholding, in which an efficient nonconvex estimation procedure is proposed through a thresholding function and the robust Huber loss. The proposed regularization method accounts for complex dependence structures in predictors and is robust against outliers in outcomes. Theoretically, we analyze rigorously the landscape of the population and empirical risk functions for the proposed method. The fine landscape enables us to establish both statistical consistency and computational convergence under the high-dimensional setting. Finite-sample properties of the proposed method are examined by extensive simulation studies. An illustration of real-world application concerns a scalar-on-image regression analysis for an association of psychiatric disorder measured by the general factor of psychopathology with features extracted from the task functional magnetic resonance imaging data in the Adolescent Brain Cognitive Development study.

Keywords— Landscape analysis; Nonconvex optimization; Thresholding function; Scalar-on-image regression.

1 Introduction

Regression analysis of high-dimensional data has been extensively studied in a number of research fields over the last three decades or so. To overcome the high-dimensionality, statistical researchers have proposed a variety of regularization methods to perform variable selection and parameter estimation simultaneously. Among these, the 0\ell_{0} regularization enjoys the oracle risk inequality (Barron et al., 1999) but it is impractical due to its NP-hard computational complexity. In contrast, the 1\ell_{1} regularization (Tibshirani, 1996) provides an effective convex relaxation of the 0\ell_{0} regularization and achieves variable selection consistency under the so-called irrepresentable condition (Zhao & Yu, 2006; Zou, 2006; Wainwright, 2009). The adaptive 1\ell_{1} regularization (Zou, 2006) and the folded concave regularization (Fan & Li, 2001; Zhang, 2010) relax the irrepresentable condition and improve the estimation and variable selection performance. The folded concave penalized estimation can be implemented through a sequence of the adaptive 1\ell_{1} penalized estimations and achieves the strong oracle property (Zou & Li, 2008; Fan et al., 2014).

Despite these important advances, existing methods, including the (adaptive) 1\ell_{1} regularization and folded concave regularization, do not work well when predictors are strongly correlated, which is the case especially in scalar-on-image regression analysis (Wang et al., 2017; Kang et al., 2018; He et al., 2018). This paper is motivated by the needs of analyzing the nn-back working memory task fMRI data in the Adolescent Brain Cognitive Development (ABCD) study (Casey et al., 2018). The task-invoked fMRI imaging measures the blood oxygen level signal that is linked to personal neural activities when performing a specific task. The nn-back task is a commonly used approach to making assessment in psychology and cognitive neuroscience with a focus on working memory. One question of interest is to understand the association between the risk of developing psychiatry disorder and features related to functional brain activity. We use the 2-back versus 0-back contrast map statistics derived from the nn-back task fMRI data as imaging predictors. We aim at identifying important imaging biomarkers that are strongly associated with the general factor of psychopathology (GFP) or “p-factor”. GFP is a psychiatric disorder outcome used to evaluate the overall mental health of a subject. In this application, it is expected that the irrepresentable condition of the 1\ell_{1} regularization can be easily violated by strong dependence among high dimensional imaging predictors from fMRI data. To illustrate the presence of strong dependence among imaging predictors, in Figure 1, we plot the largest absolute value of correlation coefficients and the number of correlation coefficients that are 0.8\geq 0.8 or 0.8\leq-0.8 between brain regions. Obviously, there exists strong dependence among imaging predictors, so that existing methods may not have a satisfactory performance in the scalar-on-image analysis. See the simulation study in Section 4 for more details.

To effectively address potential technical challenges in the presence of such strongly correlated predictors, we consider a new approach based on the thresholding function technique. The rationale behind our idea is rooted in attractive performances given by various recently developed thresholding methods, including the hard-thresholding property of the 0\ell_{0} regularization shown in Fan & Lv (2013). They showed that the global minimizer in the thresholded parameter space enjoys the variable selection consistency. Thus, with proper thresholding of coefficients, it is possible to significantly relax the irrepresentable condition while to address the strong dependence in the scalar-on-image regression analysis. Recently, manifested by the potential power of the thresholding strategy, Kang et al. (2018) studied a new class of Bayesian nonparametric models based on the soft-thresholded Gaussian prior, and Sun et al. (2019) proposed a two-stage hard thresholding regression analysis that applies a hard thresholding function on the initial 1\ell_{1}-penalized estimator. To achieve the strong oracle property (Fan et al., 2014), Sun et al. (2019) required the initial solution is reasonably close to the true solution in aspect of 2\ell_{2} norm with high probability.

Refer to caption
Refer to caption
Figure 1: Illustration of the strong dependence structure among imaging predictors. Panel (a) shows the largest absolute value of correlation coefficients between regions, and Panel (b) shows the number of correlation coefficients that are 0.8\geq 0.8 or 0.8\leq-0.8 between regions.

Robustness against outliers occurring from heavy-tailed errors is of great importance in the scalar-on-image analysis. Due to various limitations of used instruments and quality control in data preprocessing, fMRI data often involves many potential outliers(Poldrack, 2012), compromising the stability and reliability of standard regression analyses. fMRI indirectly measures neural activity by assessing blood-oxgen-level-dependent signals and its signal-to-noise ratio is often low(Lindquist et al., 2008). Moreover, the complexity of fMRI techniques limits the capacity of unifying fMRI data preprocessing procedures (Bennett & Miller, 2010; Brown & Behrmann, 2017) to identify and remove outliers effectively. Standard regression analysis with contaminated data may lead to a high rate of false positives in inference, as shown in many empirical studies (Eklund et al., 2012, 2016). It is loudly advocated that potential outliers should be taken into account in the study of brain functional connectivity using fMRI data (Rosenberg et al., 2016). In the lens of robustness, many works have been proposed to study the high dimensional robust regression problem. El Karoui et al. (2013) studied the consistency of regression with a robust loss function such as the least absolute deviation (LAD). In a high-dimensional robust regression, Loh (2017) showed that the use of a robust loss can help achieve the optimal rate of regression coefficient estimation with independent zero-mean error terms. In addition, Loh (2018) showed that by calibrating with a scale estimator in the Huber loss, the regularized robust regression estimator can be further improved.

In the current literature of the high dimensional scalar-on-image regression, Goldsmith et al. (2014) introduced a single-site Gibbs sampler that incorporates spatial information in a Bayesian regression framework to perform the scalar-on-image regression. Li et al. (2015) introduced a joint Ising and Dirichlet process prior to develop a Bayesian stochastic search variable selection. Wang et al. (2017) proposed a generalized regression model in which the image is assumed to belong to the space of bounded total variation incorporating the piece-wise smooth nature of fMRI data. Motivated by these works, in this paper we first introduce a new integrated robust regression model with coefficient thresholding and then propose a penalized estimation procedure with provable theoretical guarantees, where the noise distribution is not restricted to be sub-Gaussian. Specifically, we propose to use a smooth thresholding function to approximate the discrete hard thresholding function and tackle the strong dependence of predictors together with the use of the smoothed Huber loss (Charbonnier et al., 1997) to achieve desirable robust estimation. We design a customized composite gradient descent algorithm to efficiently solve the nonconvex and nonsmooth optimization problem. The proposed coefficient thresholding method is capable of incorporating intrinsic group structures of high-dimensional imaging predictors and dealing with their strong spatial and functional dependencies. Moreover, the proposed method effectively improves robustness and reliability.

The proposed regression with the coefficient thresholding method results in a nonconvex objective function in optimization. In the current literature, it becomes an increasingly important research topic to obtain the statistical and computational guarantees for nonconvex optimization methods. The local linear approximation (LLA) approach (Zou & Li, 2008; Fan et al., 2014, 2018) and the Wirtinger flow method (Candes et al., 2015; Cai et al., 2016) directly have enabled to analyze the computed local solution. The restricted strong convexity (RSC) condition (Negahban et al., 2012; Negahban & Wainwright, 2012; Loh & Wainwright, 2013, 2017) and the null consistency condition (Zhang & Zhang, 2012) were used to prove the uniqueness of the sparse local solution. However, it still remains non-trivial to study theoretical properties of the proposed robust regression with coefficient thresholding. The nonconvex optimization cannot be directly solved by the LLA approach, and the RSC condition does not hold. Alternatively, following the seminal paper by Mei et al. (2018), we carefully study the landscape of the proposed method. We prove that the proposed nonconvex loss function has a fine landscape with high probability and also establish the uniform convergence of the directional gradient and restricted Hessian of the empirical risk function to their population counterparts. Thus, under some mild conditions, we can establish key statistical and computational guarantees. Let nn be the sample size, pp be the dimension of predictors, and ss the size of the support set of true parameters. Specifically, we prove that, with high probability, (i) any stationary solution achieves the oracle inequality under the 2\ell_{2} norm when nCslogpn\geq Cs\log p; (ii) the proposed nonconvex estimation procedure has a unique stationary solution that is also the global solution when nCslog2pn\geq Cs\log^{2}p; and (iii) the proposed composite gradient descent algorithm attains the desired stationary solution. We shall point out that both statistical and computational guarantees of the proposed method do not require a specific type of initial solutions.

The rest of this paper is organized as follows. Section 2 first revisits the irrepresentable condition and then proposes the robust regression with coefficient thresholding and nonconvex estimation. Section 3 studies theoretical properties of the proposed method, including both statistical guarantees and computational guarantees. The statistical guarantees include the landscape analysis of the nonconvex loss function and asymptotic properties of the proposed estimators, while the computational guarantees concern the convergence of the proposed algorithm. Simulation studies are presented in Section 4 and the real application is demonstrated in Section 5. Section 6 includes a few concluding remarks. All the remaining technical details are given in the supplementary file.

2 Methodology

We begin with revisiting the irrepresentable condition and its relaxations in Subsection 2.1. After that, Subsection 2.2 proposes a new high-dimensional robust regression with coefficient thresholding.

2.1 The Irrepresentable Condition and its Relaxations

In the high-dimensional linear regression 𝐲=𝐗𝜷+𝜺,{\bf{y}}={\bf{X}}\boldsymbol{\beta}^{\star}+\boldsymbol{\varepsilon}, 𝐲=(y1,y2,,yn)T{\bf{y}}=(y_{1},y_{2},\dots,y_{n})^{T} is an nn-dimensional response vector, 𝐗=(𝐱1,,𝐱p){\bf{X}}=({\bf{x}}_{1},\dots,{\bf{x}}_{p}) is an n×pn\times p deterministic design matrix, 𝜷\boldsymbol{\beta}^{\star} is a pp-dimensional true regression coefficient vector and 𝜺\boldsymbol{\varepsilon} is an error vector with mean zero. We wish to recover the true sparse vector 𝜷\boldsymbol{\beta}^{\star}, whose support S(𝜷)={j:βj0}S(\boldsymbol{\beta}^{\star})=\{j:\beta_{j}^{\star}\neq 0\} satisfies that its cardinality |S(𝜷)|p|S(\boldsymbol{\beta}^{\star})|\ll p. We use SS as the shorthand for the support set S(𝜷)S(\boldsymbol{\beta}^{\star}).

The irrepresentable condition is known to be sufficient and almost necessary for the variable selection consistency of the 1\ell_{1} penalization (Zhao & Yu, 2006; Zou, 2006; Wainwright, 2009). Specifically, the design matrix 𝐗{\bf{X}} should satisfy that

maxjSc|𝐱jT𝐗S(𝐗ST𝐗S)1sgn(𝜷S)|(1γ),\underset{j\in S^{c}}{\max}\ |{\bf{x}}_{j}^{T}{\bf{X}}_{S}({\bf{X}}_{S}^{T}{\bf{X}}_{S})^{-1}\mathrm{sgn}(\boldsymbol{\beta}_{S}^{\star})|\leq(1-\gamma),

for some incoherence constant γ[0,1)\gamma\in[0,1), This condition requires that the variables in the true support are weakly correlated with other variables that are not in the true support.

There are several versions of relaxation of the irrepresentable condition in the current literature. Zou (2006) used the adaptive weights w^j\hat{w}_{j}’s in the 1\ell_{1} penalization and relaxed the irrepresentable condition as:

maxjSc1w^j|𝐱jT𝐗S(𝐗ST𝐗S)1𝐰^Ssgn(𝜷S)|(1γ),\underset{j\in S^{c}}{\max}\ \frac{1}{\hat{w}_{j}}|{\bf{x}}_{j}^{T}{\bf{X}}_{S}({\bf{X}}_{S}^{T}{\bf{X}}_{S})^{-1}\mathbf{\hat{w}}_{S}\circ\mathrm{sgn}(\boldsymbol{\beta}_{S}^{\star})|\leq(1-\gamma),

where \circ denotes the Hadamard (or componentwise) product of two vectors. The folded concave penalization  (Fan & Li, 2001; Zhang, 2010) also relaxed the irrepresentable condition. Let pλ()p_{\lambda}(\cdot) be the folded concave penalty function. Given the local linear approximation and the current solution 𝜷~\tilde{\boldsymbol{\beta}}, the folded concave penalization relaxed this condition as:

maxjSc1|p˙λ(β~j)||𝐱jT𝐗S(𝐗ST𝐗S)1|p˙λ(𝜷~S)|sgn(𝜷S)|(1γ),\underset{j\in S^{c}}{\max}\ \frac{1}{|\dot{p}_{\lambda}(\tilde{\beta}_{j})|}|{\bf{x}}_{j}^{T}{\bf{X}}_{S}({\bf{X}}_{S}^{T}{\bf{X}}_{S})^{-1}|\dot{p}_{\lambda}(\tilde{\boldsymbol{\beta}}_{S})|\circ\mathrm{sgn}(\boldsymbol{\beta}_{S}^{\star})|\leq(1-\gamma),

where p˙λ()\dot{p}_{\lambda}(\cdot) is the subgradient of the function pλ()p_{\lambda}(\cdot). Both the adaptive 1\ell_{1} penalization and folded concave penalization utilized the differential penalty functions to relax the restrictive irrepresentable condition. The procedure of adaptive lasso and solving folded concave penalized problem using local linear approximation (LLA) both require a good initialization. Their dependence on the initial solution is inevitably affected by the irrepresentable condition. As a promising alternative, we consider the assignment of adaptive weights on the design matrix 𝐗{\bf{X}}, which down-weights the unimportant variables. Consider the ideal adaptive weight function g(𝜷):=(g(β1),,g(βp))g(\boldsymbol{\beta}):=(g(\beta_{1}),\dots,g(\beta_{p})), where g(βj)g(\beta_{j}) goes to 11 when jSj\in S and goes to 0, otherwise. We propose the following nonconvex estimation procedure:

min𝜷12ni=1n{yij=1pxijg(βj)βj}2+λ𝜷1.\min_{\boldsymbol{\beta}}\frac{1}{2n}\sum_{i=1}^{n}\{y_{i}-\sum_{j=1}^{p}x_{ij}g(\beta_{j})\beta_{j}\}^{2}+\lambda\|\boldsymbol{\beta}\|_{1}.

Let 𝐳j=g(𝜷)𝐱j{\bf{z}}_{j}=g(\boldsymbol{\beta})\circ{\bf{x}}_{j}. The irrepresentable condition is now relaxed in this paper as follows:

maxjSc|g(βj)𝐱jT𝐙S(𝐙ST𝐙S)1sgn(𝜷S)|(1γ).\underset{j\in S^{c}}{\max}\ |g(\beta_{j})\cdot{\bf{x}}_{j}^{T}{\bf Z}_{S}({\bf Z}_{S}^{T}{\bf Z}_{S})^{-1}\circ\mathrm{sgn}(\boldsymbol{\beta}_{S}^{\star})|\leq(1-\gamma).

Given the ideal adaptive weight function g(𝜷)g(\boldsymbol{\beta}), the proposed approach further relaxes the irrepresentable condition in comparison to those considered by existing methods. Unlike existing methods, we propose an integrated nonconvex estimation procedure without depending on initial solutions.

2.2 The Proposed Method

The weight function plays an important role in relaxing the irrepresentable condition. If we knew the oracle threshold η=minjS{|βj|}\eta^{\star}=\underset{j\in S}{\min}\{|\beta_{j}^{\star}|\}, the hard thresholding function I{|βj|η}}I\{|\beta_{j}|\geq\eta^{\star}\}\} would be the ideal choice for the weight function g(𝜷)g(\boldsymbol{\beta}). However, the discontinuity of I{|u|η}I\{|u|\geq\eta\} is challenging for associated optimization. To overcome, we consider a smooth approximation of I{|u|η}I\{|u|\geq\eta\} given by

gτ,η(u)=hτ(uη)+hτ(uη),g_{\tau,\eta}(u)=h_{\tau}(u-\eta)+h_{\tau}(-u-\eta),

where hτ(w)=1/2+arctan(w/τ)/πh_{\tau}(w)=1/2+\arctan(w/\tau)/\pi. Since hτ(w)I{w0}h_{\tau}(w)\to I\{w\geq 0\} as τ0\tau\to 0 when w0w\neq 0, we have gτ,η(u)I{|u|η}g_{\tau,\eta}(u)\to I\{|u|\geq\eta\} as τ0\tau\to 0 when u0u\neq 0. Figure 2 illustrates the smooth approximation of βgτ,η(β)\beta\circ g_{\tau,\eta}(\beta) to βI{|β|η}\beta\circ I\{|\beta|\geq\eta\} when τ\tau is small (e.g., τ=0.001,0.005,0.01).\tau=0.001,0.005,0.01).

Refer to caption
(a) τ=0.001\tau=0.001
Refer to caption
(b) τ=0.005\tau=0.005
Refer to caption
(c) τ=0.01\tau=0.01
Figure 2: The smooth approximation of βgτ,η(β)\beta\circ g_{\tau,\eta}(\beta) to βI{|β|η}\beta\circ I\{|\beta|\geq\eta\} with η=0.1\eta=0.1.

Given the above smooth thresholding function gτ,η()g_{\tau,\eta}(\cdot), we propose the robust regression with coefficient thresholding:

min𝜷1ni=1nL(yij=1pxijβjgτ,η(βj))subject to𝜷2r,\displaystyle\min_{\boldsymbol{\beta}}\frac{1}{n}\sum_{i=1}^{n}L(y_{i}-\sum_{j=1}^{p}x_{ij}\beta_{j}g_{\tau,\eta}(\beta_{j}))\quad\text{subject to}\quad\|\boldsymbol{\beta}\|_{2}\leq r, (1)

where following Mei et al. (2018) and Loh & Wainwright (2017) we assume that the regression coefficients 𝜷\boldsymbol{\beta} are bounded in the Euclidean ball 𝐁p(r){𝜷p:𝜷2r}{\bf B}^{p}(r)\equiv\{\boldsymbol{\beta}\in\mathbb{R}^{p}:\ \|\boldsymbol{\beta}\|_{2}\leq r\}. Here, to tackle possible outliers, we choose L()L(\cdot) to be a differentiable and possibly nonconvex robust loss function such as the pseudo Huber loss (Charbonnier et al., 1997) or Tukey’s biweight loss. For the ease of presentation, throughout this paper, we use the pseudo Huber loss of the following form:

L(a)=ω2{1+(a/ω)21},a,ω.L(a)=\omega^{2}\{\sqrt{1+(a/\omega)^{2}}-1\},a\in\mathbb{R},\omega\in\mathbb{R}. (2)

Note that L(a)L(a) provides a smooth approximation of the Huber loss (Huber, 1964) and bridges over the 1\ell_{1} loss and the 2\ell_{2} loss. Specifically, L(a)L(a) is approximately an 2\ell_{2} loss a2/2a^{2}/2 when aa is small and an 1\ell_{1} loss with slope ω\omega when aa is large. It is worth pointing out that the utility of the pseudo Huber loss essentially adds a weight on each observation xix_{i} by w(xi)w(x_{i}) in the following form:

w(xi)=1+(yiyi^)2/w2(yiyi^)2,w(x_{i})=\frac{\sqrt{1+(y_{i}-\hat{y_{i}})^{2}/w^{2}}}{(y_{i}-\hat{y_{i}})^{2}},

where yi^\hat{y_{i}} is a fitted value. Hence, outliers are down-weighted to alleviate potential estimation bias.

To deal with a large number of predictors, we propose the following penalized estimation of high-dimensional robust regression with coefficient thresholding:

min𝜷{1ni=1nL(yij=1pxijβjgτ,η(βj))+pλ(|𝜷|)}subject to𝜷2r,\displaystyle\min_{\boldsymbol{\beta}}\left\{\frac{1}{n}\sum_{i=1}^{n}L(y_{i}-\sum_{j=1}^{p}x_{ij}\beta_{j}g_{\tau,\eta}(\beta_{j}))+p_{\lambda}(|\boldsymbol{\beta}|)\right\}\quad\text{subject to}\quad\|\boldsymbol{\beta}\|_{2}\leq r, (3)

where pλ(|𝜷|)p_{\lambda}(|\boldsymbol{\beta}|) is a chosen penalty function such as the 1\ell_{1} penalization or folded concave penalization. In the scalar-on-image analysis, often we need to incorporate some known group information into pλ(|𝜷|)p_{\lambda}(|\boldsymbol{\beta}|). Suppose that the coefficient vector 𝜷\boldsymbol{\beta} is divided into BB separate subvectors 𝜷1,.,𝜷B\boldsymbol{\beta}_{1},.\cdots,\boldsymbol{\beta}_{B}. We can use the group penalty λb=1B𝜷b2\lambda\sum_{b=1}^{B}\|\boldsymbol{\beta}_{b}\|_{2} (Yuan & Lin, 2006) or sparse group penalty λ1𝜷1+λ2b=1B𝜷b2\lambda_{1}\|\boldsymbol{\beta}\|_{1}+\lambda_{2}\sum_{b=1}^{B}\|\boldsymbol{\beta}_{b}\|_{2} (Simon et al., 2013) in the penalized estimation (3). With pλ(|𝜷|)=λb=1B𝜷b2p_{\lambda}(|\boldsymbol{\beta}|)=\lambda\sum_{b=1}^{B}\|\boldsymbol{\beta}_{b}\|_{2}, the sparse robust regression with coefficient thresholding and group selection can be written as follows:

min𝜷{1ni=1nL(yij=1pxijβjgτ,η(βj))+λb=1B𝜷b2}subject to𝜷2r,\displaystyle\min_{\boldsymbol{\beta}}\left\{\frac{1}{n}\sum_{i=1}^{n}L(y_{i}-\sum_{j=1}^{p}x_{ij}\beta_{j}g_{\tau,\eta}(\beta_{j}))+\lambda\sum_{b=1}^{B}\|\boldsymbol{\beta}_{b}\|_{2}\right\}\quad\text{subject to}\quad\|\boldsymbol{\beta}\|_{2}\leq r, (4)

which includes the 1\ell_{1} penalization as a special example with B=pB=p. The penalized robust regression with coefficient thresholding (3) and (4) can be efficiently solved by a customized composite gradient descent algorithm with provable guarantees, and the details will be presented in Subsection 3.3.

REMARK 1.

Both thresholding function and robust loss work together to relax the restrictive irrepresentable condition in the presence of possible outliers. Define 𝐖=(𝐰1,,𝐰p)\mathbf{W}=(\mathbf{w}_{1},\cdots,\mathbf{w}_{p}) as the weight matrix with 𝐰j=(w(x1j),,w(xnj))\mathbf{w}_{j}=(w(x_{1j}),\ldots,w(x_{nj})). The irrepresentable condition for the 1\ell_{1}-penalized robust loss can be written as

maxjSc|𝐰j𝐱jT𝐀S(𝐀ST𝐀S)1sgn(𝜷s)|(1γ),\underset{j\in S^{c}}{\max}\ |\mathbf{w}_{j}\circ{\bf{x}}_{j}^{T}{\bf A}_{S}({\bf A}_{S}^{T}{\bf A}_{S})^{-1}\cdot\mathrm{sgn}(\boldsymbol{\beta}_{s}^{\star})|\leq(1-\gamma),

where 𝐀=𝐖𝐗{\bf A}=\mathbf{W}\circ{\bf{X}}. We impose the weights 𝐖\mathbf{W} on the row vectors of design matrix 𝐗{\bf{X}}. Recall that the thresholding function adds the weights to the column vectors of 𝐗{\bf{X}}. Now, with pλ(|𝛃|)=λ𝛃1p_{\lambda}(|\boldsymbol{\beta}|)=\lambda\|\boldsymbol{\beta}\|_{1} in (3), we add the entry-specific weight 𝐖~=(w~ij)n×p\tilde{\mathbf{W}}=(\tilde{w}_{ij})_{n\times p} to 𝐗{\bf{X}} with

w~ij=w(xij)g(βj).\tilde{w}_{ij}=w(x_{ij})\circ g(\beta_{j}).

Thus, the irrepresentable condition is now relaxed as follows:

maxjSc|𝐰~j𝐱jT𝐙S(𝐙ST𝐙S)1sgn(𝜷s)|(1γ),\underset{j\in S^{c}}{\max}\ |\tilde{\mathbf{w}}_{j}\circ{\bf{x}}_{j}^{T}{\bf Z}_{S}({\bf Z}_{S}^{T}{\bf Z}_{S})^{-1}\cdot\mathrm{sgn}(\boldsymbol{\beta}_{s}^{\star})|\leq(1-\gamma),

where 𝐙=𝐖~𝐗{\bf Z}=\tilde{\mathbf{W}}\circ{\bf{X}}. By assigning column-wise weight, we allow strong dependence between variables in the true support and other variables; By assigning row-wise weight, we reduce the weights for outliers and enhance the robustness.

REMARK 2.

The proposed method can be considered as the simultaneous estimation of regression coefficients and adaptive weights to improve the adaptive 1\ell_{1} penalization (Zou, 2006), whose weights are usually solved from the initial solution or iterated solution. Let ξ=𝛃gτ,η(𝛃)=(ξ1,,ξp)T\mathbf{\xi}=\boldsymbol{\beta}\circ g_{\tau,\eta}(\boldsymbol{\beta})=(\xi_{1},\ldots,\xi_{p})^{T}. Since ξ\mathbf{\xi} is a continuous and injective function of 𝛃\boldsymbol{\beta} (say, ξ=G(𝛃)\mathbf{\xi}=G(\boldsymbol{\beta})), there is a unique 𝛃\boldsymbol{\beta} such that ξ=𝛃gτ,η(𝛃)\mathbf{\xi}=\boldsymbol{\beta}\circ g_{\tau,\eta}(\boldsymbol{\beta}) holds. Hence, we may write 𝛃=G1(ξ)\boldsymbol{\beta}=G^{-1}(\mathbf{\xi}). The minimization in the proposed high-dimensional robust regression with coefficient thresholding can be rewritten as follows:

minξ{1ni=1nL(yij=1pxijξj)+λj=1p|ξj|gτ,η(G1(ξ)j)}.\displaystyle\min_{\mathbf{\xi}}\left\{\frac{1}{n}\sum_{i=1}^{n}L(y_{i}-\sum_{j=1}^{p}x_{ij}\xi_{j})+{\lambda}\sum_{j=1}^{p}\frac{|\xi_{j}|}{g_{\tau,\eta}(G^{-1}({\mathbf{\xi}})_{j})}\right\}. (5)

However, it does not fall into the folded-concave penalized estimation (Fan & Li, 2001; Fan et al., 2014). The proposed method aims to address the presence of highly correlated predictors, whereas the nonconvex penalized estimation was motivated by correcting the bias of the 1\ell_{1}-penalized estimation. Also, solving (5) is extremely challenging, since the denominator gτ,η(G1(ξ)j)g_{\tau,\eta}(G^{-1}({\mathbf{\xi}})_{j}) can be arbitrarily small when ξj0\xi_{j}\to 0. In comparison, the thresholding function is bounded and the non-convex optimization in (4) is computationally tractable.

REMARK 3.

The proposed method is also related to Bayesian methods (Nakajima & West, 2013a, b, 2017; Kang et al., 2018; Ni et al., 2019; Cai et al., 2019). In the scalar-on-image regression, Kang et al. (2018) proposed a soft-thresholded Gaussian process (STGP) that enjoys posterior consistency for both parameter estimation and variable selection. In this Bayesian framework, 𝛃\boldsymbol{\beta} is assumed to follow an STGP, where the soft thresholding function is defined as

gη(𝜷)=sgn(𝜷)(|𝜷|λ)1{|𝜷|>λ}.g_{\eta}(\boldsymbol{\beta})=\mathrm{sgn}(\boldsymbol{\beta})(|\boldsymbol{\beta}|-\lambda)\textbf{1}_{\{|\boldsymbol{\beta}|>\lambda\}}. (6)

Thus the regression model can be written as follows:

𝐲=𝐗sgn(𝜷)(|𝜷|λ)1{|𝜷|>λ}(𝜷)+𝜺,\displaystyle{\bf{y}}={\bf{X}}\mathrm{sgn}(\boldsymbol{\beta})(|\boldsymbol{\beta}|-\lambda)\textbf{1}_{\{|\boldsymbol{\beta}|>\lambda\}}(\boldsymbol{\beta})+\boldsymbol{\varepsilon},

where 𝛃\boldsymbol{\beta} is a realization of a Gaussian process. Compared to Kang et al. (2018), our proposed method is more robust to possible outliers, and uses a very different approach to incorporate the thresholding function that down-weights unimportant variables and achieves sparsity. The proposed method also enjoys both statistical and computational guarantees as detailed in Section 3, whereas the convergence rate of the posterior computation algorithm in Kang et al. (2018) is still unclear.

3 Theoretical Properties

This section studies the theoretical properties of our proposed method. After establishing the connection of our proposed method with the thresholded parameter space in Subsection 3.1, we present the landscape analysis and asymptotic properties in Subsection 3.2, and the computational guarantee for an efficient composite gradient descent algorithm in Subsection 3.3.

3.1 Connection with the Thresholded Parameter Space

It is known that 0\ell_{0}-penalization enjoys the oracle risk inequality (Barron et al., 1999). Fan & Lv (2013) proved that the global solution of 0\ell_{0}-penalization, denoted by 𝜷^𝒪=(β^1o,,β^po)T\boldsymbol{\hat{\beta}}^{\mathcal{O}}=(\hat{\beta}^{o}_{1},\cdots,\hat{\beta}^{o}_{p})^{T}, satisfies the hard thresholding property that β^jo\hat{\beta}^{o}_{j} is either 0 or larger than a positive threshold whose value depends on 𝐗{\bf{X}} and 𝐲{\bf{y}} and tuning parameter λ\lambda. Specifically, Fan & Lv (2013) studied the oracle properties of regularization methods over the thresholded parameter space η\mathcal{B_{\eta}} defined as

η={𝜷p:𝜷2<r,𝜷0cn/logp,and for eachj,βj=0or|βj|η}.\mathcal{B_{\eta}}=\{\boldsymbol{\beta}\in\mathbb{R}^{p}:\ \|\boldsymbol{\beta}\|_{2}<r,\|\boldsymbol{\beta}\|_{0}\leq cn/\log p,\text{and for each}\,j,\ \beta_{j}=0\,\ \text{or}\,\ |\beta_{j}|\geq\eta\}. (7)

Given the thresholded parameter space η\mathcal{B_{\eta}}, we introduce the high-dimensional regression with coefficient hard-thresholding as follows:

min𝜷{1ni=1nL(yij=1pxijβjI{|βj|η})+pλ(|𝜷|)}subject to𝜷2r.\min_{\boldsymbol{\beta}}\left\{\frac{1}{n}\sum_{i=1}^{n}L(y_{i}-\sum_{j=1}^{p}x_{ij}\beta_{j}I\{|\beta_{j}|\geq\eta\})+p_{\lambda}(|\boldsymbol{\beta}|)\right\}\quad\text{subject to}\quad\|\boldsymbol{\beta}\|_{2}\leq r. (8)

This thresholded parameter space guarantees that the hard thresholding property is satisfied with proper 2\ell_{2} and 0\ell_{0} norm upper bounds. Suppose that global solutions to both (3) and (8) exist. It is intuitive that if τ\tau is very small, an optimal solution to (3) should be “close” to that of (8). Apparently, gτ,η()g_{\tau,\eta}(\cdot) converges to I(||η)I(|\cdot|\geq\eta) pointwisely almost everywhere as τ0+\tau\mapsto 0^{+}. However it is known that almost surely pointwise convergence does not guarantee the convergence of minimizers (e.g., (Rockafellar & Wets, 2009, Section 7.A)). Here we provide the following proposition to show that the global solution of the proposed method (3) converges to the minimizer of coefficient hard-thresholding in (8) with additional assumptions.

PROPOSITION 1.

Suppose that {τk}\{\tau^{k}\} is a sequence of scale parameters in the weight function gτ,η()g_{\tau,\eta}(\cdot) such that τk\tau^{k} goes to 0 as kk\to\infty. Let 𝛃^k\boldsymbol{\hat{\beta}}^{k} be a global minimizer of (3) with τ=τk\tau=\tau^{k}. If

  • (i)

    η<min|βj|\eta<\min|\beta_{j}^{\star}| and 𝜷2r\|\boldsymbol{\beta}^{\star}\|_{2}\leq r

  • (ii)

    For any ϵ>0\epsilon>0, there exists δ>0\delta>0, such that

    max1jp(||β^jk|η|<δ)<ϵ,\displaystyle\max_{1\leq j\leq p}\mathbb{P}\left(||\hat{\beta}^{k}_{j}|-\eta|<\delta\right)<{\epsilon},

then with arbitrary high probability, the sequence {𝛃^k}\{\boldsymbol{\hat{\beta}}^{k}\} enjoys a cluster point 𝛃^\boldsymbol{\hat{\beta}}, which is the global minimizer of the high-dimensional regression with coefficient hard-thresholding in (8).

REMARK 4.

Conditions (i) and (ii) of Proposition 1 are mild in our setting. Condition (i) assumes that the threshold level is smaller than the minimal signal level and the 2\ell_{2} norm of true regression coefficients is bounded. Condition (ii) assumes the magnitude of the estimated β^k\hat{\beta}^{k} is bounded away from the threshold level η\eta with high probability, that is, P(||β^jk|η|δ)1ϵP\left(||\hat{\beta}^{k}_{j}|-\eta|\geq\delta\right)\geq 1-\epsilon, j\forall j. Note that the true regression coefficients βj\beta_{j}^{\star}’s are either zero or larger than the threshold level η\eta. As long as the estimator is consistent (see Theorem 2), Condition (ii) can be satisfied.

Given Proposition 1, we can prove that the total number of falsely discovered signs of the cluster point 𝜷^\boldsymbol{\hat{\beta}}, i.e., the cardinality card({j:sgn(𝜷^)sgn(𝜷)})\mathrm{card}(\{j:\mathrm{sgn}(\boldsymbol{\hat{\beta}})\neq\mathrm{sgn}(\boldsymbol{\beta}^{\star})\}), converges to zero for the folded concave penalization under the squared loss and sub-Gaussian tail conditions. Recall that the folded concave penalty function pλ(t)p_{\lambda}(t) satisfies that (i) it is increasing and concave in t[0,)t\in[0,\infty), (ii) pλ(0)=0p_{\lambda}(0)=0, and (iii) it is differentiable with p˙λ(0+)=aλ\dot{p}_{\lambda}(0_{+})=a\lambda for some a>0a>0. Define κc\kappa_{c} as the smallest possible positive integer such that there exists an n×κcn\times\kappa_{c} submatrix of n1/2𝐗n^{-1/2}{\bf{X}} having a singular value less than a given positive constant cc. κc\kappa_{c} is named robust spark in (Fan & Lv, 2013) to ensure model identifiability. Note that card({j:sgn(𝜷^)sgn(𝜷)})\mathrm{card}(\{j:\mathrm{sgn}(\boldsymbol{\hat{\beta}})\neq\mathrm{sgn}(\boldsymbol{\beta}^{\star})\}) is an upper bound on the total number of false positives and false negatives. The following proposition implies the variable selection consistency of the cluster point 𝜷^\boldsymbol{\hat{\beta}}.

PROPOSITION 2.

Suppose we re-scale each column vector of the design matrix 𝐗{\bf{X}} for each predictor to have 2\ell_{2}-norm n1/2n^{1/2}, and the error 𝛆\boldsymbol{\varepsilon} satisfies the sub-Gaussian tail condition with mean 0. logp=O(nc0)\log p=O(n^{c_{0}}) and s=o(n1c0)s=o(n^{1-{c_{0}}}) for some constant c0(0,1)c_{0}\in(0,1). And there exists constant cc such that s>κc/2s>\kappa_{c}/2. Let 𝒜\mathcal{A} be the support of 𝛃\boldsymbol{\beta}^{\star} and minimal signal strength minj𝒜|βj|>η\min_{j\in\mathcal{A}}|\beta_{j}^{\star}|>\eta. Define 𝛃^\boldsymbol{\hat{\beta}} as the global minimizer of the following optimization problem:

min𝜷{1ni=1n(yij=1pxijβjI{|βj|η})2+pλ(|𝜷|)}subject to𝜷2r,𝜷0κc/2.\displaystyle\min_{\boldsymbol{\beta}}\left\{\frac{1}{n}\sum_{i=1}^{n}(y_{i}-\sum_{j=1}^{p}x_{ij}\beta_{j}I\{|\beta_{j}|\geq\eta\})^{2}+p_{\lambda}(|\boldsymbol{\beta}|)\right\}\quad\text{subject to}\quad\|\boldsymbol{\beta}\|_{2}\leq r,\|\boldsymbol{\beta}\|_{0}\leq\kappa_{c}/2. (9)

If the penalization parameter λ\lambda is chosen such that λ=c1logp/n=o(η)\lambda=c_{1}\sqrt{\log p/n}=o(\eta) for some constant c1c_{1}, then with probability 1O(pc2)1-O(p^{-c_{2}}), we have

card({j:sgn(𝜷^)sgn(𝜷)})Csλ2η2/(1Cλ2η2),\mathrm{card}(\{j:\mathrm{sgn}(\boldsymbol{\hat{\beta}})\neq\mathrm{sgn}(\boldsymbol{\beta}^{\star})\})\leq Cs\lambda^{2}\eta^{-2}/(1-C\lambda^{2}\eta^{-2}),

where c2c_{2} and CC are constants.

Propositions 1 and 2 indicate that the cluster point 𝜷^\boldsymbol{\hat{\beta}} enjoys the similar properties as those of the global solution given by the 0\ell_{0}-penalized methods. The proposed method mimics the 0\ell_{0}-penalization to remove irrelevant predictors and improve the estimation accuracy as τ0\tau\to 0.

In the sequel, we will show that with high probability our proposed method has a unique global minimizer (See Theorem 2(b)). In this aspect, our proposed regression with coefficient thresholding estimator provides a practical way to approximate the global minimizer over the thresholded space.

3.2 Statistical Guarantee

After presenting technical conditions, we first provide the landscape analysis and establish the uniform convergence from empirical risk to population risk in Subsection 3.2.1 and then prove the oracle inequalities for the unique minimizer of the proposed method in Subsection 3.2.2.

Now we introduce the notation to simplify our analysis. Let g(u)g(u) be a shorthand of gτ(u,η)g_{\tau}(u,\eta), and let G(𝜷)=(β1g(β1),β2g(β2),,βpg(βp))TG(\boldsymbol{\beta})=(\beta_{1}g(\beta_{1}),\beta_{2}g(\beta_{2}),\cdots,\beta_{p}g(\beta_{p}))^{T} on p\mathbb{R}^{p}. Given fixed τ\tau, we claim that G(𝜷)G(\boldsymbol{\beta}) is third continuously differentiable on its domain. After the direct calculations given in Lemma 1 of the supplementary file, we have the explicit upper and lower bounds for the derivatives of G(𝜷)G(\boldsymbol{\beta}), i.e.,

k0¯Ip×pDG(𝜷)k0¯Ip×p,\displaystyle\underline{k_{0}}I_{p\times p}\preccurlyeq D_{G}(\boldsymbol{\beta})\preccurlyeq\overline{k_{0}}I_{p\times p},\quad
DG2(𝜷)m0¯Ip×p×p,\displaystyle D^{2}_{G}(\boldsymbol{\beta})\preccurlyeq\overline{m_{0}}I_{p\times p\times p},\quad
DG3(𝜷)s0¯Ip×p×p×p,\displaystyle D^{3}_{G}(\boldsymbol{\beta})\preccurlyeq\overline{s_{0}}I_{p\times p\times p\times p},

where DG(𝜷):=G(𝜷)D_{G}(\boldsymbol{\beta}):=\nabla G(\boldsymbol{\beta}), DG2(𝜷):=DG(𝜷)D^{2}_{G}(\boldsymbol{\beta}):=\nabla D_{G}(\boldsymbol{\beta}), and DG3(𝜷):=DG2(𝜷)D^{3}_{G}(\boldsymbol{\beta}):=\nabla D^{2}_{G}(\boldsymbol{\beta}). Here, k0¯,k0¯,m0¯\underline{k_{0}},\overline{k_{0}},\overline{m_{0}} and s0¯\overline{s_{0}} are uniform constants independent of 𝜷\boldsymbol{\beta}. And ABA\preccurlyeq B represents that BAB-A is semi positive definite.

For the group lasso penalty, denote by 𝜷a=(𝜷1aT,,𝜷B1aT)T\boldsymbol{\beta}^{a}=(\boldsymbol{\beta}_{1}^{a^{T}},...,\boldsymbol{\beta}_{B_{1}}^{a^{T}})^{T} the non-zero groups and by 𝜷c=(𝜷1cT,,𝜷B2cT)T\boldsymbol{\beta}^{c}=(\boldsymbol{\beta}_{1}^{c^{T}},...,\boldsymbol{\beta}_{B_{2}}^{c^{T}})^{T} the zero groups, respectively. Following the structure of classical group lasso, let diad_{i}^{a} and dicd_{i}^{c} be the corresponding length of vector 𝜷ia\boldsymbol{\beta}_{i}^{a} and 𝜷ic\boldsymbol{\beta}_{i}^{c}, respectively. Let dmaxa=maxi{1,,B1}diad^{a}_{max}=\underset{i\in\{1,\dots,B_{1}\}}{\max}\ d_{i}^{a}, dmaxc=maxi{1,,B2}dicd^{c}_{max}=\underset{i\in\{1,\dots,B_{2}\}}{\max}\ d_{i}^{c}, and dmax=dmaxadmaxcd_{\max}=d^{a}_{\max}\vee d^{c}_{\max}, where dmaxd_{\max} is finite, not diverging with nn and pp.

We make the following assumptions on the distribution of predictor vector 𝐱{\bf{x}}, the true parameter vector 𝜷\boldsymbol{\beta}^{\star} and the distribution of random error 𝜺\boldsymbol{\varepsilon}.

ASSUMPTION 1.
  • (a)

    The predictor vector 𝐱{\bf{x}} is σ2\sigma^{2}-sub-Gaussian with mean zero and continuous density p()p(\cdot) in p\mathbb{R}^{p}, namely 𝔼[𝐱]=𝟎\mathbb{E}[{\bf{x}}]=\mathbf{0} and 𝔼{[exp(𝐮,𝐱)]}exp(σ2𝐮22/2)\mathbb{E}\{[\exp(\langle{\bf{u}},{\bf{x}}\rangle)]\}\leq\exp({\sigma^{2}\|{\bf{u}}\|^{2}_{2}}/{2}) for all 𝐮p{\bf{u}}\in\mathbb{R}^{p}.

  • (b)

    The feature vector 𝐱{\bf{x}} spans all directions in p\mathbb{R}^{p}, namely 𝔼[𝐱𝐱T]γσ2Ip×p\mathbb{E}[{\bf{x}}{\bf{x}}^{T}]\succcurlyeq\gamma\sigma^{2}I_{p\times p} for some 0<γ<10<\gamma<1.

  • (c)

    The true coefficient vector 𝜷\boldsymbol{\beta}^{\star} is sparse such that s0=supp(𝜷)=o(n)s_{0}=\mathrm{supp}(\boldsymbol{\beta}^{\star})=o(n) and 𝜷2r\|\boldsymbol{\beta}^{\star}\|_{2}\leq r. Also, the threshold index η\eta in weight function gτ,η(𝜷)g_{\tau,\eta}(\boldsymbol{\beta}) satisfies that ηminjS0{|βj|}\eta\leq\min_{j\in S_{0}}\{|\beta^{\star}_{j}|\}.

  • (d)

    The random error 𝜺\boldsymbol{\varepsilon} has a symmetric distribution whose density is strictly positive and decreasing on (0,)(0,\infty).

Assumption 1(a) presents the technical conditions on the design matrix. The same conditions were considered in Mei et al. (2018) for binary linear classification and robust regression. The sub-Gaussian assumption is a commonly used mild condition in high-dimensional regression. Assumption 1(b) imposes the sparsity on the true parameter vector 𝜷\boldsymbol{\beta}^{\star}. Assumption 1(c) permits the possibly heavy tails of the error. We allow the size of the true support set to diverge at rate o(n)o(n). Given the sparsity, it is reasonable to limit our theoretical analysis in the Euclidean ball Θn,p=𝐁p(r){𝜷p,𝜷2r}\Theta_{n,p}={\bf B}^{p}(r)\equiv\{\boldsymbol{\beta}\in\mathbb{R}^{p},\|\boldsymbol{\beta}\|_{2}\leq r\}, which can avoid unnecessary technical complications. Also, η\eta does not need to be a constant bounded away from 0 in the asymptotic analysis. Since G(𝜷)G(\boldsymbol{\beta}) is continuous and injective, there exists a unique 𝜷~\boldsymbol{\tilde{\beta}}^{\star}, such that 𝜷~2r/3\|\boldsymbol{\tilde{\beta}}^{\star}\|_{2}\leq r/3 and G(𝜷~)=𝜷G(\boldsymbol{\tilde{\beta}^{\star}})=\boldsymbol{\beta}^{\star}, i.e. 𝜷\boldsymbol{\beta}^{\star} is the thresholded version of 𝜷~\tilde{\boldsymbol{\beta}}^{\star}. In the sequel, we will study the statistical convergence to the surrogate 𝜷~\tilde{\boldsymbol{\beta}}^{\star}, which shares the same non-zero support with 𝜷\boldsymbol{\beta}^{\star}. In Theorem 2, we show that 𝜷^𝜷~2=Op(s0logpn)\|\boldsymbol{\hat{\beta}}-\boldsymbol{\tilde{\beta}}^{\star}\|_{2}=O_{p}(\sqrt{\frac{s_{0}\log p}{n}}). Then G(𝜷^)G(𝜷~)2=G(𝜷^)𝜷2=Op(s0logpn)\|G(\boldsymbol{\hat{\beta}})-G(\boldsymbol{\tilde{\beta}}^{\star})\|_{2}=\|G(\boldsymbol{\hat{\beta}})-\boldsymbol{\beta}^{\star}\|_{2}=O_{p}(\sqrt{\frac{s_{0}\log p}{n}}) given G()G(\cdot) is Lipschitz with fixed η\eta. In this way, we can use G(𝜷^)G(\boldsymbol{\hat{\beta}}) as an estimator for 𝜷\boldsymbol{\beta}^{\star}. Assumption 1(d) allows random error with heavier tails than the standard Gaussian distribution, and it suits for many applications with outliers. For example, in our simulation, our noise is a mixture distribution with a small variance Gaussian distribution and a large variance Gaussian distribution. Define φ()=L()\varphi(\cdot)=L^{\prime}(\cdot) and h()=𝔼𝜺[φ(+𝜺)]h(\cdot)=\mathbb{E}_{\boldsymbol{\varepsilon}}[\varphi(\cdot+\boldsymbol{\varepsilon})]. Assumption 1(d) can be relaxed as Assumption 1(d’) h(0)=𝔼𝜺[φ(𝜺)]0h(0)=\mathbb{E}_{\boldsymbol{\varepsilon}}[\varphi(\boldsymbol{\varepsilon})]\geq 0, which holds for the right skewed error or discrete error. Our theoretical results given in this paper still hold under Assumption 1(a)–(c) and 1(d’). See Remark 6 for more details.

3.2.1 The Landscape of Population Risk and Empirical Risk

We analyze the landscape of the population and empirical risk functions of the proposed methods. We can bound both gradient vector and Hessian matrix of the population risk

R(𝜷)=𝔼{L(yj=1pxjβjgτ,η(βj))}R(\boldsymbol{\beta})=\mathbb{E}\{L(y-\sum_{j=1}^{p}x_{j}\beta_{j}g_{\tau,\eta}(\beta_{j}))\}

and then prove the uniqueness of its stationary point in the following lemma. We let R(𝜷)\nabla R(\boldsymbol{\beta}) be the gradient of R(𝜷)R(\boldsymbol{\beta}) with respect to 𝜷\boldsymbol{\beta} and 2R(𝜷)\nabla^{2}R(\boldsymbol{\beta}) the Hessian matrix. Recall that 𝐁p(r)={𝜷p,𝜷2r}{\bf B}^{p}(r)=\{\boldsymbol{\beta}\in\mathbb{R}^{p},\|\boldsymbol{\beta}\|_{2}\leq r\}. Let 𝐁p(𝜷,ε0){\bf B}^{p}(\boldsymbol{\beta},\varepsilon_{0}) be the 2\ell_{2} ball with the center 𝜷\boldsymbol{\beta} and radius ε0>0\varepsilon_{0}>0.

LEMMA 1.

(Landscape of population risk) Under Assumption 1, the population risk function R(𝛃){R(\boldsymbol{\beta})} has the following properties:

  • (a)

    There exists an ε0>0\varepsilon_{0}>0 and constants 0<L¯0<L¯0<0<\underline{L}_{0}<\overline{L}_{0}<\infty, T0>0T_{0}>0 such that

    inf𝜷𝐁p(r)\𝐁p(𝜷~,ε0)R(𝜷)2L0¯,sup𝜷𝐁p(r)R(𝜷)2L0¯;\inf_{\boldsymbol{\beta}\in{\bf B}^{p}(r)\backslash{\bf B}^{p}(\boldsymbol{\tilde{\beta}}^{\star},\varepsilon_{0})}\|\nabla{R(\boldsymbol{\beta})}\|_{2}\geq\underline{L_{0}},\quad\sup_{\boldsymbol{\beta}\in{\bf B}^{p}(r)}\|\nabla{R(\boldsymbol{\beta})}\|_{2}\leq\overline{L_{0}}; (10)

    and for all 𝜷2r\|\boldsymbol{\beta}\|_{2}\leq r,

    𝜷𝜷~,R(𝜷)T0𝜷𝜷~22.\langle\boldsymbol{\beta}-\boldsymbol{\tilde{\beta}}^{\star},\nabla{R(\boldsymbol{\beta})}\rangle\geq T_{0}\|\boldsymbol{\beta}-\boldsymbol{\tilde{\beta}}^{\star}\|_{2}^{2}. (11)
  • (b)

    For same ε0\varepsilon_{0} in part (a), there exist constants 0<M¯0<M¯0<0<\underline{M}_{0}<\overline{M}_{0}<\infty such that

    inf𝜷𝐁p(𝜷~,ε0)λmin(2R(𝜷))M0¯,sup𝜷𝐁p(r)λmax(2R(𝜷))M0¯.\inf_{\boldsymbol{\beta}\in{\bf B}^{p}(\boldsymbol{\tilde{\beta}}^{\star},\varepsilon_{0})}\lambda_{\min}(\nabla^{2}{R(\boldsymbol{\beta})})\geq\underline{M_{0}},\quad\sup_{\boldsymbol{\beta}\in{\bf B}^{p}(r)}\lambda_{\max}(\nabla^{2}{R(\boldsymbol{\beta})})\leq\overline{M_{0}}. (12)
  • (c)

    The population risk R(𝜷){R(\boldsymbol{\beta})} has a unique stationary point 𝜷~\boldsymbol{\tilde{\beta}}^{\star}.

REMARK 5.

Lemma 1 analyze the landscape of population risk and establishes good properties of population risk function. Indeed, we first show that out of the ball 𝐁p(𝛃~,ε0){\bf B}^{p}(\boldsymbol{\boldsymbol{\tilde{\beta}}}^{\star},\varepsilon_{0}), gradient has a lower bound and thus no stationary point exists. In fact, we get a stronger result that no stationary point exists except 𝛃\boldsymbol{\beta}^{\star}. Secondly, inside the ball 𝐁p(𝛃~,ε0){\bf B}^{p}(\boldsymbol{\boldsymbol{\tilde{\beta}}}^{\star},\varepsilon_{0}), Hessian matrix of risk function has a lower bound, meaning that 𝛃\boldsymbol{\beta}^{\star} is a unique minimizer.

REMARK 6.

Lemma 1 still holds under Assumption 1(a)–(c) and 1(d’). In order to establish a lower bound for the gradient of population risk, we need conditions that h(z)>0h(z)>0 for all z>0z>0 and h(0)>0h^{\prime}(0)>0. When we consider the pseudo Huber loss, φ()\varphi(\cdot) is odd and φ(z)>0\varphi(z)>0 for all z>0z>0; φ()\varphi^{\prime}(\cdot) is even and φ>0\varphi^{\prime}>0. If the exchange of expectation and limit are allowed, h(0)=𝔼𝛆[φ(𝛆)]>0h^{\prime}(0)=\mathbb{E}_{\boldsymbol{\varepsilon}}[\varphi^{\prime}(\boldsymbol{\varepsilon})]>0 is always true. Thus, Lemma 1 and all the following theorems hold with the same proofs.

Next we follow Mei et al. (2018) to develop the uniform convergence results for gradient Rn(𝜷)\nabla R_{n}(\boldsymbol{\beta}) and Hessian 2Rn(𝜷)\nabla^{2}R_{n}(\boldsymbol{\beta}) of empirical risk, to study the landscape of the empirical risk function

R^n(𝜷)=1ni=1nL(yij=1pxijβjgτ,η(βj)).{\widehat{R}_{n}(\boldsymbol{\beta})}=\frac{1}{n}\sum_{i=1}^{n}L(y_{i}-\sum_{j=1}^{p}x_{ij}\beta_{j}g_{\tau,\eta}(\beta_{j})).
LEMMA 2.

(Landscape of empirical risk) Under Assumption 1, for any arbitrarily small δ\delta, there exists a constant C1>0C_{1}>0 depending on parameters (r,σ2,γ,δ,τ,η)(r,\sigma^{2},\gamma,\delta,\tau,\eta), independent of nn and pp, such that, if nC1plogpn\geq C_{1}p\,\text{log}\,p, the following properties hold for R^n(𝛃){\widehat{R}_{n}(\boldsymbol{\beta})}, with probability at least 1δ1-\delta:

  • (a)

    There exists an ε0>0\varepsilon_{0}>0 and constants 0<L¯0<L¯0<0<\underline{L}_{0}<\overline{L}_{0}<\infty, T0>0T_{0}>0 such that

    inf𝜷𝐁p(0,r)\𝐁p(𝜷~,ε0)R^n(𝜷)2L0¯/2,sup𝜷𝐁p(0,r)R^n(𝜷)2L0¯/2;\inf_{\boldsymbol{\beta}\in{\bf B}^{p}(0,r)\backslash{\bf B}^{p}(\boldsymbol{\boldsymbol{\tilde{\beta}}}^{\star},\varepsilon_{0})}\|\nabla{\widehat{R}_{n}(\boldsymbol{\beta})}\|_{2}\geq\underline{L_{0}}/2,\quad\sup_{\boldsymbol{\beta}\in{\bf B}^{p}(0,r)}\|\nabla{\widehat{R}_{n}(\boldsymbol{\beta})}\|_{2}\leq\overline{L_{0}}/2; (13)

    and for all 𝜷\boldsymbol{\beta} satisfying 𝜷𝜷~2ε0/2\|\boldsymbol{\beta}-\boldsymbol{\tilde{\beta}}^{\star}\|_{2}\geq\varepsilon_{0}/2 and 𝜷2r\|\boldsymbol{\beta}\|_{2}\leq r,

    𝜷𝜷~,R^n(𝜷)ε0T0𝜷𝜷~2/4.\langle\boldsymbol{\beta}-\boldsymbol{\tilde{\beta}}^{\star},\nabla{\widehat{R}_{n}(\boldsymbol{\beta})}\rangle\geq\varepsilon_{0}T_{0}\|\boldsymbol{\beta}-\boldsymbol{\tilde{\beta}}^{\star}\|_{2}/4. (14)
  • (b)

    For the same ε0\varepsilon_{0} in part (a), there exist constants 0<M¯0<M¯0<0<\underline{M}_{0}<\overline{M}_{0}<\infty such that

    inf𝜷𝐁p(𝜷~,ε0)λmin(2R^n(𝜷))M0¯/2,sup𝜷𝐁p(r)λmax(2R^n(𝜷))M0¯/2,\inf_{\boldsymbol{\beta}\in{\bf B}^{p}(\boldsymbol{\boldsymbol{\tilde{\beta}}}^{\star},\varepsilon_{0})}\lambda_{\min}(\nabla^{2}{\widehat{R}_{n}(\boldsymbol{\beta})})\geq\underline{M_{0}}/2,\quad\sup_{\boldsymbol{\beta}\in{\bf B}^{p}(r)}\lambda_{\max}(\nabla^{2}{\widehat{R}_{n}(\boldsymbol{\beta})})\leq\overline{M_{0}}/2, (15)
  • (c)

    The empirical risk function R^n(𝜷){\widehat{R}_{n}(\boldsymbol{\beta})} has a unique minimizer 𝜷^n\boldsymbol{\hat{\beta}}_{n} satisfying that

    𝜷^n𝜷~2C1plogn/n.\|\boldsymbol{\hat{\beta}}_{n}-\boldsymbol{\tilde{\beta}}^{\star}\|_{2}\leq C_{1}\sqrt{p\,\log\,n/n}.

It is worth pointing out that Part (c) of Lemma 2 implies the consistency of the unique minimizer of the empiricial risk function when the dimension pp diverges with the sample size nn but nC1plogpn\geq C_{1}p\,\text{log}\,p. In the sequel, we will establish the oracle inequality for the proposed penalized robust regression with coefficient thresholding when the dimension can be much larger than the sample size nn.

3.2.2 Oracle Inequality

Given the landscape analysis, we establish the oracle inequality under the ultra-high dimensional setting when pp is at the nearly exponential order of nn. Specifically, Theorem 1 shows that the sample directional gradient and restricted Hessian converges uniformly to their population counterparts.

THEOREM 1.

Under Assumption 1, for any arbitrarily small δ\delta, there exist constants C2C_{2} and C3C_{3} that depends on (r,σ2,γ,δ,τ,η)(r,\sigma^{2},\gamma,\delta,\tau,\eta) such that the following properties hold:

  • (a)

    The empirical directional gradient converges uniformly to population directional gradient along the direction (𝜷𝜷~)(\boldsymbol{\beta}-\boldsymbol{\tilde{\beta}}^{\star}), namely

    (sup𝜷𝐁p(r)/{𝟎}|R^n(𝜷)R(𝜷),𝜷𝜷~|𝜷𝜷~1σC2log(p)n)1δ.\mathbb{P}\left(\sup_{\boldsymbol{\beta}\in{\bf B}^{p}(r)/\{\mathbf{0}\}}\frac{|\langle\nabla{\widehat{R}_{n}(\boldsymbol{\beta})}-\nabla{R(\boldsymbol{\beta})},\boldsymbol{\beta}-\boldsymbol{\tilde{\beta}}^{\star}\rangle|}{\|\boldsymbol{\beta}-\boldsymbol{\tilde{\beta}}^{\star}\|_{1}}\leq\sigma\sqrt{\frac{C_{2}\log(p)}{n}}\right)\geq 1-\delta. (16)
  • (b)

    The empirical restricted Hessian converges uniformly to the population restricted Hessian in the set 𝐁p(r)𝐁0p(s0){\bf B}^{p}(r)\cap{\bf B}^{p}_{0}(s_{0}) for any s0ps_{0}\leq p. That is, if 𝜷2<r\|\boldsymbol{\beta}\|_{2}<r and 𝜷0<s0\|\boldsymbol{\beta}\|_{0}<s_{0}, as nC3s0logpn\geq C_{3}s_{0}\log p, we have

    (sup(𝜷,𝐯)Ω|𝐯,(2R^n(𝜷)2R(𝜷))𝐯|σ2C3s0log(p)n)1δ,\mathbb{P}\left(\sup_{(\boldsymbol{\beta},{\bf{v}})\in\Omega}|\langle{\bf{v}},(\nabla^{2}{\widehat{R}_{n}(\boldsymbol{\beta})}-\nabla^{2}{R(\boldsymbol{\beta})}){\bf{v}}\rangle|\leq\sigma^{2}\sqrt{\frac{C_{3}s_{0}\log(p)}{n}}\right)\geq 1-\delta, (17)

    where Ω1=𝐁p(r)𝐁0p(s0)\Omega_{1}={\bf B}^{p}(r)\cap{\bf B}^{p}_{0}(s_{0}), Ω2=𝐁p(1)𝐁0p(s0)\Omega_{2}={\bf B}^{p}(1)\cap{\bf B}^{p}_{0}(s_{0}) and Ω=Ω1×Ω2\Omega=\Omega_{1}\times\Omega_{2}.

Theorem 2 proves the oracle inequality of the global minimizer of the proposed high-dimensional robust regression with coefficient thresholding as well as its uniqueness with high probability.

THEOREM 2.

Under Assumption 1, for any arbitrarily small δ\delta, there exist constants C4C_{4}, C5C_{5} and CλC_{\lambda} that depends on (r,σ2,γ,η,δ,τ,M,t,dmaxa,dmaxc)(r,\sigma^{2},\gamma,\eta,\delta,\tau,M,t,d_{\max}^{a},d_{\max}^{c}) but are independent of n,p,s0n,p,s_{0}, such that as nC5s0logpn\geq C_{5}s_{0}\log p and λCλ(logp)/n\lambda\geq C_{\lambda}\sqrt{(\log p)/n}, the following properties hold with probability at least 1δ1-\delta:

  • (a)

    Any stationary point 𝜷^\boldsymbol{\hat{\beta}} of group-regularized risk minimization satisfies

    𝜷^𝜷~2C4(s0logp)/n+s0λ2.\|\boldsymbol{\hat{\beta}}-\boldsymbol{\tilde{\beta}}^{\star}\|_{2}\leq C_{4}\sqrt{(s_{0}\log p)/n+s_{0}\lambda^{2}}. (18)
  • (b)

    Furthermore, let rs=C4(s0logp)/n+s0λ2r_{s}=C_{4}\sqrt{(s_{0}\log p)/n+s_{0}\lambda^{2}}. There exists a constant C6C_{6}, for nC6log2pn\geq C_{6}\log^{2}p such that rsε0r_{s}\leq\varepsilon_{0}, the optimization problem (4) has a unique global minimizer 𝜷^\boldsymbol{\hat{\beta}}.

3.3 Computational Guarantee

Gradient descent algorithms do not work since the objective function is not differentiable at zero. We consider the composite gradient descent algorithm (Nesterov, 2013), which is computationally efficient for solving the non-smooth and nonconvex optimization and enjoys the desired convergence property. Specifically, solving the proposed robust regression with coefficient thresholding on proposed composite gradient descent algorithm consists of two key steps at each iteration: the gradient descent step and the 2\ell_{2}-ball projection step. To perform the step of gradient descent, we derive the gradient vector of the empirical risk function:

R^n(𝜷)=i=1nL(G(𝜷)T𝐱iyi)𝐱iTDG(𝜷),\nabla\widehat{R}_{n}(\boldsymbol{\beta})=\sum_{i=1}^{n}L^{{}^{\prime}}(G(\boldsymbol{\beta})^{T}{\bf{x}}_{i}-y_{i}){\bf{x}}_{i}^{T}D_{G}(\boldsymbol{\beta}),

where DG(𝜷)=diag(g(β1)+βjg(β1),,g(βp)+βjg(βp))D_{G}(\boldsymbol{\beta})=\text{diag}(g(\beta_{1})+\beta_{j}g^{\prime}(\beta_{1}),\cdots,g(\beta_{p})+\beta_{j}g^{\prime}(\beta_{p})) is a diagonal matrix. Given the previous iterated solution 𝜷^(k)\boldsymbol{\hat{\beta}}^{(k)}, we need to solve the following subproblem:

min𝜷{12𝜷(𝜷^(k)1hR^(𝜷^(k)))22+ληb=1B𝜷b2}.\min_{\boldsymbol{\beta}}\ \left\{\frac{1}{2}\|\boldsymbol{\beta}-(\boldsymbol{\hat{\beta}}^{(k)}-\frac{1}{h}\nabla\hat{R}(\boldsymbol{\hat{\beta}}^{(k)}))\|^{2}_{2}+\frac{\lambda}{\eta}\sum_{b=1}^{B}\|\boldsymbol{\beta}_{b}\|_{2}\right\}.

It is known that this subproblem has a closed-form solution through the following group-wise soft thresholding operator:

St(𝝃)=𝝃j𝝃j21(𝝃2t)+,S_{t}(\boldsymbol{\xi})=\boldsymbol{\xi}_{j}\cdot\|\boldsymbol{\xi}_{j}\|_{2}^{-1}\circ(\|\boldsymbol{\xi}\|_{2}-t)_{+},

where \circ denotes the Hadamard product. Thus, the gradient descent step can be solved as:

𝜷~(k+1)=Sλ/h(𝜷^(k)h(R^n(𝜷^(k)))).\boldsymbol{\tilde{\beta}}^{(k+1)}=S_{\lambda/h}(\boldsymbol{\hat{\beta}}^{(k)}-h\nabla(\widehat{R}_{n}(\boldsymbol{\hat{\beta}}^{(k)}))).

Next, in the second step, we perform the following projection onto the 2\ell_{2}-ball:

min𝜷𝜷𝜷~(k+1)2subject to 𝜷2r,\underset{\boldsymbol{\beta}}{\min}\ \|\boldsymbol{\beta}-\tilde{\boldsymbol{\beta}}^{(k+1)}\|_{2}\quad\text{subject to }\ \|\boldsymbol{\beta}\|_{2}\leq r,

which is given by the following closed-form projection:

πr(𝜷~(k+1))=min{𝜷~(k+1)2,r}𝜷~(k+1)2𝜷~(k+1).\pi_{r}(\boldsymbol{\tilde{\beta}}^{(k+1)})=\frac{\min\{\|\boldsymbol{\tilde{\beta}}^{(k+1)}\|_{2},r\}}{\|\boldsymbol{\tilde{\beta}}^{(k+1)}\|_{2}}\boldsymbol{\tilde{\beta}}^{(k+1)}.

To sum up, the proposed composite gradient descent algorithm can be proceeded as Algorithm 1.

Input: 𝜷(0)𝐁p(r)\boldsymbol{\beta}^{(0)}\in{\bf B}^{p}(r), step size hh, penalization parameter λ\lambda, and thresholding parameter η\eta
for k=0,1,2,k=0,1,2,... until convergence do
       𝜷~(k+1)=Sλ/h(𝜷^(k)h(R^n(𝜷^(k))))\boldsymbol{\tilde{\beta}}^{(k+1)}=S_{\lambda/h}(\boldsymbol{\hat{\beta}}^{(k)}-h\nabla(\widehat{R}_{n}(\boldsymbol{\hat{\beta}}^{(k)})))
       𝜷^(k+1)=πr(𝜷~(k+1))\boldsymbol{\hat{\beta}}^{(k+1)}=\pi_{r}(\boldsymbol{\tilde{\beta}}^{(k+1)})
end for
Algorithm 1 The proposed composite gradient descent algorithm.

At each iteration, the subproblem can be solved easily with the closed-form solution, and the computational complexity is on the quadratic order of dimension pp. The algorithmic convergence rate is presented in the following theorem.

THEOREM 3.

Let 𝛃^(k)\hat{\boldsymbol{\beta}}^{(k)} be the kk-th iterated solution of Algorithm 1. There exist constants crc_{r},chc_{h} and C7C_{7} are independent of (n,p,s0)(n,p,s_{0}) such that when we choose r>crr>c_{r} and h<chh<c_{h}, we can find one iteration kk that can be upper bounded by C71ϵ2C_{7}\frac{1}{\epsilon^{2}}, such that there exists a subgradient we denote as g(𝛃^b(k))𝛃^b(k)2g(\boldsymbol{\hat{\beta}}^{(k)}_{b})\in\partial\|\boldsymbol{\hat{\beta}}^{(k)}_{b}\|_{2}:

R^n(𝜷^(k))+λb=1Bg(𝜷^b(k))2ϵ\|\nabla\widehat{R}_{n}(\boldsymbol{\hat{\beta}}^{(k)})+\lambda\sum_{b=1}^{B}g(\boldsymbol{\hat{\beta}}^{(k)}_{b})\|_{2}\leq\epsilon

where 𝛃b2\partial\|\boldsymbol{\beta}_{b}\|_{2} denotes the sub-differential of the group penalty function (Rockafellar, 2015).

REMARK 7.

Theorem 3 provides a theoretical justification of the algorithmic convergence rate. More specifically, the proposed algorithm always converges to an approximate stationary solution (a.k.a. ϵ\epsilon-stationary solution) at finite sample sizes. In other words, for the solution solved after O(1ϵ2)O(\frac{1}{\epsilon^{2}}) iterations, the subgradient of the objective function is bounded by ϵ\epsilon under the 2\ell_{2} norm at finite sample sizes. When kk increases, the proposed algorithm will find the stationary solution that satisfies the subgradient optimality condition as ϵ0\epsilon\to 0. To better understand the algorithmic convergence in practice, we also provide the convergence plots and computational cost of the proposed algorithm in simulation studies in Section C of the supplementary file. Combining Theorems 2 and 3, we may obtain the convergence guarantee with high probability. From both theoretical and practical aspects, the proposed algorithm is computationally efficient and achieves the desired computational guarantee. Observing that the nice empirical gradient structure proved in Theorem 2, in Appendix Section C, we further provide Theorem 4, which is an extension of Theorem 3. It further shows that given the solution is sparse, the composite gradient descent algorithm guarantees a linear convergence rate of 𝛃\boldsymbol{\beta}, with respect to the number of iterations.

4 Simulation Studies

This section carefully examines the finite-sample performance of our proposed method in simulation studies. To this end, we compare the proposed robust estimator with coefficient thresholding (denoted by RCT) to Lasso, Adaptive Lasso (denoted by AdaLasso), SCAD and MCP penalized estimators in eight different linear regression models (Models 1–8) and with Lasso, Group Lasso (denoted by GLasso), and Sparse Group Lasso (denoted by SGL) in two Gaussian process regression models (Models 9 and 10). Models 9 and 10 mimic the scalar-on-image regression analysis. We implemented the Lasso and the Adaptive lasso estimators using R package ‘glmnet’, and chose the weight for adaptive lasso estimator based on the Lasso estimator. Group lasso is implemented using the method in (Yang, 2015). SCAD and MCP penalized estimators are implemented using R package ‘ncvreg’. And we also verify that the estimation results are consistency with R package ‘Picasso’.

We compare estimation performance based on the root-mean-square error (RMSE, that is 𝜷^𝜷2\|\hat{\boldsymbol{\beta}}-\boldsymbol{\beta}^{\star}\|_{2}) and variable selection performance based on both false positive rate (FPR) and false negative rate (FNR). FPR=FP/(FP+TN)\text{FPR}={\text{FP}}/({\text{FP}+\text{TN}}), and FNR=FN/(FN+TP)\text{FNR}={\text{FN}}/({\text{FN}+\text{TP}}), where TN, TP, FP and FN represent the numbers of true negative, true positive, false positive and false negative respectively. Each measure is computed as the average over 5050 independent replications.

Before proceeding to simulation results, we discuss the choice of tuning parameters for the proposed method. The proposed method involves tuning over the penalization parameter λ\lambda, coefficient thresholding parameter η\eta, step size hh, and radius rr of the feasible region. Here, hh is chosen to be relatively small such that the algorithm does not encounter overflow issues. To simplify the tuning process, we set hh to be a small constant in our empirical studies; alternatively, hh can also be chosen according to an acceleration process in Nesterov (2013) to further achieve faster convergence. Also, rr is chosen to be relatively large to specify the appropriate feasible region, and η\eta can be specified as any arbitrary quantity that is smaller than the minimal signal strength. In addition, τ\tau is chosen to be properly small to make our soft-thresholding function mimic the true thresholding function. It is important to point out that the algorithmic convergence and numerical results are generally robust to the choices of η\eta, hh, and rr. In our simulation studies, we choose h=0.01h=0.01, r=20r=20, τ=0.01\tau=0.01. With these prespecified hh, rr, and τ\tau, we choose the penalization parameter λ\lambda and η\eta using the 55-folded cross-validation based on 1\ell_{1} prediction error. Additionally, η\eta can also be chosen as the 30%30\% quantile of the absolute values of non-zero coefficients of Lasso.

4.1 Linear Regression

First, we simulated data from the linear model y=𝐱𝜷+εy={\bf{x}}^{\prime}\boldsymbol{\beta}+\varepsilon where 𝐱N(0,𝚺){\bf{x}}\sim N(0,\boldsymbol{\Sigma}). We consider different correlation structures for 𝚺=(σij)p×p\boldsymbol{\Sigma}=(\sigma_{ij})_{p\times p} in the following six simulation models:

Model 1:σij=0.5|ij|,AR1(0.5)\displaystyle\text{Model 1}:\sigma_{ij}=0.5^{|i-j|},\ \qquad\qquad\text{AR1(0.5)}
Model 2:σij=0.6|ij|,AR1(0.6)\displaystyle\text{Model 2}:\sigma_{ij}=0.6^{|i-j|},\ \qquad\qquad\text{AR1(0.6)}
Model 3:σij=0.7|ij|,AR1(0.7)\displaystyle\text{Model 3}:\sigma_{ij}=0.7^{|i-j|},\ \qquad\qquad\text{AR1(0.7)}
Model 4:σij=0.4+0.6I(i=j),CS(0.4)\displaystyle\text{Model 4}:\sigma_{ij}=0.4+0.6I(i=j),\ \text{CS(0.4)}
Model 5:σij=0.5+0.5I(i=j),CS(0.5)\displaystyle\text{Model 5}:\sigma_{ij}=0.5+0.5I(i=j),\ \text{CS(0.5)}
Model 6:σij=0.6+0.4I(i=j),CS(0.6).\displaystyle\text{Model 6}:\sigma_{ij}=0.6+0.4I(i=j),\ \text{CS(0.6)}.

Models 1–3 have autoregressive (AR) correlation structures, in which the irrepresentable condition holds for Model 1 but fails for Models 2 and 3. Models 4-6 have the compound symmetry (CS) correlation structures and the irrepresentable condition fails in all these models.

We assume a mixture model for the error, ε0.9N(0,σ12)+0.1N(0,σ22)\varepsilon\sim 0.9N(0,\sigma_{1}^{2})+0.1N(0,\sigma_{2}^{2}), where σ22\sigma^{2}_{2} is set much larger than σ12\sigma^{2}_{1}. For Models 1–3, set σ22=10\sigma^{2}_{2}=10 and σ12=1,2\sigma^{2}_{1}=1,2 or 33 in case (a), (b) or (c), respectively. For Models 4–6, set σ22=3\sigma^{2}_{2}=3 and σ12=0.1\sigma^{2}_{1}=0.1,0.30.3 or 11 in case (a), (b) or (c), respectively. For all the models, we choose n=100n=100, p=2000p=2000 and 𝜷=(120,01980)\boldsymbol{\beta}^{\star}=(\textbf{1}_{20},\textbf{0}_{1980}) to create a high dimensional regime.

Table 1: Estimation and selection accuracy of different methods for Models 1–3
FPR FNR 2\ell_{2} loss FPR FNR 2\ell_{2} loss FPR FNR 2\ell_{2} loss
Model (1a) Model (2a) Model (3a)
Lasso 0.021 0.199 3.198 0.017 0.165 3.032 0.014 0.153 3.035
(0.009) (0.130) (0.475) (0.010) (0.100) (0.407) (0.008) (0.091) (0.391)
AdaLasso 0.020 0.212 3.787 0.016 0.180 3.716 0.014 0.156 3.739
(0.008) (0.143) (0.821) (0.007) (0.111) (0.816) (0.008) (0.111) (0.815)
SCAD 0.007 0.422 4.148 0.008 0.474 4.750 0.010 0.575 5.979
(0.006) (0.137) (0.950) (0.007) (0.123) (1.032) (0.006) (0.107) (1.054)
MCP 0.003 0.625 4.709 0.004 0.654 5.430 0.003 0.694 6.201
(0.002) (0.065) (0.537) (0.002) (0.060) (0.571) (0.002) (0.049) (0.536)
RCT 0.010 0.177 2.860 0.004 0.071 2.041 0.002 0.018 1.466
(0.008) (0.117) (0.888) (0.002) (0.112) (0.754) (0.002) (0.035) (0.502)
Model (1b) Model (2b) Model (3b)
Lasso 0.019 0.243 3.446 0.017 0.209 3.319 0.014 0.187 3.273
(0.010) (0.140) (0.372) (0.008) (0.101) (0.406) (0.008) (0.087) (0.422)
AdaLasso 0.018 0.256 4.134 0.016 0.330 4.027 0.013 0.199 4.062
(0.008) (0.130) (0.657) (0.008) (0.104) (0.687) (0.008) (0.107) (0.726)
SCAD 0.008 0.443 4.233 0.008 0.477 4.772 0.009 0.566 5.726
(0.005) (0.140) (0.998) (0.006) (0.156) (1.236) (0.007) (0.153) (1.393)
MCP 0.003 0.636 4.771 0.003 0.674 5.573 0.003 0.708 6.386
(0.002) (0.069) (0.627) (0.002) (0.060) (0.577) (0.002) (0.058) (0.667)
RCT 0.018 0.242 3.879 0.010 0.154 3.331 0.007 0.084 2.939
(0.016) (0.163) (0.735) (0.009) (0.098) (0.691) (0.005) (0.127) (0.755)
Model (1c) Model (2c) Model (3c)
Lasso 0.018 0.338 3.706 0.017 0.244 3.589 0.014 0.242 3.611
(0.010) (0.162) (0.383) (0.010) (0.106) (0.475) (0.010) (0.093) (0.450)
AdaLasso 0.019 0.195 4.148 0.016 0.258 4.469 0.012 0.243 4.459
(0.018) (0.156) (0.650) (0.010) (0.114) (0.590) (0.009) (0.105) (0.590)
SCAD 0.008 0.498 4.390 0.007 0.481 4.722 0.006 0.549 5.599
(0.005) (0.134) (1.034) (0.005) (0.147) (1.331) (0.006) (0.149) (1.612)
MCP 0.003 0.678 4.968 0.003 0.697 5.624 0.003 0.732 6.506
(0.001) (0.069) (0.716) (0.002) (0.054) (0.709) (0.002) (0.047) (0.823)
RCT 0.025 0.294 4.305 0.019 0.195 4.148 0.011 0.164 3.886
(0.156) (0.230) (0.574) (0.018) (0.156) (0.650) (0.011) (0.136) (0.688)
Table 2: Estimation and selection accuracy of different methods for Models 4–6
FPR FNR 2\ell_{2} loss FPR FNR 2\ell_{2} loss FPR FNR 2\ell_{2} loss
Model (4a) Model (5a) Model (6a)
Lasso 0.040 0.337 4.075 0.041 0.387 4.183 0.041 0.374 4.144
(0.004) (0.135) (0.501) (0.003) (0.126) (0.458) (0.004) (0.135) (0.422)
AdaLasso 0.033 0.369 4.035 0.033 0.421 3.726 0.032 0.443 4.512
(0.003) (0.138) (0.626) (0.003) (0.134) (0.578) (0.003) (0.135) (0.763)
SCAD 0.021 0.736 5.849 0.021 0.736 5.849 0.020 0.745 5.711
(0.007) (0.147) (1.238) (0.007) (0.147) (1.238) (0.009) (0.149) (1.151)
MCP 0.008 0.868 6.763 0.008 0.896 7.129 0.007 0.921 7.348
(0.003) (0.084) (0.558) (0.003) (0.065) (0.541) (0.003) (0.064) (0.510)
RCT 0.061 0.215 3.982 0.062 0.244 4.023 0.066 0.253 4.093
(0.007) (0.089) (0.265) (0.009) (0.093) (0.283) (0.009) (0.098) (0.314)
Model (4b) Model (5b) Model (6b)
Lasso 0.040 0.351 4.092 0.041 0.378 4.173 0.041 0.364 4.124
(0.004) (0.135) (0.448) (0.004) (0.125) (0.451) (0.003) (0.081) (0.235)
AdaLasso 0.033 0.377 4.083 0.033 0.411 4.283 0.033 0.459 4.553
(0.003) (0.139) (0.561) (0.003) (0.134) (0.565) (0.003) (0.145) (0.819)
SCAD 0.022 0.719 5.563 0.021 0.722 5.668 0.020 0.744 5.802
(0.022) (0.135) (1.178) (0.007) (0.138) (1.154) (0.008) (0.157) (1.229)
MCP 0.008 0.868 6.738 0.007 0.900 7.142 0.008 0.900 7.146
(0.003) (0.079) (0.600) (0.003) (0.065) (0.614) (0.004) (0.080) (0.719)
RCT 0.060 0.226 4.019 0.063 0.260 4.067 0.066 0.275 4.138
(0.008) (0.090) (0.249) (0.007) (0.191) (0.181) (0.007) (0.089) (0.255)
Model (4c) Model (5c) Model (6c)
Lasso 0.040 0.382 4.173 0.041 0.418 4.265 0.040 0.408 4.324
(0.003) (0.131) (0.385) (0.004) (0.133) (0.360) (0.003) (0.103) (0.352)
AdaLasso 0.034 0.411 4.264 0.033 0.446 4.347 0.032 0.478 4.633
(0.002) (0.135) (0.473) (0.003) (0.127) (0.446) (0.003) (0.140) (0.549)
SCAD 0.023 0.704 5.343 0.021 0.726 5.688 0.021 0.727 5.612
(0.006) (0.148) (1.168) (0.007) (0.164) (1.415) (0.008) (0.179) (1.390)
MCP 0.008 0.885 6.979 0.007 0.905 7.217 0.007 0.915 7.441
(0.004) (0.073) (0.531) (0.004) (0.069) (0.578) (0.027) (0.062) (0.411)
RCT 0.061 0.267 4.147 0.062 0.290 4.228 0.064 0.314 4.275
(0.007) (0.102) (0.239) (0.007) (0.173) (0.212) (0.008) (0.110) (0.228)

Tables 1 and 2 summarize the simulation results for Models 1–6. Several observations can be made from this table. First, in Models 1–3, as the auto correlation increases, it becomes clearer that our robust estimator with coefficient thresholding (RCT) outperforms both Lasso ,Adaptive Lasso (AdaLasso) and nonconvex penalized estimators (SCAD and MCP). Nonconvex penalized estimators do not work well on all these settings, since they tend to choose too-large penalization, and it results in a much higher false negative rate than other methods. The Lasso estimator misses many true signals due to the high correlated 𝐱{\bf{x}}, leading to that adaptive lasso also cannot perform well given the Lasso initials. Especially in Model 3, when the auto correlation is high, our thresholding estimator has much smaller FPR and FNR compared to lasso-type methods. Second, in case (a), where the effects of outliers are stronger than (b) and (c), our estimator achieves a relatively better performance than other two estimators thanks to the use of the pseudo Huber loss. Third, in more challenging Models 4–6, our estimator is able to identify more true predictors and well controls false positives. In summary, the proposed estimator is more favorable in high dimensional regression setting, especially when the predictors are highly correlated, such as AR1(0.7) model.

As we introduced in previous sections, nonconvex penalization such as MCP penalized regression does not require irrepresentable condition. However, in our simulation, we found that it does not work well in our setting. We find the major reason is that it does not work not as well as our estimator when dimension is too large comparable the sample size: n=100n=100, p=2000p=2000 in our case. Here we summarize the regression simulation settings for major nonconvex penalized regression papers. Breheny & Huang (2011) has n=100n=100 and p=500p=500 for linear regression setting. It has larger dimension of features under logistic regression setting, but nonconvex penalized estimator performs no better than 1\ell_{1} penalized estimator in the aspect of prediction. Zhang (2010) has n=300n=300 and p=2000p=2000, with features being generated independently, in which the irrepresentable condition is not violated. Fan et al. (2014) used a setting of n=100n=100 and p=1000p=1000 with AR1(0.5) covariance matrix, which does not violate the irrepresentable condition. Loh & Wainwright (2017) used p=512p=512 and various different sample sizes larger than 100100. Wang et al. (2014) used a setting of n=200n=200 and p=2000p=2000. In comparison to these reported simulation studies, our setting of n=100n=100, p=2000p=2000 appears to be more challenging, in which larger variances in the generative model add additional difficulty in the analysis. Therefore MCP/SCAD penalized estimators do not perform well in these settings considered in our simulation studies, which clearly demonstrate the advantage of our proposed method.

4.2 Gaussian Process Regression

We now report simulation results to mimic the scalar-on-image regression. In the simulation, we consider a two-dimensional image structure and the whole region is generated by Gaussian processes. In these models, let n=500n=500 and p=50×50p=50\times 50, and consider the following covariance structures:

Model 7(GP1(10)):σij=exp(si2sj210sisj2),\displaystyle\text{Model 7}\ (\text{GP1(10)}):\quad\sigma_{ij}=\exp(-\|s_{i}\|^{2}-\|s_{j}\|^{2}-10\|s_{i}-s_{j}\|^{2}),\
Model 8(GP1(5)):σij=exp(si2sj25sisj2),\displaystyle\text{Model 8}\ (\text{GP1(5)}):\quad\sigma_{ij}=\exp(-\|s_{i}\|^{2}-\|s_{j}\|^{2}-5\|s_{i}-s_{j}\|^{2}),

where sis_{i} and sjs_{j} are two points in the rectangle [1,1]×[1,1][-1,1]\times[-1,1]. And the coefficients with nonzero effects are in a circle of the graph, with radius 0.1 for the true parameters. The values of the nonzero effects are generated by a uniform distribution on [0.5,1][0.5,1], similar to the setting given in He et al. (2018). The errors follow the same mixture model, ε0.9N(0,σ12)+0.1N(0,σ22)\varepsilon\sim 0.9N(0,\sigma_{1}^{2})+0.1N(0,\sigma_{2}^{2}). For Models 7–8, set σ22=30\sigma^{2}_{2}=30, and refer to case (a), (b) and (c) with σ12=2,4\sigma^{2}_{1}=2,4 and 88, respectively. A typical realization of 𝐱{\bf{x}} for Model 7 is illustrated in Figure 3(a).

In this section, we also evaluate the soft-threholded Gaussian Process(STGP) estimator(Kang et al., 2018), which is a state-of-art scalar-on-image regression method. STGP assumes a spatial smootheness condition, which is satisfied by Model 7-10. The original STGP estimator requires knowledge of whether any two predictors are adjacent. Such information may be unavailable or cannot be implemented in many cases. For example, when the number of voxels are too large to be fully processed, we may have to preselect a subset of voxels or use certain proper dimension reduction techniques to overcome related computational difficulties. Given that all other methods do not need such adjacency information, to fairly compare STGP with other methods, we consider two STGP estimators in our comparison. That is, STGP represents the original STGP estimator; STGP(no-info) represents STGP estimator without using adjacency information.

Table 3: Estimation and selection accuracy of different methods for Models 7 and 8
FPR FNR 2\ell_{2} loss FPR FNR 2\ell_{2} loss
Model (7a) Model (8a)
Lasso 0.002 0.814 7.083 0.001 0.784 6.071
(0.002) (0.031) (1.561) (0.001) (0.068) (1.305)
AdaLasso 0.002 0.820 12.527 0.001 0.784 0.196
(0.031) (0.031) (0.041) (0.068) (0.068) (0.306)
MCP 0.007 0.918 7.526 0.007 0.918 7.532
(0.003) (0.060) (0.622) (0.003) (0.060) (0.655)
STGP(no-info) 0.001 0.435 2.729 0.002 0.461 2.584
(0.002) (0.161) (0.335) (0.006) (0.185) (0.621)
RCT 0.025 0.018 2.302 0.027 0.196 3.038
(0.001) (0.041) (0.342) (0.014) (0.306) (1.083)
Model (7b) Model (8b)
Lasso 0.002 0.825 7.328 0.001 0.805 6.639
(0.002) (0.051) (1.719) (0.001) (0.053) (0.429)
AdaLasso 0.003 0.815 12.995 0.001 0.807 10.936
(0.003) (0.095) (4.783) (0.001) (0.062) (6.530)
MCP 0.007 0.918 7.525 0.007 0.918 7.532
(0.003) (0.060) (0.619) (0.007) (0.060) (0.654)
STGP(no-info) 0.001 0.460 2.904 0.001 0.485 2.600
(0.003) (0.171) (0.635) (0.006) (0.192) (0.650)
RCT 0.030 0.053 2.520 0.030 0.172 2.949
(0.007) (0.033) (0.219) (0.013) (0.265) (0.828)
Model (7c) Model (8c)
Lasso 0.002 0.854 7.228 0.041 0.418 4.265
(0.001) (0.033) (1.222) (0.006) (0.050) (0.360)
AdaLasso 0.001 0.853 10.690 0.033 0.446 4.347
(0.001) (0.033) (2.728) (0.003) (0.127) (0.446)
MCP 0.007 0.918 7.524 0.007 0.918 7.535
(0.007) (0.060) (0.612) (0.003) (0.069) (0.675)
STGP(no-info) 0.001 0.484 2.753 0.003 0.476 2.561
(0.003) (0.210) (0.610) (0.008) (0.226) (0.577)
RCT 0.034 0.016 2.761 0.045 0.303 3.284
(0.008) (0.105) (0.298) (0.016) (0.320) (1.215)

As shown in Table 3, Lasso, Adaptive Lasso, SCAD and MCP fail to identify most of the true predictors and have a very high FNR, while our proposed estimator is consistently the best among all these models. It also indicates that the thresholding function helps us deal with these very complicated covariance structures of predictors. RCT also outperforms STGP(no-info), and has compatible performance with original STGP estimator, despite our estimator are not restricted with adjacency information.

We consider two more cases when the group lasso penalty is necessary to be applied. In Models 9–10, we partition the whole image space into 25 sub-regions with equal numbers of predictors in each region. The covariance structure inside each region is the same as Models 7–8. Correlations between different regions are 0.9. Then we randomly assign two regions with each non-zero regions of radius 0.13, which makes around 1/3 of the points within the selected region has non-zero effects. We assign all non-zero effects as 2. The covariance structures can be summarized as:

Model 9(GP25(10)):𝐱N(𝝁b,𝚺),σij=exp(si2sj210sisj2),\displaystyle\text{Model 9}\ (\text{GP25(10)}):\ {\bf{x}}\sim N(\boldsymbol{\mu}_{b},\boldsymbol{\Sigma}),\ \sigma_{ij}=\exp(-\|s_{i}\|^{2}-\|s_{j}\|^{2}-10\|s_{i}-s_{j}\|^{2}),
Model 10(GP25(5)):𝐱N(𝝁b,𝚺),σij=exp(si2sj25sisj2),\displaystyle\text{Model 10}\ (\text{GP25(5)}):{\bf{x}}\sim N(\boldsymbol{\mu}_{b},\boldsymbol{\Sigma}),\ \sigma_{ij}=\exp(-\|s_{i}\|^{2}-\|s_{j}\|^{2}-5\|s_{i}-s_{j}\|^{2}),

where 𝝁b\boldsymbol{\mu}_{b} is the mean for region bb, and (𝝁1,,𝝁25)N(0,𝚪)(\boldsymbol{\mu}_{1},\dots,\boldsymbol{\mu}_{25})\sim N(0,\boldsymbol{\Gamma}), 𝚪ij=0.90.1I(i=j)\boldsymbol{\Gamma}_{ij}=0.9-0.1I(i=j). For the noise term, we still set σ12=2,4\sigma^{2}_{1}=2,4 and 88 as case (a),(b) and (c), respectively, and σ22=30\sigma^{2}_{2}=30. For these two models, we compare performances of the Lasso, the group Lasso (GLasso), the sparse Group Lasso (SGL), STGP and the proposed estimator (RCT) in Table 4. As a key difference from Models 77 and 88, Models 99 and 1010 impose the group structure on both 𝐱{\bf{x}} and 𝜷\boldsymbol{\beta}. Figure 3(b) shows an illustration of the simulated imaging predictor 𝐱{\bf{x}} in Models 7 and 9.

Table 4: Estimation and selection accuracy of different methods for Models 9 and 10
FPR FNR R-FPR R-FNR 2\ell_{2} loss FPR FNR R-FPR R-FNR 2\ell_{2} loss
Model (9a) Model (10a)
Lasso 0.019 0.270 0.101 0 25.607 0.028 0.411 0.140 0 33.091
(0.004) (0.072) (0.091) (0) (3.235) (0.006) (0.100) (0.083) (0) (3.380)
GLasso 0.220 0.378 0.232 0 16.101 0.215 0.403 0.232 0 16.128
(0.055) (0.151) (0.094) (0) (0.626) (0.054) (0.141) (0.094) (0) (0.621)
SGL 0.115 0.010 0.135 0 12.507 0.126 0.012 0.126 0 13.366
(0.035) (0.022) (0.060) (0) (0.198) (0.037) (0.030) (0.063) (0) (0.523)
STGP 0.063 0 0.109 0 12.540 0.061 0 0.110 0 12.642
(0.010) (0) (0.060) (0) (0.281) (0.007) (0) (0.060) (0) (0.354)
RCT 0.059 0 0.087 0 13.114 0.064 0 0.111 0 13.505
(0.003) (0) (0.087) (0) (0.198) (0.007) (0) (0.091) (0) (0.247)
Model (9b) Model (10b)
Lasso 0.019 0.347 0.166 0 27.557 0.028 0.415 0.145 0 32.957
(0.004) (0.078) (0.088) (0) (3.128) (0.006) (0) (0.096) (0) (0.273)
GLasso 0.223 0.372 0.232 0 16.084 0.214 0.405 0.232 0 16.119
(0.053) (0.143) (0.094) (0) (0.618) (0.056) (0.140) (0.094) (0) (0.619)
SGL 0.130 0.012 0.170 0 12.656 0.128 0.020 0.165 0 13.353
(0.032) (0.026) (0.060) (0) (0.603) (0.037) (0.046) (0.091) (0) (0.615)
STGP 0.061 0 0.114 0 12.520 0.062 0 0.113 0 12.536
(0.007) (0) (0.062) (0) (0.378) (0.007) (0) (0.039) (0) (0.173)
RCT 0.066 0 0.161 0 13.269 0.065 0 0.120 0 13.670
(0.006) (0) (0.100) (0) (0.281) (0.006) (0) (0.096) (0) (0.273)
Model (9c) Model (10c)
Lasso 0.019 0.394 0.180 0 27.949 0.027 0.427 0.151 0 32.553
(0.004) (0.075) (0.102) (0) (3.328) (0.005) (0.098) (0.098) (0) (3.517)
GLasso 0.223 0.354 0.232 0 16.064 0.211 0.407 0.232 0 16.114
(0.058) (0.143) (0.094) (0) (0.619) (0.054) (0.137) (0.094) (0) (0.634)
SGL 0.128 0.013 0.165 0 12.869 0.135 0.016 0.213 0 13.566
(0.033) (0.020) (0.123) (0) (0.700) (0.040) (0.027) (0.170) (0) (0.848)
STGP 0.059 0 0.098 0 12.565 0.126 0.012 0.126 0 13.366
(0.006) (0) (0.064) (0) (0.362) (0.037) (0.030) (0.063) (0) (0.523)
RCT 0.067 0 0.157 0 13.581 0.059 0 0.088 0 12.655
(0.006) (0) (0.092) (0) (0.260) (0.007) (0) (0.066) (0) (0.324)
Refer to caption
(a) Model 7
Refer to caption
(b) Model 9
Figure 3: Plots of the simulated imaging predictors from the Gaussian process regression

In Table 4, we include the region level false positive rate (Region FPR) and region level false negative rate (region FNR) to measure the variable selection accuracy within each region. They are computed based on whether there is at least one variable in the region is selected. Compared with the group 1\ell_{1} penalty, the proposed method identified almost all the correct groups with zero false negatives and achieved a more satisfying FPR and the 2\ell_{2} loss. The proposed method is less likely to over-select either predictors or regions than the sparse group Lasso. It implies that the thresholding function effectively achieved a lower FPR. Again, RCT has compatible performance with original STGP estimator, despite our estimator are not restricted with adjacency information.

5 Application to Scalar-on-Image Analysis

This section applies the proposed method to analyze the 2-back versus 0-back contrast maps derived from the nn-back task fMRI imaging data in the Adolescent Brain Cognitive Development (ABCD) study (Casey et al., 2018). Our goal is to identify the important imaging features from the contrast maps that are strongly associated with the risk of psychairtric disorder, measured by the general factor of psychopathology (GFP) or “p-factor”. After the standard fMRI prepossessing steps, all the images are registered into the 2 mm standard Montreal Neurological Institute (MNI) space consisting of 160,990 voxels in the 90 Automated Anatomical Labeling (AAL) brain regions. With the missing values being removed, the data used in our analysis consists of 2,070 subjects. To reduce the dimension of the imaging data, we partition 90 AAL regions into 2,518 sub-regions with each region consisting of an average of 64 voxels. We refer to each subregion as a super-voxel. For each subject, we compute the average intensity values of the voxels within each super-voxel as its intensity. We consider those 2,518 super-voxel-wise intensity values as the potential imaging predictors.

There are several challenging issues in the scalar-on-image regression analysis of this dataset. First, the correlations between super-voxels across the 90 AAL regions can be very high and the correlation patterns are complex. In fact, there are 151,724 voxel pairs across these regions having a correlation larger than 0.8 (or less than 0.8-0.8), and 9,038 voxel pairs with a correlation larger than 0.9 (or less than 0.9-0.9). Figure 1 visualizes the region-wise correlation structures, where panel (a) shows the highest correlations between regions; and panel (b) counts the voxel pairs that have a correlation higher than 0.8 (or less than 0.8-0.8) in each corresponding region pair. Given the imaging predictors having such high and complicated covariance structures, the classical lasso or the group lasso method may fail to perform variable selection satisfactorily. In contrast, the proposed model with coefficient thresholding is developed to resolve this issue since it does not require the strong conditions on the design matrix. Second, the AAL brain atlas provides useful information on the brain structure and function that may be related to the risk of psychiatry disorder. It is of interest to integrate the AAL region partition as grouping information of imaging predictors to improve the accuracy of imaging feature selection. Third, the outcome variable “p-factor” has a right skewed marginal distribution (see Figure 4), thus its measurements are more likely to contain outliers. The existing non-robust scalar-on-image regression methods may produce inaccurate results. All the aforementioned challenging issues motivate the needs of developing our robust regression with coefficient thresholding and group penalty.

Refer to caption
Figure 4: Marginal distribution of ‘p-factor’ variable among all subjects

In our analysis, we adjust confounding effects by including a few predictors in the model: family size, gender, race, highest parents education, household marital status and household income level. Given the intrinsic group structure, we compare the performance of the proposed method and the STGP in this data analysis. As we mentioned, the fMRI data analysis generally suffers from the reliability issue due to its complex data structure and low signal-to-noise ratio (Bennett & Miller, 2010; Brown & Behrmann, 2017; Eklund et al., 2012, 2016). To evaluate the variable selection stability for both methods, we consider a bootstrap approach with 100 replications. In each replication, we sample nn observations with replacement, and fit the bootstrap samples using the best set of tuning parameters chosen by a five-fold cross-validation. Then we obtain the frequency of each super-voxel being selected over 100 replications as a measure of the selection stability, which can be used to fairly compare the regions that can be consistently selected against randomness, and thus ensure the reliability of the scientific findings in our analysis.

Refer to caption
(a) Soft Thresholded Gaussian Process
Refer to caption
(b) Robust Coefficient Thresholding
Figure 5: Illustration of the selection frequency and cardinality of 90 brain regions for STGP and the RCT. The X axis shows the maximum selection frequency, the bubble size is proportional to the number of super-voxels with the selection frequency being larger than 0.60.6, and the color indicates the proportion of super-voxels for each brain region.

RCT and STGP respectively select 124.5 and 245.3 super-voxels per replication on average. Figure 5 displays the bootstrap selection results, where the xx-axis represents the maximum selection frequency of super-voxels in each region. The circle size is proportional to the number of super-voxels with the corresponding selection frequency being larger than 0.6. The color represents the proportion of super-voxels being selected in each region. Despite a smaller number of super-voxels being selected in each bootstrap run, RCT consistently selects super-voxels in several important brain regions over bootstrap samples, while STGP identifies a less number of brain regions that contain selected super-voxels.

Table 5: Comparisons of stable selection regions between RCT and STGP for different levels of selection frequency threshold 
Selected frequency RCT STGP
0.6-0.7 Callcarine_L, Occipital_Mid_L, Frontal_Sup_Midial_R, Temporal_Mid_L
Parietal_Inf_L
0.7-0.8 Temporal_Mid_L SupraMarginal_R
0.8-0.9 Frontal_Mid_Orb_L, Precuneus_L N/A
>0.9 Front_Sup_L N/A

Table 5 summarizes the comparisons of selected regions from RCT and STGP by varying different thresholds of selection frequency from 0.6 to 0.9. Compared with STGP, RCT selects more stable regions for each level of selection frequency, indicating that our method produces more reliable selection results. In particular, containing at least one super-voxels with more than 60% selection frequency, seven and three regions are respectively identified by RCT and STGP. Among those regions, only one common region, i.e. the left temporal gyrus (Temporal_Mid_L), is detected by both methods, where RCT has a higher selection frequency (0.79) than STGP (0.69). The existing functional neuroimaging studies have indicated that the middle temporal gyrus is involved in language and semantic memory processing (Cabeza & Nyberg, 2000), and it is also related to mental diseases such as chronic schizophrenia (Onitsuka et al., 2004).

Among other selected regions, with more than 90% selection frequency, RCT consistently selects super-voxels in the left superior frontal gyrus (Frontal_Sup_L), while the selection frequency by STGP is below 60%. Superior frontal gyrus is known to be strongly related with working memory (Boisgueheneuc et al., 2006) which plays a critical role in attending to and analyzing incoming information. Deficits in working memory are associated with many cognitive and mental health challenges, such as anxiety and stress (Lukasik et al., 2019), which can be captured by the “p-factor”. The strong relationship between p-factor and working memory has been discovered by exisitng studies (Huang-Pollock et al., 2017).

In addition, RCT also identifies five more regions than STGP: precuneus, the left middle frontal gyrus (Frontal_Mid_Orb_L), the left alcarine fissure and surrounding cortex (Calcarine_L), the middle occipital gyrus (Occipital_Mid_L) and the left inferior parietal gyri (Parietal_Inf_L). Percuneus is well studied as a core of mind (Cavanna & Trimble, 2006), and it is highly related to posttraumatic stress disorder(PTSD) and other mental health issue (Geuze et al., 2007). Middle frontal gyrus is part of limbic system and known to be highly related to emotion (Sprooten et al., 2017). Inferior parietal lobule has been involved in the perception of emotions in facial stimuli, and interpretation of sensory information (Radua et al., 2010). Calcarine fissure is related to vision. Middle occipital gyrus is primarily responsible for object recognition. It would be interesting to further investigate how the brain activity in these regions influence on the p-factor.

To further demonstrate the proposed method providing more reliable scientific findings in comparison to STGP, we evaluate the prediction performance of the two methods by using the train-test data cross-validation. We randomly split the data into two parts with 80%80\% as the training data for model fitting and 20%20\% as the test data for computing the prediction error. We repeat this procedure for 50 times. The mean absolute prediction error of RCT is 0.464 with standard error 0.004; STGP has a mean absolute prediction error of 0.480 with standard error 0.038. We conclude that our proposed method improves the prediction performance of the p-factor using working memory contrast maps in the ABCD study, comparing with the state-of-the-art scalar-on-image regression method.

6 Conclusion

In this paper, we propose a novel high-dimensional robust regression with coefficient thresholding in the presence of complex dependencies among predictors and potential outliers. The proposed method uses the power of thresholding functions and the robust Huber loss to build an efficient nonconvex estimation procedure. We carefully analyze the landscape of the nonconvex loss function for the proposed method, which enables us to establish both statistical and computational consistency. We demonstrate the effectiveness and usefulness of the proposed method in simulation studies and a real application to imaging data analysis. In the future, it is interesting to investigate how to incorporate the spatial-temporal information of the imaging data into our proposed method. It is also important to study the statistical consistency of the near-stationary solution from the proposed gradient descent based algorithm under more general conditions.

References

  • (1)
  • Barron et al. (1999) Barron, A., Birgé, L. & Massart, P. (1999), ‘Risk bounds for model selection via penalization’, Probability Theory and Related Fields 113(3), 301–413.
  • Bennett & Miller (2010) Bennett, C. M. & Miller, M. B. (2010), ‘How reliable are the results from functional magnetic resonance imaging?’, Annals of the New York Academy of Sciences 1191(1), 133–155.
  • Boisgueheneuc et al. (2006) Boisgueheneuc, F. d., Levy, R., Volle, E., Seassau, M., Duffau, H., Kinkingnehun, S., Samson, Y., Zhang, S. & Dubois, B. (2006), ‘Functions of the left superior frontal gyrus in humans: a lesion study’, Brain 129(12), 3315–3328.
  • Breheny & Huang (2011) Breheny, P. & Huang, J. (2011), ‘Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection’, The annals of applied statistics 5(1), 232.
  • Brown & Behrmann (2017) Brown, E. N. & Behrmann, M. (2017), ‘Controversy in statistical analysis of functional magnetic resonance imaging data’, Proceedings of the National Academy of Sciences 114(17), E3368–E3369.
  • Cabeza & Nyberg (2000) Cabeza, R. & Nyberg, L. (2000), ‘Imaging cognition ii: An empirical review of 275 pet and fmri studies’, Journal of cognitive neuroscience 12(1), 1–47.
  • Cai et al. (2019) Cai, Q., Kang, J. & Yu, T. (2019), ‘Bayesian network marker selection via the thresholded graph laplacian gaussian prior’, Bayesian Analysis In Press.
  • Cai et al. (2016) Cai, T. T., Li, X. & Ma, Z. (2016), ‘Optimal rates of convergence for noisy sparse phase retrieval via thresholded wirtinger flow’, The Annals of Statistics 44(5), 2221–2251.
  • Candes et al. (2015) Candes, E. J., Li, X. & Soltanolkotabi, M. (2015), ‘Phase retrieval via wirtinger flow: Theory and algorithms’, IEEE Transactions on Information Theory 61(4), 1985–2007.
  • Casey et al. (2018) Casey, B., Cannonier, T., Conley, M. I., Cohen, A. O., Barch, D. M., Heitzeg, M. M., Soules, M. E., Teslovich, T., Dellarco, D. V., Garavan, H. et al. (2018), ‘The adolescent brain cognitive development (abcd) study: imaging acquisition across 21 sites’, Developmental cognitive neuroscience 32, 43–54.
  • Cavanna & Trimble (2006) Cavanna, A. E. & Trimble, M. R. (2006), ‘The precuneus: a review of its functional anatomy and behavioural correlates’, Brain 129(3), 564–583.
  • Charbonnier et al. (1997) Charbonnier, P., Blanc-Féraud, L., Aubert, G. & Barlaud, M. (1997), ‘Deterministic edge-preserving regularization in computed imaging’, IEEE Transactions on Image Processing 6(2), 298–311.
  • Eklund et al. (2012) Eklund, A., Andersson, M., Josephson, C., Johannesson, M. & Knutsson, H. (2012), ‘Does parametric fmri analysis with spm yield valid results? -— an empirical study of 1484 rest datasets’, NeuroImage 61(3), 565–578.
  • Eklund et al. (2016) Eklund, A., Nichols, T. E. & Knutsson, H. (2016), ‘Cluster failure: Why fmri inferences for spatial extent have inflated false-positive rates’, Proceedings of the national academy of sciences 113(28), 7900–7905.
  • El Karoui et al. (2013) El Karoui, N., Bean, D., Bickel, P. J., Lim, C. & Yu, B. (2013), ‘On robust regression with high-dimensional predictors’, Proceedings of the National Academy of Sciences 110(36), 14557–14562.
  • Fan & Li (2001) Fan, J. & Li, R. (2001), ‘Variable selection via nonconcave penalized likelihood and its oracle properties’, Journal of the American Statistical Association 96(456), 1348–1360.
  • Fan et al. (2018) Fan, J., Liu, H., Sun, Q. & Zhang, T. (2018), ‘I-lamm for sparse learning: Simultaneous control of algorithmic complexity and statistical error’, The Annals of Statistics 46(2), 814–841.
  • Fan et al. (2014) Fan, J., Xue, L. & Zou, H. (2014), ‘Strong oracle optimality of folded concave penalized estimation’, The Annals of Statistics 42(3), 819–849.
  • Fan & Lv (2013) Fan, Y. & Lv, J. (2013), ‘Asymptotic equivalence of regularization methods in thresholded parameter space’, Journal of the American Statistical Association 108(503), 1044–1061.
  • Geuze et al. (2007) Geuze, E., Vermetten, E., de Kloet, C. S. & Westenberg, H. G. (2007), ‘Precuneal activity during encoding in veterans with posttraumatic stress disorder’, Progress in brain research 167, 293–297.
  • Goldsmith et al. (2014) Goldsmith, J., Huang, L. & Crainiceanu, C. M. (2014), ‘Smooth scalar-on-image regression via spatial bayesian variable selection’, Journal of Computational and Graphical Statistics 23(1), 46–64.
  • He et al. (2018) He, K., Xu, H. & Kang, J. (2018), ‘A selective overview of feature screening methods with applications to neuroimaging data’, Wiley Interdisciplinary Reviews: Computational Statistics p. e1454.
  • Huang-Pollock et al. (2017) Huang-Pollock, C., Shapiro, Z., Galloway-Long, H. & Weigard, A. (2017), ‘Is poor working memory a transdiagnostic risk factor for psychopathology?’, Journal of abnormal child psychology 45(8), 1477–1490.
  • Huber (1964) Huber, P. J. (1964), ‘Robust estimation of a location parameter’, The Annals of Mathematical Statistics 53, 73–101.
  • Kang et al. (2018) Kang, J., Reich, B. J. & Staicu, A.-M. (2018), ‘Scalar-on-image regression via the soft-thresholded gaussian process’, Biometrika 105(1), 165–184.
  • Li et al. (2015) Li, F., Zhang, T., Wang, Q., Gonzalez, M. Z., Maresh, E. L. & Coan, J. A. (2015), ‘Spatial bayesian variable selection and grouping for high-dimensional scalar-on-image regression’, The Annals of Applied Statistics 9(2), 687–713.
  • Lindquist et al. (2008) Lindquist, M. A. et al. (2008), ‘The statistical analysis of fmri data’, Statistical science 23(4), 439–464.
  • Loh (2017) Loh, P.-L. (2017), ‘Statistical consistency and asymptotic normality for high-dimensional robust mm-estimators’, The Annals of Statistics 45(2), 866–896.
  • Loh (2018) Loh, P.-L. (2018), ‘Scale calibration for high-dimensional robust regression’, arXiv preprint arXiv:1811.02096 .
  • Loh & Wainwright (2013) Loh, P.-L. & Wainwright, M. J. (2013), Regularized m-estimators with nonconvexity: Statistical and algorithmic theory for local optima, in ‘Advances in Neural Information Processing Systems’, pp. 476–484.
  • Loh & Wainwright (2017) Loh, P.-L. & Wainwright, M. J. (2017), ‘Support recovery without incoherence: A case for nonconvex regularization’, The Annals of Statistics 45(6), 2455–2482.
  • Lukasik et al. (2019) Lukasik, K. M., Waris, O., Soveri, A., Lehtonen, M. & Laine, M. (2019), ‘The relationship of anxiety and stress with working memory performance in a large non-depressed sample’, Frontiers in psychology 10, 4.
  • Mei et al. (2018) Mei, S., Bai, Y. & Montanari, A. (2018), ‘The landscape of empirical risk for nonconvex losses’, The Annals of Statistics 46(6A), 2747–2774.
  • Nakajima & West (2013a) Nakajima, J. & West, M. (2013a), ‘Bayesian analysis of latent threshold dynamic models’, Journal of Business & Economic Statistics 31(2), 151–164.
  • Nakajima & West (2013b) Nakajima, J. & West, M. (2013b), ‘Bayesian dynamic factor models: Latent threshold approach’, Journal of Financial Econometrics 11, 116–153.
  • Nakajima & West (2017) Nakajima, J. & West, M. (2017), ‘Dynamics & sparsity in latent threshold factor models: A study in multivariate eeg signal processing’, Brazilian Journal of Probability and Statistics 31(4), 701–731.
  • Negahban et al. (2012) Negahban, S. N., Ravikumar, P., Wainwright, M. J., Yu, B. et al. (2012), ‘A unified framework for high-dimensional analysis of mm-estimators with decomposable regularizers’, Statistical Science 27(4), 538–557.
  • Negahban & Wainwright (2012) Negahban, S. & Wainwright, M. J. (2012), ‘Restricted strong convexity and weighted matrix completion: Optimal bounds with noise’, Journal of Machine Learning Research 13(May), 1665–1697.
  • Nesterov (2013) Nesterov, Y. (2013), ‘Gradient methods for minimizing composite functions’, Mathematical Programming 140(1), 125–161.
  • Ni et al. (2019) Ni, Y., Stingo, F. C. & Baladandayuthapani, V. (2019), ‘Bayesian graphical regression’, Journal of the American Statistical Association 114, 184–197.
  • Onitsuka et al. (2004) Onitsuka, T., Shenton, M. E., Salisbury, D. F., Dickey, C. C., Kasai, K., Toner, S. K., Frumin, M., Kikinis, R., Jolesz, F. A. & McCarley, R. W. (2004), ‘Middle and inferior temporal gyrus gray matter volume abnormalities in chronic schizophrenia: an mri study’, American Journal of Psychiatry 161(9), 1603–1611.
  • Poldrack (2012) Poldrack, R. A. (2012), ‘The future of fmri in cognitive neuroscience’, Neuroimage 62(2), 1216–1220.
  • Radua et al. (2010) Radua, J., Phillips, M. L., Russell, T., Lawrence, N., Marshall, N., Kalidindi, S., El-Hage, W., McDonald, C., Giampietro, V., Brammer, M. J. et al. (2010), ‘Neural response to specific components of fearful faces in healthy and schizophrenic adults’, Neuroimage 49(1), 939–946.
  • Rockafellar (2015) Rockafellar, R. T. (2015), Convex Analysis, Princeton University Press.
  • Rockafellar & Wets (2009) Rockafellar, R. T. & Wets, R. J.-B. (2009), Variational analysis, Vol. 317, Springer Science & Business Media.
  • Rosenberg et al. (2016) Rosenberg, M. D., Finn, E. S., Scheinost, D., Papademetris, X., Shen, X., Constable, R. T. & Chun, M. M. (2016), ‘A neuromarker of sustained attention from whole-brain functional connectivity’, Nature Neuroscience 19(1), 165.
  • Simon et al. (2013) Simon, N., Friedman, J., Hastie, T. & Tibshirani, R. (2013), ‘A sparse-group lasso’, Journal of Computational and Graphical Statistics 22(2), 231–245.
  • Sprooten et al. (2017) Sprooten, E., Rasgon, A., Goodman, M., Carlin, A., Leibu, E., Lee, W. H. & Frangou, S. (2017), ‘Addressing reverse inference in psychiatric neuroimaging: Meta-analyses of task-related brain activation in common mental disorders’, Human brain mapping 38(4), 1846–1864.
  • Sun et al. (2019) Sun, Q., Jiang, B., Zhu, H. & Ibrahim, J. G. (2019), ‘Hard thresholding regression’, Scandinavian Journal of Statistics 46(1), 314–328.
  • Tibshirani (1996) Tibshirani, R. (1996), ‘Regression shrinkage and selection via the lasso’, Journal of the Royal Statistical Society: Series B (Methodological) 58(1), 267–288.
  • Wainwright (2009) Wainwright, M. J. (2009), ‘Sharp thresholds for high-dimensional and noisy sparsity recovery using _\ell\_{11}-constrained quadratic programming (lasso)’, IEEE Transactions on Information Theory 55(5), 2183–2202.
  • Wang et al. (2017) Wang, X., Zhu, H. & Initiative, A. D. N. (2017), ‘Generalized scalar-on-image regression models via total variation’, Journal of the American Statistical Association 112(519), 1156–1168.
  • Wang et al. (2014) Wang, Z., Liu, H. & Zhang, T. (2014), ‘Optimal computational and statistical rates of convergence for sparse nonconvex learning problems’, Annals of statistics 42(6), 2164.
  • Yang (2015) Yang, Y. (2015), ‘A unified algorithm for fitting penalized models with high dimensional data’.
  • Yuan & Lin (2006) Yuan, M. & Lin, Y. (2006), ‘Model selection and estimation in regression with grouped variables’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68(1), 49–67.
  • Zhang (2010) Zhang, C.-H. (2010), ‘Nearly unbiased variable selection under minimax concave penalty’, The Annals of Statistics 38(2), 894–942.
  • Zhang & Zhang (2012) Zhang, C.-H. & Zhang, T. (2012), ‘A general theory of concave regularization for high-dimensional sparse estimation problems’, Statistical Science 27(4), 576–593.
  • Zhao & Yu (2006) Zhao, P. & Yu, B. (2006), ‘On model selection consistency of lasso’, Journal of Machine Learning Research 7(Nov), 2541–2563.
  • Zou (2006) Zou, H. (2006), ‘The adaptive lasso and its oracle properties’, Journal of the American Statistical Association 101(476), 1418–1429.
  • Zou & Li (2008) Zou, H. & Li, R. (2008), ‘One-step sparse estimates in nonconcave penalized likelihood models’, The Annals of Statistics 36(4), 1509–1533.