This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Robust subgroup-classifier learning and testing in change-plane regressions

Xu Liu1, Jian Huang2, Yong Zhou3 and Xiao Zhang4,   
1Shanghai University of Finance and Economics, Shanghai, China
2The Hong Kong Polytechnic University, Hong Kong SAR, China
3East China Normal University, Shanghai, China
4The Chinese University of Hong Kong-Shenzhen, Shenzhen, China
CONTACT [email protected]; The Chinese University of Hong Kong-Shenzhen, China
Abstract

Considered here are robust subgroup-classifier learning and testing in change-plane regressions with heavy-tailed errors, which can identify subgroups as a basis for making optimal recommendations for individualized treatment. A new subgroup classifier is proposed by smoothing the indicator function, which is learned by minimizing the smoothed Huber loss. Nonasymptotic properties and the Bahadur representation of estimators are established, in which the proposed estimators of the grouping difference parameter and baseline parameter achieve sub-Gaussian tails. The hypothesis test considered here belongs to the class of test problems for which some parameters are not identifiable under the null hypothesis. The classic supremum of the squared score test statistic may lose power in practice when the dimension of the grouping parameter is large, so to overcome this drawback and make full use of the data’s heavy-tailed error distribution, a robust weighted average of the squared score test statistic is proposed, which achieves a closed form when an appropriate weight is chosen. Asymptotic distributions of the proposed robust test statistic are derived under the null and alternative hypotheses. The proposed robust subgroup classifier and test statistic perform well on finite samples, and their performances are shown further by applying them to a medical dataset. The proposed procedure leads to the immediate application of recommending optimal individualized treatments.


Keywords: Gaussian-type deviation; Heavy-tailed data; Nonstandard test; Robust classifier; Subgroup classifier; Subgroup detection.

1 Introduction

When studying the risk of a disease outcome, there could be heterogeneity across subgroups characterized by covariates, meaning that the same treatment in different subpopulations may cause different treatment effects of predictors. In the presence of population heterogeneity in classical models, learning the subgroup classifier and testing the existence of subgroups associated with the risk-model heterogeneity are important for understanding better the different effects of predictors and modeling better the association of diseases with predictors. In precision medicine, this plays a core role in guiding personalized treatment to individuals in a population by identifying subgroups with different treatment effects on disease. There has been much previous research on learning the subgroup classifiers of individuals based on various models Foster et al., (2011); Wei and Kosorok, (2018); Huang et al., (2020); Li et al., (2021); Zhang et al., (2022).

Before learning the subgroup classifier, it is necessary to test for the existence of subgroups of individuals to address the potential risk of finding false-positive subgroups. This necessity not only arises from the data themselves but is also intrinsic to statistics, because the nonexistence of subgroups causes the identifiability problem when learning the subgroup classifier. However, this test problem belongs to the class of nonstandard tests with loss of identifiability under the null hypothesis; see Wald, (1943); Andrews and Ploberger, (1994, 1995); Davies, (1977); Song et al., (2009); Liu et al., (2024); Kang et al., (2024), among others. Therefore, the focus herein is on learning the subgroup classifier and testing for the existence of subgroups simultaneously, which offers more information and the potential for making the best recommendations for optimal individualized treatments and guiding future treatment modification and development.

Let {𝑽i=(yi,𝑿i,𝒁i,𝑼i),i=1,,n}\{{\boldsymbol{V}}_{i}=(y_{i},{\boldsymbol{X}}_{i},{\boldsymbol{Z}}_{i},{\boldsymbol{U}}_{i}),i=1,\cdots,n\} be the observed data, which are nn independent and identically distributed copies of 𝑽=(y,𝑿,𝒁,𝑼){\boldsymbol{V}}=(y,{\boldsymbol{X}},{\boldsymbol{Z}},{\boldsymbol{U}}). Consider the regression model with change plane Lee et al., (2011); Zhang et al., (2022); Mukherjee et al., (2022); Liu et al., (2024)

yi=𝑿iT𝜶+𝒁iT𝜷𝟏(𝑼iT𝜸0)+ϵi,\displaystyle y_{i}={\boldsymbol{X}}_{i}^{\rm T}\boldsymbol{\alpha}+{\boldsymbol{Z}}_{i}^{\rm T}\boldsymbol{\beta}{\boldsymbol{1}}({\boldsymbol{U}}_{i}^{\rm T}\boldsymbol{\gamma}\geq 0)+\epsilon_{i}, (1)

where 𝜶=(α1,,αp)TΘαp\boldsymbol{\alpha}=(\alpha_{1},\cdots,\alpha_{p})^{\rm T}\in\Theta_{\alpha}\subseteq\mathbb{R}^{p}, 𝜷=(β1,,βq)TΘβq\boldsymbol{\beta}=(\beta_{1},\cdots,\beta_{q})^{\rm T}\in\Theta_{\beta}\subseteq\mathbb{R}^{q} and 𝜸=(γ1,,γr)TΘγr\boldsymbol{\gamma}=(\gamma_{1},\cdots,\gamma_{r})^{\rm T}\in\Theta_{\gamma}\subseteq\mathbb{R}^{r} are unknown parameters, and E(ϵi|𝑿i,𝒁i,𝑼i)=0{\mathrm{E}}(\epsilon_{i}|{\boldsymbol{X}}_{i},{\boldsymbol{Z}}_{i},{\boldsymbol{U}}_{i})=0 and E(|ϵi|2+δ|𝑿i,𝒁i,𝑼i)=Mδ<{\mathrm{E}}(|\epsilon_{i}|^{2+\delta}|{\boldsymbol{X}}_{i},{\boldsymbol{Z}}_{i},{\boldsymbol{U}}_{i})=M_{\delta}<\infty for some δ0\delta\geq 0. When δ=0\delta=0, the error has a finite second moment, denoted by M0=E(ϵi2|𝑿i,𝒁i,𝑼i)M_{0}={\mathrm{E}}(\epsilon_{i}^{2}|{\boldsymbol{X}}_{i},{\boldsymbol{Z}}_{i},{\boldsymbol{U}}_{i}). For easy expression, let 𝜽=(𝜶T,𝜷T)T\boldsymbol{\theta}=(\boldsymbol{\alpha}^{\rm T},\boldsymbol{\beta}^{\rm T})^{\rm T}. Following the expressions in Liu et al., (2024), 𝑼{\boldsymbol{U}} is called the grouping variable, 𝜸\boldsymbol{\gamma} is called the grouping parameter, 𝒁{\boldsymbol{Z}} is called the grouping difference variable, 𝜷\boldsymbol{\beta} is called the grouping difference parameter, 𝑿{\boldsymbol{X}} is called the baseline variable, and 𝜶\boldsymbol{\alpha} is called the baseline parameter. Herein, the indicator function 𝟏(𝑼T𝜸0){\boldsymbol{1}}({\boldsymbol{U}}^{\rm T}\boldsymbol{\gamma}\geq 0) is called the subgroup classifier.

The technology for collecting and processing data sets has improved considerably in recent years, and one is now more likely to encounter heavy-tailed or low-quality data, thereby causing the typical assumption of a Gaussian or sub-Gaussian distribution to fail. Therefore, new challenges arise compared with the classic methodology for modeling non-Gaussian or heavy-tailed data. Even for linear regression models with heavy-tailed errors, the ordinary least squares (OLS) estimators are suboptimal both theoretically and empirically. Instead, proposed herein is a robust estimator of subgroup classification by considering the change-plane model (1) with heavy-tailed errors. This paper addresses two important problems for model (1) with heavy-tailed errors, i.e., subgroup-classifier learning (Section 2) and subgroup testing for whether subgroups exist (Section 3).

1.1 Robust subgroup-classifier learning

Li et al., (2021) considered the change-plane model (1) with Gaussian errors. Also, Zhang et al., (2022) investigated a quantile regression with a change plane and derived the asymptotic normalities for the grouping difference parameter and the grouping parameter. However, although quantile or median regression models require no Gaussian or sub-Gaussian assumption, they essentially estimate the conditional quantile or median regression instead of the conditional mean regression. If the mean regression is of interest in practice, then these procedures are not feasible unless the error distribution is symmetric around zero, which may be too strong to cause the misspecification problem. See Fan et al., 2017b for some examples that demonstrate the distinction between conditional mean regression and conditional quantile or median regression.

Linear regression models with heavy-tailed errors are prevalent in the literature. Fan et al., 2017b proposed a robust estimator of high-dimensional mean regression in the absence of asymmetry and with light tail assumptions. Zhou et al., (2018) provided a robust M-estimation procedure with applications to dependence-adjusted multiple testing. Sun et al., (2020) and Wang et al., (2021) studied adaptive Huber regression for linear regression models with heavy-tailed errors. Chen and Zhou, (2020) investigated robust inference via multiplier bootstrap in multiple response regression models, constructing robust bootstrap confidence sets and addressing large-scale simultaneous hypothesis testing problems.

Refer to caption
Refer to caption
Figure 1: Estimation errors of parameter 𝛃\boldsymbol{\beta} with L2L_{2} norm (left) and accuracies of estimated subgroup classifier (right) for heavy-tailed errors generated from the Pareto distribution.

Mukherjee et al., (2022) studied the change-plane problem under heavy-tailed errors when 𝜶=0\boldsymbol{\alpha}=0 and 𝑿=1{\boldsymbol{X}}=1, which is a special case of the change-plane model (1), and they left the general change-plane model with heavy-tailed errors for future work. Herein, a new robust procedure is introduced to estimate parameters and consequently to learn the subgroup classifier. Figure 1 shows boxplots of the estimation errors of the parameter 𝜽=(𝜶T,𝜷T)T\boldsymbol{\theta}=(\boldsymbol{\alpha}^{\rm T},\boldsymbol{\beta}^{\rm T})^{\rm T} with L2L_{2} norm and the accuracies of the estimated subgroup classifier, where the L2L_{2} norm of the estimation errors is defined as 𝜽^τ,h𝜽\|\hat{\boldsymbol{\theta}}_{\tau,h}-\boldsymbol{\theta}^{*}\| with the robust estimator 𝜽^τ,h\hat{\boldsymbol{\theta}}_{\tau,h} of the true grouping parameter 𝜽\boldsymbol{\theta}^{*}, and the accuracy is defined as ACC=1n1i=1n|𝟏(𝑼iT𝜸^τ,h0)𝟏(𝑼iT𝜸0)|\mbox{ACC}=1-n^{-1}\sum_{i=1}^{n}\left|{\boldsymbol{1}}({\boldsymbol{U}}_{i}^{\rm T}\hat{\boldsymbol{\gamma}}_{\tau,h}\geq 0)-{\boldsymbol{1}}({\boldsymbol{U}}_{i}^{\rm T}\boldsymbol{\gamma}^{*}\geq 0)\right| with the robust estimator 𝜸^τ,h\hat{\boldsymbol{\gamma}}_{\tau,h} of the true parameter 𝜸\boldsymbol{\gamma}^{*}. Here, the settings are (p,q,r)=(3,3,3)(p,q,r)=(3,3,3) and n=(200,400,600)n=(200,400,600) with 1000 repetitions, the heavy-tailed errors are generated from the Pareto distribution Par(2,1)Par(2,1) with shape parameter 2 and scale parameter 1, and X1=Z1=1X_{1}=Z_{1}=1 and (X2,,Xp)T=(Z2,,Zp)T(X_{2},\cdots,X_{p})^{\rm T}=(Z_{2},\cdots,Z_{p})^{\rm T} and (U2,,Ur)T(U_{2},\cdots,U_{r})^{\rm T} are generated independently from multivariate normal distributions N(𝟎p1,3𝑰p1)N({\boldsymbol{0}}_{p-1},\sqrt{3}{\boldsymbol{I}}_{p-1}) and N(𝟎r1,3𝑰r1)N({\boldsymbol{0}}_{r-1},\sqrt{3}{\boldsymbol{I}}_{r-1}), respectively; see Section 4 for details. As used by Zhang et al., (2022), the smooth function K(u)={1+exp(u)}1K(u)=\{1+\exp(-u)\}^{-1} with smoothness parameter h=log(n)/nh=\sqrt{\log(n)/n} is chosen, and the proposed robust estimation procedure (AHu) is compared with the method based on OLS (Li et al., 2021). Figure 1 sends the important message that in the presence of heavy tails, compared with the existing method (Li et al., 2021), the proposed robust estimators not only reduce the estimation error dramatically but also improve significantly the accuracy of the subgroup classifier.

1.2 Robust subgroup testing

Another goal of this paper is to test for the existence of subgroups, i.e.,

H0:𝜷=𝟎versusH1:𝜷𝟎.\displaystyle H_{0}:\boldsymbol{\beta}={\boldsymbol{0}}\quad versus\quad H_{1}:\boldsymbol{\beta}\neq{\boldsymbol{0}}. (2)

Note that the grouping parameter 𝜸\boldsymbol{\gamma} is not identifiable under the null hypothesis.

The classic Wald-type test or score-based test is powerful in standard test problems when there is no identifiability problem in both the null and alternative hypotheses, but these common procedures are not feasible when nuisance parameters are present. Andrews and Ploberger, (1994) and Andrews and Ploberger, (1995) studied the weighted average exponential form, which was originally introduced by Wald, (1943). Davies, (1977) investigated well the supremum of the squared score test (SST) statistic for mixture models, which was applied by Song et al., (2009) and Kang et al., (2017) to semiparametric models in censoring data.

All the aforementioned testing methods are optimal tests based on the weighted average power criterion. However, because these optimal tests take the weighted exponential average of the classical tests over the grouping parametric space Θγ\Theta_{\gamma}, they may not only not perform well in practice when the dimension of Θγ\Theta_{\gamma} is large but also give rise to a heavy-burden calculation of the p-value or the critical value. Instead, Liu et al., (2024) introduced a new test statistic by taking the weighted average of the SST (WAST) over Θγ\Theta_{\gamma} and removing both the inverse of the covariance and the cross-interaction terms to overcome the drawbacks of SST. Thanks to its closed form, WAST achieves more-accurate type-I errors and significantly improved power and hence dramatically reduced computational time as a byproduct.

Refer to caption
Refer to caption
Refer to caption
Figure 2: Powers of test statistics by proposed RWAST (rwast, red solid line), WAST (olsw, green dashed line), and SST (olss, blue dotted line) for (p,q,r)=(3,3,3)(p,q,r)=(3,3,3) and generating heavy-tailed errors from the Pareto distribution.

All the aforementioned test procedures require the important assumption of Gaussian or sub-Gaussian errors in the change-plane models, none of which apply to heavy-tailed data sets. Therefore, proposed herein is a robust test procedure based on WAST (Liu et al., 2024), called robust WAST (RWAST). Figure 2 shows the power curves of the proposed RWAST, WAST as introduced by Liu et al., (2024), and SST as considered by Davies, (1977); Kang et al., (2017). A total of 1000 bootstrap samples is set, and the other settings are the same as those in Section 1.1. With the nominal significance level α=0.05\alpha=0.05, the type-I errors for these three methods are (rwast,olsw,olss)=(0.042,0.065,0.031)(\mbox{rwast},\mbox{olsw},\mbox{olss})=(0.042,0.065,0.031) for n=200n=200, (rwast,olsw,olss)=(0.043,0.047,0.118)(\mbox{rwast},\mbox{olsw},\mbox{olss})=(0.043,0.047,0.118) for n=400n=400, and (rwast,olsw,olss)=(0.033,0.047,0.135)(\mbox{rwast},\mbox{olsw},\mbox{olss})=(0.033,0.047,0.135) for n=600n=600. It follows that RWAST controls the type-I errors well, while those of SST based on quadratic loss are larger and deviate far from 0.05. Figure 2 shows that in the presence of heavy tails, the proposed RWAST achieves larger power in comparison with WAST based on the ordinary quadratic loss.

In summary, from the demonstration examples in Section 1.1 and Section 1.2, compared with the existing nonrobust methods for heavy-tailed data, the proposed robust estimation procedure is characterized by lower estimation errors and higher accuracy, and the robust test procedure has more-accurate type-I errors and larger power.

1.3 Main contributions

The main contributions of this paper are summarized as follows. First, the robust estimation procedure for the change-plane model (1) with heavy-tailed errors is investigated carefully. The proposed robust estimator adapts to the sample size, the robustification parameter in the Huber loss, the smoothness parameter when approximating the indicator function in the subgroup classifier, and the moments of errors. The sacrifices made in pursuit of robustness and smoothness are analyzed theoretically, with the bias involving the robustification parameter arising from the pursuit of robustness, and the one involving the smoothness parameter arising from the approximation to the indicator function. The nonasymptotic properties for parameters 𝜶\boldsymbol{\alpha} and 𝜷\boldsymbol{\beta} are established, as well as those for the grouping parameter. The theoretical results reveal that the proposed estimators of the grouping difference parameter and baseline parameter have Gaussian-type deviations (Devroye et al., 2016). Also provided is the nonasymptotic Bahadur representation of the proposed robust estimators, which is convenient for deriving the classical asymptotic results needed for statistical inference such as hypothesis tests and constructing confidence regions. Extensive simulation studies show that the proposed robust estimation procedure is superior to its several competitors.

Second, for the change-plane model with heavy-tailed errors, RWAST is proposed, which makes full use of the data’s heavy-tailed information and overcomes the drawbacks of loss of power in practice and the heavy computational burden of SST when the dimension of the grouping parameter is large. The asymptotic distributions of the proposed RWAST under the null and alternative hypotheses are established based on the theory of degenerate U-statistics. As with exponential average tests, the proposed asymptotic distributions are not standard (e.g., the normal or χ2\chi^{2} distribution), so a novel bootstrap method that is easily implemented and theoretically guaranteed is introduced to mimic the critical value or p-value. Comprehensive simulation studies conducted with finite sample sizes and for various heavy-tailed error distributions show the excellent performance of the proposed RWAST, which improves the power significantly and reduces the computational burden dramatically.

In summary, a novel robust estimator is proposed that adapts to the sample size, dimension, robustification parameter, moments, and smoothness parameter in pursuit of the optimal tradeoff among bias, robustness, and smoothness. To the best of the author’s knowledge about change-plane analysis, the literature contains no nonasymptotic results with sub-Gaussian tails for parameters 𝜽\boldsymbol{\theta} and 𝜼\boldsymbol{\eta}, and no nonasymptotic results with sub-exponential tails for the Bahadur representation of these parameters. Furthermore, a robust test procedure is proposed that improves on WAST in the change-plane model with heavy-tailed errors.

1.4 Notation and Organization of paper

Here, some useful notation is introduced for convenience of expression. For a vector 𝒗d{\boldsymbol{v}}\in\mathbb{R}^{d} and a square matrix A=(aij)d×dA=(a_{ij})\in\mathbb{R}^{d\times d}, denote by 𝒗\|{\boldsymbol{v}}\| the Euclidean norm of 𝒗{\boldsymbol{v}}, by trace(A)=i=1daii\mbox{trace}(A)=\sum_{i=1}^{d}a_{ii} the trace of AA; 𝒗A2=i,jaijvivj\|{\boldsymbol{v}}\|_{A}^{2}=\sum_{i,j}a_{ij}v_{i}v_{j} and 𝒗2=𝒗𝒗T{\boldsymbol{v}}^{\otimes 2}={\boldsymbol{v}}{\boldsymbol{v}}^{\rm T}. Denote by Ap=sup{A𝒙p:𝒙d,𝒙p=1}\|A\|_{p}=\sup\{\|A{\boldsymbol{x}}\|_{p}:{\boldsymbol{x}}\in\mathbb{R}^{d},\|{\boldsymbol{x}}\|_{p}=1\} the induced operator norm for a matrix A=(aij)m×dA=(a_{ij})\in\mathbb{R}^{m\times d}.

Denote by 𝐏\boldsymbol{\mathrm{P}} the ordinary probability measure such that 𝐏f=f𝑑𝐏\boldsymbol{\mathrm{P}}f=\int fd\boldsymbol{\mathrm{P}} for any measurable function ff, by n\mathbb{P}_{n} the empirical measure of a sample of random elements from 𝐏\boldsymbol{\mathrm{P}} such that nf=n1i=1nf(𝑽i)\mathbb{P}_{n}f=n^{-1}\sum_{i=1}^{n}f({\boldsymbol{V}}_{i}), and by 𝔾n{\mathbb{G}}_{n} the empirical process indexed by a class {\cal F} of measurable functions such that 𝔾nf=n(n𝐏)f{\mathbb{G}}_{n}f=\sqrt{n}(\mathbb{P}_{n}-\boldsymbol{\mathrm{P}})f for any ff\in{\cal F}. Let Lp(Q)L^{p}(Q) be the space of all measurable functions ff such that fQ,p:=(Q|f|p)1/p<\|f\|_{Q,p}:=(Q|f|^{p})^{1/p}<\infty, where p[1,)p\in[1,\infty) and (Q|f|p)1/p(Q|f|^{p})^{1/p} denotes the essential supremum when p=p=\infty. Let N(ϵ,,Q,2)N(\epsilon,{\cal F},\|\cdot\|_{Q,2}) be an ϵ\epsilon-covering number of {\cal F} with respect to the L2(Q)L^{2}(Q) seminorm Q,2\|\cdot\|_{Q,2}, where {\cal F} is a class of measure functions and QQ is finite discrete.

The remainder of this paper is organized as follows. Section 2 provides the robust estimators for the grouping difference parameter and grouping parameter as well as the subgroup classifier, and theorems reveal that these estimators achieve Gaussian-type deviations. Also derived is the Bahadur representation of the robust estimators, and it is shown that the remainder of the Bahadur representation achieves sub-Gaussian tails. Section 3 presents the RWAST statistic and establishes its limiting distributions under the null and alternative hypotheses. Section 4 reports the results of simulation studies conducted to evaluate the finite-sample performance of the proposed methods with competitors in the change-plane models with heavy-tailed errors. The performance of the proposed methods is illustrated further by applying them to a medical dataset in Section 5. Finally, Section 6 concludes with remarks and further extensions. The proofs are provided in the Supplementary Material, and an R package named “wasthub” is available at https://github.com/xliusufe/wasthub.

2 Robust subgroup-classifier learning

In this section, the subgroup classifier is learned to partition subjects into two subgroups, and nonasymptotic properties are provided for the robust estimators, whose deviations achieve sub-Gaussian tails. To adapt for different magnitudes of errors and to robustify the estimation, the Huber loss (Huber, 1964; Fan et al., 2017b ; Wang et al., 2021; Han et al., 2022) is considered, the definition of which begins this section.

Definition 2.1.

The Huber loss Lτ(u)L_{\tau}(u) (Huber, 1964) is defined as

Lτ(u)={u2/2 if |u|τ,τ|u|τ2/2 otherwise,\displaystyle L_{\tau}(u)=\left\{\begin{array}[]{ll}u^{2}/2&\text{ if }|u|\leq\tau,\\ \tau|u|-\tau^{2}/2&\text{ otherwise,}\end{array}\right. (5)

where τ>0\tau>0 is a tuning parameter called the robustification parameter (Sun et al., 2020; Chen and Zhou, 2020), which regulates the bias and robustness.

The Huber loss is a hybrid of the squared loss with small errors and absolute loss for large errors. Denoting by L(u)=u2/2L(u)=u^{2}/2 the ordinary quadratic loss, it is straightforward to see that L(u)=limτLτ(u)L(u)=\lim_{\tau\rightarrow\infty}L_{\tau}(u).

2.1 Robust estimation

Rewrite model (1) as

yi=𝑿iT𝜶+𝒁iT𝜷𝟏(U1i+𝑼2iT𝜼0)+ϵi,\displaystyle y_{i}={\boldsymbol{X}}_{i}^{\rm T}\boldsymbol{\alpha}+{\boldsymbol{Z}}_{i}^{\rm T}\boldsymbol{\beta}{\boldsymbol{1}}(U_{1i}+{\boldsymbol{U}}_{2i}^{\rm T}\boldsymbol{\eta}\geq 0)+\epsilon_{i}, (6)

where 𝑼i=(U1i,𝑼2iT)T{\boldsymbol{U}}_{i}=(U_{1i},{\boldsymbol{U}}_{2i}^{\rm T})^{\rm T}, 𝜼=γ11𝜸1\boldsymbol{\eta}=\gamma_{1}^{-1}\boldsymbol{\gamma}_{-1} with 𝜸1=(γ2,,γr)T\boldsymbol{\gamma}_{-1}=(\gamma_{2},\cdots,\gamma_{r})^{\rm T}. To avoid the identifiability problem for 𝜼\boldsymbol{\eta}, 𝜷𝟎\boldsymbol{\beta}\neq{\boldsymbol{0}} is assumed in this section. Denote 𝜻=(𝜶T,𝜷T,𝜼T)TΘζ\boldsymbol{\zeta}=(\boldsymbol{\alpha}^{\rm T},\boldsymbol{\beta}^{\rm T},\boldsymbol{\eta}^{\rm T})^{\rm T}\in\Theta_{\zeta}, where Θζ=Θα×Θβ×Θη\Theta_{\zeta}=\Theta_{\alpha}\times\Theta_{\beta}\times\Theta_{\eta} is the product space.

Because the indicator function 𝟏(U1+𝑼2T𝜼0){\boldsymbol{1}}(U_{1}+{\boldsymbol{U}}_{2}^{\rm T}\boldsymbol{\eta}\geq 0) is not differentiable, it is natural to approximate it by a smooth function K(u)K(u) satisfying

limu+K(u)=1andlimuK(u)=0.\displaystyle\lim_{u\rightarrow+\infty}K(u)=1~{}\mbox{and}~{}\lim_{u\rightarrow-\infty}K(u)=0.

Note that this smooth function characterizes the cumulative distribution function instead of the density function; see Seo and Linton, (2007); Li et al., (2021); Zhang et al., (2022); Mukherjee et al., (2020) for more details. The literature contains many commonly used smooth functions, such as the cumulative distribution function of standard normal distribution K(u)=Φ(u)K(u)=\Phi(u), the sigmoid function K(u)={1+exp(u)}1K(u)=\{1+\exp(-u)\}^{-1}, and the mixture of the cumulative distribution function and density of standard normal distribution K(u)=Φ(u)+uϕ(u)K(u)=\Phi(u)+u\phi(u). Thus, model (6) can be approximated by

yi=𝑿iT𝜶+𝒁iT𝜷Kh(U1i+𝑼2iT𝜼)+ϵi,\displaystyle y_{i}={\boldsymbol{X}}_{i}^{\rm T}\boldsymbol{\alpha}+{\boldsymbol{Z}}_{i}^{\rm T}\boldsymbol{\beta}K_{h}(U_{1i}+{\boldsymbol{U}}_{2i}^{\rm T}\boldsymbol{\eta})+\epsilon_{i}, (7)

where Kh(u)=K(u/h)K_{h}(u)=K(u/h), and hh is a predetermined tuning parameter associated with nn satisfying limnh=0\lim_{n\rightarrow\infty}h=0 , called the smoothness parameter.

For any τ>0\tau>0, let 𝜻τ,h\boldsymbol{\zeta}_{\tau,h}^{*} be the minimizer defined as

𝜻τ,h=argmin𝜻Θζ𝐏Lτ(y𝑿T𝜶𝒁T𝜷Kh(U1+𝑼2T𝜼)),\displaystyle\boldsymbol{\zeta}_{\tau,h}^{*}=\mathop{\rm argmin}_{\boldsymbol{\zeta}\in\Theta_{\zeta}}\boldsymbol{\mathrm{P}}L_{\tau}(y-{\boldsymbol{X}}^{\rm T}\boldsymbol{\alpha}-{\boldsymbol{Z}}^{\rm T}\boldsymbol{\beta}K_{h}(U_{1}+{\boldsymbol{U}}_{2}^{\rm T}\boldsymbol{\eta})), (8)

which approximates the minimizer

𝜻τ=argmin𝜻Θζ𝐏Lτ(y𝑿T𝜶𝒁T𝜷𝟏(U1+𝑼2T𝜼0)).\displaystyle\boldsymbol{\zeta}_{\tau}^{*}=\mathop{\rm argmin}_{\boldsymbol{\zeta}\in\Theta_{\zeta}}\boldsymbol{\mathrm{P}}L_{\tau}(y-{\boldsymbol{X}}^{\rm T}\boldsymbol{\alpha}-{\boldsymbol{Z}}^{\rm T}\boldsymbol{\beta}{\boldsymbol{1}}(U_{1}+{\boldsymbol{U}}_{2}^{\rm T}\boldsymbol{\eta}\geq 0)). (9)

As did Sun et al., (2020), 𝜻τ\boldsymbol{\zeta}_{\tau}^{*} is called the Huber coefficient, which usually distinguishes from the true parameter 𝜻\boldsymbol{\zeta}^{*}. Measured by 𝜻τ𝜻\|\boldsymbol{\zeta}_{\tau}^{*}-\boldsymbol{\zeta}^{*}\|, the Huber error is caused by the robustification for the heavy-tailed errors, while the distance 𝜻τ,h𝜻\|\boldsymbol{\zeta}_{\tau,h}^{*}-\boldsymbol{\zeta}^{*}\| is a consequence of both robustification and smoothness. Theorem 2 reveals that 𝜻τ,h𝜻\|\boldsymbol{\zeta}_{\tau,h}^{*}-\boldsymbol{\zeta}^{*}\| is controlled by both τ\tau and hh, with hh playing the role of the bandwidth in the nonparametric area.

Minimizing the empirical loss in (8) produces the robust estimator of interest, i.e.,

𝜻^τ,h=argmin𝜻ΘζnLτ(y𝑿T𝜶𝒁T𝜷Kh(U1+𝑼2T𝜼)).\displaystyle\hat{\boldsymbol{\zeta}}_{\tau,h}=\mathop{\rm argmin}_{\boldsymbol{\zeta}\in\Theta_{\zeta}}\mathbb{P}_{n}L_{\tau}(y-{\boldsymbol{X}}^{\rm T}\boldsymbol{\alpha}-{\boldsymbol{Z}}^{\rm T}\boldsymbol{\beta}K_{h}(U_{1}+{\boldsymbol{U}}_{2}^{\rm T}\boldsymbol{\eta})). (10)

From (8), (9), and (10), the total estimation error 𝜻^τ,h𝜻\|\hat{\boldsymbol{\zeta}}_{\tau,h}-\boldsymbol{\zeta}^{*}\| can be decomposed into three parts, i.e.,

𝜻^τ,h𝜻Total error𝜻^τ,h𝜻τ,hestimation error+𝜻τ,h𝜻τsmooth error+𝜻τ𝜻Huber error.\displaystyle\underbrace{\|\hat{\boldsymbol{\zeta}}_{\tau,h}-\boldsymbol{\zeta}^{*}\|}_{\mbox{Total error}}\leq\underbrace{\|\hat{\boldsymbol{\zeta}}_{\tau,h}-\boldsymbol{\zeta}_{\tau,h}^{*}\|}_{\mbox{estimation error}}+\underbrace{\|\boldsymbol{\zeta}_{\tau,h}^{*}-\boldsymbol{\zeta}_{\tau}^{*}\|}_{\mbox{smooth error}}+\underbrace{\|\boldsymbol{\zeta}_{\tau}^{*}-\boldsymbol{\zeta}^{*}\|}_{\mbox{Huber error}}. (11)

It is natural to use the alternating strategy to obtain the estimate, denoted by 𝜻^=(𝜶^T,𝜷^T,𝜼^T)T\hat{\boldsymbol{\zeta}}=(\hat{\boldsymbol{\alpha}}^{\rm T},\hat{\boldsymbol{\beta}}^{\rm T},\hat{\boldsymbol{\eta}}^{\rm T})^{\rm T}. Specifically, the parameters (𝜶T,𝜷T)T(\boldsymbol{\alpha}^{\rm T},\boldsymbol{\beta}^{\rm T})^{\rm T} and 𝜼\boldsymbol{\eta} can be estimated iteratively as follows. For given 𝜼(k)\boldsymbol{\eta}^{(k)}, 𝜶(k+1)\boldsymbol{\alpha}^{(k+1)} and 𝜷(k+1)\boldsymbol{\beta}^{(k+1)} are obtained by minimizing

nLτ(y𝑿T𝜶𝒁T𝜷Kh(U1+𝑼2T𝜼(k))),\displaystyle\mathbb{P}_{n}L_{\tau}(y-{\boldsymbol{X}}^{\rm T}\boldsymbol{\alpha}-{\boldsymbol{Z}}^{\rm T}\boldsymbol{\beta}K_{h}(U_{1}+{\boldsymbol{U}}_{2}^{\rm T}\boldsymbol{\eta}^{(k)})),

and for given 𝜶(k+1)\boldsymbol{\alpha}^{(k+1)} and 𝜷(k+1)\boldsymbol{\beta}^{(k+1)}, 𝜸(k+1)\boldsymbol{\gamma}^{(k+1)} is estimated by minimizing

nLτ(y𝑿T𝜶(k+1)𝒁T𝜷(k+1)Kh(U1+𝑼2T𝜼)).\displaystyle\mathbb{P}_{n}L_{\tau}(y-{\boldsymbol{X}}^{\rm T}\boldsymbol{\alpha}^{(k+1)}-{\boldsymbol{Z}}^{\rm T}\boldsymbol{\beta}^{(k+1)}K_{h}(U_{1}+{\boldsymbol{U}}_{2}^{\rm T}\boldsymbol{\eta})).

Iterating these two maximizers leads to the desired robust estimator. The above alternating strategy is summarized in Algorithm A in Appendix A of the Supplementary Material.

2.2 Nonasymptotic properties

This section begins with assumptions needed to establish the nonasymptotic properties. Let 𝑽¯{\bar{\boldsymbol{V}}} be 𝑽{\boldsymbol{V}} by removing U1U_{1}, i.e., 𝑽¯=(y,𝑿,𝒁,𝑼2)){\bar{\boldsymbol{V}}}=(y,{\boldsymbol{X}},{\boldsymbol{Z}},{\boldsymbol{U}}_{2})), and ω¯(𝜼)=U1+𝑼2T𝜼\bar{\omega}(\boldsymbol{\eta})=U_{1}+{\boldsymbol{U}}_{2}^{\rm T}\boldsymbol{\eta} and ω¯=ω¯(𝜼)\bar{\omega}=\bar{\omega}(\boldsymbol{\eta}^{*}).

  1. (A1)

    The conditional random vectors E(𝑿|𝑼2)p{\mathrm{E}}({\boldsymbol{X}}|{\boldsymbol{U}}_{2})\in\mathbb{R}^{p} and E(𝒁|𝑼2)q{\mathrm{E}}({\boldsymbol{Z}}|{\boldsymbol{U}}_{2})\in\mathbb{R}^{q} given 𝑼2{\boldsymbol{U}}_{2} are sub-Gaussian, and 𝑼2{\boldsymbol{U}}_{2} is sub-Gaussian. There is a universal constant K1>0K_{1}>0 satisfying E(𝑿|𝑼2)ψ2K1\|{\mathrm{E}}({\boldsymbol{X}}|{\boldsymbol{U}}_{2})\|_{\psi_{2}}\leq K_{1}, E(𝒁|𝑼2)ψ2K1\|{\mathrm{E}}({\boldsymbol{Z}}|{\boldsymbol{U}}_{2})\|_{\psi_{2}}\leq K_{1}, and 𝑼2ψ2K1\|{\boldsymbol{U}}_{2}\|_{\psi_{2}}\leq K_{1}. For any 𝒖2r1{\boldsymbol{u}}_{2}\in\mathbb{R}^{r-1}, the matrices E(𝑿𝑿T|𝑼2=𝒖2){\mathrm{E}}({\boldsymbol{X}}{\boldsymbol{X}}^{\rm T}|{\boldsymbol{U}}_{2}={\boldsymbol{u}}_{2}), E(𝑿𝑿T|𝑼2=𝒖2){\mathrm{E}}({\boldsymbol{X}}{\boldsymbol{X}}^{\rm T}|{\boldsymbol{U}}_{2}={\boldsymbol{u}}_{2}), and E(𝑼2𝑼2T){\mathrm{E}}({\boldsymbol{U}}_{2}{\boldsymbol{U}}_{2}^{\rm T}) are uniformly positive definite, and there is a universal constant K0>0K_{0}>0 satisfying λmin(E(𝑿𝑿T|𝑼2=𝒖2))K0\lambda_{\min}({\mathrm{E}}({\boldsymbol{X}}{\boldsymbol{X}}^{\rm T}|{\boldsymbol{U}}_{2}={\boldsymbol{u}}_{2}))\geq K_{0}, λmin(E(𝒁𝒁T|𝑼2=𝒖2))K0\lambda_{\min}({\mathrm{E}}({\boldsymbol{Z}}{\boldsymbol{Z}}^{\rm T}|{\boldsymbol{U}}_{2}={\boldsymbol{u}}_{2}))\geq K_{0}, and λmin(E(𝑼2𝑼2T))K0\lambda_{\min}({\mathrm{E}}({\boldsymbol{U}}_{2}{\boldsymbol{U}}_{2}^{\rm T}))\geq K_{0}.

  2. (A2)

    The error variable ϵ\epsilon is independent of (𝑿T,𝒁T,𝑼T)T({\boldsymbol{X}}^{\rm T},{\boldsymbol{Z}}^{\rm T},{\boldsymbol{U}}^{\rm T})^{\rm T} and satisfies E(ϵ)=0{\mathrm{E}}(\epsilon)=0 and E(|ϵ|2+δ)=Mδ<{\mathrm{E}}(|\epsilon|^{2+\delta})=M_{\delta}<\infty with δ0\delta\geq 0. Denote M0=E(|ϵ|2)M_{0}={\mathrm{E}}(|\epsilon|^{2}).

  3. (A3)

    0<E[𝟏(𝑼T𝜸0)]<10<{\mathrm{E}}[{\boldsymbol{1}}({\boldsymbol{U}}^{\rm T}\boldsymbol{\gamma}\geq 0)]<1 for any 𝜸Θγ\boldsymbol{\gamma}\in\Theta_{\gamma}, and there is a constant δu2>0\delta_{u_{2}}>0 satisfying sup𝜼𝕊r1E[𝑼2T𝜼𝟏(𝑼2T𝜼0)]δu2\sup_{\boldsymbol{\eta}\in\mathbb{S}^{r-1}}{\mathrm{E}}[{\boldsymbol{U}}_{2}^{\rm T}\boldsymbol{\eta}{\boldsymbol{1}}({\boldsymbol{U}}_{2}^{\rm T}\boldsymbol{\eta}\geq 0)]\geq\delta_{u_{2}}.

  4. (A4)

    For almost every 𝒖2{\boldsymbol{u}}_{2}, the density of U1U_{1} conditional on 𝑼2=𝒖2{\boldsymbol{U}}_{2}={\boldsymbol{u}}_{2} is everywhere positive. The conditional density fϖ|𝒗¯(ϖ)f_{\varpi|\bar{{\boldsymbol{v}}}}(\varpi) of ω¯\bar{\omega} given 𝑽¯{\bar{\boldsymbol{V}}} has continuous derivative, and there is a constant κf>0\kappa_{f}>0 such that fϖ|𝒗¯(ϖ)f_{\varpi|\bar{{\boldsymbol{v}}}}(\varpi) and |fϖ|𝒗¯(ϖ)||f_{\varpi|\bar{{\boldsymbol{v}}}}^{\prime}(\varpi)| are uniformly bounded from above by κf\kappa_{f} over (ϖ,𝒗¯)(\varpi,\bar{{\boldsymbol{v}}}). fϖ|𝒗¯(0)f_{\varpi|\bar{{\boldsymbol{v}}}}(0) is uniformly bounded from below by δf0\delta_{f_{0}} over 𝒗¯\bar{{\boldsymbol{v}}}, and Fϖ|𝒗¯(0)F_{\varpi|\bar{{\boldsymbol{v}}}}(0) is uniformly bounded from above by κF\kappa_{F} over 𝒗¯\bar{{\boldsymbol{v}}}, where δf0>0\delta_{f_{0}}>0 and 0<κF<10<\kappa_{F}<1 are constants.

  5. (A5)

    The smooth function K()K(\cdot) is twice differentiable and K(t)=1K(t)K(-t)=1-K(t). K()K^{\prime}(\cdot) is symmetric around zero. Moveover, there is a universal constant κk>0\kappa_{k}>0 satisfying max{supt|K(t)|,supt|K′′(t)|,|K(t)|j𝑑t,|K′′(t)|j𝑑t,|t||K(t)|j𝑑t}κk\max\left\{\sup_{t\in\mathbb{R}}|K^{\prime}(t)|,\sup_{t\in\mathbb{R}}|K^{\prime\prime}(t)|,\int|K^{\prime}(t)|^{j}dt,\int|K^{\prime\prime}(t)|^{j}dt,\int|t||K^{\prime}(t)|^{j}dt\right\}\leq\kappa_{k} with j=1,2j=1,2.

Remark 1.

Assumptions (A1)–(A5) are mild conditions for deriving the nonasymptotic bounds in Theorems 14 below. Assumption (A1) is the moment condition for covariates. Assumption (A2) is imposed to control the moment of error and to yield the adaptive nonasymptotic upper bounds; see Fan et al., 2017b ; Wang et al., (2021); Han et al., (2022). Assumption (A3) is mild and easily verified in practice, and it is usually imposed in change-plane analysis; see Kang et al., (2017); Liu et al., (2024). Assumption (A4) is required to establish nonasymptotic properties in dealing with the indicator function; see Horowitz, (1993); Zhang et al., (2022). By Lemmas C1–C3 in Appendix C of the Supplementary Material, Assumption (A5) holds for commonly used smoothing functions such as (i) the cumulative distribution function of standard normal distribution K(u)=Φ(u)K(u)=\Phi(u), (ii) the sigmoid function K(u)={1+exp(u)}1K(u)=\{1+\exp(-u)\}^{-1}, and (iii) the function K(u)=Φ(u)+uϕ(u)K(u)=\Phi(u)+u\phi(u); see Horowitz, (1993); Zhang et al., (2022).

Theorem 1.

Let 𝛉=(𝛂T,𝛃T)T\boldsymbol{\theta}=(\boldsymbol{\alpha}^{\rm T},\boldsymbol{\beta}^{\rm T})^{\rm T}. If Assumptions (A1)–(A4) hold, then for some δ0\delta\geq 0, the minimizer 𝛇τ=((𝛉τ)T,(𝛈τ)T)T\boldsymbol{\zeta}_{\tau}^{*}=((\boldsymbol{\theta}_{\tau}^{*})^{\rm T},(\boldsymbol{\eta}_{\tau}^{*})^{\rm T})^{\rm T} given in (9) satisfies

𝜽τ𝜽5K12Mδ+C𝜽2+δK12+3δ/2K0(1δF)(1+δ)τ1+δ\displaystyle\|\boldsymbol{\theta}_{\tau}^{*}-\boldsymbol{\theta}^{*}\|\leq 5K_{1}\frac{2M_{\delta}+C\|\boldsymbol{\theta}^{*}\|^{2+\delta}K_{1}^{2+3\delta/2}}{K_{0}(1-\delta_{F})(1+\delta)\tau^{1+\delta}}

and

𝜼τ𝜼\displaystyle\|\boldsymbol{\eta}_{\tau}^{*}-\boldsymbol{\eta}^{*}\|\leq 160K12K0δf0δu2𝜷2{2Mδ+C𝜽2+δK12+3δ/2K0(1δF)(1+δ)τ1+δ}2,\displaystyle\frac{160K_{1}^{2}}{K_{0}\delta_{f_{0}}\delta_{u_{2}}\|\boldsymbol{\beta}^{*}\|^{2}}\left\{\frac{2M_{\delta}+C\|\boldsymbol{\theta}^{*}\|^{2+\delta}K_{1}^{2+3\delta/2}}{K_{0}(1-\delta_{F})(1+\delta)\tau^{1+\delta}}\right\}^{2},

where C>0C>0 is a constant.

Theorem 1 states that the Huber error is of order τ(1+δ)\tau^{-(1+\delta)} for 𝜽τ𝜽\|\boldsymbol{\theta}_{\tau}^{*}-\boldsymbol{\theta}^{*}\| and of order τ(2+2δ)\tau^{-(2+2\delta)} for 𝜼𝜼τ\|\boldsymbol{\eta}^{*}-\boldsymbol{\eta}_{\tau}^{*}\|. As τ\tau tends to infinity, because the Huber loss becomes the ordinary quadratic loss, the Huber error vanishes as expected. The next theorem studies the smooth error and Huber loss together.

Theorem 2.

If Assumptions (A1)–(A5) hold, then for some δ0\delta\geq 0 and h=o(1)h=o(1), the minimizer 𝛇τ,h=((𝛉τ,h)T,(𝛈τ,h)T)T\boldsymbol{\zeta}_{\tau,h}^{*}=((\boldsymbol{\theta}_{\tau,h}^{*})^{\rm T},(\boldsymbol{\eta}_{\tau,h}^{*})^{\rm T})^{\rm T} given in (8) satisfies

𝜽τ,h𝜽16K1{2Mδ+C𝜽2+δK12+3δ/2K0(1δF)(1+δ)τ1+δ+κfκk𝜷K0(1δF)h}\displaystyle\|\boldsymbol{\theta}_{\tau,h}^{*}-\boldsymbol{\theta}^{*}\|\leq 16K_{1}\left\{\frac{2M_{\delta}+C\|\boldsymbol{\theta}^{*}\|^{2+\delta}K_{1}^{2+3\delta/2}}{K_{0}(1-\delta_{F})(1+\delta)\tau^{1+\delta}}+\frac{\kappa_{f}\kappa_{k}\|\boldsymbol{\beta}^{*}\|}{K_{0}(1-\delta_{F})}h\right\}

and

𝜼τ,h𝜼642K12κfκkδf0δu2min{𝜷,𝜷2}{2Mδ+C𝜽2+δK12+3δ/2K0(1δF)(1+δ)τ1+δ+κfκk𝜷K0(1δF)h}2,\displaystyle\|\boldsymbol{\eta}_{\tau,h}^{*}-\boldsymbol{\eta}^{*}\|\leq\frac{64^{2}K_{1}^{2}\kappa_{f}\kappa_{k}}{\delta_{f_{0}}\delta_{u_{2}}\min\{\|\boldsymbol{\beta}^{*}\|,\|\boldsymbol{\beta}^{*}\|^{2}\}}\left\{\frac{2M_{\delta}+C\|\boldsymbol{\theta}^{*}\|^{2+\delta}K_{1}^{2+3\delta/2}}{K_{0}(1-\delta_{F})(1+\delta)\tau^{1+\delta}}+\frac{\kappa_{f}\kappa_{k}\|\boldsymbol{\beta}^{*}\|}{K_{0}(1-\delta_{F})}h\right\}^{2},

where C>0C>0 is a constant.

Theorem 2 reveals that 𝜽τ,h𝜽\|\boldsymbol{\theta}_{\tau,h}^{*}-\boldsymbol{\theta}^{*}\| and 𝜼τ,h𝜼\|\boldsymbol{\eta}_{\tau,h}^{*}-\boldsymbol{\eta}^{*}\| are associated with both the Huber error and the smooth error. The smoothness parameter hh can be of order τ(1+δ)\tau^{-(1+\delta)}, and because Kh(t)K_{h}(t) approximates 𝟏(t0){\boldsymbol{1}}(t\geq 0) as hh tends to zero, the upper bounds in Theorem 2 are of the same order as those in Theorem 1 when h0h\rightarrow 0. Thus, the deviations 𝜽τ,h𝜽\|\boldsymbol{\theta}_{\tau,h}^{*}-\boldsymbol{\theta}^{*}\| and 𝜼τ,h𝜼\|\boldsymbol{\eta}_{\tau,h}^{*}-\boldsymbol{\eta}^{*}\| are sacrifices in pursuit of robustification and smoothness. The next theorem provides the exponential-type deviations for the baseline and grouping difference parameters and grouping parameter.

Theorem 3.

If Assumptions (A1)–(A5) hold, then for some δ\delta and any t>0t>0, when h2n/(p+q+r)h^{2}n/(p+q+r)\rightarrow\infty and h=o(1)h=o(1), the estimator 𝛇^τ,h=(𝛉^τ,hT,𝛈^τ,hT)T\hat{\boldsymbol{\zeta}}_{\tau,h}=(\hat{\boldsymbol{\theta}}_{\tau,h}^{\rm T},\hat{\boldsymbol{\eta}}_{\tau,h}^{\rm T})^{\rm T} given in (10) satisfies, with probability at least 121exp(t)1-21\exp(-t),

𝜽^τ,h𝜽32K0(1κF)a(n,τ)\displaystyle\begin{split}\|\hat{\boldsymbol{\theta}}_{\tau,h}-\boldsymbol{\theta}^{*}\|\leq&\frac{32}{K_{0}(1-\kappa_{F})}a(n,\tau)\end{split} (12)

and

𝜼^τ,h𝜼32K0𝜷δf0δkK0(1κF)h1/2a(n,τ),\displaystyle\begin{split}\|\hat{\boldsymbol{\eta}}_{\tau,h}-\boldsymbol{\eta}^{*}\|\leq&\frac{32}{K_{0}\|\boldsymbol{\beta}^{*}\|\sqrt{\delta_{f_{0}}\delta_{k}K_{0}(1-\kappa_{F})}}h^{1/2}a(n,\tau),\end{split} (13)

where ν0\nu_{0} is a constant depending on only the constants {κF,κf,κk,K1,𝛏}\{\kappa_{F},\kappa_{f},\kappa_{k},K_{1},\|\boldsymbol{\xi}^{*}\|\}, and

a(n,τ)=3ν0(p+q+r+2t)/n+22+δMδ{(3+δ)K1𝜷}2+δτ1+δ.\displaystyle a(n,\tau)=\sqrt{3\nu_{0}(p+q+r+2t)/n}+\frac{2^{2+\delta}M_{\delta}\{(3+\delta)K_{1}\|\boldsymbol{\beta}^{*}\|\}^{2+\delta}}{\tau^{1+\delta}}.

With appropriate choice of τ\tau with δ=0\delta=0, such as τ=O((n/t)1/2)\tau=O((n/t)^{1/2}) with t=log(n)t=\log(n), the nonasymptotic property of the sub-Gaussian estimator 𝜽^τ,h\hat{\boldsymbol{\theta}}_{\tau,h} in Theorem 3 demonstrates that the deviation 𝜽^τ,h𝜽\|\hat{\boldsymbol{\theta}}_{\tau,h}-\boldsymbol{\theta}^{*}\| adapts to the sample size, dimension, robustification parameter τ\tau, and moments in pursuit of the optimal tradeoff between bias and robustness, and the deviation 𝜼^τ,h𝜼\|\hat{\boldsymbol{\eta}}_{\tau,h}-\boldsymbol{\eta}^{*}\| adapts to the extra smooth parameter hh to achieve smoothness. Adaptation to the robustification parameter τ\tau is caused by pursuing the robustness for linear regression with heavy-tailed errors. The deviations 𝜽^τ,h𝜽\|\hat{\boldsymbol{\theta}}_{\tau,h}-\boldsymbol{\theta}^{*}\| and 𝜼^τ,h𝜼\|\hat{\boldsymbol{\eta}}_{\tau,h}-\boldsymbol{\eta}^{*}\| coincide with the smoothed OLS estimator when τ\tau\rightarrow\infty, because a(n,τ)=3ν0(p+q+r+2t)/na(n,\tau)=\sqrt{3\nu_{0}(p+q+r+2t)/n} as τ\tau\rightarrow\infty. The next theorem derives the nonasymptotic Bahadur representation of the robust estimators given in (10).

Theorem 4.

Let 𝐖(𝛈)=(𝐗T,𝐙T𝟏(ω¯(𝛈)0))T{\boldsymbol{W}}(\boldsymbol{\eta})=({\boldsymbol{X}}^{\rm T},{\boldsymbol{Z}}^{\rm T}{\boldsymbol{1}}(\bar{\omega}(\boldsymbol{\eta})\geq 0))^{\rm T} and 𝐖~h(𝛈)=(𝐗T,𝐙TKh(ω¯(𝛈))T{\widetilde{\boldsymbol{W}}}_{h}(\boldsymbol{\eta})=({\boldsymbol{X}}^{\rm T},{\boldsymbol{Z}}^{\rm T}K_{h}(\bar{\omega}(\boldsymbol{\eta}))^{\rm T}, where ω¯(𝛈)=U1+𝐔2Tβ\bar{\omega}(\boldsymbol{\eta})=U_{1}+{\boldsymbol{U}}_{2}^{\rm T}\beta. If the assumptions in Theorem 3 are satisfied, then with probability at least 132exp(t)1-32\exp(-t) we have

ΣW1/2(𝜽^τ,h𝜽)ΣW1/2n{𝑾~h(𝜼)ψτ(y𝑾~h(𝜼)T𝜽)}{ν1(p+q+1+2t)/n+ν2h1/2}a(n,τ)\displaystyle\begin{split}\bigg{\|}\Sigma_{W}^{1/2}(\hat{\boldsymbol{\theta}}_{\tau,h}-\boldsymbol{\theta}^{*})-&\Sigma_{W}^{-1/2}\mathbb{P}_{n}\left\{{\widetilde{\boldsymbol{W}}}_{h}(\boldsymbol{\eta}^{*})\psi_{\tau}(y-{\widetilde{\boldsymbol{W}}}_{h}(\boldsymbol{\eta}^{*})^{\rm T}\boldsymbol{\theta}^{*})\right\}\bigg{\|}\\ \leq&\left\{\nu_{1}\sqrt{(p+q+1+2t)/n}+\nu_{2}h^{1/2}\right\}a(n,\tau)\end{split} (14)

and

ΣU21/2(𝜼^τ,h𝜼)hΣU21/2n{𝑼2𝒁T𝜷Kh(U1+𝑼2T𝜼)ψτ(y𝑾~h(𝜼)T𝜽)}{ν3(r+2t)/(nh)+ν4(h+τ(1+δ))}h1/2a(n,τ),\displaystyle\begin{split}\bigg{\|}\Sigma_{U_{2}}^{1/2}(\hat{\boldsymbol{\eta}}_{\tau,h}-\boldsymbol{\eta}^{*})-&h\Sigma_{U_{2}}^{-1/2}\mathbb{P}_{n}\left\{{\boldsymbol{U}}_{2}{\boldsymbol{Z}}^{\rm T}\boldsymbol{\beta}^{*}K_{h}(U_{1}+{\boldsymbol{U}}_{2}^{\rm T}\boldsymbol{\eta}^{*})\psi_{\tau}(y-{\widetilde{\boldsymbol{W}}}_{h}(\boldsymbol{\eta}^{*})^{\rm T}\boldsymbol{\theta}^{*})\right\}\bigg{\|}\\ \leq&\left\{\nu_{3}\sqrt{(r+2t)/(nh)}+\nu_{4}\left(h+\tau^{-(1+\delta)}\right)\right\}h^{1/2}a(n,\tau),\end{split} (15)

where a(n,τ)a(n,\tau) is as defined in Theorem 3, ν1>0\nu_{1}>0, ν2>0\nu_{2}>0, ν3>0\nu_{3}>0, and ν4>0\nu_{4}>0 are constants depending on only the constants {κF,κf,κk,K0,K1,𝛏}\{\kappa_{F},\kappa_{f},\kappa_{k},K_{0},K_{1},\|\boldsymbol{\xi}^{*}\|\}, and

ΣW=𝐏{𝑾(𝜼)𝑾(𝜼)T},ΣU2=K(0)𝐏{fω¯|𝒗¯(0)(𝒁T𝜷)2𝑼2𝑼2T}.\displaystyle\Sigma_{W}=\boldsymbol{\mathrm{P}}\left\{{\boldsymbol{W}}(\boldsymbol{\eta}^{*}){\boldsymbol{W}}(\boldsymbol{\eta}^{*})^{\rm T}\right\},\quad\Sigma_{U_{2}}=K^{\prime}(0)\boldsymbol{\mathrm{P}}\left\{f_{\bar{\omega}|{\bar{\boldsymbol{v}}}}(0)({\boldsymbol{Z}}^{\rm T}\boldsymbol{\beta}^{*})^{2}{\boldsymbol{U}}_{2}{\boldsymbol{U}}_{2}^{\rm T}\right\}.

Theorem 4 shows that the remainder of the Bahadur representation of 𝜽^τ,h𝜽\|\hat{\boldsymbol{\theta}}_{\tau,h}-\boldsymbol{\theta}^{*}\| achieves the rate h1/2a(n,τ)h^{1/2}a(n,\tau), which is the same as that for 𝜼^τ,h𝜼\|\hat{\boldsymbol{\eta}}_{\tau,h}-\boldsymbol{\eta}^{*}\|. Because of the rate restriction h2n/(p+q+r)h^{2}n/(p+q+r)\rightarrow\infty and h=o(1)h=o(1), the remainder of the Bahadur representation in inequality (15) does not exhibit subexponential behavior as considered by Sun et al., (2020); Chen and Zhou, (2020). This reason is that there is a change plane involved a smooth function in model (1). To the best of the authors’ knowledge, this is the first time that this type of nonasymptotic Bahadur representation has been reported in the literature, especially for the robust estimator 𝜼^τ,h\hat{\boldsymbol{\eta}}_{\tau,h} of the grouping parameter, with previous studies reporting only polynomial-type deviation bounds; see Liu et al., (2024); Zhang et al., (2022). It is convenient to derive the classical asymptotic results from the Bahadur representation, and Theorems 3 and 4 show that the robustification parameter τ\tau and the smoothness parameter hh play the same role as bandwidth in constructing classical nonparametric estimators.

2.3 Implementation

The theoretical properties in Section 2.2 guarantee that the robust estimation performs well with appropriate choices of the robustification parameter τ\tau and the smoothness parameter hh. Because the robustification parameter τ\tau is treated as a tuning parameter to balance bias and robustness, it is natural to consider using the cross-validation (CV) method to select an appropriate τ\tau in practice. However, as noted by Chen and Zhou, (2020) and Wang et al., (2021), because Mδ=E(|ϵ|2+δ)M_{\delta}=\textrm{E}(|\epsilon|^{2+\delta}) is typically unknown in practice, its empirical OLS estimator M^δ=(npq)1i=1n(yi𝑿iT𝜶^τ,h𝒁iT𝜷^τ,h𝟏(𝑼iTγ^τ,h0))2+δ\hat{M}_{\delta}=(n-p-q)^{-1}\sum_{i=1}^{n}(y_{i}-{\boldsymbol{X}}_{i}^{\rm T}\hat{\boldsymbol{\alpha}}_{\tau,h}-{\boldsymbol{Z}}_{i}^{\rm T}\hat{\boldsymbol{\beta}}_{\tau,h}{\boldsymbol{1}}({\boldsymbol{U}}_{i}^{\rm T}\hat{\gamma}_{\tau,h}\geq 0))^{2+\delta} is poor when the errors are heavy-tailed. Instead, there are two good alternatives (Chen and Zhou, 2020): (i) an adaptive technique based on Lepski’s method Lepskii, (1992) and (ii) a Huber-type method by solving a so-called censored equation Hahn et al., (1990); see Chen and Zhou, (2020); Wang et al., (2021) for details. For the smoothness parameter hh, the CV method is always a natural choice for selecting an appropriate one. Alternatively, from the theoretical conditions on the smoothness parameter hh, a rule of thumb hn=chσ^ulog(n)/nh_{n}=c_{h}\hat{\sigma}_{u}\log(n)/\sqrt{n} suggested by Seo and Linton, (2007), Li et al., (2021), and Zhang et al., (2022) is recommended by considering the computation reduction, where chc_{h} is a constant and σ^u=(nr)1i=1n(U1i+𝑼2iT𝜼^ols)2\hat{\sigma}_{u}=\sqrt{(n-r)^{-1}\sum_{i=1}^{n}(U_{1i}+{\boldsymbol{U}}_{2i}^{\rm T}\hat{\boldsymbol{\eta}}^{ols})^{2}} is the estimated standard deviation of 𝑼T𝜸{\boldsymbol{U}}^{\rm T}\boldsymbol{\gamma}, with (𝜶^ols,𝜷^ols,𝜼^ols)(\hat{\boldsymbol{\alpha}}^{ols},\hat{\boldsymbol{\beta}}^{ols},\hat{\boldsymbol{\eta}}^{ols}) being the estimator of (𝜶,𝜷,𝜼)(\boldsymbol{\alpha},\boldsymbol{\beta},\boldsymbol{\eta}) by using ordinary quadratic loss.

Attention now turns to the implementation for subgroup-classifier learning, with the parameters 𝜽=(𝜶T,𝜷T)T\boldsymbol{\theta}=(\boldsymbol{\alpha}^{\rm T},\boldsymbol{\beta}^{\rm T})^{\rm T} and 𝜼\boldsymbol{\eta} estimated iteratively as follows. Let τ(𝜶,𝜷,𝜼)\ell_{\tau}(\boldsymbol{\alpha},\boldsymbol{\beta},\boldsymbol{\eta}) be the loss function for model (1), i.e.,

τ(𝜶,𝜷,𝜼)=i=1nLτ(yi𝑿iT𝜶𝒁iT𝜷𝟏(U1i+𝑼2iT𝜼0)),\displaystyle\ell_{\tau}(\boldsymbol{\alpha},\boldsymbol{\beta},\boldsymbol{\eta})=\sum_{i=1}^{n}L_{\tau}\left(y_{i}-{\boldsymbol{X}}_{i}^{\rm T}\boldsymbol{\alpha}-{\boldsymbol{Z}}_{i}^{\rm T}\boldsymbol{\beta}{\boldsymbol{1}}(U_{1i}+{\boldsymbol{U}}_{2i}^{\rm T}\boldsymbol{\eta}\geq 0)\right),

and let the smoothed loss function be

~τ,h(𝜶,𝜷,𝜼)=i=1nLτ(yi𝑿iT𝜶𝒁iT𝜷Kh(U1i+𝑼2iT𝜼0)).\displaystyle\tilde{\ell}_{\tau,h}(\boldsymbol{\alpha},\boldsymbol{\beta},\boldsymbol{\eta})=\sum_{i=1}^{n}L_{\tau}\left(y_{i}-{\boldsymbol{X}}_{i}^{\rm T}\boldsymbol{\alpha}-{\boldsymbol{Z}}_{i}^{\rm T}\boldsymbol{\beta}K_{h}(U_{1i}+{\boldsymbol{U}}_{2i}^{\rm T}\boldsymbol{\eta}\geq 0)\right). (16)

For given 𝜼(k)\boldsymbol{\eta}^{(k)}, one obtains 𝜶(k+1)\boldsymbol{\alpha}^{(k+1)} and 𝜷(k+1)\boldsymbol{\beta}^{(k+1)} by minimizing the smoothed loss function

(𝜶(k+1),𝜷(k+1))=argmin𝜶Θα,𝜷Θβ~τ,h(𝜶,𝜷,𝜼(k)),\displaystyle(\boldsymbol{\alpha}^{(k+1)},\boldsymbol{\beta}^{(k+1)})=\mathop{\rm argmin}_{\boldsymbol{\alpha}\in\Theta_{\alpha},\boldsymbol{\beta}\in\Theta_{\beta}}\tilde{\ell}_{\tau,h}(\boldsymbol{\alpha},\boldsymbol{\beta},\boldsymbol{\eta}^{(k)}),

and for given 𝜶(k+1)\boldsymbol{\alpha}^{(k+1)} and 𝜷(k+1)\boldsymbol{\beta}^{(k+1)}, one estimates 𝜼(k+1)\boldsymbol{\eta}^{(k+1)} by

𝜼(k+1)=argmin𝜼Θη~τ,h(𝜶(k+1),𝜷(k+1),𝜼).\displaystyle\boldsymbol{\eta}^{(k+1)}=\mathop{\rm argmin}_{\boldsymbol{\eta}\in\Theta_{\eta}}\tilde{\ell}_{\tau,h}(\boldsymbol{\alpha}^{(k+1)},\boldsymbol{\beta}^{(k+1)},\boldsymbol{\eta}).

Iterating these two minimizers leads to the desired robust estimators. These are summarized with the multiplier bootstrap calibration in Algorithm 1 in Appendix A of the Supplementary Material, which provides the strategy for estimating the confidence intervals for the estimators 𝜶^\hat{\boldsymbol{\alpha}}, 𝜷^\hat{\boldsymbol{\beta}}, and 𝜼^\hat{\boldsymbol{\eta}}.

Note that herein, 𝜶(k+1)\boldsymbol{\alpha}^{(k+1)} and 𝜷(k+1)\boldsymbol{\beta}^{(k+1)} for given 𝜼(k)\boldsymbol{\eta}^{(k)} are obtained by a robust data-adaptive method proposed by Wang et al., (2021). Specifically, 𝜽=(𝜶T,𝜷T)T\boldsymbol{\theta}=(\boldsymbol{\alpha}^{\rm T},\boldsymbol{\beta}^{\rm T})^{\rm T} is estimated and τ\tau is calibrated simultaneously by solving the following system of equations:

{i=1nψτ(yi𝑾iT𝜽)𝑾i=0,(τ2n)1i=1nmin{(yi𝑾iT𝜽)2,τ2}n1(d+z)=0,\displaystyle\left\{\begin{array}[]{l}\sum_{i=1}^{n}\psi_{\tau}(y_{i}-{\boldsymbol{W}}_{i}^{\rm T}\boldsymbol{\theta}){\boldsymbol{W}}_{i}=0,\\ (\tau^{2}n)^{-1}\sum_{i=1}^{n}\min\{(y_{i}-{\boldsymbol{W}}_{i}^{\rm T}\boldsymbol{\theta})^{2},\tau^{2}\}-n^{-1}(d+z)=0,\end{array}\right.

where d=p+q1d=p+q-1, 𝑾i=(𝑿iT,𝒁iT𝟏(U1i+𝑼2iT𝜼(k)0))T{\boldsymbol{W}}_{i}=({\boldsymbol{X}}_{i}^{\rm T},{\boldsymbol{Z}}_{i}^{\rm T}{\boldsymbol{1}}(U_{1i}+{\boldsymbol{U}}_{2i}^{\rm T}\boldsymbol{\eta}^{(k)}\geq 0))^{\rm T}, and z=log(n)z=\log(n) as suggested by Wang et al., (2021). The initial values 𝜽(0)=𝜽(ols)\boldsymbol{\theta}^{(0)}=\boldsymbol{\theta}^{(ols)} and τ(0)=σ^ϵn/(d+z)\tau^{(0)}=\hat{\sigma}_{\epsilon}\sqrt{n/(d+z)} are set using the ordinary quadratic loss, where σ^ϵ2=(npqr)1i=1n(y𝑿iT𝜶^ols𝒁iT𝜷^ols𝟏(U1i+𝑼2iT𝜼^ols0))2\hat{\sigma}_{\epsilon}^{2}=(n-p-q-r)^{-1}\sum_{i=1}^{n}(y-{\boldsymbol{X}}_{i}^{\rm T}\hat{\boldsymbol{\alpha}}^{ols}-{\boldsymbol{Z}}_{i}^{\rm T}\hat{\boldsymbol{\beta}}^{ols}{\boldsymbol{1}}(U_{1i}+{\boldsymbol{U}}_{2i}^{\rm T}\hat{\boldsymbol{\eta}}^{ols}\geq 0))^{2}, with (𝜶^ols,𝜷^ols,𝜼ols)(\hat{\boldsymbol{\alpha}}^{ols},\hat{\boldsymbol{\beta}}^{ols},\boldsymbol{\eta}^{ols}) being the estimator of (𝜶,𝜷,𝜼)(\boldsymbol{\alpha},\boldsymbol{\beta},\boldsymbol{\eta}).

3 Robust subgroup testing

Before learning the subgroup classifier, it is of interest to test for the existence of subgroups, which guarantees avoidance of the identifiability problem of 𝜼\boldsymbol{\eta}. This section considers the test problem (2). Recall the loss function in (9), i.e.,

Lτ(y𝑿T𝜶𝒁T𝜷𝟏(𝑼T𝜸0)),\displaystyle L_{\tau}(y-{\boldsymbol{X}}^{\rm T}\boldsymbol{\alpha}-{\boldsymbol{Z}}^{\rm T}\boldsymbol{\beta}{\boldsymbol{1}}({\boldsymbol{U}}^{\rm T}\boldsymbol{\gamma}\geq 0)), (17)

the derivative of which with respect to 𝜷\boldsymbol{\beta} under the alternative hypothesis is

φ(𝑽,𝜶,𝜷,𝜸)=𝒁𝟏(𝑼T𝜸0)ψτ(y𝑿T𝜶𝒁T𝜷𝟏(𝑼T𝜸0)),\displaystyle\varphi({\boldsymbol{V}},\boldsymbol{\alpha},\boldsymbol{\beta},\boldsymbol{\gamma})={\boldsymbol{Z}}{\boldsymbol{1}}({\boldsymbol{U}}^{\rm T}\boldsymbol{\gamma}\geq 0)\psi_{\tau}(y-{\boldsymbol{X}}^{\rm T}\boldsymbol{\alpha}-{\boldsymbol{Z}}^{\rm T}\boldsymbol{\beta}{\boldsymbol{1}}({\boldsymbol{U}}^{\rm T}\boldsymbol{\gamma}\geq 0)),

and with respect to 𝜶\boldsymbol{\alpha} under the null hypothesis is

φ0(𝑽,𝜶)=𝑿ψτ(y𝑿T𝜶),\displaystyle\varphi_{0}({\boldsymbol{V}},\boldsymbol{\alpha})={\boldsymbol{X}}\psi_{\tau}(y-{\boldsymbol{X}}^{\rm T}\boldsymbol{\alpha}),

where ψτ(u)\psi_{\tau}(u) is the first derivative of the Huber loss (5), defined as ψτ(u)=sgn(u)min{|u|,τ}\psi_{\tau}(u)=\hbox{sgn}(u)\min\{|u|,\tau\}.

3.1 Robust estimation under null hypothesis

Under the null hypothesis, model (1) reduces to the ordinary linear regression model with heavy-tailed errors, i.e.,

yi=𝑿iT𝜶+ϵi.\displaystyle y_{i}={\boldsymbol{X}}_{i}^{\rm T}\boldsymbol{\alpha}+\epsilon_{i}. (18)

Parametric estimation in model (18) is well-studied in the literature; see Huber, (1964, 1973); Fan et al., 2017b ; Sun et al., (2020); Wang et al., (2021); Han et al., (2022); Chen and Zhou, (2020); Zhou et al., (2018), among others. Let 𝜶^τ\hat{\boldsymbol{\alpha}}_{\tau} be the estimate of 𝜶τ\boldsymbol{\alpha}_{\tau} under the null hypothesis, i.e.,

𝜶^τ=argmin𝜶ΘαnLτ(y𝑿T𝜶).\displaystyle\hat{\boldsymbol{\alpha}}_{\tau}=\mathop{\rm argmin}_{\boldsymbol{\alpha}\in\Theta_{\alpha}}\mathbb{P}_{n}L_{\tau}(y-{\boldsymbol{X}}^{\rm T}\boldsymbol{\alpha}). (19)

Before establishing asymptotic properties for the robust estimator under the null hypothesis and the test statistic, the following required assumptions are made.

  1. (B1)

    The random vector 𝑿{\boldsymbol{X}} is sub-Gaussian, J={𝐏[𝑿𝑿T]}1p×pJ=\{\boldsymbol{\mathrm{P}}[{\boldsymbol{X}}{\boldsymbol{X}}^{\rm T}]\}^{-1}\in\mathbb{R}^{p\times p} is a finite and positive definite deterministic matrix, and there is a universal constant K1>0K_{1}>0 satisfying 𝑿ψ2K1\|{\boldsymbol{X}}\|_{\psi_{2}}\leq K_{1}, 𝒁ψ2K1\|{\boldsymbol{Z}}\|_{\psi_{2}}\leq K_{1}.

  2. (B2)

    The error variable ϵ\epsilon is independent of (𝑿T,𝒁T,𝑼T)T({\boldsymbol{X}}^{\rm T},{\boldsymbol{Z}}^{\rm T},{\boldsymbol{U}}^{\rm T})^{\rm T} and satisfies 𝐏(ϵ)=0\boldsymbol{\mathrm{P}}(\epsilon)=0 and 𝐏(|ϵ|2+δ)=Mδ<\boldsymbol{\mathrm{P}}(|\epsilon|^{2+\delta})=M_{\delta}<\infty with δ0\delta\geq 0.

  3. (B3)

    0<𝐏[𝟏(𝑼T𝜸0)]<10<\boldsymbol{\mathrm{P}}[{\boldsymbol{1}}({\boldsymbol{U}}^{\rm T}\boldsymbol{\gamma}\geq 0)]<1 for any 𝜸Θγr\boldsymbol{\gamma}\in\Theta_{\gamma}\subseteq\mathbb{R}^{r}.

Remark 2.

Assumption (B1) is the moment condition of covariates for establishing the nonasymptotic properties under the null hypothesis and deriving the asymptotic distributions. Assumption (B2) is the same as Assumption (A2), and Assumption (B3) is the same as Assumption (A3).

Lemma 1.

(Chen and Zhou, (2020)) If Assumptions (B1)–(B3) hold, then for any t>0t>0 and vv2+δ1/(2+δ)v\geq v_{2+\delta}^{1/(2+\delta)}, the estimator 𝛂^τ\hat{\boldsymbol{\alpha}}_{\tau} given in (19) with τ=v(np+t)1/(2+δ)\tau=v(\frac{n}{p+t})^{1/(2+\delta)} satisfies

𝐏{J1/2(𝜶^τ𝜶)c1vp+tn}2etand𝐏{J1/2(𝜶^τ𝜶)J1/2n𝑿ψτ(ϵ)c2vp+tn}2et\displaystyle\begin{split}\boldsymbol{\mathrm{P}}\left\{\left\|J^{-1/2}(\hat{\boldsymbol{\alpha}}_{\tau}-\boldsymbol{\alpha}^{*})\right\|\geq c_{1}v\sqrt{\frac{p+t}{n}}\right\}\leq&2e^{t}\\ \mbox{and}~{}~{}\boldsymbol{\mathrm{P}}\left\{\left\|J^{-1/2}(\hat{\boldsymbol{\alpha}}_{\tau}-\boldsymbol{\alpha}^{*})-J^{-1/2}\mathbb{P}_{n}{\boldsymbol{X}}\psi_{\tau}(\epsilon)\right\|\geq c_{2}v\sqrt{\frac{p+t}{n}}\right\}\leq&2e^{t}\end{split} (20)

as along as nc3(p+t)n\geq c_{3}(p+t), where c1c_{1}c3c_{3} are constants depending on only K1K_{1}.

3.2 Robust test statistic

As discussed by Liu et al., (2024), for any known 𝜸Θγ\boldsymbol{\gamma}\in\Theta_{\gamma}, it is natural to consider an SST statistic for testing 𝜷=𝟎\boldsymbol{\beta}={\boldsymbol{0}}, i.e.,

T~n(𝜸)=n1nφ(𝑽,𝜶^τ,𝟎,𝜸)V~(𝜸)12,\displaystyle\tilde{T}_{n}(\boldsymbol{\gamma})=n^{-1}\|\mathbb{P}_{n}\varphi({\boldsymbol{V}},\hat{\boldsymbol{\alpha}}_{\tau},{\boldsymbol{0}},\boldsymbol{\gamma})\|^{2}_{{\tilde{V}}(\boldsymbol{\gamma})^{-1}}, (21)

where 𝜶^τ\hat{\boldsymbol{\alpha}}_{\tau} is given in (19) and V~(𝜸)=n{φ(𝑽,𝜶^τ,𝟎,𝜸)G^(𝜸)J^φ0(𝑽,𝜶^τ)}2{\tilde{V}}(\boldsymbol{\gamma})=\mathbb{P}_{n}\{\varphi({\boldsymbol{V}},\hat{\boldsymbol{\alpha}}_{\tau},{\boldsymbol{0}},\boldsymbol{\gamma})-\hat{G}(\boldsymbol{\gamma})\hat{J}\varphi_{0}({\boldsymbol{V}},\hat{\boldsymbol{\alpha}}_{\tau})\}^{\otimes 2}. Here, G^(𝜸){\hat{G}}(\boldsymbol{\gamma}) and J^{\hat{J}} are consistent estimators of G(𝜸)G(\boldsymbol{\gamma}) and JJ, respectively, where JJ is as defined in Assumption (B1) and

G(𝜸)=𝐏{𝒁𝑿T𝟏(𝑼T𝜸0)𝟏(|y𝑿T𝜶|τ)}q×p.\displaystyle G(\boldsymbol{\gamma})=\boldsymbol{\mathrm{P}}\{{\boldsymbol{Z}}{\boldsymbol{X}}^{\rm T}{\boldsymbol{1}}({\boldsymbol{U}}^{\rm T}\boldsymbol{\gamma}\geq 0){\boldsymbol{1}}(|y-{\boldsymbol{X}}^{\rm T}\boldsymbol{\alpha}^{*}|\leq\tau)\}\in\mathbb{R}^{q\times p}.
Lemma 2.

If Assumptions (B1)–(B3) hold, then for any fixed 𝛄Θγ\boldsymbol{\gamma}\in\Theta_{\gamma}, T~n(𝛄)\tilde{T}_{n}(\boldsymbol{\gamma}) converges in distribution to a χ2\chi^{2} one with qq degrees of freedom under H0H_{0} as nn\rightarrow\infty.

Although there is an unknown parameter 𝜸\boldsymbol{\gamma} that prevents T~n(𝜸)\tilde{T}_{n}(\boldsymbol{\gamma}) from being used directly in practice, Lemma 2 reveals essentially that the asymptotic distribution of T~n(𝜸)\tilde{T}_{n}(\boldsymbol{\gamma}) is free of the nuisance parameter 𝜸\boldsymbol{\gamma}. Thus, the supremum and the weighted average of T~n(𝜸)\tilde{T}_{n}(\boldsymbol{\gamma}) over 𝜸\boldsymbol{\gamma} should guarantee the correct type-I errors, which motivates constructing the robust test statistic based on the weighted average of T~n(𝜸)\tilde{T}_{n}(\boldsymbol{\gamma}) over the parametric space Θγ\Theta_{\gamma}.

Fan et al., 2017a studied the supremum of the SST statistic T~n(𝜸)\tilde{T}_{n}(\boldsymbol{\gamma}) (SST) over the grouping parameter 𝜸\boldsymbol{\gamma} for a semiparametric model, i.e.,

T~n=sup𝜸Θγ{n1nφ(𝑽,𝜶^τ,𝟎,𝜸)V~(𝜸)12}.\displaystyle\tilde{T}_{n}=\sup_{\boldsymbol{\gamma}\in\Theta_{\gamma}}\left\{n^{-1}\|\mathbb{P}_{n}\varphi({\boldsymbol{V}},\hat{\boldsymbol{\alpha}}_{\tau},{\boldsymbol{0}},\boldsymbol{\gamma})\|^{2}_{{\tilde{V}}(\boldsymbol{\gamma})^{-1}}\right\}. (22)

The test statistic T~n\tilde{T}_{n} has been investigated widely in the literature; see Andrews and Ploberger, (1994, 1995); Davies, (1977); Song et al., (2009); Shen and Qu, (2020); Liu et al., (2024). It is easy to extent the SST to model (1) with heavy-tailed errors according to the Bahadur representation in Theorem 4.

When the dimension of the parametric space Θγ\Theta_{\gamma} is large, searching for the supremum value over Θγ\Theta_{\gamma} may cause T~n\tilde{T}_{n} to lose power in practice and is time-consuming computationally. To avoid these drawbacks, proposed in this section is a robust test procedure that is a type of WAST statistic first introduced by Liu et al., (2024).

Proposed herein is RWAST, i.e.,

Tn=1n(n1)ijωij𝒁iT𝒁jψτ(yi𝑿iT𝜶^τ)ψτ(yj𝑿jT𝜶^τ),\displaystyle T_{n}=\frac{1}{n(n-1)}\sum_{i\neq j}\omega_{ij}{\boldsymbol{Z}}_{i}^{\rm T}{\boldsymbol{Z}}_{j}\psi_{\tau}(y_{i}-{\boldsymbol{X}}_{i}^{\rm T}\hat{\boldsymbol{\alpha}}_{\tau})\psi_{\tau}(y_{j}-{\boldsymbol{X}}_{j}^{\rm T}\hat{\boldsymbol{\alpha}}_{\tau}), (23)

where

ωij=14+12πarctan(ϱij1ϱij2)ifij,\displaystyle\omega_{ij}=\frac{1}{4}+\frac{1}{2\pi}\arctan\left(\frac{\varrho_{ij}}{\sqrt{1-\varrho_{ij}^{2}}}\right)\quad\mbox{if}~{}i\neq j, (24)

and ϱij=𝑼iT𝑼j(𝑼i𝑼j)1\varrho_{ij}={\boldsymbol{U}}_{i}^{\rm T}{\boldsymbol{U}}_{j}(\|{\boldsymbol{U}}_{i}\|\|{\boldsymbol{U}}_{j}\|)^{-1}. As noted by Liu et al., (2024), there is a Bayesian explanation for the weight ωij\omega_{ij}. In fact, Lemma D1 in the Supplementary Material shows that

ωij=𝜸Θγ𝟏(𝑼iT𝜸0)𝟏(𝑼jT𝜸0)w(𝜸)𝑑𝜸,\displaystyle\omega_{ij}=\int_{\boldsymbol{\gamma}\in\Theta_{\gamma}}{\boldsymbol{1}}({\boldsymbol{U}}_{i}^{\rm T}\boldsymbol{\gamma}\geq 0){\boldsymbol{1}}({\boldsymbol{U}}_{j}^{\rm T}\boldsymbol{\gamma}\geq 0)w(\boldsymbol{\gamma})d\boldsymbol{\gamma},

where w(𝜸)w(\boldsymbol{\gamma}) is the standard multi-Gaussian density and can be chosen as another weight satisfying w(𝜸)0w(\boldsymbol{\gamma})\geq 0 for all 𝜸Θγ\boldsymbol{\gamma}\in\Theta_{\gamma} and 𝜸Θγw(𝜸)𝑑𝜸=1\int_{\boldsymbol{\gamma}\in\Theta_{\gamma}}w(\boldsymbol{\gamma})d\boldsymbol{\gamma}=1. In the Bayesian motivation, the grouping parameter 𝜸\boldsymbol{\gamma} has a prior with density w(𝜸)w(\boldsymbol{\gamma}). Because the goal herein is to test for the existence of subgroups instead of estimating the grouping parameter, the difference is that there is no requirement for the posterior distribution.

The choice of the weight affect the computation of the test statistic because of the numerical integration over q\mathbb{R}^{q}. Thus, taking the weight as the standard multi-Gaussian density offers good performance in practice, as illustrated in the simulation studies in Section 4 and the case studies in Section 5. To illustrate the performance in robust regression, numerical studies were conducted to investigate the sensitivity of the weight’s choice by comparing ωij\omega_{ij} with the closed form in (24) and the approximated ωij\omega_{ij} in (E.3) of Appendix E.3 in the Supplementary Material. From the numerical results, compared with the approximated ωij\omega_{ij}, the test statistic with ωij\omega_{ij} in (24) has higher power uniformly and takes only 10% of the time computationally when N=10 000N=10\,000. This is a strong recommendation to use the closed-form RWAST in (24).

To establish the asymptotic distribution of RWAST, additional notation is introduced below. Denote the kernel of a U-statistic under the null hypothesis by

h(𝑽i,𝑽j)=ωij𝒁iT𝒁jψτ(yi𝑿iT𝜶)ψτ(yj𝑿jT𝜶)+φ0(𝑽i,𝜶)TKj+KiTφ0(𝑽j,𝜶)+φ0(𝑽i,𝜶)THφ0(𝑽j,𝜶),\displaystyle\begin{split}h({\boldsymbol{V}}_{i},{\boldsymbol{V}}_{j})=&\omega_{ij}{\boldsymbol{Z}}_{i}^{\rm T}{\boldsymbol{Z}}_{j}\psi_{\tau}(y_{i}-{\boldsymbol{X}}_{i}^{\rm T}\boldsymbol{\alpha}^{*})\psi_{\tau}(y_{j}-{\boldsymbol{X}}_{j}^{\rm T}\boldsymbol{\alpha}^{*})\\ &+\varphi_{0}({\boldsymbol{V}}_{i},\boldsymbol{\alpha}^{*})^{\rm T}K_{j}+K_{i}^{\rm T}\varphi_{0}({\boldsymbol{V}}_{j},\boldsymbol{\alpha}^{*})+\varphi_{0}({\boldsymbol{V}}_{i},\boldsymbol{\alpha}^{*})^{\rm T}H\varphi_{0}({\boldsymbol{V}}_{j},\boldsymbol{\alpha}^{*}),\end{split} (25)

where φ0(𝑽,𝜶)=𝑿ψτ(y𝑿T𝜶)\varphi_{0}({\boldsymbol{V}},\boldsymbol{\alpha})={\boldsymbol{X}}\psi_{\tau}(y-{\boldsymbol{X}}^{\rm T}\boldsymbol{\alpha}), and

H=𝜸ΘγJTG(𝜸)TG(𝜸)Jw(𝜸)𝑑𝜸andKi=𝜸ΘγJTG(𝜸)Tφ(𝑽i,𝜶,𝟎,𝜸)w(𝜸)𝑑𝜸\displaystyle H=\int_{\boldsymbol{\gamma}\in\Theta_{\gamma}}J^{\rm T}G(\boldsymbol{\gamma})^{\rm T}G(\boldsymbol{\gamma})Jw(\boldsymbol{\gamma})d\boldsymbol{\gamma}~{}\mbox{and}~{}K_{i}=\int_{\boldsymbol{\gamma}\in\Theta_{\gamma}}J^{\rm T}G(\boldsymbol{\gamma})^{\rm T}\varphi({\boldsymbol{V}}_{i},\boldsymbol{\alpha}^{*},{\boldsymbol{0}},\boldsymbol{\gamma})w(\boldsymbol{\gamma})d\boldsymbol{\gamma}

with φ(𝑽,𝜶,𝜷,𝜸)=𝒁𝟏(𝑼T𝜸0)ψτ(y𝑿T𝜶𝒁T𝜷𝟏(𝑼T𝜸0))\varphi({\boldsymbol{V}},\boldsymbol{\alpha},\boldsymbol{\beta},\boldsymbol{\gamma})={\boldsymbol{Z}}{\boldsymbol{1}}({\boldsymbol{U}}^{\rm T}\boldsymbol{\gamma}\geq 0)\psi_{\tau}(y-{\boldsymbol{X}}^{\rm T}\boldsymbol{\alpha}-{\boldsymbol{Z}}^{\rm T}\boldsymbol{\beta}{\boldsymbol{1}}({\boldsymbol{U}}^{\rm T}\boldsymbol{\gamma}\geq 0)).

Theorem 5.

If Assumptions (B1)–(B3) hold, then under the null hypothesis, we have

nTnμ0ν,\displaystyle nT_{n}-\mu_{0}\stackrel{{\scriptstyle\cal{L}}}{{\longrightarrow}}\nu,

where \stackrel{{\scriptstyle\cal{L}}}{{\longrightarrow}} denotes convergence in distribution, μ0=n{𝐏ψτ(ϵ)}2𝛄Θγ{𝐏𝐙𝟏(𝐔T𝛄0)}2w(𝛄)𝑑𝛄+𝐏[φ0(𝐕,𝛂)THφ0(𝐕,𝛂)]+2𝐏[φ0(𝐕1,𝛂)TK1]\mu_{0}=n\{\boldsymbol{\mathrm{P}}\psi_{\tau}(\epsilon)\}^{2}\int_{\boldsymbol{\gamma}\in\Theta_{\gamma}}\{\boldsymbol{\mathrm{P}}{\boldsymbol{Z}}{\boldsymbol{1}}({\boldsymbol{U}}^{\rm T}\boldsymbol{\gamma}\geq 0)\}^{2}w(\boldsymbol{\gamma})d\boldsymbol{\gamma}+\boldsymbol{\mathrm{P}}[\varphi_{0}({\boldsymbol{V}},\boldsymbol{\alpha}^{*})^{\rm T}H\varphi_{0}({\boldsymbol{V}},\boldsymbol{\alpha}^{*})]+2\boldsymbol{\mathrm{P}}[\varphi_{0}({\boldsymbol{V}}_{1},\boldsymbol{\alpha}^{*})^{\rm T}K_{1}], ν\nu is a random variable of the form ν=j=1λj(χ1j21)\nu=\sum_{j=1}^{\infty}\lambda_{j}(\chi^{2}_{1j}-1), and χ112,χ122,\chi^{2}_{11},\chi^{2}_{12},\cdots are independent χ12\chi^{2}_{1} variables, i.e., ν\nu has the characteristic function

𝐏[eitν]=j=1(12itλj)12eitλj.\displaystyle\boldsymbol{\mathrm{P}}\left[e^{it\nu}\right]=\prod_{j=1}^{\infty}(1-2it\lambda_{j})^{-\frac{1}{2}}e^{-it\lambda_{j}}.

Here, i=1i=\sqrt{-1} is the imaginary unit, and {λj}\{\lambda_{j}\} are the eigenvalues of the kernel h(𝐯1,𝐯2)h({\boldsymbol{v}}_{1},{\boldsymbol{v}}_{2}) under f(𝐯,𝛂,𝟎,𝛄)f({\boldsymbol{v}},\boldsymbol{\alpha}^{*},{\boldsymbol{0}},\boldsymbol{\gamma}^{*}), i.e., they are the solutions of λjgj(𝐯2)=0h(𝐯1,𝐯2)gj(𝐯1)f(𝐯1,𝛂,𝟎,𝛄)𝑑𝐯1\lambda_{j}g_{j}({\boldsymbol{v}}_{2})=\int_{0}^{\infty}h({\boldsymbol{v}}_{1},{\boldsymbol{v}}_{2})g_{j}({\boldsymbol{v}}_{1})\\ f({\boldsymbol{v}}_{1},\boldsymbol{\alpha}^{*},{\boldsymbol{0}},\boldsymbol{\gamma}^{*})d{\boldsymbol{v}}_{1} for nonzero gjg_{j}, where f(𝐯,𝛂,𝛃,𝛄)f({\boldsymbol{v}},\boldsymbol{\alpha},\boldsymbol{\beta},\boldsymbol{\gamma}) is the density of 𝐕{\boldsymbol{V}}.

Investigated next is the power performance of the proposed test statistic under two types of alternative hypotheses under which subgroups exist. Considered first is the global alternative denoted by H1gH_{1g}: 𝜷=𝝃\boldsymbol{\beta}=\boldsymbol{\xi}, where 𝝃Θ𝜷\{𝟎}\boldsymbol{\xi}\in\Theta_{\boldsymbol{\beta}}\backslash\{{\boldsymbol{0}}\} is fixed. Theorem 6 provides the asymptotic distribution of the test statistic TnT_{n} under the global alternative.

Theorem 6.

If Assumptions (B1)–(B3) hold, then under the global alternative H1gH_{1g}, we have

n(Tnμ1)𝒩(0,σξ2),\displaystyle\sqrt{n}(T_{n}-\mu_{1})\stackrel{{\scriptstyle\cal{L}}}{{\longrightarrow}}\mathcal{N}(0,\sigma^{2}_{\xi}),

where μ1=𝐏[h(𝐕1,𝐕2)]\mu_{1}=\boldsymbol{\mathrm{P}}[h({\boldsymbol{V}}_{1},{\boldsymbol{V}}_{2})] and σξ2=4Var(𝐏[h(𝐕1,𝐕2)|𝐕1])\sigma^{2}_{\xi}=4\hbox{Var}(\boldsymbol{\mathrm{P}}[h({\boldsymbol{V}}_{1},{\boldsymbol{V}}_{2})|{\boldsymbol{V}}_{1}]).

To derive the asymptotic distribution under the local alternative hypothesis, denoted by H1l:𝜷=n1/2𝝃H_{1l}:\boldsymbol{\beta}=n^{-1/2}\boldsymbol{\xi}, additional assumptions are required, where 𝝃Θβ\boldsymbol{\xi}\in\Theta_{\beta} is a fixed vector.

  1. (B4)

    There is a positive function b(𝒗,𝝃)b({\boldsymbol{v}},\boldsymbol{\xi}) of 𝒗{\boldsymbol{v}} relying on 𝜶\boldsymbol{\alpha}^{*}, 𝜸\boldsymbol{\gamma}^{*} such that

    |𝝃Tf(𝒗;𝜶,rn𝝃,𝜸)𝜷f(𝒗;𝜶,𝟎,𝜸)|b(𝒗,𝝃),\displaystyle\left|\boldsymbol{\xi}^{\rm T}\frac{\partial f({\boldsymbol{v}};\boldsymbol{\alpha}^{*},r_{n}\boldsymbol{\xi},\boldsymbol{\gamma}^{*})\partial\boldsymbol{\beta}}{f({\boldsymbol{v}};\boldsymbol{\alpha}^{*},{\boldsymbol{0}},\boldsymbol{\gamma}^{*})}\right|\leq b({\boldsymbol{v}},\boldsymbol{\xi}),

    and 𝐏[b(𝑽,𝝃)2]\boldsymbol{\mathrm{P}}[b({\boldsymbol{V}},\boldsymbol{\xi})^{2}] and 𝐏[ϕk(𝑽)2b(𝑽,𝝃)]\boldsymbol{\mathrm{P}}[\phi_{k}({\boldsymbol{V}})^{2}b({\boldsymbol{V}},\boldsymbol{\xi})] for all k=1,,k=1,\cdots, are bounded by C𝝃C_{\boldsymbol{\xi}}, where 𝝃Θβ\boldsymbol{\xi}\in\Theta_{\beta}, rn=o(1)r_{n}=o(1), C𝝃>0C_{\boldsymbol{\xi}}>0 is a constant relying on 𝝃\boldsymbol{\xi}, ϕk()\phi_{k}(\cdot) is as defined in Theorem 5, and 𝑽{\boldsymbol{V}} is generated from the null distribution with density f(𝒗;𝜶,𝟎,𝜸)f({\boldsymbol{v}};\boldsymbol{\alpha}^{*},{\boldsymbol{0}},\boldsymbol{\gamma}^{*}).

Theorem 7.

If Assumptions (B1)–(B4) hold, then under the local alternative hypothesis H1lH_{1l}, i.e., 𝛃=n1/2𝛏\boldsymbol{\beta}=n^{-1/2}\boldsymbol{\xi} with a fixed vector 𝛏Θβ\boldsymbol{\xi}\in\Theta_{\beta}, we have

nTnμ0ν,\displaystyle nT_{n}-\mu_{0}\stackrel{{\scriptstyle\cal{L}}}{{\longrightarrow}}\nu,

where μ0\mu_{0} is as defined in Theorem 5, ν\nu is a random variable of the form ν=j=1λj(χ1j2(μaj)1)\nu=\sum_{j=1}^{\infty}\lambda_{j}(\chi^{2}_{1j}(\mu_{aj})-1), and χ112(μa1),χ122(μa2),\chi^{2}_{11}(\mu_{a1}),\chi^{2}_{12}(\mu_{a2}),\cdots are independent noncentral χ12\chi^{2}_{1} variables, i.e., ν\nu has the characteristic function

𝐏[eitν]=j=1(12itλj)12exp(itλj+itλjμaj12itλj).\displaystyle\boldsymbol{\mathrm{P}}\left[e^{it\nu}\right]=\prod_{j=1}^{\infty}(1-2it\lambda_{j})^{-\frac{1}{2}}\exp\left(-it\lambda_{j}+\frac{it\lambda_{j}\mu_{aj}}{1-2it\lambda_{j}}\right).

Here, {λj}\{\lambda_{j}\} are the eigenvalues of the kernel h(𝐯1,𝐯2)h({\boldsymbol{v}}_{1},{\boldsymbol{v}}_{2}) defined in (25) under f(𝐯,𝛂,𝟎,𝛄)f({\boldsymbol{v}},\boldsymbol{\alpha}^{*},{\boldsymbol{0}},\boldsymbol{\gamma}^{*}), i.e., they are the solutions of λjgj(𝐯2)=0h(𝐯1,𝐯2)gj(𝐯1)f(𝐯1,𝛂,𝟎,𝛄)𝑑𝐯1\lambda_{j}g_{j}({\boldsymbol{v}}_{2})=\int_{0}^{\infty}h({\boldsymbol{v}}_{1},{\boldsymbol{v}}_{2})g_{j}({\boldsymbol{v}}_{1})f({\boldsymbol{v}}_{1},\boldsymbol{\alpha}^{*},{\boldsymbol{0}},\boldsymbol{\gamma}^{*})d{\boldsymbol{v}}_{1} for nonzero gjg_{j}, and each noncentrality parameter of χ1j2(μaj)\chi^{2}_{1j}(\mu_{aj}) is

μaj=𝐏[ϕj(𝑽0)𝝃Tlog(f(𝑽0,𝜶,𝟎,𝜸))/𝜷],j=1,2,,\displaystyle\mu_{aj}=\boldsymbol{\mathrm{P}}\left[\phi_{j}({\boldsymbol{V}}_{0})\boldsymbol{\xi}^{\rm T}\partial\log(f({\boldsymbol{V}}_{0},\boldsymbol{\alpha}^{*},{\boldsymbol{0}},\boldsymbol{\gamma}^{*}))/\partial\boldsymbol{\beta}\right],\quad j=1,2,\cdots,

where {ϕj(𝐯)}\{\phi_{j}({\boldsymbol{v}})\} denotes orthonormal eigenfunctions corresponding to the eigenvalues {λj}\{\lambda_{j}\}, and 𝐕0{\boldsymbol{V}}_{0} is generated from the null distribution f(𝐯,𝛂,𝟎,𝛄)f({\boldsymbol{v}},\boldsymbol{\alpha}^{*},{\boldsymbol{0}},\boldsymbol{\gamma}^{*}).

Denote by FνF_{\nu} the cumulative distribution function of ν\nu. It follows from Theorem 7 that the power function of nTnμ0nT_{n}-\mu_{0} can be approximated theoretically by FνF_{\nu}, and the proof of Theorem 7 shows that 0<𝐏h(𝑽1,𝑽2)=j=1λjμaj+o(1)0<\boldsymbol{\mathrm{P}}h({\boldsymbol{V}}_{1},{\boldsymbol{V}}_{2})=\sum_{j=1}^{\infty}\lambda_{j}\mu_{aj}+o(1) under the local alternative hypothesis. The additional mean μ1\mu_{1} under H1gH_{1g} (or j=1λjμaj\sum_{j=1}^{\infty}\lambda_{j}\mu_{aj} under H1lH_{1l}) can be viewed as a measure of the difference between H0H_{0} and H1gH_{1g} (or H1lH_{1l}). FνF_{\nu} is difficult to use in practice because it is not common. In Appendix B of the Supplementary Material, a novel bootstrap method is recommended for calculating the critical value or p-value, of which the asymptotic consistency is established in Theorem B1 in the Supplementary Material.

4 Simulation studies

Consider the change-plane model (1)

yi=𝑿iT𝜶+𝑿iT𝜷𝟏(𝑼iT𝜸0)+ϵi,\displaystyle y_{i}={\boldsymbol{X}}_{i}^{\rm T}\boldsymbol{\alpha}+{\boldsymbol{X}}_{i}^{\rm T}\boldsymbol{\beta}{\boldsymbol{1}}({\boldsymbol{U}}_{i}^{\rm T}\boldsymbol{\gamma}\geq 0)+\epsilon_{i},

where X1i=1X_{1i}=1 and U1i=1U_{1i}=1, and (X2i,,Xpi)T(X_{2i},\cdots,X_{pi})^{\rm T} and (U2i,,Uri)T(U_{2i},\cdots,U_{ri})^{\rm T} are generated independently from multivariate normal distributions N(𝟎p1,2𝑰p1)N({\boldsymbol{0}}_{p-1},\sqrt{2}{\boldsymbol{I}}_{p-1}) and N(𝟎r1,2𝑰r1)N({\boldsymbol{0}}_{r-1},\sqrt{2}{\boldsymbol{I}}_{r-1}), respectively. The error ϵi\epsilon_{i} is generated from following eight distributions: (i) N(0,2)N(0,\sqrt{2}); (ii) t2t_{2}; (iii) Pareto distribution Par(2,1)Par(2,1) with shape parameter 2 and scale parameter 1; (iv) Weibull distribution Weib(0.75,0.75)Weib(0.75,0.75) with shape parameter 0.75 and scale parameter 0.75. For the limit of space, we put other four error’s distributions in Appendix E of the Supplementary Material: (v) Gaussian mixture; (vi) mixture of t2t_{2} and Weibull Weib(0.75,0.75)Weib(0.75,0.75); (vii) mixture of Pareto Par(2,1)Par(2,1) and N(0,2)N(0,\sqrt{2}); and (viii) mixture of lognormal exp(N(0,1))\exp(N(0,1)) and N(0,2)N(0,\sqrt{2}). (γ2,,γq)T=(1,2,,2)T(\gamma_{2},\cdots,\gamma_{q})^{\rm T}=(1,2,\cdots,2)^{\rm T} is set under H1H_{1}, and γ1\gamma_{1} is chosen as the negative of 35% percentile of U2γ2++UqγrU_{2}\gamma_{2}+\cdots+U_{q}\gamma_{r}, which means that 𝑼T𝜸{\boldsymbol{U}}^{\rm T}\boldsymbol{\gamma} divides the population into two groups with 65% and 35% observations, respectively. To save space, we only present simulation results of robust subgroup-classifier learning, and we put performance of robust subgroup testing in Appendix E2 of the Supplementary Material.

The settings used herein are 𝜶=(5,0.5,,0.5)pand𝜷=(0.5,,0.5)q,\boldsymbol{\alpha}=(5,0.5,\cdots,0.5)\in\mathbb{R}^{p}\quad\mbox{and}\quad\boldsymbol{\beta}=(0.5,\cdots,0.5)\in\mathbb{R}^{q}, with the sigmoid function K(u)={1+exp(u)}1K(u)=\{1+\exp(-u)\}^{-1} chosen as the smooth function. Finite-sample studies were performed for different smooth functions K(u)K(u) (such as K(u)=Φ(u)K(u)=\Phi(u) and K(u)=Φ(u)+uϕ(u)K(u)=\Phi(u)+u\phi(u)), but the results were similar and so are omitted here. For comparison, three strategies are considered: (i) the proposed adaptive procedure (AHu), (ii) the classic Huber method (Hub), with the robustness parameter τ\tau selected by τ0median{𝒚median(𝒚)}/Φ1(0.75)\tau_{0}\mbox{median}\{{\boldsymbol{y}}-\mbox{median}({\boldsymbol{y}})\}/\Phi^{-1}(0.75) with τ0=1.345\tau_{0}=1.345 and 𝒚=(y1,,yn)T{\boldsymbol{y}}=(y_{1},\cdots,y_{n})^{\rm T}, as suggested in Wang et al., (2021)), and (iii) the estimation method based on the ordinary quadratic loss considered in Li et al., (2021) (OLS). Here, the subscript nn in AHUn, Hubn and OLSn stands for the sample size with n=200,400,600n=200,400,600.

Figure 3 shows boxplots of the L2L_{2}-norm of the estimation errors of the parameter 𝜽=(𝜶T,𝜷T)T\boldsymbol{\theta}=(\boldsymbol{\alpha}^{\rm T},\boldsymbol{\beta}^{\rm T})^{\rm T} with different errors and with (p,q,r)=(3,3,3)(p,q,r)=(3,3,3), where the L2L_{2}-norm is defined as L2=𝜽^τ,h𝜽2L_{2}=\|\hat{\boldsymbol{\theta}}_{\tau,h}-\boldsymbol{\theta}^{*}\|^{2}, where 𝜽^τ,h\hat{\boldsymbol{\theta}}_{\tau,h} is the robust estimator of the true parameter 𝜽\boldsymbol{\theta}^{*}. Figure 4 shows boxplots of the accuracy (ACC) of subgroup identification with different errors and with (p,q,r)=(3,3,3)(p,q,r)=(3,3,3). Here, ACC is defined as

ACC=1n1i=1n|𝟏(U1i+𝑼2iT𝜼^τ,h0)𝟏(U1i+𝑼2iT𝜼0)|\mbox{ACC}=1-n^{-1}\sum_{i=1}^{n}\left|{\boldsymbol{1}}(U_{1i}+{\boldsymbol{U}}_{2i}^{\rm T}\hat{\boldsymbol{\eta}}_{\tau,h}\geq 0)-{\boldsymbol{1}}(U_{1i}+{\boldsymbol{U}}_{2i}^{\rm T}\boldsymbol{\eta}^{*}\geq 0)\right|

with the robust estimator 𝜼^τ,h\hat{\boldsymbol{\eta}}_{\tau,h} of the true parameter 𝜼\boldsymbol{\eta}^{*}.

Figures 3 and 4 show that the L2L_{2}-norms of the estimation errors decrease and the accuracies of subgroup identification increase as the sample size nn grows, which verifies the established theoretical results. Compared with the ordinary quadratic loss, the proposed estimation procedure achieves a uniformly lower median of the L2L_{2}-norm of the estimation errors and higher accuracy of subgroup identification for all heavy-tailed distributions except for the t2t_{2} distribution. The three methods are comparable for the symmetric distributions, such as the Gaussian and t2t_{2} distributions. To limit the number of pages, the boxplots of the L2L_{2}-norm of the estimation errors of the parameter 𝜽\boldsymbol{\theta} and accuracy of subgroup identification for (p,q,r)=(3,3,11),(6,6,3)(p,q,r)=(3,3,11),(6,6,3), and (6,6,11)(6,6,11) are shown in Figures E2–E4 and E6–E8 in Appendix  E.1 of the Supplementary Material.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: L2L_{2}-norm of estimation errors of parameter 𝛉=(𝛂T,𝛃T)T\boldsymbol{\theta}=(\boldsymbol{\alpha}^{\rm T},\boldsymbol{\beta}^{\rm T})^{\rm T} with four different error distributions. Here, (p,q,r)=(3,3,3)(p,q,r)=(3,3,3).
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: Accuracy of subgroup identification with four different error distributions. Here, (p,q,r)=(3,3,3)(p,q,r)=(3,3,3).

5 Case study

The proposed procedure is applied to cancer data for skin cutaneous melanoma (SKCM) downloaded from the Cancer Genome Atlas https://tcga-data.nci.nih.gov/. SKCM is one of the most aggressive types of cancer, and its incidence, mortality, and disease burden are increasing annually (Siegel et al., 2021; Xiong et al., 2019). It is believed that CREG1, TMEM201, and CCL8 are skin-cancer susceptibility genes (Hu et al., 2020), and the goal is to identify the subgroup of Breslow’s thickness based on mutations of those sensitive genes.

Consideration is given to the three environmental factors of (i) gender, (ii) age at diagnosis, and (iii) Clark level at diagnosis (CLAD), all of which have been studied extensively in the literature. Studies such as those by Dickson and Gershenwald, (2011) have found that these three environmental factors have positive effects. After removing missing values, there are 253 subjects, and the SKCM data are modeled by applying the change-plane model (1)

Yi=𝑿iT𝜶+𝑿iT𝜷𝟏(𝑼iT𝜸0)+ϵi,i=1,,253,\displaystyle Y_{i}={\boldsymbol{X}}_{i}^{\rm T}\boldsymbol{\alpha}+{\boldsymbol{X}}_{i}^{\rm T}\boldsymbol{\beta}{\boldsymbol{1}}({\boldsymbol{U}}_{i}^{\rm T}\boldsymbol{\gamma}\geq 0)+\epsilon_{i},\quad i=1,\cdots,253,

where 𝑿i=(1,AGEi,GENDERi,CLADi)T{\boldsymbol{X}}_{i}=(1,\mbox{AGE}_{i},\mbox{GENDER}_{i},\mbox{CLAD}_{i})^{\rm T}, and 𝑼i=(1,CREG1i,TMEM201i,CCL8i)T{\boldsymbol{U}}_{i}=(1,\mbox{CREG1}_{i},\mbox{TMEM201}_{i},\mbox{CCL8}_{i})^{\rm T}. The three important genes CREG1, TMEM201, and CCL8 are high correlated with breast cancer (Hu et al., 2020).

Based on B=5000B=5000 bootstrap samples and the Huber loss, the p-value with the proposed RWAST is 0.0020.002, which implies that based on the proposed method, there is a strong evidence for rejecting the null hypothesis. However, based on the ordinary quadratic loss, the p-values with WAST and SST are 0.6454 and 0.0286, respectively, where B=5000B=5000 and K=2000K=2000. Therefore, there is no evidence for rejecting the null hypothesis for WAST and SST based on the ordinary quadratic loss.

The parametric estimators are listed in Table 1, from which the indicator function 𝟏(𝑼iT𝜸^0){\boldsymbol{1}}({\boldsymbol{U}}_{i}^{\rm T}\hat{\boldsymbol{\gamma}}\geq 0) partitions the population into two subgroups with 121 and 132 subjects based on the Huber loss, and two subgroups with 91 and 162 subjects based on the ordinary quadratic loss. Therefore, there would appear to be a subgroup with a higher chance of skin cancer based on mutations of these three genes CREG1, TMEM201, and CCL8.

Table 1: Estimates of parameter if null hypothesis has been rejected.
Change plane Parameter Intercept AGE GENDER CLAD Intercept CREG1 TMEM201 CCL8
OLS 𝜶^\hat{\boldsymbol{\alpha}} 0.037-0.037 0.038-0.038 0.002 0.050
𝜷^\hat{\boldsymbol{\beta}} 0.022 0.024 0.008 0.007-0.007
𝜸^\hat{\boldsymbol{\gamma}} 1.000 0.179 0.656-0.656 4.970
Huber 𝜶^\hat{\boldsymbol{\alpha}} 3.525 0.029 0.079 0.064
𝜷^\hat{\boldsymbol{\beta}} 0.012 0.014 0.026-0.026 0.045-0.045
𝜸^\hat{\boldsymbol{\gamma}} 1.000 0.384 2.093-2.093 4.860

6 Conclusion

Considered herein were subgroup classification and subgroup tests for change-plane models with heavy-tailed errors, which offer help in (i) narrowing down populations for modeling and (ii) providing recommendations for optimal individualized treatments in practice. A novel subgroup classifier was proposed by smoothing the indicator function and minimizing a smoothed Huber loss. Nonasymptotic properties were derived and nonasymptotic Bahadur representation was provided, in which the estimators of the parameters 𝜶\boldsymbol{\alpha} and 𝜷\boldsymbol{\beta} achieve sub-Gaussian tails.

The novel test statistic RWAST was introduced to test whether subgroups of individuals exist. In a comparison with WAST Liu et al., (2024) and SST Andrews and Ploberger, (1994, 1995); Song et al., (2009); Fan et al., 2017a based on the ordinary change-plane regression with non-heavy-tailed errors, the proposed test statistic improved the power significantly because it is robust with the assistance of the Huber loss and avoids the drawbacks of taking the supremum over the parametric space Θγ\Theta_{\gamma} when its dimension is large. Asymptotic distributions were derived under the null and alternative hypotheses. As studied by Liu et al., (2024) and Huang et al., (2020), it is easy to extend the proposed robust estimation procedure and RWAST to change-plane regression with multiple change planes with heavy-tailed errors.

In the age of big data, it is interesting to consider high-dimensional change-plane regression models and to provide high-dimensional robust estimation procedures and test statistics. As noted by Liu et al., (2024), the proposed RWAST can be applied to change-plane regression with high-dimensional grouping parameter 𝜸\boldsymbol{\gamma}. However, it remains open to provide estimation procedures for change-plane regression with high-dimensional 𝜸\boldsymbol{\gamma}. A possible strategy is to penalize the loss function with 𝜽1\|\boldsymbol{\theta}\|_{1} under the assumption of sparsity.

Supplementary Material

Appendix A includes Algorithm 1 for implementation in Section 2.3. Appendix B provides the computation of critical value in Section 3. Appendix C provides the proofs of Theorems 14 and the related Lemmas. Appendix D provides the proofs of Theorems 57. Appendix E provides additional simulation studies to illustrate the performance of the proposed estimation and test prcedures.

Acknowledgments

This work was supported in part by the NSFC (12271329, 72331005).

References

  • Andrews and Ploberger, (1994) Andrews, D. W. K. and Ploberger, W. (1994). Optimal tests when a nuisance parameter is present only under the alternative. Econometrica, 62(6):1383–1414.
  • Andrews and Ploberger, (1995) Andrews, D. W. K. and Ploberger, W. (1995). Admissibility of the Likelihood Ratio Test When a Nuisance Parameter is Present Only Under the Alternative. The Annals of Statistics, 23(5):1609 – 1629.
  • Chen and Zhou, (2020) Chen, X. and Zhou, W.-X. (2020). Robust inference via multiplier bootstrap. The Annals of Statistics, 48(3):1665 – 1691.
  • Davies, (1977) Davies, R. B. (1977). Hypothesis testing when a nuisance parameter is present only under the alternative. Biometrika, 64(2):247–254.
  • Devroye et al., (2016) Devroye, L., Lerasle, M., Lugosi, G., and Oliveira, R. I. (2016). Sub-Gaussian mean estimators. The Annals of Statistics, 44(6):2695 – 2725.
  • Dickson and Gershenwald, (2011) Dickson, P. V. and Gershenwald, J. E. (2011). Staging and prognosis of cutaneous melanoma. Surgical oncology clinics of North America, 20(1):1–17.
  • (7) Fan, A., Rui, S., and Lu, W. (2017a). Change-plane analysis for subgroup detection and sample size calculation. Journal of the American Statistical Association, 112:769–778.
  • (8) Fan, J., Li, Q., and Wang, Y. (2017b). Estimation of high dimensional mean regression in the absence of symmetry and light tail assumptions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(1):247–265.
  • Foster et al., (2011) Foster, J. C., Taylor, J., and Ruberg, S. J. (2011). Subgroup identification from randomized clinical trial data. Statistics in Medicine, 30(24):2867–2880.
  • Hahn et al., (1990) Hahn, M. G., Kuelbs, J., and Weiner, D. C. (1990). The Asymptotic Joint Distribution of Self-Normalized Censored Sums and Sums of Squares. The Annals of Probability, 18(3):1284 – 1341.
  • Han et al., (2022) Han, D., Huang, J., Lin, Y., and Shen, G. (2022). Robust post-selection inference of high-dimensional mean regression with heavy-tailed asymmetric or heteroskedastic errors. Journal of Econometrics, 230(2):416 – 431.
  • Horowitz, (1993) Horowitz, J. L. (1993). Optimal rates of convergence of parameter estimators in the binary response model with weak distributional assumptions. Econometric Theory, 9(1):1–18.
  • Hu et al., (2020) Hu, B., Wei, Q., Zhou, C., Ju, M., Wang, L., Chen, L., Li, Z., Wei, M., He, M., and Zhao, L. (2020). Analysis of immune subtypes based on immunogenomic profiling identifies prognostic signature for cutaneous melanoma. International Immunopharmacology, 89:107162.
  • Huang et al., (2020) Huang, Y., Cho, J., and Fong, Y. (2020). Threshold-based subgroup testing in logistic regression models in two-phase sampling designs. Applied Statistics, 70(2).
  • Huber, (1964) Huber, P. J. (1964). Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35(4):73–101.
  • Huber, (1973) Huber, P. J. (1973). Robust Regression: Asymptotics, Conjectures and Monte Carlo. The Annals of Statistics, 1(5):799 – 821.
  • Kang et al., (2024) Kang, C., Cho, H., Song, R., Banerjee, M., Laber, E. B., and Kosorok, M. R. (2024). Inference for change-plane regression. arXiv.org/abs/2206.06140.
  • Kang et al., (2017) Kang, S., Lu, W., and Song, R. (2017). Subgroup detection and sample size calculation with proportional hazards regression for survival data. Statistics in Medicine, 36:4646 – 4659.
  • Lee et al., (2011) Lee, S., Seo, M. H., and Shin, Y. (2011). Testing for threshold effects in regression models. Journal of the American Statistical Association, 106(493):220–231.
  • Lepskii, (1992) Lepskii, O. V. (1992). Asymptotically minimax adaptive estimation. i: Upper bounds. optimally adaptive estimates. Theory of Probability & Its Applications, 36(4):682–697.
  • Li et al., (2021) Li, J., Li, Y., Jin, B., and Kosorok, M. R. (2021). Multithreshold change plane model: Estimation theory and applications in subgroup identification. Statistics in Medicine, 40(15):3440–3459.
  • Liu et al., (2024) Liu, X., Huang, J., Zhou, Y., Zhang, F., and Panpan, R. (2024). Efficient subgroup testing in change-planemodels. http://arxiv.org/abs/2408.00602.
  • Mukherjee et al., (2020) Mukherjee, D., Banerjee, M., Mukherjee, D., and Ritov, Y. (2020). Asymptotic normality of a linear threshold estimator in fixed dimension with near-optimal rate. arXiv preprint arXiv:2001.06955.
  • Mukherjee et al., (2022) Mukherjee, D., Banerjee, M., and Ritov, Y. (2022). On robust learning in the canonical change point problem under heavy tailed errors in finite and growing dimensions. Electronic Journal of Statistics, 16(1):1153 – 1252.
  • Seo and Linton, (2007) Seo, M. H. and Linton, O. (2007). A smoothed least squares estimator for threshold regression models. Journal of Econometrics, 141(2):704–735.
  • Shen and Qu, (2020) Shen, J. and Qu, A. (2020). Subgroup analysis based on structured mixedeffects models for longitudinal data. Journal of Biopharmaceutical Statistics, 30(4):607–622.
  • Siegel et al., (2021) Siegel, R. L., Miller, K. D., Fuchs, H. E., and Jemal, A. (2021). Cancer statistics, 2021. CA: A Cancer Journal for Clinicians, 71(1):7–33.
  • Song et al., (2009) Song, R., Kosorok, M. R., and Fine, J. P. (2009). On asymptotically optimal tests under loss of identifiability in semiparametric models. The Annals of Statistics, 37:2409 – 2444.
  • Sun et al., (2020) Sun, Q., Zhou, W.-X., and Fan, J. (2020). Adaptive huber regression. Journal of the American Statistical Association, 115(529):254–265. PMID: 33139964.
  • Wald, (1943) Wald, A. (1943). Tests of statistical hypotheses concerning several parameters when the number of observations is large. Transactions of the American Mathematical Society, 54(3):426–482.
  • Wang et al., (2021) Wang, L., Zheng, C., Zhou, W., and Zhou, W.-X. (2021). A new principle for tuning-free huber regression. Statistica Sinica, 31:2153–2177.
  • Wei and Kosorok, (2018) Wei, S. and Kosorok, M. R. (2018). The change-plane Cox model. Biometrika, 105(4):891–903.
  • Xiong et al., (2019) Xiong, J., Bing, Z., and Guo, S. (2019). Observed survival interval: A supplement to tcga pan-cancer clinical data resource. Cancers, 11(3).
  • Zhang et al., (2022) Zhang, Y., Wang, J. W., and Zhu, Z. (2022). Single-index thresholding in quantile regression. Journal of the American Statistical Association, 117(540):2222–2237.
  • Zhou et al., (2018) Zhou, W.-X., Bose, K., Fan, J., and Liu, H. (2018). A new perspective on robust MM-estimation: Finite sample theory and applications to dependence-adjusted multiple testing. The Annals of Statistics, 46(5):1904 – 1931.