Robust subgroup-classifier learning and testing in change-plane regressions

Xu Liu¹, Jian Huang², Yong Zhou³ and Xiao Zhang^4,
¹Shanghai University of Finance and Economics, Shanghai, China
²The Hong Kong Polytechnic University, Hong Kong SAR, China
³East China Normal University, Shanghai, China
⁴The Chinese University of Hong Kong-Shenzhen, Shenzhen, China CONTACT [email protected]; The Chinese University of Hong Kong-Shenzhen, China

Abstract

Considered here are robust subgroup-classifier learning and testing in change-plane regressions with heavy-tailed errors, which can identify subgroups as a basis for making optimal recommendations for individualized treatment. A new subgroup classifier is proposed by smoothing the indicator function, which is learned by minimizing the smoothed Huber loss. Nonasymptotic properties and the Bahadur representation of estimators are established, in which the proposed estimators of the grouping difference parameter and baseline parameter achieve sub-Gaussian tails. The hypothesis test considered here belongs to the class of test problems for which some parameters are not identifiable under the null hypothesis. The classic supremum of the squared score test statistic may lose power in practice when the dimension of the grouping parameter is large, so to overcome this drawback and make full use of the data’s heavy-tailed error distribution, a robust weighted average of the squared score test statistic is proposed, which achieves a closed form when an appropriate weight is chosen. Asymptotic distributions of the proposed robust test statistic are derived under the null and alternative hypotheses. The proposed robust subgroup classifier and test statistic perform well on finite samples, and their performances are shown further by applying them to a medical dataset. The proposed procedure leads to the immediate application of recommending optimal individualized treatments.

Keywords: Gaussian-type deviation; Heavy-tailed data; Nonstandard test; Robust classifier; Subgroup classifier; Subgroup detection.

1 Introduction

When studying the risk of a disease outcome, there could be heterogeneity across subgroups characterized by covariates, meaning that the same treatment in different subpopulations may cause different treatment effects of predictors. In the presence of population heterogeneity in classical models, learning the subgroup classifier and testing the existence of subgroups associated with the risk-model heterogeneity are important for understanding better the different effects of predictors and modeling better the association of diseases with predictors. In precision medicine, this plays a core role in guiding personalized treatment to individuals in a population by identifying subgroups with different treatment effects on disease. There has been much previous research on learning the subgroup classifiers of individuals based on various models Foster et al., (2011); Wei and Kosorok, (2018); Huang et al., (2020); Li et al., (2021); Zhang et al., (2022).

Before learning the subgroup classifier, it is necessary to test for the existence of subgroups of individuals to address the potential risk of finding false-positive subgroups. This necessity not only arises from the data themselves but is also intrinsic to statistics, because the nonexistence of subgroups causes the identifiability problem when learning the subgroup classifier. However, this test problem belongs to the class of nonstandard tests with loss of identifiability under the null hypothesis; see Wald, (1943); Andrews and Ploberger, (1994, 1995); Davies, (1977); Song et al., (2009); Liu et al., (2024); Kang et al., (2024), among others. Therefore, the focus herein is on learning the subgroup classifier and testing for the existence of subgroups simultaneously, which offers more information and the potential for making the best recommendations for optimal individualized treatments and guiding future treatment modification and development.

Let $\{{\boldsymbol{V}}_{i}=(y_{i},{\boldsymbol{X}}_{i},{\boldsymbol{Z}}_{i},{\boldsymbol{U}}_{i}),i=1,\cdots,n\}$ be the observed data, which are $n$ independent and identically distributed copies of ${\boldsymbol{V}}=(y,{\boldsymbol{X}},{\boldsymbol{Z}},{\boldsymbol{U}})$ . Consider the regression model with change plane Lee et al., (2011); Zhang et al., (2022); Mukherjee et al., (2022); Liu et al., (2024)

\displaystyle y_{i}={\boldsymbol{X}}_{i}^{\rm T}\boldsymbol{\alpha}+{\boldsymbol{Z}}_{i}^{\rm T}\boldsymbol{\beta}{\boldsymbol{1}}({\boldsymbol{U}}_{i}^{\rm T}\boldsymbol{\gamma}\geq 0)+\epsilon_{i},

(1)

where $\boldsymbol{\alpha}=(\alpha_{1},\cdots,\alpha_{p})^{\rm T}\in\Theta_{\alpha}\subseteq\mathbb{R}^{p}$ , $\boldsymbol{\beta}=(\beta_{1},\cdots,\beta_{q})^{\rm T}\in\Theta_{\beta}\subseteq\mathbb{R}^{q}$ and $\boldsymbol{\gamma}=(\gamma_{1},\cdots,\gamma_{r})^{\rm T}\in\Theta_{\gamma}\subseteq\mathbb{R}^{r}$ are unknown parameters, and ${\mathrm{E}}(\epsilon_{i}|{\boldsymbol{X}}_{i},{\boldsymbol{Z}}_{i},{\boldsymbol{U}}_{i})=0$ and ${\mathrm{E}}(|\epsilon_{i}|^{2+\delta}|{\boldsymbol{X}}_{i},{\boldsymbol{Z}}_{i},{\boldsymbol{U}}_{i})=M_{\delta}<\infty$ for some $\delta\geq 0$ . When $\delta=0$ , the error has a finite second moment, denoted by $M_{0}={\mathrm{E}}(\epsilon_{i}^{2}|{\boldsymbol{X}}_{i},{\boldsymbol{Z}}_{i},{\boldsymbol{U}}_{i})$ . For easy expression, let $\boldsymbol{\theta}=(\boldsymbol{\alpha}^{\rm T},\boldsymbol{\beta}^{\rm T})^{\rm T}$ . Following the expressions in Liu et al., (2024), ${\boldsymbol{U}}$ is called the grouping variable, $\boldsymbol{\gamma}$ is called the grouping parameter, ${\boldsymbol{Z}}$ is called the grouping difference variable, $\boldsymbol{\beta}$ is called the grouping difference parameter, ${\boldsymbol{X}}$ is called the baseline variable, and $\boldsymbol{\alpha}$ is called the baseline parameter. Herein, the indicator function ${\boldsymbol{1}}({\boldsymbol{U}}^{\rm T}\boldsymbol{\gamma}\geq 0)$ is called the subgroup classifier.

The technology for collecting and processing data sets has improved considerably in recent years, and one is now more likely to encounter heavy-tailed or low-quality data, thereby causing the typical assumption of a Gaussian or sub-Gaussian distribution to fail. Therefore, new challenges arise compared with the classic methodology for modeling non-Gaussian or heavy-tailed data. Even for linear regression models with heavy-tailed errors, the ordinary least squares (OLS) estimators are suboptimal both theoretically and empirically. Instead, proposed herein is a robust estimator of subgroup classification by considering the change-plane model (1) with heavy-tailed errors. This paper addresses two important problems for model (1) with heavy-tailed errors, i.e., subgroup-classifier learning (Section 2) and subgroup testing for whether subgroups exist (Section 3).

1.1 Robust subgroup-classifier learning

Li et al., (2021) considered the change-plane model (1) with Gaussian errors. Also, Zhang et al., (2022) investigated a quantile regression with a change plane and derived the asymptotic normalities for the grouping difference parameter and the grouping parameter. However, although quantile or median regression models require no Gaussian or sub-Gaussian assumption, they essentially estimate the conditional quantile or median regression instead of the conditional mean regression. If the mean regression is of interest in practice, then these procedures are not feasible unless the error distribution is symmetric around zero, which may be too strong to cause the misspecification problem. See Fan et al., 2017b for some examples that demonstrate the distinction between conditional mean regression and conditional quantile or median regression.

Linear regression models with heavy-tailed errors are prevalent in the literature. Fan et al., 2017b proposed a robust estimator of high-dimensional mean regression in the absence of asymmetry and with light tail assumptions. Zhou et al., (2018) provided a robust M-estimation procedure with applications to dependence-adjusted multiple testing. Sun et al., (2020) and Wang et al., (2021) studied adaptive Huber regression for linear regression models with heavy-tailed errors. Chen and Zhou, (2020) investigated robust inference via multiplier bootstrap in multiple response regression models, constructing robust bootstrap confidence sets and addressing large-scale simultaneous hypothesis testing problems.

Refer to caption — Figure 1: Estimation errors of parameter $\boldsymbol{\beta}$ with $L_{2}$ norm (left) and accuracies of estimated subgroup classifier (right) for heavy-tailed errors generated from the Pareto distribution.

Mukherjee et al., (2022) studied the change-plane problem under heavy-tailed errors when $\boldsymbol{\alpha}=0$ and ${\boldsymbol{X}}=1$ , which is a special case of the change-plane model (1), and they left the general change-plane model with heavy-tailed errors for future work. Herein, a new robust procedure is introduced to estimate parameters and consequently to learn the subgroup classifier. Figure 1 shows boxplots of the estimation errors of the parameter $\boldsymbol{\theta}=(\boldsymbol{\alpha}^{\rm T},\boldsymbol{\beta}^{\rm T})^{\rm T}$ with $L_{2}$ norm and the accuracies of the estimated subgroup classifier, where the $L_{2}$ norm of the estimation errors is defined as $\|\hat{\boldsymbol{\theta}}_{\tau,h}-\boldsymbol{\theta}^{*}\|$ with the robust estimator $\hat{\boldsymbol{\theta}}_{\tau,h}$ of the true grouping parameter $\boldsymbol{\theta}^{*}$ , and the accuracy is defined as $\mbox{ACC}=1-n^{-1}\sum_{i=1}^{n}\left|{\boldsymbol{1}}({\boldsymbol{U}}_{i}^{\rm T}\hat{\boldsymbol{\gamma}}_{\tau,h}\geq 0)-{\boldsymbol{1}}({\boldsymbol{U}}_{i}^{\rm T}\boldsymbol{\gamma}^{*}\geq 0)\right|$ with the robust estimator $\hat{\boldsymbol{\gamma}}_{\tau,h}$ of the true parameter $\boldsymbol{\gamma}^{*}$ . Here, the settings are $(p,q,r)=(3,3,3)$ and $n=(200,400,600)$ with 1000 repetitions, the heavy-tailed errors are generated from the Pareto distribution $Par(2,1)$ with shape parameter 2 and scale parameter 1, and $X_{1}=Z_{1}=1$ and $(X_{2},\cdots,X_{p})^{\rm T}=(Z_{2},\cdots,Z_{p})^{\rm T}$ and $(U_{2},\cdots,U_{r})^{\rm T}$ are generated independently from multivariate normal distributions $N({\boldsymbol{0}}_{p-1},\sqrt{3}{\boldsymbol{I}}_{p-1})$ and $N({\boldsymbol{0}}_{r-1},\sqrt{3}{\boldsymbol{I}}_{r-1})$ , respectively; see Section 4 for details. As used by Zhang et al., (2022), the smooth function $K(u)=\{1+\exp(-u)\}^{-1}$ with smoothness parameter $h=\sqrt{\log(n)/n}$ is chosen, and the proposed robust estimation procedure (AHu) is compared with the method based on OLS (Li et al., 2021). Figure 1 sends the important message that in the presence of heavy tails, compared with the existing method (Li et al., 2021), the proposed robust estimators not only reduce the estimation error dramatically but also improve significantly the accuracy of the subgroup classifier.

1.2 Robust subgroup testing

Another goal of this paper is to test for the existence of subgroups, i.e.,

\displaystyle H_{0}:\boldsymbol{\beta}={\boldsymbol{0}}\quad versus\quad H_{1}:\boldsymbol{\beta}\neq{\boldsymbol{0}}.

(2)

Note that the grouping parameter $\boldsymbol{\gamma}$ is not identifiable under the null hypothesis.

The classic Wald-type test or score-based test is powerful in standard test problems when there is no identifiability problem in both the null and alternative hypotheses, but these common procedures are not feasible when nuisance parameters are present. Andrews and Ploberger, (1994) and Andrews and Ploberger, (1995) studied the weighted average exponential form, which was originally introduced by Wald, (1943). Davies, (1977) investigated well the supremum of the squared score test (SST) statistic for mixture models, which was applied by Song et al., (2009) and Kang et al., (2017) to semiparametric models in censoring data.

All the aforementioned testing methods are optimal tests based on the weighted average power criterion. However, because these optimal tests take the weighted exponential average of the classical tests over the grouping parametric space $\Theta_{\gamma}$ , they may not only not perform well in practice when the dimension of $\Theta_{\gamma}$ is large but also give rise to a heavy-burden calculation of the p-value or the critical value. Instead, Liu et al., (2024) introduced a new test statistic by taking the weighted average of the SST (WAST) over $\Theta_{\gamma}$ and removing both the inverse of the covariance and the cross-interaction terms to overcome the drawbacks of SST. Thanks to its closed form, WAST achieves more-accurate type-I errors and significantly improved power and hence dramatically reduced computational time as a byproduct.

All the aforementioned test procedures require the important assumption of Gaussian or sub-Gaussian errors in the change-plane models, none of which apply to heavy-tailed data sets. Therefore, proposed herein is a robust test procedure based on WAST (Liu et al., 2024), called robust WAST (RWAST). Figure 2 shows the power curves of the proposed RWAST, WAST as introduced by Liu et al., (2024), and SST as considered by Davies, (1977); Kang et al., (2017). A total of 1000 bootstrap samples is set, and the other settings are the same as those in Section 1.1. With the nominal significance level $\alpha=0.05$ , the type-I errors for these three methods are $(\mbox{rwast},\mbox{olsw},\mbox{olss})=(0.042,0.065,0.031)$ for $n=200$ , $(\mbox{rwast},\mbox{olsw},\mbox{olss})=(0.043,0.047,0.118)$ for $n=400$ , and $(\mbox{rwast},\mbox{olsw},\mbox{olss})=(0.033,0.047,0.135)$ for $n=600$ . It follows that RWAST controls the type-I errors well, while those of SST based on quadratic loss are larger and deviate far from 0.05. Figure 2 shows that in the presence of heavy tails, the proposed RWAST achieves larger power in comparison with WAST based on the ordinary quadratic loss.

In summary, from the demonstration examples in Section 1.1 and Section 1.2, compared with the existing nonrobust methods for heavy-tailed data, the proposed robust estimation procedure is characterized by lower estimation errors and higher accuracy, and the robust test procedure has more-accurate type-I errors and larger power.

1.3 Main contributions

The main contributions of this paper are summarized as follows. First, the robust estimation procedure for the change-plane model (1) with heavy-tailed errors is investigated carefully. The proposed robust estimator adapts to the sample size, the robustification parameter in the Huber loss, the smoothness parameter when approximating the indicator function in the subgroup classifier, and the moments of errors. The sacrifices made in pursuit of robustness and smoothness are analyzed theoretically, with the bias involving the robustification parameter arising from the pursuit of robustness, and the one involving the smoothness parameter arising from the approximation to the indicator function. The nonasymptotic properties for parameters $\boldsymbol{\alpha}$ and $\boldsymbol{\beta}$ are established, as well as those for the grouping parameter. The theoretical results reveal that the proposed estimators of the grouping difference parameter and baseline parameter have Gaussian-type deviations (Devroye et al., 2016). Also provided is the nonasymptotic Bahadur representation of the proposed robust estimators, which is convenient for deriving the classical asymptotic results needed for statistical inference such as hypothesis tests and constructing confidence regions. Extensive simulation studies show that the proposed robust estimation procedure is superior to its several competitors.

Second, for the change-plane model with heavy-tailed errors, RWAST is proposed, which makes full use of the data’s heavy-tailed information and overcomes the drawbacks of loss of power in practice and the heavy computational burden of SST when the dimension of the grouping parameter is large. The asymptotic distributions of the proposed RWAST under the null and alternative hypotheses are established based on the theory of degenerate U-statistics. As with exponential average tests, the proposed asymptotic distributions are not standard (e.g., the normal or $\chi^{2}$ distribution), so a novel bootstrap method that is easily implemented and theoretically guaranteed is introduced to mimic the critical value or p-value. Comprehensive simulation studies conducted with finite sample sizes and for various heavy-tailed error distributions show the excellent performance of the proposed RWAST, which improves the power significantly and reduces the computational burden dramatically.

In summary, a novel robust estimator is proposed that adapts to the sample size, dimension, robustification parameter, moments, and smoothness parameter in pursuit of the optimal tradeoff among bias, robustness, and smoothness. To the best of the author’s knowledge about change-plane analysis, the literature contains no nonasymptotic results with sub-Gaussian tails for parameters $\boldsymbol{\theta}$ and $\boldsymbol{\eta}$ , and no nonasymptotic results with sub-exponential tails for the Bahadur representation of these parameters. Furthermore, a robust test procedure is proposed that improves on WAST in the change-plane model with heavy-tailed errors.

1.4 Notation and Organization of paper

Here, some useful notation is introduced for convenience of expression. For a vector ${\boldsymbol{v}}\in\mathbb{R}^{d}$ and a square matrix $A=(a_{ij})\in\mathbb{R}^{d\times d}$ , denote by $\|{\boldsymbol{v}}\|$ the Euclidean norm of ${\boldsymbol{v}}$ , by $\mbox{trace}(A)=\sum_{i=1}^{d}a_{ii}$ the trace of $A$ ; $\|{\boldsymbol{v}}\|_{A}^{2}=\sum_{i,j}a_{ij}v_{i}v_{j}$ and ${\boldsymbol{v}}^{\otimes 2}={\boldsymbol{v}}{\boldsymbol{v}}^{\rm T}$ . Denote by $\|A\|_{p}=\sup\{\|A{\boldsymbol{x}}\|_{p}:{\boldsymbol{x}}\in\mathbb{R}^{d},\|{\boldsymbol{x}}\|_{p}=1\}$ the induced operator norm for a matrix $A=(a_{ij})\in\mathbb{R}^{m\times d}$ .

Denote by $\boldsymbol{\mathrm{P}}$ the ordinary probability measure such that $\boldsymbol{\mathrm{P}}f=\int fd\boldsymbol{\mathrm{P}}$ for any measurable function $f$ , by $\mathbb{P}_{n}$ the empirical measure of a sample of random elements from $\boldsymbol{\mathrm{P}}$ such that $\mathbb{P}_{n}f=n^{-1}\sum_{i=1}^{n}f({\boldsymbol{V}}_{i})$ , and by ${\mathbb{G}}_{n}$ the empirical process indexed by a class ${\cal F}$ of measurable functions such that ${\mathbb{G}}_{n}f=\sqrt{n}(\mathbb{P}_{n}-\boldsymbol{\mathrm{P}})f$ for any $f\in{\cal F}$ . Let $L^{p}(Q)$ be the space of all measurable functions $f$ such that $\|f\|_{Q,p}:=(Q|f|^{p})^{1/p}<\infty$ , where $p\in[1,\infty)$ and $(Q|f|^{p})^{1/p}$ denotes the essential supremum when $p=\infty$ . Let $N(\epsilon,{\cal F},\|\cdot\|_{Q,2})$ be an $\epsilon$ -covering number of ${\cal F}$ with respect to the $L^{2}(Q)$ seminorm $\|\cdot\|_{Q,2}$ , where ${\cal F}$ is a class of measure functions and $Q$ is finite discrete.

The remainder of this paper is organized as follows. Section 2 provides the robust estimators for the grouping difference parameter and grouping parameter as well as the subgroup classifier, and theorems reveal that these estimators achieve Gaussian-type deviations. Also derived is the Bahadur representation of the robust estimators, and it is shown that the remainder of the Bahadur representation achieves sub-Gaussian tails. Section 3 presents the RWAST statistic and establishes its limiting distributions under the null and alternative hypotheses. Section 4 reports the results of simulation studies conducted to evaluate the finite-sample performance of the proposed methods with competitors in the change-plane models with heavy-tailed errors. The performance of the proposed methods is illustrated further by applying them to a medical dataset in Section 5. Finally, Section 6 concludes with remarks and further extensions. The proofs are provided in the Supplementary Material, and an R package named “wasthub” is available at https://github.com/xliusufe/wasthub.

2 Robust subgroup-classifier learning

In this section, the subgroup classifier is learned to partition subjects into two subgroups, and nonasymptotic properties are provided for the robust estimators, whose deviations achieve sub-Gaussian tails. To adapt for different magnitudes of errors and to robustify the estimation, the Huber loss (Huber, 1964; Fan et al., 2017b ; Wang et al., 2021; Han et al., 2022) is considered, the definition of which begins this section.

Definition 2.1.

The Huber loss $L_{\tau}(u)$ (Huber, 1964) is defined as

\displaystyle L_{\tau}(u)=\left\{\begin{array}[]{ll}u^{2}/2&\text{ if }|u|\leq\tau,\\ \tau|u|-\tau^{2}/2&\text{ otherwise,}\end{array}\right.

(5)

where $\tau>0$ is a tuning parameter called the robustification parameter (Sun et al., 2020; Chen and Zhou, 2020), which regulates the bias and robustness.

The Huber loss is a hybrid of the squared loss with small errors and absolute loss for large errors. Denoting by $L(u)=u^{2}/2$ the ordinary quadratic loss, it is straightforward to see that $L(u)=\lim_{\tau\rightarrow\infty}L_{\tau}(u)$ .

2.1 Robust estimation

Rewrite model (1) as

\displaystyle y_{i}={\boldsymbol{X}}_{i}^{\rm T}\boldsymbol{\alpha}+{\boldsymbol{Z}}_{i}^{\rm T}\boldsymbol{\beta}{\boldsymbol{1}}(U_{1i}+{\boldsymbol{U}}_{2i}^{\rm T}\boldsymbol{\eta}\geq 0)+\epsilon_{i},

(6)

where ${\boldsymbol{U}}_{i}=(U_{1i},{\boldsymbol{U}}_{2i}^{\rm T})^{\rm T}$ , $\boldsymbol{\eta}=\gamma_{1}^{-1}\boldsymbol{\gamma}_{-1}$ with $\boldsymbol{\gamma}_{-1}=(\gamma_{2},\cdots,\gamma_{r})^{\rm T}$ . To avoid the identifiability problem for $\boldsymbol{\eta}$ , $\boldsymbol{\beta}\neq{\boldsymbol{0}}$ is assumed in this section. Denote $\boldsymbol{\zeta}=(\boldsymbol{\alpha}^{\rm T},\boldsymbol{\beta}^{\rm T},\boldsymbol{\eta}^{\rm T})^{\rm T}\in\Theta_{\zeta}$ , where $\Theta_{\zeta}=\Theta_{\alpha}\times\Theta_{\beta}\times\Theta_{\eta}$ is the product space.

Because the indicator function ${\boldsymbol{1}}(U_{1}+{\boldsymbol{U}}_{2}^{\rm T}\boldsymbol{\eta}\geq 0)$ is not differentiable, it is natural to approximate it by a smooth function $K(u)$ satisfying

\displaystyle\lim_{u\rightarrow+\infty}K(u)=1~{}\mbox{and}~{}\lim_{u\rightarrow-\infty}K(u)=0.

Note that this smooth function characterizes the cumulative distribution function instead of the density function; see Seo and Linton, (2007); Li et al., (2021); Zhang et al., (2022); Mukherjee et al., (2020) for more details. The literature contains many commonly used smooth functions, such as the cumulative distribution function of standard normal distribution $K(u)=\Phi(u)$ , the sigmoid function $K(u)=\{1+\exp(-u)\}^{-1}$ , and the mixture of the cumulative distribution function and density of standard normal distribution $K(u)=\Phi(u)+u\phi(u)$ . Thus, model (6) can be approximated by

\displaystyle y_{i}={\boldsymbol{X}}_{i}^{\rm T}\boldsymbol{\alpha}+{\boldsymbol{Z}}_{i}^{\rm T}\boldsymbol{\beta}K_{h}(U_{1i}+{\boldsymbol{U}}_{2i}^{\rm T}\boldsymbol{\eta})+\epsilon_{i},

(7)

where $K_{h}(u)=K(u/h)$ , and $h$ is a predetermined tuning parameter associated with $n$ satisfying $\lim_{n\rightarrow\infty}h=0$ , called the smoothness parameter.

For any $\tau>0$ , let $\boldsymbol{\zeta}_{\tau,h}^{*}$ be the minimizer defined as

\displaystyle\boldsymbol{\zeta}_{\tau,h}^{*}=\mathop{\rm argmin}_{\boldsymbol{\zeta}\in\Theta_{\zeta}}\boldsymbol{\mathrm{P}}L_{\tau}(y-{\boldsymbol{X}}^{\rm T}\boldsymbol{\alpha}-{\boldsymbol{Z}}^{\rm T}\boldsymbol{\beta}K_{h}(U_{1}+{\boldsymbol{U}}_{2}^{\rm T}\boldsymbol{\eta})),

(8)

which approximates the minimizer

\displaystyle\boldsymbol{\zeta}_{\tau}^{*}=\mathop{\rm argmin}_{\boldsymbol{\zeta}\in\Theta_{\zeta}}\boldsymbol{\mathrm{P}}L_{\tau}(y-{\boldsymbol{X}}^{\rm T}\boldsymbol{\alpha}-{\boldsymbol{Z}}^{\rm T}\boldsymbol{\beta}{\boldsymbol{1}}(U_{1}+{\boldsymbol{U}}_{2}^{\rm T}\boldsymbol{\eta}\geq 0)).

(9)

As did Sun et al., (2020), $\boldsymbol{\zeta}_{\tau}^{*}$ is called the Huber coefficient, which usually distinguishes from the true parameter $\boldsymbol{\zeta}^{*}$ . Measured by $\|\boldsymbol{\zeta}_{\tau}^{*}-\boldsymbol{\zeta}^{*}\|$ , the Huber error is caused by the robustification for the heavy-tailed errors, while the distance $\|\boldsymbol{\zeta}_{\tau,h}^{*}-\boldsymbol{\zeta}^{*}\|$ is a consequence of both robustification and smoothness. Theorem 2 reveals that $\|\boldsymbol{\zeta}_{\tau,h}^{*}-\boldsymbol{\zeta}^{*}\|$ is controlled by both $\tau$ and $h$ , with $h$ playing the role of the bandwidth in the nonparametric area.

Minimizing the empirical loss in (8) produces the robust estimator of interest, i.e.,

\displaystyle\hat{\boldsymbol{\zeta}}_{\tau,h}=\mathop{\rm argmin}_{\boldsymbol{\zeta}\in\Theta_{\zeta}}\mathbb{P}_{n}L_{\tau}(y-{\boldsymbol{X}}^{\rm T}\boldsymbol{\alpha}-{\boldsymbol{Z}}^{\rm T}\boldsymbol{\beta}K_{h}(U_{1}+{\boldsymbol{U}}_{2}^{\rm T}\boldsymbol{\eta})).

(10)

From (8), (9), and (10), the total estimation error $\|\hat{\boldsymbol{\zeta}}_{\tau,h}-\boldsymbol{\zeta}^{*}\|$ can be decomposed into three parts, i.e.,

\displaystyle\underbrace{\|\hat{\boldsymbol{\zeta}}_{\tau,h}-\boldsymbol{\zeta}^{*}\|}_{\mbox{Total error}}\leq\underbrace{\|\hat{\boldsymbol{\zeta}}_{\tau,h}-\boldsymbol{\zeta}_{\tau,h}^{*}\|}_{\mbox{estimation error}}+\underbrace{\|\boldsymbol{\zeta}_{\tau,h}^{*}-\boldsymbol{\zeta}_{\tau}^{*}\|}_{\mbox{smooth error}}+\underbrace{\|\boldsymbol{\zeta}_{\tau}^{*}-\boldsymbol{\zeta}^{*}\|}_{\mbox{Huber error}}.

(11)

It is natural to use the alternating strategy to obtain the estimate, denoted by $\hat{\boldsymbol{\zeta}}=(\hat{\boldsymbol{\alpha}}^{\rm T},\hat{\boldsymbol{\beta}}^{\rm T},\hat{\boldsymbol{\eta}}^{\rm T})^{\rm T}$ . Specifically, the parameters $(\boldsymbol{\alpha}^{\rm T},\boldsymbol{\beta}^{\rm T})^{\rm T}$ and $\boldsymbol{\eta}$ can be estimated iteratively as follows. For given $\boldsymbol{\eta}^{(k)}$ , $\boldsymbol{\alpha}^{(k+1)}$ and $\boldsymbol{\beta}^{(k+1)}$ are obtained by minimizing

\displaystyle\mathbb{P}_{n}L_{\tau}(y-{\boldsymbol{X}}^{\rm T}\boldsymbol{\alpha}-{\boldsymbol{Z}}^{\rm T}\boldsymbol{\beta}K_{h}(U_{1}+{\boldsymbol{U}}_{2}^{\rm T}\boldsymbol{\eta}^{(k)})),

and for given $\boldsymbol{\alpha}^{(k+1)}$ and $\boldsymbol{\beta}^{(k+1)}$ , $\boldsymbol{\gamma}^{(k+1)}$ is estimated by minimizing

\displaystyle\mathbb{P}_{n}L_{\tau}(y-{\boldsymbol{X}}^{\rm T}\boldsymbol{\alpha}^{(k+1)}-{\boldsymbol{Z}}^{\rm T}\boldsymbol{\beta}^{(k+1)}K_{h}(U_{1}+{\boldsymbol{U}}_{2}^{\rm T}\boldsymbol{\eta})).

Iterating these two maximizers leads to the desired robust estimator. The above alternating strategy is summarized in Algorithm A in Appendix A of the Supplementary Material.

2.2 Nonasymptotic properties

This section begins with assumptions needed to establish the nonasymptotic properties. Let ${\bar{\boldsymbol{V}}}$ be ${\boldsymbol{V}}$ by removing $U_{1}$ , i.e., ${\bar{\boldsymbol{V}}}=(y,{\boldsymbol{X}},{\boldsymbol{Z}},{\boldsymbol{U}}_{2}))$ , and $\bar{\omega}(\boldsymbol{\eta})=U_{1}+{\boldsymbol{U}}_{2}^{\rm T}\boldsymbol{\eta}$ and $\bar{\omega}=\bar{\omega}(\boldsymbol{\eta}^{*})$ .

(A1)

The conditional random vectors ${\mathrm{E}}({\boldsymbol{X}}|{\boldsymbol{U}}_{2})\in\mathbb{R}^{p}$ and ${\mathrm{E}}({\boldsymbol{Z}}|{\boldsymbol{U}}_{2})\in\mathbb{R}^{q}$ given ${\boldsymbol{U}}_{2}$ are sub-Gaussian, and ${\boldsymbol{U}}_{2}$ is sub-Gaussian. There is a universal constant $K_{1}>0$ satisfying $\|{\mathrm{E}}({\boldsymbol{X}}|{\boldsymbol{U}}_{2})\|_{\psi_{2}}\leq K_{1}$ , $\|{\mathrm{E}}({\boldsymbol{Z}}|{\boldsymbol{U}}_{2})\|_{\psi_{2}}\leq K_{1}$ , and $\|{\boldsymbol{U}}_{2}\|_{\psi_{2}}\leq K_{1}$ . For any ${\boldsymbol{u}}_{2}\in\mathbb{R}^{r-1}$ , the matrices ${\mathrm{E}}({\boldsymbol{X}}{\boldsymbol{X}}^{\rm T}|{\boldsymbol{U}}_{2}={\boldsymbol{u}}_{2})$ , ${\mathrm{E}}({\boldsymbol{X}}{\boldsymbol{X}}^{\rm T}|{\boldsymbol{U}}_{2}={\boldsymbol{u}}_{2})$ , and ${\mathrm{E}}({\boldsymbol{U}}_{2}{\boldsymbol{U}}_{2}^{\rm T})$ are uniformly positive definite, and there is a universal constant $K_{0}>0$ satisfying $\lambda_{\min}({\mathrm{E}}({\boldsymbol{X}}{\boldsymbol{X}}^{\rm T}|{\boldsymbol{U}}_{2}={\boldsymbol{u}}_{2}))\geq K_{0}$ , $\lambda_{\min}({\mathrm{E}}({\boldsymbol{Z}}{\boldsymbol{Z}}^{\rm T}|{\boldsymbol{U}}_{2}={\boldsymbol{u}}_{2}))\geq K_{0}$ , and $\lambda_{\min}({\mathrm{E}}({\boldsymbol{U}}_{2}{\boldsymbol{U}}_{2}^{\rm T}))\geq K_{0}$ .
(A2)

The error variable $\epsilon$ is independent of $({\boldsymbol{X}}^{\rm T},{\boldsymbol{Z}}^{\rm T},{\boldsymbol{U}}^{\rm T})^{\rm T}$ and satisfies ${\mathrm{E}}(\epsilon)=0$ and ${\mathrm{E}}(|\epsilon|^{2+\delta})=M_{\delta}<\infty$ with $\delta\geq 0$ . Denote $M_{0}={\mathrm{E}}(|\epsilon|^{2})$ .
(A3)

$0<{\mathrm{E}}[{\boldsymbol{1}}({\boldsymbol{U}}^{\rm T}\boldsymbol{\gamma}\geq 0)]<1$ for any $\boldsymbol{\gamma}\in\Theta_{\gamma}$ , and there is a constant $\delta_{u_{2}}>0$ satisfying $\sup_{\boldsymbol{\eta}\in\mathbb{S}^{r-1}}{\mathrm{E}}[{\boldsymbol{U}}_{2}^{\rm T}\boldsymbol{\eta}{\boldsymbol{1}}({\boldsymbol{U}}_{2}^{\rm T}\boldsymbol{\eta}\geq 0)]\geq\delta_{u_{2}}$ .
(A4)

For almost every ${\boldsymbol{u}}_{2}$ , the density of $U_{1}$ conditional on ${\boldsymbol{U}}_{2}={\boldsymbol{u}}_{2}$ is everywhere positive. The conditional density $f_{\varpi|\bar{{\boldsymbol{v}}}}(\varpi)$ of $\bar{\omega}$ given ${\bar{\boldsymbol{V}}}$ has continuous derivative, and there is a constant $\kappa_{f}>0$ such that $f_{\varpi|\bar{{\boldsymbol{v}}}}(\varpi)$ and $|f_{\varpi|\bar{{\boldsymbol{v}}}}^{\prime}(\varpi)|$ are uniformly bounded from above by $\kappa_{f}$ over $(\varpi,\bar{{\boldsymbol{v}}})$ . $f_{\varpi|\bar{{\boldsymbol{v}}}}(0)$ is uniformly bounded from below by $\delta_{f_{0}}$ over $\bar{{\boldsymbol{v}}}$ , and $F_{\varpi|\bar{{\boldsymbol{v}}}}(0)$ is uniformly bounded from above by $\kappa_{F}$ over $\bar{{\boldsymbol{v}}}$ , where $\delta_{f_{0}}>0$ and $0<\kappa_{F}<1$ are constants.
(A5)

The smooth function $K(\cdot)$ is twice differentiable and $K(-t)=1-K(t)$ . $K^{\prime}(\cdot)$ is symmetric around zero. Moveover, there is a universal constant $\kappa_{k}>0$ satisfying $\max\left\{\sup_{t\in\mathbb{R}}|K^{\prime}(t)|,\sup_{t\in\mathbb{R}}|K^{\prime\prime}(t)|,\int|K^{\prime}(t)|^{j}dt,\int|K^{\prime\prime}(t)|^{j}dt,\int|t||K^{\prime}(t)|^{j}dt\right\}\leq\kappa_{k}$ with $j=1,2$ .

Remark 1.

Assumptions (A1)–(A5) are mild conditions for deriving the nonasymptotic bounds in Theorems 1–4 below. Assumption (A1) is the moment condition for covariates. Assumption (A2) is imposed to control the moment of error and to yield the adaptive nonasymptotic upper bounds; see Fan et al., 2017b ; Wang et al., (2021); Han et al., (2022). Assumption (A3) is mild and easily verified in practice, and it is usually imposed in change-plane analysis; see Kang et al., (2017); Liu et al., (2024). Assumption (A4) is required to establish nonasymptotic properties in dealing with the indicator function; see Horowitz, (1993); Zhang et al., (2022). By Lemmas C1–C3 in Appendix C of the Supplementary Material, Assumption (A5) holds for commonly used smoothing functions such as (i) the cumulative distribution function of standard normal distribution $K(u)=\Phi(u)$ , (ii) the sigmoid function $K(u)=\{1+\exp(-u)\}^{-1}$ , and (iii) the function $K(u)=\Phi(u)+u\phi(u)$ ; see Horowitz, (1993); Zhang et al., (2022).

Theorem 1.

Let $\boldsymbol{\theta}=(\boldsymbol{\alpha}^{\rm T},\boldsymbol{\beta}^{\rm T})^{\rm T}$ . If Assumptions (A1)–(A4) hold, then for some $\delta\geq 0$ , the minimizer $\boldsymbol{\zeta}_{\tau}^{*}=((\boldsymbol{\theta}_{\tau}^{*})^{\rm T},(\boldsymbol{\eta}_{\tau}^{*})^{\rm T})^{\rm T}$ given in (9) satisfies

\displaystyle\|\boldsymbol{\theta}_{\tau}^{*}-\boldsymbol{\theta}^{*}\|\leq 5K_{1}\frac{2M_{\delta}+C\|\boldsymbol{\theta}^{*}\|^{2+\delta}K_{1}^{2+3\delta/2}}{K_{0}(1-\delta_{F})(1+\delta)\tau^{1+\delta}}

and

\displaystyle\|\boldsymbol{\eta}_{\tau}^{*}-\boldsymbol{\eta}^{*}\|\leq

\displaystyle\frac{160K_{1}^{2}}{K_{0}\delta_{f_{0}}\delta_{u_{2}}\|\boldsymbol{\beta}^{*}\|^{2}}\left\{\frac{2M_{\delta}+C\|\boldsymbol{\theta}^{*}\|^{2+\delta}K_{1}^{2+3\delta/2}}{K_{0}(1-\delta_{F})(1+\delta)\tau^{1+\delta}}\right\}^{2},

where $C>0$ is a constant.

Theorem 1 states that the Huber error is of order $\tau^{-(1+\delta)}$ for $\|\boldsymbol{\theta}_{\tau}^{*}-\boldsymbol{\theta}^{*}\|$ and of order $\tau^{-(2+2\delta)}$ for $\|\boldsymbol{\eta}^{*}-\boldsymbol{\eta}_{\tau}^{*}\|$ . As $\tau$ tends to infinity, because the Huber loss becomes the ordinary quadratic loss, the Huber error vanishes as expected. The next theorem studies the smooth error and Huber loss together.

Theorem 2.

If Assumptions (A1)–(A5) hold, then for some $\delta\geq 0$ and $h=o(1)$ , the minimizer $\boldsymbol{\zeta}_{\tau,h}^{*}=((\boldsymbol{\theta}_{\tau,h}^{*})^{\rm T},(\boldsymbol{\eta}_{\tau,h}^{*})^{\rm T})^{\rm T}$ given in (8) satisfies

\displaystyle\|\boldsymbol{\theta}_{\tau,h}^{*}-\boldsymbol{\theta}^{*}\|\leq 16K_{1}\left\{\frac{2M_{\delta}+C\|\boldsymbol{\theta}^{*}\|^{2+\delta}K_{1}^{2+3\delta/2}}{K_{0}(1-\delta_{F})(1+\delta)\tau^{1+\delta}}+\frac{\kappa_{f}\kappa_{k}\|\boldsymbol{\beta}^{*}\|}{K_{0}(1-\delta_{F})}h\right\}

and

\displaystyle\|\boldsymbol{\eta}_{\tau,h}^{*}-\boldsymbol{\eta}^{*}\|\leq\frac{64^{2}K_{1}^{2}\kappa_{f}\kappa_{k}}{\delta_{f_{0}}\delta_{u_{2}}\min\{\|\boldsymbol{\beta}^{*}\|,\|\boldsymbol{\beta}^{*}\|^{2}\}}\left\{\frac{2M_{\delta}+C\|\boldsymbol{\theta}^{*}\|^{2+\delta}K_{1}^{2+3\delta/2}}{K_{0}(1-\delta_{F})(1+\delta)\tau^{1+\delta}}+\frac{\kappa_{f}\kappa_{k}\|\boldsymbol{\beta}^{*}\|}{K_{0}(1-\delta_{F})}h\right\}^{2},

where $C>0$ is a constant.

Theorem 2 reveals that $\|\boldsymbol{\theta}_{\tau,h}^{*}-\boldsymbol{\theta}^{*}\|$ and $\|\boldsymbol{\eta}_{\tau,h}^{*}-\boldsymbol{\eta}^{*}\|$ are associated with both the Huber error and the smooth error. The smoothness parameter $h$ can be of order $\tau^{-(1+\delta)}$ , and because $K_{h}(t)$ approximates ${\boldsymbol{1}}(t\geq 0)$ as $h$ tends to zero, the upper bounds in Theorem 2 are of the same order as those in Theorem 1 when $h\rightarrow 0$ . Thus, the deviations $\|\boldsymbol{\theta}_{\tau,h}^{*}-\boldsymbol{\theta}^{*}\|$ and $\|\boldsymbol{\eta}_{\tau,h}^{*}-\boldsymbol{\eta}^{*}\|$ are sacrifices in pursuit of robustification and smoothness. The next theorem provides the exponential-type deviations for the baseline and grouping difference parameters and grouping parameter.

Theorem 3.

If Assumptions (A1)–(A5) hold, then for some $\delta$ and any $t>0$ , when $h^{2}n/(p+q+r)\rightarrow\infty$ and $h=o(1)$ , the estimator $\hat{\boldsymbol{\zeta}}_{\tau,h}=(\hat{\boldsymbol{\theta}}_{\tau,h}^{\rm T},\hat{\boldsymbol{\eta}}_{\tau,h}^{\rm T})^{\rm T}$ given in (10) satisfies, with probability at least $1-21\exp(-t)$ ,

\displaystyle\begin{split}\|\hat{\boldsymbol{\theta}}_{\tau,h}-\boldsymbol{\theta}^{*}\|\leq&\frac{32}{K_{0}(1-\kappa_{F})}a(n,\tau)\end{split}

(12)

and

\displaystyle\begin{split}\|\hat{\boldsymbol{\eta}}_{\tau,h}-\boldsymbol{\eta}^{*}\|\leq&\frac{32}{K_{0}\|\boldsymbol{\beta}^{*}\|\sqrt{\delta_{f_{0}}\delta_{k}K_{0}(1-\kappa_{F})}}h^{1/2}a(n,\tau),\end{split}

(13)

where $\nu_{0}$ is a constant depending on only the constants $\{\kappa_{F},\kappa_{f},\kappa_{k},K_{1},\|\boldsymbol{\xi}^{*}\|\}$ , and

\displaystyle a(n,\tau)=\sqrt{3\nu_{0}(p+q+r+2t)/n}+\frac{2^{2+\delta}M_{\delta}\{(3+\delta)K_{1}\|\boldsymbol{\beta}^{*}\|\}^{2+\delta}}{\tau^{1+\delta}}.

With appropriate choice of $\tau$ with $\delta=0$ , such as $\tau=O((n/t)^{1/2})$ with $t=\log(n)$ , the nonasymptotic property of the sub-Gaussian estimator $\hat{\boldsymbol{\theta}}_{\tau,h}$ in Theorem 3 demonstrates that the deviation $\|\hat{\boldsymbol{\theta}}_{\tau,h}-\boldsymbol{\theta}^{*}\|$ adapts to the sample size, dimension, robustification parameter $\tau$ , and moments in pursuit of the optimal tradeoff between bias and robustness, and the deviation $\|\hat{\boldsymbol{\eta}}_{\tau,h}-\boldsymbol{\eta}^{*}\|$ adapts to the extra smooth parameter $h$ to achieve smoothness. Adaptation to the robustification parameter $\tau$ is caused by pursuing the robustness for linear regression with heavy-tailed errors. The deviations $\|\hat{\boldsymbol{\theta}}_{\tau,h}-\boldsymbol{\theta}^{*}\|$ and $\|\hat{\boldsymbol{\eta}}_{\tau,h}-\boldsymbol{\eta}^{*}\|$ coincide with the smoothed OLS estimator when $\tau\rightarrow\infty$ , because $a(n,\tau)=\sqrt{3\nu_{0}(p+q+r+2t)/n}$ as $\tau\rightarrow\infty$ . The next theorem derives the nonasymptotic Bahadur representation of the robust estimators given in (10).

Theorem 4.

Let ${\boldsymbol{W}}(\boldsymbol{\eta})=({\boldsymbol{X}}^{\rm T},{\boldsymbol{Z}}^{\rm T}{\boldsymbol{1}}(\bar{\omega}(\boldsymbol{\eta})\geq 0))^{\rm T}$ and ${\widetilde{\boldsymbol{W}}}_{h}(\boldsymbol{\eta})=({\boldsymbol{X}}^{\rm T},{\boldsymbol{Z}}^{\rm T}K_{h}(\bar{\omega}(\boldsymbol{\eta}))^{\rm T}$ , where $\bar{\omega}(\boldsymbol{\eta})=U_{1}+{\boldsymbol{U}}_{2}^{\rm T}\beta$ . If the assumptions in Theorem 3 are satisfied, then with probability at least $1-32\exp(-t)$ we have

\displaystyle\begin{split}\bigg{\|}\Sigma_{W}^{1/2}(\hat{\boldsymbol{\theta}}_{\tau,h}-\boldsymbol{\theta}^{*})-&\Sigma_{W}^{-1/2}\mathbb{P}_{n}\left\{{\widetilde{\boldsymbol{W}}}_{h}(\boldsymbol{\eta}^{*})\psi_{\tau}(y-{\widetilde{\boldsymbol{W}}}_{h}(\boldsymbol{\eta}^{*})^{\rm T}\boldsymbol{\theta}^{*})\right\}\bigg{\|}\\ \leq&\left\{\nu_{1}\sqrt{(p+q+1+2t)/n}+\nu_{2}h^{1/2}\right\}a(n,\tau)\end{split}

(14)

and

\displaystyle\begin{split}\bigg{\|}\Sigma_{U_{2}}^{1/2}(\hat{\boldsymbol{\eta}}_{\tau,h}-\boldsymbol{\eta}^{*})-&h\Sigma_{U_{2}}^{-1/2}\mathbb{P}_{n}\left\{{\boldsymbol{U}}_{2}{\boldsymbol{Z}}^{\rm T}\boldsymbol{\beta}^{*}K_{h}(U_{1}+{\boldsymbol{U}}_{2}^{\rm T}\boldsymbol{\eta}^{*})\psi_{\tau}(y-{\widetilde{\boldsymbol{W}}}_{h}(\boldsymbol{\eta}^{*})^{\rm T}\boldsymbol{\theta}^{*})\right\}\bigg{\|}\\ \leq&\left\{\nu_{3}\sqrt{(r+2t)/(nh)}+\nu_{4}\left(h+\tau^{-(1+\delta)}\right)\right\}h^{1/2}a(n,\tau),\end{split}

(15)

where $a(n,\tau)$ is as defined in Theorem 3, $\nu_{1}>0$ , $\nu_{2}>0$ , $\nu_{3}>0$ , and $\nu_{4}>0$ are constants depending on only the constants $\{\kappa_{F},\kappa_{f},\kappa_{k},K_{0},K_{1},\|\boldsymbol{\xi}^{*}\|\}$ , and

\displaystyle\Sigma_{W}=\boldsymbol{\mathrm{P}}\left\{{\boldsymbol{W}}(\boldsymbol{\eta}^{*}){\boldsymbol{W}}(\boldsymbol{\eta}^{*})^{\rm T}\right\},\quad\Sigma_{U_{2}}=K^{\prime}(0)\boldsymbol{\mathrm{P}}\left\{f_{\bar{\omega}|{\bar{\boldsymbol{v}}}}(0)({\boldsymbol{Z}}^{\rm T}\boldsymbol{\beta}^{*})^{2}{\boldsymbol{U}}_{2}{\boldsymbol{U}}_{2}^{\rm T}\right\}.

Theorem 4 shows that the remainder of the Bahadur representation of $\|\hat{\boldsymbol{\theta}}_{\tau,h}-\boldsymbol{\theta}^{*}\|$ achieves the rate $h^{1/2}a(n,\tau)$ , which is the same as that for $\|\hat{\boldsymbol{\eta}}_{\tau,h}-\boldsymbol{\eta}^{*}\|$ . Because of the rate restriction $h^{2}n/(p+q+r)\rightarrow\infty$ and $h=o(1)$ , the remainder of the Bahadur representation in inequality (15) does not exhibit subexponential behavior as considered by Sun et al., (2020); Chen and Zhou, (2020). This reason is that there is a change plane involved a smooth function in model (1). To the best of the authors’ knowledge, this is the first time that this type of nonasymptotic Bahadur representation has been reported in the literature, especially for the robust estimator $\hat{\boldsymbol{\eta}}_{\tau,h}$ of the grouping parameter, with previous studies reporting only polynomial-type deviation bounds; see Liu et al., (2024); Zhang et al., (2022). It is convenient to derive the classical asymptotic results from the Bahadur representation, and Theorems 3 and 4 show that the robustification parameter $\tau$ and the smoothness parameter $h$ play the same role as bandwidth in constructing classical nonparametric estimators.

2.3 Implementation

The theoretical properties in Section 2.2 guarantee that the robust estimation performs well with appropriate choices of the robustification parameter $\tau$ and the smoothness parameter $h$ . Because the robustification parameter $\tau$ is treated as a tuning parameter to balance bias and robustness, it is natural to consider using the cross-validation (CV) method to select an appropriate $\tau$ in practice. However, as noted by Chen and Zhou, (2020) and Wang et al., (2021), because $M_{\delta}=\textrm{E}(|\epsilon|^{2+\delta})$ is typically unknown in practice, its empirical OLS estimator $\hat{M}_{\delta}=(n-p-q)^{-1}\sum_{i=1}^{n}(y_{i}-{\boldsymbol{X}}_{i}^{\rm T}\hat{\boldsymbol{\alpha}}_{\tau,h}-{\boldsymbol{Z}}_{i}^{\rm T}\hat{\boldsymbol{\beta}}_{\tau,h}{\boldsymbol{1}}({\boldsymbol{U}}_{i}^{\rm T}\hat{\gamma}_{\tau,h}\geq 0))^{2+\delta}$ is poor when the errors are heavy-tailed. Instead, there are two good alternatives (Chen and Zhou, 2020): (i) an adaptive technique based on Lepski’s method Lepskii, (1992) and (ii) a Huber-type method by solving a so-called censored equation Hahn et al., (1990); see Chen and Zhou, (2020); Wang et al., (2021) for details. For the smoothness parameter $h$ , the CV method is always a natural choice for selecting an appropriate one. Alternatively, from the theoretical conditions on the smoothness parameter $h$ , a rule of thumb $h_{n}=c_{h}\hat{\sigma}_{u}\log(n)/\sqrt{n}$ suggested by Seo and Linton, (2007), Li et al., (2021), and Zhang et al., (2022) is recommended by considering the computation reduction, where $c_{h}$ is a constant and $\hat{\sigma}_{u}=\sqrt{(n-r)^{-1}\sum_{i=1}^{n}(U_{1i}+{\boldsymbol{U}}_{2i}^{\rm T}\hat{\boldsymbol{\eta}}^{ols})^{2}}$ is the estimated standard deviation of ${\boldsymbol{U}}^{\rm T}\boldsymbol{\gamma}$ , with $(\hat{\boldsymbol{\alpha}}^{ols},\hat{\boldsymbol{\beta}}^{ols},\hat{\boldsymbol{\eta}}^{ols})$ being the estimator of $(\boldsymbol{\alpha},\boldsymbol{\beta},\boldsymbol{\eta})$ by using ordinary quadratic loss.

Attention now turns to the implementation for subgroup-classifier learning, with the parameters $\boldsymbol{\theta}=(\boldsymbol{\alpha}^{\rm T},\boldsymbol{\beta}^{\rm T})^{\rm T}$ and $\boldsymbol{\eta}$ estimated iteratively as follows. Let $\ell_{\tau}(\boldsymbol{\alpha},\boldsymbol{\beta},\boldsymbol{\eta})$ be the loss function for model (1), i.e.,

\displaystyle\ell_{\tau}(\boldsymbol{\alpha},\boldsymbol{\beta},\boldsymbol{\eta})=\sum_{i=1}^{n}L_{\tau}\left(y_{i}-{\boldsymbol{X}}_{i}^{\rm T}\boldsymbol{\alpha}-{\boldsymbol{Z}}_{i}^{\rm T}\boldsymbol{\beta}{\boldsymbol{1}}(U_{1i}+{\boldsymbol{U}}_{2i}^{\rm T}\boldsymbol{\eta}\geq 0)\right),

and let the smoothed loss function be

\displaystyle\tilde{\ell}_{\tau,h}(\boldsymbol{\alpha},\boldsymbol{\beta},\boldsymbol{\eta})=\sum_{i=1}^{n}L_{\tau}\left(y_{i}-{\boldsymbol{X}}_{i}^{\rm T}\boldsymbol{\alpha}-{\boldsymbol{Z}}_{i}^{\rm T}\boldsymbol{\beta}K_{h}(U_{1i}+{\boldsymbol{U}}_{2i}^{\rm T}\boldsymbol{\eta}\geq 0)\right).

(16)

For given $\boldsymbol{\eta}^{(k)}$ , one obtains $\boldsymbol{\alpha}^{(k+1)}$ and $\boldsymbol{\beta}^{(k+1)}$ by minimizing the smoothed loss function

\displaystyle(\boldsymbol{\alpha}^{(k+1)},\boldsymbol{\beta}^{(k+1)})=\mathop{\rm argmin}_{\boldsymbol{\alpha}\in\Theta_{\alpha},\boldsymbol{\beta}\in\Theta_{\beta}}\tilde{\ell}_{\tau,h}(\boldsymbol{\alpha},\boldsymbol{\beta},\boldsymbol{\eta}^{(k)}),

and for given $\boldsymbol{\alpha}^{(k+1)}$ and $\boldsymbol{\beta}^{(k+1)}$ , one estimates $\boldsymbol{\eta}^{(k+1)}$ by

\displaystyle\boldsymbol{\eta}^{(k+1)}=\mathop{\rm argmin}_{\boldsymbol{\eta}\in\Theta_{\eta}}\tilde{\ell}_{\tau,h}(\boldsymbol{\alpha}^{(k+1)},\boldsymbol{\beta}^{(k+1)},\boldsymbol{\eta}).

Iterating these two minimizers leads to the desired robust estimators. These are summarized with the multiplier bootstrap calibration in Algorithm 1 in Appendix A of the Supplementary Material, which provides the strategy for estimating the confidence intervals for the estimators $\hat{\boldsymbol{\alpha}}$ , $\hat{\boldsymbol{\beta}}$ , and $\hat{\boldsymbol{\eta}}$ .

Note that herein, $\boldsymbol{\alpha}^{(k+1)}$ and $\boldsymbol{\beta}^{(k+1)}$ for given $\boldsymbol{\eta}^{(k)}$ are obtained by a robust data-adaptive method proposed by Wang et al., (2021). Specifically, $\boldsymbol{\theta}=(\boldsymbol{\alpha}^{\rm T},\boldsymbol{\beta}^{\rm T})^{\rm T}$ is estimated and $\tau$ is calibrated simultaneously by solving the following system of equations:

\displaystyle\left\{\begin{array}[]{l}\sum_{i=1}^{n}\psi_{\tau}(y_{i}-{\boldsymbol{W}}_{i}^{\rm T}\boldsymbol{\theta}){\boldsymbol{W}}_{i}=0,\\ (\tau^{2}n)^{-1}\sum_{i=1}^{n}\min\{(y_{i}-{\boldsymbol{W}}_{i}^{\rm T}\boldsymbol{\theta})^{2},\tau^{2}\}-n^{-1}(d+z)=0,\end{array}\right.

where $d=p+q-1$ , ${\boldsymbol{W}}_{i}=({\boldsymbol{X}}_{i}^{\rm T},{\boldsymbol{Z}}_{i}^{\rm T}{\boldsymbol{1}}(U_{1i}+{\boldsymbol{U}}_{2i}^{\rm T}\boldsymbol{\eta}^{(k)}\geq 0))^{\rm T}$ , and $z=\log(n)$ as suggested by Wang et al., (2021). The initial values $\boldsymbol{\theta}^{(0)}=\boldsymbol{\theta}^{(ols)}$ and $\tau^{(0)}=\hat{\sigma}_{\epsilon}\sqrt{n/(d+z)}$ are set using the ordinary quadratic loss, where $\hat{\sigma}_{\epsilon}^{2}=(n-p-q-r)^{-1}\sum_{i=1}^{n}(y-{\boldsymbol{X}}_{i}^{\rm T}\hat{\boldsymbol{\alpha}}^{ols}-{\boldsymbol{Z}}_{i}^{\rm T}\hat{\boldsymbol{\beta}}^{ols}{\boldsymbol{1}}(U_{1i}+{\boldsymbol{U}}_{2i}^{\rm T}\hat{\boldsymbol{\eta}}^{ols}\geq 0))^{2}$ , with $(\hat{\boldsymbol{\alpha}}^{ols},\hat{\boldsymbol{\beta}}^{ols},\boldsymbol{\eta}^{ols})$ being the estimator of $(\boldsymbol{\alpha},\boldsymbol{\beta},\boldsymbol{\eta})$ .

3 Robust subgroup testing

Before learning the subgroup classifier, it is of interest to test for the existence of subgroups, which guarantees avoidance of the identifiability problem of $\boldsymbol{\eta}$ . This section considers the test problem (2). Recall the loss function in (9), i.e.,

\displaystyle L_{\tau}(y-{\boldsymbol{X}}^{\rm T}\boldsymbol{\alpha}-{\boldsymbol{Z}}^{\rm T}\boldsymbol{\beta}{\boldsymbol{1}}({\boldsymbol{U}}^{\rm T}\boldsymbol{\gamma}\geq 0)),

(17)

the derivative of which with respect to $\boldsymbol{\beta}$ under the alternative hypothesis is

\displaystyle\varphi({\boldsymbol{V}},\boldsymbol{\alpha},\boldsymbol{\beta},\boldsymbol{\gamma})={\boldsymbol{Z}}{\boldsymbol{1}}({\boldsymbol{U}}^{\rm T}\boldsymbol{\gamma}\geq 0)\psi_{\tau}(y-{\boldsymbol{X}}^{\rm T}\boldsymbol{\alpha}-{\boldsymbol{Z}}^{\rm T}\boldsymbol{\beta}{\boldsymbol{1}}({\boldsymbol{U}}^{\rm T}\boldsymbol{\gamma}\geq 0)),

and with respect to $\boldsymbol{\alpha}$ under the null hypothesis is

\displaystyle\varphi_{0}({\boldsymbol{V}},\boldsymbol{\alpha})={\boldsymbol{X}}\psi_{\tau}(y-{\boldsymbol{X}}^{\rm T}\boldsymbol{\alpha}),

where $\psi_{\tau}(u)$ is the first derivative of the Huber loss (5), defined as $\psi_{\tau}(u)=\hbox{sgn}(u)\min\{|u|,\tau\}$ .

3.1 Robust estimation under null hypothesis

Under the null hypothesis, model (1) reduces to the ordinary linear regression model with heavy-tailed errors, i.e.,

\displaystyle y_{i}={\boldsymbol{X}}_{i}^{\rm T}\boldsymbol{\alpha}+\epsilon_{i}.

(18)

Parametric estimation in model (18) is well-studied in the literature; see Huber, (1964, 1973); Fan et al., 2017b ; Sun et al., (2020); Wang et al., (2021); Han et al., (2022); Chen and Zhou, (2020); Zhou et al., (2018), among others. Let $\hat{\boldsymbol{\alpha}}_{\tau}$ be the estimate of $\boldsymbol{\alpha}_{\tau}$ under the null hypothesis, i.e.,

\displaystyle\hat{\boldsymbol{\alpha}}_{\tau}=\mathop{\rm argmin}_{\boldsymbol{\alpha}\in\Theta_{\alpha}}\mathbb{P}_{n}L_{\tau}(y-{\boldsymbol{X}}^{\rm T}\boldsymbol{\alpha}).

(19)

Before establishing asymptotic properties for the robust estimator under the null hypothesis and the test statistic, the following required assumptions are made.

(B1)

The random vector ${\boldsymbol{X}}$ is sub-Gaussian, $J=\{\boldsymbol{\mathrm{P}}[{\boldsymbol{X}}{\boldsymbol{X}}^{\rm T}]\}^{-1}\in\mathbb{R}^{p\times p}$ is a finite and positive definite deterministic matrix, and there is a universal constant $K_{1}>0$ satisfying $\|{\boldsymbol{X}}\|_{\psi_{2}}\leq K_{1}$ , $\|{\boldsymbol{Z}}\|_{\psi_{2}}\leq K_{1}$ .
(B2)

The error variable $\epsilon$ is independent of $({\boldsymbol{X}}^{\rm T},{\boldsymbol{Z}}^{\rm T},{\boldsymbol{U}}^{\rm T})^{\rm T}$ and satisfies $\boldsymbol{\mathrm{P}}(\epsilon)=0$ and $\boldsymbol{\mathrm{P}}(|\epsilon|^{2+\delta})=M_{\delta}<\infty$ with $\delta\geq 0$ .
(B3)

$0<\boldsymbol{\mathrm{P}}[{\boldsymbol{1}}({\boldsymbol{U}}^{\rm T}\boldsymbol{\gamma}\geq 0)]<1$ for any $\boldsymbol{\gamma}\in\Theta_{\gamma}\subseteq\mathbb{R}^{r}$ .

Remark 2.

Assumption (B1) is the moment condition of covariates for establishing the nonasymptotic properties under the null hypothesis and deriving the asymptotic distributions. Assumption (B2) is the same as Assumption (A2), and Assumption (B3) is the same as Assumption (A3).

Lemma 1.

(Chen and Zhou, (2020)) If Assumptions (B1)–(B3) hold, then for any $t>0$ and $v\geq v_{2+\delta}^{1/(2+\delta)}$ , the estimator $\hat{\boldsymbol{\alpha}}_{\tau}$ given in (19) with $\tau=v(\frac{n}{p+t})^{1/(2+\delta)}$ satisfies

\displaystyle\begin{split}\boldsymbol{\mathrm{P}}\left\{\left\|J^{-1/2}(\hat{\boldsymbol{\alpha}}_{\tau}-\boldsymbol{\alpha}^{*})\right\|\geq c_{1}v\sqrt{\frac{p+t}{n}}\right\}\leq&2e^{t}\\ \mbox{and}~{}~{}\boldsymbol{\mathrm{P}}\left\{\left\|J^{-1/2}(\hat{\boldsymbol{\alpha}}_{\tau}-\boldsymbol{\alpha}^{*})-J^{-1/2}\mathbb{P}_{n}{\boldsymbol{X}}\psi_{\tau}(\epsilon)\right\|\geq c_{2}v\sqrt{\frac{p+t}{n}}\right\}\leq&2e^{t}\end{split}

(20)

as along as $n\geq c_{3}(p+t)$ , where $c_{1}$ – $c_{3}$ are constants depending on only $K_{1}$ .

3.2 Robust test statistic

As discussed by Liu et al., (2024), for any known $\boldsymbol{\gamma}\in\Theta_{\gamma}$ , it is natural to consider an SST statistic for testing $\boldsymbol{\beta}={\boldsymbol{0}}$ , i.e.,

\displaystyle\tilde{T}_{n}(\boldsymbol{\gamma})=n^{-1}\|\mathbb{P}_{n}\varphi({\boldsymbol{V}},\hat{\boldsymbol{\alpha}}_{\tau},{\boldsymbol{0}},\boldsymbol{\gamma})\|^{2}_{{\tilde{V}}(\boldsymbol{\gamma})^{-1}},

(21)

where $\hat{\boldsymbol{\alpha}}_{\tau}$ is given in (19) and ${\tilde{V}}(\boldsymbol{\gamma})=\mathbb{P}_{n}\{\varphi({\boldsymbol{V}},\hat{\boldsymbol{\alpha}}_{\tau},{\boldsymbol{0}},\boldsymbol{\gamma})-\hat{G}(\boldsymbol{\gamma})\hat{J}\varphi_{0}({\boldsymbol{V}},\hat{\boldsymbol{\alpha}}_{\tau})\}^{\otimes 2}$ . Here, ${\hat{G}}(\boldsymbol{\gamma})$ and ${\hat{J}}$ are consistent estimators of $G(\boldsymbol{\gamma})$ and $J$ , respectively, where $J$ is as defined in Assumption (B1) and

\displaystyle G(\boldsymbol{\gamma})=\boldsymbol{\mathrm{P}}\{{\boldsymbol{Z}}{\boldsymbol{X}}^{\rm T}{\boldsymbol{1}}({\boldsymbol{U}}^{\rm T}\boldsymbol{\gamma}\geq 0){\boldsymbol{1}}(|y-{\boldsymbol{X}}^{\rm T}\boldsymbol{\alpha}^{*}|\leq\tau)\}\in\mathbb{R}^{q\times p}.

Lemma 2.

If Assumptions (B1)–(B3) hold, then for any fixed $\boldsymbol{\gamma}\in\Theta_{\gamma}$ , $\tilde{T}_{n}(\boldsymbol{\gamma})$ converges in distribution to a $\chi^{2}$ one with $q$ degrees of freedom under $H_{0}$ as $n\rightarrow\infty$ .

Although there is an unknown parameter $\boldsymbol{\gamma}$ that prevents $\tilde{T}_{n}(\boldsymbol{\gamma})$ from being used directly in practice, Lemma 2 reveals essentially that the asymptotic distribution of $\tilde{T}_{n}(\boldsymbol{\gamma})$ is free of the nuisance parameter $\boldsymbol{\gamma}$ . Thus, the supremum and the weighted average of $\tilde{T}_{n}(\boldsymbol{\gamma})$ over $\boldsymbol{\gamma}$ should guarantee the correct type-I errors, which motivates constructing the robust test statistic based on the weighted average of $\tilde{T}_{n}(\boldsymbol{\gamma})$ over the parametric space $\Theta_{\gamma}$ .

Fan et al., 2017a studied the supremum of the SST statistic $\tilde{T}_{n}(\boldsymbol{\gamma})$ (SST) over the grouping parameter $\boldsymbol{\gamma}$ for a semiparametric model, i.e.,

\displaystyle\tilde{T}_{n}=\sup_{\boldsymbol{\gamma}\in\Theta_{\gamma}}\left\{n^{-1}\|\mathbb{P}_{n}\varphi({\boldsymbol{V}},\hat{\boldsymbol{\alpha}}_{\tau},{\boldsymbol{0}},\boldsymbol{\gamma})\|^{2}_{{\tilde{V}}(\boldsymbol{\gamma})^{-1}}\right\}.

(22)

The test statistic $\tilde{T}_{n}$ has been investigated widely in the literature; see Andrews and Ploberger, (1994, 1995); Davies, (1977); Song et al., (2009); Shen and Qu, (2020); Liu et al., (2024). It is easy to extent the SST to model (1) with heavy-tailed errors according to the Bahadur representation in Theorem 4.

When the dimension of the parametric space $\Theta_{\gamma}$ is large, searching for the supremum value over $\Theta_{\gamma}$ may cause $\tilde{T}_{n}$ to lose power in practice and is time-consuming computationally. To avoid these drawbacks, proposed in this section is a robust test procedure that is a type of WAST statistic first introduced by Liu et al., (2024).

Proposed herein is RWAST, i.e.,

\displaystyle T_{n}=\frac{1}{n(n-1)}\sum_{i\neq j}\omega_{ij}{\boldsymbol{Z}}_{i}^{\rm T}{\boldsymbol{Z}}_{j}\psi_{\tau}(y_{i}-{\boldsymbol{X}}_{i}^{\rm T}\hat{\boldsymbol{\alpha}}_{\tau})\psi_{\tau}(y_{j}-{\boldsymbol{X}}_{j}^{\rm T}\hat{\boldsymbol{\alpha}}_{\tau}),

(23)

where

\displaystyle\omega_{ij}=\frac{1}{4}+\frac{1}{2\pi}\arctan\left(\frac{\varrho_{ij}}{\sqrt{1-\varrho_{ij}^{2}}}\right)\quad\mbox{if}~{}i\neq j,

(24)

and $\varrho_{ij}={\boldsymbol{U}}_{i}^{\rm T}{\boldsymbol{U}}_{j}(\|{\boldsymbol{U}}_{i}\|\|{\boldsymbol{U}}_{j}\|)^{-1}$ . As noted by Liu et al., (2024), there is a Bayesian explanation for the weight $\omega_{ij}$ . In fact, Lemma D1 in the Supplementary Material shows that

\displaystyle\omega_{ij}=\int_{\boldsymbol{\gamma}\in\Theta_{\gamma}}{\boldsymbol{1}}({\boldsymbol{U}}_{i}^{\rm T}\boldsymbol{\gamma}\geq 0){\boldsymbol{1}}({\boldsymbol{U}}_{j}^{\rm T}\boldsymbol{\gamma}\geq 0)w(\boldsymbol{\gamma})d\boldsymbol{\gamma},

where $w(\boldsymbol{\gamma})$ is the standard multi-Gaussian density and can be chosen as another weight satisfying $w(\boldsymbol{\gamma})\geq 0$ for all $\boldsymbol{\gamma}\in\Theta_{\gamma}$ and $\int_{\boldsymbol{\gamma}\in\Theta_{\gamma}}w(\boldsymbol{\gamma})d\boldsymbol{\gamma}=1$ . In the Bayesian motivation, the grouping parameter $\boldsymbol{\gamma}$ has a prior with density $w(\boldsymbol{\gamma})$ . Because the goal herein is to test for the existence of subgroups instead of estimating the grouping parameter, the difference is that there is no requirement for the posterior distribution.

The choice of the weight affect the computation of the test statistic because of the numerical integration over $\mathbb{R}^{q}$ . Thus, taking the weight as the standard multi-Gaussian density offers good performance in practice, as illustrated in the simulation studies in Section 4 and the case studies in Section 5. To illustrate the performance in robust regression, numerical studies were conducted to investigate the sensitivity of the weight’s choice by comparing $\omega_{ij}$ with the closed form in (24) and the approximated $\omega_{ij}$ in (E.3) of Appendix E.3 in the Supplementary Material. From the numerical results, compared with the approximated $\omega_{ij}$ , the test statistic with $\omega_{ij}$ in (24) has higher power uniformly and takes only 10% of the time computationally when $N=10\,000$ . This is a strong recommendation to use the closed-form RWAST in (24).

To establish the asymptotic distribution of RWAST, additional notation is introduced below. Denote the kernel of a U-statistic under the null hypothesis by

\displaystyle\begin{split}h({\boldsymbol{V}}_{i},{\boldsymbol{V}}_{j})=&\omega_{ij}{\boldsymbol{Z}}_{i}^{\rm T}{\boldsymbol{Z}}_{j}\psi_{\tau}(y_{i}-{\boldsymbol{X}}_{i}^{\rm T}\boldsymbol{\alpha}^{*})\psi_{\tau}(y_{j}-{\boldsymbol{X}}_{j}^{\rm T}\boldsymbol{\alpha}^{*})\\ &+\varphi_{0}({\boldsymbol{V}}_{i},\boldsymbol{\alpha}^{*})^{\rm T}K_{j}+K_{i}^{\rm T}\varphi_{0}({\boldsymbol{V}}_{j},\boldsymbol{\alpha}^{*})+\varphi_{0}({\boldsymbol{V}}_{i},\boldsymbol{\alpha}^{*})^{\rm T}H\varphi_{0}({\boldsymbol{V}}_{j},\boldsymbol{\alpha}^{*}),\end{split}

(25)

where $\varphi_{0}({\boldsymbol{V}},\boldsymbol{\alpha})={\boldsymbol{X}}\psi_{\tau}(y-{\boldsymbol{X}}^{\rm T}\boldsymbol{\alpha})$ , and

\displaystyle H=\int_{\boldsymbol{\gamma}\in\Theta_{\gamma}}J^{\rm T}G(\boldsymbol{\gamma})^{\rm T}G(\boldsymbol{\gamma})Jw(\boldsymbol{\gamma})d\boldsymbol{\gamma}~{}\mbox{and}~{}K_{i}=\int_{\boldsymbol{\gamma}\in\Theta_{\gamma}}J^{\rm T}G(\boldsymbol{\gamma})^{\rm T}\varphi({\boldsymbol{V}}_{i},\boldsymbol{\alpha}^{*},{\boldsymbol{0}},\boldsymbol{\gamma})w(\boldsymbol{\gamma})d\boldsymbol{\gamma}

with $\varphi({\boldsymbol{V}},\boldsymbol{\alpha},\boldsymbol{\beta},\boldsymbol{\gamma})={\boldsymbol{Z}}{\boldsymbol{1}}({\boldsymbol{U}}^{\rm T}\boldsymbol{\gamma}\geq 0)\psi_{\tau}(y-{\boldsymbol{X}}^{\rm T}\boldsymbol{\alpha}-{\boldsymbol{Z}}^{\rm T}\boldsymbol{\beta}{\boldsymbol{1}}({\boldsymbol{U}}^{\rm T}\boldsymbol{\gamma}\geq 0))$ .

Theorem 5.

If Assumptions (B1)–(B3) hold, then under the null hypothesis, we have

\displaystyle nT_{n}-\mu_{0}\stackrel{{\scriptstyle\cal{L}}}{{\longrightarrow}}\nu,

where $\stackrel{{\scriptstyle\cal{L}}}{{\longrightarrow}}$ denotes convergence in distribution, $\mu_{0}=n\{\boldsymbol{\mathrm{P}}\psi_{\tau}(\epsilon)\}^{2}\int_{\boldsymbol{\gamma}\in\Theta_{\gamma}}\{\boldsymbol{\mathrm{P}}{\boldsymbol{Z}}{\boldsymbol{1}}({\boldsymbol{U}}^{\rm T}\boldsymbol{\gamma}\geq 0)\}^{2}w(\boldsymbol{\gamma})d\boldsymbol{\gamma}+\boldsymbol{\mathrm{P}}[\varphi_{0}({\boldsymbol{V}},\boldsymbol{\alpha}^{*})^{\rm T}H\varphi_{0}({\boldsymbol{V}},\boldsymbol{\alpha}^{*})]+2\boldsymbol{\mathrm{P}}[\varphi_{0}({\boldsymbol{V}}_{1},\boldsymbol{\alpha}^{*})^{\rm T}K_{1}]$ , $\nu$ is a random variable of the form $\nu=\sum_{j=1}^{\infty}\lambda_{j}(\chi^{2}_{1j}-1)$ , and $\chi^{2}_{11},\chi^{2}_{12},\cdots$ are independent $\chi^{2}_{1}$ variables, i.e., $\nu$ has the characteristic function

\displaystyle\boldsymbol{\mathrm{P}}\left[e^{it\nu}\right]=\prod_{j=1}^{\infty}(1-2it\lambda_{j})^{-\frac{1}{2}}e^{-it\lambda_{j}}.

Here, $i=\sqrt{-1}$ is the imaginary unit, and $\{\lambda_{j}\}$ are the eigenvalues of the kernel $h({\boldsymbol{v}}_{1},{\boldsymbol{v}}_{2})$ under $f({\boldsymbol{v}},\boldsymbol{\alpha}^{*},{\boldsymbol{0}},\boldsymbol{\gamma}^{*})$ , i.e., they are the solutions of $\lambda_{j}g_{j}({\boldsymbol{v}}_{2})=\int_{0}^{\infty}h({\boldsymbol{v}}_{1},{\boldsymbol{v}}_{2})g_{j}({\boldsymbol{v}}_{1})\\ f({\boldsymbol{v}}_{1},\boldsymbol{\alpha}^{*},{\boldsymbol{0}},\boldsymbol{\gamma}^{*})d{\boldsymbol{v}}_{1}$ for nonzero $g_{j}$ , where $f({\boldsymbol{v}},\boldsymbol{\alpha},\boldsymbol{\beta},\boldsymbol{\gamma})$ is the density of ${\boldsymbol{V}}$ .

Investigated next is the power performance of the proposed test statistic under two types of alternative hypotheses under which subgroups exist. Considered first is the global alternative denoted by $H_{1g}$ : $\boldsymbol{\beta}=\boldsymbol{\xi}$ , where $\boldsymbol{\xi}\in\Theta_{\boldsymbol{\beta}}\backslash\{{\boldsymbol{0}}\}$ is fixed. Theorem 6 provides the asymptotic distribution of the test statistic $T_{n}$ under the global alternative.

Theorem 6.

If Assumptions (B1)–(B3) hold, then under the global alternative $H_{1g}$ , we have

\displaystyle\sqrt{n}(T_{n}-\mu_{1})\stackrel{{\scriptstyle\cal{L}}}{{\longrightarrow}}\mathcal{N}(0,\sigma^{2}_{\xi}),

where $\mu_{1}=\boldsymbol{\mathrm{P}}[h({\boldsymbol{V}}_{1},{\boldsymbol{V}}_{2})]$ and $\sigma^{2}_{\xi}=4\hbox{Var}(\boldsymbol{\mathrm{P}}[h({\boldsymbol{V}}_{1},{\boldsymbol{V}}_{2})|{\boldsymbol{V}}_{1}])$ .

To derive the asymptotic distribution under the local alternative hypothesis, denoted by $H_{1l}:\boldsymbol{\beta}=n^{-1/2}\boldsymbol{\xi}$ , additional assumptions are required, where $\boldsymbol{\xi}\in\Theta_{\beta}$ is a fixed vector.

(B4)

There is a positive function $b({\boldsymbol{v}},\boldsymbol{\xi})$ of ${\boldsymbol{v}}$ relying on $\boldsymbol{\alpha}^{*}$ , $\boldsymbol{\gamma}^{*}$ such that

\displaystyle\left|\boldsymbol{\xi}^{\rm T}\frac{\partial f({\boldsymbol{v}};\boldsymbol{\alpha}^{*},r_{n}\boldsymbol{\xi},\boldsymbol{\gamma}^{*})\partial\boldsymbol{\beta}}{f({\boldsymbol{v}};\boldsymbol{\alpha}^{*},{\boldsymbol{0}},\boldsymbol{\gamma}^{*})}\right|\leq b({\boldsymbol{v}},\boldsymbol{\xi}),

and $\boldsymbol{\mathrm{P}}[b({\boldsymbol{V}},\boldsymbol{\xi})^{2}]$ and $\boldsymbol{\mathrm{P}}[\phi_{k}({\boldsymbol{V}})^{2}b({\boldsymbol{V}},\boldsymbol{\xi})]$ for all $k=1,\cdots,$ are bounded by $C_{\boldsymbol{\xi}}$ , where $\boldsymbol{\xi}\in\Theta_{\beta}$ , $r_{n}=o(1)$ , $C_{\boldsymbol{\xi}}>0$ is a constant relying on $\boldsymbol{\xi}$ , $\phi_{k}(\cdot)$ is as defined in Theorem 5, and ${\boldsymbol{V}}$ is generated from the null distribution with density $f({\boldsymbol{v}};\boldsymbol{\alpha}^{*},{\boldsymbol{0}},\boldsymbol{\gamma}^{*})$ .

Theorem 7.

If Assumptions (B1)–(B4) hold, then under the local alternative hypothesis $H_{1l}$ , i.e., $\boldsymbol{\beta}=n^{-1/2}\boldsymbol{\xi}$ with a fixed vector $\boldsymbol{\xi}\in\Theta_{\beta}$ , we have

\displaystyle nT_{n}-\mu_{0}\stackrel{{\scriptstyle\cal{L}}}{{\longrightarrow}}\nu,

where $\mu_{0}$ is as defined in Theorem 5, $\nu$ is a random variable of the form $\nu=\sum_{j=1}^{\infty}\lambda_{j}(\chi^{2}_{1j}(\mu_{aj})-1)$ , and $\chi^{2}_{11}(\mu_{a1}),\chi^{2}_{12}(\mu_{a2}),\cdots$ are independent noncentral $\chi^{2}_{1}$ variables, i.e., $\nu$ has the characteristic function

\displaystyle\boldsymbol{\mathrm{P}}\left[e^{it\nu}\right]=\prod_{j=1}^{\infty}(1-2it\lambda_{j})^{-\frac{1}{2}}\exp\left(-it\lambda_{j}+\frac{it\lambda_{j}\mu_{aj}}{1-2it\lambda_{j}}\right).

Here, $\{\lambda_{j}\}$ are the eigenvalues of the kernel $h({\boldsymbol{v}}_{1},{\boldsymbol{v}}_{2})$ defined in (25) under $f({\boldsymbol{v}},\boldsymbol{\alpha}^{*},{\boldsymbol{0}},\boldsymbol{\gamma}^{*})$ , i.e., they are the solutions of $\lambda_{j}g_{j}({\boldsymbol{v}}_{2})=\int_{0}^{\infty}h({\boldsymbol{v}}_{1},{\boldsymbol{v}}_{2})g_{j}({\boldsymbol{v}}_{1})f({\boldsymbol{v}}_{1},\boldsymbol{\alpha}^{*},{\boldsymbol{0}},\boldsymbol{\gamma}^{*})d{\boldsymbol{v}}_{1}$ for nonzero $g_{j}$ , and each noncentrality parameter of $\chi^{2}_{1j}(\mu_{aj})$ is

\displaystyle\mu_{aj}=\boldsymbol{\mathrm{P}}\left[\phi_{j}({\boldsymbol{V}}_{0})\boldsymbol{\xi}^{\rm T}\partial\log(f({\boldsymbol{V}}_{0},\boldsymbol{\alpha}^{*},{\boldsymbol{0}},\boldsymbol{\gamma}^{*}))/\partial\boldsymbol{\beta}\right],\quad j=1,2,\cdots,

where $\{\phi_{j}({\boldsymbol{v}})\}$ denotes orthonormal eigenfunctions corresponding to the eigenvalues $\{\lambda_{j}\}$ , and ${\boldsymbol{V}}_{0}$ is generated from the null distribution $f({\boldsymbol{v}},\boldsymbol{\alpha}^{*},{\boldsymbol{0}},\boldsymbol{\gamma}^{*})$ .

Denote by $F_{\nu}$ the cumulative distribution function of $\nu$ . It follows from Theorem 7 that the power function of $nT_{n}-\mu_{0}$ can be approximated theoretically by $F_{\nu}$ , and the proof of Theorem 7 shows that $0<\boldsymbol{\mathrm{P}}h({\boldsymbol{V}}_{1},{\boldsymbol{V}}_{2})=\sum_{j=1}^{\infty}\lambda_{j}\mu_{aj}+o(1)$ under the local alternative hypothesis. The additional mean $\mu_{1}$ under $H_{1g}$ (or $\sum_{j=1}^{\infty}\lambda_{j}\mu_{aj}$ under $H_{1l}$ ) can be viewed as a measure of the difference between $H_{0}$ and $H_{1g}$ (or $H_{1l}$ ). $F_{\nu}$ is difficult to use in practice because it is not common. In Appendix B of the Supplementary Material, a novel bootstrap method is recommended for calculating the critical value or p-value, of which the asymptotic consistency is established in Theorem B1 in the Supplementary Material.

4 Simulation studies

Consider the change-plane model (1)

\displaystyle y_{i}={\boldsymbol{X}}_{i}^{\rm T}\boldsymbol{\alpha}+{\boldsymbol{X}}_{i}^{\rm T}\boldsymbol{\beta}{\boldsymbol{1}}({\boldsymbol{U}}_{i}^{\rm T}\boldsymbol{\gamma}\geq 0)+\epsilon_{i},

where $X_{1i}=1$ and $U_{1i}=1$ , and $(X_{2i},\cdots,X_{pi})^{\rm T}$ and $(U_{2i},\cdots,U_{ri})^{\rm T}$ are generated independently from multivariate normal distributions $N({\boldsymbol{0}}_{p-1},\sqrt{2}{\boldsymbol{I}}_{p-1})$ and $N({\boldsymbol{0}}_{r-1},\sqrt{2}{\boldsymbol{I}}_{r-1})$ , respectively. The error $\epsilon_{i}$ is generated from following eight distributions: (i) $N(0,\sqrt{2})$ ; (ii) $t_{2}$ ; (iii) Pareto distribution $Par(2,1)$ with shape parameter 2 and scale parameter 1; (iv) Weibull distribution $Weib(0.75,0.75)$ with shape parameter 0.75 and scale parameter 0.75. For the limit of space, we put other four error’s distributions in Appendix E of the Supplementary Material: (v) Gaussian mixture; (vi) mixture of $t_{2}$ and Weibull $Weib(0.75,0.75)$ ; (vii) mixture of Pareto $Par(2,1)$ and $N(0,\sqrt{2})$ ; and (viii) mixture of lognormal $\exp(N(0,1))$ and $N(0,\sqrt{2})$ . $(\gamma_{2},\cdots,\gamma_{q})^{\rm T}=(1,2,\cdots,2)^{\rm T}$ is set under $H_{1}$ , and $\gamma_{1}$ is chosen as the negative of 35% percentile of $U_{2}\gamma_{2}+\cdots+U_{q}\gamma_{r}$ , which means that ${\boldsymbol{U}}^{\rm T}\boldsymbol{\gamma}$ divides the population into two groups with 65% and 35% observations, respectively. To save space, we only present simulation results of robust subgroup-classifier learning, and we put performance of robust subgroup testing in Appendix E2 of the Supplementary Material.

The settings used herein are $\boldsymbol{\alpha}=(5,0.5,\cdots,0.5)\in\mathbb{R}^{p}\quad\mbox{and}\quad\boldsymbol{\beta}=(0.5,\cdots,0.5)\in\mathbb{R}^{q},$ with the sigmoid function $K(u)=\{1+\exp(-u)\}^{-1}$ chosen as the smooth function. Finite-sample studies were performed for different smooth functions $K(u)$ (such as $K(u)=\Phi(u)$ and $K(u)=\Phi(u)+u\phi(u)$ ), but the results were similar and so are omitted here. For comparison, three strategies are considered: (i) the proposed adaptive procedure (AHu), (ii) the classic Huber method (Hub), with the robustness parameter $\tau$ selected by $\tau_{0}\mbox{median}\{{\boldsymbol{y}}-\mbox{median}({\boldsymbol{y}})\}/\Phi^{-1}(0.75)$ with $\tau_{0}=1.345$ and ${\boldsymbol{y}}=(y_{1},\cdots,y_{n})^{\rm T}$ , as suggested in Wang et al., (2021)), and (iii) the estimation method based on the ordinary quadratic loss considered in Li et al., (2021) (OLS). Here, the subscript $n$ in AHU_n, Hub_n and OLS_n stands for the sample size with $n=200,400,600$ .

Figure 3 shows boxplots of the $L_{2}$ -norm of the estimation errors of the parameter $\boldsymbol{\theta}=(\boldsymbol{\alpha}^{\rm T},\boldsymbol{\beta}^{\rm T})^{\rm T}$ with different errors and with $(p,q,r)=(3,3,3)$ , where the $L_{2}$ -norm is defined as $L_{2}=\|\hat{\boldsymbol{\theta}}_{\tau,h}-\boldsymbol{\theta}^{*}\|^{2}$ , where $\hat{\boldsymbol{\theta}}_{\tau,h}$ is the robust estimator of the true parameter $\boldsymbol{\theta}^{*}$ . Figure 4 shows boxplots of the accuracy (ACC) of subgroup identification with different errors and with $(p,q,r)=(3,3,3)$ . Here, ACC is defined as

\mbox{ACC}=1-n^{-1}\sum_{i=1}^{n}\left|{\boldsymbol{1}}(U_{1i}+{\boldsymbol{U}}_{2i}^{\rm T}\hat{\boldsymbol{\eta}}_{\tau,h}\geq 0)-{\boldsymbol{1}}(U_{1i}+{\boldsymbol{U}}_{2i}^{\rm T}\boldsymbol{\eta}^{*}\geq 0)\right|

with the robust estimator $\hat{\boldsymbol{\eta}}_{\tau,h}$ of the true parameter $\boldsymbol{\eta}^{*}$ .

Figures 3 and 4 show that the $L_{2}$ -norms of the estimation errors decrease and the accuracies of subgroup identification increase as the sample size $n$ grows, which verifies the established theoretical results. Compared with the ordinary quadratic loss, the proposed estimation procedure achieves a uniformly lower median of the $L_{2}$ -norm of the estimation errors and higher accuracy of subgroup identification for all heavy-tailed distributions except for the $t_{2}$ distribution. The three methods are comparable for the symmetric distributions, such as the Gaussian and $t_{2}$ distributions. To limit the number of pages, the boxplots of the $L_{2}$ -norm of the estimation errors of the parameter $\boldsymbol{\theta}$ and accuracy of subgroup identification for $(p,q,r)=(3,3,11),(6,6,3)$ , and $(6,6,11)$ are shown in Figures E2–E4 and E6–E8 in Appendix E.1 of the Supplementary Material.

5 Case study

The proposed procedure is applied to cancer data for skin cutaneous melanoma (SKCM) downloaded from the Cancer Genome Atlas https://tcga-data.nci.nih.gov/. SKCM is one of the most aggressive types of cancer, and its incidence, mortality, and disease burden are increasing annually (Siegel et al., 2021; Xiong et al., 2019). It is believed that CREG1, TMEM201, and CCL8 are skin-cancer susceptibility genes (Hu et al., 2020), and the goal is to identify the subgroup of Breslow’s thickness based on mutations of those sensitive genes.

Consideration is given to the three environmental factors of (i) gender, (ii) age at diagnosis, and (iii) Clark level at diagnosis (CLAD), all of which have been studied extensively in the literature. Studies such as those by Dickson and Gershenwald, (2011) have found that these three environmental factors have positive effects. After removing missing values, there are 253 subjects, and the SKCM data are modeled by applying the change-plane model (1)

\displaystyle Y_{i}={\boldsymbol{X}}_{i}^{\rm T}\boldsymbol{\alpha}+{\boldsymbol{X}}_{i}^{\rm T}\boldsymbol{\beta}{\boldsymbol{1}}({\boldsymbol{U}}_{i}^{\rm T}\boldsymbol{\gamma}\geq 0)+\epsilon_{i},\quad i=1,\cdots,253,

where ${\boldsymbol{X}}_{i}=(1,\mbox{AGE}_{i},\mbox{GENDER}_{i},\mbox{CLAD}_{i})^{\rm T}$ , and ${\boldsymbol{U}}_{i}=(1,\mbox{CREG1}_{i},\mbox{TMEM201}_{i},\mbox{CCL8}_{i})^{\rm T}$ . The three important genes CREG1, TMEM201, and CCL8 are high correlated with breast cancer (Hu et al., 2020).

Based on $B=5000$ bootstrap samples and the Huber loss, the p-value with the proposed RWAST is $0.002$ , which implies that based on the proposed method, there is a strong evidence for rejecting the null hypothesis. However, based on the ordinary quadratic loss, the p-values with WAST and SST are 0.6454 and 0.0286, respectively, where $B=5000$ and $K=2000$ . Therefore, there is no evidence for rejecting the null hypothesis for WAST and SST based on the ordinary quadratic loss.

The parametric estimators are listed in Table 1, from which the indicator function ${\boldsymbol{1}}({\boldsymbol{U}}_{i}^{\rm T}\hat{\boldsymbol{\gamma}}\geq 0)$ partitions the population into two subgroups with 121 and 132 subjects based on the Huber loss, and two subgroups with 91 and 162 subjects based on the ordinary quadratic loss. Therefore, there would appear to be a subgroup with a higher chance of skin cancer based on mutations of these three genes CREG1, TMEM201, and CCL8.

Table 1: Estimates of parameter if null hypothesis has been rejected.

Change plane	Parameter	Intercept	AGE	GENDER	CLAD	Intercept	CREG1	TMEM201	CCL8
OLS	$\hat{\boldsymbol{\alpha}}$	$-0.037$	$-0.038$	0.002	0.050
	$\hat{\boldsymbol{\beta}}$	0.022	0.024	0.008	$-0.007$
	$\hat{\boldsymbol{\gamma}}$					1.000	0.179	$-0.656$	4.970
Huber	$\hat{\boldsymbol{\alpha}}$	3.525	0.029	0.079	0.064
	$\hat{\boldsymbol{\beta}}$	0.012	0.014	$-0.026$	$-0.045$
	$\hat{\boldsymbol{\gamma}}$					1.000	0.384	$-2.093$	4.860

6 Conclusion

Considered herein were subgroup classification and subgroup tests for change-plane models with heavy-tailed errors, which offer help in (i) narrowing down populations for modeling and (ii) providing recommendations for optimal individualized treatments in practice. A novel subgroup classifier was proposed by smoothing the indicator function and minimizing a smoothed Huber loss. Nonasymptotic properties were derived and nonasymptotic Bahadur representation was provided, in which the estimators of the parameters $\boldsymbol{\alpha}$ and $\boldsymbol{\beta}$ achieve sub-Gaussian tails.

The novel test statistic RWAST was introduced to test whether subgroups of individuals exist. In a comparison with WAST Liu et al., (2024) and SST Andrews and Ploberger, (1994, 1995); Song et al., (2009); Fan et al., 2017a based on the ordinary change-plane regression with non-heavy-tailed errors, the proposed test statistic improved the power significantly because it is robust with the assistance of the Huber loss and avoids the drawbacks of taking the supremum over the parametric space $\Theta_{\gamma}$ when its dimension is large. Asymptotic distributions were derived under the null and alternative hypotheses. As studied by Liu et al., (2024) and Huang et al., (2020), it is easy to extend the proposed robust estimation procedure and RWAST to change-plane regression with multiple change planes with heavy-tailed errors.

In the age of big data, it is interesting to consider high-dimensional change-plane regression models and to provide high-dimensional robust estimation procedures and test statistics. As noted by Liu et al., (2024), the proposed RWAST can be applied to change-plane regression with high-dimensional grouping parameter $\boldsymbol{\gamma}$ . However, it remains open to provide estimation procedures for change-plane regression with high-dimensional $\boldsymbol{\gamma}$ . A possible strategy is to penalize the loss function with $\|\boldsymbol{\theta}\|_{1}$ under the assumption of sparsity.

Supplementary Material

Appendix A includes Algorithm 1 for implementation in Section 2.3. Appendix B provides the computation of critical value in Section 3. Appendix C provides the proofs of Theorems 1–4 and the related Lemmas. Appendix D provides the proofs of Theorems 5–7. Appendix E provides additional simulation studies to illustrate the performance of the proposed estimation and test prcedures.

Acknowledgments

This work was supported in part by the NSFC (12271329, 72331005).

References

Andrews and Ploberger, (1994) Andrews, D. W. K. and Ploberger, W. (1994). Optimal tests when a nuisance parameter is present only under the alternative. Econometrica, 62(6):1383–1414.
Andrews and Ploberger, (1995) Andrews, D. W. K. and Ploberger, W. (1995). Admissibility of the Likelihood Ratio Test When a Nuisance Parameter is Present Only Under the Alternative. The Annals of Statistics, 23(5):1609 – 1629.
Chen and Zhou, (2020) Chen, X. and Zhou, W.-X. (2020). Robust inference via multiplier bootstrap. The Annals of Statistics, 48(3):1665 – 1691.
Davies, (1977) Davies, R. B. (1977). Hypothesis testing when a nuisance parameter is present only under the alternative. Biometrika, 64(2):247–254.
Devroye et al., (2016) Devroye, L., Lerasle, M., Lugosi, G., and Oliveira, R. I. (2016). Sub-Gaussian mean estimators. The Annals of Statistics, 44(6):2695 – 2725.
Dickson and Gershenwald, (2011) Dickson, P. V. and Gershenwald, J. E. (2011). Staging and prognosis of cutaneous melanoma. Surgical oncology clinics of North America, 20(1):1–17.
(7) Fan, A., Rui, S., and Lu, W. (2017a). Change-plane analysis for subgroup detection and sample size calculation. Journal of the American Statistical Association, 112:769–778.
(8) Fan, J., Li, Q., and Wang, Y. (2017b). Estimation of high dimensional mean regression in the absence of symmetry and light tail assumptions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(1):247–265.
Foster et al., (2011) Foster, J. C., Taylor, J., and Ruberg, S. J. (2011). Subgroup identification from randomized clinical trial data. Statistics in Medicine, 30(24):2867–2880.
Hahn et al., (1990) Hahn, M. G., Kuelbs, J., and Weiner, D. C. (1990). The Asymptotic Joint Distribution of Self-Normalized Censored Sums and Sums of Squares. The Annals of Probability, 18(3):1284 – 1341.
Han et al., (2022) Han, D., Huang, J., Lin, Y., and Shen, G. (2022). Robust post-selection inference of high-dimensional mean regression with heavy-tailed asymmetric or heteroskedastic errors. Journal of Econometrics, 230(2):416 – 431.
Horowitz, (1993) Horowitz, J. L. (1993). Optimal rates of convergence of parameter estimators in the binary response model with weak distributional assumptions. Econometric Theory, 9(1):1–18.
Hu et al., (2020) Hu, B., Wei, Q., Zhou, C., Ju, M., Wang, L., Chen, L., Li, Z., Wei, M., He, M., and Zhao, L. (2020). Analysis of immune subtypes based on immunogenomic profiling identifies prognostic signature for cutaneous melanoma. International Immunopharmacology, 89:107162.
Huang et al., (2020) Huang, Y., Cho, J., and Fong, Y. (2020). Threshold-based subgroup testing in logistic regression models in two-phase sampling designs. Applied Statistics, 70(2).
Huber, (1964) Huber, P. J. (1964). Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35(4):73–101.
Huber, (1973) Huber, P. J. (1973). Robust Regression: Asymptotics, Conjectures and Monte Carlo. The Annals of Statistics, 1(5):799 – 821.
Kang et al., (2024) Kang, C., Cho, H., Song, R., Banerjee, M., Laber, E. B., and Kosorok, M. R. (2024). Inference for change-plane regression. arXiv.org/abs/2206.06140.
Kang et al., (2017) Kang, S., Lu, W., and Song, R. (2017). Subgroup detection and sample size calculation with proportional hazards regression for survival data. Statistics in Medicine, 36:4646 – 4659.
Lee et al., (2011) Lee, S., Seo, M. H., and Shin, Y. (2011). Testing for threshold effects in regression models. Journal of the American Statistical Association, 106(493):220–231.
Lepskii, (1992) Lepskii, O. V. (1992). Asymptotically minimax adaptive estimation. i: Upper bounds. optimally adaptive estimates. Theory of Probability & Its Applications, 36(4):682–697.
Li et al., (2021) Li, J., Li, Y., Jin, B., and Kosorok, M. R. (2021). Multithreshold change plane model: Estimation theory and applications in subgroup identification. Statistics in Medicine, 40(15):3440–3459.
Liu et al., (2024) Liu, X., Huang, J., Zhou, Y., Zhang, F., and Panpan, R. (2024). Efficient subgroup testing in change-planemodels. http://arxiv.org/abs/2408.00602.
Mukherjee et al., (2020) Mukherjee, D., Banerjee, M., Mukherjee, D., and Ritov, Y. (2020). Asymptotic normality of a linear threshold estimator in fixed dimension with near-optimal rate. arXiv preprint arXiv:2001.06955.
Mukherjee et al., (2022) Mukherjee, D., Banerjee, M., and Ritov, Y. (2022). On robust learning in the canonical change point problem under heavy tailed errors in finite and growing dimensions. Electronic Journal of Statistics, 16(1):1153 – 1252.
Seo and Linton, (2007) Seo, M. H. and Linton, O. (2007). A smoothed least squares estimator for threshold regression models. Journal of Econometrics, 141(2):704–735.
Shen and Qu, (2020) Shen, J. and Qu, A. (2020). Subgroup analysis based on structured mixedeffects models for longitudinal data. Journal of Biopharmaceutical Statistics, 30(4):607–622.
Siegel et al., (2021) Siegel, R. L., Miller, K. D., Fuchs, H. E., and Jemal, A. (2021). Cancer statistics, 2021. CA: A Cancer Journal for Clinicians, 71(1):7–33.
Song et al., (2009) Song, R., Kosorok, M. R., and Fine, J. P. (2009). On asymptotically optimal tests under loss of identifiability in semiparametric models. The Annals of Statistics, 37:2409 – 2444.
Sun et al., (2020) Sun, Q., Zhou, W.-X., and Fan, J. (2020). Adaptive huber regression. Journal of the American Statistical Association, 115(529):254–265. PMID: 33139964.
Wald, (1943) Wald, A. (1943). Tests of statistical hypotheses concerning several parameters when the number of observations is large. Transactions of the American Mathematical Society, 54(3):426–482.
Wang et al., (2021) Wang, L., Zheng, C., Zhou, W., and Zhou, W.-X. (2021). A new principle for tuning-free huber regression. Statistica Sinica, 31:2153–2177.
Wei and Kosorok, (2018) Wei, S. and Kosorok, M. R. (2018). The change-plane Cox model. Biometrika, 105(4):891–903.
Xiong et al., (2019) Xiong, J., Bing, Z., and Guo, S. (2019). Observed survival interval: A supplement to tcga pan-cancer clinical data resource. Cancers, 11(3).
Zhang et al., (2022) Zhang, Y., Wang, J. W., and Zhu, Z. (2022). Single-index thresholding in quantile regression. Journal of the American Statistical Association, 117(540):2222–2237.
Zhou et al., (2018) Zhou, W.-X., Bose, K., Fan, J., and Liu, H. (2018). A new perspective on robust $M$ -estimation: Finite sample theory and applications to dependence-adjusted multiple testing. The Annals of Statistics, 46(5):1904 – 1931.