Robust High-dimensional Tuning Free Multiple Testing††thanks: Supported by NSF grants DMS-2210833, DMS-2053832, DMS-2052926 and ONR grant N00014-22-1-2340
Abstract
A stylized feature of high-dimensional data is that many variables have heavy tails, and robust statistical inference is critical for valid large-scale statistical inference. Yet, the existing developments such as Winsorization, Huberization and median of means require the bounded second moments and involve variable-dependent tuning parameters, which hamper their fidelity in applications to large-scale problems. To liberate these constraints, this paper revisits the celebrated Hodges-Lehmann (HL) estimator for estimating location parameters in both the one- and two-sample problems, from a non-asymptotic perspective. Our study develops Berry-Esseen inequality and Cramér type moderate deviation for the HL estimator based on newly developed non-asymptotic Bahadur representation, and builds data-driven confidence intervals via a weighted bootstrap approach. These results allow us to extend the HL estimator to large-scale studies and propose tuning-free and moment-free high-dimensional inference procedures for testing global null and for large-scale multiple testing with false discovery proportion control. It is convincingly shown that the resulting tuning-free and moment-free methods control false discovery proportion at a prescribed level. The simulation studies lend further support to our developed theory.
1 Introduction
Large-scale, high-dimensional data with rich structures have been widely collected in almost all scientific disciplines and humanities, thanks to the advancements of modern technologies. Massive developments have been made in statistics over the past two decades on extracting valuable information from these high dimensional data; see Bühlmann and Van De Geer (2011); Hastie et al. (2009, 2015); Wainwright (2019); Fan et al. (2020a); Chen et al. (2021) for a detailed account and references therein.
Despite convenience for theoretical analysis, the sub-Gaussian tails condition is not realistic in many applications that involve high-dimensional variables. For instance, it is well known that heavy-tailed distributions is a stylized feature for financial returns and macroeconomic variables (Cont, 2001; Stock and Watson, 2002; Fan and Yao, 2017; Fan et al., 2021). Therefore, tools designed for sub-Gaussian data can lead to erroneous scientific conclusions. Asking thousands of gene expressions to have all sub-Gaussian tails is a mathematical dream, not a reality that data scientists face. For example, comparing gene expression profiles between various cell sub-populations, especially after treatments and therapies, is an essential statistical task (Nagalakshmi et al., 2008; Shendure and Ji, 2008; Wang et al., 2009; Li et al., 2012; Li and Tibshirani, 2013; Gupta et al., 2014; Finotello and Di Camillo, 2015). However, it is unrealistic to hope all thousands of gene expressions have sub-Gaussian distributions: outliers in non-sub-Gaussian distributions can have a significant impact on nonrobust procedures and lead to many false positives and negatives (Gupta et al., 2014; Wang et al., 2015). The situation also arises for inferences using functional magnetic resonance imaging (fMRI) data since the data do not conform to the assumed Gaussian distribution (Eklund et al., 2016). These practical challenges demand for developing efficient and reliable robust inference methods.
Recently, robust statistical methods have gained popularity as a mean of resolving outliers and heavy-tailed noises. Many preceding arts have taken a significant stride toward effective statistical estimation under heavy-tailed distributions. For instance, aiming at dealing with the heavy-tailed noise contamination, the Huber regression is proposed (Huber, 1973), and subsequent publications along these lines include Yohai and Maronna (1979); Mammen (1989); He and Shao (1996, 2000), where the asymptotic properties of the Huber estimator have been thoroughly investigated. From a non-asymptotic perspective, the Huber-type estimator was recently revisited by Sun et al. (2020), in which the authors propose an adaptive Huber regression method and establish its non-asymptotic deviation bounds by only requiring finite -th moment of the noise with any . Moreover, using a similar idea for making the correspondent -estimator insensitive to extreme values, Catoni (2012) developed a novel approach through minimizing a robust empirical loss. It is demonstrated that the estimator has exponential concentration around the true mean and enjoys the same statistical rate as the sample average for sub-Gaussian distributions when the population only has a bounded second moment. Brownlees et al. (2015) further investigates empirical risk minimization based on the robust estimator proposed in Catoni (2012). Additionally, the so-called median of means strategy, which can be traced back to Nemirovsky and Yudin (1983), is another successful method for handling heavy-tailed distributions. By only requiring bounded second moment, it achieves the sub-Gaussian type of concentration around the population mean parameter. Minsker (2015) and Hsu and Sabato (2016) further generalize this idea to multivariate cases. Moreover, there also exists a series of works that focus on solving the issue caused by heavy-tailed noises using quantile-based robust estimation; see Arcones (1995); Koenker and Hallock (2001); Belloni and Chernozhukov (2011); Fan et al. (2014); Zheng et al. (2015) for more details. Furthermore, recently, in Fan et al. (2021); Yang et al. (2017); Fan et al. (2022c), under heavy-tailed contamination, they proposed a novel principle by simply truncating or shrinking the response variables appropriately to achieve sub-Gaussian rates, and they only require bounded second moment of the measurements. Additionally, the aforementioned methodologies can also be applied to a wide range of problems, such as matrix sensing, matrix completion, robust PCA, factor analysis, and neural networks. For interested readers, we refer to Minsker (2015); Hsu and Sabato (2016); Fan et al. (2017); Loh (2017); Minsker (2018); Goldstein et al. (2018); Wang et al. (2020); Fan et al. (2022b); Wang and Fan (2022); Fan et al. (2022a) for more details.
While many effective solutions have been developed to address the problem of heavy-tailedness, these solutions still have some potential shortcomings. In specific, the developments call for the second moments to be bounded and are primarily based on shrinkage data, Huber-type of loss, median of means (Huber, 1973; Nemirovsky and Yudin, 1983; Catoni, 2012; Fan et al., 2021). More critically, Huberization, Winsorization and sample splitting introduce additional tuning parameters and these tuning paprameters should be variable-dependent that makes large-scale applications difficult and damages the fidelity of empirical results . Although, the quantile estimators (Koenker and Hallock, 2001) such as the median and the Hodge-Lehmann (HL) estimator (Hodges and Lehmann, 1963) can eliminate the restriction on moment conditions and tuning parameters selection, the empirical median is often less efficient and requires stronger distribution assumptions. In addition, the existing literature on the HL estimator focuses mainly on low-dimensional asymptotic analysis and can not be applied to large-scale inferences.
In this paper, we revisit the celebrated HL estimator (Hodges and Lehmann, 1963) and conduct non-asymptotic and large-scale theoretical studies for both one-sample and two-sample problems. For one-sample location estimation in the univariate case, we let be independent and identically distributed (i.i.d.) random variables with
(1.1) |
where represents the location parameter of interest and are i.i.d. random variables drawn from some unknown distribution. In this scenario, the HL estimator of is defined by
(1.2) |
By assuming the pseudomedian of to be zero, we derive non-asymptotic Bahadur representation of To the best of our knowledge, this is the first study of its kind on the non-asymptotic expansion of the HL estimator. From there, we also establish the Berry-Essen bound and moderate deviation for in the widest range. Furthermore, as there are multiple unknowns in the asymptotic distribution of , including the density function of and the unknown location parameter , we then propose a weighted bootstrap approach to construct confidence intervals for based on data. These results and methods are essential for the large-scale inference.
In addition to the study of one-sample problem, two-sample location shift problems arise frequently in many scientific studies, including choosing genes that are expressed differently in normal and injured spinal cord, determining the effects of treatment between treated and control groups, finding change points, etc. To this end, we let be another independent sample of i.i.d. random variables satisfying
(1.3) |
The primary goal is to conduct statistical inference for . In the sequel, following Hodges and Lehmann (1963), the two-sample HL estimator for is given by
(1.4) |
Instead of assuming the noises are generated from the same distribution (Hodges and Lehmann, 1963), we only require median, which is more general and allows random noises to have different distributions. In a similar vein, we establish the non-asymptotic expansion of , investigate its asymptotic distributions, and calculate the confidence interval via bootstrap techniques. Again, the techniques and results developed can be applied to large-scale multiple testing problems.
There is a rich literature on large-scale multiple testing problems for location parameters (Benjamini and Hochberg, 1995; Storey, 2002, 2003; Genovese and Wasserman, 2004; Ferreira and Zwinderman, 2006; Chi, 2007; Blanchard and Roquain, 2009). However, most of these works assume the noise distributions sub-Gaussian. Moving away for sub-Gaussian assumptions, Fan et al. (2019) propose estimating mean vector via minimizing the Huber type loss and perform the false discovery proportion (FDP) control. However, leveraging Huber type estimators necessitates moment limits and introduces tuning parameters, making it hard to be applied in large-scale inference, as the tuning parameters should ideally be variable-dependent. Additionally, while the HL estimator enjoys tuning-free and moment-free qualities in the univariate setting, its behavior in high dimensions is largely unknown. To this end, in this research, we further expand the HL estimator to high-dimensional regimes prompted by the lack of tuning free large-scale multiple testing problems for heavy-tailed distributions.
In specific, we assume , and , where and are p-dimensional vectors of unknown parameters and random noises . Let . For both one- and two-sample problems, we propose a carefully constructed Gaussian multiplier bootstrap to test global null hypotheses
(1.5) |
by extending the HL estimator to high-dimensional regimes. When the null hypothesis above is rejected, we then perform multiple testing, allowing weakly dependent measurements, and efficiently control the FDP. Compared with existing literature (Liu and Shao, 2014; Fan et al., 2019), our procedures do not involve any tuning parameters and moment conditions for testing global null and large-scale multiple testing. These theoretical finds are further supported by exhaustive numerical studies.
The main contributions of the paper can be summarized as follows:
-
•
The existing studies on the HL estimator mainly focus on its asymptotic behavior, which is too weak for high-dimensional applications. In practice, however, it is crucial to understand the HL estimator’s performance under finite sample, especially in high-dimensional and large-scale experiments. For this purpose, we first derive the non-asymptotic expansions of the HL estimators for both one-sample and two-sample problems.
-
•
With the non-asymptotic expansions of the HL estimators, for both one- and two-sample problems, we derive its Berry-Essen type bounds and Cramér type moderate deviations, with the widest range. To deal with unknown components in the distribution, we further develop the weighted bootstrap to build data-driven confidence intervals. In addition, we also furnish the non-asymptotic analysis of the bootstrap estimator.
-
•
Existing work on large-scale testing with heavy-tail errors typically involves additional tuning parameters and the moment conditions. In order to address these issues, we generalize the HL estimator to large-scale studies and propose tuning-free and moment-free high-dimensional testing procedures. Additionally, we develop bootstrap methods for calculating critical values for large-scale applications. We show that the resulting false discovery proportion is well controlled.
1.1 Roadmap
In §2, we first set up the model and introduce basic settings. We then derive both one-sample and two-sample non-asymptotic expansions of the HL estimator, its Berry-Esseen bound and moderate deviations. In addition, as the asymptotic distribution of the estimator involves unknown quantities, in §3, we conduct multiplier bootstrap to construct valid data-driven confidence intervals. Moreover, §4 is devoted to extending the HL estimator to large-scale multiple testing problems. §5 contains comprehensive numerical studies to verify theoretical results. Finally we conclude the paper with some discussions in §6. All the proofs are deferred to the appendix.
1.2 Notation
For any integer , we use to denote the set . For any function , we denote . Throughout this paper, we use to denote universal positive constants whose values may vary at different places. We use to denote the indicator function. For any set , we use to denote its cardinality. For two positive sequences and , we write or if there exists a positive constant such that and we write if . In addition, we define the pseudomedian of a distribution to be the median of the distribution of where and are independent, each with the same distribution . Moreover, for any distribution and constant , we let represent the distribution of the random variable , where is the random variable drawn from
2 Estimation and Inference
This section is devoted to studying the non-asymptotic expansions of the Hodges-Lehmann estimator and conducting statistical estimation and inference for population location shift parameters. For both one-sample and two-sample problems, the theoretical properties, which are needed for large-scale inferences, are presented in the following sections.
2.1 One-sample Problem
Let , , be i.i.d. real-valued random variables, where is the unknown location parameter of interest and are i.i.d. random variables drawn from some unknown distribution. It is assumed that has a pseudomedian (Høyland, 1965) of zero, throughout this section. As a consequence, letting , it holds that . The HL estimator (Hodges and Lehmann, 1963) of is given by the median of all Walsh averages of the observations , namely,
(2.1) |
Equivalently, if we define the U-process , the HL estimator in (2.1) can also be expressed as the sample median of the process , namely,
(2.2) |
Let denote the cumulative distribution function of and be its density function. We then present the non-asymptotic Bahadur representation of in the following theorem.
Theorem 2.1.
Assume that there exist positive constants and such that . Then for any , we have
(2.3) |
Furthermore, assume that and there exist positive constants and such that . Then for any such that , we have
(2.4) |
where are positive constants depending only on and .
We note that existing works mainly study the asymptotic distribution of quantiles of U-statistics instead of non-asymptotic ones (Arcones, 1996). Asymptotic theory, however, is frequently less effective for theoretical studies in high-dimensional statistics (Wainwright, 2019). To fill in the blank, in Theorem 2.1, we present both the non-asymptotic deviation bound and linear approximation of the HL estimator . It is worth mentioning that the HL estimator has sub-Gaussian tails without any moment constraints imposed on the noise , whereas the Huber-type or winsorized estimator requires the existence of the second moment. Moreover, in contrast to Huber regression (Zhou et al., 2018; Sun et al., 2020) or truncation (Fan et al., 2021), which both require additional tuning parameters, HL-type estimation is tuning-free and thus more scalable.
Moreover, when the distribution of is symmetric around zero, reduces to the median of the distribution of . In this scenario, the sample median serves as a plausible alternative robust estimator for . Under similar regularity conditions on the density function , the classical Bahadur representation for reveals that
(2.5) |
for any constant . Compared with (2.4) (by taking ), the linear approximation of HL estimator is much more accurate than that of the quantile estimator (Arcones, 1996).
2.1.1 Asymptotic Distribution
In addition to estimation, statistical inference is also essential in real-world applications. To this end, with the developed non-asymptotic expansion at hand, we next present the asymptotic distribution of the HL estimator in this section.
Let denote the cumulative distribution function of standard normal random variable. The following theorem establishes a Berry-Esseen theorem for .
Theorem 2.2.
Theorem 2.2 establishes the asymptotic normality of . When the distribution of is symmetric around zero, the asymptotic variance above reduces to . Consequently, in view of (2.5) and (2.6), the asymptotic relative efficiency (ARE) between the HL estimator and the sample median is (Hodges and Lehmann, 1963). A concrete example is given in Table 1, where we summarize the ARE between and for .
ARE |
In particular, when , the HL estimator has a strictly smaller asymptotic variance than the sample median. The above example illustrates the effectiveness of the HL estimator over the quantile regression method.
Based on the non-asymptotic linear expansion in (2.4), we further derive the Cramér-type moderate deviation to quantify the relative error of the normal approximation for in the following theorem, which has important applications to large-scale inference (Fan et al., 2007; Liu and Shao, 2014; Xia et al., 2018; Zhou et al., 2018).
Theorem 2.3.
Let be a sequence of positive numbers satisfying . Then, under the conditions of Theorem 2.1, we have
(2.7) |
uniformly for , where is a positive constant independent of and . In particular, when , we have
uniformly for .
It is worth mentioning that taking yields the wideest possible range for the relative error in (2.7) to vanish, which is also optimal for the Cramér-type moderate deviation results (Petrov, 1975; Fan et al., 2007; Liu and Shao, 2014; Zhou et al., 2018; Fan et al., 2019; Chen and Zhou, 2020; Fang et al., 2020). Next, we proceed to estimate the location shift parameter between two distributions via the HL estimator.
2.2 Two-sample Problem
A variety of applications use two-sample location shift estimation and inference, such as testing gene differences, quantifying treatment effects, and detecting change points. Accordingly, this section examines the two-sample estimation and inference of the population location shift parameter.
Let , , be another sample of i.i.d. real-valued random variables independent of , and we aim at constructing confidence interval for . Throughout this section, it is assumed that
(2.8) |
The existing literature on HL estimators mainly deals with the case where and are identically distributed (Hodges and Lehmann, 1963; Lehmann, 1963; Bauer, 1972; Rosenkranz, 2010). In contrast, it should be noted that the assumption imposed in (2.8) is satisfied as long as median which is much more general than the identical distribution. In the sequel, following Hodges and Lehmann (1963), the two sample HL estimator for is given by
(2.9) |
Before proceeding, we present the following assumption on the relative sample sizes of the involved random samples.
Assumption 2.1.
There exists a positive constant such that .
Assumption 2.1 is a natural condition which ensures the sample sizes to be comparable. Such a requirement is commonly imposed for two sample estimation and inference (Bai and Saranadasa, 1996; Chen and Qin, 2010; Li and Chen, 2012; Chang et al., 2017; Zhang et al., 2020). In what follows, we write for simplicity. The sub-Gaussian-type deviation inequality and the non-asymptotic Bahadur representation of the two-sample HL estimator are established in the subsequent theorem.
Theorem 2.4.
Assume that there exist positive constants and such that . Then for any , we have
Furthermore, assume that and there exist positive constants and such that . Then, under Assumption 2.1, for any , we have
where stands for the cumulative distribution function of and are positive constants depending only on and .
Theorem 2.4 presents the non-asymptotic approximation of the HL estimator , where the approximator also enjoys sub-Gaussian tails without posing any constraints on the moments of and . Equipped with this, we establish the Berry-Esseen bound and Cramér type moderate deviation of , respectively, in the following theorem. Before proceeding, we define the asymptotic variance of to be
Theorem 2.5.
Under the conditions of Theorem 2.4, we have
where is a positive constant independent of . Moreover, let be a sequence of positive constants satisfying . Then, we further achieve
uniformly for , where is a positive constant independent of and . In particular, when , we have
uniformly for .
One observes that the asymptotic distributions of and involve many unknown quantities such as density functions and population parameters and . In the following section, we utilize the bootstrap method to construct confidence intervals for the parameters of interest.
3 Bootstrap Calibration
In this section, we propose a weighted bootstrap method to construct confidence intervals for and , rather than directly estimating those involved unknown terms in asymptotic variances using the brute force methods. The reason is that the direct estimation approach always necessitates the additional selection of tuning parameters and imposes moment conditions. Additionally, the bootstrap calibration performs admirably with finite samples, particularly when the sample size is modest. Therefore, in the sections that follow, we outline the bootstrap procedures for both the one- and two-sample problems.
3.1 Boostrap for One-sample Problem
Recall that the one-sample HL estimator is given by . Throughout this paper, we focus on the weighted bootstrap procedure in which the bootstrap estimate of is defined by minimizing the randomly perturbed objective function. More specifically, let be i.i.d. non-negative random variables with and . Then the weighted bootstrap estimate of is given by
A natural candidate of the bootstrap weight above would be sampled from a 2Bernoulli(0.5) distribution (the multiplication of 2 is to guarantee the previous normalization condition). In this case, the bootstrap estimator has the simple closed-form expression as follows,
(3.1) |
which is the same as the sub-sampled HL estimator computed based on the dataset where , and we concentrate on this type of bootstrap calibration procedure in what follows.
Let denote the total number of Walsh averages in (3.1) and denote for each . In the subsequent theorem, we establish the non-asymptotic Bahadur representation of the bootstrap estimator and approximated distribution of bootstrap samples.
Theorem 3.1.
Under the conditions of Theorem 2.1, for any such that , with probability at least , we have
(3.2) | ||||
(3.3) |
where stands for the conditional probability and – are positive constants depending only on and .
The non-asymptotic linear expansion in (3.3) enables us to derive the asymptotic normality of the bootstrap estimator . Combined with the Berry-Esseen bound in Theorem 2.2, we further establish a non-asymptotic upper bound on the Kolmogorov distance between the distribution functions of and . More specifically, with probability at least , we have
(3.4) |
where and are positive constants independent of . Consequently, we are equipped to construct confidence interval for in a data-driven way. For any significance level , let
Then the confidence interval for is given by .
3.2 Bootstrap in Two-sample Problem
This section is devoted to constructing confidence intervals for for the two-sample problem. Let be i.i.d. 2Bernoulli(0.5) random variables independent of . Following (3.1), the bootstrap estimator for is defined by
(3.5) |
where and . It is worth noting that is equivalent to the sub-sampled HL estimator based on the two datasets and . With these necessary tools at hands, the non-asymptotic Bahadur representation of the bootstrap estimator and approximated distribution of bootstrap samples are developed in the following theorem.
Theorem 3.2.
We then obtain the Berry-Esseen bound and build confidence intervals for based on the non-asymptotic expansion of the two-sample bootstrap estimator using similar arguments in §3.1.
4 Large-scale Multiple Testing
With the advancement of technology, large-scale, high-dimensional data have been extensively collected over the past two decades in a variety of fields such as medicine, biology, genetics, earth science, and finance. In large-scale regimes, there are inevitably heavy-tailed noises and it is crucial to develop robust statistical inference procedures. Yet, existing research that infers location shifts via Huber-type estimates calls for variable-dependent tuning parameters and moment limitations (Fan et al., 2019; Sun et al., 2020). This type of technique is hard to apply efficiently and faithfully to large-scale inferences due to the choices of tuning values. Moment constraints also exclude a large number of heavy-tailed distributions. To remedy the issues, this section focuses on extending the HL estimation to high dimensions and developing tuning-free and moment-free high-dimensional multiple testing procedures.
4.1 Large-Scale Testing for One-sample Problem
In this section, we investigate high-dimensional multiple testing using the HL estimator for one-sample data. Let , , be i.i.d. p-dimensional random vectors, where is a p-dimensional vector of unknown parameters and are i.i.d. random vectors. With building blocks presented in the previous section, we first proceed to constructing simultaneous confidence intervals for using Gaussian approximation and bootstrap calibrations.
4.1.1 Gaussian Approximation
The primary goal of this section is to construct simultaneous confidence intervals for . To this end, we develop a Gaussian approximation for the maximum deviation following the intuition of recently developed high dimensional distributional theory (Chernozhukov et al., 2017; Chernozhuokov et al., 2022). More specifically, let be a p-dimensional centered Gaussian random vector with
(4.1) |
where and stands for its derivative, and . We have the high dimensional Gaussian approximation in the following theorem.
Theorem 4.1.
Assume that there exist positive constants and such that
Then, we have
(4.2) |
We consider testing the global null hypotheses
(4.3) |
Based on the marginal HL estimators , one shall reject the null hypothesis in (4.3) when exceeds certain threshold that depends on the distribution of . However, in light of (4.1), the distribution depends on the unknown distribution functions . Therefore, to approximate the distribution of , we also propose to use bootstrap procedure. In specific, in one-sample regime, recall that . For each , define the bootstrap estimate of as . It is worth mentioning that these bootstrap estimators can be efficiently computed in practice. For , let denote the th quantile of the bootstrap statistic . With the help of bootstrap, we manage to estimate the quantiles of the approximated distribution efficiently and the corresponding results are presented in the following theorem.
Theorem 4.2.
Under the conditions of Theorem 4.1, we have
Theorem 4.2 reveals that the proposed bootstrap procedure can efficiently estimate the quantiles of the approximated distribution. This allows for the direct construction of simultaneous data-driven confidence intervals for . In addition, when the null-hypothesis of (4.3) is rejected, it is essential to conduct multiple testing to identify significant individuals and control false discovery proportion (FDP). We next address this problem in the following section.
4.1.2 Multiple Testing
The goal of this section is to conduct multiple testing to identify statistically significant individuals with controlled false discovery proportions. Specifically, we consider simultaneously testing the hypotheses
Let denote the set of true null hypotheses with cardinality . For each , let denote the p-value for testing the individual hypothesis . For any prescribed threshold , we shall reject the null hypothesis whenever . Then the false discovery proportion is defined by
(4.4) |
where denotes the number of false discoveries and is the number of total discoveries. Note that the denominator is observable but is not. When tends to infinity, . This can be used to give an upper bound of FDP. In many applications, , and Storey (2002) gives an estimator for and incorporates it into estimator.
It is worth noting that the p-values are computed by constructing inferential test statistics, with pivotal limiting distributions, based on the normal distribution calibration (Fan et al., 2007). As illustrated above, to construct a test statistic for each hypothesis with pivotal asymptotic distribution, the asymptotic variance of the HL estimator always depends on the unknown components and the traditional quantile-based approach is not scalable in the ultra-high dimensional scenario. To remedy this issue, we leverage bootstrap to proceed the analysis. Specifically, let be i.i.d. non-negative random variables generated in the same way with those in §3.1 and denote for each . Similar to (3.1), the bootstrap estimate of is defined by
Consequently, our p-values are derived as and let denote the ordered p-values. In order to choose properly to control the FDP, we adopt the distribution-free procedure proposed by Benjamini and Hochberg (1995). Specifically, for any significance level , the data-dependent threshold is , where is given by
Recall that an estimate of is , following the discussion after (4.4). Then a natural choice of threshold is
This provides a simple explanation on the choice of .
In what follows, we assume that . For each , we define the correlation measure as , and impose the assumption quantifying dependence between measurements below.
Assumption 4.1.
There exists a positive constant such that and
for some and .
Assumption 4.1 requires that for every variable, the number of other variables, whose correlations with the given variable exceed certain threshold, does not grow too fast. It is worth noting that this is a commonly imposed condition in large-scale multiple testing problems (Liu and Shao, 2014). Based on this assumption, we next summarize the theoretical results in the following Theorem 4.3.
Theorem 4.3.
Theorem 4.3 develops theoretical guarantees for the consistency of FDP control procedure. To further support the derived results, we make the following several remarks.
Remark 4.1.
Condition (4.5) is nearly optimal for controlling the false discovery proportion. More specifically, as shown in Proposition 2.1 in Liu and Shao (2014), if the number of alternative hypotheses is fixed instead of tending to infinity, the B-H approach fails to control the FDP at any level even if the true p-values for are known. ∎
4.2 Large-Scale Two-Sample Tests
In this section, we study the large-scale two-sample testings. In specific, let , be another sample of i.i.d. p-dimensional random vectors independent of , where . Let be the location shift parameter. Following (2.9), the HL estimator for is given by . A detailed global test for whether for all is given in §E.4.
We next conduct a simultaneously test on the hypotheses
where denote the set of true null hypotheses and is its cardinality. Throughout this section, we assume that . For each , let denote the p-value for testing whether . For any prescribed threshold , we shall reject the null hypothesis whenever . The primary goal is to control the following false discovery proposition ,
by selecting a proper , where and .
To approximate the unknown involved asymptotic distributions, we let be i.i.d. non-negative random variables generated in the same way with those in §3.2 and denote and for each . Following (3.5), the bootstrap estimate for is defined by
Then, for each , the p-value is given by , and are the ordered p-values. Following Benjamini and Hochberg (1995), the data-dependent threshold , where is chosen as .
With these necessary tools at hand, in the paragraph that follows, we present the required assumptions and the main theorem, respectively. In specific, Assumption 4.2 provides the formal condition that quantifies the level of dependence, while Theorem 4.4 presents the theoretical assurances for two-sample multiple testing.
Assumption 4.2.
There exist positive constants such that , where , with and for each . Moreover, for some and , we have
5 Numerical Studies
In this section, we use simulation experiments to verify the theoretical findings in the paper. In specific, we validate the results on Gaussian approximation and FDP control in via experiments in §5.1 and §5.2, respectively.
5.1 Numerical Studies for Global Tests
We let and generate the variables and following the setting in §4.1 and §4.2, respectively, where the involved random variables follow several distributions described below. Cases 1-3 are devoted primarily to one-sample tests, whereas Cases 4-6 are for two-sample tests. Moreover, the notation denotes the distribution that is generated by the differences between two independent random variables drawn from distribution .
-
•
Case 1: Scaled distribution with
-
•
Case 2: Mixture of Pareto distribution with shape parameter and standard Gaussian distribution, namely,
-
•
Case 3: Mixture of Gaussian distributions, namely, where
-
•
Case 4: Gaussian distributions but with different covariance, namely, and where the covariance matrix is the same with that in case 3.
-
•
Case 5: Mixture of Gaussian distributions with where the covariance matrix is the same with that in case 3, and mixture of Pareto distribution with shape parameter 2 and standard Gaussian distribution, namely,
-
•
Case 6: Scaled distribution, where and
As a first step, we validate the results of Gaussian approximation of HL estimator by conducting the tests in (4.3) and (E.3) with threshold , respectively. We set the first 50 entries of (or ) given in §4.1 and §4.2 as , and the other entries as 0, where increases from 0 to 0.25. When , the null-hypothesis test holds; otherwise, the alternative holds. The size or power of the tests in (4.3) and (E.3) are computed via averaged outcomes from 500 replications of the methods in §4.1.1 and §E.4, respectively. In addition, for every replication, we conduct the weighted bootstrap method 300 times to compute the critical value of the test.
In addition, under the same experimental settings given above, we also compare the performance of HL estimator with the sample mean estimator ( in one-sample case and in two-sample test). In this scenario, the critical values for tests (4.3) and (E.3) via sample mean estimators are computed via 300 bootstrap samples from joint Gaussian distribution , where is the sample covariance matrix ( in one-sample test and in two-sample test). The results are then summarized in Table 2.
Estimator | One-sample | Two-sample | |||||
Case 1 | Case 2 | Case 3 | Case 4 | Case 5 | Case 6 | ||
HL | 0.054 | 0.048 | 0.052 | 0.040 | 0.046 | 0.045 | |
0.880 | 0.062 | 0.060 | 0.080 | 0.084 | 0.282 | ||
1.000 | 0.480 | 0.464 | 0.164 | 0.140 | 1.000 | ||
1.000 | 0.982 | 0.942 | 0.322 | 0.320 | 1.000 | ||
1.000 | 1.000 | 1.000 | 0.728 | 0.688 | 1.000 | ||
1.000 | 1.000 | 1.000 | 0.944 | 1.000 | 1.000 | ||
Mean | 0.004 | 0.000 | 0.044 | 0.052 | 0.000 | 0.000 | |
0.042 | 0.000 | 0.046 | 0.122 | 0.000 | 0.000 | ||
0.540 | 0.000 | 0.478 | 0.200 | 0.000 | 0.166 | ||
0.818 | 0.002 | 0.940 | 0.384 | 0.002 | 0.718 | ||
1.000 | 0.006 | 1.000 | 0.742 | 0.008 | 0.858 | ||
1.000 | 0.010 | 1.000 | 0.968 | 0.020 | 0.900 |
We conclude from Table 2 that, in terms of the HL estimator, for all scenarios, the sizes of the test are roughly when the null hypothesis is true (). Therefore, this validates the results of the Gaussian approximation. On the other hand, when the alternative holds, the power of the tests via HL estimator rises quickly to as increases. This demonstrates the effectiveness of the HL test statistics. However, in terms of sample mean estimator, one observes that when the null holds, the size of the test is approximately for most scenarios, whereas, when alternative holds, the power is much less than that of HL estimator. Thus, this further confirms the efficiency and robustness of HL estimator.
5.2 Numerical Studies for FDP Control
We validate the theoretical findings for FDP control under multiple regimes and also compare the performance of HL estimator with student’s -statistics given in Liu and Shao (2014). In specific, we maintain most of the settings mentioned in §5.1. The only difference is that we let the first entries of and be , where and the rest ones being . The target false discover proportion is varied uniformly from to and the empirical FDP are computed via the procedures in §4.1.2 and §4.2, respectively. Meanwhile, for both of these two statistics, besides the FDP, the true positive proportion (TPP) of the test is also computed. The numerical outcomes are summarized in Table 3 and Table 4 (corresponds to cases with and , respectively).
Estimator | ||||||
HL | Case 1 | 0.068 (1.000) | 0.110 (1.000) | 0.152 (1.000) | 0.212 (1.000) | 0.256 (1.000) |
Case 2 | 0.062 (1.000) | 0.116 (1.000) | 0.154 (1.000) | 0.198 (1.000) | 0.254 (1.000) | |
Case 3 | 0.052 (1.000) | 0.102 (1.000) | 0.150 (1.000) | 0.188 (1.000) | 0.248 (1.000) | |
Case 4 | 0.058 (1.000) | 0.122 (1.000) | 0.170 (1.000) | 0.238 (1.000) | 0.273 (1.000) | |
Case 5 | 0.064 (1.000) | 0.103 (1.000) | 0.142 (1.000) | 0.187 (1.000) | 0.234 (1.000) | |
Case 6 | 0.062 (1.000) | 0.092 (1.000) | 0.151 (1.000) | 0.193 (1.000) | 0.232 (1.000) | |
Student’s | Case 1 | 0.071 (1.000) | 0.109 (1.000) | 0.168 (1.000) | 0.228 (1.000) | 0.276 (1.000) |
Case 2 | 0.033 (0.976) | 0.075 (0.976) | 0.118 (0.990) | 0.161 (0.996) | 0.217 (1.000) | |
Case 3 | 0.042 (1.000) | 0.087 (1.000) | 0.131 (1.000) | 0.206 (1.000) | 0.255 (1.000) | |
Case 4 | 0.063 (1.000) | 0.127 (1.000) | 0.181 (1.000) | 0.246 (1.000) | 0.299 (1.000) | |
Case 5 | 0.026 (0.962) | 0.064 (0.962) | 0.110 (0.988) | 0.159 (0.994) | 0.195 (1.000) | |
Case 6 | 0.048 (1.000) | 0.106 (1.000) | 0.168 (1.000) | 0.196 (1.000) | 0.272 (1.000) |
Estimator | ||||||
HL | Case 1 | 0.048 (1.000) | 0.112 (1.000) | 0.154 (1.000) | 0.202 (1.000) | 0.256 (1.000) |
Case 2 | 0.074 (1.000) | 0.133 (1.000) | 0.178 (1.000) | 0.222 (1.000) | 0.260 (1.000) | |
Case 3 | 0.054 (1.000) | 0.104 (1.000) | 0.153 (1.000) | 0.210 (1.000) | 0.254 (1.000) | |
Case 4 | 0.006 (0.984) | 0.048 (0.998) | 0.108 (1.000) | 0.176 (1.000) | 0.232 (1.000) | |
Case 5 | 0.000 (0.992) | 0.024 (0.992) | 0.092 (0.992) | 0.166 (0.992) | 0.200 (0.992) | |
Case 6 | 0.055 (1.000) | 0.095 (1.000) | 0.138 (1.000) | 0.184 (1.000) | 0.254 (1.000) | |
Student’s | Case 1 | 0.044 (1.000) | 0.098 (1.000) | 0.166 (1.000) | 0.204 (1.000) | 0.263 (1.000) |
Case 2 | 0.011 (0.910) | 0.056 (0.910) | 0.106 (0.930) | 0.161 (0.968) | 0.205 (0.984) | |
Case 3 | 0.052 (1.000) | 0.106 (1.000) | 0.150 (1.000) | 0.198 (1.000) | 0.259 (1.000) | |
Case 4 | 0.004 (1.000) | 0.046 (1.000) | 0.110 (1.000) | 0.158 (1.000) | 0.210 (1.000) | |
Case 5 | 0.000 (0.870) | 0.000 (0.952) | 0.056 (0.984) | 0.100 (0.990) | 0.170 (1.000) | |
Case 6 | 0.046 (1.000) | 0.106 (1.000) | 0.140 (1.000) | 0.180 (1.000) | 0.262 (1.000) |
Compared with the student’s t-statistics, the empirical FDPs of HL estimator are closer to the theoretical thresholds in most cases and outcomes based on HL estimator have larger true positive proportion (TPP). This validates the theory of FDP control and confirms the benefit and effectiveness of using HL estimator when heavy-tailed error exists. In addition, we also further compare the performance of HL estimator with the student’s t-statistics in both one-sample and two-sample tests when the noises follow distribution, where the HL estimator also performs much better. Interested readers are referred to §A.1 for more details.
6 Conclusion
In large-scale data analysis, conventional methods are ineffective since outliers and variables with heavy tails can easily corrupt the data. To resolve heavy-tailed contamination, existing techniques, such as Huberized mean, truncation, and median of means, always require additional tuning parameters and moment restrictions. Consequently, they cannot effectively be scaled to the high-dimensional applications with fidelity. Using the well-known Hodge-Lehmann estimator, the constraint on moment and tuning parameters can be removed. However, its non-asymptotic and large-scale properties have never been investigated. This paper fills this important gap by contributing a finite-sample analysis of the HL estimator, generalizing it to large-scale studies, and proposing tuning-free and moment-free high-dimensional testing methods.
There are various potential future directions that merit further studies. First, we permit mild measurement dependence while controlling the FDP in both the one- and two-sample regimes. However, in reality, high-dimensional data can exhibit strong dependency such as those collected in the field of economics, finance, genomics, and meteorology. To solve such strong dependence problems, factor-adjusted multiple testing via Huber-type estimation has been proposed (Fan et al., 2019). However, using Huber-type loss will result in extra tuning parameters and moment constraints. As a result, factor adjustments can be incorporated into the large-scale HL estimation and testing procedures in order to develop tuning-free procedures that can be adapted to a strong dependence scenario. Second, in terms of computation, we need to compute the median of pairs. However, in the one sample estimation regime, if we use the sub-sampling idea, one may compute the median of non-overlapping pair averages:
with a random permutation on By taking 5 or more permutations and the averages of these estimates, numerically, the averaged one approximates well the Hodge-Lehmann estimator and only requires samples (Fan et al., 2020b). Therefore, it is also interesting to derive non-asymptotic analysis for this estimator based on sub-sampling.
References
- Arcones (1995) Arcones, M. A. (1995). The asymptotic accuracy of the bootstrap of u-quantiles. Ann. Statist. 1802–1822.
- Arcones (1996) Arcones, M. A. (1996). The bahadur-kiefer representation for u-quantiles. Ann. Statist., 24 1400–1422.
- Bai and Saranadasa (1996) Bai, Z. and Saranadasa, H. (1996). Effect of high dimension: by an example of a two sample problem. Statist. Sinica, 6 311–329.
- Bauer (1972) Bauer, D. F. (1972). Constructing confidence sets using rank statistics. J. Amer. Statist. Assoc., 67 687–690.
- Belloni and Chernozhukov (2011) Belloni, A. and Chernozhukov, V. (2011). -penalized quantile regression in high-dimensional sparse models. Ann. Statist., 39 82–130.
- Benjamini and Hochberg (1995) Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B, 57 289–300.
- Bentkus and Dzindzalieta (2015) Bentkus, V. K. and Dzindzalieta, D. (2015). A tight gaussian bound for weighted sums of rademacher random variables. Bernoulli, 21 1231–1237.
- Blanchard and Roquain (2009) Blanchard, G. and Roquain, É. (2009). Adaptive false discovery rate control under independence and dependence. Journal of Machine Learning Research, 10.
- Brownlees et al. (2015) Brownlees, C., Joly, E. and Lugosi, G. (2015). Empirical risk minimization for heavy-tailed losses. The Annals of Statistics, 43 2507–2536.
- Bühlmann and Van De Geer (2011) Bühlmann, P. and Van De Geer, S. (2011). Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media.
- Catoni (2012) Catoni, O. (2012). Challenging the empirical mean and empirical variance: a deviation study. In Annales de l’IHP Probabilités et statistiques, vol. 48.
- Chang et al. (2017) Chang, J., Zheng, C., Zhou, W.-X. and Zhou, W. (2017). Simulation-based hypothesis testing of high dimensional means under covariance heterogeneity. Biometrics, 73 1300–1310.
- Chen and Qin (2010) Chen, S. X. and Qin, Y.-L. (2010). A two-sample test for high-dimensional data with applications to gene-set testing. Ann. Statist., 38 808–835.
- Chen and Zhou (2020) Chen, X. and Zhou, W.-X. (2020). Robust inference via multiplier bootstrap. Ann. Statist., 48 1665–1691.
- Chen et al. (2021) Chen, Y., Chi, Y., Fan, J., Ma, C. et al. (2021). Spectral methods for data science: A statistical perspective. Foundations and Trends® in Machine Learning, 14 566–806.
- Chernozhukov et al. (2017) Chernozhukov, V., Chetverikov, D. and Kato, K. (2017). Central limit theorems and bootstrap in high dimensions. Ann. Probab., 45 2309–2352.
- Chernozhuokov et al. (2022) Chernozhuokov, V., Chetverikov, D., Kato, K. and Koike, Y. (2022). Improved central limit theorem and bootstrap approximations in high dimensions. Ann. Statist., 50 2562–2586.
- Chi (2007) Chi, Z. (2007). On the performance of fdr control: constraints and a partial solution. The Annals of Statistics, 35 1409–1431.
- Cont (2001) Cont, R. (2001). Empirical properties of asset returns: stylized facts and statistical issues. Quantitative finance, 1 223.
- Eklund et al. (2016) Eklund, A., Nichols, T. E. and Knutsson, H. (2016). Cluster failure: Why fmri inferences for spatial extent have inflated false-positive rates. Proceedings of the national academy of sciences, 113 7900–7905.
- Fan et al. (2014) Fan, J., Fan, Y. and Barut, E. (2014). Adaptive robust variable selection. Ann. Statist., 42 324–351.
- Fan et al. (2022a) Fan, J., Gu, Y. and Zhou, W.-X. (2022a). How do noise tails impact on deep relu networks? arXiv preprint arXiv:2203.10418.
- Fan et al. (2007) Fan, J., Hall, P. and Yao, Q. (2007). To how many simultaneous hypothesis tests can normal, Student’s or bootstrap calibration be applied? J. Amer. Statist. Assoc., 102 1282–1288.
- Fan et al. (2019) Fan, J., Ke, Y., Sun, Q. and Zhou, W.-X. (2019). FarmTest: factor-adjusted robust multiple testing with approximate false discovery control. J. Amer. Statist. Assoc., 114 1880–1893.
- Fan et al. (2017) Fan, J., Li, Q. and Wang, Y. (2017). Estimation of high dimensional mean regression in the absence of symmetry and light tail assumptions. J. R. Stat. Soc. Ser. B. Stat. Methodol., 79 247–265.
- Fan et al. (2020a) Fan, J., Li, R., Zhang, C.-H. and Zou, H. (2020a). Statistical foundations of data science. Chapman and Hall/CRC.
- Fan et al. (2022b) Fan, J., Lou, Z. and Yu, M. (2022b). Are latent factor regression and sparse regression adequate? arXiv preprint arXiv:2203.01219.
- Fan et al. (2020b) Fan, J., Ma, C. and Wang, K. (2020b). Comment on “a tuning-free robust and efficient approach to high-dimensional regression”. Journal of the American Statistical Association, 115 1720–1725.
- Fan et al. (2021) Fan, J., Wang, W. and Zhu, Z. (2021). A shrinkage principle for heavy-tailed data: high-dimensional robust low-rank matrix recovery. Ann. Statist., 49 1239–1266.
- Fan et al. (2022c) Fan, J., Yang, Z. and Yu, M. (2022c). Understanding implicit regularization in over-parameterized single index model. Journal of the American Statistical Association 1–14.
- Fan and Yao (2017) Fan, J. and Yao, Q. (2017). The elements of financial econometrics. Cambridge University Press.
- Fang et al. (2020) Fang, X., Luo, L. and Shao, Q.-M. (2020). A refined Cramér-type moderate deviation for sums of local statistics. Bernoulli, 26 2319–2352.
- Ferreira and Zwinderman (2006) Ferreira, J. A. and Zwinderman, A. H. (2006). On the Benjamini-Hochberg method. Ann. Statist., 34 1827–1849.
- Finotello and Di Camillo (2015) Finotello, F. and Di Camillo, B. (2015). Measuring differential gene expression with rna-seq: challenges and strategies for data analysis. Briefings in functional genomics, 14 130–142.
- Genovese and Wasserman (2004) Genovese, C. and Wasserman, L. (2004). A stochastic process approach to false discovery control. The annals of statistics, 32 1035–1061.
- Giné et al. (2000) Giné, E., Latała, R. and Zinn, J. (2000). Exponential and moment inequalities for u-statistics. In High Dimensional Probability II. Springer, 13–38.
- Goldstein et al. (2018) Goldstein, L., Minsker, S. and Wei, X. (2018). Structured signal recovery from non-linear and heavy-tailed measurements. IEEE Transactions on Information Theory, 64 5513–5530.
- Gupta et al. (2014) Gupta, S., Ellis, S. E., Ashar, F. N., Moes, A., Bader, J. S., Zhan, J., West, A. B. and Arking, D. E. (2014). Transcriptome analysis reveals dysregulation of innate immune response genes and neuronal activity-dependent genes in autism. Nature communications, 5 1–8.
- Hastie et al. (2009) Hastie, T., Tibshirani, R., Friedman, J. H. and Friedman, J. H. (2009). The elements of statistical learning: data mining, inference, and prediction, vol. 2. Springer.
- Hastie et al. (2015) Hastie, T., Tibshirani, R. and Wainwright, M. (2015). Statistical learning with sparsity. Monographs on statistics and applied probability, 143 143.
- He and Shao (1996) He, X. and Shao, Q.-M. (1996). A general bahadur representation of m-estimators and its application to linear regression with nonstochastic designs. The Annals of Statistics, 24 2608–2630.
- He and Shao (2000) He, X. and Shao, Q.-M. (2000). On parameters of increasing dimensions. Journal of multivariate analysis, 73 120–135.
- Hodges and Lehmann (1963) Hodges, J. L., Jr. and Lehmann, E. L. (1963). Estimates of location based on rank tests. Ann. Math. Statist., 34 598–611.
- Høyland (1965) Høyland, A. (1965). Robustness of the Hodges-Lehmann estimates for shift. Ann. Math. Statist., 36 174–197.
- Hsu and Sabato (2016) Hsu, D. and Sabato, S. (2016). Loss minimization and parameter estimation with heavy tails. The Journal of Machine Learning Research, 17 543–582.
- Huber (1973) Huber, P. J. (1973). Robust regression: asymptotics, conjectures and monte carlo. The annals of statistics 799–821.
- Koenker and Hallock (2001) Koenker, R. and Hallock, K. F. (2001). Quantile regression. Journal of economic perspectives, 15 143–156.
- Lehmann (1963) Lehmann, E. L. (1963). Nonparametric confidence intervals for a shift parameter. Ann. Math. Statist., 34 1507–1512.
- Li and Chen (2012) Li, J. and Chen, S. X. (2012). Two sample tests for high-dimensional covariance matrices. Ann. Statist., 40 908–940.
- Li and Tibshirani (2013) Li, J. and Tibshirani, R. (2013). Finding consistent patterns: a nonparametric approach for identifying differential expression in RNA-Seq data. Stat. Methods Med. Res., 22 519–536.
- Li et al. (2012) Li, J., Witten, D. M., Johnstone, I. M. and Tibshirani, R. (2012). Normalization, testing, and false discovery rate estimation for rna-sequencing data. Biostatistics, 13 523–538.
- Liu and Shao (2014) Liu, W. and Shao, Q.-M. (2014). Phase transition and regularized bootstrap in large-scale -tests with false discovery rate control. Ann. Statist., 42 2003–2025.
- Loh (2017) Loh, P.-L. (2017). Statistical consistency and asymptotic normality for high-dimensional robust -estimators. The Annals of Statistics, 45 866–896.
- Mammen (1989) Mammen, E. (1989). Asymptotics with increasing dimension for robust regression with applications to the bootstrap. The Annals of Statistics 382–400.
- Minsker (2015) Minsker, S. (2015). Geometric median and robust estimation in banach spaces. Bernoulli, 21 2308–2335.
- Minsker (2018) Minsker, S. (2018). Sub-gaussian estimators of the mean of a random matrix with heavy-tailed entries. The Annals of Statistics, 46 2871–2903.
- Nagalakshmi et al. (2008) Nagalakshmi, U., Wang, Z., Waern, K., Shou, C., Raha, D., Gerstein, M. and Snyder, M. (2008). The transcriptional landscape of the yeast genome defined by rna sequencing. Science, 320 1344–1349.
- Nemirovsky and Yudin (1983) Nemirovsky, A. S. and Yudin, D. B. a. (1983). Problem complexity and method efficiency in optimization. Wiley-Interscience Series in Discrete Mathematics, John Wiley & Sons, Inc., New York.
- Petrov (1975) Petrov, V. V. (1975). Sums of independent random variables. Ergebnisse der Mathematik und ihrer Grenzgebiete, Band 82, Springer-Verlag, New York-Heidelberg. Translated from the Russian by A. A. Brown.
- Rosenkranz (2010) Rosenkranz, G. K. (2010). A note on the hodges–lehmann estimator. Pharmaceutical statistics, 9 162–167.
- Shendure and Ji (2008) Shendure, J. and Ji, H. (2008). Next-generation dna sequencing. Nature biotechnology, 26 1135–1145.
- Stock and Watson (2002) Stock, J. H. and Watson, M. W. (2002). Macroeconomic forecasting using diffusion indexes. Journal of Business & Economic Statistics, 20 147–162.
- Storey (2002) Storey, J. D. (2002). A direct approach to false discovery rates. J. R. Stat. Soc. Ser. B Stat. Methodol., 64 479–498.
- Storey (2003) Storey, J. D. (2003). The positive false discovery rate: a Bayesian interpretation and the -value. Ann. Statist., 31 2013–2035.
- Storey et al. (2004) Storey, J. D., Taylor, J. E. and Siegmund, D. (2004). Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. J. R. Stat. Soc. Ser. B Stat. Methodol., 66 187–205.
- Sun et al. (2020) Sun, Q., Zhou, W.-X. and Fan, J. (2020). Adaptive huber regression. Journal of the American Statistical Association, 115 254–265.
- Wainwright (2019) Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint, vol. 48. Cambridge University Press.
- Wang and Fan (2022) Wang, B. and Fan, J. (2022). Robust matrix completion with heavy-tailed noise. arXiv preprint arXiv:2206.04276.
- Wang et al. (2020) Wang, L., Peng, B., Bradic, J., Li, R. and Wu, Y. (2020). A tuning-free robust and efficient approach to high-dimensional regression. Journal of the American Statistical Association, 115 1700–1714.
- Wang et al. (2015) Wang, L., Peng, B. and Li, R. (2015). A high-dimensional nonparametric multivariate test for mean vector. Journal of the American Statistical Association, 110 1658–1669.
- Wang et al. (2009) Wang, Z., Gerstein, M. and Snyder, M. (2009). Rna-seq: a revolutionary tool for transcriptomics. Nature reviews genetics, 10 57–63.
- Xia et al. (2018) Xia, Y., Cai, T. T. and Li, H. (2018). Joint testing and false discovery rate control in high-dimensional multivariate regression. Biometrika, 105 249–269.
- Yang et al. (2017) Yang, Z., Balasubramanian, K. and Liu, H. (2017). High-dimensional non-gaussian single index models via thresholded score function estimation. In International conference on machine learning. PMLR.
- Yohai and Maronna (1979) Yohai, V. J. and Maronna, R. A. (1979). Asymptotic behavior of m-estimators for the linear model. The Annals of Statistics 258–268.
- Zhang et al. (2020) Zhang, J.-T., Guo, J., Zhou, B. and Cheng, M.-Y. (2020). A simple two-sample test in high dimensions based on -norm. J. Amer. Statist. Assoc., 115 1011–1027.
- Zheng et al. (2015) Zheng, Q., Peng, L. and He, X. (2015). Globally adaptive quantile regression with ultra-high dimensional data. Annals of statistics, 43 2225.
- Zhou et al. (2018) Zhou, W.-X., Bose, K., Fan, J. and Liu, H. (2018). A new perspective on robust -estimation: finite sample theory and applications to dependence-adjusted multiple testing. Ann. Statist., 46 1904–1931.
Appendix A Appendix
In the following sections, we present additional simulation results as well as the proofs of all the theoretical results in the main paper.
A.1 Additional Simulation
In this section, we conduct additional simulations for FDP control using HL estimator and student’s t-statistics, respectively. We keep all settings as in §5.2, except that the noises are generated from the following case 7 and case 8, where we change the distribution in case 1 and case 6 from to The numerical outcomes are summarized in Table 5 and Table 6.
-
•
Case 7: Scaled distribution with
-
•
Case 8: Scaled distribution, where and
Estimator | ||||||
HL | Case 7 | 0.036 (1.000) | 0.082 (1.000) | 0.132 (1.000) | 0.180 (1.000) | 0.222 (1.000) |
Case 8 | 0.069 (1.000) | 0.105 (1.000) | 0.168 (1.000) | 0.225 (1.000) | 0.272 (1.000) | |
Student’s | Case 7 | 0.000 (0.282) | 0.000 (0.284) | 0.000 (0.320) | 0.000 (0.320) | 0.000 (0.320) |
Case 8 | 0.000 (0.274) | 0.000 (0.274) | 0.000 (0.274) | 0.000 (0.322) | 0.000 (0.322) |
Estimator | ||||||
HL | Case 7 | 0.036 (1.000) | 0.82 (1.000) | 0.122 (1.000) | 0.179 (1.000) | 0.224 (1.000) |
Case 8 | 0.058 (1.000) | 0.105 (1.000) | 0.157 (1.000) | 0.214 (1.000) | 0.252 (1.000) | |
Student’s | Case 7 | 0.000 (0.282) | 0.000 (0.282) | 0.000 (0.282) | 0.000 (0.284) | 0.000 (0.284) |
Case 8 | 0.000 (0.134) | 0.000 (0.134) | 0.000 (0.134) | 0.000 (0.134) | 0.000 (0.134) |
Appendix B Lemmas
Lemma B.1.
Let , where are independent random variables such that for each . Then, for any , we have
Lemma B.2.
Let be i.i.d. random variables and , where is a symmetric function with for some . Then, for any , we have
Let be another sample of i.i.d. random variables independent of and let , where is bounded such that for some . Then, for any , we have
Appendix C Proof of Theoretical Results in §2
C.1 Proof of Theorem 2.1
For simplicity of notation, we write for . Define and
With this notation, we have , where
Lemma C.1.
For any , there exists a universal positive constant such that
(C.1) |
Proof of Lemma C.1.
Lemma C.2.
For any , with probability at least , we have
(C.2) |
where .
Proof of Lemma C.2.
For any and , we have
Similarly, we have
Consequently, we obtain
For each , by (C.1), with probability at least ,
where is a universal constant.
For each , by Lemma B.1, with probability at least , we have
Hence with the same probability, we have
Similarly, it follows that
Putting all these pieces together, we obtain (C.2). ∎
C.2 Proof of Theorem 2.2
C.3 Proof of Theorem 2.3
Proof of Theorem 2.3.
Recall the definitions of and in (C.3). Since , it follows that
where is any fixed constant and is a positive constant depending only on . Denote . By (2.4), we have
By Mill’s inequality, for any ,
Consequently, it follows that
Similarly, we have
Putting all these pieces together, we obtain (2.7). ∎
C.4 Proof of Theorem 2.4
C.5 Proof of Theorem 2.5
Appendix D Proof of Results in §3
Define
Lemma D.1.
Proof of Lemma D.1.
For each , denote
Recall that . Hence
Since and , it follows from Lemma B.1 that
where is a positive constant depending only on the density function . Note that is a sequence of independent centered random variables conditional on for each . Hence, by Hoeffding’s inequality, we have . Therefore
Consequently, by the Cauchy-Schwarz inequality, we have and
By Lemma B.2, we have and . Putting all these pieces together, we obtain (D.1). ∎
D.1 Proof of Theorem 3.1
Define the bootstrap U-process
Define and for . We first introduce some notation. Let . For each and , denote
Lemma D.2.
Let . Under , for any , we have
(D.2) | ||||
(D.3) |
Proof of Lemma D.2.
For any and , we have
Similarly, we have
Consequently, we have
Hence it suffices to upper bound . For each , we have
Recall that , where are i.i.d. random variables such that and for each . Hence, it follows from Lemma B.2 that . We first upper bound . Observe that is a degenerate second order U-statistic with
Hence, by Theorem 3.3 in Giné et al. (2000),
We now upper bound . For simplicity of notation, denote and . Define
(D.4) |
Hence, under , by Theorem 1.1 in Bentkus and Dzindzalieta (2015), it follows that
Putting all these pieces together, we obtain (D.2). ∎
Lemma D.3.
Let be defined in (D.4). Assume that and . Then we have
(D.5) |
Proof of Lemma D.3.
By the triangle inequality,
For each , conditional on , are independent centered random variables with
Then, by the Bernstein inequality, with probability at least , we have
We now bound . Observe that are independent random variables with
Hence, it follows from Lemma B.1 that
Recall that . Hence for each . Therefore, by Lemma C.2, with probability at least , we have
Putting all these pieces together, we obtain (D.5). ∎
Proof of Theorem 3.1.
By Theorem 2.1, for any such that , we have
Combined with Lemma C.2, for any such that , with probability at least , we have
Similar to the proof of Theorem 2.1, for any such that , we have
Then it follows from Lemma B.1 that
By Lemma D.2 and Lemma D.3, with probability at least , we have
where is a positive constant depending only on and . Therefore, combined with the fact that , we obtain
D.2 Proof of Theorem 3.2
Appendix E Proof of Results in §4
E.1 Proof of Theorem 4.1
E.2 Proof of Theorem 4.2
Proof of Theorem 4.2.
For each and , denote
Let , where for each . By the triangle inequality,
By Lemma B.1, it follows that . Recall that
Hence, it follows from Lemma B.2 that . Consequently, for any , by Theorem 1.1 in Bentkus and Dzindzalieta (2015), with probability at least , we have
Combined with Lemma, with probability at least , we have
∎
E.3 Proof of Theorem 4.3
Proof of Theorem 4.3.
Following the proof of Theorem 2.3, it follows from Theorem 3.1 that with probability at least , we have
(E.1) |
uniformly for and . By Lemma 1 in Storey et al. (2004), it follows that
By the definition of , we have
(E.2) |
Observe that . Hence, it follows from (E.1) that . Combining this with (E.2) yields
For some , define
Under , it is straightforward to verify that
By theorem 2.3, it follows that . Then, following the proof of Theorem 3.3 in Zhou et al. (2018), for any sequence , it is straightforward to derive that
Putting all these pieces together, we obtain (4.6). ∎
E.4 Large-scale Two-sample Simultaneous Testing
We consider the following high dimensional two-sample global test,
(E.3) |
Given the HL estimators , , for , we shall the null hypothesis in (E.3) whenever exceeds some threshold. To this end, we develop a Gaussian approximation for . More specifically, let be a p-dimensional centered Gaussian random vector with
In the following theorem, we establish a non-asymptotic upper bound for the Kolmogorov distance between the distribution functions of and its Gaussian analogue .
Theorem E.1.
Let for each . Assume that there exist positive constants and such that
Then, under Assumption 2.1, we have
Proof of Theorem E.1.
Motivated by Theorem E.1, an asymptotic -level test for (E.3) is given by , where stands for the th quantile of the bootstrap statistic , namely,
The validity of the proposed test is justified via the following theorem.
Theorem E.2.
Under the conditions of Theorem E.1, we have