This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

High dimensional asymptotics of likelihood ratio tests in the Gaussian sequence model under convex constraints

Qiyang Han Department of Statistics, Rutgers University, Piscataway, NJ 08854, USA. [email protected] Bodhisattva Sen Department of Statistics, Columbia University, New York, NY 10027, USA. [email protected]  and  Yandi Shen Department of Statistics, University of Washington, Seattle, WA 98105, USA. [email protected]
Abstract.

In the Gaussian sequence model Y=μ+ξY=\mu+\xi, we study the likelihood ratio test (LRT) for testing H0:μ=μ0H_{0}:\mu=\mu_{0} versus H1:μKH_{1}:\mu\in K, where μ0K\mu_{0}\in K, and KK is a closed convex set in n\mathbb{R}^{n}. In particular, we show that under the null hypothesis, normal approximation holds for the log-likelihood ratio statistic for a general pair (μ0,K)(\mu_{0},K), in the high dimensional regime where the estimation error of the associated least squares estimator diverges in an appropriate sense. The normal approximation further leads to a precise characterization of the power behavior of the LRT in the high dimensional regime. These characterizations show that the power behavior of the LRT is in general non-uniform with respect to the Euclidean metric, and illustrate the conservative nature of existing minimax optimality and sub-optimality results for the LRT. A variety of examples, including testing in the orthant/circular cone, isotonic regression, Lasso, and testing parametric assumptions versus shape-constrained alternatives, are worked out to demonstrate the versatility of the developed theory.

Key words and phrases:
Central limit theorem, isotonic regression, lasso, normal approximation, power analysis, projection onto a closed convex set, second-order Poincaré inequalities, shape constraint
2000 Mathematics Subject Classification:
62G08, 60F05, 62G10, 62E17
The research of Q. Han is partially supported by DMS-1916221. The research of B. Sen is partially supported by DMS-2015376.

1. Introduction

1.1. The likelihood ratio test

Consider the Gaussian sequence model

Y=μ+ξ,\displaystyle Y=\mu+\xi, (1.1)

where μn\mu\in\mathbb{R}^{n} is unknown and ξ=(ξ1,,ξn)\xi=(\xi_{1},\ldots,\xi_{n}) is an nn-dimensional standard Gaussian vector. In a variety of applications, prior knowledge on the mean vector μ\mu can be naturally translated into the constraint μK\mu\in K, where KK is a closed convex set in n\mathbb{R}^{n}. Two such important examples that will be considered in this paper are: (i) Lasso in the constrained form [Tib96], where KK is an 1\ell_{1}-norm ball, and (ii) isotonic regression [CGS15], where KK is the cone consisting of monotone sequences. We also refer the readers to [Bar02, JN02, BHL05, Cha14, GS18] and many references therein for a diverse list of further concrete examples of KK. In this paper, we will be interested in the following ‘goodness-of-fit’ testing problem:

H0:μ=μ0versusH1:μK,\displaystyle H_{0}:\mu=\mu_{0}\qquad\textrm{versus}\qquad H_{1}:\mu\in K, (1.2)

where μ0Kn\mu_{0}\in K\subset\mathbb{R}^{n}, and KK is an arbitrary closed and convex subset of n\mathbb{R}^{n}. Throughout the manuscript, the asymptotics will take place as nn\to\infty, and the explicit dependence of μ,μ0,K\mu,\mu_{0},K and related quantities on the dimension nn will be suppressed for ease of notation.

Given observation YY generated from model (1.1), arguably the most natural and generic test for (1.2) is the likelihood ratio test (LRT). Under the Gaussian model (1.1), the log-likelihood ratio statistic (LRS) for (1.2) takes the form

T(Y)\displaystyle T(Y) Yμ02Yμ^K2\displaystyle\equiv\lVert Y-\mu_{0}\rVert^{2}-\lVert Y-\widehat{\mu}_{K}\rVert^{2}
=μ+ξμ02μ+ξΠK(μ+ξ)20.\displaystyle=\lVert\mu+\xi-\mu_{0}\rVert^{2}-\lVert\mu+\xi-\Pi_{K}(\mu+\xi)\rVert^{2}\geq 0. (1.3)

Here μ^KΠK(Y)argminνKYν2\widehat{\mu}_{K}\equiv\Pi_{K}(Y)\equiv\arg\min_{\nu\in K}\lVert Y-\nu\rVert^{2} is the metric projection of YY onto the constraint set KK with respect to the canonical Euclidean 2\ell_{2} norm \|\cdot\| on n\mathbb{R}^{n}. As KK is both closed and convex, ΠK\Pi_{K} is well-defined, and the resulting μ^K\widehat{\mu}_{K} is both the least squares estimator (LSE) and the maximum likelihood estimator for the mean vector μ\mu under the Gaussian model (1.1). The risk behavior of μ^K\widehat{\mu}_{K} is completely characterized in the recent work [Cha14].

The LRT for (1.2) and its generalizations thereof have gained extensive attention in the literature, see e.g. [Che54, Bar59a, Bar59b, Bar61a, Bar61b, Kud63, BBBB72, KC75, RW78, WR84, Sha85, RLN86, RWD88, Sha88, MS91, MRS92a, MRS92b, DT01, Mey03, SM17, WWG19] for an incomplete list. In our setting, an immediate way to use the LRS T(Y)T(Y) in (1.1) to form a test is to simulate the critical values of T(Y)T(Y) under H0H_{0}. More precisely, for any confidence level α(0,1)\alpha\in(0,1), we may determine through simulations an acceptance region α\mathcal{I}_{\alpha}\subset\mathbb{R} such that the LRS satisfies (T(Y)α)=1α\mathbb{P}\big{(}T(Y)\in\mathcal{I}_{\alpha}\big{)}=1-\alpha under H0H_{0}, and then formulate the LRT accordingly. In some special cases, including the classical setting where KK is a subspace, the distribution of T(Y)T(Y) under the null is even known in closed form, so the simulation step can be skipped.

Clearly, the almost effortless LRT as described above already gives an exact type I error control at the prescribed level for a generic pair (μ0,K)(\mu_{0},K). The equally important question of its power behavior, however, is more complicated and requires a much deeper level of investigation. In the classical setting of parametric models and certain semiparametric models, the power behavior of the LRT can be precisely determined, at least asymptotically, for contiguous alternatives in the corresponding parameter spaces, cf. [vdV98, vdV02]. An important and basic ingredient for the success of power analysis in these settings is the existence of a limiting distribution of the LRT under H0H_{0} that can be ‘perturbed’ in a large number of directions of alternatives.

Unfortunately, the distribution of the LRS T(Y)T(Y) in (1.1) under the null, for both finite-sample and asymptotic regimes, is only understood in very few cases. One such case is, as mentioned above, the classical setting where KK is a subspace of dimension dim(K)\dim(K). Then the null distribution of T(Y)T(Y) is a chi-squared distribution with dim(K)\dim(K) degrees of freedom. Another case is μ0=0\mu_{0}=0 and KK is a closed convex cone. In this case, the null distribution of T(Y)T(Y) is the chi-bar squared distribution, cf. [Bar61a, Kud63, BBBB72, KC75, Sha85, RWD88], which can be expressed as a finite mixture of chi-squared distributions. Apart from these special cases, little next to nothing is known about the distribution of the LRS T(Y)T(Y) for a general pair of (μ0,K)(\mu_{0},K) under the null H0H_{0}, owing in large part to the fact that the null distribution of T(Y)T(Y) highly depends on the exact location of μ0\mu_{0} with respect to KK and is thus intractable in general. Consequently, the lack of such a general description of the limiting distribution of T(Y)T(Y) causes a fundamental difficulty in obtaining precise characterizations of the power behavior of the LRT for a general pair (μ0,K)(\mu_{0},K). On the other hand, such generality is of great interest as it allows us to consider several significant examples, for instance testing general signals in isotonic regression and with constrained Lasso. See Section 4 for more details.

1.2. Normal approximation and power characterization

The unifying theme of this paper starts from the simple observation that in the classical setting where KK is a subspace, as long as dim(K)\dim(K) diverges as nn\rightarrow\infty, the distribution of T(Y)T(Y) has a progressively Gaussian shape, under proper normalization. Such a normal approximation in ‘high dimensions’ also holds for the more complicated chi-bar squared distribution; see [Dyk91, GNP17] for a different set of conditions. One may therefore hope that normal approximation of T(Y)T(Y) under the null would hold in a far more general context than just these cases. More importantly, such a distributional approximation would ideally form the basis for power analysis of the LRT.

1.2.1. Normal approximation

The first main result of this paper (see Theorem 3.1) shows that, although the exact distribution of T(Y)T(Y) under H0H_{0} is highly problem-specific and depends crucially on the pair (μ0,K)(\mu_{0},K) as described above, Gaussian approximation of T(Y)T(Y) indeed holds in a fairly general context, after proper normalization. More concretely, we show that under H0H_{0},

T(Y)mμ0σμ0𝒩(0,1) in total variation\displaystyle\frac{T(Y)-m_{\mu_{0}}}{\sigma_{\mu_{0}}}\approx\mathcal{N}(0,1)\quad\hbox{ in total variation} (1.4)

holds in the high dimensional regime where the estimation error 𝔼μ0μ^Kμ02\mathbb{E}_{\mu_{0}}\lVert\widehat{\mu}_{K}-\mu_{0}\rVert^{2} diverges in some appropriate sense; see Theorem 3.1 and the discussion afterwards for an explanation. Here and below, 𝒩(0,1)\mathcal{N}(0,1) denotes the standard normal distribution, and we reserve the notation

mμ𝔼μ(T(Y))andσμ2Varμ(T(Y))\displaystyle m_{\mu}\equiv\mathbb{E}_{\mu}(T(Y))\qquad\text{and}\qquad\sigma^{2}_{\mu}\equiv\operatorname{Var}_{\mu}(T(Y)) (1.5)

for the mean and variance of the LRS T(Y)T(Y) under (1.1) with mean μ\mu, so that mμ0m_{\mu_{0}} and σμ02\sigma^{2}_{\mu_{0}} in (1.4) are the corresponding quantities under H0H_{0}. In a similar spirit, we use the subscript μ\mu in μ\mathbb{P}_{\mu} and other probabilistic notations to indicate that the evaluation is under (1.1) with mean μ\mu.

When the normal approximation (1.4) holds, an asymptotically equivalent formulation of the previously mentioned finite-sample LRT is the following LRT using acceptance region determined by normal quantiles: For any α(0,1)\alpha\in(0,1), let

Ψ(Y)Ψ(Y;mμ0,σμ0)𝟏(T(Y)mμ0σμ0𝒜αc),\displaystyle\Psi(Y)\equiv\Psi(Y;m_{\mu_{0}},\sigma_{\mu_{0}})\equiv\bm{1}\bigg{(}\frac{T(Y)-m_{\mu_{0}}}{\sigma_{\mu_{0}}}\in\mathcal{A}_{\alpha}^{c}\bigg{)}, (1.6)

where 𝒜α\mathcal{A}_{\alpha} is a possibly unbounded interval in \mathbb{R} such that (𝒩(0,1)𝒜α)=1α\mathbb{P}(\mathcal{N}(0,1)\in\mathcal{A}_{\alpha})=1-\alpha. Common choices of 𝒜α\mathcal{A}_{\alpha} include: (i) (,zα](-\infty,z_{\alpha}] for the one-sided LRT, and (ii) [zα/2,zα/2][-z_{\alpha/2},z_{\alpha/2}] for the two-sided LRT, where zαz_{\alpha}, for any α(0,1)\alpha\in(0,1), is the normal quantile defined by (𝒩(0,1)zα)=α\mathbb{P}(\mathcal{N}(0,1)\geq z_{\alpha})=\alpha. Although mμ0,σμ0m_{\mu_{0}},\sigma_{\mu_{0}} do not admit general explicit formulae (some notable exceptions can be found in e.g. [MT14, Table 6.1] or [GNP17, Table 1]), their numeric values can be approximated by simulations. In what follows, we will focus on the LRT given by (1.6), and in particular its power behavior, when the normal approximation in (1.4) holds.

1.2.2. Power characterization

Using the normal approximation (1.4), our second main result (see Theorem 3.2) shows that under mild regularity conditions,

𝔼μΨ(Y;mμ0,σμ0)[𝒩(mμmμ0σμ0,1)𝒜αc].\displaystyle\mathbb{E}_{\mu}\Psi(Y;m_{\mu_{0}},\sigma_{\mu_{0}})\approx\mathbb{P}\bigg{[}\mathcal{N}\bigg{(}\frac{m_{\mu}-m_{\mu_{0}}}{\sigma_{\mu_{0}}},1\bigg{)}\in\mathcal{A}_{\alpha}^{c}\bigg{]}. (1.7)

This power formula implies that for a wide class of alternatives, the LRS T(Y)T(Y) still has an asymptotically Gaussian shape under the alternative, but with a mean shift parameter (mμmμ0)/σμ0(m_{\mu}-m_{\mu_{0}})/\sigma_{\mu_{0}}. In particular, (1.7) implies that

({mμmμ0σμ0})Δ𝒜α1(β)𝔼μΨ(Y;mμ0,σμ0)β[0,1].\displaystyle\mathcal{L}\bigg{(}\bigg{\{}\frac{m_{\mu}-m_{\mu_{0}}}{\sigma_{\mu_{0}}}\bigg{\}}\bigg{)}\subset\Delta_{\mathcal{A}_{\alpha}}^{-1}(\beta)\;\;\Leftrightarrow\;\;\mathbb{E}_{\mu}\Psi(Y;m_{\mu_{0}},\sigma_{\mu_{0}})\to\beta\in[0,1]. (1.8)

Here ¯{±}\overline{\mathbb{R}}\equiv\mathbb{R}\cup\{\pm\infty\}, and for a sequence {wn}\{w_{n}\}\subset\mathbb{R}, ({wn})\mathcal{L}(\{w_{n}\}) denotes the set of all limit points of {wn}\{w_{n}\} in ¯\overline{\mathbb{R}}, and the power function Δ𝒜α:¯[0,1]\Delta_{\mathcal{A}_{\alpha}}:\overline{\mathbb{R}}\to[0,1] is defined in (3.2) below. For instance, when 𝒜α=(,zα]\mathcal{A}_{\alpha}=(-\infty,z_{\alpha}] is the acceptance region for the one-sided LRT, Δ(,zα](w)=Φ(zα+w)\Delta_{(-\infty,z_{\alpha}]}(w)=\Phi(-z_{\alpha}+w). In general, Δ𝒜α(0)=α\Delta_{\mathcal{A}_{\alpha}}(0)=\alpha and Δ𝒜α(w)=1\Delta_{\mathcal{A}_{\alpha}}(w)=1 only if w{±}w\in\{\pm\infty\}. Hence the LRT is power consistent under μ\mu, i.e., 𝔼μΨ(Y;mμ0,σμ0)1\mathbb{E}_{\mu}\Psi(Y;m_{\mu_{0}},\sigma_{\mu_{0}})\rightarrow 1, if and only if

({mμmμ0σμ0})Δ𝒜α1(1){±}.\displaystyle\mathcal{L}\bigg{(}\bigg{\{}\frac{m_{\mu}-m_{\mu_{0}}}{\sigma_{\mu_{0}}}\bigg{\}}\bigg{)}\subset\Delta_{\mathcal{A}_{\alpha}}^{-1}(1)\subset\{\pm\infty\}. (1.9)

The asymptotically exact power characterization (1.8) for the LRT is rather rare beyond the classical parametric and certain semiparametric settings under contiguous alternatives (cf. [vdV98, vdV02]). The setting in (1.8) can therefore be viewed as a general nonparametric analogue of contiguous alternatives for the LRT in the Gaussian sequence model (1.1).

A notable implication of (1.9) is that for any alternative μK\mu\in K, the power characterization of the LRT depends on the quantity mμmμ0m_{\mu}-m_{\mu_{0}}, which cannot in general be equivalently reduced to a usual lower bound condition on μμ0\lVert\mu-\mu_{0}\rVert. This indicates the non-uniform power behavior of the LRT with respect to the Euclidean norm \lVert\cdot\rVert. As the LRT (with an optimal calibration) is known to be minimax optimal in terms of uniform separation under \lVert\cdot\rVert in several examples (cf. [WWG19]), the non-uniform characterization (1.9) hints that the minimax optimality criteria can be too conservative and non-informative for evaluating the power behavior of the LRT.

Another implication of (1.9) is that it is possible in certain cases that the one-sided LRT (i.e., 𝒜α=(,zα]\mathcal{A}_{\alpha}=(-\infty,z_{\alpha}]) has an asymptotically vanishing power, whereas the two-sided LRT (i.e., 𝒜α=[zα/2,zα/2]\mathcal{A}_{\alpha}=[-z_{\alpha/2},z_{\alpha/2}]) is power consistent. This phenomenon occurs when the limit point -\infty in (1.9) is achieved for certain alternatives μK\mu\in K in the high dimensional limit. See Remark 3.5 ahead for a detailed discussion.

1.3. Testing subspace versus closed convex cone

A particularly important special setting for (1.2) is the case of testing H0:μ=0H_{0}:\mu=0 versus H1:μKH_{1}:\mu\in K, where KK is assumed to be a closed convex cone in n\mathbb{R}^{n}. We perform a detailed case study of the following slightly more general testing problem:

H0:μK0versusH1:μK,\displaystyle H_{0}:\mu\in K_{0}\qquad\textrm{versus}\qquad H_{1}:\mu\in K, (1.10)

where K0KnK_{0}\subset K\subset\mathbb{R}^{n} is a subspace, and KK is a closed convex cone. The primary motivation to study (1.10) arises from the problem of testing a global polynomial structure versus its shape-constrained generalization; concrete examples include constancy versus monotonicity, linearity versus convexity, etc.; see Section 4.5 for details. The LRS for (1.10) takes the slightly modified form:

T(Y)\displaystyle T(Y) TK0(Y)Yμ^K02Yμ^K2\displaystyle\equiv T_{K_{0}}(Y)\equiv\lVert Y-\widehat{\mu}_{K_{0}}\rVert^{2}-\lVert Y-\widehat{\mu}_{K}\rVert^{2}
=μ+ξΠK0(μ+ξ)2μ+ξΠK(μ+ξ)2.\displaystyle=\lVert\mu+\xi-\Pi_{K_{0}}(\mu+\xi)\rVert^{2}-\lVert\mu+\xi-\Pi_{K}(\mu+\xi)\rVert^{2}. (1.11)

The dependence in the notation of the LRS T(Y)T(Y) on K0K_{0} will usually be suppressed when no confusion could arise.

Specializing our first main result to this testing problem, we show in Theorem 3.8 that normal approximation of T(Y)T(Y) under H0H_{0} holds essentially under the minimal growth condition that δKdim(K0)\delta_{K}-\dim(K_{0})\to\infty, where δK\delta_{K} is the statistical dimension of KK (formally defined in Definition 2.2). Similar to (1.8), the normal approximation makes possible the following precise characterization of the power behavior of the LRT under the prescribed growth condition (see Theorem 3.9):

({𝔼ΠK(μΠK0(μ)+ξ)2𝔼ΠK(ξ)2σ0})Δ𝒜α1(β)[0,+]\displaystyle\mathcal{L}\bigg{(}\bigg{\{}\frac{\mathbb{E}\lVert\Pi_{K}\big{(}\mu-\Pi_{K_{0}}(\mu)+\xi\big{)}\rVert^{2}-\mathbb{E}\lVert\Pi_{K}(\xi)\rVert^{2}}{\sigma_{0}}\bigg{\}}\bigg{)}\subset\Delta_{\mathcal{A}_{\alpha}}^{-1}(\beta)\cap[0,+\infty]
𝔼μΨ(Y;m0,σ0)β[0,1].\displaystyle\Leftrightarrow\quad\mathbb{E}_{\mu}\Psi(Y;m_{0},\sigma_{0})\to\beta\in[0,1]. (1.12)

As σ02=Var(T(ξ))δKdim(K0)\sigma_{0}^{2}=\operatorname{Var}(T(\xi))\asymp\delta_{K}-\dim(K_{0}) (cf. Lemma 2.4) for the modified LRS T(Y)T(Y) in (1.3), the LRT is power consistent under μ\mu if and only if

𝔼ΠK(μΠK0(μ)+ξ)2𝔼ΠK(ξ)2(δKdim(K0))1/2+.\displaystyle\frac{\mathbb{E}\lVert\Pi_{K}\big{(}\mu-\Pi_{K_{0}}(\mu)+\xi\big{)}\rVert^{2}-\mathbb{E}\lVert\Pi_{K}(\xi)\rVert^{2}}{\big{(}\delta_{K}-\dim(K_{0})\big{)}^{1/2}}\to+\infty. (1.13)

Formula (1.13) shows that power consistency of the LRT is determined completely by (a complicated expression involving) the ‘distance’ of the alternative μK\mu\in K to its projection onto K0K_{0} in the problem (1.10). Compared to the uniform \|\cdot\|-separation rate derived in the recent work [WWG19] (cf. (3.19) below), (1.3)-(1.13) provide asymptotically precise power characterizations of the LRT for a sequence of point alternatives. This difference is indeed crucial as (1.13), similar to (1.9), cannot be equivalently inverted into a lower bound on μΠK0(μ)\lVert\mu-\Pi_{K_{0}}(\mu)\rVert alone. This illustrates that the non-uniform power behavior of the LRT is not an aberration in certain artificial testing problems, but is rather a fundamental property of the LRT in the high dimensional regime that already appears in the special yet important setting of testing subspace versus a cone.

1.4. Examples

As an illustration of the scope of our theoretical results, we validate the normal approximation of the LRT and exemplify its power behavior in two classes of problems:

  1. (1)

    Testing in orthant/circular cone, isotonic regression and Lasso;

  2. (2)

    Testing parametric assumptions versus shape-constrained alternatives, e.g., constancy versus monotonicity, linearity versus convexity, and generalizations thereof.

1.4.1. Non-uniform power of the LRT

Some of the above problems give clear examples of the aforementioned non-uniform power behavior of the LRT: In the problem of testing μ=0\mu=0 versus the orthant and (product) circular cone, the LRT is indeed powerful against most alternatives in the region where the uniform separation in \lVert\cdot\rVert is not informative. More concretely:

  • In the case of the orthant cone, the LRT is known to be minimax optimal (cf. [WWG19]) in terms of the uniform \|\cdot\|-separation of the order n1/4n^{1/4}. Our results show that the LRT is actually powerful for ‘most’ alternatives μ\mu with μ=𝒪(n1/4)\lVert\mu\rVert=\mathcal{O}(n^{1/4}), including some with \|\cdot\|-separation of the order nδn^{\delta} for any δ>0\delta>0. This showcases the conservative nature of the minimax optimality criteria. See Section 4.1 for details.

  • In the case of (product) circular cone, the LRT is known to be minimax sub-optimal (cf. [WWG19]) with \|\cdot\|-separation of the order n1/4n^{1/4} while the minimax separation rate is of the constant order. Our results show the minimax sub-optimality is witnessed only by a few unfortunate alternatives and the LRT is powerful within a large cylindrical set including many points of constant \|\cdot\|-separation order. This also identifies the minimax framework as too pessimistic for the sub-optimality results of the LRT; see Section 4.2 for details.

1.5. Related literature

The results in this paper are related to the vast literature on nonparametric testing in the Gaussian sequence model, or more general Gaussian models, under a minimax framework. We refer the readers to the monographs [IS03, GN16] for a comprehensive treatment on this topic, and [Bar02, JN02, DJ04, BHL05, ITV10, ACCP11, Ver12, CD13, Car15, CCT17, CCC+19, CV19, CCC+20, MS20] and references therein for some recent papers on a variety of testing problems. Many results in these references establish minimax separation rates under a pre-specified metric, with the Euclidean metric \lVert\cdot\rVert being a popular choice in the Gaussian sequence model. In particular, for the testing problem (1.10), this minimax approach with \lVert\cdot\rVert metric is adopted in the recent work [WWG19], which derived minimax lower bounds for the separation rates and the uniform separation rate of the LRT. These results show that the LRT is minimax rate-optimal in a number of examples, while being sub-optimal in some other examples.

Our results are of a rather different flavor and give a precise distributional description of the LRT. Such a description is made possible by the central limit theorems of the LRS under the null proved in Theorems 3.1 and 3.8. It also allows us to make two significant further steps beyond the work [WWG19]:

  1. (1)

    For the testing problem (1.10), we provide asymptotically exact power formula of the LRT in (1.3) for each and every alternative, as opposed to lower bounds for the uniform separation rates of the LRT in \lVert\cdot\rVert as in [WWG19]. As a result, the main results for the separation rates of the LRT in [WWG19] follow as a corollary of our main results (see Corollary 3.11 for a formal statement).

  2. (2)

    Our theory applies to the general testing problem (1.2) which allows for a general pair (μ0,K)(\mu_{0},K) without a cone structure. This level of generality goes much beyond the scope of [WWG19] and covers several significant examples, including testing in isotonic regression and Lasso problems.

The precise power characterization we derive in (1.3) has interesting implications when compared to the minimax results derived in [WWG19]. In particular, as discussed in Section 1.4.1, (i) it is possible that the LRT beats substantially the minimax separation rates in \lVert\cdot\rVert metric for individual alternatives, and (ii) the sub-optimality of the LRT in [WWG19] is actually only witnessed at alternatives along some ‘bad’ directions. In this sense, our results not only give precise understanding of the power behavior of the canonical LRT in this testing problem, but also highlights some intrinsic limitations of the popular minimax framework under the Euclidean metric in the Gaussian sequence model, both in terms of its optimality and sub-optimality criteria.

From a technical perspective, our proof technique differs significantly from the one adopted in [WWG19]. Indeed, the proofs of the central limit theorems and the precise power formulae in this paper are inspired by the second-order Poincaré inequality due to [Cha09] and related normal approximation results in [GNP17]. These technical developments are of independent interest and have broader applicability; see for instance [HJS21] for further developments in the context of testing high dimensional covariance matrices.

1.6. Organization

The rest of the paper is organized as follows. Section 2 reviews some basic facts on metric projection and conic geometry. Section 3 studies normal approximation for the LRS T(Y)T(Y) and the power characterizations of the LRT both in the general setting (1.2) and the more structured setting (1.10). Applications of the abstract theory to the examples mentioned above are detailed in Section 4. Proofs are collected in Sections 5 and 6 and the appendix.

1.7. Notation

For any positive integer nn, let [1:n][1:n] denote the set {1,,n}\{1,\ldots,n\}. For a,ba,b\in\mathbb{R}, abmax{a,b}a\vee b\equiv\max\{a,b\} and abmin{a,b}a\wedge b\equiv\min\{a,b\}. For aa\in\mathbb{R}, let a±(±a)0a_{\pm}\equiv(\pm a)\vee 0. For xnx\in\mathbb{R}^{n}, let xp\lVert x\rVert_{p} denote its pp-norm (0p)(0\leq p\leq\infty), and Bp(r;x){zn:zxpr}B_{p}(r;x)\equiv\{z\in\mathbb{R}^{n}:\lVert z-x\rVert_{p}\leq r\}. We simply write xx2\lVert x\rVert\equiv\lVert x\rVert_{2}, B(r;x)B2(r;x)B(r;x)\equiv B_{2}(r;x), and B(r)B(r;0)B(r)\equiv B(r;0) for notational convenience. By 𝟏n\bm{1}_{n} we denote the vector of all ones in n\mathbb{R}^{n}. For a matrix Mn×nM\in\mathbb{R}^{n\times n}, let M\lVert M\rVert and MF\lVert M\rVert_{F} denote the spectral and Frobenius norms of MM respectively.

For a multi-index 𝒌=(k1,,kn)0n\bm{k}=(k_{1},\ldots,k_{n})\in\mathbb{Z}_{\geq 0}^{n}, let |𝒌|i=1nki\lvert\bm{k}\rvert\equiv\sum_{i=1}^{n}k_{i}. For f:nf:\mathbb{R}^{n}\rightarrow\mathbb{R}, and 𝒌=(k1,,kn)0n\bm{k}=(k_{1},\ldots,k_{n})\in\mathbb{Z}_{\geq 0}^{n}, let 𝒌f(z)|𝒌|f(z)k1z1knzn\partial_{\bm{k}}f(z)\equiv\frac{\partial^{\lvert\bm{k}\rvert}f(z)}{\partial_{k_{1}}z_{1}\cdots\partial_{k_{n}}z_{n}} for znz\in\mathbb{R}^{n} whenever definable. A vector-valued map f:nmf:\mathbb{R}^{n}\rightarrow\mathbb{R}^{m} is said to have sub-exponential growth at \infty if limxf(x)ex=0\lim_{\lVert x\rVert\to\infty}\lVert f(x)e^{-\lVert x\rVert}\rVert=0. For f=(f1,,fn):nnf=(f_{1},\ldots,f_{n}):\mathbb{R}^{n}\to\mathbb{R}^{n}, let Jf(z)(fi(z)/zj)i,j=1nJ_{f}(z)\equiv(\partial f_{i}(z)/\partial z_{j})_{i,j=1}^{n} denote the Jacobian of ff and

divf(z)i=1nzifi(z)=tr(Jf(z))\displaystyle\operatorname{div}f(z)\equiv\sum_{i=1}^{n}\frac{\partial}{\partial z_{i}}f_{i}(z)=\operatorname{tr}(J_{f}(z))

for znz\in\mathbb{R}^{n} whenever definable.

We use CxC_{x} to denote a generic constant that depends only on xx, whose numeric value may change from line to line unless otherwise specified. axba\lesssim_{x}b and axba\gtrsim_{x}b mean aCxba\leq C_{x}b and aCxba\geq C_{x}b respectively, and axba\asymp_{x}b means axba\lesssim_{x}b and axba\gtrsim_{x}b (aba\lesssim b means aCba\leq Cb for some absolute constant CC). For two nonnegative sequences {an}\{a_{n}\} and {bn}\{b_{n}\}, we write anbna_{n}\ll b_{n} (respectively anbna_{n}\gg b_{n}) if limn(an/bn)=0\lim_{n\rightarrow\infty}(a_{n}/b_{n})=0 (respectively limn(an/bn)=\lim_{n\rightarrow\infty}(a_{n}/b_{n})=\infty). We follow the convention that 0/0=00/0=0. 𝒪𝐏\mathcal{O}_{\mathbf{P}} and 𝔬𝐏\mathfrak{o}_{\mathbf{P}} denote the usual big and small O notation in probability.

We reserve the notation ξ=(ξ1,,ξn)\xi=(\xi_{1},\ldots,\xi_{n}) for an nn-dimensional standard normal random vector, and φ,Φ\varphi,\Phi for the density and the cumulative distribution function of a standard normal random variable. For any α(0,1)\alpha\in(0,1), let zαz_{\alpha} be the normal quantile defined by (𝒩(0,1)zα)=α\mathbb{P}(\mathcal{N}(0,1)\geq z_{\alpha})=\alpha. For two random variables X,YX,Y on \mathbb{R}, we use dTV(X,Y)d_{\mathrm{TV}}(X,Y) and dKol(X,Y)d_{\mathrm{Kol}}(X,Y) to denote their total variation distance and Kolmogorov distance defined respectively as

dTV(X,Y)\displaystyle d_{\mathrm{TV}}(X,Y) supB()|(XB)(YB)|,\displaystyle\equiv\sup_{B\in\mathcal{B}(\mathbb{R})}\big{\lvert}\mathbb{P}\big{(}X\in B\big{)}-\mathbb{P}\big{(}Y\in B\big{)}\big{\rvert},
dKol(X,Y)\displaystyle d_{\mathrm{Kol}}(X,Y) supt|(Xt)(Yt)|.\displaystyle\equiv\sup_{t\in\mathbb{R}}\big{\lvert}\mathbb{P}\big{(}X\leq t\big{)}-\mathbb{P}\big{(}Y\leq t\big{)}\big{\rvert}.

Here ()\mathcal{B}(\mathbb{R}) denotes all Borel measurable sets in \mathbb{R}.

2. Preliminaries: metric projection and conic geometry

In this section, we review some basic facts on metric projection and conic geometry. For any xnx\in\mathbb{R}^{n}, the metric projection of xx onto a closed convex set KnK\subset\mathbb{R}^{n} is defined by

ΠK(x)argminyKxy2.\displaystyle\Pi_{K}(x)\equiv\operatorname*{arg\,min}\limits_{y\in K}\lVert x-y\rVert^{2}.

It is a standard fact that the map ΠK\Pi_{K} is well-defined, 11-Lipschitz and hence absolutely continuous. The Jacobian JΠKJ_{\Pi_{K}} is therefore almost everywhere (a.e.) well-defined.

Let G:nG:\mathbb{R}^{n}\rightarrow\mathbb{R} be defined by

G(y)yΠK(y)2.\displaystyle G(y)\equiv\lVert y-\Pi_{K}(y)\rVert^{2}.

We summarize some useful properties of GG and JΠKJ_{\Pi_{K}} in the following lemma.

Lemma 2.1.

The following statements hold.

  1. (1)

    GG is absolutely continuous and its gradient G(y)=2(yΠK(y))\nabla G(y)=2(y-\Pi_{K}(y)) has sub-exponential growth at \infty.

  2. (2)

    For a.e. yny\in\mathbb{R}^{n}, JΠK(y)IJΠK(y)1\lVert J_{\Pi_{K}}(y)\rVert\vee\lVert I-J_{\Pi_{K}}(y)\rVert\leq 1 and JΠK(y)ΠK(y)=JΠK(y)yJ_{\Pi_{K}}(y)^{\top}\Pi_{K}(y)=J_{\Pi_{K}}(y)^{\top}y.

Proof.

(1) follows from [GNP17, Lemma 2.2] and the proof of [GNP17, Lemma A.2]. The first claim of (2) is proved in [GNP17, Lemma 2.1]. For the second claim of (2), note that G(y)=2(IJΠK(y))(yΠK(y))\nabla G(y)=2(I-J_{\Pi_{K}}(y))^{\top}(y-\Pi_{K}(y)). By (1), G(y)=2(yΠK(y))\nabla G(y)=2(y-\Pi_{K}(y)), so JΠK(y)(yΠK(y))=0J_{\Pi_{K}}(y)^{\top}(y-\Pi_{K}(y))=0, proving the claim. ∎

Recall that a closed and convex cone KnK\subset\mathbb{R}^{n} is polyhedral if it is a finite intersection of closed half-spaces, and a face of KK is a set of the form KHK\cap H, where HH is a supporting hyperplane of KK in n\mathbb{R}^{n}. Let lin(F)\text{lin}(F) denote the linear span of FF. The dimension of a face FF is dimFdim(lin(F))\dim F\equiv\dim(\text{lin}(F)), and the relative interior of FF is the interior of FF in lin(F)\text{lin}(F).

The complexity of a closed convex cone KK can be described by its statistical dimension defined as follows.

Definition 2.2.

The statistical dimension δK\delta_{K} of a closed convex cone KK is defined as δK𝔼ΠK(ξ)2\delta_{K}\equiv\mathbb{E}\lVert\Pi_{K}(\xi)\rVert^{2}.

The statistical dimension δK\delta_{K} has several equivalent definitions; see e.g. [ALMT14, Proposition 3.1]. In particular, δK=𝔼supνKB(1)(ν,ξ)2\delta_{K}=\mathbb{E}\sup_{\nu\in K\cap B(1)}(\left\langle\nu,\xi\right\rangle)^{2}. For any polyhedral cone KnK\subset\mathbb{R}^{n} and j{0,,n}j\in\{0,\ldots,n\}, the jj-th intrinsic volume of KK is defined as

vj(K)(ΠK(ξ)relative interior of a j-dimensional face of K).\displaystyle v_{j}(K)\equiv\mathbb{P}\big{(}\Pi_{K}(\xi)\in\hbox{relative interior of a $j$-dimensional face of $K$}\big{)}. (2.1)

More generally, the intrinsic volumes {vj(K)}j=0n\{v_{j}(K)\}_{j=0}^{n} for a closed convex cone KnK\subset\mathbb{R}^{n} are defined as the limit of (2.1) using polyhedral approximation; see [MT14, Section 7.3]. These quantities are well-defined and have been investigated in considerable depth; see e.g., [ALMT14, MT14, GNP17].

Definition 2.3.

For any closed convex cone KnK\subset\mathbb{R}^{n}, let VKV_{K} be a random variable taking values in {0,,n}\{0,\ldots,n\} such that (VK=j)=vj(K)\mathbb{P}(V_{K}=j)=v_{j}(K).

We summarize some useful properties of δK\delta_{K} and VKV_{K} in the following lemma. An elementary and self-contained proof is given in Appendix A.1.

Lemma 2.4.

Let KK be a convex closed cone. Then

  1. (1)

    δK=𝔼VK\delta_{K}=\mathbb{E}V_{K};

  2. (2)

    Var(ΠK(ξ)2)=Var(VK)+2δK\operatorname{Var}(\lVert\Pi_{K}(\xi)\rVert^{2})=\operatorname{Var}(V_{K})+2\delta_{K};

  3. (3)

    2δKVar(ΠK(ξ)2)2δK+2𝔼ΠK(ξ)24δK2\delta_{K}\leq\operatorname{Var}(\lVert\Pi_{K}(\xi)\rVert^{2})\leq 2\delta_{K}+2\lVert\mathbb{E}\Pi_{K}(\xi)\rVert^{2}\leq 4\delta_{K}.

For any closed convex cone KnK\subset\mathbb{R}^{n}, its polar cone is defined as

K{vn:v,u0, for all uK}.\displaystyle K^{*}\equiv\left\{v\in\mathbb{R}^{n}:\left\langle v,u\right\rangle\leq 0,\text{ for all }u\in K\right\}. (2.2)

With ΠK\Pi_{K^{*}} denoting the metric projection onto KK^{*}, Moreau’s theorem [Roc97, Theorem 31.5] states that for any vnv\in\mathbb{R}^{n},

v=ΠK(v)+ΠK(v) with ΠK(v),ΠK(v)=0.\displaystyle v=\Pi_{K}(v)+\Pi_{K^{*}}(v)\qquad\hbox{ with }\left\langle\Pi_{K}(v),\Pi_{K^{*}}(v)\right\rangle=0.

3. Theory

3.1. Normal approximation for T(Y)T(Y) and power characterizations

We start by presenting the normal approximation result for T(Y)T(Y) in (1.1) under the null hypothesis (1.2); see Section 5.1 for a proof. This will serve as the basis for the size and power analysis of the LRT (1.6) in the testing problem (1.2).

Theorem 3.1.

Let KnK\subset\mathbb{R}^{n} be a closed convex set and μ0K\mu_{0}\in K. Then under H0H_{0},

dTV(T(Y)mμ0σμ0,𝒩(0,1))8𝔼μ0μ^Kμ022𝔼μ0μ^Kμ02+𝔼μ0Jμ^KF2.\displaystyle d_{\mathrm{TV}}\bigg{(}\frac{T(Y)-m_{\mu_{0}}}{\sigma_{\mu_{0}}},\mathcal{N}(0,1)\bigg{)}\leq\frac{8\sqrt{\mathbb{E}_{\mu_{0}}\lVert\widehat{\mu}_{K}-\mu_{0}\rVert^{2}}}{2\lVert\mathbb{E}_{\mu_{0}}\widehat{\mu}_{K}-\mu_{0}\rVert^{2}+\lVert\mathbb{E}_{\mu_{0}}J_{\widehat{\mu}_{K}}\rVert_{F}^{2}}. (3.1)

Here Jμ^KJμ^K(ξ)JΠK(μ0+ξ)J_{\widehat{\mu}_{K}}\equiv J_{\widehat{\mu}_{K}}(\xi)\equiv J_{\Pi_{K}}(\mu_{0}+\xi), and mμ0,σμ0m_{\mu_{0}},\sigma_{\mu_{0}} are as defined in (1.5).

The bound (3.1) is obtained by a generalization of [GNP17, Theorem 2.1] using the second-order Poincaré inequality [Cha09], together with a lower bound for σμ02\sigma_{\mu_{0}}^{2} using Fourier analysis in the Gaussian space [NP12, Section 1.5]. The Fourier expansion can be performed up to the second order thanks to the absolute continuity of the first-order partial derivatives of T(Y)T(Y) (cf. Lemma 2.1).

We now comment on the structure of (3.1). The first term 𝔼μ0μ^Kμ02\lVert\mathbb{E}_{\mu_{0}}\widehat{\mu}_{K}-\mu_{0}\rVert^{2} in the denominator is the squared bias of the projection estimator μ^K\widehat{\mu}_{K}, while the second term 𝔼μ0Jμ^KF2\lVert\mathbb{E}_{\mu_{0}}J_{\widehat{\mu}_{K}}\rVert_{F}^{2}, which depends on the magnitudes of the first-order partial derivatives of μ^K\widehat{\mu}_{K}, can be roughly understood as the ‘variance’ of μ^K\widehat{\mu}_{K}. Consequently, one may expect that the denominator is of the order 𝔼μ0μ^Kμ02\mathbb{E}_{\mu_{0}}\lVert\widehat{\mu}_{K}-\mu_{0}\rVert^{2}, so the overall bound scales as 𝒪(1/𝔼μ0μ^Kμ02)\mathcal{O}\big{(}1/\sqrt{\mathbb{E}_{\mu_{0}}\lVert\widehat{\mu}_{K}-\mu_{0}\rVert^{2}}\big{)}. As will be clear in Section 4, this is indeed the case in all the examples worked out, and the major step in applying (3.1) to concrete problems typically depends on obtaining sharp lower bounds for the ‘variance’ term 𝔼μ0Jμ^KF2\lVert\mathbb{E}_{\mu_{0}}J_{\widehat{\mu}_{K}}\rVert_{F}^{2}, which may require non-trivial problem-specific techniques.

Using Theorem 3.1, we may characterize the size and power behavior of the LRT. For a possibly unbounded interval II\subset\mathbb{R}, let ΔI:¯[0,1]\Delta_{I}:\overline{\mathbb{R}}\to[0,1] be defined as follows: For ww\in\mathbb{R},

ΔI(w)1(𝒩(0,1)Iw)=(𝒩(0,1)Icw),\displaystyle\Delta_{I}(w)\equiv 1-\mathbb{P}\big{(}\mathcal{N}(0,1)\in I-w\big{)}=\mathbb{P}\big{(}\mathcal{N}(0,1)\in I^{c}-w\big{)}, (3.2)

and ΔI(±)limw±ΔI(w)\Delta_{I}(\pm\infty)\equiv\lim_{w\to\pm\infty}\Delta_{I}(w), which is clearly well-defined. ΔI\Delta_{I} is either monotonic or unimodal, so ΔI1(β)\Delta_{I}^{-1}(\beta) contains at most two elements for any β[0,1]\beta\in[0,1]. Two primary examples of II are 𝒜αos(,zα]\mathcal{A}^{\textrm{os}}_{\alpha}\equiv(-\infty,z_{\alpha}] and 𝒜αts[zα/2,zα/2]\mathcal{A}^{\textrm{ts}}_{\alpha}\equiv[-z_{\alpha/2},z_{\alpha/2}] — the acceptance regions for the one- and two-sided LRTs respectively, where we have

Δ𝒜αos(w)=Φ(zα+w),Δ𝒜αts(w)=Φ(zα/2+w)+Φ(zα/2w).\displaystyle\Delta_{\mathcal{A}^{\textrm{os}}_{\alpha}}(w)=\Phi(-z_{\alpha}+w),\;\;\Delta_{\mathcal{A}^{\textrm{ts}}_{\alpha}}(w)=\Phi(-z_{\alpha/2}+w)+\Phi(-z_{\alpha/2}-w). (3.3)

It is clear that Δ𝒜αos(0)=Δ𝒜αts(0)=α\Delta_{\mathcal{A}^{\textrm{os}}_{\alpha}}(0)=\Delta_{\mathcal{A}^{\textrm{ts}}_{\alpha}}(0)=\alpha, Δ𝒜αos1(1)={+}\Delta^{-1}_{\mathcal{A}^{\textrm{os}}_{\alpha}}(1)=\{+\infty\}, and Δ𝒜αts1(1)={±}\Delta^{-1}_{\mathcal{A}^{\textrm{ts}}_{\alpha}}(1)=\{\pm\infty\}. Recall the definitions of mμm_{\mu} and σμ2\sigma^{2}_{\mu} for general μK\mu\in K in (1.5). The following result (see Section 5.2 for a proof) characterizes the power behavior of the LRT.

Theorem 3.2.

Consider testing (1.2) using the LRT as in (1.6). There exists some constant C𝒜α>0C_{\mathcal{A}_{\alpha}}>0 such that

|𝔼μΨ(Y;mμ0,σμ0)Δ𝒜α(mμmμ0σμ0)|\displaystyle\bigg{\lvert}\mathbb{E}_{\mu}\Psi(Y;m_{\mu_{0}},\sigma_{\mu_{0}})-\Delta_{\mathcal{A}_{\alpha}}\bigg{(}\frac{m_{\mu}-m_{\mu_{0}}}{\sigma_{\mu_{0}}}\bigg{)}\bigg{\rvert}
2errμ0+C𝒜α(1μμ0|mμmμ0|σμ0).\displaystyle\qquad\leq 2\cdot\mathrm{err}_{\mu_{0}}+C_{\mathcal{A}_{\alpha}}\cdot\mathscr{L}\bigg{(}1\bigwedge\frac{\lVert\mu-\mu_{0}\rVert}{\lvert m_{\mu}-m_{\mu_{0}}\rvert\vee\sigma_{\mu_{0}}}\bigg{)}. (3.4)

Here

errμ0dKol(T(μ0+ξ)mμ0σμ0,𝒩(0,1))right hand side of (3.1),\displaystyle\mathrm{err}_{\mu_{0}}\equiv d_{\mathrm{Kol}}\bigg{(}\frac{T(\mu_{0}+\xi)-m_{\mu_{0}}}{\sigma_{\mu_{0}}},\mathcal{N}(0,1)\bigg{)}\leq\hbox{right hand side of (\ref{ineq:lrt_clt_global_testing})},

and (x)x1log(1/x)\mathscr{L}(x)\equiv x\sqrt{1\vee\log(1/x)} for x>0x>0 and (0)0\mathscr{L}(0)\equiv 0. Consequently:

  1. (1)

    The LRT in (1.6) has size

    |𝔼μ0Ψ(Y;mμ0,σμ0)α|2errμ0.\displaystyle\big{\lvert}\mathbb{E}_{\mu_{0}}\Psi(Y;m_{\mu_{0}},\sigma_{\mu_{0}})-\alpha\big{\rvert}\leq 2\cdot\mathrm{err}_{\mu_{0}}.
  2. (2)

    Suppose the normal approximation of T(Y)T(Y) holds under H0H_{0}, i.e., errμ00\mathrm{err}_{\mu_{0}}\to 0. Then, for any μK\mu\in K such that

    μμ0|mμmμ0|σμ0,\displaystyle\lVert\mu-\mu_{0}\rVert\ll\lvert m_{\mu}-m_{\mu_{0}}\rvert\vee\sigma_{\mu_{0}}, (3.5)

    we have

    ({mμmμ0σμ0})Δ𝒜α1(β)𝔼μΨ(Y;mμ0,σμ0)β[0,1].\displaystyle\mathcal{L}\bigg{(}\bigg{\{}\frac{m_{\mu}-m_{\mu_{0}}}{\sigma_{\mu_{0}}}\bigg{\}}\bigg{)}\subset\Delta_{\mathcal{A}_{\alpha}}^{-1}(\beta)\;\Leftrightarrow\;\mathbb{E}_{\mu}\Psi(Y;m_{\mu_{0}},\sigma_{\mu_{0}})\to\beta\in[0,1]. (3.6)

    Hence under (3.5), the LRT is power consistent under μ\mu, i.e., 𝔼μΨ(Y;mμ0,σμ0)1\mathbb{E}_{\mu}\Psi(Y;m_{\mu_{0}},\sigma_{\mu_{0}})\to 1, if and only if

    ({mμmμ0σμ0})Δ𝒜α1(1){±}.\displaystyle\mathcal{L}\bigg{(}\bigg{\{}\frac{m_{\mu}-m_{\mu_{0}}}{\sigma_{\mu_{0}}}\bigg{\}}\bigg{)}\subset\Delta_{\mathcal{A}_{\alpha}}^{-1}(1)\subset\{\pm\infty\}. (3.7)
Remark 3.3.

The validity of the normal approximation in Theorem 3.2-(2) is imposed to express the exact power behavior (3.6) with the normal quantile. More generally, as long as the normalized LRS (T(Y)mμ0)/σμ0\big{(}T(Y)-m_{\mu_{0}}\big{)}/\sigma_{\mu_{0}} has a distributional limit under H0H_{0}, (3.6) can be obtained accordingly with the corresponding quantiles.

Remark 3.4.

Some comments on conditions (3.5) and (3.7) in detail.

  1. (1)

    Condition (3.5) centers around the key deviation quantity

    ΔTμ,μ0(ξ)T(μ+ξ)T(μ0+ξ),\displaystyle\Delta T_{\mu,\mu_{0}}(\xi)\equiv T(\mu+\xi)-T(\mu_{0}+\xi), (3.8)

    which can be shown to satisfy

    𝔼(ΔTμ,μ0)=mμmμ0,Var(ΔTμ,μ0)μμ02.\displaystyle\mathbb{E}(\Delta T_{\mu,\mu_{0}})=m_{\mu}-m_{\mu_{0}},\qquad\operatorname{Var}(\Delta T_{\mu,\mu_{0}})\leq\|\mu-\mu_{0}\|^{2}.

    Moreover, it can be shown that ΔTμ,μ0\Delta T_{\mu,\mu_{0}} concentrates around its mean mμmμ0m_{\mu}-m_{\mu_{0}} with sub-Gaussian tails (see Proposition 5.3). This concentration result allows us to connect the normal approximation under the null in Theorem 3.1 to the power behavior of the LRT under the alternative.

  2. (2)

    The condition (3.5) cannot be removed in general for the validity of the power characterization (3.6). In fact, in the small separation regime μμ0σμ0\lVert\mu-\mu_{0}\rVert\ll\sigma_{\mu_{0}}, (3.5) is automatically fulfilled; in the large separation regime where μμ0σμ0\lVert\mu-\mu_{0}\rVert\gg\sigma_{\mu_{0}}, (3.5) can typically be verified by establishing a quadratic lower bound |mμmμ0|μμ02\lvert m_{\mu}-m_{\mu_{0}}\rvert\gtrsim\lVert\mu-\mu_{0}\rVert^{2}. In this sense (3.5) excludes possibly ill-behaved alternatives that violate the prescribed quadratic lower bound in the critical regime μμ0σμ0\lVert\mu-\mu_{0}\rVert\asymp\sigma_{\mu_{0}}. Such ill-behaved alternatives do exist; see e.g., Example 4.4 ahead for more details.

  3. (3)

    To verify (3.7), some problem specific understanding for mμm_{\mu} and σμ0\sigma_{\mu_{0}} is needed. As 𝔼μξ,μ^K=𝔼μdivμ^K\mathbb{E}_{\mu}\left\langle\xi,\widehat{\mu}_{K}\right\rangle=\mathbb{E}_{\mu}\operatorname{div}\widehat{\mu}_{K} by Stein’s identity, we have

    mμ=μμ02+2𝔼μdivμ^K𝔼μμ^Kμ2,\displaystyle m_{\mu}=\|\mu-\mu_{0}\|^{2}+2\mathbb{E}_{\mu}\operatorname{div}\widehat{\mu}_{K}-\mathbb{E}_{\mu}\lVert\widehat{\mu}_{K}-\mu\rVert^{2}, (3.9)

    hence the numerator of (3.7) requires sharp estimates of the expected ‘degrees of freedom’ 𝔼μdivμ^K\mathbb{E}_{\mu}\operatorname{div}\widehat{\mu}_{K} (cf. [MW00]), and the estimation error 𝔼μμ^Kμ2\mathbb{E}_{\mu}\lVert\widehat{\mu}_{K}-\mu\rVert^{2}. A (near) matching upper and lower bound for σμ0\sigma_{\mu_{0}} will also be required to obtain necessary and sufficient characterizations. We mention that (3.7) cannot in general be equivalently inverted into a lower bound on μμ0\lVert\mu-\mu_{0}\rVert only; see the remarks after Theorem 3.9 for a more detailed discussion.

Remark 3.5.

The LRT defined in (1.6) depends on the choice of the acceptance region 𝒜α\mathcal{A}_{\alpha}. Two obvious choices are:

  1. (1)

    (One-sided LRT). Let 𝒜α𝒜αos=(,zα]\mathcal{A}_{\alpha}\equiv\mathcal{A}^{\textrm{os}}_{\alpha}=(-\infty,z_{\alpha}]. This leads to the following one-sided LRT:

    Ψos(Y)Ψos(Y;mμ0,σμ0)𝟏(T(Y)mμ0σμ0>zα).\displaystyle\Psi_{\textrm{os}}(Y)\equiv\Psi_{\textrm{os}}(Y;m_{\mu_{0}},\sigma_{\mu_{0}})\equiv\bm{1}\bigg{(}\frac{T(Y)-m_{\mu_{0}}}{\sigma_{\mu_{0}}}>z_{\alpha}\bigg{)}. (3.10)
  2. (2)

    (Two-sided LRT). Let 𝒜α𝒜αts=[zα/2,zα/2]\mathcal{A}_{\alpha}\equiv\mathcal{A}^{\textrm{ts}}_{\alpha}=[-z_{\alpha/2},z_{\alpha/2}]. This leads to the following two-sided LRT:

    Ψts(Y)Ψts(Y;mμ0,σμ0)𝟏(|T(Y)mμ0σμ0|>zα/2).\displaystyle\Psi_{\textrm{ts}}(Y)\equiv\Psi_{\textrm{ts}}(Y;m_{\mu_{0}},\sigma_{\mu_{0}})\equiv\bm{1}\bigg{(}\bigg{\lvert}\frac{T(Y)-m_{\mu_{0}}}{\sigma_{\mu_{0}}}\bigg{\rvert}>z_{\alpha/2}\bigg{)}. (3.11)

In the classical case where KK is a subspace of fixed dimension, the one-sided LRT is power consistent (under μK\mu\in K) if and only if the two-sided LRT is power consistent, so one can simply use the standard one-sided LRT. The situation can be rather different for certain high dimensional instances of KK. Under the setting of Theorem 3.2-(2), as Δ𝒜αos1(1)={+}\Delta_{\mathcal{A}^{\textrm{os}}_{\alpha}}^{-1}(1)=\{+\infty\} while Δ𝒜αts1(1)={±}\Delta_{\mathcal{A}^{\textrm{ts}}_{\alpha}}^{-1}(1)=\{\pm\infty\}, power consistency under μ\mu for the one-sided LRT implies that for the two-sided LRT, but the converse fails when the -\infty limit in (3.7) is achieved. See Example 4.5 ahead for a concrete example. However, in the special case where μ0=0\mu_{0}=0 and KK is a closed convex cone, (mμmμ0)/σμ0(m_{\mu}-m_{\mu_{0}})/\sigma_{\mu_{0}} can only diverge to ++\infty under mild growth condition on KK, so in this case power consistency is equivalent for one- and two-sided LRTs. Also see Remark 3.10-(1).

As a simple toy example of Theorem 3.2, we consider the testing problem (1.2) in the linear regression case, where KKX{Xθ:θp}K\equiv K_{X}\equiv\{X\theta:\theta\in\mathbb{R}^{p}\} for some fixed design matrix Xn×pX\in\mathbb{R}^{n\times p}, with pnp\leq n. We will be interested in the high dimensional regime rank(X)\operatorname{rank}(X)\to\infty where the normal approximation for the LRT holds under the null.

Proposition 3.6.

Consider testing (1.2) with K=KXK=K_{X}. Suppose that rank(X)\operatorname{rank}(X)\to\infty. Let Ψ{Ψos,Ψts}\Psi\in\{\Psi_{\mathrm{os}},\Psi_{\mathrm{ts}}\}.

  1. (1)

    If μ0KX\mu_{0}\in K_{X}, then

    dTV(T(Y)mμ0σμ0,𝒩(0,1))\displaystyle d_{\mathrm{TV}}\bigg{(}\frac{T(Y)-m_{\mu_{0}}}{\sigma_{\mu_{0}}},\mathcal{N}(0,1)\bigg{)} 8rank(X).\displaystyle\leq\frac{8}{\sqrt{\operatorname{rank}(X)}}. (3.12)

    Consequently the LRT is asymptotically size α\alpha with 𝔼μ0Ψ(Y)=α+𝒪(1/rank(X))\mathbb{E}_{\mu_{0}}\Psi(Y)=\alpha+\mathcal{O}(1/\sqrt{\operatorname{rank}(X)}).

  2. (2)

    For any μKX\mu\in K_{X}, mμmμ0=μμ02m_{\mu}-m_{\mu_{0}}=\|\mu-\mu_{0}\|^{2}, and the LRT is power consistent under μ\mu, i.e., 𝔼μΨ(Y)1\mathbb{E}_{\mu}\Psi(Y)\to 1, if and only if μμ0(rank(X))1/4\lVert\mu-\mu_{0}\rVert\gg(\operatorname{rank}(X))^{1/4}.

Proof.

(1). Note that μ^KX=ΠKX(Y)=X(XX)XYPY\widehat{\mu}_{K_{X}}=\Pi_{K_{X}}(Y)=X(X^{\top}X)^{-}X^{\top}Y\equiv PY, where AA^{-} denotes the pseudo-inverse for AA. Then 𝔼μ0μ^KX=Pμ0=μ0\mathbb{E}_{\mu_{0}}\widehat{\mu}_{K_{X}}=P\mu_{0}=\mu_{0}, Jμ^KX=PJ_{\widehat{\mu}_{K_{X}}}=P and

𝔼μ0μ^KXμ02=𝔼μ0Jμ^KXF2\displaystyle\mathbb{E}_{\mu_{0}}\lVert\widehat{\mu}_{K_{X}}-\mu_{0}\rVert^{2}=\lVert\mathbb{E}_{\mu_{0}}J_{\widehat{\mu}_{K_{X}}}\rVert_{F}^{2} =tr(PP)=dim(KX)=rank(X).\displaystyle=\operatorname{tr}(PP^{\top})=\dim(K_{X})=\operatorname{rank}(X).

The claim (1) now follows from Theorem 3.1.

(2). By (3.9), for any μKX\mu\in K_{X},

mμ\displaystyle m_{\mu} =μμ02+𝔼[2ξ,PξPξ2]\displaystyle=\|\mu-\mu_{0}\|^{2}+\mathbb{E}\big{[}2\left\langle\xi,P\xi\right\rangle-\lVert P\xi\rVert^{2}\big{]}
=μμ02+𝔼[Pξ2]=μμ02+rank(X),\displaystyle=\|\mu-\mu_{0}\|^{2}+\mathbb{E}[\lVert P\xi\rVert^{2}]=\|\mu-\mu_{0}\|^{2}+\operatorname{rank}(X),

and with μ0KX\mu_{0}\in K_{X},

σμ02=Var(Pξ2)=rank(X).\displaystyle\sigma_{\mu_{0}}^{2}=\operatorname{Var}\big{(}\lVert P\xi\rVert^{2}\big{)}=\operatorname{rank}(X).

As μμ0μμ02rank(X)\lVert\mu-\mu_{0}\rVert\ll\lVert\mu-\mu_{0}\rVert^{2}\vee\sqrt{\operatorname{rank}(X)} always holds, the claim follows from Theorem 3.2-(2). ∎

More examples on testing in orthant/circular cone, isotonic regression and Lasso are worked out in Section 4.

3.2. Subspace versus closed convex cone

In this subsection, we study in detail the testing problem (1.10) as an important special case of (1.2). The additional subspace and cone structure will allow us to give more explicit characterizations of the size and the power of the LRT; note that here the LRS T(Y)T(Y) takes the modified form (1.3). We start with the following simple observation.

Lemma 3.7.

Let KK be a closed convex set in n\mathbb{R}^{n}. Then for μ\mu such that KμKK-\mu\subset K, we have

ΠK(μ+ξ)=μ+ΠK(ξ),ξn.\displaystyle\Pi_{K}(\mu+\xi)=\mu+\Pi_{K}(\xi),\quad\forall\xi\in\mathbb{R}^{n}.

Consequently,

μ+ξΠK(μ+ξ)2=ξΠK(ξ)2.\displaystyle\lVert\mu+\xi-\Pi_{K}(\mu+\xi)\rVert^{2}=\lVert\xi-\Pi_{K}(\xi)\rVert^{2}.
Proof.

By the definition of projection, we want to verify

μ+ξ(μ+ΠK(ξ)),ν(μ+ΠK(ξ))0,νK.\displaystyle\left\langle\mu+\xi-(\mu+\Pi_{K}(\xi)),\nu-(\mu+\Pi_{K}(\xi))\right\rangle\leq 0,\quad\forall\nu\;\in K.

This amounts to verifying that

ξΠK(ξ),(νμ)ΠK(ξ)0,νK.\displaystyle\left\langle\xi-\Pi_{K}(\xi),(\nu-\mu)-\Pi_{K}(\xi)\right\rangle\leq 0,\quad\forall\nu\in K.

As νμK\nu-\mu\in K by the condition KμKK-\mu\subset K, the above inequality holds by the projection property for ΠK(ξ)\Pi_{K}(\xi). ∎

Recall the definition of the statistical dimension δK\delta_{K} in Definition 2.2. The above lemma provides us with simplifications of mμm_{\mu} and σμ2\sigma^{2}_{\mu} as defined in (1.5): under the setting of (1.10), for any μK0\mu\in K_{0},

mμm0=δKδK0,σμ2σ02=Var(ΠK(ξ)2ΠK0(ξ)2).\displaystyle m_{\mu}\equiv m_{0}=\delta_{K}-\delta_{K_{0}},\quad\;\;\sigma_{\mu}^{2}\equiv\sigma_{0}^{2}=\operatorname{Var}\big{(}\lVert\Pi_{K}(\xi)\rVert^{2}-\lVert\Pi_{K_{0}}(\xi)\rVert^{2}\big{)}. (3.13)

Moreover, as K0K_{0} is a subspace, we have δK0=dim(K0)\delta_{K_{0}}=\dim(K_{0}). The following result (proved in Section 5.3) derives the normal approximation of T(Y)T(Y) with an explicit error bound.

Theorem 3.8.

Suppose K0KnK_{0}\subset K\subset\mathbb{R}^{n} are such that K0K_{0} is a subspace and KK is a closed convex cone. Then for μK0\mu\in K_{0},

dTV(T(Y)m0σ0,𝒩(0,1))\displaystyle d_{\mathrm{TV}}\bigg{(}\frac{T(Y)-m_{0}}{\sigma_{0}},\mathcal{N}(0,1)\bigg{)} 8δKδK0.\displaystyle\leq\frac{8}{\sqrt{\delta_{K}-\delta_{K_{0}}}}.

It is easy to see from the above bound that under the growth condition δKδK0\delta_{K}-\delta_{K_{0}}\to\infty, normal approximation of T(Y)T(Y) holds under the null. This growth condition cannot be improved in general: for a subspace KK, T(Y)T(Y) follows a chi-squared distribution with δKδK0\delta_{K}-\delta_{K_{0}} degrees of freedom under the null, so normal approximation holds if and only if δKδK0\delta_{K}-\delta_{K_{0}}\to\infty. The above theorem extends [GNP17, Theorem 2.1] in which the case K0={0}K_{0}=\{0\} is treated. Compared to classical results on the chi-bar squared distribution [Dyk91, Corollary 2.2], the growth condition here does not require exact knowledge for the mixing weights, and can be easily checked using Gaussian process techniques; see Section 4.5 for examples.

Using Theorem 3.8, we can prove sharp size and power behavior of the LRT; see Theorem 3.9 below (proved in Section 5.4). For p1p\geq 1, let

ΓK,p(ν)𝔼ΠK(ν+ξ)p𝔼ΠK(ξ)p,νn.\displaystyle\Gamma_{K,p}(\nu)\equiv\mathbb{E}\lVert\Pi_{K}\big{(}\nu+\xi\big{)}\rVert^{p}-\mathbb{E}\lVert\Pi_{K}(\xi)\rVert^{p},\quad\nu\in\mathbb{R}^{n}.

We simply shorthand ΓK,1\Gamma_{K,1} as ΓK\Gamma_{K} for notational convenience. Recall the definition of VKV_{K} in Definition 2.3 and that of the polar cone KK^{*} in (2.2).

Theorem 3.9.

Consider testing (1.10) using the LRT Ψ(Y;m0,σ0)\Psi(Y;m_{0},\sigma_{0}) with the modified LRS T(Y)T(Y) in (1.3). There exist constants C𝒜α,C𝒜α>0C_{\mathcal{A}_{\alpha}},C_{\mathcal{A}_{\alpha}}^{\prime}>0 such that

|𝔼μΨ(Y;m0,σ0)Δ𝒜α(ΓK,2(μΠK0(μ))σ0)|\displaystyle\bigg{\lvert}\mathbb{E}_{\mu}\Psi(Y;m_{0},\sigma_{0})-\Delta_{\mathcal{A}_{\alpha}}\bigg{(}\frac{\Gamma_{K,2}\big{(}\mu-\Pi_{K_{0}}(\mu)\big{)}}{\sigma_{0}}\bigg{)}\bigg{\rvert}
2err0+C𝒜α(1μΠK0(μ)|ΓK,2(μΠK0(μ))|σ0)\displaystyle\quad\qquad\leq 2\cdot\mathrm{err}_{0}+C_{\mathcal{A}_{\alpha}}\cdot\mathscr{L}\bigg{(}1\bigwedge\frac{\lVert\mu-\Pi_{K_{0}}(\mu)\rVert}{\big{\lvert}\Gamma_{K,2}\big{(}\mu-\Pi_{K_{0}}(\mu)\big{)}\big{\rvert}\vee\sigma_{0}}\bigg{)} (3.14)
C𝒜α((δKδK0)1/4).\displaystyle\quad\qquad\leq C_{\mathcal{A}_{\alpha}}^{\prime}\cdot\mathscr{L}\Big{(}\big{(}\delta_{K}-\delta_{K_{0}}\big{)}^{-1/4}\Big{)}. (3.15)

Here err0,()\mathrm{err}_{0},\mathscr{L}(\cdot) are defined in Theorem 3.2. Consequently:

  1. (1)

    For μK0\mu\in K_{0}, the LRT has size 𝔼0Ψ(Y;m0,σ0)\mathbb{E}_{0}\Psi(Y;m_{0},\sigma_{0}), where

    |𝔼0Ψ(Y;m0,σ0)α|16δKδK0.\displaystyle\big{\lvert}\mathbb{E}_{0}\Psi(Y;m_{0},\sigma_{0})-\alpha\big{\rvert}\leq\frac{16}{\sqrt{\delta_{K}-\delta_{K_{0}}}}.
  2. (2)

    Suppose further δKδK0\delta_{K}-\delta_{K_{0}}\to\infty. Then for μK\mu\in K,

    ({ΓK,2(μΠK0(μ))σ0})Δ𝒜α1(β)[0,+]\displaystyle\mathcal{L}\bigg{(}\bigg{\{}\frac{\Gamma_{K,2}\big{(}\mu-\Pi_{K_{0}}(\mu)\big{)}}{\sigma_{0}}\bigg{\}}\bigg{)}\subset\Delta_{\mathcal{A}_{\alpha}}^{-1}(\beta)\cap[0,+\infty]
    \displaystyle\Leftrightarrow\quad ({2ΓK(μΠK0(μ))2+r(K,K0)1δK0/δK})Δ𝒜α1(β)[0,+]\displaystyle\mathcal{L}\bigg{(}\bigg{\{}\frac{2\Gamma_{K}\big{(}\mu-\Pi_{K_{0}}(\mu)\big{)}}{\sqrt{2+r(K,K_{0})}\sqrt{1-\delta_{K_{0}}/\delta_{K}}}\bigg{\}}\bigg{)}\subset\Delta_{\mathcal{A}_{\alpha}}^{-1}(\beta)\cap[0,+\infty]
    \displaystyle\Leftrightarrow\quad 𝔼μΨ(Y;m0,σ0)β[0,1],\displaystyle\mathbb{E}_{\mu}\Psi(Y;m_{0},\sigma_{0})\to\beta\in[0,1], (3.16)

    where r(K,K0)Var(VKK0)/δKK0[0,2]r(K,K_{0})\equiv\operatorname{Var}(V_{K\cap K_{0}^{*}})/\delta_{K\cap K_{0}^{*}}\in[0,2]. Hence the LRT is power consistent under μ\mu, i.e., 𝔼μΨ(Y;m0,σ0)1\mathbb{E}_{\mu}\Psi(Y;m_{0},\sigma_{0})\to 1, if and only if

    ΓK,2(μΠK0(μ))(δKδK0)1/2+ΓK(μΠK0(μ))1δK0/δK+.\displaystyle\frac{\Gamma_{K,2}\big{(}\mu-\Pi_{K_{0}}(\mu)\big{)}}{\big{(}\delta_{K}-\delta_{K_{0}}\big{)}^{1/2}}\to+\infty\quad\Leftrightarrow\quad\frac{\Gamma_{K}\big{(}\mu-\Pi_{K_{0}}(\mu)\big{)}}{\sqrt{1-\delta_{K_{0}}/\delta_{K}}}\to+\infty. (3.17)
Remark 3.10.
  1. (1)

    By the proof of [WWG19, Lemma E.1], ΓK,2(ν)ν20\Gamma_{K,2}(\nu)\geq\lVert\nu\rVert^{2}\geq 0 for all νK\nu\in K, so all the limit points in (2) are nonnegative. This leads to the equivalence of the power consistency property for the one-sided LRT (3.10) and the two-sided LRT (3.11).

  2. (2)

    With the help of Lemma 3.7 and (3.13), which holds for any μK0\mu\in K_{0}, some calculations yield that

    mμm0=ΓK,2(μΠK0(μ))μΠK0(μ)2.\displaystyle m_{\mu}-m_{0}=\Gamma_{K,2}(\mu-\Pi_{K_{0}}(\mu))\geq\lVert\mu-\Pi_{K_{0}}(\mu)\rVert^{2}. (3.18)

    Therefore, the counterpart of the generic condition (3.5) under (1.10)

    μΠK0(μ)|mμm0|σ0\displaystyle\lVert\mu-\Pi_{K_{0}}(\mu)\rVert\ll\lvert m_{\mu}-m_{0}\rvert\vee\sigma_{0}

    is automatically satisfied due to the global quadratic lower bound (3.18). In particular, (3.15) vanishes under the growth condition δKδK0\delta_{K}-\delta_{K_{0}}\to\infty.

The power behavior of the LRT is characterized using ΓK,2\Gamma_{K,2} and ΓK\Gamma_{K} in Theorem 3.9. The function ΓK,2\Gamma_{K,2} is usually more amenable to explicit calculations in concrete examples, while the formulation using ΓK\Gamma_{K} allows us to recover the separation rate in \lVert\cdot\rVert for the LRT derived in [WWG19] in the setting (1.10). We formally state this result below; see Section 5.5 for a proof.

Corollary 3.11.

For Ψ{Ψos,Ψts}\Psi\in\{\Psi_{\mathrm{os}},\Psi_{\mathrm{ts}}\}, (3.17) is satisfied for any μK\mu\in K such that

μΠK0(μ)δK1/4(δK1/20infηKB(1)η,𝔼ΠK(ξ)).\displaystyle\lVert\mu-\Pi_{K_{0}}(\mu)\rVert\gg\delta_{K}^{1/4}\bigwedge\bigg{(}\frac{\delta_{K}^{1/2}}{0\bigvee\inf_{\eta\in K\cap B(1)}\left\langle\eta,\mathbb{E}\Pi_{K}(\xi)\right\rangle}\bigg{)}. (3.19)

Below we give a detailed comparison of (3.17) and its sufficient condition (3.19) due to [WWG19]:

  • (Optimality) By [WWG19], condition (3.19) cannot be further improved in the worst case in the sense that for every fixed pair (K0,K)(K_{0},K), there exists some μK\mu\in K violating (3.19) that invalidates (3.17). Furthermore, the same work also shows that the uniform \|\cdot\|-separation rate in (3.19) is minimax optimal in many cone testing problems.

  • (Non-uniform power) On the other hand, it is important to mention that (3.17) is not equivalent to (3.19). In fact, as we will see in the example of testing 0 versus the orthant cone K+K_{+} and the product circular cone K×,αK_{\times,\alpha} (to be detailed in Corollary 4.2 and Theorem 4.6), the worst case condition (3.19) in terms of a separation in \lVert\cdot\rVert is too conservative: condition (3.17) allows natural configurations of μ{K+,K×,α}\mu\in\{K_{+},K_{\times,\alpha}\} whose separation rate in \lVert\cdot\rVert can be nδn^{\delta} for any δ(0,1/4)\delta\in(0,1/4), while (3.19) necessarily requires a separation rate in \lVert\cdot\rVert of order at least n1/4n^{1/4}. Therefore, although (3.19) gives the best possible inversion of (3.17) in terms of uniform separation in \lVert\cdot\rVert, condition (3.17) can be much weaker than (3.19), and characterizes the non-uniform power behavior of the LRT.

To give a better sense of the results in Theorem 3.9, we consider a toy example where KK is also a subspace.

Proposition 3.12.

Let Ψ{Ψos,Ψts}\Psi\in\{\Psi_{\mathrm{os}},\Psi_{\mathrm{ts}}\}. Suppose δKδK0\delta_{K}-\delta_{K_{0}}\to\infty.

  1. (1)

    If μK0\mu\in K_{0}, the LRT is asymptotically size α\alpha with 𝔼μΨ(Y;m0,σ0)=α+𝒪((δKδK0)1/2)\mathbb{E}_{\mu}\Psi(Y;m_{0},\sigma_{0})=\alpha+\mathcal{O}\big{(}\big{(}\delta_{K}-\delta_{K_{0}}\big{)}^{-1/2}\big{)}.

  2. (2)

    For μK\mu\in K, the LRT is power consistent under μ\mu, i.e., 𝔼μΨ(Y;m0,σ0)1\mathbb{E}_{\mu}\Psi(Y;m_{0},\sigma_{0})\to 1, if and only if μΠK0(μ)(δKδK0)1/4\lVert\mu-\Pi_{K_{0}}(\mu)\rVert\gg\big{(}\delta_{K}-\delta_{K_{0}}\big{)}^{1/4}.

Proof.

(1) is a direct consequence of Theorem 3.9-(1). (2) follows from Theorem 3.9-(2) upon noting that

ΓK,2(μΠK0(μ))=𝔼ΠK(μΠK0(μ)+ξ)2𝔼ΠK(ξ)2=μΠK0(μ)2,\displaystyle\Gamma_{K,2}(\mu-\Pi_{K_{0}}(\mu))=\mathbb{E}\lVert\Pi_{K}\big{(}\mu-\Pi_{K_{0}}(\mu)+\xi\big{)}\rVert^{2}-\mathbb{E}\lVert\Pi_{K}(\xi)\rVert^{2}=\lVert\mu-\Pi_{K_{0}}(\mu)\rVert^{2},
σ02=Var(ΠKK0(ξ)2)=2δKK0=2(δKδK0).\displaystyle\sigma_{0}^{2}=\operatorname{Var}\big{(}\lVert\Pi_{K\cap K_{0}^{\ast}}(\xi)\rVert^{2}\big{)}=2\delta_{K\cap K_{0}^{\ast}}=2(\delta_{K}-\delta_{K_{0}}).

The second line of the above display uses Lemma 2.4-(3). ∎

More examples on testing parametric assumptions versus shape-constrained alternatives will be detailed in Section 4.

4. Examples

This section is organized as follows. Sections 4.1-4.4 study the generic testing problem (1.2) in the context of orthant/circular cones, isotonic regression, and Lasso, respectively. Section 4.5 specializes the subspace versus cone testing problem (1.10) to the setting of testing parametric assumptions versus shape-constrained alternatives. For simplicity of presentation, we will focus on the two-sided LRT (3.11), and simply call it the LRT unless otherwise specified.

4.1. Testing in orthant cone

Consider the orthant cone

K+{ν=(ν1,,νn)n:νi0,i[1:n]}.K_{+}\equiv\left\{\nu=(\nu_{1},\ldots,\nu_{n})\in\mathbb{R}^{n}:\nu_{i}\geq 0,i\in[1:n]\right\}.

We are interested in the testing problem (1.2) with K=K+K=K_{+}. Testing in the orthant cone has previously been studied by [Kud63, RLN86, WWG19]. The following result (see Section 6.1 for a proof) gives the limiting distribution of the LRS and characterizes the power behavior of the LRT in this example.

Theorem 4.1.
  1. (1)

    There exists a universal constant C>0C>0 such that for μ0K+\mu_{0}\in K_{+},

    dTV(T(Y)mμ0σμ0,𝒩(0,1))\displaystyle d_{\mathrm{TV}}\bigg{(}\frac{T(Y)-m_{\mu_{0}}}{\sigma_{\mu_{0}}},\mathcal{N}(0,1)\bigg{)} Cn.\displaystyle\leq\frac{C}{\sqrt{n}}.

    Consequently the LRT is asymptotically size α\alpha with 𝔼μ0Ψts(Y;mμ0,σμ0)=α+𝒪(n1/2)\mathbb{E}_{\mu_{0}}\Psi_{\mathrm{ts}}(Y;m_{\mu_{0}},\sigma_{\mu_{0}})=\alpha+\mathcal{O}(n^{-1/2}).

  2. (2)

    For any μK+\mu\in K_{+}, the LRT is power consistent under μ\mu, i.e., 𝔼μΨts(Y;mμ0,σμ0)1\mathbb{E}_{\mu}\Psi_{\mathrm{ts}}(Y;m_{\mu_{0}},\sigma_{\mu_{0}})\to 1, if and only if

    |i=1n{S¯+(μi)S¯+((μ0)i)}+μμ02|n1/2.\displaystyle\bigg{\lvert}\sum_{i=1}^{n}\big{\{}\bar{S}_{+}(\mu_{i})-\bar{S}_{+}((\mu_{0})_{i})\big{\}}+\lVert\mu-\mu_{0}\rVert^{2}\bigg{\rvert}\gg n^{1/2}.

    Here, S¯+\bar{S}_{+} is an increasing, concave and bounded function on [0,)[0,\infty) with S¯+(0)=0\bar{S}_{+}(0)=0 and defined as

    S¯+(x)Φ(x)+xφ(x)x2(1Φ(x))12,x0.\displaystyle\bar{S}_{+}(x)\equiv\Phi(x)+x\varphi(x)-x^{2}(1-\Phi(x))-\frac{1}{2},\quad x\geq 0. (4.1)

Let us further investigate the special case μ0=0\mu_{0}=0 to illustrate the non-uniform power behavior of the LRT mentioned after Theorem 3.9. In other words, we consider testing μ=0\mu=0 versus the orthant cone K+K_{+}. Let

S+(x)S¯+(x)+x2=Φ(x)+xφ(x)+x2Φ(x)12,x0.\displaystyle S_{+}(x)\equiv\bar{S}_{+}(x)+x^{2}=\Phi(x)+x\varphi(x)+x^{2}\Phi(x)-\frac{1}{2},\quad x\geq 0.

As S+(x)=2[φ(x)+xΦ(x)]S_{+}^{\prime}(x)=2\big{[}\varphi(x)+x\Phi(x)\big{]}, S+(0)=2φ(0)>0S_{+}^{\prime}(0)=2\varphi(0)>0, and S+′′(x)=2Φ(x)0S_{+}^{\prime\prime}(x)=2\Phi(x)\geq 0, S+S_{+} is a strictly increasing and convex function on [0,)[0,\infty) with S+(0)=0S_{+}(0)=0. Furthermore, it can be verified via direct calculation that uniformly over x0x\geq 0, S+(x)xx2S_{+}(x)\asymp x\vee x^{2}. Theorem 4.1 immediately yields the following corollary.

Corollary 4.2.
  1. (1)

    For μ=0\mu=0, the LRT is asymptotically size α\alpha with 𝔼0Ψts(Y;m0,σ0)=α+𝒪(n1/2)\mathbb{E}_{0}\Psi_{\mathrm{ts}}(Y;{m}_{0},{\sigma}_{0})=\alpha+\mathcal{O}(n^{-1/2}).

  2. (2)

    For μK+\mu\in K_{+}, the LRT is power consistent under μ\mu, i.e., 𝔼μΨts(Y;m0,σ0)1\mathbb{E}_{\mu}\Psi_{\mathrm{ts}}(Y;{m}_{0},{\sigma}_{0})\to 1, if and only if μ1μ2n1/2\lVert\mu\rVert_{1}\vee\lVert\mu\rVert^{2}\gg n^{1/2}.

The results in [WWG19, Section 3.1.5], or equivalently, condition (3.19) show that the type II error of an optimally calibrated LRT vanishes uniformly for μK+\mu\in K_{+} such that μn1/4\lVert\mu\rVert\gg n^{1/4}. Our results above indicate that the regime where the LRT has asymptotic power 1, for the orthant cone K+K_{+}, is actually characterized by the condition μ1μ2n1/2\|\mu\|_{1}\vee\|\mu\|^{2}\gg n^{1/2} and is hence non-uniform with respect to \lVert\cdot\rVert. We give two concrete examples below.

Example 4.3.

Let q(0,1/2)q\in(0,1/2) and τ1,τ2>0\tau_{1},\tau_{2}>0 be two fixed positive constants. Consider the following alternatives: (1) μ=(τ1nq)𝟏nK+\mu=(\tau_{1}n^{-q})\bm{1}_{n}\in K_{+}, and (2) μ=(τ2iq)i=1nK+\mu=(\tau_{2}i^{-q})_{i=1}^{n}\in K_{+}. In both cases, μ1n1q\lVert\mu\rVert_{1}\asymp n^{1-q} and μ2n12q\lVert\mu\rVert^{2}\asymp n^{1-2q}. The above corollary then yields that the LRT is power consistent under μ\mu if and only if q(0,1/2)q\in(0,1/2), while the characterization of [WWG19] guarantees power consistency of the LRT only for q(0,1/4)q\in(0,1/4). In particular, as q1/2q\rightarrow 1/2, the LRT is power consistent for certain alternative μ\mu with μnδ\lVert\mu\rVert\asymp n^{\delta} for any δ>0\delta>0. See Section 4.1.1 ahead for some simulation evidence.

One may further wonder whether the above examples only highlight ‘exceptional’ alternatives in the regime where the uniform separation in \lVert\cdot\rVert fails to be informative, i.e., with Mn{μK+:μ2Cn1/2}M_{n}\equiv\{\mu\in K_{+}:\lVert\mu\rVert^{2}\leq Cn^{1/2}\} for some large enough absolute constant C>0C>0, whether the above examples only constitute a small fraction of MnM_{n}. To this end, let An{μMn:μ1μ2Cn1/2}A_{n}\equiv\{\mu\in M_{n}:\lVert\mu\rVert_{1}\vee\lVert\mu\rVert^{2}\geq Cn^{1/2}\} be the region in MnM_{n} in which the LRT is indeed powerful. By a standard volumetric calculation, it is easy to see that An/Mn1A_{n}/M_{n}\to 1. In other words, the LRT is indeed powerful for ‘most’ alternatives in the region where the uniform separation in \lVert\cdot\rVert is not informative as nn\to\infty. Hence the non-uniform characterization in Corollary 4.2-(2) is essential for determining whether the LRT is powerful for a given alternative μK+\mu\in K_{+} in the regime μ=𝒪(n1/4)\lVert\mu\rVert=\mathcal{O}(n^{1/4}).

As the separation rate n1/4n^{1/4} in \lVert\cdot\rVert is minimax optimal for testing 0 versus K+K_{+} (cf. [WWG19, Proposition 1]), the discussion above also illustrates the conservative nature of the minimax formulation in this testing problem.

4.1.1. An illustrative simulation study

Refer to caption
Refer to caption
Figure 1. The power curves for the alternatives μ=(2nq)i=1n\mu=(2n^{-q})_{i=1}^{n} (in the left panel) and μ=(iq)i=1n\mu=(i^{-q})_{i=1}^{n} (in the right panel) as qq varies, for sample sizes n{2,[1:20]}n\in\{2^{\ell},\ell\in[1:20]\}. The plots illustrate that the LRT has power in the range q(0,1/2)q\in(0,1/2) in both the examples.
Refer to caption
Refer to caption
Figure 2. Fixed q=0.3q=0.3 and n=20000n=20000. The alternatives are μ=(τ1n0.3)\mu=(\tau_{1}n^{-0.3}) in left panel and μ=(τ2i0.3)\mu=(\tau_{2}i^{-0.3}) in the right panel with τ1,τ2{0.01,0.02,,1}\tau_{1},\tau_{2}\in\{0.01,0.02,\ldots,1\}. The red line denotes the power curve of LRT, i.e., {𝔼μΨts(Y;m0,σ0):τ1}\{\mathbb{E}_{\mu}\Psi_{\mathrm{ts}}(Y;m_{0},\sigma_{0}):\tau_{1}\}, while the blue line denotes the theoretical power curve via normal approximation, i.e., {Δ𝒜α(ΓK,2(μ)/σ0):τ2}\{\Delta_{\mathcal{A}_{\alpha}}\big{(}\Gamma_{K,2}(\mu)/\sigma_{0}\big{)}:\tau_{2}\}.

Below we present simulation results under the two settings considered in Example 4.3. The confidence level will be taken as α=0.05\alpha=0.05. The power of the LRT in both the simulations below is calculated using an average of 20002000 replications.

  • In Figure 1, we take τ1=2,τ2=1\tau_{1}=2,\tau_{2}=1 and examine the sharpness of the power characterization q(0,1/2)q\in(0,1/2) predicted by Corollary 4.2-(2). Clearly, Figure 1 shows that q(0,1/2)q\in(0,1/2) is the correct range where the LRT is powerful in both the settings of Example 4.3, rather than q(0,1/4)q\in(0,1/4) as predicted by [WWG19].

  • In Figure 2, we fix q=0.3q=0.3, n=20000n=20000, and examine the validity of the normal power expansion (3.14) in Theorem 3.9 along the alternatives considered in Example 4.3 with τ1,τ2{0.01,0.02,,1}\tau_{1},\tau_{2}\in\{0.01,0.02,\ldots,1\}. Formally, we consider two power curves: (i) the power of the LRT, i.e., 𝔼μΨts(Y;m0,σ0)\mathbb{E}_{\mu}\Psi_{\mathrm{ts}}(Y;m_{0},\sigma_{0}), (ii) theoretical power given by the normal approximation, i.e., Δ𝒜α(ΓK,2(μ)/σ0)\Delta_{\mathcal{A}_{\alpha}}\big{(}\Gamma_{K,2}(\mu)/\sigma_{0}\big{)}, for alternatives of the form μ=(τ1n0.3)i=1n\mu=(\tau_{1}n^{-0.3})_{i=1}^{n} and μ=(τ2i0.3)i=1n\mu=(\tau_{2}i^{-0.3})_{i=1}^{n} with the prescribed τ1,τ2\tau_{1},\tau_{2}’s. Figure 2 clearly shows that the two power curves are very close to each other.

4.1.2. Counter-examples

Let μ0=𝟏nK+\mu_{0}=\bm{1}_{n}\in K_{+}, and μ=c𝟏n\mu=c\bm{1}_{n} for some fixed c>0c>0 to be determined. As long as c1c\neq 1, we have μμ02=n(c1)2n\|\mu-\mu_{0}\|^{2}=n(c-1)^{2}\asymp n. We also have σμ2=nVar[(c+ξ11)2(c+ξ1)2]nρ2(c)n\sigma_{\mu}^{2}=n\cdot\operatorname{Var}\big{[}(c+\xi_{1}-1)^{2}-(c+\xi_{1})_{-}^{2}\big{]}\equiv n\rho^{2}(c)\asymp n, and

mμmμ0=μμ02+i=1n(S¯+(c)S¯+(1))=n{(c1)2+S¯+(c)S¯+(1)},\displaystyle m_{\mu}-m_{\mu_{0}}=\|\mu-\mu_{0}\|^{2}+\sum_{i=1}^{n}\big{(}\bar{S}_{+}(c)-\bar{S}_{+}(1)\big{)}=n\big{\{}(c-1)^{2}+\bar{S}_{+}(c)-\bar{S}_{+}(1)\big{\}},

where S¯+\bar{S}_{+} is defined in (4.1). Let F(c)(c1)2+S¯+(c)S¯+(1)F(c)\equiv(c-1)^{2}+\bar{S}_{+}(c)-\bar{S}_{+}(1). Then F(1)=0F(1)=0, F(0)=0.5753F(0)=0.5753..., and F(1)=S¯+(1)=0.1666>0F^{\prime}(1)=\bar{S}_{+}^{\prime}(1)=0.1666...>0.

We first present a choice of cc that leads to an example showing the necessity of (3.5) for the power characterization (3.6).

Example 4.4.

By the previous discussion, FF must admit a zero in the open interval (0,1)(0,1), which we denote as c0c_{0}. With c=c0c=c_{0}, we then have mμ=mμ0m_{\mu}=m_{\mu_{0}}. Moreover, as σμ2=nρ2(c0)nρ2(1)=σμ02\sigma_{\mu}^{2}=n\rho^{2}(c_{0})\neq n\rho^{2}(1)=\sigma_{\mu_{0}}^{2}, so by Theorem 4.1-(1),

T(μ+ξ)mμ0σμ0=T(μ+ξ)mμσμσμσμ0d𝒩(0,ρ2(c0)ρ2(1))d𝒩(0,1).\displaystyle\frac{T(\mu+\xi)-m_{\mu_{0}}}{\sigma_{\mu_{0}}}=\frac{T(\mu+\xi)-m_{\mu}}{\sigma_{\mu}}\cdot\frac{\sigma_{\mu}}{\sigma_{\mu_{0}}}\rightarrow_{d}\mathcal{N}\bigg{(}0,\frac{\rho^{2}(c_{0})}{\rho^{2}(1)}\bigg{)}\neq_{d}\mathcal{N}(0,1).

This means (3.6) fails.

Next we present a choice of cc that leads to an example showing the necessity of considering two-sided LRT.

Example 4.5.

By the previous discussion, F(c)<0F(c)<0 for c(0,1)c\in(0,1) near 1. Pick any c1(0,1)c_{1}\in(0,1) such that F(c1)<0F(c_{1})<0 and consider c=c1c=c_{1}. Let μ=c1𝟏n\mu=c_{1}\bm{1}_{n}. As σμ0n1/2\sigma_{\mu_{0}}\asymp n^{1/2}, (mμmμ0)/σμ0n1/2(m_{\mu}-m_{\mu_{0}})/\sigma_{\mu_{0}}\asymp-n^{1/2}, so by Theorem 4.1-(1),

T(μ+ξ)mμ0σμ0=T(μ+ξ)mμσμσμσμ0+mμmμ0σμ0\displaystyle\frac{T(\mu+\xi)-m_{\mu_{0}}}{\sigma_{\mu_{0}}}=\frac{T(\mu+\xi)-m_{\mu}}{\sigma_{\mu}}\cdot\frac{\sigma_{\mu}}{\sigma_{\mu_{0}}}+\frac{m_{\mu}-m_{\mu_{0}}}{\sigma_{\mu_{0}}}\to-\infty

in probability. This means that the two-sided LRT in (3.11) is powerful under μ=c1𝟏n\mu=c_{1}\bm{1}_{n}, i.e., 𝔼μΨts(Y;mμ0,σμ0)1\mathbb{E}_{\mu}\Psi_{\mathrm{ts}}(Y;m_{\mu_{0}},\sigma_{\mu_{0}})\to 1, but the one-sided LRT in (3.10) is not powerful under μ\mu, i.e., 𝔼μΨos(Y;mμ0,σμ0)0\mathbb{E}_{\mu}\Psi_{\textrm{os}}(Y;m_{\mu_{0}},\sigma_{\mu_{0}})\to 0.

4.2. Testing in circular cone

For any α(0,π/2)\alpha\in(0,\pi/2), let the α\alpha-circular cone be defined by

Kα{νn1:ν1νcos(α)},K_{\alpha}\equiv\left\{\nu\in\mathbb{R}^{n-1}:\nu_{1}\geq\|\nu\|\cos(\alpha)\right\},

and let K×,αKα×nK_{\times,\alpha}\equiv K_{\alpha}\times\mathbb{R}\subset\mathbb{R}^{n}. Consider the testing problem (1.2) with μ0=0\mu_{0}=0 and K{Kα,K×,α}K\in\{K_{\alpha},K_{\times,\alpha}\}. The circular cone has recently been used in modeling by [Bes06, GGF08]. The following result (see Section 6.2 for a proof) gives the limiting distribution of the LRS and characterizes the power behavior of the LRT in this example.

Theorem 4.6.
  1. (1)

    Let K{Kα,K×,α}K\in\{K_{\alpha},K_{\times,\alpha}\}. There exists some universal constant C>0C>0 such that,

    dTV(T(Y)m0σ0,𝒩(0,1))\displaystyle d_{\mathrm{TV}}\bigg{(}\frac{T(Y)-m_{0}}{\sigma_{0}},\mathcal{N}(0,1)\bigg{)} Cn.\displaystyle\leq\frac{C}{\sqrt{n}}.

    Consequently the LRT is asymptotically size α\alpha with 𝔼μ0Ψts(Y;m0,σ0)=α+𝒪(n1/2)\mathbb{E}_{\mu_{0}}\Psi_{\mathrm{ts}}(Y;m_{0},\sigma_{0})=\alpha+\mathcal{O}(n^{-1/2}).

  2. (2)
    1. (a)

      For any μKα\mu\in K_{\alpha}, the LRT is power consistent under μ\mu, i.e., 𝔼μΨts(Y;m0,σ0)1\mathbb{E}_{\mu}\Psi_{\mathrm{ts}}(Y;m_{0},\sigma_{0})\to 1, if and only if μ1\lVert\mu\rVert\gg 1.

    2. (b)

      For any μK×,α\mu\in K_{\times,\alpha}, the LRT is power consistent under μ\mu, i.e., 𝔼μΨts(Y;m0,σ0)1\mathbb{E}_{\mu}\Psi_{\mathrm{ts}}(Y;m_{0},\sigma_{0})\to 1, if and only if μ11\|\mu^{1}\|\gg 1 or |μ2|n1/4\lvert\mu^{2}\rvert\gg n^{1/4}.

    Here for any μn\mu\in\mathbb{R}^{n}, μ=(μ1,μ2)n1×\mu=(\mu^{1},\mu^{2})\in\mathbb{R}^{n-1}\times\mathbb{R} with μ1n1\mu^{1}\in\mathbb{R}^{n-1} denoting the first n1n-1 components of μ\mu and μ2\mu^{2}\in\mathbb{R} denoting the last.

Regarding the two cones {Kα,K×,α}\{K_{\alpha},K_{\times,\alpha}\}, [WWG19] showed the following:

  • For KαK_{\alpha}, an optimally calibrated LRT is powerful for μKα\mu\in K_{\alpha} such that μ1\lVert\mu\rVert\gg 1. The minimax \|\cdot\|-separation rate is of the same constant order, so the LRT is minimax optimal.

  • For K×,αK_{\times,\alpha}, an optimally calibrated LRT is powerful for μK×,α\mu\in K_{\times,\alpha} such that μn1/4\lVert\mu\rVert\gg n^{1/4}, while the minimax \|\cdot\|-separation rate is of constant order, so the LRT is strictly minimax sub-optimal.

Theorem 4.6-(2) is rather interesting compared to the above results of [WWG19]:

  • For KαK_{\alpha}, Theorem 4.6-(2)(a) shows that the power behavior of LRT is uniform with respect to \lVert\cdot\rVert for KαK_{\alpha}. In other words, for any μKα\mu\in K_{\alpha} with μ=𝒪(1)\|\mu\|=\mathcal{O}(1) the LRT is necessarily not powerful.

  • For K×,αK_{\times,\alpha}, Theorem 4.6-(2)(b) shows that the only bad alternatives that drive the uniform separation rate n1/4n^{1/4} in \lVert\cdot\rVert are those μ=(μ1,μ2)K×,α\mu=(\mu^{1},\mu^{2})\in K_{\times,\alpha} lying in the narrow cylinder μ1=𝒪(1)\lVert\mu^{1}\rVert=\mathcal{O}(1) and |μ2|=𝒪(n1/4)\lvert\mu^{2}\rvert=\mathcal{O}(n^{1/4}), and the LRT will be powerful for points of the form, e.g. (μ1,0)(\mu^{1},0) as soon as μ11\|\mu^{1}\|\gg 1. This is in line with the result of Theorem 4.6-(2)(a), and provides another example where the LRT exhibits non-uniform power behavior with respect to \lVert\cdot\rVert.

Similar to the LRT in the orthant cone, one may easily see that the conservative uniform separation rate (i.e., μn1/4\lVert\mu\rVert\gg n^{1/4}) in \lVert\cdot\rVert for K×,αK_{\times,\alpha} fails to detect ‘most’ alternatives where the LRT is powerful, as nn\to\infty. In this sense, the minimax sub-optimality of LRT for testing 0 versus K×,αK_{\times,\alpha} is also conservative as the LRT behaves badly for only a few alternatives with large separation rate in \lVert\cdot\rVert.

The phenomenon observed above for the product circular cone can be easily extended as follows. For some positive integer mm and generic closed convex cones KiniK_{i}\subset\mathbb{R}^{n_{i}}, i=1,,mi=1,\ldots,m, let K××i=1mKii=1mniK_{\times}\equiv\times_{i=1}^{m}K_{i}\subset\mathbb{R}^{\sum_{i=1}^{m}n_{i}} be the associated product cone. Then the LRT for testing 0 versus K×K_{\times} is power consistent under μ=(μi)i=1m×i=1mKi=K\mu=(\mu^{i})_{i=1}^{m}\in\times_{i=1}^{m}K_{i}=K if and only if

i=1mΓKi,2(μi)(i=1mδKi)1/2.\displaystyle\frac{\sum_{i=1}^{m}\Gamma_{K_{i},2}\big{(}\mu^{i}\big{)}}{\big{(}\sum_{i=1}^{m}\delta_{K_{i}}\big{)}^{1/2}}\to\infty.

The proof is largely similar to Theorem 4.6-(2)(b) so we omit the details.

4.3. Testing in isotonic regression

Let the monotone cone be defined by

KK,0={ν=(ν1,,νn)n:ν1νn}.K_{\uparrow}\equiv K_{\uparrow,0}=\{\nu=(\nu_{1},\ldots,\nu_{n})\in\mathbb{R}^{n}:\nu_{1}\leq\ldots\leq\nu_{n}\}.

We consider the testing problem (1.2) with K=KK=K_{\uparrow} using the two-sided LRT (as in (3.11)). The following result (see Section 6.3 for a proof) gives the limiting distribution of the LRS and characterizes the power behavior of the LRT in this example.

Theorem 4.7.
  1. (1)

    Suppose μ0K\mu_{0}\in K_{\uparrow}, and for a universal constant L>1L>1,

    1Lmin1in1n((μ0)i+1(μ0)i)max1in1n((μ0)i+1(μ0)i)L.\displaystyle\frac{1}{L}\leq\min_{1\leq i\leq n-1}n\big{(}(\mu_{0})_{i+1}-(\mu_{0})_{i}\big{)}\leq\max_{1\leq i\leq n-1}n\big{(}(\mu_{0})_{i+1}-(\mu_{0})_{i}\big{)}\leq L. (4.2)

    Then

    dTV(T(Y)mμ0σμ0,𝒩(0,1))Cn1/6.\displaystyle d_{\mathrm{TV}}\bigg{(}\frac{T(Y)-m_{\mu_{0}}}{\sigma_{\mu_{0}}},\mathcal{N}(0,1)\bigg{)}\leq\frac{C}{n^{1/6}}.

    Here C>0C>0 is a constant depending on LL only. Consequently the LRT is asymptotically size α\alpha with 𝔼μ0Ψts(Y;mμ0,σμ0)=α+𝒪(n1/6)\mathbb{E}_{\mu_{0}}\Psi_{\mathrm{ts}}(Y;m_{\mu_{0}},\sigma_{\mu_{0}})=\alpha+\mathcal{O}(n^{-1/6}).

  2. (2)

    Let μ0=(f0(i/n))i=1n\mu_{0}=\big{(}f_{0}(i/n)\big{)}_{i=1}^{n} and μ=(f(i/n))i=1n\mu=\big{(}f(i/n)\big{)}_{i=1}^{n}, where f,f0:[0,1]f,f_{0}:[0,1]\to\mathbb{R} are C2C^{2} monotone functions related by f0=f+ρnδf_{0}=f+\rho_{n}\delta for some C1C^{1} function δ:[0,1]\delta:[0,1]\to\mathbb{R} with δ2=1\int\delta^{2}=1 and δ\delta^{\prime} is bounded away from 0 and \infty. Suppose the first derivatives f,f0f^{\prime},f_{0}^{\prime} are bounded away from 0 and \infty and second derivatives f′′,f0′′f^{\prime\prime},f_{0}^{\prime\prime} are bounded away from \infty. Then the LRT is power consistent under μ\mu, i.e., 𝔼μΨts(Y;mμ0,σμ0)1\mathbb{E}_{\mu}\Psi_{\mathrm{ts}}(Y;m_{\mu_{0}},\sigma_{\mu_{0}})\to 1, if and only if ρnn5/12\rho_{n}\gg n^{-5/12}, if and only if μμ0n1/12\lVert\mu-\mu_{0}\rVert\gg n^{1/12}.

A few remarks are in order.

  • (Normal approximation) The normal approximation in Theorem 4.7-(1) settles the problem of the limiting distribution for the LRT used in the simulation in [DT01, Section 4]. There the LRT is compared to a goodness-of-fit test based on the central limit theorem for the 1\ell_{1} estimation error of isotonic LSE (cf. [Gro85, GHL99, Dur07]). We note that condition (4.2) on the sequence μ0\mu_{0} is equivalent to a bounded first derivative away from 0 and \infty at the function level. This condition is commonly adopted in global CLTs for p\ell_{p} type losses of isotonic LSEs, cf. [Dur07]. In fact, the condition in [Dur07] is stronger than (4.2) to guarantee a CLT for p\ell_{p} estimation error of the isotonic LSE.

  • (Rate of normal approximation) We conjecture that the error rate 𝒪(n1/6)\mathcal{O}(n^{-1/6}) in the above normal approximation is optimal based on the following heuristics. Writing μ^\widehat{\mu} as a shorthand for μ^K\widehat{\mu}_{K_{\uparrow}}, the LRS T(Y)T(Y) can be written, under H0H_{0}, as

    T(Y)\displaystyle T(Y) =2ξ,μ^μ0μ^μ02\displaystyle=2\left\langle\xi,\widehat{\mu}-\mu_{0}\right\rangle-\lVert\widehat{\mu}-\mu_{0}\rVert^{2}
    =i=1n(2ξi(μ^i(μ0)i)(μ^i(μ0)i)2).\displaystyle=\sum_{i=1}^{n}\Big{(}2\xi_{i}(\widehat{\mu}_{i}-(\mu_{0})_{i})-(\widehat{\mu}_{i}-(\mu_{0})_{i})^{2}\Big{)}.

    Under the regularity condition (4.2), the isotonic LSE μ^\widehat{\mu} is localized in the sense that each μ^i\widehat{\mu}_{i} roughly depends on μ0\mu_{0} and ξ\xi only via indices in a local neighborhood of ii that contains 𝒪(n2/3)\mathcal{O}(n^{2/3}) many points. So one may naturally view T(Y)T(Y) as roughly a summation of 𝒪(n1/3)\mathcal{O}(n^{1/3}) ‘independent’ blocks, each of which roughly has variance of constant order. This naturally leads to the 𝒪(1/n1/3)=𝒪(n1/6)\mathcal{O}(1/\sqrt{n^{1/3}})=\mathcal{O}(n^{-1/6}) rate in the Berry-Esseen bound of Theorem 4.7-(1). Our Theorem 4.7-(1) formalizes this intuition, but the proof is along a completely different line.

  • (Local power analysis) The ‘local alternative’ setting in Theorem 4.7-(2) follows that of [DT01]. In particular, the separation rate in Theorem 4.7-(2) is reminiscent of [DT01, Theorem 3.1]. [DT01] obtained, under similar configurations and regularity conditions, a separation rate for a goodness-of-fit test based on the CLT for the 1\ell_{1} estimation error of the isotonic LSE of order ρnn5/12n1/2δn1/2\rho_{n}\gg n^{-5/12}\vee n^{-1/2}\delta_{n}^{-1/2}, where δn\delta_{n} is the length of the support of the function δ\delta. Our results here show that the LRT has a sharp separation rate ρnn5/12\rho_{n}\gg n^{-5/12} under the prescribed configuration, which is no worse than the one derived in [DT01] based on 1\ell_{1} estimation error.

In the isotonic regression example above, the main challenge in deriving the normal approximation for T(Y)T(Y) is to lower bound the quantity 𝔼μ0Jμ^KF2\lVert\mathbb{E}_{\mu_{0}}J_{\widehat{\mu}_{K_{\uparrow}}}\rVert_{F}^{2} in (3.1). We detail this intermediate result in the following proposition, which may be of independent interest (see Section 6.3 for a proof).

Proposition 4.8.

Under the setting of Theorem 4.7-(1), there exists a small enough constant κ>0\kappa>0, depending on LL only, such that

(𝔼μ0Jμ^K)ijκn2/3\displaystyle\big{(}\mathbb{E}_{\mu_{0}}J_{\widehat{\mu}_{K_{\uparrow}}}\big{)}_{ij}\geq\kappa n^{-2/3}

for {(i,j):|ij|κn2/3,0.1ni,j0.9n}\{(i,j):\lvert i-j\rvert\leq\kappa n^{2/3},0.1n\leq i,j\leq 0.9n\} for nn large enough.

The above proposition is proved via exploiting the min-max representation of the isotonic LSE, a property not shared by general shape-constrained LSEs. We conjecture that results analogous to Theorem 4.7 hold for the general kk-monotone cone K,kK_{\uparrow,k}, to be formally defined in Section 4.5, but an analogue to Proposition 4.8 above is not yet available for general K,kK_{\uparrow,k}.

4.4. Testing in Lasso

Consider the linear regression model

Y=μ+ξXθ+ξ,Y=\mu+\xi\equiv X\theta+\xi,

where Xn×pX\in\mathbb{R}^{n\times p} is a fixed design matrix with pnp\leq n and full column rank. Let ΣXX/n\Sigma\equiv X^{\top}X/n be the Gram matrix. Let θ^0(XX)1XY\widehat{\theta}^{0}\equiv(X^{\top}X)^{-1}X^{\top}Y be the ordinary LSE, θ^θ^(λ)\widehat{\theta}\equiv\widehat{\theta}(\lambda) be the constrained Lasso solution defined as

θ^(λ)argminθp12YXθ2s.t. θ1λ,\displaystyle\widehat{\theta}(\lambda)\equiv\operatorname*{arg\,min}_{\theta\in\mathbb{R}^{p}}\frac{1}{2}\|Y-X\theta\|^{2}\qquad\text{s.t. }\|\theta\|_{1}\leq\lambda, (4.3)

and μ^μ^(λ)Xθ^(λ)\widehat{\mu}\equiv\widehat{\mu}(\lambda)\equiv X\widehat{\theta}(\lambda). The setting here fits into our general framework by letting

KKX,λ{μ=Xθ:θ1λ}K\equiv K_{X,\lambda}\equiv\{\mu=X\theta:\|\theta\|_{1}\leq\lambda\}

and μ^Kμ^\widehat{\mu}_{K}\equiv\widehat{\mu}. Note that we do not impose sparsity of θ\theta here. We will be interested in the testing problem (1.2), i.e., H0:μ=μ0H_{0}:\mu=\mu_{0} versus H1:μKX,λH_{1}:\mu\in K_{X,\lambda}, where μ0=Xθ0KX,λ\mu_{0}=X\theta_{0}\in K_{X,\lambda} with θ01λ\lVert\theta_{0}\rVert_{1}\leq\lambda. Such a goodness-of-fit test and the related problem of constructing confidence sets for the Lasso estimator has previously been studied in [VV10, CL11, NvdG13, SB18]. In the following, we use the two-sided LRT Ψts(Y;m0,σ0)\Psi_{\mathrm{ts}}(Y;m_{0},\sigma_{0}) (as in (3.11)) to test (1.2) and study its power characterization (see Section 6.4 for a proof).

Theorem 4.9.

Suppose pp\rightarrow\infty. For μKX,λ\mu\in K_{X,\lambda}, let

𝔭λ,μμ(θ^01λ).\mathfrak{p}_{\lambda,\mu}\equiv\mathbb{P}_{\mu}\big{(}\lVert\widehat{\theta}^{0}\rVert_{1}~{}\geq~{}\lambda\big{)}.
  1. (1)

    There exists a universal constant C>0C>0 such that, for μ0KX,λ\mu_{0}\in K_{X,\lambda},

    dTV(T(Y)mμ0σμ0,𝒩(0,1))\displaystyle d_{\mathrm{TV}}\bigg{(}\frac{T(Y)-m_{\mu_{0}}}{\sigma_{\mu_{0}}},\mathcal{N}(0,1)\bigg{)} Cp+n𝔭λ,μ01/2(pC(n𝔭λ,μ0)2)+.\displaystyle\leq\frac{C\sqrt{p+n\mathfrak{p}_{\lambda,\mu_{0}}^{1/2}}}{\big{(}p-C(n\mathfrak{p}_{\lambda,\mu_{0}})^{2}\big{)}_{+}}.

    Consequently the LRT is asymptotically size α\alpha with 𝔼μ0Ψts(Y;mμ0,σμ0)=α+𝒪(p1/2)\mathbb{E}_{\mu_{0}}\Psi_{\mathrm{ts}}(Y;m_{\mu_{0}},\sigma_{\mu_{0}})=\alpha+\mathcal{O}(p^{-1/2}), provided that n𝔭λ,μ01/2=𝔬(1)n\mathfrak{p}_{\lambda,\mu_{0}}^{1/2}=\mathfrak{o}(1).

  2. (2)

    Suppose n(𝔭λ,μ1/2𝔭λ,μ01/2)=𝔬(1)n\cdot(\mathfrak{p}_{\lambda,\mu}^{1/2}\vee\mathfrak{p}_{\lambda,\mu_{0}}^{1/2})=\mathfrak{o}(1). For any μKX,λ\mu\in K_{X,\lambda}, the LRT is power consistent, i.e., 𝔼μΨts(Y;mμ0,σμ0)1\mathbb{E}_{\mu}\Psi_{\mathrm{ts}}(Y;m_{\mu_{0}},\sigma_{\mu_{0}})\to 1, if and only if μμ0p1/4\lVert\mu-\mu_{0}\rVert\gg p^{1/4}.

The proof of Theorem 4.9 relies on the following proposition, which may be of independent interest (see Section 6.4 for a proof).

Proposition 4.10.

The following hold:

  1. (1)

    𝔼μ0Jμ^KX,λF2p/24(n𝔭λ,μ0)2\lVert\mathbb{E}_{\mu_{0}}J_{\widehat{\mu}_{K_{X,\lambda}}}\rVert_{F}^{2}\geq p/2-4\big{(}n\mathfrak{p}_{\lambda,\mu_{0}}\big{)}^{2}.

  2. (2)

    For any μKX,λ\mu\in K_{X,\lambda}, |𝔼μdivμ^KX,λp|2p𝔭λ,μ\big{\lvert}\mathbb{E}_{\mu}\operatorname{div}\widehat{\mu}_{K_{X,\lambda}}-p\big{\rvert}\leq 2p\cdot\mathfrak{p}_{\lambda,\mu}.

  3. (3)

    For any μKX,λ\mu\in K_{X,\lambda}, |𝔼μμ^KX,λμ2p|Cn𝔭λ,μ1/2\big{\lvert}\mathbb{E}_{\mu}\lVert\widehat{\mu}_{K_{X,\lambda}}-\mu\rVert^{2}-p\big{\rvert}\leq Cn\mathfrak{p}_{\lambda,\mu}^{1/2}.

Here C>0C>0 is an absolute constant.

The proof of the above proposition makes essential use of an explicit representation of the Jacobian Jμ^KX,λJ_{\widehat{\mu}_{K_{X,\lambda}}} derived in [Kat09], which complements its analogues for Lasso in the penalized form derived in [ZHT07, TT12].

Remark 4.11.

A few remarks are in order.

  1. (1)

    (Choice of λ\lambda) To apply Theorem 4.9, we need to control the probability term 𝔭λ,μ\mathfrak{p}_{\lambda,\mu} for a generic μ=XθKX,λ\mu=X\theta\in K_{X,\lambda}. This can be done via the following exponential inequality (see Lemma 6.3): for any t1t\geq 1,

    μ(θ^01θ1+tpnλmin(Σ))et2/C.\displaystyle\mathbb{P}_{\mu}\bigg{(}\|\widehat{\theta}^{0}\|_{1}\geq\lVert\theta\rVert_{1}+t\sqrt{\frac{p}{n\lambda_{\min}(\Sigma)}}\bigg{)}\leq e^{-t^{2}/C}.

    Here C>0C>0 is a universal constant and λmin(Σ)\lambda_{\min}(\Sigma) is the smallest eigenvalue of Σ\Sigma. Therefore, for any choice of the tuning parameter λ\lambda satisfying

    λθ01+rn, with rnCplogn/(nλmin(Σ))\displaystyle\lambda\geq\lVert\theta_{0}\rVert_{1}+r_{n},\textrm{ with }r_{n}\equiv C\sqrt{p\log n/\big{(}n\lambda_{\min}(\Sigma)\big{)}} (4.4)

    for a large enough constant C>0C>0, we have n(𝔭λ,μ1/2𝔭λ,μ01/2)=𝔬(1)n\cdot(\mathfrak{p}_{\lambda,\mu}^{1/2}\vee\mathfrak{p}_{\lambda,\mu_{0}}^{1/2})=\mathfrak{o}(1) uniformly in μKX,λrn\mu\in K_{X,\lambda-r_{n}}. Hence Theorem 4.9 yields that the LRT is asymptotically size α\alpha and power consistent for all such prescribed μ\mu’s if and only if μμ0p1/4\lVert\mu-\mu_{0}\rVert\gg p^{1/4}. To get some intuition for this result, for the tuning parameters λ\lambda satisfying (4.4) above, the proof of Proposition 4.10 shows that the Jacobian Jμ^KX,λJ_{\widehat{\mu}_{K_{X,\lambda}}} of μ^KX,λ\widehat{\mu}_{K_{X,\lambda}} (cf. Equation (6.14)) is close to that of the least squares estimator θ^0\widehat{\theta}^{0} with high probability. From this perspective, the separation rate p1/4p^{1/4} is quite natural under (4.4) in view of Proposition 3.6 with rank(X)=p\operatorname{rank}(X)=p in the current setting.

  2. (2)

    (Lasso in penalized form) Theorem 4.9 is applicable for Lasso in its constrained form as defined in (4.3). The penalized form of Lasso

    θ^pen(τ)argminθp[12YXθ2+τθ1],\displaystyle\widehat{\theta}_{\mathrm{pen}}(\tau)\equiv\operatorname*{arg\,min}_{\theta\in\mathbb{R}^{p}}\bigg{[}\frac{1}{2}\lVert Y-X\theta\rVert^{2}+\tau\lVert\theta\rVert_{1}\bigg{]}, (4.5)

    however, does not fit into our general testing framework (1.2), and therefore there is no natural associated ‘likelihood ratio test’. An interesting problem is to study the behavior of the statistic T(Y)T(Y) defined in (1.1) with μ^K\widehat{\mu}_{K} replaced by the penalized Lasso estimator μ^pen(τ)Xθ^pen(τ)\widehat{\mu}_{\textrm{pen}}(\tau)\equiv X\widehat{\theta}_{\textrm{pen}}(\tau). The major hurdle here is to, as in Proposition 4.10-(1), evaluate a lower bound for the Frobenius norm of the the expected Jacobian 𝔼Jμ^pen(τ)=𝔼XS^(τ)(XS^(τ)XS^(τ))1XS^(τ)\mathbb{E}J_{\widehat{\mu}_{\mathrm{pen}}(\tau)}=\mathbb{E}X_{\widehat{S}(\tau)}(X_{\widehat{S}(\tau)}^{\top}X_{\widehat{S}(\tau)})^{-1}X_{\widehat{S}(\tau)} (see e.g. [BZ21, Proposition 3.10]), where S^(τ)\widehat{S}(\tau) is the (random) support of θ^pen(τ)\widehat{\theta}_{\mathrm{pen}}(\tau). Although the penalized form (4.5) is known to be ‘equivalent’ to the constrained form (4.3) in that for each given τ>0\tau>0, there exists some data-dependent λ=λ(τ,X,Y)>0\lambda=\lambda(\tau,X,Y)>0 such that θ^(λ)=θ^pen(τ)\widehat{\theta}(\lambda)=\widehat{\theta}_{\mathrm{pen}}(\tau), due to the random correspondence of τ\tau and λ\lambda, the techniques used to prove Proposition 4.10 do not translate to a lower bound for 𝔼Jμ^pen(τ)F2\lVert\mathbb{E}J_{\widehat{\mu}_{\mathrm{pen}}(\tau)}\rVert_{F}^{2}. We leave this for a future study.

4.5. Testing parametric assumptions versus shape-constrained alternatives

For fixed k0k\in\mathbb{Z}_{\geq 0} and nk+2n\geq k+2, and consider the testing problem

H0:μK0,kversusH1:μK,k.\displaystyle H_{0}:\mu\in K_{0,k}\quad\textrm{versus}\quad H_{1}:\mu\in K_{\uparrow,k}. (4.6)

Here K,k{μn:k+1μ0}K_{\uparrow,k}\equiv\{\mu\in\mathbb{R}^{n}:\nabla^{k+1}\mu\geq 0\} and K0,k{μn:k+1μ=0}K_{0,k}\equiv\{\mu\in\mathbb{R}^{n}:\nabla^{k+1}\mu=0\}, with :nn1\nabla:\mathbb{R}^{n}\to\mathbb{R}^{n-1} denoting the difference operator defined by (μi)i=1n(μi+1μi)i=1n1\nabla(\mu_{i})_{i=1}^{n}\equiv(\mu_{i+1}-\mu_{i})_{i=1}^{n-1}, and k+1:nnk1\nabla^{k+1}\equiv\nabla\circ\cdots\circ\nabla:\mathbb{R}^{n}\to\mathbb{R}^{n-k-1} with k+1k+1 compositions. It can be readily verified that K0,kK_{0,k} is a subspace of dimension k+1k+1, K,kK_{\uparrow,k} is a closed and convex cone, and K0,kK,knK_{0,k}\subset K_{\uparrow,k}\subset\mathbb{R}^{n}. Hence (4.6) is a special case of the general testing problem (1.10).

Testing a parametric model against a nonparametric alternative has previously been studied in [CKWY88, ES90, AB93, HM93, Stu97, FH01, GL05, CS10, NVK10, SM17] among which the shape-constrained alternatives in (4.6) are sometimes preferred since the model fits therein usually do not involve the choice of tuning parameters. In particular:

  1. (1)

    When k=0k=0, (4.6) becomes:

    H0:μ is ‘constant’,versusH1:μ is ‘monotone’.\displaystyle H_{0}:\mu\hbox{ is `constant'},\qquad\textrm{versus}\qquad H_{1}:\mu\hbox{ is `monotone'}.
  2. (2)

    When k=1k=1, (4.6) becomes:

    H0:μ is ‘linear’,versusH1:μ is ‘convex’.\displaystyle H_{0}:\mu\hbox{ is `linear'},\qquad\textrm{versus}\qquad H_{1}:\mu\hbox{ is `convex'}.

The above two settings have previously been considered in [Bar59a, Bar59b, RWD88, SM17].

Theorem 4.12.

Fix k0k\in\mathbb{Z}_{\geq 0}. Consider testing (4.6) using the two-sided LRT Ψts(Y;m0,σ0)\Psi_{\mathrm{ts}}(Y;m_{0},\sigma_{0}), as in (3.11).

  1. (1)

    There exists a constant C>0C>0, depending on kk only, such that for μK0,k\mu\in K_{0,k},

    dTV(T(Y)m0σ0,𝒩(0,1))\displaystyle d_{\mathrm{TV}}\bigg{(}\frac{T(Y)-m_{0}}{\sigma_{0}},\mathcal{N}(0,1)\bigg{)} C𝟏k=0log(en)+𝟏k1loglog(16n).\displaystyle\leq\frac{C}{\bm{1}_{k=0}\sqrt{\log(en)}+\bm{1}_{k\geq 1}\sqrt{\log\log(16n)}}.

    Consequently for μK0,k\mu\in K_{0,k}, the LRT is asymptotically size α\alpha with 𝔼μΨts(Y;m0,σ0)=α+𝒪(𝟏k=0(log(en))1/2+𝟏k1(loglog(16n))1/2)\mathbb{E}_{\mu}\Psi_{\mathrm{ts}}(Y;m_{0},\sigma_{0})=\alpha+\mathcal{O}\big{(}\bm{1}_{k=0}(\log(en))^{-1/2}+\bm{1}_{k\geq 1}(\log\log(16n))^{-1/2}\big{)}.

  2. (2)

    For μK,k\mu\in K_{\uparrow,k} with μΠK0,k(μ)log1/4(en)\lVert\mu-\Pi_{K_{0,k}}(\mu)\rVert\gg\log^{1/4}(en), the LRT is power consistent under μ\mu, i.e., 𝔼μΨts(Y;m0,σ0)1\mathbb{E}_{\mu}\Psi_{\mathrm{ts}}(Y;m_{0},\sigma_{0})\to 1.

The key step in the proof of Theorem 4.12 (proved in Section 6.5) is to obtain the correct order of the statistical dimension δK,k\delta_{K_{\uparrow,k}}. The discrepancy between k=0k=0 and k1k\geq 1 in claim (1) is due to the fact that while a universal upper bound of the order log(en)\log(en) can be proved for any fixed k0k\geq 0, only a lower bound of the order loglog(16n)\log\log(16n) can be proved for k1k\geq 1. We conjecture that the correct order of δK,k\delta_{K_{\uparrow,k}} should be log(en)\log(en) for all fixed k0k\geq 0.

The above theorem can be easily extended to the multi-dimensional analogue of (4.6) in the context of, e.g., testing constancy versus coordinate-wise monotonicity, linearity versus multi-dimensional convexity, by using results of [HWCS19, KGGS20]; we omit the details here.

5. Proofs of results in Section 3

5.1. Proof of Theorem 3.1

We need the following proposition, which can be proved using techniques similar to [GNP17, Theorem 2.1]. We provide the details of its proof in Appendix A.2 for the convenience of the reader.

Proposition 5.1.

Suppose K0,KK_{0},K are two non-empty closed convex sets in n\mathbb{R}^{n}. Let

TK0,K(y)yΠK0(y)2yΠK(y)2.\displaystyle T_{K_{0},K}(y)\equiv\lVert y-\Pi_{K_{0}}(y)\rVert^{2}-\lVert y-\Pi_{K}(y)\rVert^{2}.

Then for any μn\mu\in\mathbb{R}^{n}, under the model (1.1),

dTV(TK0,K(Y)𝔼μTK0,K(Y)Varμ(TK0,K(Y)),𝒩(0,1))16𝔼μμ^Kμ^K02Varμ(TK0,K(Y)).\displaystyle d_{\mathrm{TV}}\bigg{(}\frac{T_{K_{0},K}(Y)-\mathbb{E}_{\mu}T_{K_{0},K}(Y)}{\sqrt{\operatorname{Var}_{\mu}(T_{K_{0},K}(Y))}},\mathcal{N}(0,1)\bigg{)}\leq\frac{16\sqrt{\mathbb{E}_{\mu}\lVert\widehat{\mu}_{K}-\widehat{\mu}_{K_{0}}\rVert^{2}}}{\operatorname{Var}_{\mu}(T_{K_{0},K}(Y))}.

The next lemma provides a lower bound for the variance of F(ξ)F(\xi), where the absolute continuity of F:nF:\mathbb{R}^{n}\to\mathbb{R} is valid up to its first derivatives. The proof is based on Fourier analysis in the Gaussian space in the spirit of [NP12, Proposition 1.5.1].

Lemma 5.2.

Let F:nF:\mathbb{R}^{n}\to\mathbb{R} be such that {𝐤F:|𝐤|1}\{\partial_{\bm{k}}F:\lvert\bm{k}\rvert\leq 1\} are absolutely continuous and {𝐤F:|𝐤|2}\{\partial_{\bm{k}}F:\lvert\bm{k}\rvert\leq 2\} have sub-exponential growth at \infty. Then

Var(F(ξ))i(𝔼iF(ξ))2+ij(𝔼ijF(ξ))2+12i(𝔼iiF(ξ))2.\displaystyle\operatorname{Var}(F(\xi))\geq\sum_{i}\big{(}\mathbb{E}\partial_{i}F(\xi)\big{)}^{2}+\sum_{i\neq j}\big{(}\mathbb{E}\partial_{ij}F(\xi)\big{)}^{2}+\frac{1}{2}\sum_{i}\big{(}\mathbb{E}\partial_{ii}F(\xi)\big{)}^{2}.
Proof.

We only need to verify the above claimed inequality for 𝔼F(ξ)=0\mathbb{E}F(\xi)=0. Let Hk(x)=(1)kex2/2dkdxkex2/2H_{k}(x)=(-1)^{k}e^{x^{2}/2}\frac{\mathrm{d}^{k}}{\mathrm{d}{x^{k}}}e^{-x^{2}/2} be the Hermite polynomial of order kk. For a multi-index 𝒌=(k1,,kn)\bm{k}=(k_{1},\ldots,k_{n}) and yny\in\mathbb{R}^{n}, let H𝒌(y)i=1nHki(yi)H_{\bm{k}}(y)\equiv\prod_{i=1}^{n}H_{k_{i}}(y_{i}). Then {H𝒌:𝒌0n}\{H_{\bm{k}}:\bm{k}\in\mathbb{Z}_{\geq 0}^{n}\} is a complete orthogonal basis of L2(γn)L_{2}(\gamma_{n}), where γn\gamma_{n} is the standard Gaussian measure on n\mathbb{R}^{n}. On the other hand, the absolute continuity and growth condition on FF ensures the validity of the following Gaussian integration-by-parts: For all multi-indices 𝒌\bm{k} such that |𝒌|2\lvert\bm{k}\rvert\leq 2,

𝔼[F(ξ)H𝒌(ξ)]=𝔼𝒌F(ξ).\displaystyle\mathbb{E}\big{[}F(\xi)H_{\bm{k}}(\xi)\big{]}=\mathbb{E}\partial_{\bm{k}}F(\xi).

As 𝔼|H𝒌(ξ)|2=𝒌!\mathbb{E}\lvert H_{\bm{k}}(\xi)\rvert^{2}=\bm{k}!, it follows by Plancherel’s theorem that

Var(F(ξ))=𝔼F2(ξ)𝒌:|𝒌|2(𝔼F(ξ)H𝒌(ξ))2𝔼|H𝒌(ξ)|2,\displaystyle\operatorname{Var}(F(\xi))=\mathbb{E}F^{2}(\xi)\geq\sum_{\bm{k}:\lvert\bm{k}\rvert\leq 2}\frac{\big{(}\mathbb{E}F(\xi)H_{\bm{k}}(\xi)\big{)}^{2}}{\mathbb{E}\lvert H_{\bm{k}}(\xi)\rvert^{2}},

which equals the right hand side of the claimed inequality. ∎

Proof of Theorem 3.1.

Let

F(ξ)T(μ0+ξ)=μ0+ξμ02μ0+ξΠK(μ0+ξ)2.\displaystyle F(\xi)\equiv T(\mu_{0}+\xi)=\lVert\mu_{0}+\xi-\mu_{0}\rVert^{2}-\lVert\mu_{0}+\xi-\Pi_{K}(\mu_{0}+\xi)\rVert^{2}.

By Lemma 2.1-(1),

F(ξ)=T(μ0+ξ)=2(ΠK(μ0+ξ)μ0).\displaystyle\nabla F(\xi)=\nabla T(\mu_{0}+\xi)=2\big{(}\Pi_{K}(\mu_{0}+\xi)-\mu_{0}\big{)}.

Hence

ijF(ξ)=ijT(μ0+ξ)=2(JΠK(μ0+ξ))ji.\displaystyle\partial_{ij}F(\xi)=\partial_{ij}T(\mu_{0}+\xi)=2(J_{\Pi_{K}}(\mu_{0}+\xi))_{ji}.

We verify that FF satisfies the condition of Lemma 5.2. By the above closed-form expression of FF and F\nabla F, the absolute continuity for {𝒌F:|𝒌|1}\{\partial_{\bm{k}}F:\lvert\bm{k}\rvert\leq 1\} holds by noting that F\nabla F is 22-Lipschitz. On the other hand, as

|F(ξ)|\displaystyle\lvert F(\xi)\rvert =|ξ2μ0+ξΠK(μ0+ξ)2|C(ξ2μ02),\displaystyle=\big{\lvert}\lVert\xi\rVert^{2}-\lVert\mu_{0}+\xi-\Pi_{K}(\mu_{0}+\xi)\rVert^{2}\big{\lvert}\leq C\cdot\big{(}\lVert\xi\rVert^{2}\vee\lVert\mu_{0}\rVert^{2}\big{)},
F(ξ)\displaystyle\lVert\nabla F(\xi)\rVert C(μ0ξ),2F(ξ)=2JΠK(μ0+ξ)2,\displaystyle\leq C\cdot\big{(}\lVert\mu_{0}\rVert\vee\lVert\xi\rVert\big{)},\qquad\lVert\nabla^{2}F(\xi)\rVert=2\lVert J_{\Pi_{K}}(\mu_{0}+\xi)^{\top}\rVert\leq 2,

it follows that {𝒌F:|𝒌|2}\{\partial_{\bm{k}}F:\lvert\bm{k}\rvert\leq 2\} have sub-exponential growth at \infty. Now we may apply Lemma 5.2 to see that

σμ02=Var(T(Y))\displaystyle\sigma_{\mu_{0}}^{2}=\operatorname{Var}(T(Y)) i(𝔼iF(ξ))2+12i,j(𝔼ijF(ξ))2\displaystyle\geq\sum_{i}\big{(}\mathbb{E}\partial_{i}F(\xi)\big{)}^{2}+\frac{1}{2}\sum_{i,j}\big{(}\mathbb{E}\partial_{ij}F(\xi)\big{)}^{2}
=4𝔼ΠK(μ0+ξ)μ02+2i,j(𝔼JΠK(μ0+ξ))ij2,\displaystyle=4\lVert\mathbb{E}\Pi_{K}(\mu_{0}+\xi)-\mu_{0}\rVert^{2}+2\sum_{i,j}\big{(}\mathbb{E}J_{\Pi_{K}}(\mu_{0}+\xi)\big{)}_{ij}^{2},

as desired. The claim of the theorem now follows from Proposition 5.1. ∎

5.2. Proof of Theorem 3.2

A simple but important observation in the proof of Theorem 3.2 is the following.

Proposition 5.3.

Let

Z(μ,μ0)\displaystyle Z(\mu,\mu_{0}) ΔTμ,μ0(ξ)𝔼(ΔTμ,μ0)\displaystyle\equiv\Delta T_{\mu,\mu_{0}}(\xi)-\mathbb{E}(\Delta T_{\mu,\mu_{0}})
=T(μ+ξ)T(μ0+ξ)(mμmμ0).\displaystyle=T(\mu+\xi)-T(\mu_{0}+\xi)-(m_{\mu}-m_{\mu_{0}}).

Then for any t0t\geq 0,

(Z(μ,μ0)>t)(Z(μ,μ0)<t)exp(t28μμ02).\displaystyle\mathbb{P}\big{(}Z(\mu,\mu_{0})>t\big{)}\vee\mathbb{P}\big{(}Z(\mu,\mu_{0})<-t\big{)}\leq\exp\bigg{(}-\frac{t^{2}}{8\lVert\mu-\mu_{0}\rVert^{2}}\bigg{)}.
Proof.

As

T(y)=(yμ02yΠK(y)2)=2(ΠK(y)μ0)\displaystyle\nabla T(y)=\nabla\big{(}\lVert y-\mu_{0}\rVert^{2}-\lVert y-\Pi_{K}(y)\rVert^{2}\big{)}=2(\Pi_{K}(y)-\mu_{0})

by Lemma 2.1-(1), it follows that

ξZ(μ,μ0)\displaystyle\lVert\nabla_{\xi}Z(\mu,\mu_{0})\rVert =2ΠK(μ+ξ)ΠK(μ0+ξ)2μμ0.\displaystyle=2\lVert\Pi_{K}(\mu+\xi)-\Pi_{K}(\mu_{0}+\xi)\rVert\leq 2\lVert\mu-\mu_{0}\rVert.

The claim now follows by Gaussian concentration inequality for Lipschitz functions, cf. [Bou03, Theorem 5.6]. ∎

Lemma 5.4.

For any tt\in\mathbb{R}, there exists some Ct>0C_{t}>0 such that for all u,η[1/2,1/2]u\in\mathbb{R},\eta\in[-1/2,1/2],

|(𝒩(u,1)t)(𝒩((1+η)u,1)t)|Ct|η|.\displaystyle\big{\lvert}\mathbb{P}\big{(}\mathcal{N}(u,1)\leq t\big{)}-\mathbb{P}\big{(}\mathcal{N}((1+\eta)u,1)\leq t\big{)}\big{\rvert}\leq C_{t}\cdot|\eta|.

Furthermore, suptMCt<\sup_{t\in M}C_{t}<\infty for any compact subset MM of \mathbb{R}.

Proof.

We assume η0\eta\geq 0 without loss of generality. Note that with φ\varphi denoting the d.f. for standard normal,

(𝒩(0,1)t(1+η)u)\displaystyle\mathbb{P}\big{(}\mathcal{N}(0,1)\leq t-(1+\eta)u\big{)} (𝒩(0,1)tu)+ηsupv[(tu)±η|u|]φ(v)|u|\displaystyle\leq\mathbb{P}\big{(}\mathcal{N}(0,1)\leq t-u\big{)}+\eta\cdot\sup_{v\in[(t-u)\pm\eta\lvert u\rvert]}\varphi(v)\lvert u\rvert
(𝒩(0,1)tu)+ηCt,\displaystyle\leq\mathbb{P}\big{(}\mathcal{N}(0,1)\leq t-u\big{)}+\eta\cdot C_{t},

where Ctsupusupv[(tu)±(|u|/2)]φ(v)|u|<C_{t}\equiv\sup_{u}\sup_{v\in[(t-u)\pm(\lvert u\rvert/2)]}\varphi(v)\lvert u\rvert<\infty depends on tt only. ∎

Proof of Theorem 3.2.

First note that under the model (1.1), the normalized LRS T(Y)T(Y) satisfies the decomposition

T(Y)mμ0σμ0\displaystyle\frac{T(Y)-m_{\mu_{0}}}{\sigma_{\mu_{0}}} =T(μ+ξ)T(μ0+ξ)σμ0+T(μ0+ξ)mμ0σμ0.\displaystyle=\frac{T(\mu+\xi)-T(\mu_{0}+\xi)}{\sigma_{\mu_{0}}}+\frac{T(\mu_{0}+\xi)-m_{\mu_{0}}}{\sigma_{\mu_{0}}}. (5.1)

Using Proposition 5.3, on an event EuE_{u} with (Eu)12eu2\mathbb{P}(E_{u})\geq 1-2e^{-u^{2}}, we have |Z(μ,μ0)|3uμμ0\lvert Z(\mu,\mu_{0})\rvert\leq 3u\lVert\mu-\mu_{0}\rVert with Z(μ,μ0)Z(\mu,\mu_{0}) defined therein. Then for any tt\in\mathbb{R},

(T(μ+ξ)mμ0σμ0t)\displaystyle\mathbb{P}\bigg{(}\frac{T(\mu+\xi)-m_{\mu_{0}}}{\sigma_{\mu_{0}}}\leq t\bigg{)}
=(mμmμ0+Z(μ,μ0)σμ0+T(μ0+ξ)mμ0σμ0t)\displaystyle=\mathbb{P}\bigg{(}\frac{m_{\mu}-m_{\mu_{0}}+Z(\mu,\mu_{0})}{\sigma_{\mu_{0}}}+\frac{T(\mu_{0}+\xi)-m_{\mu_{0}}}{\sigma_{\mu_{0}}}\leq t\bigg{)}
(mμmμ03uμμ0σμ0+𝒩(0,1)t)+2eu2+errμ0\displaystyle\leq\mathbb{P}\bigg{(}\frac{m_{\mu}-m_{\mu_{0}}-3u\lVert\mu-\mu_{0}\rVert}{\sigma_{\mu_{0}}}+\mathcal{N}(0,1)\leq t\bigg{)}+2e^{-u^{2}}+\textrm{err}_{\mu_{0}} (5.2)
=(mμmμ0σμ0(1+η(u))+𝒩(0,1)t)+2eu2+errμ0,\displaystyle=\mathbb{P}\bigg{(}\frac{m_{\mu}-m_{\mu_{0}}}{\sigma_{\mu_{0}}}\big{(}1+\eta(u)\big{)}+\mathcal{N}(0,1)\leq t\bigg{)}+2e^{-u^{2}}+\textrm{err}_{\mu_{0}},

where

η(u)3uμμ0mμmμ0.\displaystyle\eta(u)\equiv-3u\cdot\frac{\lVert\mu-\mu_{0}\rVert}{m_{\mu}-m_{\mu_{0}}}.

By choosing u|mμmμ0|/(6μμ0)u\leq\lvert m_{\mu}-m_{\mu_{0}}\rvert/\big{(}6\lVert\mu-\mu_{0}\rVert\big{)}, we have |η(u)|1/2\lvert\eta(u)\rvert\leq 1/2, so we may apply Lemma 5.4 to see that,

Δ\displaystyle\Delta^{\ast} (T(μ+ξ)mμ0σμ0t)(mμmμ0σμ0+𝒩(0,1)t)\displaystyle\equiv\mathbb{P}\bigg{(}\frac{T(\mu+\xi)-m_{\mu_{0}}}{\sigma_{\mu_{0}}}\leq t\bigg{)}-\mathbb{P}\bigg{(}\frac{m_{\mu}-m_{\mu_{0}}}{\sigma_{\mu_{0}}}+\mathcal{N}(0,1)\leq t\bigg{)}
2eu2+Ctuμμ0|mμmμ0|+errμ0.\displaystyle\leq 2e^{-u^{2}}+C_{t}u\cdot\frac{\lVert\mu-\mu_{0}\rVert}{\lvert m_{\mu}-m_{\mu_{0}}\rvert}+\textrm{err}_{\mu_{0}}.

Optimizing u|mμmμ0|/(6μμ0)u\leq\lvert m_{\mu}-m_{\mu_{0}}\rvert/\big{(}6\lVert\mu-\mu_{0}\rVert), the first two terms in the error bound above can be bounded, up to an absolute constant, by

(1Ct)(1μμ0|mμmμ0|).\displaystyle(1\vee C_{t})\cdot\mathscr{L}\bigg{(}1\bigwedge\frac{\lVert\mu-\mu_{0}\rVert}{\lvert m_{\mu}-m_{\mu_{0}}\rvert}\bigg{)}.

Next we will obtain a similar upper bound for Δ\Delta^{\ast}, but replacing |mμmμ0|\lvert m_{\mu}-m_{\mu_{0}}\rvert in the above display by σμ0\sigma_{\mu_{0}}. To see this, (5.2) along with

(mμmμ03uμμ0σμ0+𝒩(0,1)t)\displaystyle\mathbb{P}\bigg{(}\frac{m_{\mu}-m_{\mu_{0}}-3u\lVert\mu-\mu_{0}\rVert}{\sigma_{\mu_{0}}}+\mathcal{N}(0,1)\leq t\bigg{)}
(mμmμ0σμ0+𝒩(0,1)t)+φ3uμμ0σμ0\displaystyle\quad\leq\mathbb{P}\bigg{(}\frac{m_{\mu}-m_{\mu_{0}}}{\sigma_{\mu_{0}}}+\mathcal{N}(0,1)\leq t\bigg{)}+\|\varphi\|_{\infty}\cdot\frac{3u\lVert\mu-\mu_{0}\rVert}{\sigma_{\mu_{0}}}

yields that

Δ\displaystyle\Delta^{\ast} infu>0{2eu2+3uμμ0σμ0}+errμ0C(1μμ0σμ0)+errμ0.\displaystyle\leq\inf_{u>0}\bigg{\{}2e^{-u^{2}}+3u\cdot\frac{\lVert\mu-\mu_{0}\rVert}{\sigma_{\mu_{0}}}\bigg{\}}+\textrm{err}_{\mu_{0}}\leq C\cdot\mathscr{L}\bigg{(}1\bigwedge\frac{\lVert\mu-\mu_{0}\rVert}{\sigma_{\mu_{0}}}\bigg{)}+\textrm{err}_{\mu_{0}}.

Similar lower bounds can be derived. Applying the above arguments to the (at most 2) end point(s) of 𝒜α\mathcal{A}_{\alpha} proves the inequality (3.2). Now (1) is a direct consequence of (3.2), while (2) follows by further noting ΔAα(wn)β\Delta_{A_{\alpha}}(w_{n})\to\beta if and only if all limit points of the sequence {wn}\{w_{n}\} are contained in ΔAα1(β)\Delta_{A_{\alpha}}^{-1}(\beta). ∎

5.3. Proof of Theorem 3.8

By Lemma 3.7, we only need to consider μ=0\mu=0. Note that: (i) ξΠK(ξ)2=ξ2ΠK(ξ)2\lVert\xi-\Pi_{K^{\prime}}(\xi)\rVert^{2}=\lVert\xi\rVert^{2}-\lVert\Pi_{K^{\prime}}(\xi)\rVert^{2} for K{K0,K}K^{\prime}\in\{K_{0},K\}, (ii) (K0,K)(K_{0},K) is a non-oblique pair of closed convex cones in that ΠK0=ΠK0ΠK\Pi_{K_{0}}=\Pi_{K_{0}}\circ\Pi_{K}, so ΠK(ξ)=ΠK0(ξ)+ΠKK0(ξ)\Pi_{K}(\xi)=\Pi_{K_{0}}(\xi)+\Pi_{K\cap K_{0}^{\ast}}(\xi) with ΠK0(ξ),ΠKK0(ξ)=0\left\langle\Pi_{K_{0}}(\xi),\Pi_{K\cap K_{0}^{\ast}}(\xi)\right\rangle=0 (cf. [WWG19, Equation (25)]), and hence ΠK(ξ)2=ΠK0(ξ)2+ΠKK0(ξ)2\lVert\Pi_{K}(\xi)\rVert^{2}=\lVert\Pi_{K_{0}}(\xi)\rVert^{2}+\lVert\Pi_{K\cap K_{0}^{\ast}}(\xi)\rVert^{2}. Thus,

𝔼ΠK(ξ)ΠK0(ξ)2\displaystyle\mathbb{E}\lVert\Pi_{K}(\xi)-\Pi_{K_{0}}(\xi)\rVert^{2} =𝔼ΠKK0(ξ)2\displaystyle=\mathbb{E}\lVert\Pi_{K\cap K_{0}^{\ast}}(\xi)\rVert^{2}
=𝔼[ΠK(ξ)2ΠK0(ξ)2]=δKδK0,\displaystyle=\mathbb{E}\big{[}\lVert\Pi_{K}(\xi)\rVert^{2}-\lVert\Pi_{K_{0}}(\xi)\rVert^{2}\big{]}=\delta_{K}-\delta_{K_{0}},

and

σ02\displaystyle\sigma_{0}^{2} =Var(ΠK0(ξ)2ΠK(ξ)2)\displaystyle=\operatorname{Var}\big{(}\lVert\Pi_{K_{0}}(\xi)\rVert^{2}-\lVert\Pi_{K}(\xi)\rVert^{2}\big{)}
=Var(ΠKK0(ξ)2)()2δKK0=2(δKδK0).\displaystyle=\operatorname{Var}\big{(}\lVert\Pi_{K\cap K_{0}^{\ast}}(\xi)\rVert^{2}\big{)}\stackrel{{\scriptstyle(\ast)}}{{\geq}}2\delta_{K\cap K_{0}^{\ast}}=2\big{(}\delta_{K}-\delta_{K_{0}}\big{)}.

Here the inequality ()(\ast) follows by Lemma 2.4-(2). The claim now follows from Proposition 5.1. ∎

5.4. Proof of Theorem 3.9

First note that we have the decomposition

T(μ+ξ)m0σ0\displaystyle\frac{T(\mu+\xi)-m_{0}}{\sigma_{0}} =T(μ+ξ)T(ξ)σ0+T(ξ)m0σ0.\displaystyle=\frac{T(\mu+\xi)-T(\xi)}{\sigma_{0}}+\frac{T(\xi)-m_{0}}{\sigma_{0}}. (5.3)

As

mμ\displaystyle m_{\mu} =𝔼[μ+ξΠK0(μ+ξ)2μ+ξΠK(μ+ξ)2]\displaystyle=\mathbb{E}\big{[}\lVert\mu+\xi-\Pi_{K_{0}}(\mu+\xi)\rVert^{2}-\lVert\mu+\xi-\Pi_{K}(\mu+\xi)\rVert^{2}\big{]}
=𝔼[ΠK0(μ+ξ)μ22ξ,ΠK0(μ+ξ)+ξ2\displaystyle=\mathbb{E}\bigg{[}\lVert\Pi_{K_{0}}(\mu+\xi)-\mu\rVert^{2}-2\left\langle\xi,\Pi_{K_{0}}(\mu+\xi)\right\rangle+\lVert\xi\rVert^{2}
(ΠK(μ+ξ)μ22ξ,ΠK(μ+ξ)+ξ2)]\displaystyle\qquad\qquad-\bigg{(}\lVert\Pi_{K}(\mu+\xi)-\mu\rVert^{2}-2\left\langle\xi,\Pi_{K}(\mu+\xi)\right\rangle+\lVert\xi\rVert^{2}\bigg{)}\bigg{]}
={μΠK0(μ)2+2𝔼ξ,ΠK(μ+ξ)𝔼ΠK(μ+ξ)μ2}δK0,\displaystyle=\bigg{\{}\lVert\mu-\Pi_{K_{0}}(\mu)\rVert^{2}+2\mathbb{E}\left\langle\xi,\Pi_{K}(\mu+\xi)\right\rangle-\mathbb{E}\lVert\Pi_{K}(\mu+\xi)-\mu\rVert^{2}\bigg{\}}-\delta_{K_{0}},

we have (as δK=𝔼ΠK(ξ)2=𝔼ξ,ΠK(ξ)\delta_{K}=\mathbb{E}\lVert\Pi_{K}(\xi)\rVert^{2}=\mathbb{E}\left\langle\xi,\Pi_{K}(\xi)\right\rangle)

mμm0\displaystyle m_{\mu}-m_{0} =𝔼[2ξ,ΠK(μ+ξ)ΠK(μ+ξ)μ2ξ,ΠK(ξ)]+μΠK0(μ)2\displaystyle=\mathbb{E}\bigg{[}2\left\langle\xi,\Pi_{K}(\mu+\xi)\right\rangle-\lVert\Pi_{K}(\mu+\xi)-\mu\rVert^{2}-\left\langle\xi,\Pi_{K}(\xi)\right\rangle\bigg{]}+\lVert\mu-\Pi_{K_{0}}(\mu)\rVert^{2}
=𝔼[2ξ,ΠK(μΠK0(μ)+ξ)\displaystyle=\mathbb{E}\bigg{[}2\left\langle\xi,\Pi_{K}(\mu-\Pi_{K_{0}}(\mu)+\xi)\right\rangle
ΠK(μΠK0(μ)+ξ)(μΠK0(μ))2ξ,ΠK(ξ)]\displaystyle\qquad\qquad-\lVert\Pi_{K}(\mu-\Pi_{K_{0}}(\mu)+\xi)-\big{(}\mu-\Pi_{K_{0}}(\mu)\big{)}\rVert^{2}-\left\langle\xi,\Pi_{K}(\xi)\right\rangle\bigg{]}
+μΠK0(μ)2 (by Lemma 3.7)\displaystyle\qquad\qquad+\lVert\mu-\Pi_{K_{0}}(\mu)\rVert^{2}\qquad\qquad\qquad\qquad\hbox{ (by Lemma \ref{lem:invariance_lrt})}
=𝔼[2μΠK0(μ)+ξ,ΠK(μΠK0(μ)+ξ)\displaystyle=\mathbb{E}\bigg{[}2\left\langle\mu-\Pi_{K_{0}}(\mu)+\xi,\Pi_{K}(\mu-\Pi_{K_{0}}(\mu)+\xi)\right\rangle
ΠK(μΠK0(μ)+ξ)2μΠK0(μ)2ξ,ΠK(ξ)]\displaystyle\qquad\qquad-\lVert\Pi_{K}\big{(}\mu-\Pi_{K_{0}}(\mu)+\xi\big{)}\rVert^{2}-\lVert\mu-\Pi_{K_{0}}(\mu)\rVert^{2}-\left\langle\xi,\Pi_{K}(\xi)\right\rangle\bigg{]}
+μΠK0(μ)2\displaystyle\qquad\qquad+\lVert\mu-\Pi_{K_{0}}(\mu)\rVert^{2}
=𝔼ΠK(μΠK0(μ)+ξ)2𝔼ΠK(ξ)2=ΓK,2(μΠK0(μ)).\displaystyle=\mathbb{E}\lVert\Pi_{K}\big{(}\mu-\Pi_{K_{0}}(\mu)+\xi\big{)}\rVert^{2}-\mathbb{E}\lVert\Pi_{K}(\xi)\rVert^{2}=\Gamma_{K,2}(\mu-\Pi_{K_{0}}(\mu)).

Here in the last line of the above display we used that

𝔼μΠK0(μ)+ξ,ΠK(μΠK0(μ)+ξ)\displaystyle\mathbb{E}\left\langle\mu-\Pi_{K_{0}}(\mu)+\xi,\Pi_{K}(\mu-\Pi_{K_{0}}(\mu)+\xi)\right\rangle =𝔼ΠK(μΠK0(μ)+ξ)2,\displaystyle=\mathbb{E}\lVert\Pi_{K}\big{(}\mu-\Pi_{K_{0}}(\mu)+\xi\big{)}\rVert^{2},
𝔼ξ,ΠK(ξ)\displaystyle\mathbb{E}\left\langle\xi,\Pi_{K}(\xi)\right\rangle =𝔼ΠK(ξ)2.\displaystyle=\mathbb{E}\lVert\Pi_{K}(\xi)\rVert^{2}.

Let

Z0(μ)T(μ+ξ)T(ξ)(mμm0).\displaystyle Z_{0}(\mu)\equiv T(\mu+\xi)-T(\xi)-(m_{\mu}-m_{0}).

As T(y)=(yΠK0(y)2yΠK(y)2)=2(ΠK(y)ΠK0(y))\nabla T(y)=\nabla\big{(}\lVert y-\Pi_{K_{0}}(y)\rVert^{2}-\lVert y-\Pi_{K}(y)\rVert^{2}\big{)}=2(\Pi_{K}(y)-\Pi_{K_{0}}(y)) by Lemma 2.1-(1),

ξZ0(μ)\displaystyle\nabla_{\xi}Z_{0}(\mu) =2(ΠK(μ+ξ)ΠK(ξ))2(ΠK0(μ+ξ)ΠK0(ξ))\displaystyle=2\big{(}\Pi_{K}(\mu+\xi)-\Pi_{K}(\xi)\big{)}-2\big{(}\Pi_{K_{0}}(\mu+\xi)-\Pi_{K_{0}}(\xi)\big{)}
=2(ΠK(μΠK0(μ)+ξ)ΠK(ξ))\displaystyle=2\big{(}\Pi_{K}(\mu-\Pi_{K_{0}}(\mu)+\xi)-\Pi_{K}(\xi)\big{)}
2(ΠK0(μΠK0(μ)+ξ)ΠK0(ξ)),(by Lemma 3.7)\displaystyle\qquad-2\big{(}\Pi_{K_{0}}(\mu-\Pi_{K_{0}}(\mu)+\xi)-\Pi_{K_{0}}(\xi)\big{)},\qquad\hbox{(by Lemma \ref{lem:invariance_lrt})}

and hence

ξZ0(μ)4μΠK0(μ).\displaystyle\lVert\nabla_{\xi}Z_{0}(\mu)\rVert\leq 4\lVert\mu-\Pi_{K_{0}}(\mu)\rVert.

Now using the Gaussian concentration inequality for Lipschitz functions, cf. [BLM13, Theorem 5.6], it holds for any t>0t>0 that

(Z0(μ)>t)(Z0(μ)<t)exp(t232μΠK0(μ)2).\displaystyle\mathbb{P}\big{(}Z_{0}(\mu)>t\big{)}\vee\mathbb{P}\big{(}Z_{0}(\mu)<-t\big{)}\leq\exp\bigg{(}-\frac{t^{2}}{32\lVert\mu-\Pi_{K_{0}}(\mu)\rVert^{2}}\bigg{)}.

From here we may conclude (3.14) by using similar arguments as in the proof of Theorem 3.2. Furthermore, by the proof of [WWG19, Lemma E.1], ΓK,2(ν)ν20\Gamma_{K,2}(\nu)\geq\lVert\nu\rVert^{2}\geq 0 for any νK\nu\in K, so

μΠK0(μ)|ΓK,2(μΠK0(μ))|σ0()supνKνΓK,2(ν)σ0supνK1ν(σ0/ν)()1σ01/2.\displaystyle\frac{\lVert\mu-\Pi_{K_{0}}(\mu)\rVert}{\big{\lvert}\Gamma_{K,2}\big{(}\mu-\Pi_{K_{0}}(\mu)\big{)}\big{\rvert}\vee\sigma_{0}}\stackrel{{\scriptstyle(\ast)}}{{\leq}}\sup_{\nu\in K}\frac{\lVert\nu\rVert}{\Gamma_{K,2}(\nu)\vee\sigma_{0}}\leq\sup_{\nu\in K}\frac{1}{\lVert\nu\rVert\vee(\sigma_{0}/\lVert\nu\rVert)}\stackrel{{\scriptstyle(\ast\ast)}}{{\leq}}\frac{1}{\sigma_{0}^{1/2}}.

The inequality ()(\ast) follows as μΠK0(μ)K\mu-\Pi_{K_{0}}(\mu)\in K for μK\mu\in K, and ()(\ast\ast) follows as infνK{ν(σ0/ν)}inft0{t(σ0/t)}=σ01/2\inf_{\nu\in K}\big{\{}\lVert\nu\rVert\vee(\sigma_{0}/\lVert\nu\rVert)\big{\}}\geq\inf_{t\geq 0}\big{\{}t\vee(\sigma_{0}/t)\}=\sigma_{0}^{1/2}. As σ02δKδK0\sigma_{0}^{2}\asymp\delta_{K}-\delta_{K_{0}}, the second inequality (3.15) follows by the bound err08(δKδK0)1/2\textrm{err}_{0}\leq 8\big{(}\delta_{K}-\delta_{K_{0}}\big{)}^{-1/2} via Theorem 3.8.

Note that (1) is a direct consequence of (3.14) (as err0\mathrm{err}_{0} can be bounded above by Theorem 3.8 and (0)=0\mathscr{L}(0)=0) so we prove (2) below. To see the claimed power characterization, note that

𝔼ΠK(μΠK0(μ)+ξ)2𝔼ΠK(ξ)2σ0\displaystyle\frac{\mathbb{E}\lVert\Pi_{K}\big{(}\mu-\Pi_{K_{0}}(\mu)+\xi\big{)}\rVert^{2}-\mathbb{E}\lVert\Pi_{K}(\xi)\rVert^{2}}{\sigma_{0}}
=(𝔼ΠK(μΠK0(μ)+ξ))2(𝔼ΠK(ξ))2σ0+𝒪(σ01)\displaystyle=\frac{\big{(}\mathbb{E}\lVert\Pi_{K}\big{(}\mu-\Pi_{K_{0}}(\mu)+\xi\big{)}\rVert\big{)}^{2}-\big{(}\mathbb{E}\lVert\Pi_{K}(\xi)\rVert\big{)}^{2}}{\sigma_{0}}+\mathcal{O}(\sigma_{0}^{-1})
=(𝔼ΠK(μΠK0(μ)+ξ)𝔼ΠK(ξ))\displaystyle=\big{(}\mathbb{E}\lVert\Pi_{K}\big{(}\mu-\Pi_{K_{0}}(\mu)+\xi\big{)}\rVert-\mathbb{E}\lVert\Pi_{K}(\xi)\rVert\big{)}
×[2𝔼ΠK(ξ)σ0+𝔼ΠK(μΠK0(μ)+ξ)𝔼ΠK(ξ)σ0]+𝒪(σ01)\displaystyle\qquad\qquad\times\bigg{[}\frac{2\mathbb{E}\lVert\Pi_{K}(\xi)\rVert}{\sigma_{0}}+\frac{\mathbb{E}\lVert\Pi_{K}\big{(}\mu-\Pi_{K_{0}}(\mu)+\xi\big{)}\rVert-\mathbb{E}\lVert\Pi_{K}(\xi)\rVert}{\sigma_{0}}\bigg{]}+\mathcal{O}(\sigma_{0}^{-1})
=ΓK(μΠK0(μ))[2δK+𝒪(1)2(δKδK0)+Var(VKK0)\displaystyle=\Gamma_{K}\big{(}\mu-\Pi_{K_{0}}(\mu)\big{)}\bigg{[}2\sqrt{\frac{\delta_{K}+\mathcal{O}(1)}{2\big{(}\delta_{K}-\delta_{K_{0}}\big{)}+\operatorname{Var}\big{(}V_{K\cap K_{0}^{*}}\big{)}}}
+ΓK(μΠK0(μ))σ0]+𝒪(σ01)\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad+\frac{\Gamma_{K}\big{(}\mu-\Pi_{K_{0}}(\mu)\big{)}}{\sigma_{0}}\bigg{]}+\mathcal{O}(\sigma_{0}^{-1})
=ΓK(μΠK0(μ))[21+𝒪(δK1)(2+Var(VKK0)/δKK0)(1δK0/δK)\displaystyle=\Gamma_{K}\big{(}\mu-\Pi_{K_{0}}(\mu)\big{)}\bigg{[}2\sqrt{\frac{1+\mathcal{O}(\delta_{K}^{-1})}{\big{(}2+\operatorname{Var}\big{(}V_{K\cap K_{0}^{*}}\big{)}/\delta_{K\cap K_{0}^{\ast}}\big{)}\cdot\big{(}1-\delta_{K_{0}}/\delta_{K}\big{)}}}
+ΓK(μΠK0(μ))σ0]+𝒪(σ01).\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad+\frac{\Gamma_{K}\big{(}\mu-\Pi_{K_{0}}(\mu)\big{)}}{\sigma_{0}}\bigg{]}+\mathcal{O}(\sigma_{0}^{-1}).

Under the growth condition σ0\sigma_{0}\to\infty, direct calculation now entails that

𝔼ΠK(μΠK0(μ)+ξ)2𝔼ΠK(ξ)2σ0w[0,+]\displaystyle\frac{\mathbb{E}\lVert\Pi_{K}\big{(}\mu-\Pi_{K_{0}}(\mu)+\xi\big{)}\rVert^{2}-\mathbb{E}\lVert\Pi_{K}(\xi)\rVert^{2}}{\sigma_{0}}\to w^{\ast}\in[0,+\infty]
\displaystyle\Leftrightarrow\quad 2ΓK(μΠK0(μ))2+Var(VKK0)/δKK01δK0/δKw[0,+].\displaystyle\frac{2\Gamma_{K}\big{(}\mu-\Pi_{K_{0}}(\mu)\big{)}}{\sqrt{2+\operatorname{Var}(V_{K\cap K_{0}^{*}})/\delta_{K\cap K_{0}^{*}}}\sqrt{1-\delta_{K_{0}}/\delta_{K}}}\to w^{\ast}\in[0,+\infty].

The proof is now complete. ∎

5.5. Proof of Corollary 3.11

We will prove a slightly stronger (than (3.17)) claim that condition (3.19) implies

ΓK(μΠK0(μ)).\displaystyle\Gamma_{K}\big{(}\mu-\Pi_{K_{0}}(\mu)\big{)}\to\infty. (5.4)

Suppose μΠK0(μ)\lVert\mu-\Pi_{K_{0}}(\mu)\rVert is greater or equal than LnL_{n} times the right hand side of (3.19) for some slowly growing sequence LnL_{n}\uparrow\infty. Then either (i) μΠK0(μ)LnδK1/4\lVert\mu-\Pi_{K_{0}}(\mu)\rVert\geq L_{n}\delta_{K}^{1/4}, or (ii) μΠK0(μ)<LnδK1/4\lVert\mu-\Pi_{K_{0}}(\mu)\rVert<L_{n}\delta_{K}^{1/4} and μΠK0(μ),𝔼ΠK(ξ)LnδK1/2\left\langle\mu-\Pi_{K_{0}}(\mu),\mathbb{E}\Pi_{K}(\xi)\right\rangle\geq L_{n}\delta_{K}^{1/2}. In both cases, we have μΠK0(μ)\|\mu-\Pi_{K_{0}}(\mu)\|\rightarrow\infty as there exists some universal constant c0>0c_{0}>0 such that the right hand side of (3.19) is bounded below by c0c_{0}. In case (i), using [WWG19, (74a)],

ΓK(μΠK0(μ))\displaystyle\Gamma_{K}\big{(}\mu-\Pi_{K_{0}}(\mu)\big{)} μΠK0(μ)22μΠK0(μ)+8𝔼ΠK(ξ)2/e\displaystyle\geq\frac{\lVert\mu-\Pi_{K_{0}}(\mu)\rVert^{2}}{2\lVert\mu-\Pi_{K_{0}}(\mu)\rVert+8\mathbb{E}\lVert\Pi_{K}(\xi)\rVert}-2/\sqrt{e}
(1/16)μΠK0(μ)(μΠK0(μ)2/δK1/2)2/e\displaystyle\geq(1/16)\lVert\mu-\Pi_{K_{0}}(\mu)\rVert\bigwedge\Big{(}\lVert\mu-\Pi_{K_{0}}(\mu)\rVert^{2}/\delta_{K}^{1/2}\Big{)}-2/\sqrt{e}
(1/16)μΠK0(μ)Ln22/e\displaystyle\geq(1/16)\lVert\mu-\Pi_{K_{0}}(\mu)\rVert\bigwedge L_{n}^{2}-2/\sqrt{e}\rightarrow\infty (5.5)

as nn\rightarrow\infty, so (5.4) is verified. In case (ii), we may assume without loss of generality that μΠK0(μ)Ln1/4δK1/4\|\mu-\Pi_{K_{0}}(\mu)\|\leq L_{n}^{1/4}\delta_{K}^{1/4} because otherwise we can follow the same arguments as in the previous case. Then using [WWG19, (74b)] with

αα(μΠK0(μ))\displaystyle\alpha\equiv\alpha\big{(}\mu-\Pi_{K_{0}}(\mu)\big{)} =1eμΠK0(μ),𝔼ΠK(ξ)2/8μΠK0(μ)2\displaystyle=1-e^{-\left\langle\mu-\Pi_{K_{0}}(\mu),\mathbb{E}\Pi_{K}(\xi)\right\rangle^{2}/8\lVert\mu-\Pi_{K_{0}}(\mu)\rVert^{2}}
1eδK1/2/81,\displaystyle\geq 1-e^{-\delta_{K}^{1/2}/8}\to 1,

we have

ΓK(μΠK0(μ))\displaystyle\Gamma_{K}\big{(}\mu-\Pi_{K_{0}}(\mu)\big{)} αμΠK0(μ),𝔼ΠKξμΠK0(μ)2αμΠK0(μ)+2𝔼ΠK(ξ)22e\displaystyle\geq\alpha\cdot\frac{\left\langle\mu-\Pi_{K_{0}}(\mu),\mathbb{E}\Pi_{K}\xi\right\rangle-\|\mu-\Pi_{K_{0}}(\mu)\|^{2}}{\alpha\|\mu-\Pi_{K_{0}}(\mu)\|+2\mathbb{E}\|\Pi_{K}(\xi)\|_{2}}-\frac{2}{\sqrt{e}}
(LnLn1/2)δK1/2Ln1/4δK1/4+δK1/2𝒪(1)\displaystyle\gtrsim\frac{(L_{n}-L_{n}^{1/2})\delta_{K}^{1/2}}{L_{n}^{1/4}\delta_{K}^{1/4}+\delta_{K}^{1/2}}-\mathcal{O}(1)\rightarrow\infty

as nn\rightarrow\infty, so (5.4) is verified. The proof is complete. ∎

6. Proofs of results in Section 4

6.1. Proof of Theorem 4.1

Lemma 6.1.

Let ξ1\xi_{1} be a standard normal random variable. Then for x>0x>0,

𝔼[ξ1𝟏ξ1x]=φ(x),𝔼[ξ12𝟏ξ1x]=xφ(x)+xφ(y)dy.\displaystyle\mathbb{E}[\xi_{1}\bm{1}_{\xi_{1}\geq x}]=\varphi(x),\quad\mathbb{E}[\xi_{1}^{2}\bm{1}_{\xi_{1}\geq x}]=x\varphi(x)+\int_{x}^{\infty}\varphi(y)\,\mathrm{d}{y}.
Proof.

The first equality follows as 𝔼[ξ1𝟏ξ1x]=xyφ(y)dy=φ(x)\mathbb{E}[\xi_{1}\bm{1}_{\xi_{1}\geq x}]=\int_{x}^{\infty}y\varphi(y)\,\mathrm{d}{y}=\varphi(x). The second equality follows as 𝔼[ξ12𝟏ξ1x]=xy2φ(y)dy=xyφ(y)dy=xφ(x)+xφ(y)dy\mathbb{E}[\xi_{1}^{2}\bm{1}_{\xi_{1}\geq x}]=\int_{x}^{\infty}y^{2}\varphi(y)\,\mathrm{d}{y}=-\int_{x}^{\infty}y\varphi^{\prime}(y)\,\mathrm{d}{y}=x\varphi(x)+\int_{x}^{\infty}\varphi(y)\,\mathrm{d}{y}. ∎

Proof of Theorem 4.1.

Note that μ^K+=((μi+ξi)+)\widehat{\mu}_{K_{+}}=\big{(}(\mu_{i}+\xi_{i})_{+}\big{)}. For μ0K+\mu_{0}\in K_{+}, so

𝔼μ0μ^K+μ02\displaystyle\mathbb{E}_{\mu_{0}}\lVert\widehat{\mu}_{K_{+}}-\mu_{0}\rVert^{2} =i=1n𝔼[((μ0)i+ξi)+(μ0)i]2\displaystyle=\sum_{i=1}^{n}\mathbb{E}\Big{[}\big{(}(\mu_{0})_{i}+\xi_{i}\big{)}_{+}-(\mu_{0})_{i}\Big{]}^{2}
=i=1n[𝔼ξi2𝟏ξi(μ0)i+(μ0)i2(ξi<(μ0)i)].\displaystyle=\sum_{i=1}^{n}\Big{[}\mathbb{E}\xi_{i}^{2}\bm{1}_{\xi_{i}\geq-(\mu_{0})_{i}}+(\mu_{0})_{i}^{2}\mathbb{P}\big{(}\xi_{i}<-(\mu_{0})_{i}\big{)}\Big{]}.

As (μ0)i0(\mu_{0})_{i}\geq 0 for 1in1\leq i\leq n, and supx>0x2(ξ<x)<\sup_{x>0}x^{2}\mathbb{P}(\xi<-x)<\infty, it follows that

𝔼μ0μ^K+μ02n.\displaystyle\mathbb{E}_{\mu_{0}}\lVert\widehat{\mu}_{K_{+}}-\mu_{0}\rVert^{2}\asymp n.

On the other hand, as under the null Jμ^K+=(𝟏i=j𝟏ξi(μ0)i)ijJ_{\widehat{\mu}_{K_{+}}}=\big{(}\bm{1}_{i=j}\bm{1}_{\xi_{i}\geq-(\mu_{0})_{i}}\big{)}_{ij},

𝔼μ0Jμ^K+F2=i,j(𝔼μ0Jμ^K+)ij2=i=1n((ξi(μ0)i))2n.\displaystyle\lVert\mathbb{E}_{\mu_{0}}J_{\widehat{\mu}_{K_{+}}}\rVert_{F}^{2}=\sum_{i,j}\big{(}\mathbb{E}_{\mu_{0}}J_{\widehat{\mu}_{K_{+}}}\big{)}_{ij}^{2}=\sum_{i=1}^{n}\big{(}\mathbb{P}(\xi_{i}\geq-(\mu_{0})_{i})\big{)}^{2}\asymp n. (6.1)

The claim (1) now follows from Theorem 3.1.

For (2), let for x0x\geq 0

Q(x)\displaystyle Q(x) 𝔼ξ12𝟏ξ1x+2x𝔼ξ1𝟏ξ1xx2(ξ1<x)\displaystyle\equiv\mathbb{E}\xi_{1}^{2}\bm{1}_{\xi_{1}\geq-x}+2x\mathbb{E}\xi_{1}\bm{1}_{\xi_{1}\geq-x}-x^{2}\mathbb{P}\big{(}\xi_{1}<-x)
=1𝔼ξ12𝟏ξ1x+2x𝔼ξ1𝟏ξ1xx2(ξ1x)\displaystyle=1-\mathbb{E}\xi_{1}^{2}\bm{1}_{\xi_{1}\geq x}+2x\mathbb{E}\xi_{1}\bm{1}_{\xi_{1}\geq x}-x^{2}\mathbb{P}\big{(}\xi_{1}\geq x\big{)}
=xφ(y)dy+xφ(x)x2(1Φ(x)).\displaystyle=\int_{-\infty}^{x}\varphi(y)\,\mathrm{d}{y}+x\varphi(x)-x^{2}\big{(}1-\Phi(x)\big{)}.

The last equality follows from Lemma 6.1. Hence for all x0x\geq 0,

Q(x)\displaystyle Q^{\prime}(x) =2φ(x)+xφ(x)[2x(1Φ(x))x2φ(x)]\displaystyle=2\varphi(x)+x\varphi^{\prime}(x)-\big{[}2x(1-\Phi(x))-x^{2}\varphi(x)\big{]}
=2φ(x)2x(1Φ(x)),\displaystyle=2\varphi(x)-2x\big{(}1-\Phi(x)\big{)},
Q′′(x)\displaystyle Q^{\prime\prime}(x) =2[φ(x)1+Φ(x)+xφ(x)]=2(1+Φ(x))<0.\displaystyle=2\Big{[}\varphi^{\prime}(x)-1+\Phi(x)+x\varphi(x)\Big{]}=2\big{(}-1+\Phi(x)\big{)}<0.

This means that QQ^{\prime} is nonnegative, decreasing with Q(0)=2φ(0)=2/2πQ^{\prime}(0)=2\varphi(0)=2/\sqrt{2\pi} and Q()=0Q^{\prime}(\infty)=0, and QQ is strictly increasing, concave and bounded on [0,)[0,\infty) with Q(0)=1/2Q(0)=1/2.

Now note that for any μK+\mu\in K_{+},

mμμμ02\displaystyle m_{\mu}-\lVert\mu-\mu_{0}\rVert^{2}
=𝔼[2ξ,ΠK+(μ+ξ)μΠK+(μ+ξ)μ2]\displaystyle=\mathbb{E}\big{[}2\left\langle\xi,\Pi_{K_{+}}(\mu+\xi)-\mu\right\rangle-\lVert\Pi_{K_{+}}(\mu+\xi)-\mu\rVert^{2}\big{]}
=i=1n[2𝔼ξi2𝟏ξiμi+2μi𝔼ξi𝟏ξiμi(𝔼ξi2𝟏ξiμi+μi2(ξi<μi)))]\displaystyle=\sum_{i=1}^{n}\bigg{[}2\mathbb{E}\xi_{i}^{2}\bm{1}_{\xi_{i}\geq-\mu_{i}}+2\mu_{i}\mathbb{E}\xi_{i}\bm{1}_{\xi_{i}\geq-\mu_{i}}-\bigg{(}\mathbb{E}\xi_{i}^{2}\bm{1}_{\xi_{i}\geq-\mu_{i}}+\mu_{i}^{2}\mathbb{P}\big{(}\xi_{i}<-\mu_{i})\big{)}\bigg{)}\bigg{]}
=i=1n[𝔼ξi2𝟏ξiμi+2μi𝔼ξi𝟏ξiμiμi2(ξi<μi))]=i=1nQ(μi).\displaystyle=\sum_{i=1}^{n}\bigg{[}\mathbb{E}\xi_{i}^{2}\bm{1}_{\xi_{i}\geq-\mu_{i}}+2\mu_{i}\mathbb{E}\xi_{i}\bm{1}_{\xi_{i}\geq-\mu_{i}}-\mu_{i}^{2}\mathbb{P}\big{(}\xi_{i}<-\mu_{i})\big{)}\bigg{]}=\sum_{i=1}^{n}Q(\mu_{i}).

Using the lower bound (6.1) for σμ02\sigma_{\mu_{0}}^{2}, and an easy matching upper bound (by e.g. triangle inequality), we have σμ02n\sigma_{\mu_{0}}^{2}\asymp n. The condition (3.5) reduces to

μμ0|i=1n{S¯+(μi)S¯+((μ0)i)}+μμ02|n1/2.\displaystyle\lVert\mu-\mu_{0}\rVert\ll\bigg{\lvert}\sum_{i=1}^{n}\big{\{}\bar{S}_{+}(\mu_{i})-\bar{S}_{+}((\mu_{0})_{i})\big{\}}+\lVert\mu-\mu_{0}\rVert^{2}\bigg{\rvert}\bigvee n^{1/2}. (6.2)

(6.2) clearly holds for μμ0n1/2\lVert\mu-\mu_{0}\rVert\ll n^{1/2}. For μμ0n1/2\lVert\mu-\mu_{0}\rVert\gg n^{1/2}, as

|i=1n{S¯+(μi)S¯+((μ0)i)}|(2/2π)μμ01nμμ0,\displaystyle\bigg{\lvert}\sum_{i=1}^{n}\big{\{}\bar{S}_{+}(\mu_{i})-\bar{S}_{+}((\mu_{0})_{i})\big{\}}\bigg{\rvert}\leq(2/\sqrt{2\pi})\lVert\mu-\mu_{0}\rVert_{1}\lesssim\sqrt{n}\lVert\mu-\mu_{0}\rVert,

the right hand side of (6.2) is bounded from below by

(μμ02Cnμμ0)+n1/2μμ02μμ0,\displaystyle\big{(}\lVert\mu-\mu_{0}\rVert^{2}-C\sqrt{n}\lVert\mu-\mu_{0}\rVert\big{)}_{+}\vee n^{1/2}\asymp\lVert\mu-\mu_{0}\rVert^{2}\gg\lVert\mu-\mu_{0}\rVert,

so (6.2) holds. Hence in these two regimes, the claim follows from Theorem 3.2-(2). For μμ0n1/2\lVert\mu-\mu_{0}\rVert\asymp n^{1/2}, by the decomposition (5.1), the LRT is powerful if and only if |mμmμ0|/σμ0\lvert m_{\mu}-m_{\mu_{0}}\rvert/\sigma_{\mu_{0}}\to\infty, i.e., |i=1n{S¯+(μi)S¯+((μ0)i)}+μμ02|n1/2\big{\lvert}\sum_{i=1}^{n}\big{\{}\bar{S}_{+}(\mu_{i})-\bar{S}_{+}((\mu_{0})_{i})\big{\}}+\lVert\mu-\mu_{0}\rVert^{2}\big{\rvert}\gg n^{1/2}. The proof is now complete. ∎

6.2. Proof of Theorem 4.6

We write K×,αK_{\times,\alpha} for K×K_{\times} in the proof for notational convenience.

(1). This claim follows from the fact that σ0δK×1/2δKα1/2n1/2\sigma_{0}\asymp\delta_{K_{\times}}^{1/2}\asymp\delta_{K_{\alpha}}^{1/2}\asymp n^{1/2} (cf. [MT14, Section 6.3]) and Theorem 3.9-(1).

(2)(a). We only need to prove that the LRT is not powerful for μKα\mu\in K_{\alpha} such that μ=𝒪(1)\lVert\mu\rVert=\mathcal{O}(1). Using the decomposition (5.3), it suffices to show T(μ+ξ)T(ξ)=𝒪𝐏(n1/2)T(\mu+\xi)-T(\xi)=\mathcal{O}_{\mathbf{P}}(n^{1/2}). This follows as

T(μ+ξ)T(ξ)\displaystyle T(\mu+\xi)-T(\xi)
=μ+ξ2μ+ξΠKα(μ+ξ)2(ξ2ξΠKα(ξ)2)\displaystyle=\|\mu+\xi\|^{2}-\|\mu+\xi-\Pi_{K_{\alpha}}(\mu+\xi)\|^{2}-\big{(}\lVert\xi\rVert^{2}-\lVert\xi-\Pi_{K_{\alpha}}(\xi)\rVert^{2}\big{)}
=μ2+2μ,ξΠKα(ξ)ΠKα(μ+ξ)+μ2\displaystyle=\lVert\mu\rVert^{2}+2\left\langle\mu,\xi\right\rangle-\lVert\Pi_{K_{\alpha}}(\xi)-\Pi_{K_{\alpha}}(\mu+\xi)+\mu\rVert^{2}
2ξΠKα(ξ),ΠKα(ξ)ΠKα(μ+ξ)+μ\displaystyle\qquad-2\left\langle\xi-\Pi_{K_{\alpha}}(\xi),\Pi_{K_{\alpha}}(\xi)-\Pi_{K_{\alpha}}(\mu+\xi)+\mu\right\rangle
=𝒪𝐏(μ2+μ+ΠKα(ξ)ΠKα(μ+ξ)2+μ2\displaystyle=\mathcal{O}_{\mathbf{P}}\bigg{(}\lVert\mu\rVert^{2}+\lVert\mu\rVert+\lVert\Pi_{K_{\alpha}}(\xi)-\Pi_{K_{\alpha}}(\mu+\xi)\rVert^{2}+\lVert\mu\rVert^{2}
+ξΠKα(ξ)[ΠKα(ξ)ΠKα(μ+ξ)μ])\displaystyle\qquad+\lVert\xi-\Pi_{K_{\alpha}}(\xi)\rVert\big{[}\lVert\Pi_{K_{\alpha}}(\xi)-\Pi_{K_{\alpha}}(\mu+\xi)\rVert\vee\lVert\mu\rVert\big{]}\bigg{)}
=𝒪𝐏(n1/2).\displaystyle=\mathcal{O}_{\mathbf{P}}(n^{1/2}).

(2)(b). By (2)(a) and Theorem 3.9-(2), we have μ11\lVert\mu^{1}\rVert\gg 1 if and only if

𝔼ΠKα(μ1+ξ1)2𝔼ΠKα(ξ1)2n1/2.\displaystyle\frac{\mathbb{E}\|\Pi_{K_{\alpha}}(\mu^{1}+\xi^{1})\|^{2}-\mathbb{E}\|\Pi_{K_{\alpha}}(\xi^{1})\|^{2}}{n^{1/2}}\rightarrow\infty.

Now using Theorem 3.9-(2) again for K×K_{\times} to conclude by noting that

𝔼ΠK×(μ+ξ)2𝔼ΠK×(ξ)2\displaystyle\mathbb{E}\|\Pi_{K_{\times}}(\mu+\xi)\|^{2}-\mathbb{E}\|\Pi_{K_{\times}}(\xi)\|^{2}
=𝔼ΠKα(μ1+ξ1)2𝔼ΠKα(ξ1)2+(μ2)2.\displaystyle=\mathbb{E}\|\Pi_{K_{\alpha}}(\mu^{1}+\xi^{1})\|^{2}-\mathbb{E}\|\Pi_{K_{\alpha}}(\xi^{1})\|^{2}+(\mu^{2})^{2}.

This completes the proof. ∎

6.3. Proof of Theorem 4.7

We first prove Proposition 4.8. The following lemma will be used. We present its proof at the end of this subsection.

Lemma 6.2.

Fix 0.1ni0.9n0.1n\leq i\leq 0.9n. Let uiu^{\ast}\leq i and h10h_{1}^{\ast}\geq 0 be defined through the following max-min formula for the isotonic LSE:

μ^i\displaystyle\widehat{\mu}_{i} =maxuiminviY¯|[u,v]minviY¯|[u,v]minh20Y¯|[ih1n2/3,i+h2n2/3].\displaystyle=\max_{u\leq i}\min_{v\geq i}\bar{Y}|_{[u,v]}\equiv\min_{v\geq i}\bar{Y}|_{[u^{\ast},v]}\equiv\min_{h_{2}\geq 0}\bar{Y}|_{[i-h_{1}^{*}n^{2/3},i+h_{2}n^{2/3}]}. (6.3)

Then there exists some C=C(L)>0C=C(L)>0 such that for any t>0t>0

(|μ^iμi|>n1/3t)(h1>t)Cexp(t2/C).\displaystyle\mathbb{P}\big{(}|\widehat{\mu}_{i}-\mu_{i}|>n^{-1/3}t\big{)}\vee\mathbb{P}(h_{1}^{\ast}>t)\leq C\exp(-t^{2}/C).
Proof of Proposition 4.8.

We write in the proof μ^=μ^K\widehat{\mu}=\widehat{\mu}_{K_{\uparrow}} and μ=μ0\mu=\mu_{0} for simplicity of notation. Note that (Jμ^)ij=𝟏μ^i=μ^j(1/|{k:μ^k=μ^i}|)(J_{\widehat{\mu}})_{ij}=\bm{1}_{\widehat{\mu}_{i}=\widehat{\mu}_{j}}(1/\lvert\{k:\widehat{\mu}_{k}=\widehat{\mu}_{i}\}\rvert). Note that

(μ^i=μ^j)\displaystyle\mathbb{P}(\widehat{\mu}_{i}=\widehat{\mu}_{j}) =𝔼[𝟏μ^i=μ^j1|{k:μ^k=μ^i}|1/2|{k:μ^k=μ^i}|1/2]\displaystyle=\mathbb{E}\bigg{[}\bm{1}_{\widehat{\mu}_{i}=\widehat{\mu}_{j}}\cdot\frac{1}{\lvert\{k:\widehat{\mu}_{k}=\widehat{\mu}_{i}\}\rvert^{1/2}}\cdot\lvert\{k:\widehat{\mu}_{k}=\widehat{\mu}_{i}\}\rvert^{1/2}\bigg{]}
(𝔼Jμ^)ij𝔼|{k:μ^k=μ^i}|.\displaystyle\leq\sqrt{(\mathbb{E}J_{\widehat{\mu}})_{ij}}\cdot\sqrt{\mathbb{E}\lvert\{k:\widehat{\mu}_{k}=\widehat{\mu}_{i}\}\rvert}.

This implies that

(𝔼Jμ^)ij2(μ^i=μ^j)𝔼|{k:μ^k=μ^i}|.\displaystyle(\mathbb{E}J_{\widehat{\mu}})_{ij}\geq\frac{\mathbb{P}^{2}(\widehat{\mu}_{i}=\widehat{\mu}_{j})}{\mathbb{E}\lvert\{k:\widehat{\mu}_{k}=\widehat{\mu}_{i}\}\rvert}. (6.4)

We will bound the denominator from above and the numerator from below in the above display separately in the regime {(i,j):|ij|κn2/3,0.1ni,j0.9n}\{(i,j):\lvert i-j\rvert\leq\kappa n^{2/3},0.1n\leq i,j\leq 0.9n\}, where κ=κ(L)>0\kappa=\kappa(L)>0 is a constant to be specified below.

Fix 0.1ni0.9n0.1n\leq i\leq 0.9n. First we provide an upper bound for the denominator in (6.4). By Lemma 6.2 and using the notation defined therein, there exists some large c=c(L,ε)>1c=c(L,\varepsilon)>1 such that on an event E0E_{0} with probability 1ε1-\varepsilon,

|μ^iμi|\displaystyle\lvert\widehat{\mu}_{i}-\mu_{i}\rvert cn1/3,\displaystyle\leq cn^{-1/3}, (6.5)

and

(E1{h1c})Cec2/C,\displaystyle\mathbb{P}\big{(}E_{1}\equiv\big{\{}h_{1}^{\ast}\geq c\big{\}}\big{)}\leq Ce^{-c^{2}/C},

where C=C(L)>0C=C(L)>0 is a constant depending on LL only. Hence integrating the tail leads to the following: for some constant C=C(L)>0C^{\prime}=C^{\prime}(L)>0,

𝔼|{ki:μ^k=μ^i}|\displaystyle\mathbb{E}\lvert\{k\leq i:\widehat{\mu}_{k}=\widehat{\mu}_{i}\}\rvert Cn2/3.\displaystyle\leq C^{\prime}n^{2/3}.

Similarly we can handle the case kik\geq i, so we arrive at

𝔼|{k:μ^k=μ^i}|\displaystyle\mathbb{E}\lvert\{k:\widehat{\mu}_{k}=\widehat{\mu}_{i}\}\rvert C′′n2/3\displaystyle\leq C^{\prime\prime}n^{2/3} (6.6)

for some constant C′′=C′′(L)>0C^{\prime\prime}=C^{\prime\prime}(L)>0.

Next we provide a lower bound for the numerator of (6.4). On the event E2={1(ic100n2/3)ui}E_{2}=\{1\vee(i-c^{-100}n^{2/3})\leq u^{\ast}\leq i\} (there is nothing special about the constant 100100—a large enough value suffices), we have

n1/3(μ^iμi)=minvin1/3(μ¯|[u,v]μi+ξ¯|[u,v])\displaystyle n^{1/3}\big{(}\widehat{\mu}_{i}-\mu_{i}\big{)}=\min_{v\geq i}n^{1/3}\big{(}\bar{\mu}|_{[u^{\ast},v]}-\mu_{i}+\bar{\xi}|_{[u^{\ast},v]}\big{)}
minivn(i+c10n2/3)n1/3(μ¯|[u,v]μi+ξ¯|[u,v])\displaystyle\leq\min_{i\leq v\leq n\wedge(i+c^{-10}n^{2/3})}n^{1/3}\big{(}\bar{\mu}|_{[u^{\ast},v]}-\mu_{i}+\bar{\xi}|_{[u^{\ast},v]}\big{)}
minivn(i+c10n2/3)n1/3ξ¯|[u,v]+maxivn(i+c10n2/3)n1/3(μ¯|[u,v]μi)\displaystyle\leq\min_{i\leq v\leq n\wedge(i+c^{-10}n^{2/3})}n^{1/3}\bar{\xi}|_{[u^{\ast},v]}+\max_{i\leq v\leq n\wedge(i+c^{-10}n^{2/3})}n^{1/3}\big{(}\bar{\mu}|_{[u^{\ast},v]}-\mu_{i}\big{)}
min0h2c10W(h1)+W(h2)+Rnh1+h2+|𝒪(n2/3)|+𝒪(c),\displaystyle\leq\min_{0\leq h_{2}\leq c^{-10}}\frac{W(-h_{1}^{\ast})+W(h_{2})+R_{n}}{h_{1}^{\ast}+h_{2}+\lvert\mathcal{O}(n^{-2/3})\rvert}+\mathcal{O}(c),

where Rn=𝒪a.s.(logn/n1/3)R_{n}=\mathcal{O}_{a.s.}(\log n/n^{1/3}) comes from Kolmós-Major-Tusnády strong embedding, and WW denotes a standard two-sided Brownian motion starting from 0. The bound 𝒪(c)\mathcal{O}(c) for the bias term follows as

maxivn(i+c10n2/3)n1/3(μ¯|[u,v]μi)maxivn(i+cn2/3)n1/3(μ¯|[i,v]μi)\displaystyle\max_{i\leq v\leq n\wedge(i+c^{-10}n^{2/3})}n^{1/3}\big{(}\bar{\mu}|_{[u^{\ast},v]}-\mu_{i}\big{)}\leq\max_{i\leq v\leq n\wedge(i+cn^{2/3})}n^{1/3}\big{(}\bar{\mu}|_{[i,v]}-\mu_{i}\big{)}
maxn2/3h2cn1/3[i,i+h2n2/3](μμi)h2n2/3+1\displaystyle\leq\max_{n^{-2/3}\leq h_{2}\leq c}\frac{n^{1/3}\sum_{\ell\in[i,i+h_{2}n^{2/3}]\cap\mathbb{Z}}(\mu_{\ell}-\mu_{i})}{\left\lfloor h_{2}n^{2/3}\right\rfloor+1}
maxn2/3h2cn1/3(L/n)[0,h2n2/3]h2n2/3+1=𝒪(c).\displaystyle\leq\max_{n^{-2/3}\leq h_{2}\leq c}\frac{n^{1/3}(L/n)\sum_{\ell\in[0,h_{2}n^{2/3}]\cap\mathbb{Z}}\ell}{\left\lfloor h_{2}n^{2/3}\right\rfloor+1}=\mathcal{O}(c). (6.7)

Now on the event E2E_{2},

W(h1)sup0h1c100W(h1)=dc50sup0t1W(t)c50Z.\displaystyle W(-h_{1}^{\ast})\leq\sup_{0\leq h_{1}\leq c^{-100}}W(-h_{1})\stackrel{{\scriptstyle d}}{{=}}c^{-50}\sup_{0\leq t\leq 1}W(t)\equiv c^{-50}Z.

By reflection principle for Brownian motion, for any u>0u>0,

({W(h1)>c50u}E2)(Z>u)=2(W(1)>u)2eu2/2.\displaystyle\mathbb{P}\big{(}\{W(-h_{1}^{\ast})>c^{-50}u\}\cap E_{2}\big{)}\leq\mathbb{P}\big{(}Z>u)=2\mathbb{P}\big{(}W(1)>u\big{)}\leq 2e^{-u^{2}/2}.

Let h2h_{2}^{\circ} be such that

W(h2)inf0h2c10W(h2)=dc5inf0t1W(t)=c5Z.\displaystyle W(h_{2}^{\circ})\equiv\inf_{0\leq h_{2}\leq c^{-10}}W(h_{2})\stackrel{{\scriptstyle d}}{{=}}c^{-5}\inf_{0\leq t\leq 1}W(t)=-c^{-5}Z.

So for u>0u>0,

(W(h2)<c5u)=(Z>u)=1(|𝒩(0,1)|u)12u.\displaystyle\mathbb{P}(W(h_{2}^{\circ})<-c^{-5}u)=\mathbb{P}(Z>u)=1-\mathbb{P}\big{(}\lvert\mathcal{N}(0,1)\rvert\leq u\big{)}\geq 1-2u.

Hence on the event E2E_{2} intersected with an event with probability at least 14ε1-4\varepsilon,

min0h2c10W(h1)+W(h2)+Rnh1+h2+|𝒪(n2/3)|W(h1)+W(h2)+Rnh1+h2+|𝒪(n2/3)|\displaystyle\min_{0\leq h_{2}\leq c^{-10}}\frac{W(-h_{1}^{\ast})+W(h_{2})+R_{n}}{h_{1}^{\ast}+h_{2}+\lvert\mathcal{O}(n^{-2/3})\rvert}\leq\frac{W(-h_{1}^{\ast})+W(h_{2}^{\circ})+R_{n}}{h_{1}^{\ast}+h_{2}^{\circ}+\lvert\mathcal{O}(n^{-2/3})\rvert}
c502log(1/ε)c5ε+Rnh1+h2+|𝒪(n2/3)|Cεc5.\displaystyle\leq\frac{c^{-50}\sqrt{2\log(1/\varepsilon)}-c^{-5}\varepsilon+R_{n}}{h_{1}^{\ast}+h_{2}^{\circ}+\lvert\mathcal{O}(n^{-2/3})\rvert}\leq-C_{\varepsilon}\cdot c^{5}.

where the last inequality follows by choosing for c=c(ε)c=c(\varepsilon) large enough followed by nn large enough, and h1+h2c100+c102c10h_{1}^{\ast}+h_{2}^{\circ}\leq c^{-100}+c^{-10}\leq 2c^{-10} on E2E_{2}. Combining the above estimates, we see that

n1/3(μ^iμi)Cεc5\displaystyle n^{1/3}\big{(}\widehat{\mu}_{i}-\mu_{i}\big{)}\leq-C_{\varepsilon}^{\prime}\cdot c^{5}

on the event E2E_{2} intersected with an event with probability at least 14ε1-4\varepsilon, when cc and nn are chosen large enough, depending on L,εL,\varepsilon. This event must occur with small enough probability for cc large in view of (6.5), so we have proved that (E2)5ε\mathbb{P}(E_{2})\leq 5\varepsilon for large enough c=c(L,ε)>1c=c(L,\varepsilon)>1 and n=n(L,ε)n=n(L,\varepsilon)\in\mathbb{N}. This means that (μ^i=μ^j)15ε\mathbb{P}(\widehat{\mu}_{i}=\widehat{\mu}_{j})\geq 1-5\varepsilon for 1(ic100n2/3)ji1\vee(i-c^{-100}n^{2/3})\leq j\leq i for large enough c=c(L,ε)>1c=c(L,\varepsilon)>1 and n=n(L,ε)n=n(L,\varepsilon)\in\mathbb{N}. Similarly one can handle the regime ij(i+c100n2/3)ni\leq j\leq(i+c^{-100}n^{2/3})\vee n. In summary, we have proved there exists some κ=κ(L)>0\kappa=\kappa(L)>0 such that

(μ^i=μ^j)1/2\displaystyle\mathbb{P}(\widehat{\mu}_{i}=\widehat{\mu}_{j})\geq 1/2 (6.8)

holds for {(i,j):|ij|κn2/3,0.1ni,j0.9n}\{(i,j):\lvert i-j\rvert\leq\kappa n^{2/3},0.1n\leq i,j\leq 0.9n\} for nn large enough. The claim of the proposition now follows by plugging (6.6) and (6.8) into (6.4). ∎

Now we are in position to prove Theorem 4.7.

Proof of Theorem 4.7-(1).

We write in the proof μ^=μ^K\widehat{\mu}=\widehat{\mu}_{K} and μ=μ0\mu=\mu_{0} for simplicity of notation. μ,𝔼μ\mathbb{P}_{\mu},\mathbb{E}_{\mu} are shorthanded to ,𝔼\mathbb{P},\mathbb{E} if no confusion could arise. For κ>0\kappa>0, let II(κ){i:0.1n+(1)κn2/3i(0.1n+κn2/3)0.9n}I_{\ell}\equiv I_{\ell}(\kappa)\equiv\{i:0.1n+(\ell-1)\cdot\kappa n^{2/3}\leq i\leq\big{(}0.1n+\ell\cdot\kappa n^{2/3}\big{)}\wedge 0.9n\} and 0\ell_{0} be the maximum integer for which I0[0.1n,0.9n]I_{\ell_{0}}\subset[0.1n,0.9n]. Clearly |I|κn2/3\lvert I_{\ell}\rvert\asymp\kappa n^{2/3} for all 101\leq\ell\leq\ell_{0} and 0n1/3/κ\ell_{0}\asymp n^{1/3}/\kappa. Using the κ\kappa specified in Proposition 4.8, we have

𝔼Jμ^F2=i,j(𝔼Jμ^)ij2\displaystyle\lVert\mathbb{E}J_{\widehat{\mu}}\rVert_{F}^{2}=\sum_{i,j}\big{(}\mathbb{E}J_{\widehat{\mu}}\big{)}_{ij}^{2} =10(i,j)I×I(𝔼(Jμ^)ij)2\displaystyle\geq\sum_{\ell=1}^{\ell_{0}}\sum_{(i,j)\in I_{\ell}\times I_{\ell}}(\mathbb{E}(J_{\widehat{\mu}})_{ij}\big{)}^{2}
=10(i,j)I×In4/3n1/3.\displaystyle\gtrsim\sum_{\ell=1}^{\ell_{0}}\sum_{(i,j)\in I_{\ell}\times I_{\ell}}n^{-4/3}\asymp n^{1/3}. (6.9)

On the other hand, by e.g., [Zha02], under the condition of Theorem 4.7, we have

𝔼μ^μ2n1/3.\displaystyle\mathbb{E}\lVert\widehat{\mu}-\mu\rVert^{2}\lesssim n^{1/3}.

The claim now follows by applying Theorem 3.1 (by ignoring the bias term in the denominator) with the above two displays. ∎

Proof of Theorem 4.7-(2).

Following the notation used in [MW00], let W~\widetilde{W} be the greatest convex minorant of tW(t)+t2/2,tt\mapsto W(t)+t^{2}/2,t\in\mathbb{R}, and a=𝔼[W~(0)]>0,b=𝔼[W~(0)2]>0a=-\mathbb{E}[\widetilde{W}(0)]>0,b=\mathbb{E}[\widetilde{W}^{\prime}(0)^{2}]>0. Using the same techniques as in [MW00, Theorem 2, Corollary 4] but by performing Taylor expansion to the second order, it can be shown that for all C2C^{2} monotone functions f:[0,1]f:[0,1]\to\mathbb{R} with bounded first derivative ff^{\prime} away from 0 and \infty, and bounded second derivative f′′f^{\prime\prime} away from \infty,

𝔼μfdivμ^K\displaystyle\mathbb{E}_{\mu_{f}}\operatorname{div}\widehat{\mu}_{K_{\uparrow}} =(a+b)n1/301(f(t))2/3dt+𝒪(1),\displaystyle=(a+b)\cdot n^{1/3}\int_{0}^{1}(f^{\prime}(t))^{2/3}\,\mathrm{d}{t}+\mathcal{O}(1),
𝔼μfμ^Kμf2\displaystyle\mathbb{E}_{\mu_{f}}\lVert\widehat{\mu}_{K_{\uparrow}}-\mu_{f}\rVert^{2} =bn1/301(f(t))2/3dt+𝒪(1).\displaystyle=b\cdot n^{1/3}\int_{0}^{1}(f^{\prime}(t))^{2/3}\,\mathrm{d}{t}+\mathcal{O}(1). (6.10)

Here μf=(f(i/n))i=1n\mu_{f}=(f(i/n))_{i=1}^{n} for a generic f:[0,1]f:[0,1]\to\mathbb{R}, and the 𝒪(1)\mathcal{O}(1) term in the above display depends only on the upper and lower bounds for ff^{\prime} and the upper bound of f′′f^{\prime\prime}. Hence for the prescribed ff,

mμfμfμf02\displaystyle m_{\mu_{f}}-\lVert\mu_{f}-\mu_{f_{0}}\rVert^{2} =2𝔼μfdivμ^K𝔼μfμ^Kμf2\displaystyle=2\mathbb{E}_{\mu_{f}}\operatorname{div}\widehat{\mu}_{K_{\uparrow}}-\mathbb{E}_{\mu_{f}}\lVert\widehat{\mu}_{K_{\uparrow}}-\mu_{f}\rVert^{2}
=(2a+b)n1/301(f(t))2/3dt+𝒪(1).\displaystyle=(2a+b)\cdot n^{1/3}\int_{0}^{1}(f^{\prime}(t))^{2/3}\,\mathrm{d}{t}+\mathcal{O}(1).

On the other hand, for the prescribed ff, (6.3) provides a lower bound for σμf2\sigma_{\mu_{f}}^{2}, while the Gaussian-Poincaré inequality yields a matching upper bound:

n1/3σμf24𝔼μfμ^Kμf2n1/3.\displaystyle n^{1/3}\lesssim\sigma^{2}_{\mu_{f}}\leq 4\mathbb{E}_{\mu_{f}}\lVert\widehat{\mu}_{K_{\uparrow}}-\mu_{f}\rVert^{2}\lesssim n^{1/3}.

Now with δ[1]δ\lVert\delta\rVert_{[1]}\equiv\int\delta^{\prime}, condition (3.5) reduces to

μfμf0|mμfmμf0|σμf\displaystyle\lVert\mu_{f}-\mu_{f_{0}}\rVert\ll\big{\lvert}m_{\mu_{f}}-m_{\mu_{f_{0}}}\big{\rvert}\vee\sigma_{\mu_{f}}
\displaystyle\Leftrightarrow\quad n(ff0)2+𝒪(1)\displaystyle\sqrt{n\int(f-f_{0})^{2}+\mathcal{O}(1)}
|(2a+b)n1/3{(f)2/3(f0)2/3}+𝒪(1)+n(ff0)2|n1/6\displaystyle\quad\ll\bigg{\lvert}(2a+b)n^{1/3}\int\Big{\{}(f^{\prime})^{2/3}-(f^{\prime}_{0})^{2/3}\Big{\}}+\mathcal{O}(1)+n\int(f-f_{0})^{2}\bigg{\rvert}\bigvee n^{1/6}
\displaystyle\Leftrightarrow\quad nρn2+𝒪(1)||𝒪(1)|n1/3ρnδ[1]+𝒪(1)+nρn2|n1/6,\displaystyle\sqrt{n\rho_{n}^{2}+\mathcal{O}(1)}\ll\bigg{\lvert}-\lvert\mathcal{O}(1)\rvert n^{1/3}\rho_{n}\lVert\delta\rVert_{[1]}+\mathcal{O}(1)+n\rho_{n}^{2}\bigg{\rvert}\bigvee n^{1/6}, (6.11)

where in the last equivalence we used that

{(f)2/3(f0)2/3}\displaystyle\int\Big{\{}(f^{\prime})^{2/3}-(f^{\prime}_{0})^{2/3}\Big{\}} ={(f)2/3(f+ρnδ)2/3}=𝒪(ρn)δ.\displaystyle=\int\Big{\{}(f^{\prime})^{2/3}-(f^{\prime}+\rho_{n}\delta^{\prime})^{2/3}\Big{\}}=-\mathcal{O}(\rho_{n})\int\delta^{\prime}.

By Theorem 3.2-(2), under (6.3) the LRT is power consistent if and only if

||𝒪(1)|n1/3ρnδ[1]+nρn2|n1/6.\displaystyle\frac{\big{\lvert}-\lvert\mathcal{O}(1)\rvert n^{1/3}\rho_{n}\lVert\delta\rVert_{[1]}+n\rho_{n}^{2}\big{\rvert}}{n^{1/6}}\to\infty. (6.12)

We have two cases:

  1. (1)

    If nρn2n1/3ρn|δ[1]|ρnn2/3|δ[1]|n\rho_{n}^{2}\gg n^{1/3}\rho_{n}\lvert\lVert\delta\rVert_{[1]}\rvert\Leftrightarrow\rho_{n}\gg n^{-2/3}\lvert\lVert\delta\rVert_{[1]}\rvert, then (6.12) requires ρnn5/12\rho_{n}\gg n^{-5/12}.

  2. (2)

    If nρn2n1/3ρn|δ[1]|ρnn2/3|δ[1]|n\rho_{n}^{2}\ll n^{1/3}\rho_{n}\lvert\lVert\delta\rVert_{[1]}\rvert\Leftrightarrow\rho_{n}\ll n^{-2/3}\lvert\lVert\delta\rVert_{[1]}\rvert, then (6.12) requires ρnn1/6/|δ[1]|\rho_{n}\gg n^{-1/6}/\lvert\lVert\delta\rVert_{[1]}\rvert. This is not feasible as |δ[1]|=|δ|=𝒪(1)\lvert\lVert\delta\rVert_{[1]}\rvert=\lvert\int\delta^{\prime}\rvert=\mathcal{O}(1).

To summarize, (6.12) is equivalent to requiring ρnn5/12\rho_{n}\gg n^{-5/12}. In this regime (6.3) also holds. The proof is complete. ∎

Proof of Lemma 6.2.

By the monotonicity of μ\mu, we have

μ^iμi\displaystyle\widehat{\mu}_{i}-\mu_{i} =minh20Y¯|[ih1n2/3,i+h2n2/3]μiY¯|[ih1n2/3,i+n2/3]μi\displaystyle=\min_{h_{2}\geq 0}\bar{Y}|_{[i-h_{1}^{*}n^{2/3},i+h_{2}n^{2/3}]}-\mu_{i}\leq\bar{Y}|_{[i-h_{1}^{*}n^{2/3},i+n^{2/3}]}-\mu_{i}
=(μ¯|[ih1n2/3,i+n2/3]μi)+ξ¯|[ih1n2/3,i+n2/3]\displaystyle=\big{(}\bar{\mu}|_{[i-h_{1}^{*}n^{2/3},i+n^{2/3}]}-\mu_{i}\big{)}+\bar{\xi}|_{[i-h_{1}^{*}n^{2/3},i+n^{2/3}]}
(μi+n2/3μi)+maxh10|ξ¯|[ih1n2/3,i+n2/3]|\displaystyle\leq\big{(}\mu_{\left\lceil i+n^{2/3}\right\rceil}-\mu_{i}\big{)}+\max_{h_{1}\geq 0}|\bar{\xi}|_{[i-h_{1}n^{2/3},i+n^{2/3}]}|
C1n1/3+maxh10|ξ¯|[ih1n2/3,i+n2/3]|,\displaystyle\leq C_{1}n^{-1/3}+\max_{h_{1}\geq 0}|\bar{\xi}|_{[i-h_{1}n^{2/3},i+n^{2/3}]}|,

where the last inequality follows by (4.2). Note for any t>0t>0, a standard blocking argument (cf. [HZ19, Lemma 2]) yields

(maxh10|ξ¯|[ih1n2/3,i+n2/3]|>tn1/3)Cet2/C.\displaystyle\mathbb{P}\bigg{(}\max_{h_{1}\geq 0}|\bar{\xi}|_{[i-h_{1}n^{2/3},i+n^{2/3}]}|>tn^{-1/3}\bigg{)}\leq Ce^{-t^{2}/C}. (6.13)

This concludes the one-sided estimate (μ^iμi>n1/3t)\mathbb{P}(\widehat{\mu}_{i}-\mu_{i}>n^{-1/3}t). The other side is similar.

Now consider (h1>t)\mathbb{P}(h_{1}^{\ast}>t). On the event {h1>t}\{h_{1}^{*}>t\}, we have

μ^iμi\displaystyle\widehat{\mu}_{i}-\mu_{i} =minh20Y¯|[ih1n2/3,i+h2n2/3]\displaystyle=\min_{h_{2}\geq 0}\bar{Y}|_{[i-h_{1}^{*}n^{2/3},i+h_{2}n^{2/3}]}
(μ¯|[ih1n2/3,i+n2/3]μi)+ξ¯|[ih1n2/3,i+n2/3]\displaystyle\leq\big{(}\bar{\mu}|_{[i-h_{1}^{*}n^{2/3},i+n^{2/3}]}-\mu_{i}\big{)}+\bar{\xi}|_{[i-h_{1}^{*}n^{2/3},i+n^{2/3}]}
(μ¯|[itn2/3,i+n2/3]μi)+maxh10|ξ¯|[ih1n2/3,i+n2/3]|\displaystyle\leq\big{(}\bar{\mu}|_{[i-tn^{2/3},i+n^{2/3}]}-\mu_{i}\big{)}+\max_{h_{1}\geq 0}\big{\lvert}\bar{\xi}|_{[i-h_{1}n^{2/3},i+n^{2/3}]}\big{\rvert}
C2tn1/3+C3n1/3+maxh10|ξ¯|[ih1n2/3,i+n2/3]|,\displaystyle\leq-C_{2}\cdot tn^{-1/3}+C_{3}\cdot n^{-1/3}+\max_{h_{1}\geq 0}\big{\lvert}\bar{\xi}|_{[i-h_{1}n^{2/3},i+n^{2/3}]}\big{\rvert},

where the last inequality follows from calculations similar to (6.3), but now using both the upper and lower bound parts of (4.2). Choosing t2C3/C2t\geq 2C_{3}/C_{2}, and replacing tt in (6.13) by C2t/4C_{2}t/4, we see that

(h1>t)(μ^iμi(C2/4)tn1/3)+C4et2/C4.\displaystyle\mathbb{P}\big{(}h_{1}^{\ast}>t\big{)}\leq\mathbb{P}\big{(}\widehat{\mu}_{i}-\mu_{i}\leq-(C_{2}/4)tn^{-1/3}\big{)}+C_{4}e^{-t^{2}/C_{4}}.

The claim follows by adjusting constants. ∎

6.4. Proof of Theorem 4.9

We first prove Proposition 4.10. The following lemma will be useful to control the term 𝔭λ,μ0\mathfrak{p}_{\lambda,\mu_{0}} therein. We present its proof at the end of this subsection.

Lemma 6.3.

For any θp\theta\in\mathbb{R}^{p} with θ1λ\|\theta\|_{1}\leq\lambda, let μXθKX,λ\mu\equiv X\theta\in K_{X,\lambda}. There exists some universal constant C>0C>0 such that for t1t\geq 1,

μ(θ^01θ1+tpnλmin(Σ))et2/C.\displaystyle\mathbb{P}_{\mu}\bigg{(}\|\widehat{\theta}^{0}\|_{1}\geq\lVert\theta\rVert_{1}+t\sqrt{\frac{p}{n\lambda_{\min}(\Sigma)}}\bigg{)}\leq e^{-t^{2}/C}.
Proof of Proposition 4.10.

(1). We will derive an explicit formula for Jμ^J_{\widehat{\mu}} using the results of [Kat09]. First note by the chain rule that

Jμ^(ξ)=μ^ξ=μ^θ^θ^θ^0θ^0ξ=Xθ^θ^0(XX)1X.\displaystyle J_{\widehat{\mu}}(\xi)=\frac{\partial\widehat{\mu}}{\partial\xi}=\frac{\partial\widehat{\mu}}{\partial\widehat{\theta}}\frac{\partial\widehat{\theta}}{\partial\widehat{\theta}^{0}}\frac{\partial\widehat{\theta}^{0}}{\partial\xi}=X\frac{\partial\widehat{\theta}}{\partial\widehat{\theta}^{0}}(X^{\top}X)^{-1}X^{\top}.

Let K~K~λ{θp:θ1λ}\widetilde{K}\equiv\widetilde{K}_{\lambda}\equiv\{\theta\in\mathbb{R}^{p}:\|\theta\|_{1}\leq\lambda\}. For each m{1,,p}m\in\{1,\ldots,p\}, suppose there are NmN_{m} faces of K~λ\widetilde{K}_{\lambda} of dimension mm, denoted as {Fm,}=1Nm\{F_{m,\ell}\}_{\ell=1}^{N_{m}}. Then we can partition K~λ\partial\widetilde{K}_{\lambda} as {Fm,}m,\{F_{m,\ell}\}_{m,\ell}. Let {E0,{Em,}m,}\big{\{}E_{0},\{E_{m,\ell}\}_{m,\ell}\big{\}} be a partition of p\mathbb{R}^{p} defined as E0K~λE_{0}\equiv\widetilde{K}_{\lambda}, Em,{yp:ΠK~λ(y)Fm,}E_{m,\ell}\equiv\{y\in\mathbb{R}^{p}:\Pi_{\widetilde{K}_{\lambda}}(y)\in F_{m,\ell}\}. Let E0,Em,E_{0}^{\circ},E_{m,\ell}^{\circ} be the interiors of E0,Em,E_{0},E_{m,\ell}, respectively. Since K~λ\widetilde{K}_{\lambda} is a polyhedron, it follows by [Kat09, Equation (3.6) and Remark 3.3] that when θ^0Em,\widehat{\theta}^{0}\in E^{\circ}_{m,\ell},

θ^θ^0=Bm,(Bm,XXBm,)1Bm,XX,\displaystyle\frac{\partial\widehat{\theta}}{\partial\widehat{\theta}^{0}}=B_{m,\ell}(B_{m,\ell}^{\top}X^{\top}XB_{m,\ell})^{-1}B_{m,\ell}^{\top}X^{\top}X,

where Bm,=[b1,,,bpm,]p×(pm)B_{m,\ell}=[b_{1,\ell},\ldots,b_{p-m,\ell}]\in\mathbb{R}^{p\times(p-m)} whose columns are linearly independent and span the tangent space at any point in Fm,F_{m,\ell}. Hence on the event {θ^0Em,}\{\widehat{\theta}^{0}\in E^{\circ}_{m,\ell}\},

Jμ^(ξ)=XBm,(Bm,XXBm,)1(Bm,X),\displaystyle J_{\widehat{\mu}}(\xi)=XB_{m,\ell}(B_{m,\ell}^{\top}X^{\top}XB_{m,\ell})^{-1}(B_{m,\ell}^{\top}X^{\top}),

which is a projection matrix onto the column space of XBm,XB_{m,\ell}. In other words, a.e. on n\mathbb{R}^{n},

Jμ^(ξ)=𝟏θ^0E0Z0+m=1p=1Nm𝟏θ^0Em,Zm,,\displaystyle J_{\widehat{\mu}}(\xi)=\bm{1}_{\widehat{\theta}^{0}\in E_{0}^{\circ}}\cdot Z_{0}+\sum_{m=1}^{p}\sum_{\ell=1}^{N_{m}}\bm{1}_{\widehat{\theta}^{0}\in E^{\circ}_{m,\ell}}\cdot Z_{m,\ell}, (6.14)

where

Z0X(XX)1X,Zm,XBm,(Bm,XXBm,)1(Bm,X).\displaystyle Z_{0}\equiv X(X^{\top}X)^{-1}X^{\top},\quad Z_{m,\ell}\equiv XB_{m,\ell}(B_{m,\ell}^{\top}X^{\top}XB_{m,\ell})^{-1}(B_{m,\ell}^{\top}X^{\top}).

Hence,

𝔼μ0Jμ^(ξ)=Z0μ0(θ^0E0)+m=1p=1Nmμ0(θ^0Em,)Zm,,\displaystyle\mathbb{E}_{\mu_{0}}J_{\widehat{\mu}}(\xi)=Z_{0}\cdot\mathbb{P}_{\mu_{0}}\big{(}\widehat{\theta}^{0}\in E^{\circ}_{0}\big{)}+\sum_{m=1}^{p}\sum_{\ell=1}^{N_{m}}\mathbb{P}_{\mu_{0}}\big{(}\widehat{\theta}^{0}\in E^{\circ}_{m,\ell}\big{)}\cdot Z_{m,\ell},

and therefore

𝔼μ0Jμ^F2\displaystyle\lVert\mathbb{E}_{\mu_{0}}J_{\widehat{\mu}}\rVert_{F}^{2} =i,j=1n(𝔼μ0Jμ^)ij2\displaystyle=\sum_{i,j=1}^{n}\big{(}\mathbb{E}_{\mu_{0}}J_{\widehat{\mu}}\big{)}_{ij}^{2}
=i,j=1n[(Z0)ij(θ^0E0)+m,(θ^0Em,)(Zm,)ij]2\displaystyle=\sum_{i,j=1}^{n}\bigg{[}(Z_{0})_{ij}\mathbb{P}\big{(}\widehat{\theta}^{0}\in E^{\circ}_{0}\big{)}+\sum_{m,\ell}\mathbb{P}\big{(}\widehat{\theta}^{0}\in E^{\circ}_{m,\ell}\big{)}(Z_{m,\ell})_{ij}\bigg{]}^{2}
i,j=1n[(Z0)ijμ0(θ^0E0)μ0(θ^0E0)maxm,|(Zm,)ij|]+2\displaystyle\geq\sum_{i,j=1}^{n}\bigg{[}(Z_{0})_{ij}\mathbb{P}_{\mu_{0}}\big{(}\widehat{\theta}^{0}\in E^{\circ}_{0}\big{)}-\mathbb{P}_{\mu_{0}}\big{(}\widehat{\theta}^{0}\notin E^{\circ}_{0}\big{)}\max_{m,\ell}\lvert(Z_{m,\ell})_{ij}\rvert\bigg{]}_{+}^{2}
()i,j=1n[(Z0)ijμ0(θ^0E0)μ0(θ^0E0)]+2\displaystyle\stackrel{{\scriptstyle(*)}}{{\geq}}\sum_{i,j=1}^{n}\Big{[}(Z_{0})_{ij}\mathbb{P}_{\mu_{0}}\big{(}\widehat{\theta}^{0}\in E^{\circ}_{0}\big{)}-\mathbb{P}_{\mu_{0}}\big{(}\widehat{\theta}^{0}\notin E^{\circ}_{0}\big{)}\Big{]}_{+}^{2}
()i,j=1n[(Z0)ij2μ0(θ^0E0)]+2\displaystyle\stackrel{{\scriptstyle(*)}}{{\geq}}\sum_{i,j=1}^{n}\Big{[}(Z_{0})_{ij}-2\mathbb{P}_{\mu_{0}}\big{(}\widehat{\theta}^{0}\notin E^{\circ}_{0}\big{)}\Big{]}_{+}^{2}
()i,j=1n(Z0)ij2/24n2μ0(θ^0E0)2\displaystyle\stackrel{{\scriptstyle(**)}}{{\geq}}\sum_{i,j=1}^{n}(Z_{0})_{ij}^{2}/2-4n^{2}\mathbb{P}_{\mu_{0}}\big{(}\widehat{\theta}^{0}\notin E^{\circ}_{0}\big{)}^{2}
=p/24n2μ0(θ^0E0)2.\displaystyle=p/2-4n^{2}\mathbb{P}_{\mu_{0}}\big{(}\widehat{\theta}^{0}\notin E^{\circ}_{0}\big{)}^{2}.

Here we have used the following:

  • In ()(\ast), we apply the estimate

    ±Zij=±eiZejsupu,v:u=v=1uZv=Z1,Z{Z0,Zm,}.\displaystyle\pm Z_{ij}=\pm e_{i}^{\top}Ze_{j}\leq\sup_{u,v:\lVert u\rVert=\lVert v\rVert=1}u^{\top}Zv=\lVert Z\rVert\leq 1,\quad Z\in\{Z_{0},Z_{m,\ell}\}.

    This means maxi,j|Zij|1\max_{i,j}\lvert Z_{ij}\rvert\leq 1 for Z{Z0,Zm,}Z\in\{Z_{0},Z_{m,\ell}\}.

  • In ()(\ast\ast) we use the estimate (ab)+2a2/2b2(a-b)_{+}^{2}\geq a^{2}/2-b^{2}.

  • In the last equality we use i,j(Z0)ij2=tr(Z0Z0)=p\sum_{i,j}(Z_{0})_{ij}^{2}=\operatorname{tr}(Z_{0}Z_{0}^{\top})=p.

Thus, claim (1) follows.

(2). Note that

𝔼μdivμ^\displaystyle\mathbb{E}_{\mu}\operatorname{div}\widehat{\mu} =𝔼μtr(Jμ^)\displaystyle=\mathbb{E}_{\mu}\operatorname{tr}(J_{\widehat{\mu}})
=μ(θ^0E0)tr(Z0)+m=1p=1Nmμ(θ^0Em,)tr(Zm,)\displaystyle=\mathbb{P}_{\mu}\big{(}\widehat{\theta}^{0}\in E_{0}^{\circ}\big{)}\operatorname{tr}(Z_{0})+\sum_{m=1}^{p}\sum_{\ell=1}^{N_{m}}\mathbb{P}_{\mu}\big{(}\widehat{\theta}^{0}\in E^{\circ}_{m,\ell}\big{)}\operatorname{tr}(Z_{m,\ell})
=ppμ(θ^0E0)+m=1p=1Nmμ(θ^0Em,)(pm).\displaystyle=p-p\mathbb{P}_{\mu}\big{(}\widehat{\theta}^{0}\notin E_{0}^{\circ}\big{)}+\sum_{m=1}^{p}\sum_{\ell=1}^{N_{m}}\mathbb{P}_{\mu}\big{(}\widehat{\theta}^{0}\in E^{\circ}_{m,\ell}\big{)}(p-m).

Hence

|𝔼μdivμ^p|\displaystyle\big{\lvert}\mathbb{E}_{\mu}\operatorname{div}\widehat{\mu}-p\big{\rvert} 2pμ(θ^0E0),\displaystyle\leq 2p\mathbb{P}_{\mu}\big{(}\widehat{\theta}^{0}\notin E_{0}^{\circ}\big{)},

proving claim (2).

(3). When θ^0E0\widehat{\theta}^{0}\in E_{0}^{\circ}, θ^=θ^0\widehat{\theta}=\widehat{\theta}^{0} as θ^\widehat{\theta} is the projection of θ^0\widehat{\theta}^{0} onto K~\widetilde{K} with respect to X(XX)1/2\|\cdot\|_{X}\equiv\big{(}\cdot^{\top}X^{\top}X\cdot\big{)}^{1/2}, cf. [Kat09, Equation (1.6)]. This means

𝔼μμ^μ2\displaystyle\mathbb{E}_{\mu}\lVert\widehat{\mu}-\mu\rVert^{2} =𝔼μ(μ^μ2𝟏θ^0E0)+𝔼μ(μ^μ2𝟏θ^0E0)\displaystyle=\mathbb{E}_{\mu}\big{(}\lVert\widehat{\mu}-\mu\rVert^{2}\bm{1}_{\widehat{\theta}^{0}\in E_{0}^{\circ}}\big{)}+\mathbb{E}_{\mu}\big{(}\lVert\widehat{\mu}-\mu\rVert^{2}\bm{1}_{\widehat{\theta}^{0}\notin E_{0}^{\circ}}\big{)}
=𝔼μ(Xθ^0Xθ2𝟏θ^0E0)+𝔼μ(μ^μ2𝟏θ^0E0)\displaystyle=\mathbb{E}_{\mu}\big{(}\lVert X\widehat{\theta}^{0}-X\theta\rVert^{2}\bm{1}_{\widehat{\theta}^{0}\in E_{0}^{\circ}}\big{)}+\mathbb{E}_{\mu}\big{(}\lVert\widehat{\mu}-\mu\rVert^{2}\bm{1}_{\widehat{\theta}^{0}\notin E_{0}^{\circ}}\big{)}
=𝔼μXθ^0Xθ2+Rn,μ=p+Rn,μ,\displaystyle=\mathbb{E}_{\mu}\lVert X\widehat{\theta}^{0}-X\theta\rVert^{2}+R_{n,\mu}=p+R_{n,\mu},

where

Rn,μ𝔼μ(μ^μ2𝟏θ^0E0)𝔼μ(Z0ξ2𝟏θ^0E0).\displaystyle R_{n,\mu}\equiv\mathbb{E}_{\mu}\big{(}\lVert\widehat{\mu}-\mu\rVert^{2}\bm{1}_{\widehat{\theta}^{0}\notin E_{0}^{\circ}}\big{)}-\mathbb{E}_{\mu}\big{(}\lVert Z_{0}\xi\rVert^{2}\bm{1}_{\widehat{\theta}^{0}\notin E_{0}^{\circ}}\big{)}.

As θ1λ\|\theta\|_{1}\leq\lambda, μ^μ22Yμ^2+2ξ24ξ2\lVert\widehat{\mu}-\mu\rVert^{2}\leq 2\lVert Y-\widehat{\mu}\rVert^{2}+2\lVert\xi\rVert^{2}\leq 4\lVert\xi\rVert^{2} by using Yμ^2Yμ2=ξ2\lVert Y-\widehat{\mu}\rVert^{2}\leq\lVert Y-\mu\rVert^{2}=\lVert\xi\rVert^{2} via the definition of projection, we have

𝔼μ(μ^μ2𝟏θ^0E0)\displaystyle\mathbb{E}_{\mu}\big{(}\lVert\widehat{\mu}-\mu\rVert^{2}\bm{1}_{\widehat{\theta}^{0}\notin E_{0}^{\circ}}\big{)} 4𝔼ξ4μ(θ^0E0)Cnμ(θ^0E0).\displaystyle\leq 4\sqrt{\mathbb{E}\lVert\xi\rVert^{4}}\sqrt{\mathbb{P}_{\mu}\big{(}\widehat{\theta}^{0}\notin E_{0}^{\circ}\big{)}}\leq Cn\sqrt{\mathbb{P}_{\mu}\big{(}\widehat{\theta}^{0}\notin E_{0}^{\circ}\big{)}}.

On the other hand, a similar estimate yields

𝔼μ(Z0ξ2𝟏θ^0E0)Cpμ(θ^0E0)Cnμ(θ^0E0).\displaystyle\mathbb{E}_{\mu}\big{(}\lVert Z_{0}\xi\rVert^{2}\bm{1}_{\widehat{\theta}^{0}\notin E_{0}^{\circ}}\big{)}\leq Cp\sqrt{\mathbb{P}_{\mu}\big{(}\widehat{\theta}^{0}\notin E_{0}^{\circ}\big{)}}\leq Cn\sqrt{\mathbb{P}_{\mu}\big{(}\widehat{\theta}^{0}\notin E_{0}^{\circ}\big{)}}.

Hence

|𝔼μμ^μ2p|Cnμ(θ^0E0).\displaystyle\big{\lvert}\mathbb{E}_{\mu}\lVert\widehat{\mu}-\mu\rVert^{2}-p\big{\rvert}\leq Cn\sqrt{\mathbb{P}_{\mu}\big{(}\widehat{\theta}^{0}\notin E_{0}^{\circ}\big{)}}.

This completes the proof of claim (3). ∎

Proof of Theorem 4.9.

The first claim follows from Proposition 4.10-(1)(3) and Theorem 3.1 (by ignoring the bias term in the denominator). For the second claim, by Proposition 4.10-(2)(3),

mμ(μμ02+p)=2𝔼μdivμ^KX,λ𝔼μμ^KX,λμ2p=𝒪(n𝔭λ,μ1/2).\displaystyle m_{\mu}-\big{(}\lVert\mu-\mu_{0}\rVert^{2}+p\big{)}=2\mathbb{E}_{\mu}\operatorname{div}\widehat{\mu}_{K_{X,\lambda}}-\mathbb{E}_{\mu}\lVert\widehat{\mu}_{K_{X,\lambda}}-\mu\rVert^{2}-p=\mathcal{O}(n\cdot\mathfrak{p}_{\lambda,\mu}^{1/2}).

This entails that mμmμ0=μμ02+n𝒪(𝔭λ,μ1/2𝔭λ,μ01/2)m_{\mu}-m_{\mu_{0}}=\|\mu-\mu_{0}\|^{2}+n\cdot\mathcal{O}(\mathfrak{p}_{\lambda,\mu}^{1/2}\vee\mathfrak{p}_{\lambda,\mu_{0}}^{1/2}). Furthermore, using Gaussian-Poincaré inequality along with Proposition 4.10-(1)(3), we have

σμ024𝔼μ0μ^μ024p+C(n𝔭λ,μ01/2)=𝒪(p),\displaystyle\sigma_{\mu_{0}}^{2}\leq 4\mathbb{E}_{\mu_{0}}\|\widehat{\mu}-\mu_{0}\|^{2}\leq 4p+C(n\mathfrak{p}_{\lambda,\mu_{0}}^{1/2})=\mathcal{O}(p),

where the last inequality follows from the condition n𝔭λ,μ01/2=𝔬(1)n\mathfrak{p}_{\lambda,\mu_{0}}^{1/2}=\mathfrak{o}(1). This, along with lower bound for σμ02\sigma_{\mu_{0}}^{2} derived in Proposition 4.10-(1), yields that σμ02p\sigma_{\mu_{0}}^{2}\asymp p. Therefore, under the condition n(𝔭λ,μ1/2𝔭λ,μ01/2)=𝔬(1)n\cdot(\mathfrak{p}_{\lambda,\mu}^{1/2}\vee\mathfrak{p}_{\lambda,\mu_{0}}^{1/2})=\mathfrak{o}(1), (3.5) is satisfied automatically, and (3.7) is equivalent to

|n𝒪(𝔭λ,μ1/2𝔭λ,μ01/2)+μμ02p1/2|μμ0p1/4.\displaystyle\bigg{\lvert}\frac{n\cdot\mathcal{O}\big{(}\mathfrak{p}_{\lambda,\mu}^{1/2}\vee\mathfrak{p}_{\lambda,\mu_{0}}^{1/2}\big{)}+\lVert\mu-\mu_{0}\rVert^{2}}{p^{1/2}}\bigg{\rvert}\to\infty\Leftrightarrow\lVert\mu-\mu_{0}\rVert\gg p^{1/4}.

The proof is complete. ∎

Proof of Lemma 6.3.

Recall that Σ=XX/n\Sigma=X^{\top}X/n. Note that

θ^0=(XX)1XY=θ+(XX)1Xξ=dθ+(XX)1/2Z\displaystyle\widehat{\theta}^{0}=(X^{\top}X)^{-1}X^{\top}Y=\theta+(X^{\top}X)^{-1}X^{\top}\xi\stackrel{{\scriptstyle d}}{{=}}\theta+(X^{\top}X)^{-1/2}Z

with Z𝒩(0,Ip)Z\sim\mathcal{N}(0,I_{p}). For any bpb\in\mathbb{R}^{p}, let fb:pf_{b}:\mathbb{R}^{p}\rightarrow\mathbb{R} be defined as fb(y)fb(y;X)(XX)1/2y,bf_{b}(y)\equiv f_{b}(y;X)\equiv\left\langle(X^{\top}X)^{-1/2}y,b\right\rangle. Then θ^01=supb:b1[fb(Z)+bθ]\|\widehat{\theta}^{0}\|_{1}=\sup_{b:\|b\|_{\infty}\leq 1}\big{[}f_{b}(Z)+b^{\top}\theta\big{]}. Hence by Gaussian concentration (cf. [BLM13, Theorem 5.8]), for any t>0t>0,

(θ^01𝔼θ^01>t)exp(t2/2σ2),\displaystyle\mathbb{P}\Big{(}\|\widehat{\theta}^{0}\|_{1}-\mathbb{E}\|\widehat{\theta}^{0}\|_{1}>t\Big{)}\leq\exp\big{(}-t^{2}/2\sigma^{2}\big{)}, (6.15)

where σ2=supb:b1Var(fb(Z))\sigma^{2}=\sup_{b:\|b\|_{\infty}\leq 1}\operatorname{Var}\big{(}f_{b}(Z)\big{)}. Next we bound 𝔼θ^01\mathbb{E}\|\widehat{\theta}^{0}\|_{1} and σ2\sigma^{2}. For σ2\sigma^{2}, note that

σ2\displaystyle\sigma^{2} =supb:b1𝔼(XX)1/2Z,b2=n1supb:b1bΣ1b\displaystyle=\sup_{b:\|b\|_{\infty}\leq 1}\mathbb{E}\left\langle(X^{\top}X)^{-1/2}Z,b\right\rangle^{2}=n^{-1}\sup_{b:\|b\|_{\infty}\leq 1}b^{\top}\Sigma^{-1}b
(p/n)supb:b21bΣ1b=p/(nλmin(Σ)).\displaystyle\leq(p/n)\cdot\sup_{b:\|b\|_{2}\leq 1}b^{\top}\Sigma^{-1}b=p\big{/}\big{(}n\lambda_{\min}(\Sigma)\big{)}.

For the mean term, we have 𝔼θ^01θ1+𝔼(XX)1/2ξ1=θ1+𝔼supb:b1fb(Z)\mathbb{E}\|\widehat{\theta}^{0}\|_{1}\leq\|\theta\|_{1}+\mathbb{E}\|(X^{\top}X)^{-1/2}\xi\|_{1}=\|\theta\|_{1}+\mathbb{E}\sup_{b:\|b\|_{\infty}\leq 1}f_{b}(Z). Note that the natural metric dd induced by the Gaussian process (fb(Z):bp)\big{(}f_{b}(Z):b\in\mathbb{R}^{p}\big{)} takes the form

d2(b1,b2)\displaystyle d^{2}(b_{1},b_{2}) 𝔼(fb1(Z)fb2(Z))2\displaystyle\equiv\mathbb{E}\big{(}f_{b_{1}}(Z)-f_{b_{2}}(Z)\big{)}^{2}
=(b1b2)(nΣ)1(b1b2)n1λmin1(Σ)b1b22,\displaystyle=(b_{1}-b_{2})^{\top}(n\Sigma)^{-1}(b_{1}-b_{2})\leq n^{-1}\lambda_{\min}^{-1}(\Sigma)\|b_{1}-b_{2}\|^{2},

and a simple volume estimate yields that

𝒩(ε,{b:b1},d)[(nλmin(Σ))1/2/ε]p.\displaystyle\mathcal{N}(\varepsilon,\{b:\|b\|_{\infty}\leq 1\},d)\lesssim\big{[}(n\lambda_{\min}(\Sigma))^{-1/2}/\varepsilon\big{]}^{p}.

Hence by Dudley’s entropy integral (cf. [GN16, Theorem 2.3.6]),

𝔼(XX)1/2ξ1\displaystyle\mathbb{E}\|(X^{\top}X)^{-1/2}\xi\|_{1} 0log(1𝒩(ε,{b:b1},d))dε\displaystyle\lesssim\int_{0}^{\infty}\sqrt{\log\big{(}1\vee\mathcal{N}(\varepsilon,\{b:\|b\|_{\infty}\leq 1\},d)\big{)}}\,\mathrm{d}\varepsilon
p/(nλmin(Σ)).\displaystyle\lesssim\sqrt{p\big{/}\big{(}n\lambda_{\min}(\Sigma)\big{)}}.

The claim now follows from (6.15). ∎

6.5. Proof of Theorem 4.12

By definition of K0,kK_{0,k}, we have δK0,k=dim(K0,k)=k+1\delta_{K_{0,k}}=\dim(K_{0,k})=k+1. We will now show that

Lk1(𝟏k1loglog(16n)+𝟏k=0log(en))δK,kLklog(en),\displaystyle L_{k}^{-1}\big{(}\bm{1}_{k\geq 1}\log\log(16n)+\bm{1}_{k=0}\log(en)\big{)}\leq\delta_{K_{\uparrow,k}}\leq L_{k}\log(en), (6.16)

where Lk>0L_{k}>0 only depends on kk.

We first prove the upper bound in (6.16) by induction. The baseline case k=0k=0 follows by [ALMT14, Equation (D.12)]. Suppose the claim holds for some k0k\in\mathbb{Z}_{\geq 0}. For k+1k+1, note that K,k+1==1nK,k+1;K_{\uparrow,k+1}=\cup_{\ell=1}^{n}K_{\uparrow,k+1;\ell} where K,k+1;K_{\uparrow,k+1;\ell} contains all νK,k+1\nu\in K_{\uparrow,k+1} such that ν|[1:]-\nu|_{[1:\ell]} is kk-monotone, and ν|(:n]\nu|_{(\ell:n]} is kk-monotone. Hence for any [1:n]\ell\in[1:n], it follows by [ALMT14, Proposition 3.1] that

δK,k+1;Lk(log(e)+log(e(n)))2Lklog(en),\displaystyle\delta_{K_{\uparrow,k+1;\ell}}\leq L_{k}\big{(}\log(e\ell)+\log(e(n-\ell))\big{)}\leq 2L_{k}\log(en),

where the second inequality follows by induction. On the other hand, let ZksupνK,kB(1)ν,ξ=ΠK,k(ξ)Z_{k}\equiv\sup_{\nu\in K_{\uparrow,k}\cap B(1)}\left\langle\nu,\xi\right\rangle=\|\Pi_{K_{\uparrow,k}}(\xi)\|, then Gaussian concentration (cf. [BLM13, Theorem 5.8]) entails that for any t>0t>0,

(Zk𝔼Zk+t)exp(t2/2).\displaystyle\mathbb{P}(Z_{k}\geq\mathbb{E}Z_{k}+t)\leq\exp(-t^{2}/2).

Hence, using the induction hypothesis 𝔼Zk(𝔼Zk2)1/2Lk1/2log(en)\mathbb{E}Z_{k}\leq\big{(}\mathbb{E}Z_{k}^{2}\big{)}^{1/2}\leq L_{k}^{1/2}\sqrt{\log(en)} and the union bound, it holds w.p. at least 1exp(t)1-\exp(-t) that

Zk+1supνK,k+1B(1)ν,ξ\displaystyle Z_{k+1}\equiv\sup_{\nu\in K_{\uparrow,k+1}\cap B(1)}\left\langle\nu,\xi\right\rangle max1nδK,k+1;1/2+2(t+log(en))\displaystyle\leq\max_{1\leq\ell\leq n}\delta^{1/2}_{K_{\uparrow,k+1;\ell}}+\sqrt{2(t+\log(en))}
(2Lklog(en))1/2+2(t+log(en)).\displaystyle\leq\big{(}2L_{k}\log(en)\big{)}^{1/2}+\sqrt{2(t+\log(en))}.

Now the bound for δK,k+1=𝔼Zk+12\delta_{K_{\uparrow,k+1}}=\mathbb{E}Z_{k+1}^{2} follows by integrating the tail.

Next we prove the lower bound in (6.16) for k1k\geq 1. By Sudakov’s minorization (cf. [GN16, Theorem 2.4.12]), we have

δK,k+11/2𝔼Zk+1\displaystyle\delta_{K_{\uparrow,k+1}}^{1/2}\geq\mathbb{E}Z_{k+1} supε>0εlog𝒩(ε,K,k+1B(1),)\displaystyle\gtrsim\sup_{\varepsilon>0}\varepsilon\sqrt{\log\mathcal{N}(\varepsilon,K_{\uparrow,k+1}\cap B(1),\|\cdot\|)}
supε>0εlog𝒟(2ε,K,k+1B(1),),\displaystyle\geq\sup_{\varepsilon>0}\varepsilon\sqrt{\log\mathcal{D}(2\varepsilon,K_{\uparrow,k+1}\cap B(1),\|\cdot\|)},

where 𝒟(ε,T,d)\mathcal{D}(\varepsilon,T,d) is the maximal ε\varepsilon-packing number of set TT with respect to the metric dd. By taking ε\varepsilon to be small enough, the construction in [SHH20, Theorem 3.4] yields an (2ε)(2\varepsilon)-packing set of cardinality of the order log(en)\log(en). This completes the lower bound proof.

Now the claim (1) follows from Theorem 3.9-(1) and the lower bound in (6.16). (2) follows from the upper bound in (6.16) and Theorem 3.9-(2). ∎

Appendix A Additional proofs

A.1. Proof of Lemma 2.4

We provide the proof for (1)-(2) assuming KK is a polyhedral cone. The claim for a general convex cone KK follows from polyhedral approximation [MT14, Section 7.3].

(1) As VK=ddivΠK(ξ)=tr(JΠK(ξ))V_{K}\stackrel{{\scriptstyle d}}{{=}}\operatorname{div}\Pi_{K}(\xi)=\operatorname{tr}\big{(}J_{\Pi_{K}}(\xi)\big{)}, we have 𝔼VK=𝔼divΠK(ξ)=𝔼ξ,ΠK(ξ)=𝔼ΠK(ξ)2=δK\mathbb{E}V_{K}=\mathbb{E}\operatorname{div}\Pi_{K}(\xi)=\mathbb{E}\left\langle\xi,\Pi_{K}(\xi)\right\rangle=\mathbb{E}\lVert\Pi_{K}(\xi)\rVert^{2}=\delta_{K}.

(2) The claim is proved in [MT14, Proposition 4.4] using the ‘Master Steiner formula’, cf. [MT14, Theorem 3.1], which is a restatement of the chi-bar squared distribution. Below we provide a simple alternative proof of this claim using Gaussian integration-by-parts only.

By expanding VK=dξ,ΠK(ξ)(ξ,ΠK(ξ)divΠK(ξ))V_{K}\stackrel{{\scriptstyle d}}{{=}}\left\langle\xi,\Pi_{K}(\xi)\right\rangle-\big{(}\left\langle\xi,\Pi_{K}(\xi)\right\rangle-\operatorname{div}\Pi_{K}(\xi)\big{)} and noting that 𝔼ξ,ΠK(ξ)=𝔼divΠK(ξ)\mathbb{E}\left\langle\xi,\Pi_{K}(\xi)\right\rangle=\mathbb{E}\operatorname{div}\Pi_{K}(\xi),

Var(VK)\displaystyle\operatorname{Var}(V_{K}) =Var(ξ,ΠK(ξ)divΠK(ξ))+Var(ξ,ΠK(ξ))\displaystyle=\operatorname{Var}\big{(}\left\langle\xi,\Pi_{K}(\xi)\right\rangle-\operatorname{div}\Pi_{K}(\xi)\big{)}+\operatorname{Var}\big{(}\left\langle\xi,\Pi_{K}(\xi)\right\rangle\big{)}
2𝔼[(ξ,ΠK(ξ)divΠK(ξ))ξ,ΠK(ξ)]\displaystyle\qquad-2\mathbb{E}\big{[}\big{(}\left\langle\xi,\Pi_{K}(\xi)\right\rangle-\operatorname{div}\Pi_{K}(\xi)\big{)}\left\langle\xi,\Pi_{K}(\xi)\right\rangle\big{]}
=𝔼trJΠK2(ξ)+𝔼ΠK(ξ)2+Var(ξ,ΠK(ξ))\displaystyle=\mathbb{E}\operatorname{tr}J_{\Pi_{K}}^{2}(\xi)+\mathbb{E}\lVert\Pi_{K}(\xi)\rVert^{2}+\operatorname{Var}\big{(}\left\langle\xi,\Pi_{K}(\xi)\right\rangle\big{)}
2𝔼[ΠK(ξ),ξ,ΠK(ξ)].\displaystyle\qquad-2\mathbb{E}\big{[}\left\langle\Pi_{K}(\xi),\nabla\left\langle\xi,\Pi_{K}(\xi)\right\rangle\right\rangle\big{]}.

The last equality follows from Gaussian integration-by-parts: (i) Var(ξ,f(ξ)divf(ξ))=𝔼trJf2(ξ)+𝔼f(ξ)2\operatorname{Var}(\left\langle\xi,f(\xi)\right\rangle-\operatorname{div}f(\xi))=\mathbb{E}\operatorname{tr}J_{f}^{2}(\xi)+\mathbb{E}\lVert f(\xi)\rVert^{2} (see e.g., [Ste81, Theorem 3], or [BZ21, Theorem 2.1]) and (ii) 𝔼[(ξ,f(ξ)divf(ξ))g(ξ)]=𝔼[f(ξ),g(ξ)]\mathbb{E}[(\left\langle\xi,f(\xi)\right\rangle-\operatorname{div}f(\xi))g(\xi)]=\mathbb{E}[\left\langle f(\xi),\nabla g(\xi)\right\rangle] [BZ21, Equation (2.4)]. Note that (i) ξ,ΠK(ξ)=ΠK(ξ)2=ξΠK(ξ)2=2(ξΠK(ξ))=2ΠK(ξ)\nabla\left\langle\xi,\Pi_{K}(\xi)\right\rangle=\nabla\lVert\Pi_{K}(\xi)\rVert^{2}=\nabla\lVert\xi-\Pi_{K^{\ast}}(\xi)\rVert^{2}=2(\xi-\Pi_{K^{\ast}}(\xi))=2\Pi_{K}(\xi) using the fact KK is a cone and Lemma 2.1-(1), and (ii) 𝔼trJΠK2(ξ)=𝔼trJΠK(ξ)=𝔼divΠK(ξ)\mathbb{E}\operatorname{tr}J_{\Pi_{K}}^{2}(\xi)=\mathbb{E}\operatorname{tr}J_{\Pi_{K}}(\xi)=\mathbb{E}\operatorname{div}\Pi_{K}(\xi) using the fact that when KK is polyhedral, JΠKJ_{\Pi_{K}} is a.e. a projection matrix (cf. [Kat09, Remark 3.3]). Finally using that 𝔼divΠK(ξ)=𝔼ξ,ΠK(ξ)=𝔼ΠK(ξ)2\mathbb{E}\operatorname{div}\Pi_{K}(\xi)=\mathbb{E}\left\langle\xi,\Pi_{K}(\xi)\right\rangle=\mathbb{E}\lVert\Pi_{K}(\xi)\rVert^{2} to conclude.

(3) The right inequality follows by an application of the improved Gaussian-Poincaré inequality stated in [GNP17, Theorem A.2] as follows: By Lemma 2.1-(1) again, ΠK(ξ)2=ξΠK(ξ)2=2(ξΠK(ξ))=2ΠK(ξ)\nabla\lVert\Pi_{K}(\xi)\rVert^{2}=\nabla\lVert\xi-\Pi_{K^{\ast}}(\xi)\rVert^{2}=2(\xi-\Pi_{K^{\ast}}(\xi))=2\Pi_{K}(\xi), so [GNP17, Theorem A.2] yields that

Var(ΠK(ξ)2)\displaystyle\operatorname{Var}(\lVert\Pi_{K}(\xi)\rVert^{2}) 12𝔼ΠK(ξ)22+12𝔼ΠK(ξ)22\displaystyle\leq\frac{1}{2}\mathbb{E}\big{\lVert}\nabla\lVert\Pi_{K}(\xi)\rVert^{2}\big{\rVert}^{2}+\frac{1}{2}\big{\lVert}\mathbb{E}\nabla\lVert\Pi_{K}(\xi)\rVert^{2}\big{\rVert}^{2}
=2𝔼ΠK(ξ)2+2𝔼ΠK(ξ)2.\displaystyle=2\mathbb{E}\lVert\Pi_{K}(\xi)\rVert^{2}+2\lVert\mathbb{E}\Pi_{K}(\xi)\rVert^{2}.

The left inequality is an immediately consequence of (2). ∎

A.2. Proof of Proposition 5.1

Recall the following second-order Poincaré inequality due to [Cha09].

Lemma A.1 (Second-order Poincaré inequality).

Let ξ\xi be an nn-dimensional standard normal random vector. Let F:nF:\mathbb{R}^{n}\to\mathbb{R} be absolute continuous such that FF and its derivatives have sub-exponential growth at \infty. Let ξ\xi^{\prime} be an independent copy of ξ\xi. Define T:nT:\mathbb{R}^{n}\to\mathbb{R} by

T(y)0112tF(y),𝔼ξF(ty+1tξ)dt.\displaystyle T(y)\equiv\int_{0}^{1}\frac{1}{2\sqrt{t}}\left\langle\nabla F(y),\mathbb{E}_{\xi^{\prime}}\nabla F(\sqrt{t}y+\sqrt{1-t}\xi^{\prime})\right\rangle\,\mathrm{d}{t}.

Then with WF(ξ)W\equiv F(\xi),

dTV(W𝔼WVar(W),𝒩(0,1))2Var(T(ξ))Var(W).\displaystyle d_{\mathrm{TV}}\bigg{(}\frac{W-\mathbb{E}W}{\sqrt{\mathrm{Var}(W)}},\mathcal{N}(0,1)\bigg{)}\leq\frac{2\sqrt{\mathrm{Var}(T(\xi))}}{\mathrm{Var}(W)}.
Proof.

Let W(F(ξ)𝔼F(ξ))/Var(F(ξ))W^{\prime}\equiv\big{(}F(\xi)-\mathbb{E}F(\xi)\big{)}/\sqrt{\mathrm{Var}(F(\xi))}, and TT/Var(F(ξ))T^{\prime}\equiv T/\mathrm{Var}(F(\xi)). Then [Cha09, Lemma 5.3] says that

dTV(W,𝒩(0,1))2Var(T(ξ))=2Var(T(ξ))/Var(F(ξ)).\displaystyle d_{\mathrm{TV}}(W^{\prime},\mathcal{N}(0,1))\leq 2\sqrt{\mathrm{Var}(T^{\prime}(\xi))}=2\sqrt{\mathrm{Var}(T(\xi))}/\mathrm{Var}(F(\xi)).

The claim follows by the invariance of the total variation metric by translation and scaling. ∎

Proof of Proposition 5.1.

For any fixed μn\mu\in\mathbb{R}^{n}, let F(ξ)Fμ(ξ)μ+ξΠK0(μ+ξ)2μ+ξΠK(μ+ξ)2F(\xi)\equiv F_{\mu}(\xi)\equiv\lVert\mu+\xi-\Pi_{K_{0}}(\mu+\xi)\rVert^{2}-\lVert\mu+\xi-\Pi_{K}(\mu+\xi)\rVert^{2}. By Lemma 2.1-(1), we have

F(ξ)\displaystyle\nabla F(\xi) =2(μ+ξΠK0(μ+ξ))2(μ+ξΠK(μ+ξ))\displaystyle=2(\mu+\xi-\Pi_{K_{0}}(\mu+\xi))-2(\mu+\xi-\Pi_{K}(\mu+\xi))
=2(ΠK(μ+ξ)ΠK0(μ+ξ)).\displaystyle=2\big{(}\Pi_{K}(\mu+\xi)-\Pi_{K_{0}}(\mu+\xi)\big{)}.

To use the second-order Poincaré inequality, let ξ\xi^{\prime} be an independent copy of ξ\xi and ξttξ+1tξ\xi_{t}\equiv\sqrt{t}\xi+\sqrt{1-t}\xi^{\prime}, and let

T(ξ)\displaystyle T(\xi) 0112tF(ξ),𝔼ξF(ξt)dt\displaystyle\equiv\int_{0}^{1}\frac{1}{2\sqrt{t}}\left\langle\nabla F(\xi),\mathbb{E}_{\xi^{\prime}}\nabla F(\xi_{t})\right\rangle\,\mathrm{d}{t}
=4𝔼ξ0112tΠK(μ+ξ)ΠK0(μ+ξ),ΠK(μ+ξt)ΠK0(μ+ξt)dt.\displaystyle=4\mathbb{E}_{\xi^{\prime}}\int_{0}^{1}\frac{1}{2\sqrt{t}}\left\langle\Pi_{K}(\mu+\xi)-\Pi_{K_{0}}(\mu+\xi),\Pi_{K}(\mu+\xi_{t})-\Pi_{K_{0}}(\mu+\xi_{t})\right\rangle\,\mathrm{d}{t}.

Hence

T(ξ)\displaystyle\nabla T(\xi) =4𝔼ξ12t[(JΠKJΠK0)(μ+ξ)(ΠK(μ+ξt)ΠK0(μ+ξt))\displaystyle=4\mathbb{E}_{\xi^{\prime}}\int\frac{1}{2\sqrt{t}}\bigg{[}(J_{\Pi_{K}}-J_{\Pi_{K_{0}}})(\mu+\xi)^{\top}\big{(}\Pi_{K}(\mu+\xi_{t})-\Pi_{K_{0}}(\mu+\xi_{t})\big{)}
+t(JΠKJΠK0)(μ+ξt)(ΠK(μ+ξ)ΠK0(μ+ξ))]dt.\displaystyle\qquad\qquad+\sqrt{t}(J_{\Pi_{K}}-J_{\Pi_{K_{0}}})(\mu+\xi_{t})^{\top}\big{(}\Pi_{K}(\mu+\xi)-\Pi_{K_{0}}(\mu+\xi)\big{)}\bigg{]}\,\mathrm{d}{t}.

The terms involved in the integral in TT are all absolute continuous, so we may continue to use Gaussian-Poincaré inequality:

Var(T(ξ))𝔼T(ξ)2\displaystyle\operatorname{Var}(T(\xi))\leq\mathbb{E}\lVert\nabla T(\xi)\rVert^{2}
160112t𝔼(JΠKJΠK0)(μ+ξ)(ΠK(μ+ξt)ΠK0(μ+ξt))\displaystyle\leq 16\int_{0}^{1}\frac{1}{2\sqrt{t}}\mathbb{E}\bigg{\lVert}(J_{\Pi_{K}}-J_{\Pi_{K_{0}}})(\mu+\xi)^{\top}\big{(}\Pi_{K}(\mu+\xi_{t})-\Pi_{K_{0}}(\mu+\xi_{t})\big{)}
+t(JΠKJΠK0)(μ+ξt)(ΠK(μ+ξ)ΠK0(μ+ξ))2dt\displaystyle\qquad\qquad+\sqrt{t}(J_{\Pi_{K}}-J_{\Pi_{K_{0}}})(\mu+\xi_{t})^{\top}\big{(}\Pi_{K}(\mu+\xi)-\Pi_{K_{0}}(\mu+\xi)\big{)}\bigg{\lVert}^{2}\,\mathrm{d}{t}
(by Jensen’s inequality applied to the measure dt/2t\mathrm{d}{t}/2\sqrt{t})
16×80112t(𝔼ΠK(μ+ξt)ΠK0(μ+ξt)2\displaystyle\leq 16\times 8\int_{0}^{1}\frac{1}{2\sqrt{t}}\bigg{(}\mathbb{E}\lVert\Pi_{K}(\mu+\xi_{t})-\Pi_{K_{0}}(\mu+\xi_{t})\rVert^{2}
+𝔼ΠK(μ+ξ)ΠK0(μ+ξ)2)dt.\displaystyle\qquad\qquad\qquad\qquad\qquad+\mathbb{E}\lVert\Pi_{K}(\mu+\xi)-\Pi_{K_{0}}(\mu+\xi)\rVert^{2}\bigg{)}\,\mathrm{d}{t}.

Here in the last inequality we used that JΠKJΠK0JΠK+JΠK02\lVert J_{\Pi_{K}}-J_{\Pi_{K_{0}}}\rVert\leq\lVert J_{\Pi_{K}}\rVert+\lVert J_{\Pi_{K_{0}}}\rVert\leq 2. Now using that ξt\xi_{t} has the same distribution as ξ\xi for each t[0,1]t\in[0,1], we arrive at

Var(T(ξ))162𝔼μ^Kμ^K02.\displaystyle\operatorname{Var}(T(\xi))\leq 16^{2}\mathbb{E}\lVert\widehat{\mu}_{K}-\widehat{\mu}_{K_{0}}\rVert^{2}.

The claim now follows from the second-order Poincaré inequality in Lemma A.1. ∎

Acknowledgments

The authors would like to thank two referees and an Associate Editor for their helpful comments and suggestions that significantly improved the exposition of the paper.

References

  • [AB93] Adelchi Azzalini and Adrian Bowman, On the use of nonparametric regression for checking linear relationships, J. Roy. Statist. Soc. Ser. B 55 (1993), no. 2, 549–557.
  • [ACCP11] Ery Arias-Castro, Emmanuel J. Candès, and Yaniv Plan, Global testing under sparse alternatives: ANOVA, multiple comparisons and the higher criticism, Ann. Statist. 39 (2011), no. 5, 2533–2556.
  • [ALMT14] Dennis Amelunxen, Martin Lotz, Michael B. McCoy, and Joel A. Tropp, Living on the edge: phase transitions in convex programs with random data, Inf. Inference 3 (2014), no. 3, 224–294.
  • [Bar59a] D. J. Bartholomew, A test of homogeneity for ordered alternatives, Biometrika 46 (1959), no. 1-2, 36–48.
  • [Bar59b] by same author, A test of homogeneity for ordered alternatives. II, Biometrika 46 (1959), 328–335.
  • [Bar61a] by same author, Ordered tests in the analysis of variance, Biometrika 48 (1961), 325–332.
  • [Bar61b] by same author, A test of homogeneity of means under restricted alternatives, J. Roy. Statist. Soc. Ser. B 23 (1961), 239–281.
  • [Bar02] Yannick Baraud, Non-asymptotic minimax rates of testing in signal detection, Bernoulli 8 (2002), no. 5, 577–606.
  • [BBBB72] R. E. Barlow, D. J. Bartholomew, J. M. Bremner, and H. D. Brunk, Statistical inference under order restrictions. The theory and application of isotonic regression, John Wiley & Sons, London-New York-Sydney, 1972, Wiley Series in Probability and Mathematical Statistics.
  • [Bes06] Olivier Besson, Adaptive detection of a signal whose signature belongs to a cone, Fourth IEEE Workshop on Sensor Array and Multichannel Processing, 2006., IEEE, 2006, pp. 409–413.
  • [BHL05] Yannick Baraud, Sylvie Huet, and Béatrice Laurent, Testing convex hypotheses on the mean of a Gaussian vector. Application to testing qualitative hypotheses on a regression function, Ann. Statist. 33 (2005), no. 1, 214–257.
  • [BLM13] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart, Concentration inequalities: A nonasymptotic theory of independence, Oxford University Press, Oxford, 2013.
  • [Bou03] Olivier Bousquet, Concentration inequalities for sub-additive functions using the entropy method, Stochastic inequalities and applications, Progr. Probab., vol. 56, Birkhäuser, Basel, 2003, pp. 213–247.
  • [BZ21] Pierre C Bellec and Cun-Hui Zhang, Second order stein: Sure for sure and other applications in high-dimensional inference, Ann. Statist. (to appear). Available at arXiv:1811.04121 (2021).
  • [Car15] Alexandra Carpentier, Testing the regularity of a smooth signal, Bernoulli 21 (2015), no. 1, 465–488.
  • [CCC+19] Alexandra Carpentier, Olivier Collier, Laëtitia Comminges, Alexandre B Tsybakov, and Yu Wang, Minimax rate of testing in sparse linear regression, Automation and Remote Control 80 (2019), no. 10, 1817–1834.
  • [CCC+20] Alexandra Carpentier, Olivier Collier, Laetitia Comminges, Alexandre B Tsybakov, and Yuhao Wang, Estimation of the l_2l\_2-norm and testing in sparse linear regression with unknown variance, arXiv preprint arXiv:2010.13679 (2020).
  • [CCT17] Olivier Collier, Laëtitia Comminges, and Alexandre B. Tsybakov, Minimax estimation of linear and quadratic functionals on sparsity classes, Ann. Statist. 45 (2017), no. 3, 923–958.
  • [CD13] Laëtitia Comminges and Arnak S. Dalalyan, Minimax testing of a composite null hypothesis defined via a quadratic functional in the model of regression, Electron. J. Stat. 7 (2013), 146–190.
  • [CGS15] Sabyasachi Chatterjee, Adityanand Guntuboyina, and Bodhisattva Sen, On risk bounds in isotonic and other shape restricted regression problems, Ann. Statist. 43 (2015), no. 4, 1774–1800.
  • [Cha09] Sourav Chatterjee, Fluctuations of eigenvalues and second order Poincaré inequalities, Probab. Theory Related Fields 143 (2009), no. 1-2, 1–40.
  • [Cha14] by same author, A new perspective on least squares under convex constraint, Ann. Statist. 42 (2014), no. 6, 2340–2381.
  • [Che54] Herman Chernoff, On the distribution of the likelihood ratio, Ann. Math. Statistics 25 (1954), 573–578.
  • [CKWY88] Dennis Cox, Eunmee Koh, Grace Wahba, and Brian S. Yandell, Testing the (parametric) null model hypothesis in (semiparametric) partial and generalized spline models, Ann. Statist. 16 (1988), no. 1, 113–119.
  • [CL11] A. Chatterjee and S. N. Lahiri, Bootstrapping lasso estimators, J. Amer. Statist. Assoc. 106 (2011), no. 494, 608–625.
  • [CS10] Ronald Christensen and Siu Kei Sun, Alternative goodness-of-fit tests for linear models, J. Amer. Statist. Assoc. 105 (2010), no. 489, 291–301.
  • [CV19] Alexandra Carpentier and Nicolas Verzelen, Optimal sparsity testing in linear regression model, arXiv preprint arXiv:1901.08802 (2019).
  • [DJ04] David Donoho and Jiashun Jin, Higher criticism for detecting sparse heterogeneous mixtures, Ann. Statist. 32 (2004), no. 3, 962–994.
  • [DT01] Cécile Durot and Anne-Sophie Tocquet, Goodness of fit test for isotonic regression, ESAIM Probab. Statist. 5 (2001), 119–140.
  • [Dur07] Cécile Durot, On the 𝕃p\mathbb{L}_{p}-error of monotonicity constrained estimators, Ann. Statist. 35 (2007), no. 3, 1080–1104.
  • [Dyk91] Richard Dykstra, Asymptotic normality for chi-bar-square distributions, Canad. J. Statist. 19 (1991), no. 3, 297–306.
  • [ES90] R. L. Eubank and C. H. Spiegelman, Testing the goodness of fit of a linear model via nonparametric regression techniques, J. Amer. Statist. Assoc. 85 (1990), no. 410, 387–392.
  • [FH01] Jianqing Fan and Li-Shan Huang, Goodness-of-fit tests for parametric regression models, J. Amer. Statist. Assoc. 96 (2001), no. 454, 640–652.
  • [GGF08] Maria Greco, Fulvio Gini, and Alfonso Farina, Radar detection and classification of jamming signals belonging to a cone class, IEEE transactions on signal processing 56 (2008), no. 5, 1984–1993.
  • [GHL99] Piet Groeneboom, Gerard Hooghiemstra, and Hendrik P. Lopuhaä, Asymptotic normality of the L1L_{1} error of the Grenander estimator, Ann. Statist. 27 (1999), no. 4, 1316–1347.
  • [GL05] Emmanuel Guerre and Pascal Lavergne, Data-driven rate-optimal specification testing in regression models, Ann. Statist. 33 (2005), no. 2, 840–870.
  • [GN16] Evarist Giné and Richard Nickl, Mathematical foundations of infinite-dimensional statistical models, Cambridge Series in Statistical and Probabilistic Mathematics, [40], Cambridge University Press, New York, 2016.
  • [GNP17] Larry Goldstein, Ivan Nourdin, and Giovanni Peccati, Gaussian phase transitions and conic intrinsic volumes: Steining the Steiner formula, Ann. Appl. Probab. 27 (2017), no. 1, 1–47.
  • [Gro85] Piet Groeneboom, Estimating a monotone density, Proceedings of the Berkeley conference in honor of Jerzy Neyman and Jack Kiefer, Vol. II (Berkeley, Calif., 1983), Wadsworth Statist./Probab. Ser., Wadsworth, Belmont, CA, 1985, pp. 539–555.
  • [GS18] Adityanand Guntuboyina and Bodhisattva Sen, Nonparametric shape-restricted regression, Statist. Sci. 33 (2018), no. 4, 568–594.
  • [HJS21] Qiyang Han, Tiefeng Jiang, and Yandi Shen, A general method for power analysis in testing high dimensional covariance matrices, arXiv preprint arXiv:2101.11086 (2021).
  • [HM93] W. Härdle and E. Mammen, Comparing nonparametric versus parametric regression fits, Ann. Statist. 21 (1993), no. 4, 1926–1947.
  • [HWCS19] Qiyang Han, Tengyao Wang, Sabyasachi Chatterjee, and Richard J. Samworth, Isotonic regression in general dimensions, Ann. Statist. 47 (2019), no. 5, 2440–2471.
  • [HZ19] Qiyang Han and Cun-Hui Zhang, Limit distribution theory for block estimators in multiple isotonic regression, Ann. Statist. (to appear). Available at arXiv:1905.12825 (2019+).
  • [IS03] Yu. I. Ingster and I. A. Suslina, Nonparametric goodness-of-fit testing under Gaussian models, Lecture Notes in Statistics, vol. 169, Springer-Verlag, New York, 2003.
  • [ITV10] Yuri I. Ingster, Alexandre B. Tsybakov, and Nicolas Verzelen, Detection boundary in sparse regression, Electron. J. Stat. 4 (2010), 1476–1526.
  • [JN02] Anatoli Juditsky and Arkadi Nemirovski, On nonparametric tests of positivity/monotonicity/convexity, Ann. Statist. 30 (2002), no. 2, 498–527.
  • [Kat09] Kengo Kato, On the degrees of freedom in shrinkage estimation, J. Multivariate Anal. 100 (2009), no. 7, 1338–1352.
  • [KC75] Akio Kudô and Jae Rong Choi, A generalized multivariate analogue of the one sided test, Mem. Fac. Sci. Kyushu Univ. Ser. A 29 (1975), no. 2, 303–328.
  • [KGGS20] Gil Kur, Fuchang Gao, Adityanand Guntuboyina, and Bodhisattva Sen, Convex regression in multidimensions: Suboptimality of least squares estimators, arXiv preprint arXiv:2006.02044 (2020).
  • [Kud63] Akio Kudô, A multivariate analogue of the one-sided test, Biometrika 50 (1963), 403–418.
  • [Mey03] Mary C. Meyer, A test for linear versus convex regression function using shape-restricted regression, Biometrika 90 (2003), no. 1, 223–232.
  • [MRS92a] J. A. Menéndez, C. Rueda, and B. Salvador, Dominance of likelihood ratio tests under cone constraints, Ann. Statist. 20 (1992), no. 4, 2087–2099.
  • [MRS92b] by same author, Testing nonoblique hypotheses, Comm. Statist. Theory Methods 21 (1992), no. 2, 471–484.
  • [MS91] J. A. Menéndez and B. Salvador, Anomalies of the likelihood ratio tests for testing restricted hypotheses, Ann. Statist. 19 (1991), no. 2, 889–898.
  • [MS20] Rajarshi Mukherjee and Subhabrata Sen, On minimax exponents of sparse testing, arXiv preprint arXiv:2003.00570 (2020).
  • [MT14] Michael B. McCoy and Joel A. Tropp, From Steiner formulas for cones to concentration of intrinsic volumes, Discrete Comput. Geom. 51 (2014), no. 4, 926–963.
  • [MW00] Mary Meyer and Michael Woodroofe, On the degrees of freedom in shape-restricted regression, Ann. Statist. 28 (2000), no. 4, 1083–1104.
  • [NP12] Ivan Nourdin and Giovanni Peccati, Normal approximations with Malliavin calculus, Cambridge Tracts in Mathematics, vol. 192, Cambridge University Press, Cambridge, 2012, From Stein’s method to universality.
  • [NvdG13] Richard Nickl and Sara van de Geer, Confidence sets in sparse regression, Ann. Statist. 41 (2013), no. 6, 2852–2876.
  • [NVK10] Natalie Neumeyer and Ingrid Van Keilegom, Estimating the error distribution in nonparametric multiple regression with applications to model testing, J. Multivariate Anal. 101 (2010), no. 5, 1067–1078.
  • [RLN86] Richard F. Raubertas, Chu-In Charles Lee, and Erik V. Nordheim, Hypothesis tests for normal means constrained by linear inequalities, Comm. Statist. A—Theory Methods 15 (1986), no. 9, 2809–2833.
  • [Roc97] R. Tyrrell Rockafellar, Convex Analysis, Princeton Landmarks in Mathematics, Princeton University Press, Princeton, NJ, 1997, Reprint of the 1970 original, Princeton Paperbacks.
  • [RW78] Tim Robertson and Edward J. Wegman, Likelihood ratio tests for order restrictions in exponential families, Ann. Statist. 6 (1978), no. 3, 485–505.
  • [RWD88] Tim Robertson, F. T. Wright, and R. L. Dykstra, Order restricted statistical inference, Wiley Series in Probability and Mathematical Statistics: Probability and Mathematical Statistics, John Wiley & Sons, Ltd., Chichester, 1988.
  • [SB18] Rajen D. Shah and Peter Bühlmann, Goodness-of-fit tests for high dimensional linear models, J. R. Stat. Soc. Ser. B. Stat. Methodol. 80 (2018), no. 1, 113–135.
  • [Sha85] Alexander Shapiro, Asymptotic distribution of test statistics in the analysis of moment structures under inequality constraints, Biometrika 72 (1985), no. 1, 133–144.
  • [Sha88] A. Shapiro, Towards a unified theory of inequality constrained testing in multivariate analysis, Internat. Statist. Rev. 56 (1988), no. 1, 49–62.
  • [SHH20] Yandi Shen, Qiyang Han, and Fang Han, On a phase transition in general order spline regression, arXiv preprint arXiv:2004.10922 (2020).
  • [SM17] Bodhisattva Sen and Mary Meyer, Testing against a linear regression model using ideas from shape-restricted estimation, J. R. Stat. Soc. Ser. B. Stat. Methodol. 79 (2017), no. 2, 423–448.
  • [Ste81] Charles M. Stein, Estimation of the mean of a multivariate normal distribution, Ann. Statist. 9 (1981), no. 6, 1135–1151.
  • [Stu97] Winfried Stute, Nonparametric model checks for regression, Ann. Statist. 25 (1997), no. 2, 613–641.
  • [Tib96] Robert Tibshirani, Regression shrinkage and selection via the lasso, J. Roy. Statist. Soc. Ser. B 58 (1996), no. 1, 267–288.
  • [TT12] Ryan J. Tibshirani and Jonathan Taylor, Degrees of freedom in lasso problems, Ann. Statist. 40 (2012), no. 2, 1198–1232.
  • [vdV98] Aad van der Vaart, Asymptotic Statistics, Cambridge Series in Statistical and Probabilistic Mathematics, vol. 3, Cambridge University Press, Cambridge, 1998.
  • [vdV02] by same author, Semiparametric statistics, Lectures on probability theory and statistics (Saint-Flour, 1999), Lecture Notes in Math., vol. 1781, Springer, Berlin, 2002, pp. 331–457.
  • [Ver12] Nicolas Verzelen, Minimax risks for sparse regressions: ultra-high dimensional phenomenons, Electron. J. Stat. 6 (2012), 38–90.
  • [VV10] Nicolas Verzelen and Fanny Villers, Goodness-of-fit tests for high-dimensional Gaussian linear models, Ann. Statist. 38 (2010), no. 2, 704–752.
  • [WR84] Giles Warrack and Tim Robertson, A likelihood ratio test regarding two nested but oblique order-restricted hypotheses, J. Amer. Statist. Assoc. 79 (1984), no. 388, 881–886.
  • [WWG19] Yuting Wei, Martin J. Wainwright, and Adityanand Guntuboyina, The geometry of hypothesis testing over convex cones: generalized likelihood ratio tests and minimax radii, Ann. Statist. 47 (2019), no. 2, 994–1024.
  • [Zha02] Cun-Hui Zhang, Risk bounds in isotonic regression, Ann. Statist. 30 (2002), no. 2, 528–555.
  • [ZHT07] Hui Zou, Trevor Hastie, and Robert Tibshirani, On the “degrees of freedom” of the lasso, Ann. Statist. 35 (2007), no. 5, 2173–2192.