This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Adaptive lasso and Dantzig selector for spatial point processes intensity estimation

Achmad Choiruddinlabel=e1][email protected] [    Jean-François Coeurjollylabel=e2 [    mark][email protected]    Frédérique Letuélabel=e3][email protected] [ Department of Statistics, Institut Teknologi Sepuluh Nopember (ITS), Indonesia, Laboratoire Jean Kuntzmann, Université Grenoble Alpes, France,
Abstract

Lasso and Dantzig selector are standard procedures able to perform variable selection and estimation simultaneously. This paper is concerned with extending these procedures to spatial point process intensity estimation. We propose adaptive versions of these procedures, develop efficient computational methodologies and derive asymptotic results for a large class of spatial point processes under an original setting where the number of parameters, i.e. the number of spatial covariates considered, increases with the expected number of data points. Both procedures are compared theoretically, in a simulation study, and in a real data example.

estimating equations,
high-dimensional statistics,
linear programming,
regularization methods,
spatial point pattern,
keywords:
\startlocaldefs\endlocaldefs

, and

1 Introduction

Spatial point processes are stochastic processes which model random locations of points in space, such as random locations of trees in a forest, locations of disease cases, earthquake occurrences and crime events [e.g. 2, 8, 13, 23]. To understand the arrangement of points, the intensity function is the standard summary function [2, 13]. When one seeks to describe the probability of observing a point at location udu\in\mathbb{R}^{d} in terms of covariates, the most popular model for the intensity function, ρ\rho, is

ρ(u;𝜷)=exp{𝜷𝐳(u)},uDd,\displaystyle\rho(u;\bm{\beta})=\exp\{\bm{\beta}^{\top}\mathbf{z}(u)\},\quad u\in D\subset\mathbb{R}^{d}, (1)

where, for p1p\geq 1, 𝐳(u)={z1(u),,zp(u)}\mathbf{z}(u)=\{z_{1}(u),\ldots,z_{p}(u)\}^{\top}, represent spatial covariates measured at location uu and 𝜷={β1,,βp}\bm{\beta}=\{\beta_{1},\ldots,\beta_{p}\}^{\top} is a real pp-dimensional parameter.

The score of the Poisson likelihood, i.e. the likelihood if the underlying process is assumed to be the Poisson process, remains an unbiased estimating equation to estimate 𝜷\bm{\beta} even if the point pattern does not arise from a Poisson process [25]. Such a method is well-studied in the literature and extended in several ways to gain efficiency [e.g. 18, 19] when the number of covariates is moderate. Standard results cover the consistency and asymptotic normality of the maximum Poisson likelihood under the increasing domain asymptotic (see e.g. [18] and references therein).

When a large number of covariates is available, variable selection is unavoidable. Performing estimation and selection for spatial point processes intensity models has received a lot of attention. Recent developments consider techniques centered around regularization methods [e.g. 10, 14, 24, 27] such as lasso technique. In particular, [10] consider several composite likelihoods penalized by a large class of convex and non-convex penalty functions and obtain asymptotic results under the increasing domain asymptotic framework.

The Dantzig selector is an alternative procedure to regularization techniques. It was initially proposed for linear models by [7] and subsequently extended to more complex models [e.g. 1, 15, 21]. In particular, [15] generalizes this approach to general estimating equations. One of the main advantages of the Dantzig selector is its implementation which, for linear models, results in a linear programming. Since then, the Dantzig selector and lasso procedures have been compared in different contexts [e.g. 3, 22].

In this paper, we compare lasso and Dantzig selector when they are applied to intensity estimation for spatial point processes. We compare these procedures in the complex asymptotic framework where the number of informative covariates, say sns_{n} and the number of non-informative covariates, say pnsnp_{n}-s_{n}, may increase with the mean number of points. Our asymptotic results are developed under a setting which embraces both increasing domain asymptotic and infill asymptotic which are often considered in the literature (see also Section 2). Such a setting is almost never considered in the spatial point processes literature (see again e.g. [18]).

It is well-known that the Poisson likelihood can be approximated as a quasi-Poisson regression model, see e.g. [2]. For the adaptive lasso procedure, our theoretical contributions can also be seen as extension of work such as [16] which provides asymptotic results for estimators from regularized generalized linear models. However, in our spatial framework, the more standard sample size must be substituted by, here, a mean number of points. Furthermore, observations are no more independent, i.e. our results are valid for a large class of spatial dependent point processes. Note also that [16] assumes sn=ss_{n}=s.

The theoretical contributions of the present paper are two-fold. First, for the adaptive lasso procedure, the question stems from extending the work by [10] which considers only an increasing domain asymptotic, and assumes sn=ss_{n}=s and pn=pp_{n}=p. For the Dantzig selector, our contributions are to extend the standard methodology to spatial point processes, propose an adaptive version and derive theoretical results. This yields different computational and theoretical issues. As revealed by our main result, Theorem 4.1, the adaptive lasso and Dantzig selector procedures share several similarities but also some slight differences. We first prove that both procedures satisfy an oracle property, i.e. methods can correctly select the nonzero coefficients with probability converging to one, and second that the estimators of the nonzero coefficients are asymptotically normal. However, the conditions under which results are valid for the Dantzig selector are slightly more restrictive.

Our conducted simulation study and application to environmental data also demonstrate that both procedures behave similarly.

2 Background and framework

Let 𝐗\mathbf{X} be a spatial point process on d\mathbb{R}^{d}, d1d\geq 1. We view 𝐗\mathbf{X} as a locally finite random subset of d\mathbb{R}^{d}. Let DdD\subset\mathbb{R}^{d} be a compact set of Lebesgue measure |D||D| which will play the role of the observation domain. A realization of 𝐗\mathbf{X} in DD is thus a set 𝐱={x1,,xm}\mathbf{x}=\{x_{1},\ldots,x_{m}\}, where xDx\in D and mm is the observed number of points in DD. Suppose 𝐗\mathbf{X} has intensity function ρ\rho and second-order product density ρ(2)\rho^{(2)}. Campbell theorem states that, for any function k:d[0,)k:\mathbb{R}^{d}\to[0,\infty) or k:d×d[0,)k:\mathbb{R}^{d}\times\mathbb{R}^{d}\to[0,\infty)

𝔼u𝐗k(u)=dk(u)ρ(u)du,𝔼u,v𝐗k(u,v)=d×dk(u,v)ρ(2)(u,v)dudv.\displaystyle\mathbb{E}\sum_{u\in\mathbf{X}}k(u)={\int_{\mathbb{R}^{d}}k(u)\rho(u)\mathrm{d}u},\quad\mathbb{E}\sum_{u,v\in\mathbf{X}}^{\neq}k(u,v)=\int_{\mathbb{R}^{d}\times\mathbb{R}^{d}}k(u,v)\rho^{(2)}(u,v)\mathrm{d}u\mathrm{d}v. (2)

Based on the first two intensity functions, the pair correlation function gg is defined by

g(u,v)=ρ(2)(u,v)ρ(u)ρ(v),u,vD,\displaystyle g(u,v)=\frac{\rho^{(2)}(u,v)}{\rho(u)\rho(v)},\quad u,v\in D,

when both ρ\rho and ρ(2)\rho^{(2)} exist with the convention 0/0=00/0=0. The pair correlation function is a measure of departure of the model from the Poisson point process for which g=1g=1. For further background materials on spatial point processes, see for example [2, 23].

For our asymptotic considerations, we assume that a sequence (𝐗)n1(\mathbf{X})_{n\geq 1} is observed within a sequence of bounded domains (Dn)n1(D_{n})_{n\geq 1}. We denote by ρn\rho_{n} and gng_{n} the intensity and pair correlation of 𝐗n\mathbf{X}_{n}. With an abuse of notation, we denote by 𝔼\mathbb{E} and Var\mathrm{Var}, the expectation and variance under 𝐗n\mathbf{X}_{n}. we assume that the intensity ρn\rho_{n} writes ρn(u)=exp{𝜷0𝐳(u)}\rho_{n}(u)=\exp\{\bm{\beta}_{0}^{\top}\mathbf{z}(u)\}, uDnu\in D_{n}. We thus let 𝜷0\bm{\beta}_{0} denote the true parameter vector and assume it can be decomposed as 𝜷0={β01,,β0sn,β0(sn+1),,β0pn}=(𝜷01,𝜷02)=(𝜷01,𝟎)\bm{\beta}_{0}=\{\beta_{01},\ldots,\beta_{0s_{n}},\beta_{0(s_{n}+1)},\ldots,\beta_{0p_{n}}\}^{\top}=(\bm{\beta}^{\top}_{01},\bm{\beta}^{\top}_{02})^{\top}=(\bm{\beta}_{01}^{\top},\mathbf{0}^{\top})^{\top}. Therefore, 𝜷01sn\bm{\beta}_{01}\in\mathbb{R}^{s_{n}}, 𝜷02=𝟎pnsn\bm{\beta}_{02}=\mathbf{0}\in\mathbb{R}^{p_{n}-s_{n}} and 𝜷0pn\bm{\beta}_{0}\in\mathbb{R}^{p_{n}}, where sns_{n} is the number of non-zero coefficients, pnsnp_{n}-s_{n} the number of zero coefficients and pnp_{n} the total number of parameters. We underline that it is unknown to us which coefficients are non-zero and which are zero. Thus, we consider a sparse intensity model where in particular sns_{n} and pnp_{n} may diverge to infinity as nn grows.

For any 𝜷pn\bm{\beta}\in\mathbb{R}^{p_{n}} or for the spatial covariates 𝐳(u)\mathbf{z}(u), we use a similar notation, i.e. 𝜷=(𝜷1,𝜷2)\bm{\beta}=(\bm{\beta}_{1}^{\top},\bm{\beta}_{2}^{\top})^{\top} and 𝐳(u)={𝐳1(u),𝐳2(u)}\mathbf{z}(u)=\{\mathbf{z}_{1}(u)^{\top},\mathbf{z}_{2}(u)^{\top}\}^{\top}, uDnu\in D_{n}. We let μn=𝔼{N(Dn)}\mu_{n}=\mathbb{E}\{N(D_{n})\}, that is the expected number of points in DnD_{n}. By Campbell theorem, we have

μn=Dnρn(u,𝜷0)du=Dnexp{𝜷0𝐳(u)}du=Dnexp{𝜷01𝐳1(u)}du\mu_{n}=\int_{D_{n}}\rho_{n}(u,\bm{\beta}_{0})\mathrm{d}u=\int_{D_{n}}\exp\{\bm{\beta}_{0}^{\top}\mathbf{z}(u)\}\mathrm{d}u=\int_{D_{n}}\exp\{\bm{\beta}_{01}^{\top}\mathbf{z}_{1}(u)\}\mathrm{d}u

Note that μn\mu_{n} is a function of Dn,𝜷01,𝐳1(u),snD_{n},\bm{\beta}_{01},\mathbf{z}_{1}(u),s_{n}. In this paper, we assume that μn\mu_{n}\to\infty as nn\to\infty. That kind of assumption is very general and embraces the well-known frameworks called increasing domain asymptotics and infill asymptotics. For the increasing domain context, DndD_{n}\to\mathbb{R}^{d} and usually 𝜷01\bm{\beta}_{01} depends on nn only through sns_{n}. For the infill asymptotics, Dn=DD_{n}=D is assumed to be a bounded domain of d\mathbb{R}^{d} and usually z1(u)=1z_{1}(u)=1, 𝜷01=θn\bm{\beta}_{01}=\theta_{n}\to\infty as nn\to\infty. In some sense, the parameter μn\mu_{n} plays the role of the sample size in standard inference.

To reduce notation in the following, unless it is ambiguous, we do not index 𝐗\mathbf{X}, ρ\rho, gg, 𝜷0\bm{\beta}_{0}, 𝜷\bm{\beta}, 𝐳(u)\mathbf{z}(u) with nn.

3 Methodologies

3.1 Standard methodology

If 𝐗\mathbf{X} is a Poisson point process, then, on DnD_{n}, 𝐗\mathbf{X} admits a density with respect to the unit rate Poisson point process [23]. This yields the log-likelihood function for 𝜷\bm{\beta}, which, for the intensity model (1), is proportional to

n(𝜷)=u𝐗Dn𝜷𝐳(u)Dnρ(u;𝜷)du.\ell_{n}(\bm{\beta})=\sum_{u\in\mathbf{X}\cap D_{n}}\bm{\beta}^{\top}\mathbf{z}(u)-\int_{D_{n}}\rho(u;\bm{\beta})\mathrm{d}u. (3)

The gradient of (3) writes

𝐔n(𝜷)=dd𝜷n(𝜷)=u𝐗Dn𝐳(u)Dn𝐳(u)ρ(u;𝜷)du.\mathbf{U}_{n}(\bm{\beta})=\frac{\mathrm{d}}{\mathrm{d}\bm{\beta}}\ell_{n}(\bm{\beta})={\sum_{u\in\mathbf{X}\cap D_{n}}\mathbf{z}(u)}-{\int_{D_{n}}\mathbf{z}(u)\rho(u;\bm{\beta})\mathrm{d}u}. (4)

If 𝐗\mathbf{X} is not a Poisson point process, Campbell Theorem shows that (4) remains an unbiased estimating equation. Hence, the maximum of (3), still makes sense for non-Poisson models. Such an estimator, which can be viewed as composite likelihood has received a lot of attention in the literature and asymptotic properties are well-established when pn=pp_{n}=p and pp is moderate [e.g 18, 19, 25].

We end this section with the definition of the two following pn×pnp_{n}\times p_{n} matrices

𝐀n(𝜷)\displaystyle\mathbf{A}_{n}(\bm{\beta}) =Dn𝐳(u)𝐳(u)ρ(u;𝜷)du\displaystyle={\int_{D_{n}}\mathbf{z}(u)\mathbf{z}(u)^{\top}\rho(u;\bm{\beta})\mathrm{d}u} (5)
𝐁n(𝜷)\displaystyle\mathbf{B}_{n}(\bm{\beta}) =𝐀n(𝜷)+DnDn𝐳(u)𝐳(v){g(u,v)1}ρ(u;𝜷)ρ(v;𝜷)dudv.\displaystyle=\mathbf{A}_{n}(\bm{\beta})+{\int_{D_{n}}\int_{D_{n}}\mathbf{z}(u)\mathbf{z}(v)^{\top}\{g(u,v)-1\}\rho(u;\bm{\beta})\rho(v;\bm{\beta})\mathrm{d}u\mathrm{d}v}. (6)

The matrix 𝐀n(𝜷)\mathbf{A}_{n}(\bm{\beta}) corresponds to the sensitivity matrix defined by 𝐀n(𝜷)=𝔼{d𝐔n(𝜷)/d𝜷}\mathbf{A}_{n}(\bm{\beta})=-\mathbb{E}\{\mathrm{d}\mathbf{U}_{n}(\bm{\beta})/\mathrm{d}\bm{\beta}^{\top}\} while 𝐁n(𝜷)\mathbf{B}_{n}(\bm{\beta}) corresponds to the variance of the estimating equation, i.e. 𝐁n(𝜷)=Var{𝐔n(𝜷)}\mathbf{B}_{n}(\bm{\beta})=\mathrm{Var}\{\mathbf{U}_{n}(\bm{\beta})\}. By passing, we point out that 𝐀n(𝜷)=d𝐔n(𝜷)/d𝜷\mathbf{A}_{n}(\bm{\beta})=-\mathrm{d}\mathbf{U}_{n}(\bm{\beta})/\mathrm{d}\bm{\beta}^{\top}. Let 𝐌n\mathbf{M}_{n} be some pn×pnp_{n}\times p_{n} matrix, e.g. 𝐀n(𝜷)\mathbf{A}_{n}(\bm{\beta}) or 𝐁n(𝜷)\mathbf{B}_{n}(\bm{\beta}). Such a matrix is decomposed as

𝐌n=[𝐌n,1𝐌n,2]=[𝐌n,11𝐌n,12𝐌n,21𝐌n,22],\displaystyle\mathbf{M}_{n}=\begin{bmatrix}\mathbf{M}_{n,1}\\ \mathbf{M}_{n,2}\end{bmatrix}=\begin{bmatrix}\mathbf{M}_{n,11}&\mathbf{M}_{n,12}\\ \mathbf{M}_{n,21}&\mathbf{M}_{n,22}\end{bmatrix}, (7)

where 𝐌n,1\mathbf{M}_{n,1} (resp. 𝐌n,2\mathbf{M}_{n,2}) is the first sn×pns_{n}\times p_{n} (resp. the following (pnsn)×pn(p_{n}-s_{n})\times p_{n}) components of 𝐌n\mathbf{M}_{n} and 𝐌n,11\mathbf{M}_{n,11} (resp. 𝐌n,12\mathbf{M}_{n,12}, 𝐌n,21\mathbf{M}_{n,21}, and 𝐌n,22\mathbf{M}_{n,22}) is the sn×sns_{n}\times s_{n} top-left corner (resp. the sn×(pnsn)s_{n}\times(p_{n}-s_{n}) top-right corner, the (pnsn)×sn(p_{n}-s_{n})\times s_{n} bottom-left corner, and the (pnsn)×(pnsn)(p_{n}-s_{n})\times(p_{n}-s_{n}) bottom-right corner) of 𝐌n\mathbf{M}_{n}. In what follows, for a squared symmetric matrix 𝐌n\mathbf{M}_{n}, νmin(𝐌n)\nu_{\min}(\mathbf{M}_{n}) and νmax(𝐌n)\nu_{\max}(\mathbf{M}_{n}) denote respectively the smallest and largest eigenvalue of 𝐌n\mathbf{M}_{n}. Finally, 𝐲\|\mathbf{y}\| denotes the Euclidean norm of a vector 𝐲\mathbf{y}, while 𝐌n=sup𝐲0𝐌n𝐲/𝐲\|\mathbf{M}_{n}\|=\sup_{\|\mathbf{y}\|\neq 0}\|\mathbf{M}_{n}\mathbf{y}\|/\|\mathbf{y}\| denotes the spectral norm. We remind that the spectral norm is subordinate and that for a symmetric definite positive matrix 𝐌n=νmax(𝐌n)\|\mathbf{M}_{n}\|=\nu_{\max}(\mathbf{M}_{n}).

3.2 Adaptive lasso (AL)

When the number of parameters is large, regularization methods allow one to perform both estimation and variable selection simultaneously. When pn=pp_{n}=p, [10] consider several regularization procedures which consist in adding a convex or non-convex penalty term to (3). The proposed methods are unchanged even when the number of covariates diverges. In particular, the adaptive lasso consists in maximizing

Qn(𝜷)=1μnn(𝜷)j=1pnλn,j|βj|,\displaystyle Q_{n}(\bm{\beta})=\frac{1}{\mu_{n}}\ell_{n}(\bm{\beta})-{\sum_{j=1}^{p_{n}}\lambda_{n,j}|\beta_{j}|}, (8)

where the real numbers λn,j\lambda_{n,j} are non-negative tuning parameters. We therefore define the adaptive lasso estimator as

𝜷^AL=argmax𝜷pnQn(𝜷).\displaystyle\hat{\bm{\beta}}_{\mathrm{AL}}=\arg\max_{\bm{\beta}\in\mathbb{R}^{p_{n}}}Q_{n}(\bm{\beta}). (9)

When λn,j=0\lambda_{n,j}=0 for j=1,,pnj=1,\dots,p_{n}, the method reduces to the maximum composite likelihood estimator and when λn,j=λn\lambda_{n,j}=\lambda_{n} to the standard lasso estimator. If β1\beta_{1} acts as an intercept, meaning that z1(u)=1,uDnz_{1}(u)=1,\forall u\in D_{n}, it is often desired to let this parameter free. This can be done by setting λn,1=0\lambda_{n,1}=0 in the second term of (8). Finally, the choice of μn\mu_{n} as a normalization factor in (8) follows the implementation of the adaptive lasso procedure for generalized linear models in the standard software (e.g. R package glmnet [17]).

3.3 Adaptive (linearized) Dantzig selector (ALDS)

When applied to a likelihood [7, 21], the Dantzig selector estimate is obtained by minimizing 𝜷1\|\bm{\beta}\|_{1} subject to the infinite norm of the score function bounded by some threshold parameter λ\lambda. In the spatial point process setting, we propose an adaptive version of the Dantzig selector estimate as the solution of the problem

min𝚲n𝜷1 subject to |(𝐔n(𝜷))j|λn,j for j=1,,pn\displaystyle\min\|{\bm{\Lambda}}_{n}\bm{\beta}\|_{1}\mbox{ subject to }|(\mathbf{U}_{n}(\bm{\beta}))_{j}|\leq\lambda_{n,j}\quad\text{ for }j=1,\dots,p_{n} (10)

where 𝚲n=diag(λn,1,,λn,pn){\bm{\Lambda}}_{n}=\mathrm{diag}(\lambda_{n,1},\cdots,\lambda_{n,p_{n}}) and where 𝐔n(𝜷)\mathbf{U}_{n}(\bm{\beta}) is the estimating equation given by (4). It is worth pointing out λn,j=0\lambda_{n,j}=0 for j=1,,pnj=1,\dots,p_{n} reduces the criterion (10) to 𝐔n(𝜷)=0\mathbf{U}_{n}(\bm{\beta})=0, which leads to the maximum composite likelihood estimator. Similarly to the adaptive lasso procedure, the intercept can be let free by setting λn,1=0\lambda_{n,1}=0. However, in the following, we assume, for the ALDS procedure, that λn,j>0\lambda_{n,j}>0 for convenience, in order to rewrite (10) in the following matrix form

min𝚲n𝜷1 subject to (μn)1𝚲n1𝐔n(𝜷)1.\displaystyle\min\|{\bm{\Lambda}}_{n}\bm{\beta}\|_{1}\mbox{ subject to }(\mu_{n})^{-1}\Big{\|}{\bm{\Lambda}}_{n}^{-1}\mathbf{U}_{n}({\bm{\beta}})\Big{\|}_{\infty}\leq 1. (11)

We claim that the whole methodology and the proofs could be redone without involving the notation 𝚲n1{\bm{\Lambda}}_{n}^{-1} and so that Theorem 4.1 remains valid if one does not regularize the intercept term for example.

Due to the nonlinearity of the constraint vector, standard linear programming can no more be used to solve (11). This results in a non-convex optimization problem. In particular the feasible set {𝜷:𝚲n1𝐔n(𝜷)1}\{\bm{\beta}:\|{\bm{\Lambda}}_{n}^{-1}\mathbf{U}_{n}(\bm{\beta})\|_{\infty}\leq 1\} is non-convex, which makes the method difficult to implement and to analyze from a theoretical point of view. In the context of generalized linear models, [21] consider the iterative reweighted least squares method and define an iterative procedure where at each step of the algorithm the constraint vector corresponds to a linearization of the updated pseudo-score. Such a procedure is not straightforward to extend from (11) and remains complex to analyze from a theoretical point of view. As an alternative, we follow [15, Chapter 3] and propose to linearize the constraint vector by expanding 𝐔n(𝜷)\mathbf{U}_{n}(\bm{\beta}) around 𝜷~\tilde{\bm{\beta}}, an initial estimate of 𝜷0\bm{\beta}_{0}, using a first order Taylor approximation; i.e. we substitute 𝐔n(𝜷)\mathbf{U}_{n}(\bm{\beta}) by 𝐔n(𝜷~)+𝐀n(𝜷~)(𝜷~𝜷)\mathbf{U}_{n}(\tilde{\bm{\beta}})+\mathbf{A}_{n}(\tilde{\bm{\beta}})(\tilde{\bm{\beta}}-\bm{\beta}). Such a linearization enables now the use of standard linear programming. We term adaptive linearized Dantzig selector (ALDS) estimate and denote it by 𝜷^ALDS\hat{\bm{\beta}}_{\mathrm{ALDS}} the solution to the optimization problem

min𝚲n𝜷1 subject to (μn)1𝚲n1{𝐔n(𝜷~)+𝐀n(𝜷~)(𝜷~𝜷)}1.\displaystyle\min\|{\bm{\Lambda}}_{n}\bm{\beta}\|_{1}\mbox{ subject to }(\mu_{n})^{-1}\;\Big{\|}{\bm{\Lambda}}_{n}^{-1}\big{\{}\mathbf{U}_{n}(\tilde{\bm{\beta}})+\mathbf{A}_{n}(\tilde{\bm{\beta}})(\tilde{\bm{\beta}}-{\bm{\beta}})\big{\}}\Big{\|}_{\infty}\leq 1. (12)

Properties of 𝜷^ALDS\hat{\bm{\beta}}_{\mathrm{ALDS}} depend on properties of 𝜷~\tilde{\bm{\beta}} which are made precise in the next section.

4 Asymptotic results

Our main result relies upon the following conditions:

  1. (𝒞\mathcal{C}.1)

    The intensity function has the log-linear specification given by (1) where 𝜷pn\bm{\beta}\in\mathbb{R}^{p_{n}}.

  2. (𝒞\mathcal{C}.2)

    (μn)n1(\mu_{n})_{n\geq 1} is an increasing sequence of real numbers, such that μn\mu_{n}\to\infty as n{n}\to\infty.

  3. (𝒞\mathcal{C}.3)

    The covariates 𝐳\mathbf{z} satisfy

    supn1supi=1,,pnsupud|zi(u)|< and infn1infϕpn,ϕ=1infuDn{ϕ𝐳(u)}2>0\sup_{n\geq 1}\;\sup_{i=1,\dots,p_{n}}\;\sup_{u\in\mathbb{R}^{d}}|z_{i}(u)|<\infty\qquad\mbox{ and }\qquad\inf_{n\geq 1}\inf_{\bm{\phi}\in\mathbb{R}^{p_{n}},\|\bm{\phi}\|=1}\inf_{u\in D_{n}}\{\bm{\phi}^{\top}\mathbf{z}(u)\}^{2}>0
  4. (𝒞\mathcal{C}.4)

    The intensity and pair correlation satisfy

    DnDnρ(u;𝜷0)ρ(v;𝜷0)|g(u,v)1|dudv=O(μn).\int_{D_{n}}\int_{D_{n}}\rho(u;\bm{\beta}_{0})\rho(v;\bm{\beta}_{0})|g(u,v)-1|\mathrm{d}u\mathrm{d}v=O(\mu_{n}).
  5. (𝒞\mathcal{C}.5)

    The matrix Bn,11(𝜷0)B_{n,11}(\bm{\beta}_{0}) satisfies

    lim infninfϕsn,ϕ=1ϕ{(μn)1𝐁n,11(𝜷0)}ϕ>0.\liminf_{n}\inf_{\bm{\phi}\in\mathbb{R}^{s_{n}},\|\bm{\phi}\|=1}\bm{\phi}^{\top}\big{\{}(\mu_{n})^{-1}\mathbf{B}_{n,11}(\bm{\beta}_{0})\big{\}}\bm{\phi}>0.
  6. (𝒞\mathcal{C}.6)

    For any ϕsn{0}\bm{\phi}\in\mathbb{R}^{s_{n}}\setminus\{0\}, the following convergence holds in distribution as nn\to\infty:

    σϕ1ϕ𝐔n,1(𝜷0)dN(0,1)\sigma_{\bm{\phi}}^{-1}\bm{\phi}^{\top}\mathbf{U}_{n,1}(\bm{\beta}_{0})\stackrel{{\scriptstyle d}}{{\to}}N(0,1)

    where σϕ2=ϕ𝐁n,11(𝜷0)ϕ\sigma^{2}_{\bm{\phi}}=\bm{\phi}^{\top}\mathbf{B}_{n,11}(\bm{\beta}_{0})\bm{\phi}.

  7. (𝒞\mathcal{C}.7)

    The initial estimate 𝜷~\tilde{\bm{\beta}} satisfies 𝜷~𝜷0=OP(pn/μn)\|\tilde{\bm{\beta}}-{\bm{\beta}_{0}}\|=O_{\mathrm{P}}(\sqrt{p_{n}/{\mu_{n}}}) and is such that 𝐀n,11(𝜷~)1=OP(μn1)\|\mathbf{A}_{n,11}(\tilde{\bm{\beta}})^{-1}\|=O_{\mathrm{P}}(\mu_{n}^{-1}).

  8. (𝒞\mathcal{C}.8)

    As nn\to\infty, we assume that sn,pns_{n},p_{n} and μn\mu_{n} are such that as nn\to\infty

    {max(pn4μn,sn2pn3μn)0 for the AL estimatesn3pn4μn0 for the ALDS estimate.\left\{\begin{array}[]{ll}\max\left(\frac{p_{n}^{4}}{\mu_{n}},\frac{s_{n}^{2}p_{n}^{3}}{\mu_{n}}\right)\to 0&\quad\text{ for the AL estimate}\\ \frac{s_{n}^{3}p_{n}^{4}}{\mu_{n}}\to 0&\quad\text{ for the ALDS estimate.}\end{array}\right.
  9. (𝒞\mathcal{C}.9)

    Let an=maxj=1,,snλn,ja_{n}=\max_{j=1,\ldots,{s_{n}}}\lambda_{n,j} and bn=minj=sn+1,,pnλn,jb_{n}=\min_{j={s_{n}}+1,\ldots,p_{n}}\lambda_{n,j}. We assume that these sequences are such that, as nn\to\infty

    {ansnμn0,bnμnpn2 for the AL estimateansn3μn0,bnμnpn3 for the ALDS estimate.\left\{\begin{array}[]{lll}a_{n}\sqrt{s_{n}\mu_{n}}\to 0,&b_{n}\sqrt{\frac{\mu_{n}}{p_{n}^{2}}}\to\infty&\quad\text{ for the AL estimate}\\ a_{n}\sqrt{s_{n}^{3}\mu_{n}}\to 0,&b_{n}\sqrt{\frac{\mu_{n}}{p_{n}^{3}}}\to\infty&\quad\text{ for the ALDS estimate.}\end{array}\right.

Condition (𝒞\mathcal{C}.1) specifies the form of intensity models considered in this paper. In particular, note that we do not assume that 𝜷\bm{\beta} is an element of a bounded domain of pn\mathbb{R}^{p_{n}}. Condition (𝒞\mathcal{C}.2) specifies our asymptotic framework where we assume to observe in average more and more points in DnD_{n}. As already mentioned, this may cover increasing domain type or infill type asymptotics. To our knowledge, only [11] consider a similar asymptotic framework in order to construct information criteria for spatial point process intensity estimation. The context is however very different here as we consider a large number of covariates and we study methodologies (adaptive lasso or Dantzig) which are able to produce a sparse estimate. Condition (𝒞\mathcal{C}.3) is quite standard and is not too restrictive. Note that conditions (𝒞\mathcal{C}.1)-(𝒞\mathcal{C}.3) allow us in Lemma A.2 to prove that in a ‘neighbordhood’ of 𝜷0\bm{\beta}_{0}, ρ(u;𝜷)=O(μn)\int\rho(u;\bm{\beta})=O(\mu_{n}), a useful result widely used in our proofs. The last part of Condition (𝒞\mathcal{C}.3) asserts that at any location the covariates are linearly independent. Condition (𝒞\mathcal{C}.3) also implies first that lim infninfϕ,ϕ=1ϕ{(μn)1𝐀n,11(𝜷0)}ϕ>0\liminf_{n}\inf_{\bm{\phi},\|\bm{\phi}\|=1}\bm{\phi}^{\top}\big{\{}(\mu_{n})^{-1}\mathbf{A}_{n,11}(\bm{\beta}_{0})\big{\}}\bm{\phi}>0 and second that lim infninfϕ,ϕ=1ϕ{(μn)1𝐀n(𝜷0)}ϕ>0\liminf_{n}\inf_{\bm{\phi},\|\bm{\phi}\|=1}\bm{\phi}^{\top}\big{\{}(\mu_{n})^{-1}\mathbf{A}_{n}(\bm{\beta}_{0})\big{\}}\bm{\phi}>0. Condition (𝒞\mathcal{C}.5) is a similar assumption but for the submatrix 𝐁n,11(𝜷0)\mathbf{B}_{n,11}(\bm{\beta}_{0}) which corresponds to Var{𝐔n,1(𝜷0)}\mathrm{Var}\{\mathbf{U}_{n,1}(\bm{\beta}_{0})\}. Condition (𝒞\mathcal{C}.4) is also natural. Combined with Condition (𝒞\mathcal{C}.3), this implies that missingBn(𝜷0)=O(μnpn)\mathrm{\mathbf{missing}}B_{n}(\bm{\beta}_{0})=O(\mu_{n}p_{n}). When pn=pp_{n}=p (and therefore sn=ss_{n}=s) and in the increasing domain framework, such an assumption can be satisfied by a large class of spatial point processes such as determinantal point processes, log-Gaussian Cox processes and Neyman-Scott point processes [see 10]. When pn=pp_{n}=p and in the infill asymptotic framework, these assumptions are also valid for many spatial point processes, as discussed by [11]. Condition (𝒞\mathcal{C}.6) is required to derive the asymptotic normality of 𝜷^01\hat{\bm{\beta}}_{01}. Under a specific framework, such a result has already been obtained for a large class of spatial point processes: by [4, 26] under the increasing domain framework and pn=pp_{n}=p and [9] when pnp_{n}\to\infty; by [11] in the infill/increasing domain asymptotics frameworks and pn=pp_{n}=p.

Condition (𝒞\mathcal{C}.7) is very specific to the ALDS estimate which requires a preliminary estimate of 𝜷\bm{\beta}. That condition is not unrealistic as a simple choice for 𝜷~\tilde{\bm{\beta}} could be the maximum of the composite likelihood function (3), see the remark after Theorem 4.1. Of course, we do not require that 𝜷~\tilde{\bm{\beta}} produces a sparse estimate.

Condition (𝒞\mathcal{C}.8) reflects the restriction on the number of covariates that can be considered in this study. For the AL estimate, this assumption is very similar to the one required by [16] when μn\mu_{n} is replaced by nn and where the number of non-zero coefficients sns_{n} is constant.

Condition (𝒞\mathcal{C}.9) contains the main ingredients to derive sparsity properties, consistency and asymptotic normality. We first note that if λn,j=λn\lambda_{n,j}=\lambda_{n}, then an=bn=λna_{n}=b_{n}=\lambda_{n}, whereby it is easily deduced that the two conditions on ana_{n} and bnb_{n} cannot be satisfied simultaneously even if pn=pp_{n}=p. This justifies the introduction of an adaptive version of the Dantzig selector and motivates the use of the adaptive lasso. The condition ansnμn0a_{n}\sqrt{s_{n}\mu_{n}}\to 0 for the adaptive lasso is similar to the one imposed by [16] when μn\mu_{n} is replaced by nn and sn=ss_{n}=s in their context. However, we require a slightly stronger condition on bnb_{n} than the one required by [16]. In our setting, their assumption would be written as bnμn/pnb_{n}\sqrt{\mu_{n}/p_{n}}\to\infty. However, we would have to assume that νmax(𝐀n(𝜷0))=O(μn)\nu_{\max}\big{(}\mathbf{A}_{n}(\bm{\beta}_{0})\big{)}=O(\mu_{n}). Such a condition is not straightforwardly satisfied in our setting since, for instance, the conditions (𝒞\mathcal{C}.2)-(𝒞\mathcal{C}.4) only imply that νmax(𝐀n(𝜷0))=O(pnμn)\nu_{\max}\big{(}\mathbf{A}_{n}(\bm{\beta}_{0})\big{)}=O({p_{n}\mu_{n}}).

As already mentioned, we do not assume that 𝜷~\tilde{\bm{\beta}} satisfies any sparsity property. We believe this is the main reason why conditions (𝒞\mathcal{C}.8))-(𝒞\mathcal{C}.9) contain slightly stronger assumptions for the ALDS estimate than for the AL estimate. We now present our main result, whose proof is provided in Appendices B-C.

Theorem 4.1.

Let 𝛃^\hat{\bm{\beta}} denote either 𝛃^AL\hat{\bm{\beta}}_{\mathrm{AL}} or 𝛃^ALDS\hat{\bm{\beta}}_{\mathrm{ALDS}}. Assume that the conditions (𝒞\mathcal{C}.1)-(𝒞\mathcal{C}.9) hold, then the following properties hold.

  1. (i)

    𝜷^\hat{\bm{\beta}} exists. Moreover 𝜷^AL\hat{\bm{\beta}}_{\mathrm{AL}} satisfies, 𝜷^AL𝜷0=OP(pnμn){\displaystyle\|\hat{\bm{\beta}}_{\mathrm{AL}}-\bm{\beta}_{0}\|=O_{\mathrm{P}}\left(\sqrt{\frac{p_{n}}{\mu_{n}}}\right)}.

  2. (ii)

    Sparsity: P(𝜷^2=0)1\mathrm{P}(\hat{\bm{\beta}}_{2}=0)\to 1 as nn\to\infty.

  3. (iii)

    Asymptotic Normality: for any ϕsn{0}\bm{\phi}\in\mathbb{R}^{s_{n}}\setminus\{0\} such that ϕ<\|\bm{\phi}\|<\infty

    σϕ1ϕ𝐀n,11(𝜷0)(𝜷^1𝜷01)𝑑𝒩(0,1)\sigma_{\bm{\phi}}^{-1}\,\bm{\phi}^{\top}\mathbf{A}_{n,11}(\bm{\beta}_{0})(\hat{\bm{\beta}}_{1}-\bm{\beta}_{01})\xrightarrow{d}\mathcal{N}(0,1)

in distribution, where σϕ2=ϕ𝐁n,11(𝛃0)ϕ\sigma^{2}_{\bm{\phi}}=\bm{\phi}^{\top}\mathbf{B}_{n,11}(\bm{\beta}_{0})\bm{\phi}.

To derive the consistency of 𝜷^AL\hat{\bm{\beta}}_{\mathrm{AL}}, a careful look at the proof in Appendix B shows that the condition anμnsn/pn0a_{n}\sqrt{{\mu_{n}s_{n}}/p_{n}}\to 0 could be sufficient. The convergence rate, i.e. OP(pn/μn)O_{\mathrm{P}}(\sqrt{p_{n}/\mu_{n}}), is pn\sqrt{p_{n}} times the convergence rate of the estimator obtained when pnp_{n} is constant [see 10, Theorem 1]. It also corresponds to the rate of convergence obtained by [16] for generalized linear models when pnp_{n}\to\infty and when μn\mu_{n} corresponds to the standard sample size. It is worth pointing out that a possible diverging number of non-zero coefficients sns_{n} does not affect the rate of convergence. It does however impose a more restrictive condition on ana_{n}. Still on Theorem 4.1 (i), its proof shows that this result remains valid when λn,j=0\lambda_{n,j}=0 for j=1,,pnj=1,\dots,p_{n}. In other words, the maximum composite likelihood estimator is consistent with the same rate of convergence. Hence a simple choice for the initial estimate 𝜷~\widetilde{\bm{\beta}} defining the ALDS estimate is the maximum of the Poisson likelihood given by (3).

Theorem 4.1 (iii) would be the result one would obtain if pnsn=0p_{n}-s_{n}=0. Therefore, the efficiency of 𝜷^AL,1\hat{\bm{\beta}}_{\mathrm{AL},1} and 𝜷^ALDS,1\hat{\bm{\beta}}_{\mathrm{ALDS},1} is the same as the estimator of 𝜷01\bm{\beta}_{01} obtained by maximizing (3) based on the sub model knowing that 𝜷02=𝟎\bm{\beta}_{02}=\mathbf{0}. In other words, when nn is sufficiently large, both estimators are as efficient as the oracle one.

We end this section with the following remark. Despite the asymptotic properties for 𝜷^AL\hat{\bm{\beta}}_{\mathrm{AL}} and 𝜷^ALDS\hat{\bm{\beta}}_{\mathrm{ALDS}} and the conditions under which they are valid, are (almost) identical, the proofs are completely different and rely upon different tools. For 𝜷^AL\hat{\bm{\beta}}_{\mathrm{AL}}, our contribution is to extend the proof by [10] where only the increasing domain framework was considered, i.e. μn=O(|Dn|)\mu_{n}=O(|D_{n}|) and DndD_{n}\to\mathbb{R}^{d} as nn\to\infty and where sn=ss_{n}=s and pn=pp_{n}=p. The results for 𝜷^ALDS\hat{\bm{\beta}}_{\mathrm{ALDS}} are the first ones available for spatial point processes. To handle this estimator, we first have to study existence and optimal solutions for the primal and dual problems.

5 Computational considerations

To lessen notation, we remove the index nn in different quantities such as 𝐔n,λn,j,n\mathbf{U}_{n},\lambda_{n,j},\ell_{n}, and μn\mu_{n}. For the practical point of view, we define μ=N(D)\mu=N(D) the number of data points in DD.

5.1 Berman-Turner approach

Before discussing how AL and ALDS estimates are obtained we first remind the Berman-Turner approximation [2] used to derive the Poisson likelihood estimate (3). The so-called Berman-Turner approximation consists in discretizing the integral term in (3) as

Dρ(u;𝜷)dui=1Mw(ui)ρ(ui;𝜷),\displaystyle{\int_{D}\rho(u;\bm{\beta})\mathrm{d}u}\approx{\sum_{i=1}^{M}w(u_{i})\rho(u_{i};\bm{\beta})},

where ui,i=1,,Mu_{i},i=1,\ldots,M are points in DD consisting of the mm data points and MmM-m dummy points and where the quadrature weights w(ui)>0w(u_{i})>0 are non-negative real numbers such that iw(ui)=|D|{\sum_{i}w(u_{i})}=|D|. Using this specific integral discretization, (3) is then approximated by

(𝜷)~(𝜷)=i=1Mwi{yilogρi(𝜷)ρi(𝜷)},\displaystyle\ell(\bm{\beta})\approx\tilde{\ell}(\bm{\beta})={\sum_{i=1}^{M}w_{i}\{y_{i}\log\rho_{i}(\bm{\beta})-\rho_{i}(\bm{\beta})\}}, (13)

where wi=w(ui),yi=wi1𝟏(ui𝐗D)w_{i}=w(u_{i}),y_{i}=w_{i}^{-1}\mathbf{1}(u_{i}\in\mathbf{X}\cap D) and ρi(𝜷)=ρ(ui;𝜷)\rho_{i}(\bm{\beta})=\rho(u_{i};\bm{\beta}). Equation (13) is formally equivalent to the weighted likelihood function of independent Poisson variables yiy_{i} with weights wiw_{i}. The method to approximate (4) and (5) follows along similar lines which respectively results in

𝐔(𝜷)𝐔~(𝜷)=i=1Mwi𝐳i{yiρi(𝜷)},𝐀(𝜷)𝐀~(𝜷)=i=1Mwi𝐳i𝐳iρi(𝜷),\displaystyle\mathbf{U}(\bm{\beta})\approx\tilde{\mathbf{U}}(\bm{\beta})={\sum_{i=1}^{M}w_{i}\mathbf{z}_{i}\{y_{i}-\rho_{i}(\bm{\beta})\}},\quad\mathbf{A}(\bm{\beta})\approx\tilde{\mathbf{A}}(\bm{\beta})={\sum_{i=1}^{M}w_{i}\mathbf{z}_{i}\mathbf{z}_{i}^{\top}\rho_{i}(\bm{\beta})}, (14)

where 𝐳i=𝐳(ui)\mathbf{z}_{i}=\mathbf{z}(u_{i}). Thus, standard statistical software for generalized linear models can be used to obtain the estimates. This fact is implemented in the 𝚜𝚙𝚊𝚝𝚜𝚝𝚊𝚝\mathtt{spatstat} R package by ppm function with option method="mpl" [2].

5.2 Adaptive lasso (AL)

First, given a current estimate 𝜷ˇ\check{\bm{\beta}}, (13) is approximated using second order Taylor approximation to apply iteratively reweighted least squares,

~(𝜷)Q(𝜷)=12i=1Mψi(yi𝜷𝐳i)2+C(𝜷ˇ),\displaystyle\tilde{\ell}(\bm{\beta})\approx\ell_{Q}(\bm{\beta})=-\frac{1}{2}{\sum_{i=1}^{M}\psi_{i}(y_{i}^{*}-\bm{\beta}^{\top}\mathbf{z}_{i})^{2}+C(\check{\bm{\beta}})}, (15)

where C(𝜷ˇ)C(\check{\bm{\beta}}) is a constant, yiy_{i}^{*} and ψi\psi_{i} are the new working response values and weights, yi=𝐳i𝜷ˇ+{yiexp(𝜷ˇ𝐳i)}/{exp(𝜷ˇ𝐳i)},ψi=wiexp(𝜷ˇ𝐳i).y_{i}^{*}=\mathbf{z}_{i}^{\top}\check{\bm{\beta}}+\{y_{i}-\exp(\check{\bm{\beta}}^{\top}\mathbf{z}_{i})\}/\{\exp(\check{\bm{\beta}}^{\top}\mathbf{z}_{i})\},\;\psi_{i}=w_{i}\exp(\check{\bm{\beta}}^{\top}\mathbf{z}_{i}). Second, a penalized weighted least squares problem is obtained by adding penalty term. Therefore, we solve

min𝜷pΩ(𝜷)=min𝜷p{1N(D)Q(𝜷)+j=1pλj|βj|}\displaystyle{\displaystyle\min_{\bm{\beta}\in\mathbb{R}^{p}}\Omega(\bm{\beta})}={\displaystyle\min_{\bm{\beta}\in\mathbb{R}^{p}}\left\{-\frac{1}{N(D)}\ell_{Q}(\bm{\beta})+\sum_{j=1}^{p}\lambda_{j}|\beta_{j}|\right\}} (16)

using the coordinate descent algorithm [17]. The method consists in partially minimizing (16) with respect to βj\beta_{j} given βˇl\check{\beta}_{l} for ljl\neq j, l,j=1,,pl,j=1,\ldots,p, that is

minβjΩ(βˇ1,,βˇj1,βj,βˇj+1,,βˇp).\displaystyle{\displaystyle\min_{\beta_{j}}\Omega(\check{\beta}_{1},\ldots,\check{\beta}_{j-1},\beta_{j},\check{\beta}_{j+1},\ldots,\check{\beta}_{p})}.

With a few modifications (16) is solved using the 𝚐𝚕𝚖𝚗𝚎𝚝\mathtt{glmnet} R package [17]. More detail about this implementation can be found in [10, Appendix C].

5.3 Adaptive (linearized) Dantzig selector (ALDS)

Given 𝜷~\tilde{\bm{\beta}}, (12) is a linear problem, simple to implement. The main task is to compute the vectors 𝐔(𝜷~)\mathbf{U}(\tilde{\bm{\beta}}) and 𝐀(𝜷~)(𝜷~𝜷)\mathbf{A}(\tilde{\bm{\beta}})(\tilde{\bm{\beta}}-{\bm{\beta}}). Typically, 𝜷~\tilde{\bm{\beta}} is chosen from the maximum composite likelihood estimate. Then, 𝐔(𝜷~)\mathbf{U}(\tilde{\bm{\beta}}) and 𝐀(𝜷~)(𝜷~𝜷)\mathbf{A}(\tilde{\bm{\beta}})(\tilde{\bm{\beta}}-{\bm{\beta}}) are approximated by (14). This results in solving

minj=1pλj|βj| subject to N(D)1|𝐔~j(𝜷~)+{𝐀~(𝜷~)(𝜷~𝜷)}j|λj, for j=1,,pn,\displaystyle\min{\sum_{j=1}^{p}\lambda_{j}|\beta_{j}|}\mbox{ subject to }{N(D)}^{-1}\Big{|}\tilde{\mathbf{U}}_{j}(\tilde{\bm{\beta}})+\big{\{}\tilde{\mathbf{A}}(\tilde{\bm{\beta}})(\tilde{\bm{\beta}}-{\bm{\beta}})\big{\}}_{j}\Big{|}\leq\lambda_{j},\quad\text{ for }j=1,\dots,p_{n},

where 𝐔~j(𝜷~)\tilde{\mathbf{U}}_{j}(\tilde{\bm{\beta}}) and {𝐀~(𝜷~)(𝜷~𝜷)}j\{\tilde{\mathbf{A}}(\tilde{\bm{\beta}})(\tilde{\bm{\beta}}-{\bm{\beta}})\}_{j} are the jj-th components of vectors 𝐔~(𝜷~)\tilde{\mathbf{U}}(\tilde{\bm{\beta}}) and 𝐀~(𝜷~)(𝜷~𝜷)\tilde{\mathbf{A}}(\tilde{\bm{\beta}})(\tilde{\bm{\beta}}-{\bm{\beta}}).

5.4 Tuning parameter selection

Both AL and ALDS rely on the proper regularization parameters λj\lambda_{j} to avoid from unnecessary bias due to too small λj\lambda_{j} selection and from large variance due to too large λj\lambda_{j} choice, so the selection of λj\lambda_{j} becomes an important task. To tune the λj\lambda_{j}, we follow [10, 28] and define λj=λ|β~j|ν\lambda_{j}=\lambda|\tilde{\beta}_{j}|^{-\nu}, where λ0,ν>0\lambda\geq 0,\nu>0 and 𝜷~\tilde{\bm{\beta}} is the maximum composite likelihood estimate. The weights |β~j|ν|\tilde{\beta}_{j}|^{-\nu} serve as a prior knowledge to identify the non-zero coefficients since, for a constant λ\lambda, large (resp. small) β~j\tilde{\beta}_{j} will force λj\lambda_{j} close to zero (resp. infinity) [28],

Our theoretical results are not in line with this stochastic way of setting the regularization parameters but we believe this choice is pertinent in our context. Here is the intuition: let λnj=λn/|β~j|\lambda_{nj}=\lambda_{n}/|\tilde{\beta}_{j}|. Using the μn/pn\sqrt{\mu_{n}/p_{n}}-consistency of β~j\tilde{\beta}_{j}, λnj=OP(λn)\lambda_{nj}=O_{P}(\lambda_{n}) for non-zero coefficients, while, for zero coefficients, we may conjecture that λnj=OP(λnμn/pn)\lambda_{nj}=O_{P}(\lambda_{n}\sqrt{\mu_{n}/p_{n}}). Forgetting the OPO_{P}, we may conjecture that anλna_{n}\asymp\lambda_{n} and bnλnμn/pnb_{n}\asymp\lambda_{n}\sqrt{\mu_{n}/p_{n}}. Hence, considering the AL procedure for instance, that would mean that we require λn\lambda_{n} to be such that λnsnμn0\lambda_{n}\sqrt{s_{n}\mu_{n}}\to 0 and λnμn/pn3/2\lambda_{n}\mu_{n}/p_{n}^{3/2}\to\infty, which constitutes a non empty condition. For instance, assuming (𝒞\mathcal{C}.8), the sequence λn=(snμn)η\lambda_{n}=(s_{n}\mu_{n})^{-\eta} satisfies (𝒞\mathcal{C}.9) as soon as 1/2<η<5/91/2<\eta<5/9, since as nn\to\infty

λnsnμn=(snμn)1/2η0 and λnμnpn3/2=(μnsn2pn3)η/2(μnpn4)13η/2pn5/29η/2.\lambda_{n}\sqrt{s_{n}\mu_{n}}=(s_{n}\mu_{n})^{1/2-\eta}\to 0\qquad\text{ and }\qquad\frac{\lambda_{n}\mu_{n}}{p_{n}^{3/2}}=\left(\frac{\mu_{n}}{s_{n}^{2}p_{n}^{3}}\right)^{\eta/2}\left(\frac{\mu_{n}}{p_{n}^{4}}\right)^{1-3\eta/2}p_{n}^{5/2-9\eta/2}\to\infty.

Considering rigorously the stochastic choice λn/|β~j|\lambda_{n}/|\tilde{\beta}_{j}| for the regularization parameters is left for future research.

The remaining task is to specify λ\lambda. We follow here the literature, mainly [11], and propose to select λ\lambda as the minimum of the Bayesian information criterion, BIC(λ\lambda), for spatial point processes

BIC(λ)=2{𝜷^(λ)}+plogN(D),\displaystyle\mathrm{BIC}(\lambda)=-2\ell\{\hat{\bm{\beta}}(\lambda)\}+p_{*}\log N(D),

where {𝜷^(λ)}\ell\{\hat{\bm{\beta}}(\lambda)\} is the maximized composite likelihood function, pp_{*} is the number of non-zero elements in 𝜷^(λ)\hat{\bm{\beta}}(\lambda) and N(D)N(D) is the number of observed data points.

6 Numerical results

In Sections 6.1-6.2, we compare the AL and ALDS for intensity modeling of a simulated and real data. In particular, the real data example comes from an environmental data where 1146 locations of Acalypha diversifolia trees given by Figure 1 are surveyed in a 50-hectare region (D=1000m×500mD=1000m\times 500m) of the tropical forest of Barro Colorado Island (BCI) in central Panama [e.g. 20]. A main question is how this tree species profits from environmental habitats [12, 26] that could be related to the 15 environmental covariates depicted in Figure 2 and their 79 interactions. With a total of 94 covariates, we perform variable selection using the AL and ALDS to determine which covariates should be included in the model. We center and scale the 94 covariates to sort the important covariates according to the magnitudes of 𝜷^\hat{\bm{\beta}}. The subset of such covariates are also considered to construct a realistic setting for the simulation studies.

Refer to caption
Figure 1: Plot of 1146 Acalypha diversifolia tree locations observed in the tropical forest of Barro Colorado Island
Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption
Figure 2: Maps of covariates used in simulation study and in application. From left to right: Elevation, slope, Aluminium (row 1), Boron, Calcium, Copper (row 2), Iron, Potassium, Magnesium (row 3), Manganese, Phosporus, Zinc (row 4), and Nitrogen, Nitrigen mineralisation, pH (row 5).

6.1 Simulation study

The simulated point patterns are generated from Poisson and Thomas cluster processes with intensity (1). To generate point patterns from Thomas process with intensity (1) [e.g. 10], we first generate a parent point pattern from a stationary Poisson point process 𝐂\mathbf{C} with intensity κ=4×104\kappa=4\times 10^{-4}. Given 𝐂\mathbf{C}, offspring point patterns are generated from inhomogeneous Poisson point process 𝐗c,c𝐂\mathbf{X}_{c},c\in\mathbf{C} with intensity

ρchild(u)=exp{𝜷𝐳(u)}k(uc;γ)/κ,\displaystyle\rho_{child}(u)=\exp\{\bm{\beta}^{\top}\mathbf{z}(u)\}k(u-c;\gamma)/\kappa,

where k(uc;γ)=(2πγ2)1exp(uc2/(2γ2))k(u-c;\gamma)=(2\pi\gamma^{2})^{-1}\exp(-\|u-c\|^{2}/(2\gamma^{2})). The point process 𝐗=c𝐂𝐗c\mathbf{X}=\cup_{c\in\mathbf{C}}\mathbf{X}_{c} is indeed an inhomogeneous Thomas point process with intensity (1). We set γ=5\gamma=5 and 1515. The smaller the γ\gamma, the more clustered the point patterns, leading to moderate clustering for γ=15\gamma=15 and highly clustering for γ=5\gamma=5.

The covariates 𝐳(u)\mathbf{z}(u) used for the simulation experiment are from the BCI data. In addition to the 15 environmental factors depicted by Figure 2, we add interaction between two covariates until we obtain the desired number of covariates p=21,41p=21,41 or 8181. For each pp, we consider three mean numbers of points (μ\mu) which increase when the observation domain expands. More precisely, we set that μ1=150\mu_{1}=150 (resp. μ2=600,μ3=2400\mu_{2}=600,\mu_{3}=2400) points is generated in D1=[0,250]×[0,125]D_{1}=[0,250]\times[0,125] (resp. D2=[0,500]×[0,250]D_{2}=[0,500]\times[0,250], D3=[0,1000]×[0,500]D_{3}=[0,1000]\times[0,500]). When D1D_{1} or D2D_{2} is considered, we simply rescale the covariates to fit the desired observation domain. We fix β2=1\beta_{2}=1 and β3=1\beta_{3}=-1 while the rest are set to zero, so s=2s=2. The parameter β1\beta_{1} acts as intercept and is tuned to control the average number of points.

Table 1: True positive rate (TPR), false positive rate (FPR) in percentage, RMSE and average time in seconds obtained for AL and ALDS estimates based on 500 simulations from inhomogeneous Poisson point processes observed on different observation domains.
TPR FPR RMSE Time
AL ALDS AL ALDS AL ALDS AL ALDS
D1D_{1} (μ1=150\mu_{1}=150)
p=20p=20 57 57 23 23 2.4 2.4 0.3 0.3
p=40p=40 7 7 15 15 2.9 2.9 2.0 2.0
p=80p=80 0 0 8 8 2.8 2.8 4.0 4.0
D2D_{2} (μ2=600\mu_{2}=600)
p=20p=20 100 100 3 4 0.3 0.3 0.3 0.3
p=40p=40 97 96 4 5 0.5 0.5 0.8 0.6
p=80p=80 86 86 8 8 0.9 0.9 3.0 3.0
D3D_{3} (μ3=2400\mu_{3}=2400)
p=20p=20 100 100 0 0 0.1 0.1 0.4 0.3
p=40p=40 100 100 0 0 0.1 0.1 0.9 0.9
p=80p=80 100 100 0 0 0.1 0.1 3.0 3.0
Table 2: True positive rate (TPR), false positive rate (FPR) in percentage, RMSE and average time in seconds obtained for AL and ALDS estimates based on 500 simulations from inhomogeneous Thomas point processes with κ=4×104\kappa=4\times 10^{-4} and γ=15\gamma=15 (moderate clustering) observed on different observation domains.
TPR FPR RMSE Time
AL ALDS AL ALDS AL ALDS AL ALDS
D1D_{1} (μ1=150\mu_{1}=150)
p=20p=20 59 59 44 44 8.2 8.2 0.3 0.3
p=40p=40 20 20 30 30 11.0 11.0 2.0 2.0
p=80p=80 0 0 15 15 8.3 8.3 4.0 4.0
D2D_{2} (μ2=600\mu_{2}=600)
p=20p=20 91 89 53 47 2.6 2.2 0.3 0.3
p=40p=40 88 86 48 43 4.9 3.8 1.0 0.6
p=80p=80 80 80 35 35 7.7 7.7 5.0 5.0
D3D_{3} (μ3=2400\mu_{3}=2400)
p=20p=20 100 100 59 43 1.0 0.8 0.9 1.0
p=40p=40 100 100 56 39 1.7 1.1 2.0 3.0
p=80p=80 100 100 52 52 3.2 3.2 8.0 8.0
Table 3: True positive rate (TPR), false positive rate (FPR) in percentage, RMSE and average time in seconds obtained for AL and ALDS estimates based on 500 simulations from inhomogeneous Thomas point processes with κ=4×104\kappa=4\times 10^{-4} and γ=5\gamma=5 (high clustering) observed on different observation domains.
TPR FPR RMSE Time
AL ALDS AL ALDS AL ALDS AL ALDS
D1D_{1} (μ1=150\mu_{1}=150)
p=20p=20 64 64 72 72 39.0 39.0 0.5 0.5
p=40p=40 29 29 72 72 130.0 130.0 6.0 6.0
p=80p=80 0 0 36 36 69.0 69.0 6.0 6.0
D2D_{2} (μ2=600\mu_{2}=600)
p=20p=20 91 90 66 61 4.1 3.6 0.4 0.3
p=40p=40 84 80 70 64 13.0 10.0 2.0 0.7
p=80p=80 80 80 74 74 53.0 53.0 10.0 10.0
D3D_{3} (μ3=2400\mu_{3}=2400)
p=20p=20 100 100 66 51 1.4 1.1 1.0 1.0
p=40p=40 100 100 69 51 2.6 1.9 3.0 4.0
p=80p=80 100 100 70 70 7.1 7.1 10.0 10.0

For each model and setting, we generate 500 independent point patterns and estimate the parameters for each of these using the AL and ALDS procedures. The performances of AL and ALDS estimates are compared in terms of the true positive rate (TPR), false positive rate (FPR), and root-mean squared error (RMSE). We also report the computing time. The TPR (resp. FPR) are the expected fractions of informative (resp. non-informative) covariates included in the selected model, so we would expect to obtain a high (resp. low) TPR (resp. FPR). The RMSE is defined for an estimate 𝜷^\hat{\bm{\beta}} by

RMSE={j=2p𝔼^(β^jβj)2}12\displaystyle\mathrm{RMSE}=\left\{\sum_{j=2}^{p}{\hat{\mathbb{E}}(\hat{\beta}_{j}-\beta_{j})^{2}}\right\}^{\frac{1}{2}}

where 𝔼^\hat{\mathbb{E}} is the empirical mean.

Tables 1-3 report results respectively for the Poisson model and the Thomas model with moderate and high clustering. In the situation where point patterns come from Poisson processes, AL and ALDS perform very similarly. In particular, both methods do not work well in a small spatial domain with large pp. The performances improve significantly as DD expands even for large pp. When the point patterns exhibit clustering (Tables 2-3), in general AL and ALDS tend to overfit the intensity model by selecting too many covariates (indicated by higher FPR) which yields higher RMSE. ALDS sometimes performs slightly better in terms of RMSE. Results tend to deteriorate in the high clustering situation but remain very satisfactory: for the three considered models, the TPR (resp. FPR, RMSE) increases (decreases) when |D||D| grows for given pp while for given DD (especially when D=D2,D3D=D_{2},D_{3}), the results remain quite stable when pp increases. In terms of computing time, no major difference can be observed.

6.2 Application to the forestry dataset

We model the intensity function for the point pattern of the Acalypha diversifolia using (1) depending on 94 environmental described previously. The overall 𝜷\bm{\beta} estimates from the AL and ALDS procedures are presented in Table 5. We only report in Table 4 the top 12 important variables.

Among 94 environmental variables, the AL and ALDS respectively selects 32 and 33 important covariates (most of them are similar). We sort the magnitude of 𝜷^\hat{\bm{\beta}} to identify the 12 most informative covariates. It turns out that these 12 covariates are similar for both procedures (see Table 4). For the rest of selected covariates, the rankings are slightly different but the magnitudes are very similar (see Table 5).

Table 4: Twelve most important covariates selected by AL and ALDS for modeling the intensity of Acalypha diversifolia point pattern
Covariates AL ALDS
Ca:N.min -0.89 -0.89
K:N.min 0.58 0.53
Al:Mg -0.51 -0.47
pH 0.48 0.46
B -0.46 -0.43
Ca 0.45 0.42
Al:Fe 0.38 0.39
Fe:K -0.31 -0.33
B:P -0.30 -0.29
P:Nz 0.27 0.26
Fe 0.26 0.25
Mn -0.24 -0.24
Number of selected covariates 32 33

7 Discussion

In this paper, we develop the adaptive lasso and Dantzig selector for spatial point processes intensity estimation and provide asymptotic results under an original setting where the number of non-zero and zero coefficients diverge with the mean number of points. We demonstrate that both methods share identical asymptotic properties and perform similarly on simulated and real data. This study supplements previous ones [see e.g. 3] where similar conclusions on linear models and generalized linear models were addressed.

[10] considered extensions of lasso type methods by involving general convex and non-convex penalties. In particular, composite likelihoods penalized by SCAD or MC+ penalty showed interesting properties. To integrate such an idea for the Dantzig selector, we could consider the optimization problem

mini=1ppλj(βj) subject to μ1|𝐔j(𝜷~)+[𝐀(𝜷~)(𝜷~𝜷)]j|pλj(βj),\displaystyle\min\sum_{i=1}^{p}{p_{\lambda_{j}}(\beta_{j})}\mbox{ subject to }{\mu}^{-1}\;\Big{|}\mathbf{U}_{j}(\tilde{\bm{\beta}})+[\mathbf{A}(\tilde{\bm{\beta}})(\tilde{\bm{\beta}}-{\bm{\beta}})]_{j}\Big{|}\leq p^{\prime}_{\lambda_{j}}(\beta_{j}), j=1,,p\displaystyle j=1,\ldots,p

where pλ(θ)p^{\prime}_{\lambda}(\theta) is the derivative with respect to θ\theta of a general penalty function pλp_{\lambda}. However, such an interesting extension would make linear programming unusable and theoretical developments more complex to derive. We leave this direction for further study.

Another direction for further study is to derive results for the selection of regularization parameters. As mentioned earlier, a challenging and definitely interesting perspective would be the validity of Theorem 4.1 when we define the regularization parameters in a stochastic way such as λn,j=λn/|β~j|\lambda_{n,j}=\lambda_{n}/|\tilde{\beta}_{j}|.

On a similar topic, [11] studied information criteria such as AIC, BIC, and the composite-versions under similar asymptotic framework for selecting intensity model of spatial point process. These criteria could be extended for tuning parameter selection in the context of regularization methods for spatial point process.

Appendix A Additional notation and auxiliary Lemmas

Lemmas A.1-A.2 are used for the proof of Theorem 4.1 in both cases 𝜷^=𝜷^AL,𝜷^ALDS\hat{\bm{\beta}}=\hat{\bm{\beta}}_{\mathrm{AL}},\hat{\bm{\beta}}_{\mathrm{ALDS}}. Throughout the proofs, the notation 𝐗n=OP(xn)\mathbf{X}_{n}=O_{\mathrm{P}}(x_{n}) or 𝐗n=oP(xn)\mathbf{X}_{n}=o_{\mathrm{P}}(x_{n}) for a random vector 𝐗n\mathbf{X}_{n} and a sequence of real numbers xnx_{n} means that 𝐗n=OP(xn)\|\mathbf{X}_{n}\|=O_{\mathrm{P}}(x_{n}) and 𝐗n=oP(xn)\|\mathbf{X}_{n}\|=o_{\mathrm{P}}(x_{n}). In the same way for a vector 𝐕n\mathbf{V}_{n} or a squared matrix 𝐌n\mathbf{M}_{n}, the notation 𝐕n=O(xn)\mathbf{V}_{n}=O(x_{n}) and 𝐌n=O(xn)\mathbf{M}_{n}=O(x_{n}) mean that 𝐕n=O(xn)\|\mathbf{V}_{n}\|=O(x_{n}) and 𝐌n=O(xn)\|\mathbf{M}_{n}\|=O(x_{n}).

Lemma A.1.

Under conditions (𝒞\mathcal{C}.2), (𝒞\mathcal{C}.3) and (𝒞\mathcal{C}.4), the following result hold as nn\to\infty

max{𝐔n(𝜷0),𝐔n,2(𝜷0)}=OP(pnμn) and 𝐔n,1(𝜷0)=OP(snμn).\displaystyle\max\left\{\|\mathbf{U}_{n}(\bm{\beta}_{0})\|,\|\mathbf{U}_{n,2}(\bm{\beta}_{0})\|\right\}=O_{\mathrm{P}}\left(\sqrt{p_{n}\mu_{n}}\right)\quad\text{ and }\quad\mathbf{U}_{n,1}(\bm{\beta}_{0})=O_{\mathrm{P}}\left(\sqrt{{s_{n}\mu_{n}}}\right). (17)
Proof.

Using Campbell Theorems (2), the score vector 𝐔n(𝜷0)\mathbf{U}_{n}(\bm{\beta}_{0}) is unbiased and has variance Var𝐔n(𝜷0)=𝐁n(𝜷0)\mathrm{Var}\mathbf{U}_{n}(\bm{\beta}_{0})=\mathbf{B}_{n}(\bm{\beta}_{0}). By Condition (𝒞\mathcal{C}.3) for any uDnu\in D_{n}, 𝐳(u)𝐳(u)=O(pn)\mathbf{z}(u)\mathbf{z}(u)^{\top}=O(p_{n}). Hence, 𝐀n(𝜷0)=O(pnμn)\mathbf{A}_{n}(\bm{\beta}_{0})=O(p_{n}\mu_{n}). By definition of (6) and conditions (𝒞\mathcal{C}.2) and (𝒞\mathcal{C}.4), we deduce that 𝐁n(𝜷0)=O(pnμn)\mathbf{B}_{n}(\bm{\beta}_{0})=O(p_{n}\mu_{n}). We deduce that Var{𝐔n(𝜷0)}=O(pnμn)\mathrm{Var}\{\mathbf{U}_{n}(\bm{\beta}_{0})\}=O(p_{n}\mu_{n}). In the same way, Var{𝐔n,2(𝜷0)}=O{(pnsn)μn}=O(pnμn)\mathrm{Var}\{\mathbf{U}_{n,2}(\bm{\beta}_{0})\}=O\{(p_{n}-s_{n})\mu_{n}\}=O(p_{n}\mu_{n}) and Var{𝐔n,1(𝜷0)}=O(snμn)\mathrm{Var}\{\mathbf{U}_{n,1}(\bm{\beta}_{0})\}=O(s_{n}\mu_{n}). The result is proved since for any centered real-valued stochastic process Yn{Y_{n}} with finite variance Var(Yn)\mathrm{Var}({Y_{n}}), Yn=OP{Var(Yn)}{Y_{n}}=O_{\mathrm{P}}{\big{\{}\sqrt{\mathrm{Var}({Y_{n}})}\big{\}}}. ∎

The next lemma states that in the vicinity of 𝜷0\bm{\beta}_{0}, ρ(u;𝜷)\rho(u;\bm{\beta}) and ρ(u;𝜷0)\rho(u;\bm{\beta}_{0}) have the same behaviour.

Lemma A.2.

(i) Let (ζn)n1(\zeta_{n})_{n\geq 1} be any sequence such that ζn=o(1/pn)\zeta_{n}=o(1/\sqrt{p_{n}}) and let κ\kappa be any non-negative real number, then under the Conditions (𝒞\mathcal{C}.1)-(𝒞\mathcal{C}.3), we have

sup𝜷𝜷0κζnDnρ(u;𝜷)du=O(μn).\sup_{\|\bm{\beta}-\bm{\beta}_{0}\|\leq\kappa\zeta_{n}}\int_{D_{n}}\rho(u;\bm{\beta})\mathrm{d}u=O(\mu_{n}).

(ii) Similary for any random vector 𝛃\bm{\beta} such that 𝛃𝛃0=oP(1/pn)\|\bm{\beta}-\bm{\beta}_{0}\|=o_{\mathrm{P}}(1/\sqrt{p_{n}}), then

Dnρ(u;𝜷)du=OP(μn).\int_{D_{n}}\rho(u;\bm{\beta})\mathrm{d}u=O_{\mathrm{P}}(\mu_{n}).

(iii) In addition, under condition (𝒞\mathcal{C}.8), (i)-(ii) are valid for the sequence defined by ζn=pn/μn\zeta_{n}=\sqrt{p_{n}/\mu_{n}}.

Proof.

(i)-(ii) We only focus on (i) as (ii) follows along similar lines. For any uDnu\in D_{n}, there exists, by Condition (𝒞\mathcal{C}.3), a constant κ<\kappa<\infty (independent of u,𝜷u,\bm{\beta} and 𝜷0\bm{\beta}_{0}) such that

κpn𝜷𝜷0(𝜷𝜷0)𝐳(u)κpn𝜷𝜷0.-\kappa\sqrt{p_{n}}\|\bm{\beta}-\bm{\beta}_{0}\|\leq(\bm{\beta}-\bm{\beta}_{0})^{\top}\mathbf{z}(u)\leq\kappa\sqrt{p_{n}}\|\bm{\beta}-\bm{\beta}_{0}\|.

Since, ρ(u;𝜷)du=exp{(𝜷𝜷0)𝐳(u)}ρ(u;𝜷0)du\int\rho(u;\bm{\beta})\mathrm{d}u=\int\exp\{(\bm{\beta}-\bm{\beta}_{0})^{\top}\mathbf{z}(u)\}\rho(u;\bm{\beta}_{0})\mathrm{d}u, we deduce that

μnexp(κpn𝜷𝜷0)Dnρ(u;𝜷)duμnexp(κpn𝜷𝜷0)\mu_{n}\,\exp(-\kappa\sqrt{p_{n}}\|\bm{\beta}-\bm{\beta}_{0}\|)\leq\int_{D_{n}}\rho(u;\bm{\beta})\mathrm{d}u\leq\mu_{n}\,\exp(\kappa\sqrt{p_{n}}\|\bm{\beta}-\bm{\beta}_{0}\|)

which yields the result by definition of ζn\zeta_{n}.
(iii) Condition (𝒞\mathcal{C}.8) implies in particular that pn2/μn0\sqrt{p_{n}^{2}/\mu_{n}}\to 0 as nn\to\infty. ∎

Appendix B Proof of Theorem 4.1 when 𝜷^=𝜷^AL\hat{\bm{\beta}}=\hat{\bm{\beta}}_{\mathrm{AL}}

B.1 Existence of a root-(μn/pn)(\mu_{n}/p_{n}) consistent local maximizer

The first result presented hereafter shows that there exists a local maximizer of Qn(𝜷)Q_{n}(\bm{\beta}) which is a consistent estimator of 𝜷0\bm{\beta}_{0}.

Proposition B.1.

Assume that the conditions (𝒞\mathcal{C}.2)-(𝒞\mathcal{C}.4) hold. If in addition pn4/μn0p_{n}^{4}/\mu_{n}\to 0 and ansnμn/pn0a_{n}\sqrt{s_{n}\mu_{n}/p_{n}}\to 0 as nn\to\infty, then there exists a local maximizer 𝛃^AL{\hat{\bm{\beta}}_{\mathrm{AL}}} of Qn(𝛃)Q_{n}(\bm{\beta}) such that

𝜷^AL𝜷𝟎=OP{pn/μn}.\displaystyle{\bf\|\hat{\bm{\beta}}_{\mathrm{AL}}-\bm{\beta}_{0}\|}=O_{\mathrm{P}}\big{\{}\sqrt{p_{n}/\mu_{n}}\big{\}}.

Note that the conditions on an,sn,μna_{n},s_{n},\mu_{n} and pnp_{n} are actually implied by conditions (𝒞\mathcal{C}.8) and (𝒞\mathcal{C}.9). In the proof of this result and the following ones, the notation κ\kappa stands for a generic constant which may vary from line to line. In particular this constant is independent of nn, 𝜷0\bm{\beta}_{0} and 𝐤\mathbf{k}.

Proof.

Let 𝐤pn\mathbf{k}\in\mathbb{R}^{p_{n}}. We remind the reader that the estimate of 𝜷0\bm{\beta}_{0} is defined as the maximum of the function QnQ_{n}, given by (8), over pn\mathbb{R}^{p_{n}}. To prove Proposition B.1, we aim at proving that for any given ϵ>0\epsilon>0, there exists sufficiently large K>0K>0 such that for nn sufficiently large

P{sup𝐤=KΔn(𝐤)>0}ϵ, where Δn(𝐤)=Qn(𝜷0+pn/μn𝐤)Qn(𝜷0).\mathrm{P}\bigg{\{}\sup_{\|\mathbf{k}\|=K}\Delta_{n}(\mathbf{k})>0\bigg{\}}\leq\epsilon,\quad\mbox{ where }\Delta_{n}(\mathbf{k})=Q_{n}(\bm{\beta}_{0}+\sqrt{p_{n}/\mu_{n}}\mathbf{k})-Q_{n}(\bm{\beta}_{0}). (18)

Equation (18) will imply that with probability at least 1ϵ1-\epsilon, there exists a local maximum in the ball {𝜷0+pn/μn𝐤:𝐤K}\{\bm{\beta}_{0}+\sqrt{p_{n}/\mu_{n}}\mathbf{k}:\|\mathbf{k}\|\leq K\}, and therefore a local maximizer 𝜷^\bm{\hat{\beta}} such that 𝜷^𝜷0=OP(pn/μn)\|{\bm{\hat{\beta}}-\bm{\beta}_{0}}\|=O_{\mathrm{P}}(\sqrt{p_{n}/\mu_{n}}). We decompose Δn(𝐤)\Delta_{n}(\mathbf{k}) as Δn(𝐤)=T1+T2\Delta_{n}(\mathbf{k})=T_{1}+T_{2} where

T1\displaystyle T_{1} =μn1{n(𝜷0+pn/μn𝐤)n(𝜷0)}\displaystyle=\mu_{n}^{-1}\left\{\ell_{n}(\bm{\beta}_{0}+\sqrt{p_{n}/\mu_{n}}\mathbf{k})-\ell_{n}(\bm{\beta}_{0})\right\}
T2\displaystyle T_{2} =j=1pnλn,j(|β0j||β0j+pn/μnkj|).\displaystyle=\sum_{j=1}^{p_{n}}\lambda_{n,j}\left(|\beta_{0j}|-|\beta_{0j}+\sqrt{p_{n}/\mu_{n}}k_{j}|\right).

Since ρ(u;)\rho(u;\cdot) is infinitely continuously differentiable and n(2)(𝜷)=𝐀n(𝜷)\ell_{n}^{(2)}(\bm{\beta})=-\mathbf{A}_{n}(\bm{\beta}), then using a second-order Taylor expansion there exists t(0,1)t\in(0,1) such that

μnT1=\displaystyle\mu_{n}T_{1}= pn/μn𝐤n(1)(𝜷0)+T11+T12\displaystyle\,\sqrt{p_{n}/\mu_{n}}\mathbf{k}^{\top}\ell_{n}^{(1)}(\bm{\beta}_{0})+T_{11}+T_{12}

where

T11=\displaystyle T_{11}= 12pnμn𝐤𝐀n(𝜷0)𝐤\displaystyle-\frac{1}{2}\frac{p_{n}}{\mu_{n}}\mathbf{k}^{\top}\mathbf{A}_{n}(\bm{\beta}_{0})\mathbf{k}
T12=\displaystyle T_{12}= +12pnμn𝐤{𝐀n(𝜷0)𝐀n(𝜷0+tpn/μn𝐤)}𝐤.\displaystyle+\frac{1}{2}\frac{p_{n}}{\mu_{n}}\mathbf{k}^{\top}\left\{\mathbf{A}_{n}(\bm{\beta}_{0})-\mathbf{A}_{n}(\bm{\beta}_{0}+t\sqrt{p_{n}/\mu_{n}}\mathbf{k})\right\}\mathbf{k}.

By condition (𝒞\mathcal{C}.3)

T11=12pn𝐤{μn1𝐀n(𝜷0)}𝐤𝐤2𝐤2α2pn𝐤2T_{11}=-\frac{1}{2}p_{n}\frac{\mathbf{k}^{\top}\{\mu_{n}^{-1}\mathbf{A}_{n}(\bm{\beta}_{0})\}\mathbf{k}}{\|\mathbf{k}\|^{2}}\,\|\mathbf{k}\|^{2}\leq-\frac{\alpha}{2}p_{n}\|\mathbf{k}\|^{2}

where α=lim infn1inf𝐦𝐢𝐬𝐬𝐢𝐧𝐠ϕ,𝐦𝐢𝐬𝐬𝐢𝐧𝐠ϕ=1ϕ{μn1𝐀n(𝜷0)}ϕ>0\alpha=\liminf_{n\geq 1}\inf_{\mathbf{\bm{missing}}\phi,\|\mathbf{\bm{missing}}\phi\|=1}\bm{\phi}^{\top}\{\mu_{n}^{-1}\mathbf{A}_{n}(\bm{\beta}_{0})\}\bm{\phi}>0. Now, for some 𝜷~\tilde{\bm{\beta}} on the line segment between 𝜷0\bm{\beta}_{0} and 𝜷0+tpn/μn𝐤\bm{\beta}_{0}+t{\sqrt{p_{n}/\mu_{n}}}\mathbf{k}

T12=12pnμn𝐤{Dn𝐳(u)𝐳(u)tpnμn𝐤𝐳(u)ρ(u;𝜷~)du}𝐤.T_{12}=\frac{1}{2}\,\frac{p_{n}}{\mu_{n}}\mathbf{k}^{\top}\left\{\int_{D_{n}}\mathbf{z}(u)\mathbf{z}(u)^{\top}t\sqrt{\frac{p_{n}}{\mu_{n}}}\mathbf{k}^{\top}\mathbf{z}(u)\rho(u;\tilde{\bm{\beta}})\mathrm{d}u\right\}\mathbf{k}.

By conditions (𝒞\mathcal{C}.2)-(𝒞\mathcal{C}.3) and Lemma A.2

T12=O(𝐤3pnμnpnpnμnpnμn)=O(pnpn4μn)=o(pn).T_{12}=O\left(\|\mathbf{k}\|^{3}\frac{p_{n}}{\mu_{n}}p_{n}\sqrt{\frac{p_{n}}{\mu_{n}}}\sqrt{p_{n}}\mu_{n}\right)=O\left(p_{n}\sqrt{\frac{p_{n}^{4}}{\mu_{n}}}\right)=o(p_{n}).

Hence, for nn sufficiently large

μnT1pnμn𝐤n(1)(𝜷0)α4pn𝐤2.\mu_{n}T_{1}\leq\sqrt{\frac{p_{n}}{\mu_{n}}}\mathbf{k}^{\top}\ell_{n}^{(1)}(\bm{\beta}_{0})-\frac{\alpha}{4}p_{n}\|\mathbf{k}\|^{2}.

Regarding the term T2T_{2} we have,

T2j=1snλn,j{|β0j||β0j+pnμnkj|}anpnμnj=1sn|kj|ansnpnμn𝐤.T_{2}\leq\sum_{j=1}^{s_{n}}\lambda_{n,j}\left\{|\beta_{0j}|-\left|\beta_{0j}+\sqrt{\frac{p_{n}}{\mu_{n}}}k_{j}\right|\right\}\leq a_{n}\sqrt{\frac{p_{n}}{\mu_{n}}}\sum_{j=1}^{s_{n}}|k_{j}|\leq a_{n}\sqrt{\frac{s_{n}p_{n}}{\mu_{n}}}\|\mathbf{k}\|.

We deduce that for nn large enough, there exists κ\kappa such that T2κdn2kT_{2}\leq\kappa d_{n}^{2}\|k\| whereby we deduce that

Δn(𝐤)1μnpnμn𝐤n(1)(𝜷0)α4pnμn𝐤2+ansnpnμn𝐤.\Delta_{n}(\mathbf{k})\leq\frac{1}{\mu_{n}}\sqrt{\frac{p_{n}}{\mu_{n}}}\mathbf{k}^{\top}\ell_{n}^{(1)}(\bm{\beta}_{0})-\frac{\alpha}{4}\frac{p_{n}}{\mu_{n}}\|\mathbf{k}\|^{2}+a_{n}\sqrt{\frac{s_{n}p_{n}}{\mu_{n}}}\|\mathbf{k}\|.

By the assumption of Proposition B.1, ansnpn/μn=ansnμn/pnpn/μn=o(pn/μn)a_{n}\sqrt{s_{n}p_{n}/\mu_{n}}=a_{n}\sqrt{s_{n}\mu_{n}/p_{n}}p_{n}/\mu_{n}=o(p_{n}/\mu_{n}), whereby we deduce that for nn sufficiently large

Δn(𝐤)1μnpnμn𝐤n(1)(𝜷0)α8pnμn𝐤2.\Delta_{n}(\mathbf{k})\leq\frac{1}{\mu_{n}}\sqrt{\frac{p_{n}}{\mu_{n}}}\mathbf{k}^{\top}\ell_{n}^{(1)}(\bm{\beta}_{0})-\frac{\alpha}{8}\frac{p_{n}}{\mu_{n}}\|\mathbf{k}\|^{2}.

Now for nn sufficiently large,

P{sup𝐤=KΔn(𝐤)>0}\displaystyle\mathrm{P}\bigg{\{}{\sup_{\|\mathbf{k}\|=K}\Delta_{n}(\mathbf{k})>0}\bigg{\}} P{n(1)(𝜷0)α8Kpnμnpn}\displaystyle\leq\mathrm{P}\bigg{\{}\|\ell_{n}^{(1)}(\bm{\beta}_{0})\|\geq\frac{\alpha}{8}Kp_{n}\sqrt{\frac{\mu_{n}}{p_{n}}}\bigg{\}}
=P{n(1)(𝜷0)α8Kpnμn}<ε\displaystyle=\mathrm{P}\bigg{\{}\|\ell_{n}^{(1)}(\bm{\beta}_{0})\|\geq\frac{\alpha}{8}K\sqrt{p_{n}\mu_{n}}\bigg{\}}<\varepsilon

for any given ε>0\varepsilon>0 since n(1)(𝜷0)=𝐔n(𝜷0)=OP(pnμn)\ell_{n}^{(1)}(\bm{\beta}_{0})=\mathbf{U}_{n}(\bm{\beta}_{0})=O_{\mathrm{P}}(\sqrt{p_{n}\mu_{n}}) by Lemma A.1.

B.2 Sparsity property for 𝜷^=𝜷^AL\hat{\bm{\beta}}=\hat{\bm{\beta}}_{\mathrm{AL}}

The sparsity property for 𝜷^AL\hat{\bm{\beta}}_{\mathrm{AL}} follows directly from Proposition B.1 and the following Lemma B.2.

Lemma B.2.

Assume the conditions (𝒞\mathcal{C}.2)-(𝒞\mathcal{C}.4) and (𝒞\mathcal{C}.8)-(𝒞\mathcal{C}.9) hold, then with probability tending to 11, for any 𝛃1sn\bm{\beta}_{1}\in\mathbb{R}^{s_{n}} satisfying 𝛃1𝛃01=OP(pn/μn)\|{\bm{\beta}_{1}-\bm{\beta}_{01}}\|=O_{\mathrm{P}}(\sqrt{p_{n}/\mu_{n}}), and for any constant K1>0K_{1}>0,

Qn{(𝜷1,𝟎)}=max𝜷2K1pn/μnQn{(𝜷1,𝜷2)}.\displaystyle Q_{n}\Big{\{}({\bm{\beta}_{1}}^{\top},\mathbf{0}^{\top})^{\top}\Big{\}}=\max_{\|\bm{\beta}_{2}\|\leq K_{1}\sqrt{p_{n}/\mu_{n}}}Q_{n}\Big{\{}({\bm{\beta}_{1}}^{\top},{\bm{\beta}_{2}}^{\top})^{\top}\Big{\}}.
Proof.

Let εn=K1pn/μn\varepsilon_{n}=K_{1}\sqrt{p_{n}/\mu_{n}}. It is sufficient to show that with probability tending to 11 as n{n\to\infty}, for any 𝜷1{\bm{\beta}_{1}} satisfying 𝜷1𝜷01=OP(pn/μn)\|{\bm{\beta}_{1}-\bm{\beta}_{01}}\|=O_{\mathrm{P}}(\sqrt{p_{n}/\mu_{n}}), we have for any j=sn+1,,pnj=s_{n}+1,\ldots,p_{n}

Qn(𝜷)βj<0 for 0<βj<εn, and\frac{\partial Q_{n}(\bm{\beta})}{\partial\beta_{j}}<0\quad\mbox{ for }0<\beta_{j}<\varepsilon_{n},\mbox{ and} (19)
Qn(𝜷)βj>0 for εn<βj<0.\frac{\partial Q_{n}(\bf\bm{\beta})}{\partial\beta_{j}}>0\quad\mbox{ for }-\varepsilon_{n}<\beta_{j}<0. (20)

From (3),

n(𝜷)βj=n(𝜷0)βj+Rn,\displaystyle\frac{\partial\ell_{n}(\bm{\beta})}{\partial\beta_{j}}=\frac{\partial\ell_{n}{(\bm{\beta}_{0})}}{\partial\beta_{j}}+R_{n},

where Rn=Dnzj(u){ρ(u;𝜷)ρ(u;𝜷0)}duR_{n}={-}\int_{D_{n}}z_{j}(u)\big{\{}\rho(u;\bm{\beta})-\rho(u;\bm{\beta}_{0})\big{\}}\mathrm{d}u. Let udu\in\mathbb{R}^{d}. By Taylor expansion, there exists t(0,1),t\in(0,1), such that

ρ(u;𝜷)=ρ(u;𝜷0)+(𝜷𝜷0)𝐳(u)ρ{u;𝜷0+t(𝜷𝜷0)}.\displaystyle\rho(u;\bm{\beta})=\rho(u;\bm{\beta}_{0})+(\bm{\beta}-\bm{\beta}_{0})^{\top}\mathbf{z}(u)\rho\{u;\bm{\beta}_{0}+t(\bm{\beta}-\bm{\beta}_{0})\}.

By condition (𝒞\mathcal{C}.1)-(𝒞\mathcal{C}.3) and Lemma A.2, we have for nn sufficiently large

|Rn|κ𝜷𝜷0pnDnρ(u;𝜷0)du=OP(pnμnpnμn)=OP(pnμn).|R_{n}|\leq\kappa\|\bm{\beta}-\bm{\beta}_{0}\|\sqrt{p_{n}}\int_{D_{n}}\rho(u;\bm{\beta}_{0})\mathrm{d}u=O_{\mathrm{P}}\left(\sqrt{\frac{p_{n}}{\mu_{n}}}\sqrt{p_{n}}\mu_{n}\right)=O_{\mathrm{P}}\left({p_{n}}\sqrt{\mu_{n}}\right).

Following the proof of Lemma A.1, we can derive Var(n(𝜷0)/βj)=Var[{𝐔n(𝜷0)}j]=O(μn)\mathrm{Var}({\partial\ell_{n}{(\bm{\beta}_{0})}}/{\partial\beta_{j}})=\mathrm{Var}[\{\mathbf{U}_{n}(\bm{\beta}_{0})\}_{j}]=O(\mu_{n}) whereby we deduce that

n(𝜷)βj=OP(pnμn).\frac{\partial\ell_{n}(\bm{\beta})}{\partial\beta_{j}}=O_{\mathrm{P}}(p_{n}\sqrt{\mu_{n}}). (21)

Now, we want to prove (19). Let 0<βj<εn0<\beta_{j}<\varepsilon_{n} and remind that the sequence bnb_{n} is given by (𝒞\mathcal{C}.9). Then, for nn sufficiently large,

P{Qn(𝜷)βj<0}\displaystyle\mathrm{P}\left\{\frac{\partial Q_{n}(\bm{\beta})}{\partial\beta_{j}}<0\right\} =P{n(𝜷)βjμnλn,jsign(βj)<0}\displaystyle=\mathrm{P}\left\{\frac{\partial\ell_{n}(\bm{\beta})}{\partial\beta_{j}}-\mu_{n}{\lambda_{n,j}}\operatorname{sign}(\beta_{j})<0\right\}
=P{n(𝜷)βj<μnλn,j}\displaystyle=\mathrm{P}\left\{\frac{\partial\ell_{n}(\bm{\beta})}{\partial\beta_{j}}<\mu_{n}{\lambda_{n,j}}\right\}
P{n(𝜷)βj<μnbn}\displaystyle\geq\mathrm{P}\left\{\frac{\partial\ell_{n}(\bm{\beta})}{\partial\beta_{j}}<\mu_{n}b_{n}\right\}
=P{n(𝜷)βj<pnμnμnpn2bn}.\displaystyle=\mathrm{P}\left\{\frac{\partial\ell_{n}(\bm{\beta})}{\partial\beta_{j}}<p_{n}\sqrt{\mu_{n}}\;\sqrt{\frac{\mu_{n}}{p_{n}^{2}}}b_{n}\right\}.

The assertion (19) is therefore deduced from (21) and from the assumption that bnμn/pn2b_{n}\sqrt{\mu_{n}/p_{n}^{2}}\to\infty as nn\to\infty. We proceed similarly to prove (20). ∎

B.3 Asymptotic normality for 𝜷^=𝜷^AL\hat{\bm{\beta}}=\hat{\bm{\beta}}_{\mathrm{AL}}

Proof.

As shown in Proposition B.1, there is a root-(μn/pn)(\mu_{n}/p_{n}) consistent local maximizer 𝜷^AL\hat{\bm{\beta}}_{\mathrm{AL}} of Qn(𝜷)Q_{n}(\bm{\beta}), and it can be shown that there exists an estimator 𝜷^AL,1\hat{\bm{\beta}}_{\mathrm{AL},1} in Proposition B.1 that is a root-(μn/pn)(\mu_{n}/p_{n}) consistent local maximizer of Qn{(𝜷1,𝟎)}Q_{n}\{({\bm{\beta}_{1}}^{\top},\mathbf{0}^{\top})^{\top}\Big{\}}, which is regarded as a function of 𝜷1\bm{\beta}_{1}, and that satisfies

Qn(𝜷^AL)βj=0 for j=1,,sn and 𝜷^AL=(𝜷^AL,1,𝟎).\displaystyle\frac{\partial Q_{n}(\hat{\bm{\beta}}_{\mathrm{AL}})}{\partial\beta_{j}}=0\quad\mbox{ for }j=1,\ldots,s_{n}\mbox{ and }\hat{\bm{\beta}}_{\mathrm{AL}}=(\hat{\bm{\beta}}_{\mathrm{AL},1}^{\top},\mathbf{0}^{\top})^{\top}.

There exists t(0,1)t\in(0,1) and 𝜷ˇ=𝜷^AL+t(𝜷0𝜷^AL)\bm{\check{\beta}}=\hat{\bm{\beta}}_{\mathrm{AL}}+t(\bm{\beta}_{0}-\hat{\bm{\beta}}_{\mathrm{AL}}) such that for j=1,,snj=1,\cdots,s_{n}

0=\displaystyle 0= n(𝜷^AL)βjμnλn,jsign(β^AL,j)\displaystyle\frac{\partial\ell_{n}{(\hat{\bm{\beta}}_{\mathrm{AL}})}}{\partial\beta_{j}}-\mu_{n}{\lambda_{n,j}}\operatorname{sign}({\hat{\beta}_{\mathrm{AL},j}})
=\displaystyle= n(𝜷0)βj+l=1sn2n(𝜷ˇ)βjβl(β^AL,lβ0l)μnλn,jsign(β^AL,j)\displaystyle\frac{\partial\ell_{n}{(\bm{\beta}_{0})}}{\partial\beta_{j}}+{\sum_{l=1}^{s_{n}}\frac{\partial^{2}\ell_{n}{(\bm{\check{\beta}})}}{\partial\beta_{j}\partial\beta_{l}}}({\hat{\beta}_{\mathrm{AL},l}-\beta_{0l}})-\mu_{n}{\lambda_{n,j}}\operatorname{sign}({\hat{\beta}_{\mathrm{AL},j}})
=\displaystyle= n(𝜷0)βj+l=1sn2n(𝜷0)βjβl(β^AL,lβ0l)+l=1snΨn,jl(β^AL,lβ0l)\displaystyle\frac{\partial\ell_{n}{(\bm{\beta}_{0})}}{\partial\beta_{j}}+{\sum_{l=1}^{s_{n}}\frac{\partial^{2}\ell_{n}{(\bm{\beta}_{0})}}{\partial\beta_{j}\partial\beta_{l}}}({\hat{\beta}_{\mathrm{AL},l}-\beta_{0l}})+{\sum_{l=1}^{s_{n}}\Psi_{n,jl}({\hat{\beta}_{\mathrm{AL},l}}-\beta_{0l})}
μnλn,jsign(β^AL,j)\displaystyle-\mu_{n}\lambda_{n,j}\operatorname{sign}({\hat{\beta}_{\mathrm{AL},j}}) (22)

where

Ψn,jl=2n(𝜷ˇ)βjβl2n(𝜷0)βjβl.\displaystyle\Psi_{n,jl}=\frac{\partial^{2}\ell_{n}{(\bm{\check{\beta}})}}{\partial\beta_{j}\partial\beta_{l}}-\frac{\partial^{2}\ell_{n}{(\bm{\beta}_{0})}}{\partial\beta_{j}\partial\beta_{l}}.

Let 𝐔n,1(𝜷0)\mathbf{U}_{n,1}(\bm{\beta}_{0}) (resp. n,1(2)(𝜷0)\ell^{(2)}_{n,1}(\bm{\beta}_{0})) be the first sns_{n} components (resp. sn×sns_{n}\times s_{n} top-left corner) of 𝐔n(𝜷0)\mathbf{U}_{n}(\bm{\beta}_{0}) (resp. n(2)(𝜷0)\ell^{(2)}_{n}(\bm{\beta}_{0})). Let also 𝚿n\bm{\Psi}_{n} be the sn×sns_{n}\times s_{n} matrix containing Ψn,jl,j,l=1,,sn\Psi_{n,jl},j,l=1,\ldots,s_{n}. Finally, let the vector 𝐩n\mathbf{p}^{\prime}_{n}

𝐩n\displaystyle\mathbf{p}^{\prime}_{n} ={λn,1sign(β^AL,1),,λn,snsign(β^AL,sn)}.\displaystyle={\{\lambda_{n,1}\operatorname{sign}({\hat{\beta}_{\mathrm{AL},1}}),\ldots,\lambda_{n,s_{n}}\operatorname{sign}({\hat{\beta}_{\mathrm{AL},s_{n}}})\}^{\top}}.

These notation allow us to rewrite (22) as

𝐔n,1(𝜷0)𝐀n,11(𝜷0)(𝜷^AL,1𝜷01)+𝚿n(𝜷^AL,1𝜷01)μn𝐩n=0.\mathbf{U}_{n,1}(\bm{\beta}_{0})-\mathbf{A}_{n,11}(\bm{\beta}_{0})(\hat{\bm{\beta}}_{\mathrm{AL},1}-\bm{\beta}_{01})+\bm{\Psi}_{n}(\hat{\bm{\beta}}_{\mathrm{AL},1}-\bm{\beta}_{01})-\mu_{n}\mathbf{p}^{\prime}_{n}=0. (23)

Let ϕsn{0}\bm{\phi}\in\mathbb{R}^{s_{n}}\setminus\{0\} and σϕ2=ϕ𝐁n,11(𝜷0)ϕ\sigma^{2}_{\phi}=\bm{\phi}^{\top}\mathbf{B}_{n,11}(\bm{\beta}_{0})\bm{\phi}, then

σϕ1ϕ𝐔n,1(𝜷0)σϕ1ϕ𝐀n,11(𝜷0)(𝜷^AL,1𝜷01)+σϕ1ϕ𝚿n(𝜷^AL,1𝜷01)μnσϕ1ϕ𝐩n=0.\sigma_{\bm{\phi}}^{-1}\bm{\phi}^{\top}\mathbf{U}_{n,1}(\bm{\beta}_{0})-\sigma_{\bm{\phi}}^{-1}\bm{\phi}^{\top}\mathbf{A}_{n,11}(\bm{\beta}_{0})(\hat{\bm{\beta}}_{\mathrm{AL},1}-\bm{\beta}_{01})+\sigma_{\bm{\phi}}^{-1}\bm{\phi}^{\top}\bm{\Psi}_{n}(\hat{\bm{\beta}}_{\mathrm{AL},1}-\bm{\beta}_{01})-\mu_{n}\sigma_{\bm{\phi}}^{-1}\bm{\phi}^{\top}\mathbf{p}^{\prime}_{n}=0.

Now, by condition (𝒞\mathcal{C}.5), σϕ1=O(μn1/2)\sigma_{\bm{\phi}}^{-1}=O(\mu_{n}^{-1/2}) and by the definition of ana_{n}, 𝐩n=O(ansn)\mathbf{p}^{\prime}_{n}=O(a_{n}\sqrt{s_{n}}). By conditions (𝒞\mathcal{C}.1)-(𝒞\mathcal{C}.3), there exists some 𝜷~\tilde{\bm{\beta}} on the line segment between 𝜷0\bm{\beta}_{0} and 𝜷ˇ\check{\bm{\beta}} such that

𝚿n=Dn𝐳1(u)𝐳1(u)(𝜷ˇ𝜷0)𝐳(u)ρ(u;𝜷~)du\bm{\Psi}_{n}=\int_{D_{n}}\mathbf{z}_{1}(u)\mathbf{z}_{1}(u)^{\top}(\check{\bm{\beta}}-\bm{\beta}_{0})^{\top}\mathbf{z}(u)\rho(u;\tilde{\bm{\beta}})\mathrm{d}u

whereby we deduce from conditions (𝒞\mathcal{C}.1)-(𝒞\mathcal{C}.3) and Lemma A.2 that

𝚿n=OP(snpnμnpnμn)=OP(snpnμn).\|\bm{\Psi}_{n}\|=O_{\mathrm{P}}\left(s_{n}\sqrt{\frac{p_{n}}{\mu_{n}}}\sqrt{p_{n}}\mu_{n}\right)=O_{\mathrm{P}}(s_{n}p_{n}\sqrt{\mu_{n}}).

The last two results and conditions (𝒞\mathcal{C}.8)-(𝒞\mathcal{C}.9) yield that

σϕ1ϕ𝚿n(𝜷^AL,1𝜷01)\displaystyle\sigma_{\bm{\phi}}^{-1}\bm{\phi}^{\top}\bm{\Psi}_{n}(\hat{\bm{\beta}}_{\mathrm{AL},1}-\bm{\beta}_{01}) =OP(1μnsnpnμnpnμn)=OP(sn2pn3μn)=oP(1)\displaystyle=O_{\mathrm{P}}\left(\frac{1}{\sqrt{\mu_{n}}}s_{n}p_{n}\sqrt{\mu_{n}}\sqrt{\frac{p_{n}}{\mu_{n}}}\right)=O_{\mathrm{P}}\left(\sqrt{\frac{s_{n}^{2}p_{n}^{3}}{\mu_{n}}}\right)=o_{\mathrm{P}}(1)
μnσϕ1ϕ𝐩n\displaystyle\mu_{n}\sigma_{\bm{\phi}}^{-1}\bm{\phi}^{\top}\mathbf{p}^{\prime}_{n} =O(μn1μnansn)=O(ansnμn)=o(1).\displaystyle=O\left(\mu_{n}\frac{1}{\sqrt{\mu_{n}}}a_{n}\sqrt{s_{n}}\right)=O(a_{n}\sqrt{s_{n}\mu_{n}})=o(1).

These results finally lead to

σϕ1ϕ𝐀n,11(𝜷0)(𝜷^AL,1𝜷01)=σϕ1ϕ𝐔n,1(𝜷0)+oP(1)\sigma_{\bm{\phi}}^{-1}\bm{\phi}^{\top}\mathbf{A}_{n,11}(\bm{\beta}_{0})(\hat{\bm{\beta}}_{\mathrm{AL},1}-\bm{\beta}_{01})=\sigma_{\bm{\phi}}^{-1}\bm{\phi}^{\top}\mathbf{U}_{n,1}(\bm{\beta}_{0})+o_{\mathrm{P}}(1)

and finally to the proof of the result using Slutsky’s lemma and condition (𝒞\mathcal{C}.6).

Appendix C Proof of Theorem 4.1 when 𝜷^=𝜷^ALDS\hat{\bm{\beta}}=\hat{\bm{\beta}}_{\mathrm{ALDS}}

C.1 Existence and optimal solutions for the primal and dual problems

For 𝜷pn\bm{\beta}\in\mathbb{R}^{p_{n}}, we let 𝚫n(𝜷)=𝐔n(𝜷~)+𝐀n(𝜷~)(𝜷~𝜷)\bm{\Delta}_{n}(\bm{\beta})=\mathbf{U}_{n}(\tilde{\bm{\beta}})+\mathbf{A}_{n}(\tilde{\bm{\beta}})(\tilde{\bm{\beta}}-\bm{\beta}).

Lemma C.1.

There exists a solution to the problem (12).

Proof.

Following [6], we state that (12) is equivalent to

min𝜷,𝒖juj subject to {𝚲n𝜷𝒖𝚲n𝜷𝒖μn1𝚲n1𝚫n(𝜷)𝟏pn𝟎μn1𝚲n1𝚫n(𝜷)𝟏pn𝟎\displaystyle\min_{\bm{\beta},{\bm{u}}}\sum_{j}u_{j}\mbox{ subject to }\begin{cases}{\bm{\Lambda}}_{n}\bm{\beta}\leq{\bm{u}}\\ -{\bm{\Lambda}}_{n}\bm{\beta}\leq{\bm{u}}\\ \mu_{n}^{-1}{\bm{\Lambda}}_{n}^{-1}\bm{\Delta}_{n}(\bm{\beta})-\bm{1}_{p_{n}}\leq{\mathbf{0}}\\ -\mu_{n}^{-1}{\bm{\Lambda}}_{n}^{-1}\bm{\Delta}_{n}(\bm{\beta})-\bm{1}_{p_{n}}\leq{\mathbf{0}}\end{cases} (24)

where 𝒖pn\bm{u}\in\mathbb{R}^{p_{n}} is an additional parameter vector to be optimized and 𝜷~\tilde{\bm{\beta}} is the initial estimator. Note that (24) is a linear problem with 4pn4p_{n} linear inequality constraints. To prove the existence of ALDS estimates, we need to derive dual problem of (24) and prove that strong duality holds. To derive the dual problem, we first construct the Lagrangian form associated with the problem (24) considering the main arguments by [5, section 5.2]

L(𝜷;𝒖;𝜶)\displaystyle L(\bm{\beta};\bm{u};\bm{\alpha}) =juj+𝜶1(𝚲n𝜷𝐮)+𝜶2(𝚲n𝜷𝐮)\displaystyle=\sum_{j}u_{j}+\bm{\alpha}_{1}^{\top}({\bm{\Lambda}}_{n}\bm{\beta}-\mathbf{u})+\bm{\alpha}_{2}^{\top}(-{\bm{\Lambda}}_{n}\bm{\beta}-\mathbf{u})
+𝜶3[μn1𝚲n1𝚫n(𝜷)𝟏pn]+𝜶4[μn1𝚲n1𝚫n(𝜷)𝟏pn]\displaystyle\quad+\bm{\alpha}_{3}^{\top}\Big{[}\mu_{n}^{-1}{\bm{\Lambda}}_{n}^{-1}\bm{\Delta}_{n}(\bm{\beta})-\bm{1}_{p_{n}}\Big{]}+\bm{\alpha}_{4}^{\top}\Big{[}-\mu_{n}^{-1}{\bm{\Lambda}}_{n}^{-1}\bm{\Delta}_{n}(\bm{\beta})-\bm{1}_{p_{n}}\Big{]}
=(𝟏pn𝜶1𝜶2)𝒖+{(𝜶1𝜶2)𝚲nμn1(𝜶3𝜶4)𝚲n1𝐀n(𝜷~)}𝜷\displaystyle=(\bm{1}_{p_{n}}-\bm{\alpha}_{1}-\bm{\alpha}_{2})^{\top}\bm{u}+\Big{\{}(\bm{\alpha}_{1}-\bm{\alpha}_{2})^{\top}{\bm{\Lambda}}_{n}-\mu_{n}^{-1}(\bm{\alpha}_{3}-\bm{\alpha}_{4})^{\top}{\bm{\Lambda}}_{n}^{-1}\mathbf{A}_{n}(\tilde{\bm{\beta}})\Big{\}}\bm{\beta}
+(𝜶3𝜶4){μn1𝚲n1(𝐔n(𝜷~)+𝐀n(𝜷~)𝜷~)}(𝜶3+𝜶4)𝟏pn,\displaystyle\quad+(\bm{\alpha}_{3}-\bm{\alpha}_{4})^{\top}\Big{\{}\mu_{n}^{-1}{\bm{\Lambda}}_{n}^{-1}\Big{(}\mathbf{U}_{n}(\tilde{\bm{\beta}})+\mathbf{A}_{n}(\tilde{\bm{\beta}})\tilde{\bm{\beta}}\Big{)}\Big{\}}-(\bm{\alpha}_{3}+\bm{\alpha}_{4})^{\top}\bm{1}_{p_{n}},

where 𝜶=(𝜶1,𝜶2,𝜶3,𝜶4)4pn\bm{\alpha}=(\bm{\alpha}_{1}^{\top},\bm{\alpha}_{2}^{\top},\bm{\alpha}_{3}^{\top},\bm{\alpha}_{4}^{\top})^{\top}\in\mathbb{R}^{4p_{n}} is the dual vector (which can be viewed as a Lagrange multiplier).

The dual function hh is defined by

h(𝜶)\displaystyle h(\bm{\alpha}) =inf𝜷,𝒖L(𝜷;𝒖;𝜶)\displaystyle=\inf_{\bm{\beta},\bm{u}}L(\bm{\beta};\bm{u};\bm{\alpha}) (25)
={(𝜶3𝜶4){μn1𝚲n1(𝐔n(𝜷~)+𝐀n(𝜷~)𝜷~)}(𝜶3+𝜶4)𝟏pn, if{𝟏pn𝜶1𝜶2=𝟎(𝜶1𝜶2)𝚲nμn1(𝜶3𝜶4)𝚲n1𝐀n(𝜷~)=𝟎 otherwise.\displaystyle=\begin{cases}(\bm{\alpha}_{3}-\bm{\alpha}_{4})^{\top}\Big{\{}\mu_{n}^{-1}{\bm{\Lambda}}_{n}^{-1}\Big{(}\mathbf{U}_{n}(\tilde{\bm{\beta}})+\mathbf{A}_{n}(\tilde{\bm{\beta}})\tilde{\bm{\beta}}\Big{)}\Big{\}}-(\bm{\alpha}_{3}+\bm{\alpha}_{4})^{\top}\bm{1}_{p_{n}},\text{ if}\\ \quad\begin{cases}\bm{1}_{p_{n}}-\bm{\alpha}_{1}-\bm{\alpha}_{2}={\mathbf{0}}\\ (\bm{\alpha}_{1}-\bm{\alpha}_{2})^{\top}{\bm{\Lambda}}_{n}-\mu_{n}^{-1}(\bm{\alpha}_{3}-\bm{\alpha}_{4})^{\top}{\bm{\Lambda}}_{n}^{-1}\mathbf{A}_{n}(\tilde{\bm{\beta}})={\mathbf{0}}\end{cases}\\ -\infty\text{ otherwise}.\end{cases}

For any 𝜶=(𝜶1,𝜶2,𝜶3,𝜶4)+4pn\bm{\alpha}=(\bm{\alpha}_{1}^{\top},\bm{\alpha}_{2}^{\top},\bm{\alpha}_{3}^{\top},\bm{\alpha}_{4}^{\top})^{\top}\in\mathbb{R^{+}}^{4p_{n}}, h(𝜶)h(\bm{\alpha}) is a lower bound to the optimality problem (24) (see [5, p.216]). To find the best lower bound comes to solve the dual problem: max𝜶𝟎h(𝜶).\max_{\bm{\alpha}\geq{\mathbf{0}}}h(\bm{\alpha}).

Recall that problem (24) is a linear program with linear inequality constraints, so that strong duality holds if the dual problem is feasible [see 5, p.227], that is to say if there exists some 𝜶=(𝜶1,𝜶2,𝜶3,𝜶4)+4pn\bm{\alpha}=(\bm{\alpha}_{1}^{\top},\bm{\alpha}_{2}^{\top},\bm{\alpha}_{3}^{\top},\bm{\alpha}_{4}^{\top})^{\top}\in\mathbb{R^{+}}^{4p_{n}} such that

𝟏pn𝜶1𝜶2\displaystyle\bm{1}_{p_{n}}-\bm{\alpha}_{1}-\bm{\alpha}_{2} =\displaystyle= 𝟎\displaystyle{\mathbf{0}}
(𝜶1𝜶2)𝚲nμn1(𝜶3𝜶4)𝚲n1𝐀n(𝜷~)\displaystyle(\bm{\alpha}_{1}-\bm{\alpha}_{2})^{\top}{\bm{\Lambda}}_{n}-\mu_{n}^{-1}(\bm{\alpha}_{3}-\bm{\alpha}_{4})^{\top}{\bm{\Lambda}}_{n}^{-1}\mathbf{A}_{n}(\tilde{\bm{\beta}}) =\displaystyle= 𝟎.\displaystyle{\mathbf{0}}.

Moreover, we remark that

𝜶1𝟎,𝜶2𝟎,𝜶3𝟎,𝜶4\displaystyle\bm{\alpha}_{1}\geq{\mathbf{0}},\;\bm{\alpha}_{2}\geq{\mathbf{0}},\;\bm{\alpha}_{3}\geq{\mathbf{0}},\;\bm{\alpha}_{4} \displaystyle\geq 𝟎\displaystyle{\mathbf{0}}
𝟏pn𝜶1𝜶2\displaystyle\bm{1}_{p_{n}}-\bm{\alpha}_{1}-\bm{\alpha}_{2} =\displaystyle= 𝟎\displaystyle{\mathbf{0}}
(𝜶1𝜶2)𝚲nμn1(𝜶3𝜶4)𝚲n1𝐀n(𝜷~)\displaystyle(\bm{\alpha}_{1}-\bm{\alpha}_{2})^{\top}{\bm{\Lambda}}_{n}-\mu_{n}^{-1}(\bm{\alpha}_{3}-\bm{\alpha}_{4})^{\top}{\bm{\Lambda}}_{n}^{-1}\mathbf{A}_{n}(\tilde{\bm{\beta}}) =\displaystyle= 𝟎\displaystyle{\mathbf{0}}

is equivalent to

𝜶1𝟎,𝜶2=𝟏n𝜶1𝟎,𝜶3𝟎,𝜶4\displaystyle\bm{\alpha}_{1}\geq{\mathbf{0}},\;\bm{\alpha}_{2}=\bm{1}_{n}-\bm{\alpha}_{1}\geq{\mathbf{0}},\;\bm{\alpha}_{3}\geq{\mathbf{0}},\;\bm{\alpha}_{4} \displaystyle\geq 𝟎\displaystyle{\mathbf{0}}
(2𝜶1𝟏pn)𝚲nμn1(𝜶3𝜶4)𝚲n1𝐀n(𝜷~)\displaystyle(2\bm{\alpha}_{1}-\bm{1}_{p_{n}})^{\top}{\bm{\Lambda}}_{n}-\mu_{n}^{-1}(\bm{\alpha}_{3}-\bm{\alpha}_{4})^{\top}{\bm{\Lambda}}_{n}^{-1}\mathbf{A}_{n}(\tilde{\bm{\beta}}) =\displaystyle= 𝟎\displaystyle{\mathbf{0}}

which is also equivalent to

𝜶1=12{𝟏pn+μn1𝚲n1𝐀n(𝜷~)𝚲n1(𝜶3𝜶4)}\displaystyle\bm{\alpha}_{1}=\frac{1}{2}\Big{\{}\bm{1}_{p_{n}}+\mu_{n}^{-1}{\bm{\Lambda}}_{n}^{-1}\mathbf{A}_{n}(\tilde{\bm{\beta}}){\bm{\Lambda}}_{n}^{-1}(\bm{\alpha}_{3}-\bm{\alpha}_{4})\Big{\}} \displaystyle\geq 𝟎\displaystyle{\mathbf{0}}
𝜶2=𝟏pn𝜶1=12{𝟏pnμn1𝚲n1𝐀n(𝜷~)𝚲n1(𝜶3𝜶4)}\displaystyle\bm{\alpha}_{2}=\bm{1}_{p_{n}}-\bm{\alpha}_{1}=\frac{1}{2}\Big{\{}\bm{1}_{p_{n}}-\mu_{n}^{-1}{\bm{\Lambda}}_{n}^{-1}\mathbf{A}_{n}(\tilde{\bm{\beta}}){\bm{\Lambda}}_{n}^{-1}(\bm{\alpha}_{3}-\bm{\alpha}_{4})\Big{\}} \displaystyle\geq 𝟎\displaystyle{\mathbf{0}}
𝜶3𝟎,𝜶4\displaystyle\bm{\alpha}_{3}\geq{\mathbf{0}},\;\bm{\alpha}_{4} \displaystyle\geq 𝟎.\displaystyle{\mathbf{0}}.

This comes to the condition that there exists (𝜶3,𝜶4)+2pn(\bm{\alpha}_{3}^{\top},\bm{\alpha}_{4}^{\top})^{\top}\in\mathbb{R^{+}}^{2p_{n}} such that

μn1𝚲n1𝐀n(𝜷~)𝚲n1(𝜶3𝜶4)1.\mu_{n}^{-1}\|{\bm{\Lambda}}_{n}^{-1}\mathbf{A}_{n}(\tilde{\bm{\beta}}){\bm{\Lambda}}_{n}^{-1}(\bm{\alpha}_{3}-\bm{\alpha}_{4})\|_{\infty}\leq 1. (26)

Therefore, the dual problem associated with (24) is

max𝜶3,𝜶40(𝜶3𝜶4)[μn1𝚲n1{𝐔n(𝜷~)+𝐀n(𝜷~)𝜷~}](𝜶3+𝜶4)𝟏pn\displaystyle\max_{\bm{\alpha}_{3},\bm{\alpha}_{4}\geq 0}(\bm{\alpha}_{3}-\bm{\alpha}_{4})^{\top}\Big{[}\mu_{n}^{-1}{\bm{\Lambda}}_{n}^{-1}\Big{\{}\mathbf{U}_{n}(\tilde{\bm{\beta}})+\mathbf{A}_{n}(\tilde{\bm{\beta}})\tilde{\bm{\beta}}\Big{\}}\Big{]}-(\bm{\alpha}_{3}+\bm{\alpha}_{4})^{\top}\bm{1}_{p_{n}}
subject to μn1𝚲n1𝐀n(𝜷~)𝚲n1(𝜶3𝜶4)1.\displaystyle\text{subject to }\mu_{n}^{-1}\|{\bm{\Lambda}}_{n}^{-1}\mathbf{A}_{n}(\tilde{\bm{\beta}}){\bm{\Lambda}}_{n}^{-1}(\bm{\alpha}_{3}-\bm{\alpha}_{4})\|_{\infty}\leq 1. (27)

The condition (26) is always true as far as the matrix 𝚲n1𝐀n(𝜷~)𝚲n1{\bm{\Lambda}}_{n}^{-1}\mathbf{A}_{n}(\tilde{\bm{\beta}}){\bm{\Lambda}}_{n}^{-1} is non zero. Indeed, let 𝐲pn\mathbf{y}\in\mathbb{R}^{p_{n}} such that {𝚲n1𝐀n(𝜷~)𝚲n1y}0\mathbf{\{}{\bm{\Lambda}}_{n}^{-1}\mathbf{A}_{n}(\tilde{\bm{\beta}}){\bm{\Lambda}}_{n}^{-1}y\}\neq 0. Now define

α3j\displaystyle\alpha_{3j} =\displaystyle= yjμn1𝚲n1𝐀n(𝜷~)𝚲n1y𝟏(yj>0),\displaystyle\frac{y_{j}}{\mu_{n}^{-1}\|{\bm{\Lambda}}_{n}^{-1}\mathbf{A}_{n}(\tilde{\bm{\beta}}){\bm{\Lambda}}_{n}^{-1}y\|_{\infty}}\mathbf{1}(y_{j}>0),
α4j\displaystyle\alpha_{4j} =\displaystyle= yjμn1𝚲n1𝐀n(𝜷~)𝚲n1y𝟏(yj<0).\displaystyle\frac{-y_{j}}{\mu_{n}^{-1}\|{\bm{\Lambda}}_{n}^{-1}\mathbf{A}_{n}(\tilde{\bm{\beta}}){\bm{\Lambda}}_{n}^{-1}y\|_{\infty}}\mathbf{1}(y_{j}<0).

Clearly, (𝜶3,𝜶4)+2pn(\bm{\alpha}_{3}^{\top},\bm{\alpha}_{4}^{\top})^{\top}\in\mathbb{R^{+}}^{2p_{n}}, (26) is always verified. This ends the proof. ∎

Note that the dual problem (27) can be unequivocally reparameterized in terms of 𝜸=𝜶3𝜶4\bm{\gamma}=\bm{\alpha}_{3}-\bm{\alpha}_{4} as follows

max𝜸pn𝜸[μn1𝚲n1{𝐔n(𝜷~)+𝐀n(𝜷~)𝜷~}]𝜸1\displaystyle\max_{\bm{\gamma}\in\mathbb{R}^{p_{n}}}\bm{\gamma}^{\top}\Big{[}\mu_{n}^{-1}{\bm{\Lambda}}_{n}^{-1}\Big{\{}\mathbf{U}_{n}(\tilde{\bm{\beta}})+\mathbf{A}_{n}(\tilde{\bm{\beta}})\tilde{\bm{\beta}}\Big{\}}\Big{]}-\|\bm{\gamma}\|_{1}
subject to μn1𝜸𝚲n1𝐀n(𝜷~)𝚲n11\displaystyle\text{subject to }\mu_{n}^{-1}\|\bm{\gamma}^{\top}{\bm{\Lambda}}_{n}^{-1}\mathbf{A}_{n}(\tilde{\bm{\beta}}){\bm{\Lambda}}_{n}^{-1}\|_{\infty}\leq 1 (28)

due to complementary slackness conditions. Now, we derive the following optimality conditions and obtain optimal primal and dual solutions.

We derive the following optimality conditions ensuring the Karush-Kuhn-Tucker (KKT) conditions and thus obtain optimal primal and dual solutions.

Lemma C.2.

Consider the primal and dual problems defined by (12) and (28). Suppose that the matrix 𝚲n1𝐀n(𝛃~)𝚲n1{\bm{\Lambda}}_{n}^{-1}\mathbf{A}_{n}(\tilde{\bm{\beta}}){\bm{\Lambda}}_{n}^{-1} is non zero and that 𝛃^{\hat{\bm{\beta}}} and 𝛄^\hat{\bm{\gamma}} verify

μn1𝚲n1𝚫n(𝜷^)\displaystyle\mu_{n}^{-1}\Big{\|}{\bm{\Lambda}}_{n}^{-1}\bm{\Delta}_{n}({\hat{\bm{\beta}}})\Big{\|}_{\infty} 1\displaystyle\leq 1 (29)
μn1𝜸^𝚲n1𝐀n(𝜷~)𝚲n1\displaystyle\mu_{n}^{-1}\Big{\|}\hat{\bm{\gamma}}^{\top}{\bm{\Lambda}}_{n}^{-1}\mathbf{A}_{n}(\tilde{\bm{\beta}}){\bm{\Lambda}}_{n}^{-1}\Big{\|}_{\infty} 1\displaystyle\leq 1 (30)
μn1𝜸^𝚲n1𝐀n(𝜷~)𝜷^\displaystyle\mu_{n}^{-1}\hat{\bm{\gamma}}^{\top}{\bm{\Lambda}}_{n}^{-1}\mathbf{A}_{n}(\tilde{\bm{\beta}})\hat{\bm{\beta}} =𝚲n𝜷^1\displaystyle=\|{\bm{\Lambda}}_{n}\hat{\bm{\beta}}\|_{1} (31)
μn1𝜸^𝚲n1𝚫n(𝜷^)\displaystyle\mu_{n}^{-1}\hat{\bm{\gamma}}^{\top}{\bm{\Lambda}}_{n}^{-1}\bm{\Delta}_{n}({\hat{\bm{\beta}}}) =𝜸^1.\displaystyle=\|\hat{\bm{\gamma}}\|_{1}. (32)

Then the Karush-Kuhn-Tucker (KKT) conditions for (12) are fulfilled and 𝛃^{\hat{\bm{\beta}}} and 𝛄^\hat{\bm{\gamma}} are the optimal primal and dual solutions.

Proof.

We start by writing the Karush-Kuhn-Tucker (KKT) conditions for the problem (12):

𝚲n𝜷\displaystyle{\bm{\Lambda}}_{n}\bm{\beta} 𝒖\displaystyle\leq{\bm{u}} (33)
𝚲n𝜷\displaystyle-{\bm{\Lambda}}_{n}\bm{\beta} 𝒖\displaystyle\leq{\bm{u}} (34)
μn1𝚲n1𝚫n(𝜷)𝟏pn\displaystyle\mu_{n}^{-1}{\bm{\Lambda}}_{n}^{-1}{\bm{\Delta}_{n}(\bm{\beta})}-\bm{1}_{p_{n}} 𝟎\displaystyle\leq{\mathbf{0}} (35)
μn1𝚲n1𝚫n(𝜷)𝟏pn\displaystyle-\mu_{n}^{-1}{\bm{\Lambda}}_{n}^{-1}{\bm{\Delta}_{n}(\bm{\beta})}-\bm{1}_{p_{n}} 𝟎\displaystyle\leq{\mathbf{0}} (36)
𝜶10,𝜶2\displaystyle\bm{\alpha}_{1}\geq 0,\bm{\alpha}_{2} 0,\displaystyle\geq 0, (37)
𝜶30,𝜶4\displaystyle\bm{\alpha}_{3}\geq 0,\bm{\alpha}_{4} 0,\displaystyle\geq 0, (38)
iα1i{(𝚲n𝜷)iui}\displaystyle\forall i\;\alpha_{1i}\{({\bm{\Lambda}}_{n}\bm{\beta})_{i}-u_{i}\} =0\displaystyle=0 (39)
iα2i{(𝚲n𝜷)iui}\displaystyle\forall i\;\alpha_{2i}\{-({\bm{\Lambda}}_{n}\bm{\beta})_{i}-u_{i}\} =0\displaystyle=0 (40)
iα3i[μn1{𝚲n1𝚫n(𝜷)}i1]\displaystyle\forall i\;\alpha_{3i}[\mu_{n}^{-1}\{{\bm{\Lambda}}_{n}^{-1}{\bm{\Delta}_{n}(\bm{\beta})}\}_{i}-1] =0\displaystyle=0 (41)
iα4i[μn1{(𝚲n1𝚫n(𝜷)}i1]\displaystyle\forall i\;\alpha_{4i}[-\mu_{n}^{-1}\{{(}{\bm{\Lambda}}_{n}^{-1}{\bm{\Delta}_{n}(\bm{\beta})}\}_{i}-1] =0\displaystyle=0 (42)
1𝜶1𝜶2\displaystyle 1-\bm{\alpha}_{1}-\bm{\alpha}_{2} =0\displaystyle=0 (43)
(𝜶1𝜶2)T𝚲nμn1(𝜶3𝜶4)T𝚲n1𝐀n(𝜷~)\displaystyle(\bm{\alpha}_{1}-\bm{\alpha}_{2})^{T}{\bm{\Lambda}}_{n}-\mu_{n}^{-1}(\bm{\alpha}_{3}-\bm{\alpha}_{4})^{T}{\bm{\Lambda}}_{n}^{-1}\mathbf{A}_{n}(\tilde{\bm{\beta}}) =0\displaystyle=0 (44)

Let 𝜷^{\hat{\bm{\beta}}} and 𝜸^\hat{\bm{\gamma}} satisfy (29)-(32). (35) and (36) are obviously satisfied under (29). If one defines 𝜶^3\hat{\bm{\alpha}}_{3} and 𝜶^4\hat{\bm{\alpha}}_{4} such that α^3i=(0,γ^i)+\hat{\alpha}_{3i}=(0,\hat{\gamma}_{i})_{+} and α^4i=(0,γ^i)+\hat{\alpha}_{4i}=(0,-\hat{\gamma}_{i})_{+}, then 𝜸^=𝜶^3𝜶^4\hat{\bm{\gamma}}=\hat{\bm{\alpha}}_{3}-\hat{\bm{\alpha}}_{4} and (38) is satisfied.

Now, we define

𝜶^1\displaystyle\hat{\bm{\alpha}}_{1} =12{𝟏pn+μn1𝚲n1𝐀n(𝜷~)𝚲n1}(𝜶^3𝜶^4)\displaystyle=\frac{1}{2}\{\mathbf{1}_{p_{n}}+\mu_{n}^{-1}{\bm{\Lambda}}_{n}^{-1}\mathbf{A}_{n}(\tilde{\bm{\beta}}){\bm{\Lambda}}_{n}^{-1}\}(\hat{\bm{\alpha}}_{3}-\hat{\bm{\alpha}}_{4})
𝜶^2\displaystyle\hat{\bm{\alpha}}_{2} =12{𝟏pnμn1𝚲n1𝐀n(𝜷~)𝚲n1}(𝜶^3𝜶^4)\displaystyle=\frac{1}{2}\{\mathbf{1}_{p_{n}}-\mu_{n}^{-1}{\bm{\Lambda}}_{n}^{-1}\mathbf{A}_{n}(\tilde{\bm{\beta}}){\bm{\Lambda}}_{n}^{-1}\}(\hat{\bm{\alpha}}_{3}-\hat{\bm{\alpha}}_{4})

which ensures (43) and (44). In addition, under (30) implies that (37) is also true.

From the definition of 𝜶3\bm{\alpha}_{3} and 𝜶4\bm{\alpha}_{4}, we rewrite (32) as

i(α^3i[μn1{𝚲n1𝚫n(𝜷^)}i1]+α^4i[μn1{𝚲n1𝚫n(𝜷^)}i1])=0.\sum_{i}\left(\hat{\alpha}_{3i}[\mu_{n}^{-1}\{{\bm{\Lambda}}_{n}^{-1}{\bm{\Delta}_{n}(\hat{\bm{\beta}})}\}_{i}-1]+\hat{\alpha}_{4i}[-\mu_{n}^{-1}\{{\bm{\Lambda}}_{n}^{-1}{\bm{\Delta}_{n}(\hat{\bm{\beta}})}\}_{i}-1]\right)=0.

From (36)-(37) each term in the above sum is the sum of two negative terms, whereby we dedude that (41)-(42) necessarily hold. With a similar argument, by using (31), we also deduce that (39)-(40) are also true by setting in particular ui=|(𝚲n𝜷^)i|u_{i}=|({\bm{\Lambda}}_{n}\hat{\bm{\beta}})_{i}|. And that latter choice implies that (33)-(34) are also satisfied. ∎

C.2 A few auxiliary statements

Before tackling more specifically the proof of Theorem 4.1 for the ALDS estimator, we present a few auxiliary results that will be used.

Lemma C.3.

Assume conditions (𝒞\mathcal{C}.1)-(𝒞\mathcal{C}.7) hold.
(i)

𝚲n,11=an,𝚲n,221=1bn.\|{\bm{\Lambda}}_{n,11}\|=a_{n},\qquad\|{\bm{\Lambda}}_{n,22}^{-1}\|=\frac{1}{b_{n}}. (45)

(ii) For any t[0,1]t\in[0,1] and 𝛃ˇ=𝛃0+t(𝛃~𝛃0)\check{\bm{\beta}}=\bm{\beta}_{0}+t(\tilde{\bm{\beta}}-\bm{\beta}_{0}), we have

𝐀n(𝜷ˇ)\displaystyle\mathbf{A}_{n}({\check{\bm{\beta}}}) =OP(pnμn)\displaystyle=O_{\mathrm{P}}\left(p_{n}\mu_{n}\right)
𝐀n,1(𝜷ˇ)\displaystyle\mathbf{A}_{n,1}(\check{\bm{\beta}}) =OP(pnsnμn)\displaystyle=O_{\mathrm{P}}\left(\sqrt{p_{n}s_{n}}\mu_{n}\right)
𝐀n,2(𝜷ˇ)\displaystyle\mathbf{A}_{n,2}(\check{\bm{\beta}}) =OP(pnμn)\displaystyle=O_{\mathrm{P}}\left(p_{n}\mu_{n}\right)
𝐀n,11(𝜷ˇ)\displaystyle\mathbf{A}_{n,11}(\check{\bm{\beta}}) =OP(snμn)\displaystyle=O_{\mathrm{P}}\left(s_{n}\mu_{n}\right)
𝐀n,21(𝜷ˇ)\displaystyle\mathbf{A}_{n,21}(\check{\bm{\beta}}) =OP(pnsnμn)\displaystyle=O_{\mathrm{P}}\left(\sqrt{p_{n}s_{n}}\mu_{n}\right)
𝐀n(𝜷ˇ)𝐀n(𝜷~)\displaystyle\mathbf{A}_{n}(\check{\bm{\beta}})-\mathbf{A}_{n}(\tilde{\bm{\beta}}) =OP(pn2μn)\displaystyle=O_{\mathrm{P}}\left(p_{n}^{2}\sqrt{\mu_{n}}\right)
𝐀n,1(𝜷ˇ)𝐀n,1(𝜷~)\displaystyle\mathbf{A}_{n,1}(\check{\bm{\beta}})-\mathbf{A}_{n,1}(\tilde{\bm{\beta}}) =OP(snpn3μn)\displaystyle=O_{\mathrm{P}}\left(\sqrt{s_{n}p_{n}^{3}\mu_{n}}\right)
𝐀n,11(𝜷ˇ)𝐀n,11(𝜷~)\displaystyle\mathbf{A}_{n,11}(\check{\bm{\beta}})-\mathbf{A}_{n,11}(\tilde{\bm{\beta}}) =OP(sn2pn2μn).\displaystyle=O_{\mathrm{P}}\left({\sqrt{s_{n}^{2}p_{n}^{2}\mu_{n}}}\right).

(iii)

max{𝐔n(𝜷~),𝐔n,2(𝜷~)}=OP(pn3μn).\max\left\{\|\mathbf{U}_{n}(\tilde{\bm{\beta}})\|,\|\mathbf{U}_{n,2}(\tilde{\bm{\beta}})\|\right\}=O_{\mathrm{P}}(\sqrt{p_{n}^{3}\mu_{n}}). (46)
Proof.

(i) follows from conditions on ana_{n} and bnb_{n}.
(ii) follows from conditions (𝒞\mathcal{C}.1)-(𝒞\mathcal{C}.3), (𝒞\mathcal{C}.7) and Lemma A.2. We only prove the assertions for the matrices 𝐀n(𝜷~)\mathbf{A}_{n}(\tilde{\bm{\beta}}) and 𝐀n(𝜷ˇ)𝐀n(𝜷~)\mathbf{A}_{n}(\check{\bm{\beta}})-\mathbf{A}_{n}(\tilde{\bm{\beta}}). The other cases follow along similar lines. First,

𝐀n(𝜷~)Dn𝐳(u)2ρ(u;𝜷~)du=OP(pnμn).\displaystyle\|\mathbf{A}_{n}(\tilde{\bm{\beta}})\|\leq\int_{D_{n}}\|\mathbf{z}(u)\|^{2}\rho(u;\tilde{\bm{\beta}})\mathrm{d}u=O_{\mathrm{P}}(p_{n}\mu_{n}).

Second, using Taylor expansion, there exists 𝜷\bm{\beta}^{\prime} on the segment between 𝜷~\tilde{\bm{\beta}} and 𝜷ˇ\check{\bm{\beta}} such that ρ(u;𝜷ˇ)ρ(u;𝜷~)=(𝜷ˇ𝜷~)𝐳(u)ρ(u;𝜷)\rho(u;\check{\bm{\beta}})-\rho(u;\tilde{\bm{\beta}})=(\check{\bm{\beta}}-\tilde{\bm{\beta}})^{\top}\mathbf{z}(u)\rho(u;\bm{\beta}^{\prime}) whereby we deduce that

𝐀n(𝜷ˇ)𝐀n(𝜷~)\displaystyle\|\mathbf{A}_{n}(\check{\bm{\beta}})-\mathbf{A}_{n}(\tilde{\bm{\beta}})\| Dn𝐳(u)3𝜷ˇ𝜷~ρ(u;𝜷)du\displaystyle\leq\int_{D_{n}}\|\mathbf{z}(u)\|^{3}\|\check{\bm{\beta}}-\tilde{\bm{\beta}}\|\rho(u;\bm{\beta}^{\prime})\mathrm{d}u
=OP(μnpn3/2𝜷~𝜷0)=OP(pn2μn).\displaystyle=O_{\mathrm{P}}\left(\mu_{n}p_{n}^{3/2}\|\tilde{\bm{\beta}}-\bm{\beta}_{0}\|\right)=O_{\mathrm{P}}\left(p_{n}^{2}\sqrt{\mu_{n}}\right).

(iii) We only have to prove it for 𝐔n(𝜷~)\|\mathbf{U}_{n}(\tilde{\bm{\beta}})\|. Using Taylor expansion, there exists 𝜷ˇ\check{\bm{\beta}} such that 𝐔n(𝜷~)=𝐔n(𝜷0)𝐀n(𝜷ˇ)(𝜷~𝜷0)\mathbf{U}_{n}(\tilde{\bm{\beta}})=\mathbf{U}_{n}(\bm{\beta}_{0})-\mathbf{A}_{n}(\check{\bm{\beta}})(\tilde{\bm{\beta}}-\bm{\beta}_{0}). Using Lemma A.1, (ii) and condition (𝒞\mathcal{C}.7), we obtain

𝐔n(𝜷~)=OP(pnμn+pnμnpn/μn)=OP(pn3μn).\displaystyle\|\mathbf{U}_{n}(\tilde{\bm{\beta}})\|=O_{\mathrm{P}}\left(\sqrt{p_{n}\mu_{n}}+p_{n}\mu_{n}\sqrt{p_{n}/\mu_{n}}\right)=O_{\mathrm{P}}\left(\sqrt{p_{n}^{3}\mu_{n}}\right).

C.3 Sparsity property for 𝜷^=𝜷^ALDS\hat{\bm{\beta}}=\hat{\bm{\beta}}_{\mathrm{ALDS}}

The sparsity property of 𝜷^ALDS\hat{\bm{\beta}}_{\mathrm{ALDS}} follows from the following Lemma.

Lemma C.4.

Let 𝛃^ALDS\hat{\bm{\beta}}_{\mathrm{ALDS}} and 𝛄^\hat{\bm{\gamma}} satisfy the following conditions

𝜷^ALDS,1\displaystyle\hat{\bm{\beta}}_{\mathrm{ALDS,1}} =𝐀n,11(𝜷~)1{𝐔n,1(𝜷~)+𝐀n,1(𝜷~)𝜷~μn𝚲n,11sign(𝜸^1)}\displaystyle=\mathbf{A}_{n,11}(\tilde{\bm{\beta}})^{-1}\Big{\{}\mathbf{U}_{n,1}(\tilde{\bm{\beta}})+\mathbf{A}_{n,1}(\tilde{\bm{\beta}})\tilde{\bm{\beta}}-\mu_{n}{\bm{\Lambda}}_{n,11}\operatorname{sign}(\hat{\bm{\gamma}}_{1})\Big{\}} (47)
𝜷^ALDS,2\displaystyle\hat{\bm{\beta}}_{\mathrm{ALDS,2}} =𝟎\displaystyle=\mathbf{0} (48)
𝜸^1\displaystyle\hat{\bm{\gamma}}_{1} =μn𝚲n,11𝐀n,11(𝜷~)1𝚲n,11sign(𝜷^ALDS,1).\displaystyle=\mu_{n}{\bm{\Lambda}}_{n,11}\mathbf{A}_{n,11}(\tilde{\bm{\beta}})^{-1}{\bm{\Lambda}}_{n,11}\operatorname{sign}(\hat{\bm{\beta}}_{\mathrm{ALDS,1}}). (49)
𝜸^2\displaystyle\hat{\bm{\gamma}}_{2} =𝟎.\displaystyle=\mathbf{0}. (50)

Then, under the conditions (𝒞\mathcal{C}.1)-(𝒞\mathcal{C}.9), the following two statements hold.
(i)

𝜷^ALDS,1𝜷01\displaystyle\hat{\bm{\beta}}_{\mathrm{ALDS,1}}-\bm{\beta}_{01} =𝐀n,11(𝜷0)1𝐔n,1(𝜷0)+oP(1snμn)=OP(snμn).\displaystyle=\mathbf{A}_{n,11}(\bm{\beta}_{0})^{-1}\mathbf{U}_{n,1}(\bm{\beta}_{0})+o_{\mathrm{P}}\left(\frac{1}{s_{n}\sqrt{\mu_{n}}}\right)=O_{\mathrm{P}}\left(\sqrt{\frac{s_{n}}{\mu_{n}}}\right).

(ii) With probability tending to 1, 𝛃^ALDS\hat{\bm{\beta}}_{\mathrm{ALDS}} and 𝛄^\hat{\bm{\gamma}} given by (47)-(50) satisfy conditions (29)-(32) and are thus the primal and dual optimal solutions (whence the notation 𝛃^ALDS\hat{\bm{\beta}}_{\mathrm{ALDS}}).

It is worth mentioning that the rate oP(1/snμn)o_{\mathrm{P}}(1/s_{n}\sqrt{\mu_{n}}) in Lemma C.4 (i) is required to derive the central limit theorem proved in Appendix C.4. That required rate of convergence imposes some stronger restriction on the sequence ana_{n}.

Proof.

(i) Using Taylor expansion, there exists 𝜷ˇ\check{\bm{\beta}} on the line segment between 𝜷~\tilde{\bm{\beta}} and 𝜷0\bm{\beta}_{0} such that 𝐔n,1(𝜷~)=𝐔n,1(𝜷0)𝐀n,1(𝜷ˇ)(𝜷~𝜷0)\mathbf{U}_{n,1}(\tilde{\bm{\beta}})=\mathbf{U}_{n,1}(\bm{\beta}_{0})-\mathbf{A}_{n,1}(\check{\bm{\beta}})(\tilde{\bm{\beta}}-\bm{\beta}_{0}) which leads, by noticing that 𝜷02=0\bm{\beta}_{02}=0, to

𝜷^ALDS,1𝜷01=\displaystyle\hat{\bm{\beta}}_{\mathrm{ALDS,1}}-\bm{\beta}_{01}= 𝐀n,11(𝜷~)1[{𝐀n,1(𝜷~)𝐀n,1(𝜷ˇ)}{𝜷~𝜷0}\displaystyle\mathbf{A}_{n,11}(\tilde{\bm{\beta}})^{-1}\bigg{[}\left\{\mathbf{A}_{n,1}(\tilde{\bm{\beta}})-\mathbf{A}_{n,1}(\check{\bm{\beta}})\right\}\left\{\tilde{\bm{\beta}}-\bm{\beta}_{0}\right\}
+𝐔n,1(𝜷0)μn𝚲n,11sign(𝜸^1)].\displaystyle+\mathbf{U}_{n,1}(\bm{\beta}_{0})-\mu_{n}{\bm{\Lambda}}_{n,11}\operatorname{sign}(\hat{\bm{\gamma}}_{1})\bigg{]}.

Condition (𝒞\mathcal{C}.3) ensures that 𝐀n,11(𝜷~)1=OP(μn1)\|\mathbf{A}_{n,11}(\tilde{\bm{\beta}})^{-1}\|=O_{\mathrm{P}}(\mu_{n}^{-1}). Let 𝜷^ALDS,1𝜷01=𝐀n,11(𝜷0)1𝐔n,1(𝜷0)+T1+T2+T3\hat{\bm{\beta}}_{\mathrm{ALDS,1}}-\bm{\beta}_{01}=\mathbf{A}_{n,11}(\bm{\beta}_{0})^{-1}\mathbf{U}_{n,1}(\bm{\beta}_{0})+T_{1}+T_{2}+T_{3} where

T1\displaystyle T_{1} ={𝐀n,11(𝜷~)1𝐀n,11(𝜷0)1}𝐔n,1(𝜷0)\displaystyle=\left\{\mathbf{A}_{n,11}(\tilde{\bm{\beta}})^{-1}-\mathbf{A}_{n,11}(\bm{\beta}_{0})^{-1}\right\}\mathbf{U}_{n,1}(\bm{\beta}_{0})
T2\displaystyle T_{2} =𝐀n,11(𝜷~)1{𝐀n,1(𝜷~)𝐀n,1(𝜷ˇ)}{𝜷~𝜷0}\displaystyle=\mathbf{A}_{n,11}(\tilde{\bm{\beta}})^{-1}\left\{\mathbf{A}_{n,1}(\tilde{\bm{\beta}})-\mathbf{A}_{n,1}(\check{\bm{\beta}})\right\}\left\{\tilde{\bm{\beta}}-\bm{\beta}_{0}\right\}
T3\displaystyle T_{3} =μn𝐀n,11(𝜷~)1𝚲n,11sign(𝜸^1).\displaystyle=\mu_{n}\mathbf{A}_{n,11}(\tilde{\bm{\beta}})^{-1}{\bm{\Lambda}}_{n,11}\operatorname{sign}(\hat{\bm{\gamma}}_{1}).

Regarding the term T1T_{1} we have

T1=𝐀n,11(𝜷~)1{𝐀n,11(𝜷0)𝐀n,11(𝜷~)}𝐀n,11(𝜷0)1𝐔n,1(𝜷0).T_{1}=\mathbf{A}_{n,11}(\tilde{\bm{\beta}})^{-1}\left\{\mathbf{A}_{n,11}(\bm{\beta}_{0})-\mathbf{A}_{n,11}(\tilde{\bm{\beta}})\right\}\mathbf{A}_{n,11}(\bm{\beta}_{0})^{-1}\mathbf{U}_{n,1}(\bm{\beta}_{0}).

Condition (𝒞\mathcal{C}.3) ensures that max(𝐀n,11(𝜷~)1,𝐀n,11(𝜷0)1)=OP(μn1)\max(\|\mathbf{A}_{n,11}(\tilde{\bm{\beta}})^{-1}\|,\|\mathbf{A}_{n,11}(\bm{\beta}_{0})^{-1}\|)=O_{\mathrm{P}}(\mu_{n}^{-1}). Using this, Lemma C.3 and Lemma A.1 we obtain

T1=OP(1μnsn2pn2μn1μnsnμn)=OP(sn3pn2μn).T_{1}=O_{\mathrm{P}}\left(\frac{1}{\mu_{n}}\sqrt{s_{n}^{2}p_{n}^{2}\mu_{n}}\frac{1}{\mu_{n}}\sqrt{s_{n}\mu_{n}}\right)=O_{\mathrm{P}}\left(\frac{\sqrt{s_{n}^{3}p_{n}^{2}}}{\mu_{n}}\right).

With similar arguments, we have

T2=OP(1μnsnpn3μnpn/μn)=OP(snpn4μn).T_{2}=O_{\mathrm{P}}\left(\frac{1}{\mu_{n}}\sqrt{s_{n}p_{n}^{3}\mu_{n}}\sqrt{p_{n}/\mu_{n}}\right)=O_{\mathrm{P}}\left(\frac{\sqrt{s_{n}p_{n}^{4}}}{\mu_{n}}\right).

Condition (𝒞\mathcal{C}.8) ensures that

T1+T2=OP(snpn4μn)=oP(1snμn).T_{1}+T_{2}=O_{\mathrm{P}}\left(\frac{\sqrt{s_{n}p_{n}^{4}}}{\mu_{n}}\right)=o_{\mathrm{P}}\left(\frac{1}{s_{n}\sqrt{\mu_{n}}}\right).

Now, regarding the last term

T3=OP(μn1μnansn)=OP(ansn).T_{3}=O_{\mathrm{P}}\left(\mu_{n}\frac{1}{\mu_{n}}a_{n}\sqrt{s_{n}}\right)=O_{\mathrm{P}}(a_{n}\sqrt{s_{n}}).

And we observe that condition (𝒞\mathcal{C}.9) is sufficient to establish that T3=oP(1/snμn)T_{3}=o_{\mathrm{P}}(1/s_{n}\sqrt{\mu_{n}}) which proves (i) using again condition (𝒞\mathcal{C}.3) and Lemma A.1.

(ii) We have to show that with probability tending to 1, 𝜷^ALDS\hat{\bm{\beta}}_{\mathrm{ALDS}} and 𝜸^\hat{\bm{\gamma}} given by (47)-(50) satisfy conditions (29)-(32). By (47)-(50),

μn1𝜸^𝚲n1𝐀n(𝜷~)𝜷^ALDS\displaystyle\mu_{n}^{-1}\hat{\bm{\gamma}}^{\top}{\bm{\Lambda}}_{n}^{-1}\mathbf{A}_{n}(\tilde{\bm{\beta}})\hat{\bm{\beta}}_{\mathrm{ALDS}} =μn1𝜸^1𝚲n,111𝐀n,11(𝜷~)𝜷^ALDS,1\displaystyle=\mu_{n}^{-1}\hat{\bm{\gamma}}_{1}^{\top}{\bm{\Lambda}}_{n,11}^{-1}\mathbf{A}_{n,11}(\tilde{\bm{\beta}})\hat{\bm{\beta}}_{\mathrm{ALDS},1}
=sign(𝜷^ALDS,1)𝚲n,11{𝐀n,11(𝜷~)}1𝚲n,11𝚲n,111𝐀n,11(𝜷~)𝜷^ALDS,1\displaystyle=\operatorname{sign}(\hat{\bm{\beta}}_{\mathrm{ALDS,1}})^{\top}{\bm{\Lambda}}_{n,11}\big{\{}\mathbf{A}_{n,11}(\tilde{\bm{\beta}})\big{\}}^{-1}{\bm{\Lambda}}_{n,11}{\bm{\Lambda}}_{n,11}^{-1}\mathbf{A}_{n,11}(\tilde{\bm{\beta}})\hat{\bm{\beta}}_{\mathrm{ALDS,1}}
=sign(𝜷^ALDS,1)𝚲n,11𝜷^ALDS,1\displaystyle=\operatorname{sign}(\hat{\bm{\beta}}_{\mathrm{ALDS,1}})^{\top}{\bm{\Lambda}}_{n,11}\hat{\bm{\beta}}_{\mathrm{ALDS,1}}
=𝚲n,11𝜷^ALDS,11=𝚲n𝜷^ALDS1,\displaystyle=\ \|{\bm{\Lambda}}_{n,11}\hat{\bm{\beta}}_{\mathrm{ALDS,1}}\|_{1}=\|{\bm{\Lambda}}_{n}\hat{\bm{\beta}}_{\mathrm{ALDS}}\|_{1},

so, (31) is satisfied. Now, we want to show that (32) holds. We have

μn1𝜸^𝚲n1{𝐔n(𝜷~)+𝐀n(𝜷~)(𝜷~𝜷^ALDS)}=𝐈+𝐈𝐈,\displaystyle\mu_{n}^{-1}\hat{\bm{\gamma}}^{\top}{\bm{\Lambda}}_{n}^{-1}\big{\{}\mathbf{U}_{n}(\tilde{\bm{\beta}})+\mathbf{A}_{n}(\tilde{\bm{\beta}})(\tilde{\bm{\beta}}-\hat{\bm{\beta}}_{\mathrm{ALDS}})\big{\}}=\mathbf{I}+\mathbf{II},

where

𝐈=\displaystyle\mathbf{I}=\; μn1𝜸^𝚲n1𝐔n(𝜷~)=μn1𝜸^1𝚲n,111𝐔n,1(𝜷~),\displaystyle\mu_{n}^{-1}\hat{\bm{\gamma}}^{\top}{\bm{\Lambda}}_{n}^{-1}\mathbf{U}_{n}(\tilde{\bm{\beta}})=\mu_{n}^{-1}\hat{\bm{\gamma}}_{1}^{\top}{\bm{\Lambda}}_{n,11}^{-1}\mathbf{U}_{n,1}(\tilde{\bm{\beta}}),
𝐈𝐈=\displaystyle\mathbf{II}=\; μn1𝜸^1𝚲n,111𝐀n,1(𝜷~)𝜷~μn1𝜸^1𝚲n,111𝐀n,11(𝜷~)𝜷^ALDS,1\displaystyle\mu_{n}^{-1}\hat{\bm{\gamma}}_{1}^{\top}{\bm{\Lambda}}_{n,11}^{-1}\mathbf{A}_{n,1}(\tilde{\bm{\beta}})\tilde{\bm{\beta}}-\mu_{n}^{-1}\hat{\bm{\gamma}}_{1}^{\top}{\bm{\Lambda}}_{n,11}^{-1}\mathbf{A}_{n,11}(\tilde{\bm{\beta}})\hat{\bm{\beta}}_{\mathrm{ALDS,1}}
=\displaystyle=\; μn1𝜸^1𝚲n,111𝐀n,1(𝜷~)𝜷~\displaystyle\mu_{n}^{-1}\hat{\bm{\gamma}}_{1}^{\top}{\bm{\Lambda}}_{n,11}^{-1}\mathbf{A}_{n,1}(\tilde{\bm{\beta}})\tilde{\bm{\beta}}
μn1𝜸^1𝚲n,111{𝐔n,1(𝜷~)+𝐀n,1(𝜷~)𝜷~μn𝚲n,11sign(𝜸^1)}\displaystyle-\mu_{n}^{-1}\hat{\bm{\gamma}}_{1}^{\top}{\bm{\Lambda}}_{n,11}^{-1}\{\mathbf{U}_{n,1}(\tilde{\bm{\beta}})+\mathbf{A}_{n,1}(\tilde{\bm{\beta}})\tilde{\bm{\beta}}-\mu_{n}{\bm{\Lambda}}_{n,11}\operatorname{sign}(\hat{\bm{\gamma}}_{1})\}
=\displaystyle=\; 𝜸^1sign(𝜸^1)μn1𝜸^1𝚲n,111𝐔n,1(𝜷~),\displaystyle\hat{\bm{\gamma}}_{1}^{\top}\operatorname{sign}(\hat{\bm{\gamma}}_{1})-\mu_{n}^{-1}\hat{\bm{\gamma}}_{1}^{\top}{\bm{\Lambda}}_{n,11}^{-1}\mathbf{U}_{n,1}(\tilde{\bm{\beta}}),

from (47)-(50). By summing 𝐈\mathbf{I} and 𝐈𝐈\mathbf{II}, we deduce that (32) holds.

To prove (30) holds, we use (50) and decompose the vector μn1𝚲n1𝐀n(𝜷~)𝚲n1𝜸^\mu_{n}^{-1}{\bm{\Lambda}}_{n}^{-1}\mathbf{A}_{n}(\tilde{\bm{\beta}}){\bm{\Lambda}}_{n}^{-1}\hat{\bm{\gamma}} as

μn1𝚲n1𝐀n(𝜷~)𝚲n1𝜸^=μn1[𝐈𝐈𝐈]=μn1[𝚲n,111𝐀n,11(𝜷~)𝚲n,111𝜸1^𝚲n,221𝐀n,21(𝜷~)𝚲n,111𝜸1^].\displaystyle\mu_{n}^{-1}{\bm{\Lambda}}_{n}^{-1}\mathbf{A}_{n}(\tilde{\bm{\beta}}){\bm{\Lambda}}_{n}^{-1}\hat{\bm{\gamma}}=\mu_{n}^{-1}\begin{bmatrix}\mathbf{I}^{\prime}\\ \mathbf{II}^{\prime}\end{bmatrix}=\mu_{n}^{-1}\begin{bmatrix}{\bm{\Lambda}}_{n,11}^{-1}\mathbf{A}_{n,11}(\tilde{\bm{\beta}}){\bm{\Lambda}}_{n,11}^{-1}\hat{\bm{\gamma}_{1}}\\ {\bm{\Lambda}}_{n,22}^{-1}\mathbf{A}_{n,21}(\tilde{\bm{\beta}}){\bm{\Lambda}}_{n,11}^{-1}\hat{\bm{\gamma}_{1}}\end{bmatrix}.

By (49)

μn1𝐈\displaystyle\mu_{n}^{-1}\|\mathbf{I}^{\prime}\|_{\infty} =μn1𝚲n,111𝐀n,11(𝜷~)𝚲n,111𝜸^1\displaystyle=\mu_{n}^{-1}\|{\bm{\Lambda}}_{n,11}^{-1}\mathbf{A}_{n,11}(\tilde{\bm{\beta}}){{\bm{\Lambda}}_{n,11}^{-1}}\hat{\bm{\gamma}}_{1}\|_{\infty}
=sign(𝜷^ALDS,1)=1.\displaystyle=\|\operatorname{sign}(\hat{\bm{\beta}}_{\mathrm{ALDS,1}})\|_{\infty}=1.

Regarding 𝐈𝐈\mathbf{II}^{\prime}, by (49), conditions on ana_{n} and bnb_{n}, conditions (𝒞\mathcal{C}.1)-(𝒞\mathcal{C}.3), (𝒞\mathcal{C}.7) and Lemma C.3(i)-(ii), we have

μn1𝐈𝐈\displaystyle\mu_{n}^{-1}\mathbf{II}^{\prime} =μn1𝚲n,221𝐀n,21(𝜷~)𝚲n,111𝜸^1\displaystyle=\mu_{n}^{-1}{\bm{\Lambda}}_{n,22}^{-1}\mathbf{A}_{n,21}(\tilde{\bm{\beta}}){\bm{\Lambda}}_{n,11}^{-1}\hat{\bm{\gamma}}_{1}
=𝚲n,221𝐀n,21(𝜷~)𝚲n,111𝚲n,11𝐀n,11(𝜷~)1𝚲n,11sign(𝜷^ALDS,1)\displaystyle={\bm{\Lambda}}_{n,22}^{-1}\mathbf{A}_{n,21}(\tilde{\bm{\beta}}){{\bm{\Lambda}}_{n,11}^{-1}}{\bm{\Lambda}}_{n,11}\mathbf{A}_{n,11}(\tilde{\bm{\beta}})^{-1}{\bm{\Lambda}}_{n,11}\operatorname{sign}(\hat{\bm{\beta}}_{\mathrm{ALDS,1}})
=𝚲n,221𝐀n,21(𝜷~)𝐀n,11(𝜷~)1𝚲n,11sign(𝜷^ALDS,1)\displaystyle={\bm{\Lambda}}_{n,22}^{-1}\mathbf{A}_{n,21}(\tilde{\bm{\beta}})\mathbf{A}_{n,11}(\tilde{\bm{\beta}})^{-1}{\bm{\Lambda}}_{n,11}\operatorname{sign}(\hat{\bm{\beta}}_{\mathrm{ALDS,1}})
=OP(1bnpnsnμn1μnansn)=OP(ansn2pnbn)\displaystyle=O_{\mathrm{P}}\left(\frac{1}{b_{n}}\sqrt{p_{n}s_{n}}\mu_{n}\frac{1}{\mu_{n}}a_{n}\sqrt{s_{n}}\right)=O_{\mathrm{P}}\left(\frac{a_{n}\sqrt{s_{n}^{2}p_{n}}}{b_{n}}\right)
=OP(ansn3μn1bnpn3μn1snpn).\displaystyle=O_{\mathrm{P}}\left(a_{n}\sqrt{s_{n}^{3}\mu_{n}}\frac{1}{b_{n}}\sqrt{\frac{p_{n}^{3}}{\mu_{n}}}\,\frac{1}{s_{n}p_{n}}\right).

Hence, μn1𝐈𝐈=oP(1)\mu_{n}^{-1}\|\mathbf{II}^{\prime}\|_{\infty}=o_{\mathrm{P}}(1) by condition (𝒞\mathcal{C}.8) and (30) is satisfied with probability tending to 1. We finally focus on (29). Note that

μn1𝚲n1{𝐔n(𝜷~)\displaystyle\mu_{n}^{-1}{\bm{\Lambda}}_{n}^{-1}\{\mathbf{U}_{n}(\tilde{\bm{\beta}}) +𝐀n(𝜷~)(𝜷~𝜷^ALDS)}=μn1[𝐈~𝐈𝐈~]\displaystyle+\mathbf{A}_{n}(\tilde{\bm{\beta}})(\tilde{\bm{\beta}}-\hat{\bm{\beta}}_{\mathrm{ALDS}})\}=\mu_{n}^{-1}\begin{bmatrix}\tilde{\mathbf{I}}\\ \tilde{\mathbf{II}}\end{bmatrix}
=μn1[𝚲n,111{𝐔n,1(𝜷~)+𝐀n,1(𝜷~)(𝜷~𝜷^ALDS)}𝚲n,221{𝐔n,2(𝜷~)+𝐀n,2(𝜷~)(𝜷~𝜷^ALDS)}].\displaystyle=\mu_{n}^{-1}\begin{bmatrix}{\bm{\Lambda}}_{n,11}^{-1}\{\mathbf{U}_{n,1}(\tilde{\bm{\beta}})+\mathbf{A}_{n,1}(\tilde{\bm{\beta}})(\tilde{\bm{\beta}}-\hat{\bm{\beta}}_{\mathrm{ALDS}})\}\\ {\bm{\Lambda}}_{n,22}^{-1}\{\mathbf{U}_{n,2}(\tilde{\bm{\beta}})+\mathbf{A}_{n,2}(\tilde{\bm{\beta}})(\tilde{\bm{\beta}}-\hat{\bm{\beta}}_{\mathrm{ALDS}})\}\end{bmatrix}.

Regarding 𝐈~\tilde{\mathbf{I}}, from (47)-(48),

μn1𝐈~=\displaystyle\mu_{n}^{-1}\|\tilde{\mathbf{I}}\|_{\infty}= μn1𝚲n,111𝐔n,1(𝜷~)+𝚲n,111𝐀n,1(𝜷~)𝜷~𝚲n,111𝐀n,11(𝜷~)𝜷^ALDS,1\displaystyle\mu_{n}^{-1}\|{\bm{\Lambda}}_{n,11}^{-1}\mathbf{U}_{n,1}(\tilde{\bm{\beta}})+{\bm{\Lambda}}_{n,11}^{-1}\mathbf{A}_{n,1}(\tilde{\bm{\beta}})\tilde{\bm{\beta}}-{\bm{\Lambda}}_{n,11}^{-1}\mathbf{A}_{n,11}(\tilde{\bm{\beta}})\hat{\bm{\beta}}_{\mathrm{ALDS,1}}\|_{\infty}
=\displaystyle= μn1𝚲n,111𝐔n,1(𝜷~)+𝚲n,111𝐀n,1(𝜷~)𝜷~\displaystyle\mu_{n}^{-1}\|{\bm{\Lambda}}_{n,11}^{-1}\mathbf{U}_{n,1}(\tilde{\bm{\beta}})+{\bm{\Lambda}}_{n,11}^{-1}\mathbf{A}_{n,1}(\tilde{\bm{\beta}})\tilde{\bm{\beta}}
𝚲n,111{𝐔n,1(𝜷~)+𝐀n,1(𝜷~)𝜷~μn𝚲n,11sign(𝜸^1)}\displaystyle-{\bm{\Lambda}}_{n,11}^{-1}\{\mathbf{U}_{n,1}(\tilde{\bm{\beta}})+\mathbf{A}_{n,1}(\tilde{\bm{\beta}})\tilde{\bm{\beta}}-\mu_{n}{\bm{\Lambda}}_{n,11}\operatorname{sign}(\hat{\bm{\gamma}}_{1})\}\|_{\infty}
=\displaystyle= sign(𝜸^1)=1.\displaystyle\|\operatorname{sign}(\hat{\bm{\gamma}}_{1})\|_{\infty}=1.

Now, consider 𝐈𝐈~\tilde{\mathbf{II}}. By the sparsity of 𝜷^ALDS\hat{\bm{\beta}}_{\mathrm{ALDS}} and 𝜷0\bm{\beta}_{0} we can write

μn1𝐈𝐈~=μn1𝚲n,221{𝐔n,2(𝜷~)+𝐀n,2(𝜷~)(𝜷~𝜷0)+𝐀n,21(𝜷~)(𝜷01𝜷^ALDS,1)}.\mu_{n}^{-1}\tilde{\mathbf{II}}=\mu_{n}^{-1}\bm{\Lambda}_{n,22}^{-1}\left\{\mathbf{U}_{n,2}(\tilde{\bm{\beta}})+\mathbf{A}_{n,2}(\tilde{\bm{\beta}})(\tilde{\bm{\beta}}-\bm{\beta}_{0})+\mathbf{A}_{n,21}(\tilde{\bm{\beta}})(\bm{\beta}_{01}-\hat{\bm{\beta}}_{\mathrm{ALDS,1}})\right\}.

We combine Lemma C.3 (i)-(iii) and Lemma B.2 to derive

μn1𝐈𝐈~\displaystyle\mu_{n}^{-1}\tilde{\mathbf{II}} =OP{1μn1bn(pnμn+pnμnpnμn+pnsnμnsnμn)}\displaystyle=O_{\mathrm{P}}\left\{\frac{1}{\mu_{n}}\frac{1}{b_{n}}\left(\sqrt{p_{n}\mu_{n}}+p_{n}\mu_{n}\sqrt{\frac{p_{n}}{\mu_{n}}}+\sqrt{p_{n}s_{n}}\mu_{n}\sqrt{\frac{s_{n}}{\mu_{n}}}\right)\right\}
=OP(1bnpn3μn)=oP(1)\displaystyle=O_{\mathrm{P}}\left(\frac{1}{b_{n}}\sqrt{\frac{p_{n}^{3}}{\mu_{n}}}\right)=o_{\mathrm{P}}(1)

by condition (𝒞\mathcal{C}.9). Hence, μn1𝐈𝐈~=oP(1)\mu_{n}^{-1}\|\tilde{\mathbf{II}}\|_{\infty}=o_{\mathrm{P}}(1) and (29) is satisfied with probability tending to 1. ∎

C.4 Asymptotic normality for 𝜷^=𝜷^ALDS\hat{\bm{\beta}}=\hat{\bm{\beta}}_{\mathrm{ALDS}}

Proof.

By Lemma C.3, 𝐀n,11(𝜷0)=OP(snμn)\mathbf{A}_{n,11}(\bm{\beta}_{0})=O_{\mathrm{P}}(s_{n}\mu_{n}). This and Lemma C.4 show that

𝐀n,11(𝜷0)(𝜷^ALDS,1𝜷01)=𝐔n,1(𝜷0)+oP(μn).\mathbf{A}_{n,11}(\bm{\beta}_{0})\left(\hat{\bm{\beta}}_{\mathrm{ALDS,1}}-\bm{\beta}_{01}\right)=\mathbf{U}_{n,1}(\bm{\beta}_{0})+o_{\mathrm{P}}(\sqrt{\mu_{n}}).

Let ϕsn{0}\bm{\phi}\in\mathbb{R}^{s_{n}}\setminus\{0\} with ϕ<\|\bm{\phi}\|<\infty and let σϕ2=ϕ𝐁n,11(𝜷0)ϕ\sigma^{2}_{\bm{\phi}}=\bm{\phi}^{\top}\mathbf{B}_{n,11}(\bm{\beta}_{0})\bm{\phi}. Now

σϕ1ϕ𝐀n,11(𝜷0)(𝜷^ALDS,1𝜷01)=σϕ1ϕ𝐔n,1(𝜷0)+σϕ1ϕoP(μn).\sigma_{\bm{\phi}}^{-1}\bm{\phi}^{\top}\mathbf{A}_{n,11}(\bm{\beta}_{0})\left(\hat{\bm{\beta}}_{\mathrm{ALDS,1}}-\bm{\beta}_{01}\right)=\sigma_{\bm{\phi}}^{-1}\bm{\phi}^{\top}\mathbf{U}_{n,1}(\bm{\beta}_{0})+\sigma_{\bm{\phi}}^{-1}\bm{\phi}^{\top}o_{\mathrm{P}}(\sqrt{\mu_{n}}).

By condition (𝒞\mathcal{C}.5), σϕ1=O(1/μn)\sigma_{\bm{\phi}}^{-1}=O(1/\sqrt{\mu_{n}}). The result is therefore deduced from condition (𝒞\mathcal{C}.6) and Slutsky’s theorem. ∎

Appendix D Resulting 𝜷\bm{\beta} estimates for the BCI dataset

Table 5: Values of 𝜷^AL\hat{\bm{\beta}}_{\mathrm{AL}} and 𝜷^ALDS\hat{\bm{\beta}}_{\mathrm{ALDS}} for the real data example
Int elev grad Al B Ca Cu Fe K Mg Mn P Zn N
AL -6.252 0 0 0 -0.461 0.448 0 0.260 0 0 -0.244 0 0 0
ALDS -6.245 0 0 0 -0.429 0.419 0 0.245 0 0 -0.235 0 0 0
N.min pH AlB AlCa AlCu AlFe AlK AlMg AlMn AlP AlZn AlN AlN.min AlpH
AL 0.076 0.477 -0.162 0 0 0.379 0 -0.514 0 -0.022 0 0.103 0 -0.033
ALDS 0.077 0.464 -0.221 0 0 0.394 0 -0.471 0 -0.008 0 0.103 0 0
BCa BCu BFe BK BMg BMn BP BZn BN BN.min BpH CaCu CaFe CaK
AL 0 0 0.152 0 0 0 -0.299 0 -0.071 0 0 0 0.144 0
ALDS 0 -0.093 0.183 0 0 0 -0.286 -0.080 -0.027 0.053 0 0 0.142 0
CaMg CaMn CaP CaZn CaN CaN.min CapH CuFe CuK CuMg CuMn CuP CuZn CuN
AL -0.155 0 0.104 0 0 -0.888 0 0 -0.091 0 0.134 0.148 0 0
ALDS -0.125 0 0.095 0 0 -0.890 0.042 0 -0.013 0 0.130 0.148 0 0
CuN.min CupH FeK FeMg FeMn FeP FeZn FeN FeN.min FepH KMg KMn KP KZn
AL 0 0 -0.311 0 0 0 0 0 0 0 0 0 0 -0.051
ALDS 0 0 -0.331 0 0 0 0 0 0 0 0 0 0 0
KN KN.min KpH MgMn MgP MgZn MgN MgN.min MgpH MnP MnZn MnN MnN.min MnpH
AL 0.198 0.580 0.023 0 0 -0.011 0 0 0 0 0 -0.050 0.107 0
ALDS 0.161 0.530 0 -0.003 0 0 0 0 0 0 0 -0.047 0.100 0
PZn PN PN.min PpH ZnN ZnN.min ZnpH NN.min NpH N.minpH elevgrad
AL 0 0.269 0 0 0 0 0 0 0 0.054 0
ALDS 0 0.258 0 0 0 0 0 0 0 0.054 0

Acknowledgements

We thank the editor, associate editor, and two reviewers for the constructive comments. The research of J.-F. Coeurjolly is supported by the Natural Sciences and Engineering Research Council of Canada. J.-F. Coeurjolly would like to thank Université du Québec à Montréal for the excellent research conditions he received these last years. The research of A. Choiruddin is supported by the Direktorat Riset, Teknologi, dan Pengabdian Kepada Masyarakat, Direktorat Jenderal Pendidikan Tinggi, Riset, dan Teknologi, Kementerian Pendidikan, Kebudayaan, Riset, dan Teknologi Republik Indonesia. The BCI soils data sets were collected and analyzed by J. Dalling, R. John, K. Harms, R. Stallard and J. Yavitt with support from NSF DEB021104,021115, 0212284,0212818 and OISE 0314581, and STRI Soils Initiative and CTFS and assistance from P. Segre and J. Trani.

References

  • [1] {barticle}[author] \bauthor\bsnmAntoniadis, \bfnmAnestis\binitsA., \bauthor\bsnmFryzlewicz, \bfnmPiotr\binitsP. and \bauthor\bsnmLetué, \bfnmFrédérique\binitsF. (\byear2010). \btitleThe Dantzig selector in Cox’s proportional hazards model. \bjournalScandinavian Journal of Statistics \bvolume37 \bpages531–552. \endbibitem
  • [2] {bbook}[author] \bauthor\bsnmBaddeley, \bfnmAdrian\binitsA., \bauthor\bsnmRubak, \bfnmEge\binitsE. and \bauthor\bsnmTurner, \bfnmRolf\binitsR. (\byear2015). \btitleSpatial Point Patterns: Methodology and Applications with R. \bpublisherCRC Press. \endbibitem
  • [3] {barticle}[author] \bauthor\bsnmBickel, \bfnmPeter J\binitsP. J., \bauthor\bsnmRitov, \bfnmYa’acov\binitsY. and \bauthor\bsnmTsybakov, \bfnmAlexandre B\binitsA. B. (\byear2009). \btitleSimultaneous analysis of lasso and Dantzig selector. \bjournalThe Annals of statistics \bvolume37 \bpages1705–1732. \endbibitem
  • [4] {barticle}[author] \bauthor\bsnmBiscio, \bfnmChristophe Ange Napoléon\binitsC. A. N. and \bauthor\bsnmWaagepetersen, \bfnmRasmus P\binitsR. P. (\byear2019). \btitleA general central limit theorem and a subsampling variance estimator for α\alpha-mixing point processes. \bjournalScandinavian Journal of Statistics \bvolume46 \bpages1168–1190. \endbibitem
  • [5] {bbook}[author] \bauthor\bsnmBoyd, \bfnmStephen\binitsS. and \bauthor\bsnmVandenberghe, \bfnmLieven\binitsL. (\byear2004). \btitleConvex optimization. \bpublisherCambridge university press. \endbibitem
  • [6] {bunpublished}[author] \bauthor\bsnmCandes, \bfnmEmmanuel\binitsE. and \bauthor\bsnmRomberg, \bfnmJustin\binitsJ. (\byear2005). \btitle1\ell^{1}-Magic: recovery of Sparse Signals via Convex Programming. \bnotehttp://www.acm.caltech. edu/l1magic/. \endbibitem
  • [7] {barticle}[author] \bauthor\bsnmCandes, \bfnmEmmanuel\binitsE. and \bauthor\bsnmTao, \bfnmTerence\binitsT. (\byear2007). \btitleThe Dantzig selector: statistical estimation when pp is much larger than nn. \bjournalThe Annals of Statistics \bvolume35 \bpages2313–2351. \endbibitem
  • [8] {barticle}[author] \bauthor\bsnmChoiruddin, \bfnmA.\binitsA., \bauthor\bsnmAisah, \bauthor\bsnmTrisnisa, \bfnmF.\binitsF. and \bauthor\bsnmIriawan, \bfnmN.\binitsN. (\byear2021). \btitleQuantifying the effect of geological factors on distribution of earthquake occurrences by inhomogeneous Cox processes. \bjournalPure and Applied Geophysics \bvolume178 \bpages1579–1592. \endbibitem
  • [9] {barticle}[author] \bauthor\bsnmChoiruddin, \bfnmAchmad\binitsA., \bauthor\bsnmCoeurjolly, \bfnmJean-François\binitsJ.-F. and \bauthor\bsnmLetué, \bfnmFrédérique\binitsF. (\byear2017). \btitleSpatial point processes intensity estimation with a diverging number of covariates. \bjournalarXiv preprint arXiv:1712.09562. \endbibitem
  • [10] {barticle}[author] \bauthor\bsnmChoiruddin, \bfnmAchmad\binitsA., \bauthor\bsnmCoeurjolly, \bfnmJean-François\binitsJ.-F. and \bauthor\bsnmLetue, \bfnmFrédérique\binitsF. (\byear2018). \btitleConvex and non-convex regularization methods for spatial point processes intensity estimation. \bjournalElectronic Journal of Statistics \bvolume12 \bpages1210–1255. \endbibitem
  • [11] {barticle}[author] \bauthor\bsnmChoiruddin, \bfnmAchmad\binitsA., \bauthor\bsnmCoeurjolly, \bfnmJean-François\binitsJ.-F. and \bauthor\bsnmWaagepetersen, \bfnmRasmus P\binitsR. P. (\byear2021). \btitleInformation criteria for inhomogeneous spatial point processes. \bjournalAustralian and New Zealand Journal of Statistics \bvolume63 \bpages119–143. \endbibitem
  • [12] {barticle}[author] \bauthor\bsnmChoiruddin, \bfnmA.\binitsA., \bauthor\bsnmCuevas-Pacheco, \bfnmF.\binitsF., \bauthor\bsnmCoeurjolly, \bfnmJ. F.\binitsJ. F. and \bauthor\bsnmWaagepetersen, \bfnmR. P.\binitsR. P. (\byear2020). \btitleRegularized estimation for highly multivariate log Gaussian Cox processes. \bjournalStatistics and Computing \bvolume30 \bpages649–662. \endbibitem
  • [13] {bincollection}[author] \bauthor\bsnmCoeurjolly, \bfnmJean-François\binitsJ.-F. and \bauthor\bsnmLavancier, \bfnmFrédéric\binitsF. (\byear2019). \btitleUnderstanding spatial point patterns through intensity and conditional intensities. In \bbooktitleStochastic Geometry \bpages45–85. \bpublisherSpringer. \endbibitem
  • [14] {barticle}[author] \bauthor\bsnmDaniel, \bfnmJeffrey\binitsJ., \bauthor\bsnmHorrocks, \bfnmJulie\binitsJ. and \bauthor\bsnmUmphrey, \bfnmGary J\binitsG. J. (\byear2018). \btitlePenalized composite likelihoods for inhomogeneous Gibbs point process models. \bjournalComputational Statistics & Data Analysis \bvolume124 \bpages104–116. \endbibitem
  • [15] {bphdthesis}[author] \bauthor\bsnmDicker, \bfnmLee\binitsL. (\byear2010). \btitleRegularized regression methods for variable selection and estimation, \btypePhD thesis, \bpublisherHarvard University. \endbibitem
  • [16] {barticle}[author] \bauthor\bsnmFan, \bfnmJianqing\binitsJ. and \bauthor\bsnmPeng, \bfnmHeng\binitsH. (\byear2004). \btitleNonconcave penalized likelihood with a diverging number of parameters. \bjournalThe Annals of Statistics \bvolume32 \bpages928–961. \endbibitem
  • [17] {barticle}[author] \bauthor\bsnmFriedman, \bfnmJerome\binitsJ., \bauthor\bsnmHastie, \bfnmTrevor\binitsT. and \bauthor\bsnmTibshirani, \bfnmRob\binitsR. (\byear2010). \btitleRegularization paths for generalized linear models via coordinate descent. \bjournalJournal of Statistical Software \bvolume33 \bpages1–22. \endbibitem
  • [18] {barticle}[author] \bauthor\bsnmGuan, \bfnmYongtao\binitsY., \bauthor\bsnmJalilian, \bfnmAbdollah\binitsA. and \bauthor\bsnmWaagepetersen, \bfnmRasmus Plenge\binitsR. P. (\byear2015). \btitleQuasi-likelihood for spatial point processes. \bjournalJournal of the Royal Statistical Society: Series B (Statistical Methodology) \bvolume77 \bpages677–697. \endbibitem
  • [19] {barticle}[author] \bauthor\bsnmGuan, \bfnmYongtao\binitsY. and \bauthor\bsnmShen, \bfnmYe\binitsY. (\byear2010). \btitleA weighted estimating equation approach for inhomogeneous spatial point processes. \bjournalBiometrika \bvolume97 \bpages867–880. \endbibitem
  • [20] {barticle}[author] \bauthor\bsnmHubbell, \bfnmStephen P\binitsS. P., \bauthor\bsnmCondit, \bfnmRichard\binitsR. and \bauthor\bsnmFoster, \bfnmRobin B\binitsR. B. (\byear2005). \btitleBarro Colorado forest census plot data. \bnotehttp://ctfs.si.edu/webatlas/datasets/bci/. \endbibitem
  • [21] {barticle}[author] \bauthor\bsnmJames, \bfnmGareth M\binitsG. M. and \bauthor\bsnmRadchenko, \bfnmPeter\binitsP. (\byear2009). \btitleA generalized Dantzig selector with shrinkage tuning. \bjournalBiometrika \bvolume96 \bpages323–337. \endbibitem
  • [22] {barticle}[author] \bauthor\bsnmJames, \bfnmGareth M\binitsG. M., \bauthor\bsnmRadchenko, \bfnmPeter\binitsP. and \bauthor\bsnmLv, \bfnmJinchi\binitsJ. (\byear2009). \btitleDASSO: connections between the Dantzig selector and lasso. \bjournalJournal of the Royal Statistical Society: Series B (Statistical Methodology) \bvolume71 \bpages127–142. \endbibitem
  • [23] {bbook}[author] \bauthor\bsnmMøller, \bfnmJesper\binitsJ. and \bauthor\bsnmWaagepetersen, \bfnmRasmus Plenge\binitsR. P. (\byear2004). \btitleStatistical inference and simulation for spatial point processes. \bpublisherCRC Press. \endbibitem
  • [24] {barticle}[author] \bauthor\bsnmRakshit, \bfnmSuman\binitsS., \bauthor\bsnmMcSwiggan, \bfnmGreg\binitsG., \bauthor\bsnmNair, \bfnmGopalan\binitsG. and \bauthor\bsnmBaddeley, \bfnmAdrian\binitsA. (\byear2021). \btitleVariable selection using penalised likelihoods for point patterns on a linear network. \bjournalAustralian & New Zealand Journal of Statistics \bvolume63 \bpages417–454. \endbibitem
  • [25] {barticle}[author] \bauthor\bsnmWaagepetersen, \bfnmRasmus Plenge\binitsR. P. (\byear2007). \btitleAn estimating function approach to inference for inhomogeneous Neyman–Scott processes. \bjournalBiometrics \bvolume63 \bpages252–258. \endbibitem
  • [26] {barticle}[author] \bauthor\bsnmWaagepetersen, \bfnmRasmus Plenge\binitsR. P. and \bauthor\bsnmGuan, \bfnmYongtao\binitsY. (\byear2009). \btitleTwo-step estimation for inhomogeneous spatial point processes. \bjournalJournal of the Royal Statistical Society: Series B (Statistical Methodology) \bvolume71 \bpages685–702. \endbibitem
  • [27] {barticle}[author] \bauthor\bsnmYue, \bfnmYu Ryan\binitsY. R. and \bauthor\bsnmLoh, \bfnmJi Meng\binitsJ. M. (\byear2015). \btitleVariable selection for inhomogeneous spatial point process models. \bjournalCanadian Journal of Statistics \bvolume43 \bpages288–305. \endbibitem
  • [28] {barticle}[author] \bauthor\bsnmZou, \bfnmHui\binitsH. (\byear2006). \btitleThe adaptive lasso and its oracle properties. \bjournalJournal of the American Statistical Association \bvolume101 \bpages1418–1429. \endbibitem