This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Information criteria for inhomogeneous spatial point processes

Achmad Choiruddin Department of Statistics, Institut Teknologi Sepuluh Nopember (ITS), Indonesia Jean-François Coeurjolly Department of Mathematics, Université du Québec à Montréal (UQAM), Canada Rasmus Waagepetersen Department of Mathematical Sciences, Aalborg University, Denmark
Abstract

The theoretical foundation for a number of model selection criteria is established in the context of inhomogeneous point processes and under various asymptotic settings: infill, increasing domain, and combinations of these. For inhomogeneous Poisson processes we consider Akaike’s information criterion and the Bayesian information criterion, and in particular we identify the point process analogue of ‘sample size’ needed for the Bayesian information criterion. Considering general inhomogeneous point processes we derive new composite likelihood and composite Bayesian information criteria for selecting a regression model for the intensity function. The proposed model selection criteria are evaluated using simulations of Poisson processes and cluster point processes.

Keywords: Akaike’s information criterion, Bayesian information criterion, composite information criterion, composite likelihood, inhomogeneous point process, intensity function, model selection.

1 Introduction

Fitting a regression model to the intensity function of a point process is one of the most fundamental tasks in statistical analysis of point pattern data , see e.g. Møller and Waagepetersen (2017) or Coeurjolly and Lavancier (2019) for a recent review of this problem. If the data in question can be viewed as a realization of a Poisson process, regression parameters are usually estimated by maximum likelihood. If the point process is not Poisson, the likelihood function is often computationally intractable. In such cases the Poisson likelihood function can still be used as a composite likelihood function for estimating regression parameters. This is e.g. the approach underlying the popular spatstat R package (Baddeley et al., 2015) procedure kppm.

Considering regression models, model selection is often a pertinent task. In case of a Poisson process, the Akaike information criterion (AIC) (Akaike, 1973) seems an obvious approach (and implemented in the function logLik.ppm of spatstat) since in this case the likelihood function is available. If the assumption of a Poisson process is not tenable, generalization of the AIC to a composite likelihood information criterion (CIC) (Varin and Vidoni, 2005) is relevant. Yet another alternative is the Bayesian information criterion (BIC) (Schwarz, 1978). The classical framework for the information criteria mentioned typically involves a sample of independent observations where the sample size plays a crucial role for the BIC. However, in the case of a unique realization of a point process, it is not obvious how to define sample size. In some sense, it is one, but this choice is obviously not useful for asymptotic justifications of information criteria. Instead, sample size must be linked to properties of the observation window or the point process intensity function. Several proposals for defining ‘sample size’ have been considered in the literature. Choiruddin et al. (2018) use the size of the observation window while Thurman et al. (2015) consider the number of observed points. Jeffrey et al. (2018) use the sum of the number of data points and the number of dummy points used in a numerical approximation of the likelihood (Berman and Turner, 1992).

In this paper we first establish the theoretical foundation for AIC and CIC in the context of intensity function model selection for a point process. This includes asymptotic results for estimates of the ‘least false parameter value’, see e.g. Claeskens and Hjort (2008, Section 2.2). Next we derive the BIC in case of a Poisson process and we thus identify what is the meaning of sample size in this context. We also consider the generalization of BIC to composite likelihood BIC (CBIC) (Gao and Song, 2010) using a concept of effective degrees of freedom derived for the CIC. Our asymptotic developments are established under an original setting which embraces both infill asymptotic (the number of points in a fixed domain increases) and increasing domain asymptotics (the volume of the observation window tends to infinity) which are often considered in the literature.

The rest of the article is organized as follows. The problem of selecting a model for the intensity function is specified in Section 2. In Section 3 we discuss asymptotic results for intensity function regression parameter estimators under a ’double’ asymptotic framework. We derive the AIC and CIC for spatial point processes in Section 4 and develop the BIC and CBIC in Section 5. The different model selection criteria are compared in a simulation study in Section 6. Section 7 gives some concluding remarks. Proofs are given in the Appendices A-E.

2 Intensity model selection

A spatial point process 𝐗\mathbf{X} defined on d\mathbb{R}^{d} is a locally finite random subset of d\mathbb{R}^{d}. If for bounded BSB\subset S we denote by N(B)N(B) the cardinality of 𝐗B\mathbf{X}\cap B, locally finite means that N(B)N(B) is a finite integer almost surely. For a bounded domain WSW\subset S, |W||W| denotes the volume of WW. The intensity function λ\lambda and the pair correlation function gg of 𝐗\mathbf{X} are defined (if they exist) by the equations

𝔼N(A)\displaystyle{\mathbb{E}}N(A) =Aλ(u)du\displaystyle=\int_{A}\lambda(u)\mathrm{d}u
𝔼{N(A)N(B)}\displaystyle{\mathbb{E}}\left\{N(A)N(B)\right\} =Aλ(u)du+ABλ(u)λ(v)g(u,v)dudv\displaystyle=\int_{A}\lambda(u)\mathrm{d}u+\int_{A}\int_{B}\lambda(u)\lambda(v)g(u,v)\mathrm{d}u\mathrm{d}v

for any bounded A,BdA,B\subset\mathbb{R}^{d}.

If the counts N(A)N(A) are Poisson distributed, 𝐗\mathbf{X} is said to be a Poisson process. In this case counts N(B1),,N(Bm)N(B_{1}),\dots,N(B_{m}) are independent whenever the subsets B1,,BmB_{1},\ldots,B_{m} are disjoint and the pair correlation function is identically equal to one. For our asymptotic considerations, we assume that a sequence of spatial point processes 𝐗n\mathbf{X}_{n} is observed within a sequence of bounded windows WndW_{n}\subset\mathbb{R}^{d}, n=1,2,n=1,2,\ldots. We denote by λn\lambda_{n} and gng_{n} the intensity and pair correlation function of 𝐗n\mathbf{X}_{n}. With an abuse of notation, we denote for any n1n\geq 1 expectation and variance under the sampling distribution of 𝐗n\mathbf{X}_{n} by 𝔼{\mathbb{E}} and 𝕍ar{\mathbb{V}\mathrm{ar}}.

For modelling the intensity function we assume that p1p\geq 1 covariates z1,,zpz_{1},\ldots,z_{p} are available where for each i=1,,pi=1,\ldots,p, ziz_{i} is a locally integrable function on d\mathbb{R}^{d}. Let IlI_{l}, l=1,,2p,l=1,\ldots,2^{p}, denote the subsets of {1,,p}\{1,\dots,p\} and let pl=|Il|+1p_{l}=|I_{l}|+1 where |Il||I_{l}| is the cardinality of IlI_{l}. We consider models for λn\lambda_{n} specified in terms of the 2p2^{p} subsets of the covariates. For each l=1,,2pl=1,\ldots,2^{p} and n1n\geq 1 we define the log-linear model

ρl,n{u;𝐳l(u);𝜷l}=θnexp{𝜷l𝐳l(u)}\rho_{l,n}\{u;\mathbf{z}_{l}(u);\boldsymbol{\beta}_{l}\}=\theta_{n}\exp\{\boldsymbol{\beta}_{l}^{\top}\mathbf{z}_{l}(u)\} (1)

where 𝜷l={β0,(βj)jIl}pl\boldsymbol{\beta}_{l}=\left\{\beta_{0},(\beta_{j})_{j\in I_{l}}\right\}\in\mathbb{R}^{p_{l}} and 𝐳l(u)=[1,{zj(u)}jIl]\mathbf{z}_{l}(u)=\big{[}1,\{z_{j}(u)\}_{j\in I_{l}}\big{]}. The quantity θn\theta_{n} should not be regarded as a parameter to be estimated. For n1n\geq 1, θn\theta_{n} could e.g. represent a timespan over which 𝐗n\mathbf{X}_{n} is observed. In the following, with an abuse of notation, we just write ρn(u;𝜷l)\rho_{n}(u;\boldsymbol{\beta}_{l}) for ρl,n{u;𝐳l(u);𝜷l}\rho_{l,n}\{u;\mathbf{z}_{l}(u);\boldsymbol{\beta}_{l}\} and similarly for related quantities.

The problem we consider is to select among the 2p2^{p} intensity models l\mathcal{M}_{l}, l=1,,2pl=1,\dots,2^{p}, given by

l={ρn(;𝜷l)𝜷l={β0,(βj)jIl}pl},l=1,,2p.\mathcal{M}_{l}=\{\rho_{n}(\cdot;\boldsymbol{\beta}_{l})\mid\boldsymbol{\beta}_{l}=\left\{\beta_{0},(\beta_{j})_{j\in I_{l}}\right\}\in\mathbb{R}^{p_{l}}\},\quad l=1,\dots,2^{p}.

The distinction between β0\beta_{0} and the other parameters is necessary because our objective is to select among 2p2^{p} different models which all contain an intercept term. Note that the true intensity function λn\lambda_{n} does not necessarily correspond to any of the suggested models l{\cal M}_{l}, l=1,,2pl=1,\ldots,2^{p}.

3 Estimation of the intensity function

In this section we discuss estimation of the intensity function using a Poisson likelihood function and associated asymptotic results.

3.1 The Poisson likelihood function

The density of a Poisson point process with intensity ρn(;𝜷l){\rho_{n}}(\cdot;\boldsymbol{\beta}_{l}) and observed in WnW_{n} is given by (see e.g. Møller and Waagepetersen (2004))

pn(𝐱;𝜷l)={u𝐱ρn(u;𝜷l)}exp{|Wn|Wnρn(u;𝜷l)du}p_{n}(\mathbf{x};\boldsymbol{\beta}_{l})=\left\{\prod_{u\in{\mathbf{x}}}{\rho_{n}}(u;\boldsymbol{\beta}_{l})\right\}\exp\left\{|W_{n}|-\int_{W_{n}}{\rho_{n}}(u;\boldsymbol{\beta}_{l})\mathrm{d}u\right\} (2)

for locally finite point configurations 𝐱Wn\mathbf{x}\subset W_{n}. We emphasize that in the following we assume neither that 𝐗n\mathbf{X}_{n} introduced in the previous section is a Poisson process nor that ρn(;𝜷l){\rho_{n}}(\cdot;\boldsymbol{\beta}_{l}) coincides with the intensity function λn\lambda_{n} of 𝐗n\mathbf{X}_{n}.

Combining (1) and (2), up to a constant, the log\log of (2) evaluated at 𝐗n\mathbf{X}_{n} becomes

n(𝜷l)=u𝐗n𝜷l𝐳l(u)θnWnexp{𝜷l𝐳l(u)}du.\ell_{n}(\boldsymbol{\beta}_{l})=\sum_{u\in{\mathbf{X}_{n}}}\boldsymbol{\beta}_{l}^{\top}\mathbf{z}_{l}(u)-\theta_{n}\int_{W_{n}}\exp\{\boldsymbol{\beta}_{l}^{\top}\mathbf{z}_{l}(u)\}\mathrm{d}u. (3)

Let en(𝜷l)e_{n}(\boldsymbol{\beta}_{l}) be the corresponding estimating function given by

en(𝜷l)=dd𝜷ln(𝜷l)=u𝐗n𝐳l(u)θnWn𝐳l(u)ρn(u;𝜷l)du.e_{n}(\boldsymbol{\beta}_{l})=\frac{\mathrm{d}}{\mathrm{d}\boldsymbol{\beta}_{l}}\ell_{n}(\boldsymbol{\beta}_{l})=\sum_{u\in{\mathbf{X}_{n}}}\mathbf{z}_{l}(u)-\theta_{n}\int_{W_{n}}\mathbf{z}_{l}(u){\rho_{n}}(u;\boldsymbol{\beta}_{l})\mathrm{d}u. (4)

For any 𝜷lpl\boldsymbol{\beta}_{l}\in{\mathbb{R}^{p_{l}}}, the sensitivity (or Fisher information) matrix is

𝐒n(𝜷l)=𝔼{dd𝜷len(𝜷l)}=θnWn𝐳l(u)𝐳l(u)ρn(u;𝜷l)du.\mathbf{S}_{n}(\boldsymbol{\beta}_{l})=-{\mathbb{E}}\left\{\frac{\mathrm{d}}{\mathrm{d}\boldsymbol{\beta}_{l}^{\top}}e_{n}(\boldsymbol{\beta}_{l})\right\}=\theta_{n}\int_{W_{n}}\mathbf{z}_{l}(u)\mathbf{z}_{l}(u)^{\top}{\rho_{n}}(u;\boldsymbol{\beta}_{l})\mathrm{d}u. (5)

We assume 𝐒n(𝜷l)\mathbf{S}_{n}(\boldsymbol{\beta}_{l}) is positive definite for all 𝜷l\boldsymbol{\beta}_{l} (see also condition C5 in the next section). We can then define the estimator of 𝜷l\boldsymbol{\beta}_{l} as

𝜷^l,n=argmax𝜷lplpn(𝐗n;𝜷l)=argmax𝜷lpln(𝜷l).\hat{\boldsymbol{\beta}}_{l,n}=\operatorname*{arg\,max}_{\boldsymbol{\beta}_{l}\in{\mathbb{R}^{p_{l}}}}\,{p_{n}}(\mathbf{X}_{n};\boldsymbol{\beta}_{l})=\operatorname*{arg\,max}_{\boldsymbol{\beta}_{l}\in{\mathbb{R}^{p_{l}}}}\,\ell_{n}(\boldsymbol{\beta}_{l}). (6)

If 𝐗n\mathbf{X}_{n} is indeed a Poisson process with intensity function ρn(;𝜷l){\rho_{n}}(\cdot;\boldsymbol{\beta}_{l}), then 𝜷^l\hat{\boldsymbol{\beta}}_{l} is the maximum likelihood estimator and the sensitivity (5) equals the observed information matrix den(𝜷l)/d𝜷l-\mathrm{d}e_{n}(\boldsymbol{\beta}_{l})/\mathrm{d}\boldsymbol{\beta}_{l}^{\top}.

If 𝐗n\mathbf{X}_{n} is not Poisson, 𝜷^l,n\hat{\boldsymbol{\beta}}_{l,n} may be viewed as a composite likelihood estimator (Schoenberg, 2005; Waagepetersen, 2007). In the situation where the intensity function ρn(;𝜷l){\rho_{n}}(\cdot;\boldsymbol{\beta}_{l}) coincides with the true intensity function λn\lambda_{n}, asymptotic properties of maximum likelihood or composite likelihood estimators obtained as maximizers of Poisson likelihood functions have been established in various settings by Rathbun and Cressie (1994), Waagepetersen (2007), Guan and Loh (2007) and Waagepetersen and Guan (2009). In the next section we investigate the more intriguing situation where the intensity model is misspecified.

3.2 Framework and asymptotic results for misspecified intensity functions

To handle the situation where ρn(;𝜷l){\rho_{n}}(\cdot;\boldsymbol{\beta}_{l}) does not coincide with the intensity function of 𝐗n\mathbf{X}_{n}, we follow Varin and Vidoni (2005) and define a (composite) Kullback-Leibler divergence between the model l{\cal M}_{l} with parameter 𝜷l\boldsymbol{\beta}_{l} and the true sampling distribution. That is,

KLn(𝜷l)=𝔼{nn(𝜷l)}\mathrm{KL}_{n}(\boldsymbol{\beta}_{l})={\mathbb{E}}\left\{\ell_{n}-\ell_{n}(\boldsymbol{\beta}_{l})\right\} (7)

where n\ell_{n} is the Poisson log-likelihood obtained with the true intensity λn\lambda_{n}. For a window WnW_{n} and model l{\cal M}_{l} we let

𝜷l,n=argmax𝜷lpl𝔼n(𝜷l)\boldsymbol{\beta}_{l,n}^{*}=\operatorname*{arg\,max}_{{\boldsymbol{\beta}_{l}\in\mathbb{R}^{p_{l}}}}{\mathbb{E}}\ell_{n}(\boldsymbol{\beta}_{l})

denote the ‘least wrong parameter value’ under model l{\cal M}_{l}, provided the maximum exists. It is easy to see by explicit evaluation of the right hand side that

dd𝜷l{dd𝜷l𝔼n(𝜷l)}=𝐒n(𝜷l)-\frac{\mathrm{d}}{\mathrm{d}{\boldsymbol{\beta}_{l}}^{\top}}\left\{-\frac{\mathrm{d}}{\mathrm{d}\boldsymbol{\beta}_{l}}{\mathbb{E}}\ell_{n}(\boldsymbol{\beta}_{l})\right\}=\mathbf{S}_{n}(\boldsymbol{\beta}_{l})

and so condition C5 stated below implies that 𝜷l,n\boldsymbol{\beta}_{l,n}^{*} is well-defined as a unique maximum when nn is large enough. Also it is easy to see that

𝔼en(𝜷l,n)=0,{\mathbb{E}}e_{n}(\boldsymbol{\beta}_{l,n}^{*})=0, (8)

which means that 𝜷^l,n\hat{\boldsymbol{\beta}}_{l,n} given by (6) is a candidate to estimate 𝜷l,n\boldsymbol{\beta}_{l,n}^{*}.

The remainder of this section is devoted to asymptotic results for 𝜷^l,n\hat{\boldsymbol{\beta}}_{l,n} within the above framework of a misspecified intensity function. We thereby extend the results in the references mentioned in Section 3.1. In contrast to these references which used either increasing domain or infill asymptotics, we moreover consider a ‘double asymptotic’ framework as formalized by condition C3 presented below.

Two matrices are crucial for the asymptotic results. The sensitivity matrix 𝐒n(𝜷l)\mathbf{S}_{n}(\boldsymbol{\beta}_{l}) is given by (5) regardless of whether the model is misspecified or not. The variance-covariance matrix 𝚺l,n\boldsymbol{\Sigma}_{l,n} of (4) is, using the Campbell theorem, given by

𝚺l,n\displaystyle\boldsymbol{\Sigma}_{l,n} =Wn𝐳l(u)𝐳l(u)λn(u)du\displaystyle=\int_{W_{n}}\mathbf{z}_{l}(u)\mathbf{z}_{l}(u)^{\top}\lambda_{n}(u)\mathrm{d}u
+WnWn𝐳l(u)𝐳l(u)λn(u)λn(v){gn(u,v)1}dudv.\displaystyle\qquad+\int_{W_{n}}\int_{W_{n}}\mathbf{z}_{l}(u)\mathbf{z}_{l}(u)^{\top}\lambda_{n}(u)\lambda_{n}(v)\left\{g_{n}(u,v)-1\right\}\mathrm{d}u\mathrm{d}v. (9)

Observe that 𝚺l,n\boldsymbol{\Sigma}_{l,n} does not depend on 𝜷l\boldsymbol{\beta}_{l} (whence its notation). Our results will be based on the following assumptions where for a square matrix 𝐌\mathbf{M}, νmin(𝐌)\nu_{\min}(\mathbf{M}) (resp. νmax(𝐌)\nu_{\max}(\mathbf{M})) stands for the smallest (resp. largest) eigenvalue. We use anbna_{n}\asymp b_{n} to denote that an=𝒪(bn)a_{n}={\mathcal{O}}(b_{n}) and bn=𝒪(an)b_{n}={\mathcal{O}}(a_{n}).

  1. [C1]

    As nn\to\infty, supn1𝜷l,n=𝒪(1)\sup_{n\geq 1}\|\boldsymbol{\beta}_{l,n}^{*}\|=\mathcal{O}(1).

  2. [C2]

    𝐳l:d\mathbf{z}_{l}:\mathbb{R}^{d}\to\mathbb{R} is continuous, 0<infud𝐳l(u)<supud𝐳l(u)<0<\inf_{u\in\mathbb{R}^{d}}\|\mathbf{z}_{l}(u)\|<\sup_{u\in\mathbb{R}^{d}}\|\mathbf{z}_{l}(u)\|<\infty.

  3. [C3]

    The sequence (τn:=θn|Wn|)n1\left(\tau_{n}:=\theta_{n}|W_{n}|\right)_{n\geq 1} is an increasing sequence, such thatlim infnθn>0\liminf_{n\to\infty}\theta_{n}>0 and limnτn=\lim_{n\to\infty}\tau_{n}=\infty. The sets WnW_{n} are convex and compact.

  4. [C4]

    As nn\to\infty, τn1en(𝜷l,n)0\tau_{n}^{-1}e_{n}(\boldsymbol{\beta}_{l,n}^{*})\to 0 almost surely.

  5. [C5]

    For any 𝜷lpl\boldsymbol{\beta}_{l}\in\mathbb{R}^{p_{l}} and n1n\geq 1, 𝐒n(𝜷l)\mathbf{S}_{n}(\boldsymbol{\beta}_{l}) is positive definite. In addition, for any n1n\geq 1, there exists a set BnWnB_{n}\subseteq W_{n} such that |Bn||Wn||B_{n}|\asymp|W_{n}| and a c>0c>0 such that infninfϕpl,ϕ=1infuBn|ϕ𝐳l(u)|c\inf_{n}\inf_{\boldsymbol{\phi}\in\mathbb{R}^{p_{l}},\|\boldsymbol{\phi}\|=1}\inf_{u\in B_{n}}|\boldsymbol{\phi}^{\top}\mathbf{z}_{l}(u)|\geq c. Finally, we assume that lim infnνmin(τn1𝚺l,n)>0\liminf_{n\to\infty}\nu_{\min}\left(\tau_{n}^{-1}\boldsymbol{\Sigma}_{l,n}\right)>0.

  6. [C6]

    As nn\to\infty, 𝚺l,n=𝒪(τn)\|\boldsymbol{\Sigma}_{l,n}\|=\mathcal{O}(\tau_{n}).

  7. [C7]

    As nn\to\infty, 𝚺l,n1/2en(𝜷l,n)N(0,𝐈pl)\boldsymbol{\Sigma}_{l,n}^{-1/2}e_{n}(\boldsymbol{\beta}_{l,n}^{*})\to N(0,\mathbf{I}_{p_{l}}) in distribution.

We can then state the following asymptotic result which is verified in Appendix A.

Theorem 1.


(i) Assume conditions C1-C4 hold, then there exists v=vn(𝜷^l,n,𝜷l,n)v=v_{n}(\hat{\boldsymbol{\beta}}_{l,n},\boldsymbol{\beta}_{l,n}^{*}), such that almost surely

𝜷^l,n𝜷l,n=𝐳l(v)𝐳l(v)2log[1+τn1en(𝜷l,n)𝐳l(v)𝐳l(v)2exp{𝜷l,n𝐳l(v)}].\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l,n}^{*}\;=\;\frac{\mathbf{z}_{l}(v)}{\|\mathbf{z}_{l}(v)\|^{2}}\,\log\left[1+\tau_{n}^{-1}{e_{n}(\boldsymbol{\beta}_{l,n}^{*})}^{\top}\frac{\mathbf{z}_{l}(v)}{\|\mathbf{z}_{l}(v)\|^{2}}\exp\{-{\boldsymbol{\beta}_{l,n}^{*}}^{\top}\mathbf{z}_{l}(v)\}\right]. (10)

and 𝜷^l,n𝜷l,n0\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l,n}^{*}\to 0 almost surely as nn\to\infty.
(ii) Assume conditions C1-C3 and C5-C6 hold. Then, 𝜷^l,n\hat{\boldsymbol{\beta}}_{l,n} is a root-τn\tau_{n} consistent estimator of 𝜷l,n\boldsymbol{\beta}_{l,n}^{*}, i.e.

𝜷^l,n𝜷l,n=𝒪P(τn1/2).\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l,n}^{*}=\mathcal{O}_{P}(\tau_{n}^{-1/2}). (11)

(iii) If in addition, C7 holds, then as nn\to\infty,

𝚺l,n1/2𝐒n(𝜷l,n)(𝜷^l,n𝜷l,n)N(0,𝐈pl)\boldsymbol{\Sigma}_{l,n}^{-1/2}\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{*})\left(\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l,n}^{*}\right)\to N(0,\mathbf{I}_{p_{l}}) (12)

in distribution.

We stress that Theorem 1 (ii)-(iii) do not require the strong consistency of τn1en(𝜷l,n)\tau_{n}^{-1}e_{n}(\boldsymbol{\beta}_{l,n}^{*}) to 0, i.e. condition C4. We conclude by some remarks regarding the assumptions C1-C7. Condition C1 ensures that the sequence of ‘least wrong parameter value‘ does not diverge with nn.

Condition C3 is different from existing conditions as it embraces both of the standard asymptotic frameworks considered in the literature:

  • infill asymptotics: Wn=WW_{n}=W with WW a bounded set of d\mathbb{R}^{d} and θn\theta_{n}\to\infty as nn\to\infty.

  • increasing domain asymptotics: θn=θ>0\theta_{n}=\theta>0 and (Wn)n1(W_{n})_{n\geq 1} is a sequence of bounded domains of d\mathbb{R}^{d} such that |Wn||W_{n}|\to\infty.

It is also valid if both asymptotics are considered at the same time. The assumed convexity in C3 enables the use of the mean value theorem in the proofs of our theoretical results.

In Condition C2, assuming an upper bound for 𝐳l(u)\|\mathbf{z}_{l}(u)\| is quite standard. The upper bound further implies that assuming a lower bound is not really restrictive since for each covariate zjz_{j} we can always find some kk so that |zj(u)+k|0|z_{j}(u)+k|\neq 0 for any udu\in\mathbb{R}^{d}. Replacing zjz_{j} by zj+kz_{j}+k while changing the intercept from β0\beta_{0} to β0βjk\beta_{0}-\beta_{j}k leaves the model unchanged. The continuity assumption is used to prove the strong consistency of 𝜷^l,n\hat{\boldsymbol{\beta}}_{l,n}.

Condition C4 is also used to ensure the strong consistency of 𝜷^l,n\hat{\boldsymbol{\beta}}_{l,n}. This condition can be seen as a law of large numbers. In Example 1, we present a class of models where such an assumption is valid under the generalized asymptotic condition C3. We can observe that if there exists p>0p>0 such that 𝔼{τn1/2en(𝜷l,np}=𝒪(1){\mathbb{E}}\{\|\tau_{n}^{-1/2}e_{n}(\boldsymbol{\beta}_{l,n}^{*}\|^{p}\}=\mathcal{O}(1) and nτnp/2<\sum_{n}\tau_{n}^{-p/2}<\infty, then C4 ensues from an application of Borel-Cantelli’s lemma. At least in the increasing domain framework, assuming such a moment assumption is quite standard to derive a central limit theorem.

Condition C5 is very similar to the assumption required by Rathbun and Cressie (1994) under the Poisson case or by Waagepetersen and Guan (2009) for more general point processes, within the increasing domain asymptotic framework. Note in particular that C5 combined with C1-C2 ensures that lim infnνmin{τn1𝐒n(𝜷l,n)}>0\liminf_{n\to\infty}\nu_{\min}\{\tau_{n}^{-1}\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{*})\}>0.

Under the Poisson case, gn=1g_{n}=1 and so C6 is obviously satisfied if supuWnλn(u)=𝒪(θn)\sup_{u\in W_{n}}\lambda_{n}(u)=\mathcal{O}(\theta_{n}). For more general point processes, assume further that gn=gg_{n}=g does not depend on nn and is invariant under translations. Then, thanks to C2, the assumption

d{g(o,w)1}dw<.\int_{\mathbb{R}^{d}}\{g(o,w)-1\}\mathrm{d}w<\infty.

will imply C6. This assumption is quite standard and satisfied by a large class of models, see e.g. Waagepetersen and Guan (2009).

Continuing within the increasing domain framework, C7 was established by Rathbun and Cressie (1994) in the Poisson case, by Guan and Loh (2007) and Waagepetersen and Guan (2009) for α\alpha-mixing point processes, and by Lavancier et al. (2020) for determinantal point processes. The infill asymptotic framework was used to establish C7 in case of Poisson cluster processes in Waagepetersen (2007). However, the ‘double’ asymptotic framework has never been considered. Below, we provide an example of a model which satisfies C4, C6 and C7 under the new general asymptotic setting C3.

Example 1.

Let 𝐂n\mathbf{C}_{n} be a homogeneous Poisson point process on d\mathbb{R}^{d} with intensity θn\theta_{n}. Given 𝐂n\mathbf{C}_{n}, let 𝐗n,c\mathbf{X}_{n,c}, c𝐂nc\in\mathbf{C}_{n}, be independent inhomogeneous Poisson point processses on WnW_{n} with intensity αk(uc)ρ(u)\alpha k(u-c)\rho(u) where α>0\alpha>0, kk is a symmetric density on d\mathbb{R}^{d}, and ρ\rho is a non-negative bounded function. Then, 𝐗n=c𝐂n𝐗n,c\mathbf{X}_{n}=\cup_{c\in\mathbf{C}_{n}}\mathbf{X}_{n,c} is an inhomogeneous Poisson cluster point process (with inhomogeneous offspring). It can be shown that λn(u)=θnαρ(u)\lambda_{n}(u)=\theta_{n}\alpha\rho(u) and gn(u,v)=1+(kk)(vu)/θng_{n}(u,v)=1+{(k\!*\!k)(v-u)}/{\theta_{n}} where the notation * denotes convolution. Assuming that supuρ(u)=𝒪(1)\sup_{u}\rho(u)=\mathcal{O}(1), there exists K0K\geq 0 such that

𝚺l,n\displaystyle\|\boldsymbol{\Sigma}_{l,n}\| K{τn+θn2WnWn|gn(u,v)1|dudv}\displaystyle\leq K\left\{\tau_{n}+\theta_{n}^{2}\int_{W_{n}}\int_{W_{n}}|g_{n}(u,v)-1|\mathrm{d}u\mathrm{d}v\right\}
K{τn+θnWnWn|(kk)(vu)|dudv}\displaystyle\leq K\left\{\tau_{n}+\theta_{n}\int_{W_{n}}\int_{W_{n}}|(k*k)(v-u)|\mathrm{d}u\mathrm{d}v\right\}
K{τn+τnd|(kk)(w)|dw}=𝒪(τn).\displaystyle\leq K\left\{\tau_{n}+\tau_{n}\int_{\mathbb{R}^{d}}|(k*k)(w)|\mathrm{d}w\right\}=\mathcal{O}(\tau_{n}).

Thus C6 is satisfied. In Appendix B we show that this model also satisfies C4 and C7 within the ‘double’ asymptotic framework. Along the same lines one can show that C7 holds for the inhomogeneous Poisson point process with intensity function λn(u)=θnρ(u)\lambda_{n}(u)=\theta_{n}\rho(u).

4 Akaike and composite information criteria

One criterion for model selection would be to choose the model that minimizes the Kullback-Leibler divergence, i.e. the model for which (7) evaluated at 𝜷l,n\boldsymbol{\beta}_{l,n}^{*} is smallest. Of course this criterion is not useful in practice since the true model is unknown. Following Varin and Vidoni (2005) we instead choose the model that minimizes an estimate of the expected value of (7) evaluated at 𝜷^l,n\hat{\boldsymbol{\beta}}_{l,n}, or equivalently, that minimizes an estimate of 2𝔼Cn(𝜷^l,n)2{\mathbb{E}}C_{n}(\hat{\boldsymbol{\beta}}_{l,n}) with Cn(𝜷l)=𝔼n(𝜷l)C_{n}(\boldsymbol{\beta}_{l})=-{\mathbb{E}}\ell_{n}(\boldsymbol{\beta}_{l}). We follow Varin and Vidoni (2005, Lemmas 1-2) to derive in our context the following result.

Proposition 1.

Assume conditions C1-C3 and C5-C6 hold. Also assume that there exists ε>0\varepsilon>0 such that

supn𝔼{en(𝜷l,n)𝐌nen(𝜷l,n)1+ε}<\sup_{n}{\mathbb{E}}\left\{\;\left\|e_{n}(\boldsymbol{\beta}_{l,n}^{*})^{\top}\mathbf{M}_{n}e_{n}(\boldsymbol{\beta}_{l,n}^{*})\right\|^{1+\varepsilon}\;\right\}<\infty (13)

where 𝐌n=𝐒n(𝛃~l,n)1𝐒n(𝛃l,n)1\mathbf{M}_{n}=\mathbf{S}_{n}(\tilde{\boldsymbol{\beta}}_{l,n})^{-1}-\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{*})^{-1} and 𝛃~l,n\tilde{\boldsymbol{\beta}}_{l,n} is on the line segment between 𝛃^l,n\hat{\boldsymbol{\beta}}_{l,n} and 𝛃l,n\boldsymbol{\beta}_{l,n}^{*}. Then,

𝔼{Cn(𝜷^l,n)}=𝔼{n(𝜷^l,n)}+trace{𝐒n(𝜷l,n)1𝚺l,n}+o(1).{\mathbb{E}}\left\{C_{n}(\hat{\boldsymbol{\beta}}_{l,n})\right\}={\mathbb{E}}\left\{-\ell_{n}(\hat{\boldsymbol{\beta}}_{l,n})\right\}+\mathrm{trace}\left\{\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{*})^{-1}\mathbf{\Sigma}_{l,n}\right\}+o(1).

In other words, 2n(𝛃^l)+2trace{𝐒n(𝛃l,n)1𝚺l,n}-2\ell_{n}(\hat{\boldsymbol{\beta}}_{l})+2\mathrm{trace}\{\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{*})^{-1}\boldsymbol{\Sigma}_{l,n}\} is an asymptotically unbiased estimator of 2𝔼{Cn(𝛃^l,n)}2{\mathbb{E}}\{C_{n}(\hat{\boldsymbol{\beta}}_{l,n})\}.

Remark 1.

The technical condition (13) implies that the sequence {e~n:=en(𝛃l,n)𝐌nen(𝛃l,n)}n1\{\tilde{e}_{n}:=e_{n}(\boldsymbol{\beta}_{l,n}^{*})^{\top}\mathbf{M}_{n}e_{n}(\boldsymbol{\beta}_{l,n}^{*})\}_{n\geq 1} is a uniformly integrable sequence of random variables. This ensures that convergence in probability of e~n\tilde{e}_{n} to 0 implies that 𝔼e~n=o(1){\mathbb{E}}\tilde{e}_{n}=o(1).

To estimate the effective degrees of freedom pl=tr{𝐒n(𝜷l,n)1𝚺l,n}p^{*}_{l}=\text{tr}\left\{\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{*})^{-1}\boldsymbol{\Sigma}_{l,n}\right\} we first simply estimate 𝐒n(𝜷l,n)\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{*}) by 𝐒n(𝜷^l,n)\mathbf{S}_{n}(\hat{\boldsymbol{\beta}}_{l,n}). The estimation of 𝚺l,n\boldsymbol{\Sigma}_{l,n} is more difficult. Following Claeskens and Hjort (2008, p. 31), note that if ρn(;𝜷l,n){\rho_{n}}(\cdot;\boldsymbol{\beta}_{l,n}^{*}) coincides with the true intensity function of 𝐗n\mathbf{X}_{n}, then 𝚺l,n\boldsymbol{\Sigma}_{l,n} coincides with 𝚺n(𝜷l,n)\boldsymbol{\Sigma}_{n}(\boldsymbol{\beta}^{*}_{l,n}) given by 𝐒n(𝜷l,n)+T2\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{*})+T_{2} where

T2:=Wn2𝐳l(u)𝐳l(v)ρn(u;𝜷l,n)ρn(v;𝜷l,n){gn(u,v)1}dudv.T_{2}:=\int_{W_{n}^{2}}\mathbf{z}_{l}(u)\mathbf{z}_{l}^{\top}(v){\rho_{n}}(u;\boldsymbol{\beta}_{l,n}^{*}){\rho_{n}}(v;\boldsymbol{\beta}_{l,n}^{*})\left\{g_{n}(u,v)-1\right\}\mathrm{d}u\mathrm{d}v.

In this case we get

trace{𝐒n(𝜷l,n)1𝚺n(𝜷l,n)}\displaystyle\text{trace}\{\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{*})^{-1}\boldsymbol{\Sigma}_{n}(\boldsymbol{\beta}_{l,n}^{*})\} =trace{𝐈pl+𝐒n(𝜷l,n)1T2}\displaystyle=\text{trace}\{\mathbf{I}_{p_{l}}+\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{*})^{-1}T_{2}\}
=pl+trace{𝐒n(𝜷l,n)1T2}.\displaystyle=p_{l}+\text{trace}\{\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{*})^{-1}T_{2}\}.

We propose to use pl,approx=pl+trace{𝐒n(𝜷l,n)1T2}p_{l,\text{approx}}^{*}=p_{l}+\text{trace}\{\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{*})^{-1}T_{2}\} as an approximation of plp_{l}^{*}. In practice we replace 𝜷l,n\boldsymbol{\beta}_{l,n}^{*} by 𝜷^l,n\hat{\boldsymbol{\beta}}_{l,n} and gng_{n} by an estimate obtained by fitting a valid parametric model for gng_{n} and thus obtain p^l,approx\hat{p}_{l,\text{approx}}^{*}. Our composite model selection criterion then becomes

CICl=2ln(𝜷^l,n)+2p^l,approx.\mathrm{CIC}_{l}=-2l_{n}(\hat{\boldsymbol{\beta}}_{l,n})+2\hat{p}_{l,\text{approx}}^{*}. (14)

For a Poisson process, gn=1g_{n}=1 in which case p^l,approx=pl\hat{p}_{l,\text{approx}}^{*}=p_{l} and (14) reduces to the popular Akaike’s information criterion. For a clustered point process, gn>1g_{n}>1 meaning that p^l,approx>pl\hat{p}_{l,\text{approx}}^{*}>p_{l}. Thus we penalize more the complexity of the model in the case of a clustered point process. This seems to make sense since random clustering of points (i.e. not due to covariates) may erroneously be picked up by covariates that actually had no effect in the data generating mechanism. Hence there is a greater risk of picking a too complex model for the intensity function in case of a clustered point process than for a Poisson process.

5 Bayesian information criterion

The motivation of the Bayesian information criterion (BIC) is quite different from the derivation of the AIC and CIC criteria considered in the previous section. A main difference is that there is initially no reference to a Kullback-Leibler distance or asymptotics related to a ‘least false parameter value’. Instead, the true model is considered to be one of the models l{\mathcal{M}}_{l} and the idea is to choose the model that has maximum posterior probability within the specified Bayesian framework. However, the asymptotic concepts again play a role in order to derive asymptotic expansions of the posterior probabilities. Section 5.1 covers the Poisson process case. Section 5.2 proposes a composite likelihood BIC in the case where data is not generated from a Poisson process. Note that in this case, it is still assumed that the true intensity function corresponds to one of intensity functions for the models l{\mathcal{M}}_{l}.

5.1 BIC in the Poisson process case

The BIC criterion (see e.g. Schwarz (1978); Lebarbier and Mary-Huard (2006)) defines the best model BIC\mathcal{M}_{\mathrm{BIC}} as

BIC=argmaxl,l=1,,2pPn(l𝐗n).\mathcal{M}_{\mathrm{BIC}}=\operatorname*{arg\,max}_{\mathcal{M}_{l},l=1,\ldots,2^{p}}\;\mathrm{P}_{n}\left(\mathcal{M}_{l}\mid\mathbf{X}_{n}\right). (15)

From Bayes formula,

Pn(l𝐗n)=pn(𝐗n|l)P(l)pn(𝐗n).\mathrm{P}_{n}(\mathcal{M}_{l}\mid\mathbf{X}_{n})=\frac{p_{n}(\mathbf{X}_{n}|{\mathcal{M}}_{l})\mathrm{P}(\mathcal{M}_{l})}{p_{n}(\mathbf{X}_{n})}.

Letting p(𝜷l|l)p(\boldsymbol{\beta}_{l}|{\mathcal{M}}_{l}) denote the prior density of 𝜷l\boldsymbol{\beta}_{l} given l{\mathcal{M}}_{l},

pn(𝐗n|l)=plpn(𝐗n;βl)p(𝜷l|l)d𝜷lp_{n}(\mathbf{X}_{n}|{\mathcal{M}}_{l})=\int_{\mathbb{R}^{p_{l}}}p_{n}(\mathbf{X}_{n};\beta_{l})p(\boldsymbol{\beta}_{l}|{\mathcal{M}}_{l})\mathrm{d}\boldsymbol{\beta}_{l}

since pn(𝐗n;βl)p_{n}(\mathbf{X}_{n};\beta_{l}) is the conditional density of 𝐗n\mathbf{X}_{n} given l\mathcal{M}_{l} and 𝜷l\boldsymbol{\beta}_{l}. We assume that the prior distribution over models is non-informative, so that the BIC criterion defines the best model as

BIC=argmaxl,l=1,,2ppn(𝐗nl).\mathcal{M}_{\mathrm{BIC}}=\operatorname*{arg\,max}_{{\mathcal{M}_{l}},l=1,\dots,2^{p}}\;p_{n}\left(\mathbf{X}_{n}\mid\mathcal{M}_{l}\right). (16)

In principle, one could evaluate the pn(𝐗nl)p_{n}\left(\mathbf{X}_{n}\mid\mathcal{M}_{l}\right) using numerical quadrature and then determine BIC\mathcal{M}_{\mathrm{BIC}}. However, this is computationally costly and also the need to elicit a specific prior p(𝜷l|l)p(\boldsymbol{\beta}_{l}|{\mathcal{M}}_{l}) for each model may be a nuisance. Our next result therefore proposes an asymptotic expansion of logpn(𝐗nl)\log p_{n}\left(\mathbf{X}_{n}\mid\mathcal{M}_{l}\right). The methodology is standard (basically a Laplace approximation) and well-known in the literature, see Tierney and Kadane (1986) or e.g. Lebarbier and Mary-Huard (2006) and the references therein. However, due to our spatial framework and the double asymptotic point of view considered in this paper, the standard results do not apply straightforwardly. Due to the use of asymptotic results we again need to rely on the notions of a true model and the ‘least false parameter value’ 𝜷l,n\boldsymbol{\beta}_{l,n}^{*}.

We impose the following conditions on the prior for 𝜷l\boldsymbol{\beta}_{l} and the mean μn=𝔼𝐗n\mu_{n}={\mathbb{E}}\mathbf{X}_{n}.

  1. [C8]

    The prior density p(𝜷ll)p(\boldsymbol{\beta}_{l}\mid\mathcal{M}_{l}) of 𝜷l\boldsymbol{\beta}_{l} given l{\mathcal{M}}_{l} is continuously differentiable on pl\mathbb{R}^{p_{l}}.

  2. [C9]

    μnτn\mu_{n}\asymp\tau_{n}.

Note that μn\mu_{n} is the marginal mean of 𝐗n\mathbf{X}_{n} under the true intensity model λn\lambda_{n} as discussed in Section 2. Condition C9 seems reasonable since it would hold if λn\lambda_{n} coincided with any of the specified parametric intensity models ρn(;𝜷l)\rho_{n}(\cdot;\boldsymbol{\beta}_{l}). We also need to slightly strengthen assumption C1 and replace it by

  1. [C1]

    As nn\to\infty, there exists 𝜷lpl\boldsymbol{\beta}_{l}^{*}\in\mathbb{R}^{p_{l}} such that limn𝜷l,n=𝜷l\lim_{n\to\infty}\boldsymbol{\beta}_{l,n}^{*}=\boldsymbol{\beta}_{l}^{*}.

The following result is verified in Appendix D using Łapiński (2019, Theorem 2), which is a rigorous statement of a multivariate Laplace approximation.

Proposition 2.

Under the conditions C1, C2-C5 and C8-C9, we have, almost surely with respect to the distribution of {𝐗n}n1\{\mathbf{X}_{n}\}_{n\geq 1}, as nn\to\infty,

logpn(𝐗nl)=\displaystyle\log p_{n}(\mathbf{X}_{n}\mid\mathcal{M}_{l})= n(𝜷^l,n)pl2log(μn)\displaystyle\ell_{n}(\hat{\boldsymbol{\beta}}_{l,n})-\frac{p_{l}}{2}\log\left(\mu_{n}\right)
+pl2log(2π)12logdet{μn1𝐒n(𝜷^l,n)}\displaystyle+\frac{p_{l}}{2}\log(2\pi)-\frac{1}{2}\log\mathrm{det}\{\mu_{n}^{-1}\mathbf{S}_{n}(\hat{\boldsymbol{\beta}}_{l,n})\}
+logp(𝜷^l,nl)+𝒪(μn1/2).\displaystyle+\log p(\hat{\boldsymbol{\beta}}_{l,n}\mid\mathcal{M}_{l})+{\mathcal{O}}(\mu_{n}^{-1/2}). (17)

The criterion (16) is defined entirely within the specified Bayesian framework. Hence no reference to a true model and no need for asymptotic results. However, this is changed when we derive the expansion (17) for (16). The expansion is around 𝜷^l,n\hat{\boldsymbol{\beta}}_{l,n} and for technical reasons, when applying the Laplace approximation, convergence of 𝜷^l,n\hat{\boldsymbol{\beta}}_{l,n} is needed. Therefore we need the assumptions that ensure strong consistency of 𝜷^l,n\hat{\boldsymbol{\beta}}_{l,n}.

Since μnτn\mu_{n}\asymp\tau_{n} and from the strong consistency of 𝜷^l,n\hat{\boldsymbol{\beta}}_{l,n}, we have that logdet{μn1𝐒n(𝜷^l,n)}=𝒪P(1)\log\mathrm{det}\{\mu_{n}^{-1}\mathbf{S}_{n}(\hat{\boldsymbol{\beta}}_{l,n})\}=\mathcal{O}_{\mathrm{P}}(1) while C8 ensures that logp(𝜷^l,nl)\log p(\hat{\boldsymbol{\beta}}_{l,n}\mid\mathcal{M}_{l}) is 𝒪P(1)\mathcal{O}_{\mathrm{P}}(1). So, if we neglect terms which are 𝒪P(1)\mathcal{O}_{\mathrm{P}}(1) in (17), we follow the standard heuristic and suggest to define a first version of the BIC criterion as

2n(𝜷^l,n)+pllog(μn)-2\ell_{n}(\hat{\boldsymbol{\beta}}_{l,n})+p_{l}\log(\mu_{n}) (18)

where we remind that plp_{l} is the length of 𝜷l\boldsymbol{\beta}_{l}.

In practice μn\mu_{n} is not known. However, since 𝕍arN(Wn)=μn{\mathbb{V}\mathrm{ar}}N(W_{n})=\mu_{n} it follows that N(Wn)/μn1=𝒪P(μn1/2)=oP(1)N(W_{n})/\mu_{n}-1=\mathcal{O}_{\mathrm{P}}(\mu_{n}^{-1/2})=o_{\mathrm{P}}(1). This justifies to define the BIC\mathrm{BIC} criterion in the following natural way

BICl=2n(𝜷^l,n)+pllog{N(Wn)}.\mathrm{BIC}_{l}=-2\ell_{n}(\hat{\boldsymbol{\beta}}_{l,n})+p_{l}\,\log\left\{N(W_{n})\right\}. (19)

5.2 Composite likelihood BIC

Suppose 𝐗n\mathbf{X}_{n} has an intensity function of the form (1) but is not a Poisson process. Then as mentioned in Section 3.1, (3) may be viewed as a composite likelihood score for estimating 𝜷l\boldsymbol{\beta}_{l}. In this case, following Gao and Song (2010), (16) may be viewed as a composite likelihood BIC. Again we obtain (19) from (16) by Laplace approximation. However, Gao and Song (2010) suggest to replace plp_{l} by the ‘effective degrees of freedom’ plp_{l}^{*} considered in Section 4. Thus our proposed composite likelihood BIC is

CBICl=2n(𝜷^l,n)+pl,approxlog{N(Wn)}\mathrm{CBIC}_{l}=-2\ell_{n}(\hat{\boldsymbol{\beta}}_{l,n})+p_{l,\text{approx}}^{*}\,\log\big{\{}N(W_{n})\big{\}}

which becomes equal to the ordinary BIC in (19) for a Poisson process. From a practical point of view, we simply estimate pl,approxp_{l,\text{approx}}^{*} as in (14).

6 Simulation study

To evaluate the proposed model selection criteria we conduct two simulation studies with p=6p=6 spatial covariates, of which four have zero effect. The first study in Section 6.1 considers AIC and BIC in the Poisson point process case. The second study considers the clustered Thomas point process where we employ the CIC and CBIC criteria to select the best model and compare with results obtained using AIC and BIC (assuming wrongly that the simulated data are from a Poisson point process).

The covariates are obtained from the BCI dataset (Hubbell and Foster, 1983; Condit et al., 1996; Condit, 1998) which in addition to locations of around 300 species of trees observed in W=[0,1000]×[0,500]W=[0,1000]\times[0,500] (m2m^{2}) contains a number of spatial covariates. In particular, we center and scale the two topological covariates (elevation and slope of elevation) and four soil nutrients (aluminium, boron, calcium, and copper). The six covariates are depicted in Figure 1.

Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption
Figure 1: Maps of covariates used in the simulation study. From left to right: First row: elevation (z1z_{1}), slope (z2z_{2}) and aluminium (z3z_{3}); Second row: boron (z4z_{4}), calcium (z5z_{5}) and copper (z6z_{6}).

Skipping the dependence on nn in the notation, we model the intensity function as

ρ(u,𝜷)=ωexp{β1z1(u)++β6z6(u)},\displaystyle\rho(u,\boldsymbol{\beta})=\omega\exp\{\beta_{1}z_{1}(u)+\cdots+\beta_{6}z_{6}(u)\}, (20)

where z1,,z6z_{1},\cdots,z_{6} are the centered and scaled covariates as in Figure 1 and β1,,β6\beta_{1},\cdots,\beta_{6} are the regression coefficients. We consider different settings for ω\omega and WW. Note that ω{\omega} plays the role of θexp(β0)\theta\exp(\beta_{0}) and will be adjusted to obtain desired expected numbers of points μ\mu in WW. When the simulation involves an observation window WW different from [0,1000]×[0,500][0,1000]\times[0,500], the covariates are simply rescaled to fit WW.

The model selection criteria are compared in terms of the true positive rate (TPR), false positive rate (FPR), expected Kullback-Leibler divergence (MKL), and mean integrated squared error of the intensity function (MISE). The TPR (resp. FPR) are the expected fractions of informative (resp. non-informative) covariates included in the selected model. For a point process observed on WW, with intensity ρ(;𝜷)\rho(;\boldsymbol{\beta}^{*}) where 𝜷\boldsymbol{\beta}^{*} stands for the true parameter estimated by 𝜷^\hat{\boldsymbol{\beta}}, MKL and MISE are estimated by averaging the following KL and ISE across simulations:

KL=\displaystyle\mathrm{KL}= W[ρ(u;𝜷){logρ(u;𝜷)1}ρ(u;𝜷^){logρ(u;𝜷^)1}]du\displaystyle\int_{W}\left[\rho(u;\boldsymbol{\beta}^{*})\left\{\log\rho(u;\boldsymbol{\beta}^{*})-1\right\}-\rho(u;\hat{\boldsymbol{\beta}})\left\{\log\rho(u;\hat{\boldsymbol{\beta}})-1\right\}\right]\mathrm{d}u
ISE=\displaystyle\mathrm{ISE}= W{ρ(u;𝜷)ρ(u;𝜷^)}2du.\displaystyle\int_{W}\left\{\rho(u;\boldsymbol{\beta}^{*})-\rho(u;\hat{\boldsymbol{\beta}})\right\}^{2}\mathrm{d}u.

6.1 Poisson point process model

We consider two different scenarios to illustrate both types of asymptotics.

  • Scenario 1 (infill asymptotics). W=[0,1]×[0,0.5]W=[0,1]\times[0,0.5]. We adjust ω=θexp(β0)\omega=\theta\exp(\beta_{0}) such that μ\mu equals either 5050 or 200200.

  • Scenario 2 (increasing domain asymptotics). W=[0,500]×[0,250]W=[0,500]\times[0,250] or [0,1000]×[0,500][0,1000]\times[0,500] and ω\omega is chosen so that μ=200\mu=200 or μ=800\mu=800.

For each scenario and the choices of WW and ω\omega, 500 simulations are generated from inhomogeneous Poisson point processes with intensity function given by (20) using the function 𝚛𝚙𝚘𝚒𝚜𝚙𝚙\mathtt{rpoispp} from the 𝚜𝚙𝚊𝚝𝚜𝚝𝚊𝚝\mathtt{spatstat} 𝚁\mathtt{R} package. We set β1=0.5\beta_{1}=0.5 and β2=0.25\beta_{2}=-0.25 to represent moderate effects of elevation and slope, and set β3==β6=0\beta_{3}=\dots=\beta_{6}=0. For each simulation, parameters are estimated by maximizing the Berman-Turner approximation (see e.g.  Baddeley and Turner, 2000) of the Poisson log-likelihood (3) using a number mm of quadrature dummy points to approximate the integral in (3). The estimation is done using the 𝚙𝚙𝚖\mathtt{ppm} function. Then, the model is selected according to the AIC and BIC-type criteria. We consider different variants of the BIC criterion, namely

BICl(π)=2(𝜷^l)+pl,approxlog(π)l=1,,64,\mathrm{BIC}_{l}(\pi)=-2\ell(\boldsymbol{\hat{\beta}}_{l})+p_{l,\text{approx}}^{*}\;\log(\pi)\quad l=1,\cdots,64,

where π\pi represents a penalty. Note that (omitting dependence on ll) AIC=BIC{exp(2)}\mathrm{AIC}=\mathrm{BIC}\{\exp(2)\} and that BIC(N)\mathrm{BIC}(N) (N=N(W)N=N(W)) corresponds to the criterion used by Thurman et al. (2015) and is also the criterion suggested by the present paper. We also consider BIC(|W|)\mathrm{BIC}(|W|) used by Choiruddin et al. (2018) and BIC(N+m)\mathrm{BIC}(N+m) considered by Jeffrey et al. (2018).

AIC BIC(π),π=(\pi),\;\;\pi=
NN N+400μN+400\mu |W||W|
W=[0,1]×[0,0.5]W=[0,1]\!\!\times\!\![0,0.5] TPR 70 59 49 100
μ=50\mu=50 FPR 16 5 1 100
MISE 6.5 6.0 6.0 7.9
MKL 3.0 2.9 2.9 3.6
W=[0,1]×[0,0.5]W=[0,1]\!\!\times\!\![0,0.5] TPR 94 83 63 100
μ=200\mu=200 FPR 17 2 0 100
MISE 2.6 2.3 3.3 3.2
MKL 2.9 2.8 4.2 3.5
W=[0,500]×[0,250]W=[0,500]\!\!\times\!\![0,250] TPR 95 83 63 62
μ=200\mu=200 FPR 16 2 0 0
MISE 1.0 0.9 1.3 1.3
MKL 2.8 2.7 4.1 4.2
W=[0,1000]×[0,500]W=[0,1000]\!\!\times\!\![0,500] TPR 100 99 99 98
μ=800\mu=800 FPR 18 1 0 0
MISE 10.4 6.5 6.7 7.0
MKL 2.9 1.8 1.9 2.0
Table 1: True positive (TPR) and false positive (FPR) rates in percent, MISE, and MKL, estimated from 500 simulations of inhomogeneous Poisson point processes on different observation domains. Model selections are based on AIC or BIC criteria of the form BICl=2(𝜷^l)+pllogπ\mathrm{BIC}_{l}=-2\ell(\hat{\boldsymbol{\beta}}_{l})+p_{l}\log\pi (l=1,,64l=1,\dots,64) for different penalty terms π=N,N+400μ,|W|\pi=N,N+400\mu,|W|. For convenience, the four rows with MISE are multiplied by respectively .001, .01, 100, and 1000.

For both scenarios, we perform estimation with m=4μm=4\mu (the rule of thumb suggested by spatstat) and model selection with the criteria AIC\mathrm{AIC}, BIC(N)\mathrm{BIC}(N), BIC(N+4μ)\mathrm{BIC}(N+4\mu) and BIC(|W|)\mathrm{BIC}(|W|). Similar results are obtained with BIC(N)\mathrm{BIC}(N) and BIC(N+4μ)\mathrm{BIC}(N+4\mu) and we omit the results for the latter. We also perform estimation and selection with m=400μm=400\mu and only report results for BIC(N+400μ)\mathrm{BIC}(N+400\mu) since the results with the criteria AIC\mathrm{AIC}, BIC(N)\mathrm{BIC}(N) and BIC(|W|)\mathrm{BIC}(|W|) are similar to those obtained when the estimation is performed with m=4μm=4\mu.

When |W||W| is small, especially when log|W|<0\log|W|<0, the criterion BIC(|W|)\mathrm{BIC}(|W|) obviously fails as it selects the most complex model regardless of the value of μ\mu. Thereby the FP rate becomes 100%. In addition, as indicated in the second and third rows of Table 1 where the point patterns have the same average number of points, it is worth noticing that BIC(|W|)\mathrm{BIC}(|W|) selects pretty different models. In particular this criterion has an undesirable strong dependence on the choice of length unit.

The criterion BIC(N+m)\mathrm{BIC}(N+m) with a large mm also fails since the TPR with this criterion and m=400μm=400\mu is very small compared to the other criteria, especially when the expected number of points is small or moderate. The AIC criterion achieves a high TPR in all situations but fails since it suffers from a high FPR, even in the scenario 2 where μ=800\mu=800.

In all cases, BIC(N)\mathrm{BIC}(N) provides the best trade-off between TPR and FPR and the results improve when ω\omega or |W||W| is increased. The minimal values of MKL and MISE are further always obtained with BIC(N)\mathrm{BIC}(N). In case of a Poisson process, we therefore recommend BIC(N)\mathrm{BIC}(N) (simply denoted BIC in the following).

6.2 Thomas point process model

To generate a simulation from a Thomas point process with intensity (20), we first generate a parent point pattern from a stationary Poisson point process 𝐂\mathbf{C} with intensity κ>0\kappa>0. Given 𝐂\mathbf{C}, clusters 𝐗c\mathbf{X}_{c}, c𝐂c\in\mathbf{C}, are generated from inhomogeneous Poisson point processes with intensity functions

ρc(u;𝜷)=ωexp{β1z1(u)++β6z6(u)}k(uc;γ)/κ,\displaystyle\rho_{c}(u;\boldsymbol{\beta})=\omega\exp\{\beta_{1}z_{1}(u)+\cdots+\beta_{6}z_{6}(u)\}k(u-c;\gamma)/\kappa,

where k(uc;γ)=(2πγ2)1exp(uc2/(2γ2))k(u-c;\gamma)=(2\pi\gamma^{2})^{-1}\exp(-\|u-c\|^{2}/(2\gamma^{2})) is the density for 𝒩(0,γ2𝐈2)\mathcal{N}(0,\gamma^{2}\mathbf{I}_{2}). Finally, 𝐗=c𝐂𝐗c\mathbf{X}=\cup_{c\in\mathbf{C}}\mathbf{X}_{c} is an inhomogeneous Thomas point process with intensity (20). The regression parameters are set as follows: β1=2\beta_{1}=2, β2=1\beta_{2}=-1, β3==β6=0\beta_{3}=\dots=\beta_{6}=0. We consider κ=4×104\kappa=4\times 10^{-4} and two scale parameters γ=5\gamma=5 and γ=15\gamma=15. A lower value for γ\gamma tends to produce more clustered patterns. We consider the observation domains W=[0,500]×[0,250]W=[0,500]\times[0,250] and W=[0,1000]×[0,500]W=[0,1000]\times[0,500] with ω\omega adjusted to give expected numbers of points μ=400\mu=400 and μ=1600\mu=1600 for the two windows. The chosen value of κ\kappa implies on average 50 parent points on W=[0,500]×[0,250]W=[0,500]\times[0,250] and 200 parent points on W=[0,1000]×[0,500]W=[0,1000]\times[0,500].

AIC BIC CIC CBIC
W=[0,500]×[0,250]W=[0,500]\!\!\times\!\![0,250] TPR 98 96 88 81
μ=400\mu=400 FPR 81 65 24 17
γ=5\gamma=5 MISE 4.5 4.4 2.7 2.5
MKL 9.7 9.6 7.5 9.1
Mean(p^l)(\hat{p}_{l}^{*}) - - 103.6 76.6
SD(p^l)(\hat{p}_{l}^{*}) - - 122.7 70.1
W=[0,1000]×[0,500]W=[0,1000]\!\!\times\!\![0,500] TPR 100 100 97 94
μ=1600\mu=1600 FPR 81 65 20 11
γ=5\gamma=5 MISE 3.9 3.8 2.6 2.4
MKL 10.3 10.2 7.9 8.2
Mean(p^l)(\hat{p}_{l}^{*}) - - 101.1 81.3
SD(p^l)(\hat{p}_{l}^{*}) - - 104.6 70.2
W=[0,500]×[0,250]W=[0,500]\!\!\times\!\![0,250] TPR 100 99 95 93
μ=400\mu=400 FPR 70 49 43 32
γ=15\gamma=15 MISE 2.1 2.0 1.8 1.8
MKL 5.2 5.0 4.9 5.4
Mean(p^l)(\hat{p}_{l}^{*}) - - 48.9 40.9
SD(p^l)(\hat{p}_{l}^{*}) - - 77.3 61.3
W=[0,1000]×[0,500]W=[0,1000]\!\!\times\!\![0,500] TPR 100 100 96 94
μ=1600\mu=1600 FPR 78 57 36 24
γ=15\gamma=15 MISE 2.8 2.8 2.3 2.5
MKL 7.7 7.5 7.3 9.3
Mean(p^l)(\hat{p}_{l}^{*}) - - 96.1 78.1
SD(p^l)(\hat{p}_{l}^{*}) - - 172.1 124.6
Table 2: True positive (TPR) and false positive (FPR) rates in percent, MISE, MKL (divided by 10), and mean and standard deviations of estimates of pl,approxp_{l,\text{approx}}^{*}, based on 500 simulations from inhomogeneous Thomas point processes with κ=4×104\kappa=4\times 10^{-4} and scale parameter γ=5\gamma=5 or 15, observed on different observation domains. Parameters are adjusted to have on average μ=400\mu=400 points in the small window and 1600 points in the larger one. Model selection is based on AIC, BIC, CIC, and CBIC.

The R function rThomas is used to simulate the point patterns and the function 𝚔𝚙𝚙𝚖\mathtt{kppm} to estimate the regression parameters. We use m=16μm=16\mu dummy points for the different integral approximations. Models are then selected using four criteria: AIC, BIC, CIC and CBIC. When using AIC or BIC we implicitly assume wrongly that the simulated patterns come from a Poisson point process and thus we set pl,approx=plp_{l,\text{approx}}^{*}=p_{l}. For the composite likelihood type criteria CIC and CBIC, we compute p^l,approx\hat{p}_{l,\text{approx}}^{*} using the R function vcov.kppm. The parameters κ\kappa and γ\gamma are estimated using minimum contrast estimation with the tuning parameter rmaxr_{\max} (Waagepetersen and Guan, 2009) set to 20 when γ=5\gamma=5 and to 50 when γ=15\gamma=15.

Results are reported in Table 2. The AIC and BIC criteria ignore the second-order structure of the simulated point patterns and we observe that overall AIC and BIC produce high TPR but also very high FPR. The CIC and CBIC give much more reasonable trade-offs between TPR and FPR and also (with one exception) give smaller MKL and MISE than AIC and BIC.

Focusing on CIC and BIC, we first notice that the means of the estimates of pl,approxp_{l,\text{approx}}^{*} are high. This is because the considered clustered point processes are far from Poisson models. We also notice that even when the number of points is quite large, the estimation of pl,approxp_{l,\text{approx}}^{*} is inaccurate with standard deviations of the estimates of the same order as the means. Nevertheless, the resulting CIC and CBIC give reasonable results. Comparing CIC and CBIC, CIC in general has a higher FPR than CBIC but on the other hand always gives the smallest MKL. The TPR and MISE are quite similar for CIC and BIC. Hence in the case of the clustered point process considered here, CIC and CBIC clearly outperform AIC and BIC but there is not a clear winner between CIC and CBIC.

7 Discussion

In this paper we establish a theoretical foundation for various model selection criteria for inhomogeneous point processes under various asymptotic settings. In case of a Poisson process a main contribution is to identify in relation to BIC, the correct interpretation of ‘sample size’ which based on our theoretical derivation is the expected number of points which in practice is estimated by the observed number of points. This interpretation is supported by our simulation study which also supports the common understanding that BIC may be preferable to AIC which tends to pick too complex models.

More generally for selecting a regression model for the intensity function of a general point process we develop composite model selection criteria, CIC and CBIC that clearly outperform AIC and BIC in the simulation study for a clustered point process. One issue regarding CIC and BIC is to estimate the bias correction for the estimate of the composite Kullback-Leibler divergence which depends on the unknown true intensity function and pair correlation function. Here, inspired by the approach underlying AIC, for a given model we simply plug in the fitted intensity function and pair correlation function for the model in question. This is computationally convenient and we leave it as an open problem to develop more precise estimates.

Further interesting topics for future research would be to study the theoretical foundation of criteria for selecting models for the conditional intensity of a Gibbs process fitted by pseudo-likelihood. Another interesting problem is selection of the penalization parameter when the intensity function is estimated using regularization methods like the lasso (see e.g Thurman et al., 2015; Choiruddin et al., 2018). This is not covered by our theoretical results which rely on asymptotic results for unbiased estimating functions or Bayesian considerations.
Acknowledgments
The research of J.-F. Coeurjolly is supported by the Natural Sciences and Engineering Research Council.Rasmus Waagepetersen is supported by The Danish Council for Independent Research | Natural Sciences, grant DFF - 7014-00074 "Statistics for point processes in space and beyond", and by the Centre for Stochastic Geometry and Advanced Bioimaging, funded by grant 8721 from the Villum Foundation.

Appendix A Proof of Theorem 1

Proof.

(i) Using a zero order Taylor expansion of en(𝜷^l,n)e_{n}(\hat{\boldsymbol{\beta}}_{l,n}) (which equals zero by definition) around 𝜷l,n\boldsymbol{\beta}_{l,n}^{*} expressed with integral remainder term, we have

en(𝜷l,n)=\displaystyle e_{n}(\boldsymbol{\beta}_{l,n}^{*})= (01Wnθn𝐳l(u)𝐳l(u)exp[{𝜷l,n+t(𝜷^l,n𝜷l,n)}𝐳l(u)]dudt)\displaystyle\left(\int_{0}^{1}\!\!\int_{W_{n}}\!\!\!\theta_{n}\mathbf{z}_{l}(u)\mathbf{z}_{l}(u)^{\top}\exp\left[\left\{\boldsymbol{\beta}_{l,n}^{*}+t(\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l,n}^{*})\right\}^{\top}\mathbf{z}_{l}(u)\right]\mathrm{d}u\mathrm{d}t\right)
×(𝜷^l,n𝜷l,n).\displaystyle\times\left(\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l,n}^{*}\right).

By rearranging the right-hand side of the latter equation and using Fubini’s theorem, we have

en(𝜷l,n)=\displaystyle e_{n}(\boldsymbol{\beta}_{l,n}^{*})= θn[Wn𝐳l(u)exp{𝜷l,n𝐳l(u)}×\displaystyle\theta_{n}\bigg{[}\int_{W_{n}}\mathbf{z}_{l}(u)\exp\left\{{\boldsymbol{\beta}_{l,n}^{*}}^{\top}\mathbf{z}_{l}(u)\right\}\times
01{(𝜷^l,n𝜷l,n)𝐳l(u)}exp{t(𝜷^l,n𝜷l,n)𝐳l(u)}dt]du\displaystyle\int_{0}^{1}\left\{(\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l,n}^{*})^{\top}\mathbf{z}_{l}(u)\right\}\exp\left\{t(\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l,n}^{*})^{\top}\mathbf{z}_{l}(u)\right\}\mathrm{d}t\bigg{]}\mathrm{d}u
=\displaystyle= θnWn𝐳l(u)exp{𝜷l,n𝐳l(u)}[exp{(𝜷^l,n𝜷l,n)𝐳l(u)}1]du.\displaystyle\theta_{n}\int_{W_{n}}\mathbf{z}_{l}(u)\exp\left\{{\boldsymbol{\beta}_{l,n}^{*}}^{\top}\mathbf{z}_{l}(u)\right\}\left[\exp\left\{(\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l,n}^{*})^{\top}\mathbf{z}_{l}(u)\right\}-1\right]\mathrm{d}u.

Now, using C2-C3, we can apply a mean value theorem for multiple integrals: there exists v=vn(𝜷^l,n,𝜷l,n)Wnv=v_{n}(\hat{\boldsymbol{\beta}}_{l,n},\boldsymbol{\beta}_{l,n}^{*})\in W_{n} such that

en(𝜷l,n)=τn𝐳l(v)exp{𝜷l,n𝐳l(v)}[exp{(𝜷^l,n𝜷l,n)𝐳l(v)}1.]e_{n}(\boldsymbol{\beta}_{l,n}^{*})=\tau_{n}\mathbf{z}_{l}(v)\exp\left\{{\boldsymbol{\beta}_{l,n}^{*}}^{\top}\mathbf{z}_{l}(v)\right\}\left[\exp\left\{(\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l,n}^{*})^{\top}\mathbf{z}_{l}(v)\right\}-1.\right] (21)

Since 𝐳l(v)>0\|\mathbf{z}_{l}(v)\|>0 almost surely by condition C2, we deduce (10) from (21) using a little algebra. Let

An=τn1en(𝜷l,n)𝐳l(v)𝐳l(v)2exp{𝜷l,n𝐳l(v)}.A_{n}=\tau_{n}^{-1}{e_{n}(\boldsymbol{\beta}_{l,n}^{*})}^{\top}\frac{\mathbf{z}_{l}(v)}{\|\mathbf{z}_{l}(v)\|^{2}}\exp\{-{\boldsymbol{\beta}_{l,n}^{*}}^{\top}\mathbf{z}_{l}(v)\}.

By conditions C1-C4, it is clear that An0A_{n}\to 0 almost surely as nn\to\infty. Using the continuity of tlog(1+t)t\mapsto\log(1+t) and again condition C2, we deduce that log(1+An)𝐳l(v)/𝐳l(v)2\log(1+A_{n})\mathbf{z}_{l}(v)/\|\mathbf{z}_{l}(v)\|^{2} tends to 0 almost surely.

(ii) The proof consists in applying a modified version of Waagepetersen and Guan (2009, Theorem 2) to the sequence of estimating functions en(𝜷l)e_{n}(\boldsymbol{\beta}_{l}). The modification is given in Appendix E and is needed to handle a sequence of ‘least false’ parameter values 𝜽n\boldsymbol{\theta}_{n}^{*} instead of a unique ‘true’ value 𝜽\boldsymbol{\theta}^{*}. Thus, we need to prove the assumptions G1-G4 in Appendix E. We leave the reader to check that conditions C3 and C5 imply conditions G1 and G2 with 𝐕n=τn𝐈p\mathbf{V}_{n}=\sqrt{\tau_{n}}\mathbf{I}_{p} and 𝐉n(𝜽n)=𝐒n(𝜷l,n)\mathbf{J}_{n}(\boldsymbol{\theta}^{*}_{n})=\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{*}).

To prove assumption G3, we have to prove that as nn\to\infty,

supτn𝜷l𝜷l,nd𝐒n(𝜷l)𝐒n(𝜷l,n)Mτn0,\sup_{\sqrt{\tau_{n}}\|\boldsymbol{\beta}_{l}-\boldsymbol{\beta}_{l,n}^{*}\|\leq d}\frac{\|\mathbf{S}_{n}(\boldsymbol{\beta}_{l})-\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{*})\|_{M}}{\tau_{n}}\to 0,

where 𝐌M=maxij|Mij|\|\mathbf{M}\|_{M}=\max_{ij}|M_{ij}|. Consider a 𝜷l\boldsymbol{\beta}_{l} such that τn𝜷l𝜷l,nd\sqrt{\tau_{n}}\|\boldsymbol{\beta}_{l}-\boldsymbol{\beta}_{l,n}^{*}\|\leq d. Under conditions C1-C3, there exist K1,K2,K<K_{1},K_{2},K<\infty such that

𝐒n(𝜷l)𝐒n(𝜷l,n)M\displaystyle\|\mathbf{S}_{n}(\boldsymbol{\beta}_{l})-\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{*})\|_{M} =θnWn𝐳l(u)𝐳l(u){ρ(u;𝜷l)ρ(u;𝜷l,n)}duM\displaystyle=\|\theta_{n}\int_{W_{n}}\mathbf{z}_{l}(u)\mathbf{z}_{l}(u)^{\top}\left\{\rho(u;\boldsymbol{\beta}_{l})-\rho(u;\boldsymbol{\beta}_{l,n}^{*})\right\}\mathrm{d}u\|_{M}
K1τn𝜷l𝜷l,nexp{K2(𝜷l,n+d/τn)}\displaystyle\leq K_{1}\,\tau_{n}\|\boldsymbol{\beta}_{l}-\boldsymbol{\beta}_{l,n}^{*}\|\exp\left\{K_{2}\left(\|\boldsymbol{\beta}_{l,n}^{*}\|+d/\sqrt{\tau_{n}}\right)\right\}
Kτn𝜷l𝜷l,n=𝒪(τn)=𝒪P(τn).\displaystyle\leq K\tau_{n}\|\boldsymbol{\beta}_{l}-\boldsymbol{\beta}_{l,n}^{*}\|=\mathcal{O}(\sqrt{\tau_{n}})=\mathcal{O}_{\mathrm{P}}(\sqrt{\tau_{n}}). (22)

To verify condition G4, it is sufficient to note that condition C6 implies that the variance of τn1/2en(𝜷l,n)\tau_{n}^{-1/2}e_{n}(\boldsymbol{\beta}_{l,n}^{*}) is bounded. Letting 𝜷^l,n\hat{\boldsymbol{\beta}}_{l,n} denote the (unique for nn large enough) solution of en(𝜷l)=0e_{n}(\boldsymbol{\beta}_{l})=0 or, equivalently,

𝜷^l,n=argmax𝜷lpln(𝜷l)\hat{\boldsymbol{\beta}}_{l,n}=\operatorname*{arg\,max}_{\boldsymbol{\beta}_{l}\in\mathbb{R}^{p_{l}}}\ell_{n}(\boldsymbol{\beta}_{l})

we can then conclude that 𝜷^l,n\hat{\boldsymbol{\beta}}_{l,n} is a root-τn{\tau_{n}} consistent estimator of 𝜷l,n\boldsymbol{\beta}_{l,n}^{*}, that is, τn(𝜷^l,n𝜷l,n)\sqrt{\tau_{n}}(\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l,n}^{*}) is bounded in probability, which proves (11).

(iii) We use a Taylor expansion around 𝜷l,n\boldsymbol{\beta}_{l,n}^{*}: there exists t(0,1)t\in(0,1) and 𝜷~l,n=𝜷^l,n+t(𝜷l,n𝜷^l,n)\tilde{\boldsymbol{\beta}}_{l,n}=\hat{\boldsymbol{\beta}}_{l,n}+t(\boldsymbol{\beta}_{l,n}^{*}-\hat{\boldsymbol{\beta}}_{l,n}) such that

en(𝜷l,n)\displaystyle e_{n}(\boldsymbol{\beta}_{l,n}^{*}) =en(𝜷l,n)en(𝜷^l,n)=𝐒n(𝜷~l,n)(𝜷^l,n𝜷l,n)\displaystyle=e_{n}(\boldsymbol{\beta}_{l,n}^{*})-e_{n}(\hat{\boldsymbol{\beta}}_{l,n})=\mathbf{S}_{n}(\tilde{\boldsymbol{\beta}}_{l,n})(\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l,n}^{*})
=𝐒n(𝜷l,n)(𝜷^l,n𝜷l,n)+𝐁n𝐒n\displaystyle=\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{*})(\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l,n}^{*})+\mathbf{B}_{n}\mathbf{S}_{n}^{\prime}

where 𝐁n=τn(𝜷^l,n𝜷l,n)\mathbf{B}_{n}=\sqrt{\tau_{n}}(\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l,n}^{*}) and where 𝐒n=τn1/2{𝐒n(𝜷~l,n)𝐒n(𝜷l,n)}\mathbf{S}_{n}^{\prime}=\tau_{n}^{-1/2}\left\{\mathbf{S}_{n}(\tilde{\boldsymbol{\beta}}_{l,n})-\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{*})\right\}. We now show that 𝐁n𝐒n=𝒪P(1)\mathbf{B}_{n}\mathbf{S}_{n}^{\prime}=\mathcal{O}_{\mathrm{P}}(1). Let dd and KK be given as above and let K3d2KK_{3}\geq d^{2}K. Using (22) we obtain

P(𝐁n𝐒nMK3)P(𝐁n𝐒nMK3,𝐁n<d)+P(𝐁nd)\displaystyle\mathrm{P}\left(\|\mathbf{B}_{n}\mathbf{S}_{n}^{\prime}\|_{M}\geq K_{3}\right)\leq\mathrm{P}\left(\|\mathbf{B}_{n}\mathbf{S}_{n}^{\prime}\|_{M}\geq K_{3},\|\mathbf{B}_{n}\|<d\right)+\mathrm{P}\left(\|\mathbf{B}_{n}\|\geq d\right)
\displaystyle\leq P(dK𝐁nK3)+P(𝐁nd)2P(𝐁nd)\displaystyle\mathrm{P}\left(dK\|\mathbf{B}_{n}\|\geq K_{3}\right)+\mathrm{P}\left(\|\mathbf{B}_{n}\|\geq d\right)\leq 2P\left(\|\mathbf{B}_{n}\|\geq d\right)

Now from (ii), for any ε>0\varepsilon>0, by choosing dd large enough, P(𝐁nd)ε/2P(\|\mathbf{B}_{n}\|\geq d)\leq\varepsilon/2 for nn sufficiently large whereby P(𝐁n𝐒nMK3)ε\mathrm{P}\left(\|\mathbf{B}_{n}\mathbf{S}_{n}^{\prime}\|_{M}\geq K_{3}\right)\leq\varepsilon for nn sufficiently large.

Finally, since by condition C5, 𝚺l,n1/2=𝒪(τn1/2)\|\mathbf{\Sigma}_{l,n}^{-1/2}\|=\mathcal{O}(\tau_{n}^{-1/2}), we conclude that

𝚺l,n1/2𝐒n(𝜷l,n)(𝜷^l,n𝜷l,n)\displaystyle\boldsymbol{\Sigma}_{l,n}^{-1/2}\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{*})\left(\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l,n}^{*}\right) =𝚺l,n1/2en(𝜷l,n)+𝚺l,n1/2𝒪P(1)\displaystyle=\boldsymbol{\Sigma}_{l,n}^{-1/2}e_{n}(\boldsymbol{\beta}_{l,n}^{*})+\boldsymbol{\Sigma}_{l,n}^{-1/2}\mathcal{O}_{P}(1)
=𝚺l,n1/2en(𝜷l,n)+oP(1)\displaystyle=\boldsymbol{\Sigma}_{l,n}^{-1/2}e_{n}(\boldsymbol{\beta}_{l,n}^{*})+o_{\mathrm{P}}(1)

which yields the result using condition C7 and Slutsky’s lemma. ∎

Appendix B Conditions C4 and C7 for the inhomogeneous Poisson cluster point process

In this section, we show that the inhomogeneous Poisson cluster point process presented in the end of Section 3.2 satisfies C4 and C7. Recall λn(u)=θnαρ(u)\lambda_{n}(u)=\theta_{n}\alpha\rho(u) where supuρ(u)=𝒪(1)\sup_{u}\rho(u)={\mathcal{O}}(1).

Proof.

For k=1,,τnk=1,\dots,\lfloor\tau_{n}\rfloor, let 𝐂n,k\mathbf{C}_{n,k} be independent inhomogeneous Poisson point processes with intensity θn/τn\theta_{n}/\lfloor\tau_{n}\rfloor. By the property of any Poisson point process, 𝐂n\mathbf{C}_{n} has the same distribution as k𝐂n,k\cup_{k}\mathbf{C}_{n,k}. Define 𝐀n=u𝐗n𝐳l(u)\mathbf{A}_{n}=\sum_{u\in\mathbf{X}_{n}}\mathbf{z}_{l}(u). Then,

𝐀n=k=1τn𝐙n,k where 𝐙n,k=c𝐂n,ku𝐗c𝐳l(u).\mathbf{A}_{n}=\sum_{k=1}^{\lfloor\tau_{n}\rfloor}\mathbf{Z}_{n,k}\qquad\text{ where }\qquad\mathbf{Z}_{n,k}=\sum_{c\in\mathbf{C}_{n,k}}\sum_{u\in\mathbf{X}_{c}}\mathbf{z}_{l}(u).

Using twice the Slivnyak-Mecke Theorem (see e.g. Theorem 3.1 in Møller and Waagepetersen, 2004),

𝔼𝐙n,k\displaystyle{\mathbb{E}}\mathbf{Z}_{n,k} =τn1Wn𝐳l(u)λn(u)du\displaystyle=\lfloor\tau_{n}\rfloor^{-1}\int_{W_{n}}\mathbf{z}_{l}(u)\lambda_{n}(u)\mathrm{d}u
𝕍ar𝐙n,k\displaystyle{\mathbb{V}\mathrm{ar}}\mathbf{Z}_{n,k} =τn1{Wn𝐳l(u)𝐳l(u)λn(u)du\displaystyle=\lfloor\tau_{n}\rfloor^{-1}\bigg{\{}\int_{W_{n}}\mathbf{z}_{l}(u)\mathbf{z}_{l}(u)^{\top}\lambda_{n}(u)\mathrm{d}u
+WnWn𝐳l(u)𝐳l(v)λn(u)λn(v)(gn(u,v)1)dudv}\displaystyle\qquad+\int_{W_{n}}\int_{W_{n}}\mathbf{z}_{l}(u)\mathbf{z}_{l}(v)^{\top}\lambda_{n}(u)\lambda_{n}(v)\left(g_{n}(u,v)-1\right)\mathrm{d}u\mathrm{d}v\bigg{\}}

where gn(u,v)=1+θn1(kk)(vu)g_{n}(u,v)=1+\theta_{n}^{-1}(k*k)(v-u). Hence,

𝔼𝐀n=Wn𝐳l(u)λn(u)duand𝕍ar𝐀n=𝚺l,n.\displaystyle{\mathbb{E}}\mathbf{A}_{n}=\int_{W_{n}}\mathbf{z}_{l}(u)\lambda_{n}(u)\mathrm{d}u\qquad\text{and}\qquad{\mathbb{V}\mathrm{ar}}\mathbf{A}_{n}=\boldsymbol{\Sigma}_{l,n}.

Moreover, by (8),

Wn𝐳l(u)λn(u)du=θnWn𝐳l(u)ρn(u;𝜷l,n)du\int_{W_{n}}\mathbf{z}_{l}(u)\lambda_{n}(u)\mathrm{d}u=\theta_{n}\int_{W_{n}}\mathbf{z}_{l}(u)\rho_{n}(u;\boldsymbol{\beta}_{l,n}^{*})\mathrm{d}u

so that en(𝜷l,n)=𝐀n𝔼𝐀n=k=1τn𝐙̊n,ke_{n}(\boldsymbol{\beta}_{l,n}^{*})=\mathbf{A}_{n}-{\mathbb{E}}\mathbf{A}_{n}=\sum_{k=1}^{\lfloor\tau_{n}\rfloor}\mathring{\mathbf{Z}}_{n,k} with 𝐙̊n,k=𝐙n,k𝔼𝐙n,k\mathring{\mathbf{Z}}_{n,k}=\mathbf{Z}_{n,k}-{\mathbb{E}}\mathbf{Z}_{n,k}. Since C6 is satisfied as shown in Example 1 it follows that the elements of 𝕍ar(𝐙̊n,k){\mathbb{V}\mathrm{ar}}(\mathring{\mathbf{Z}}_{n,k}) are 𝒪(1){\mathcal{O}}(1) whereby k1𝕍ar(𝐙̊n,k)/k2<\sum_{k\geq 1}{\mathbb{V}\mathrm{ar}}(\mathring{\mathbf{Z}}_{n,k})/k^{2}<\infty (elementwise). Therefore, we can apply Kolmogorov’s strong law of large numbers to establish that τn1en(𝜷l,n){\lfloor\tau_{n}\rfloor}^{-1}e_{n}(\boldsymbol{\beta}_{l,n}^{*}) tends almost surely to 0. Using the same conditions and C5, we can also apply the Lindeberg-Feller theorem and obtain that as nn\to\infty, 𝚺l,n1/2en(𝜷l,n)N(0,𝐈pl)\boldsymbol{\Sigma}_{l,n}^{-1/2}e_{n}(\boldsymbol{\beta}_{l,n}^{*})\to N(0,\mathbf{I}_{p_{l}}) in distribution. ∎

Appendix C Proof of Proposition 1

For a multi-index αp\alpha\in\mathbb{N}^{p} with cardinality |α|=α1++αp|\alpha|=\alpha_{1}+\dots+\alpha_{p} and a |α||\alpha| times differentiable function f:pf:\mathbb{R}^{p}\to\mathbb{R} we define for upu\in\mathbb{R}^{p},

αf(u)=|α|f(u)u1α1upαp.\partial^{\alpha}f(u)=\frac{\partial^{|\alpha|}f(u)}{\partial u_{1}^{\alpha_{1}}\dots\partial u_{p}^{\alpha_{p}}}.

and we also use the notation uα=u1α1upαpu^{\alpha}=u_{1}^{\alpha_{1}}\dots u_{p}^{\alpha_{p}}.

Proof.

We follow the sketch of the proof of Varin and Vidoni (2005, Lemmas 1-2). Let 𝐙n\mathbf{Z}_{n} be an independent copy of 𝐗n\mathbf{X}_{n}. Let n(𝜷l,𝐙n)\ell_{n}(\boldsymbol{\beta}_{l},\mathbf{Z}_{n}) and en(𝜷l;𝐙n)e_{n}(\boldsymbol{\beta}_{l};\mathbf{Z}_{n}) be the composite likelihood and its corresponding estimating equation evaluated at 𝐙n\mathbf{Z}_{n}. And we remind the notation n(𝜷l)=n(𝜷l;𝐗n)\ell_{n}(\boldsymbol{\beta}_{l})=\ell_{n}(\boldsymbol{\beta}_{l};\mathbf{X}_{n}), Cn(𝜷l)=𝔼n(𝜷l)C_{n}(\boldsymbol{\beta}_{l})=-{\mathbb{E}}\ell_{n}(\boldsymbol{\beta}_{l}), en(𝜷l)=en(𝜷l;𝐗n)e_{n}(\boldsymbol{\beta}_{l})=e_{n}(\boldsymbol{\beta}_{l};\mathbf{X}_{n}) and 𝜷^l,n=argmax𝜷ln(𝜷l)\hat{\boldsymbol{\beta}}_{l,n}=\mathrm{argmax}_{\boldsymbol{\beta}_{l}}\ell_{n}(\boldsymbol{\beta}_{l}). We have

𝔼Cn(𝜷^l,n)=𝔼[𝔼{n(𝜷^l,n;𝐙n)}|𝐗n].{\mathbb{E}}C_{n}(\hat{\boldsymbol{\beta}}_{l,n})=-{\mathbb{E}}\left[{\mathbb{E}}\left\{\ell_{n}(\hat{\boldsymbol{\beta}}_{l,n};\mathbf{Z}_{n})\right\}\big{|}\mathbf{X}_{n}\right].

Using a first order Taylor expansion of n(;𝐙n)\ell_{n}(\cdot;\mathbf{Z}_{n}) around 𝜷l,n\boldsymbol{\beta}_{l,n}^{*} with integral remainder term, we have

n(𝜷^l,n;𝐙n)=n(𝜷l,n;𝐙n)+en(𝜷l,n;𝐙n)𝚫n+𝐑n(𝜷l,n,𝜷^l,n;𝐙n)\ell_{n}(\hat{\boldsymbol{\beta}}_{l,n};\mathbf{Z}_{n})=\ell_{n}(\boldsymbol{\beta}_{l,n}^{*};\mathbf{Z}_{n})+e_{n}(\boldsymbol{\beta}_{l,n}^{*};\mathbf{Z}_{n})^{\top}\mathbf{\Delta}_{n}+\mathbf{R}_{n}(\boldsymbol{\beta}_{l,n}^{*},\hat{\boldsymbol{\beta}}_{l,n};\mathbf{Z}_{n})

where 𝚫n=𝜷^l,n𝜷l,n\mathbf{\Delta}_{n}=\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l,n}^{*} and where the integral remainder term can be expressed as

𝐑n(𝜷l,n,𝜷^l,n;𝐙n)=\displaystyle\mathbf{R}_{n}(\boldsymbol{\beta}_{l,n}^{*},\hat{\boldsymbol{\beta}}_{l,n};\mathbf{Z}_{n})= 2αpl,|α|=2[(𝜷l,n𝜷^l,n)αα!\displaystyle 2\sum_{\alpha\in\mathbb{N}^{p_{l}},|\alpha|=2}\bigg{[}\;\frac{(\boldsymbol{\beta}_{l,n}^{*}-\hat{\boldsymbol{\beta}}_{l,n})^{\alpha}}{\alpha!}
×01(1t)αn{𝜷^l,n+t(𝜷l,n𝜷^l,n);𝐙n}dt].\displaystyle\qquad\times\int_{0}^{1}(1-t)\partial^{\alpha}\ell_{n}\left\{\hat{\boldsymbol{\beta}}_{l,n}+t(\boldsymbol{\beta}_{l,n}^{*}-\hat{\boldsymbol{\beta}}_{l,n});\mathbf{Z}_{n}\right\}\mathrm{d}t\bigg{]}.

It is worth noticing that for any 𝜷lpl\boldsymbol{\beta}_{l}\in\mathbb{R}^{p_{l}} and any αpl\alpha\in\mathbb{N}^{p_{l}} such that |α|=2|\alpha|=2, αn(𝜷l;𝐙n)\partial^{\alpha}\ell_{n}(\boldsymbol{\beta}_{l};\mathbf{Z}_{n}) is a deterministic function of 𝜷l\boldsymbol{\beta}_{l}. Therefore 𝐑n(𝜷l,n,𝜷^l,n):=𝐑n(𝜷l,n,𝜷^l,n;𝐙n)\mathbf{R}_{n}(\boldsymbol{\beta}_{l,n}^{*},\hat{\boldsymbol{\beta}}_{l,n}):=\mathbf{R}_{n}(\boldsymbol{\beta}_{l,n}^{*},\hat{\boldsymbol{\beta}}_{l,n};\mathbf{Z}_{n}) does not depend on 𝐙n\mathbf{Z}_{n}. Using this and the unbiasedness of en(𝜷l,n;)e_{n}(\boldsymbol{\beta}_{l,n}^{*};\cdot) we have

𝔼{n(𝜷^l,n;𝐙n)|𝐗n}=𝔼{n(𝜷l,n;𝐙n)}+𝐑n(𝜷l,n,𝜷^l,n){\mathbb{E}}\left\{\ell_{n}(\hat{\boldsymbol{\beta}}_{l,n};\mathbf{Z}_{n})\big{|}\mathbf{X}_{n}\right\}={\mathbb{E}}\left\{\ell_{n}(\boldsymbol{\beta}_{l,n}^{*};\mathbf{Z}_{n})\right\}+\mathbf{R}_{n}(\boldsymbol{\beta}_{l,n}^{*},\hat{\boldsymbol{\beta}}_{l,n})

Hence,

𝔼{Cn(𝜷^l,n)}=𝔼{n(𝜷l,n)}𝔼{𝐑n(𝜷l,n,𝜷^l,n)}.{\mathbb{E}}\left\{C_{n}(\hat{\boldsymbol{\beta}}_{l,n})\right\}=-{\mathbb{E}}\left\{\ell_{n}(\boldsymbol{\beta}_{l,n}^{*})\right\}-{\mathbb{E}}\left\{\mathbf{R}_{n}(\boldsymbol{\beta}_{l,n}^{*},\hat{\boldsymbol{\beta}}_{l,n})\right\}. (23)

Now, using a first order Taylor expansion of n()-\ell_{n}(\cdot) around 𝜷l,n\boldsymbol{\beta}_{l,n}^{*}, we have

n(𝜷^l,n)=n(𝜷l,n)en(𝜷l,n)𝚫n𝐑n(𝜷l,n,𝜷^l,n).-\ell_{n}(\hat{\boldsymbol{\beta}}_{l,n})=-\ell_{n}(\boldsymbol{\beta}_{l,n}^{*})-e_{n}(\boldsymbol{\beta}_{l,n}^{*})^{\top}\mathbf{\Delta}_{n}-\mathbf{R}_{n}(\boldsymbol{\beta}_{l,n}^{*},\hat{\boldsymbol{\beta}}_{l,n}). (24)

Combining (23) and the expectation of (24) we obtain

𝔼{Cn(𝜷^l,n)}𝔼{n(𝜷^l,n)}=\displaystyle{\mathbb{E}}\left\{C_{n}(\hat{\boldsymbol{\beta}}_{l,n})\right\}-{\mathbb{E}}\left\{-\ell_{n}(\hat{\boldsymbol{\beta}}_{l,n})\right\}= 𝔼{en(𝜷l,n)𝚫n}.\displaystyle{\mathbb{E}}\left\{e_{n}(\boldsymbol{\beta}_{l,n}^{*})^{\top}\mathbf{\Delta}_{n}\right\}. (25)

Since by definition of 𝜷^l,n\hat{\boldsymbol{\beta}}_{l,n}, 𝚫n=𝐒n1(𝜷~l,n)en(𝜷l,n)\mathbf{\Delta}_{n}=\mathbf{S}_{n}^{-1}(\tilde{\boldsymbol{\beta}}_{l,n})e_{n}(\boldsymbol{\beta}_{l,n}^{*}) where 𝜷~l,n=𝜷^l,n+t(𝜷^l,n𝜷l,n)\tilde{\boldsymbol{\beta}}_{l,n}=\hat{\boldsymbol{\beta}}_{l,n}+t(\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l,n}^{*}) for some t(0,1)t\in(0,1) we continue with

𝔼{Cn(𝜷^l,n)}𝔼{n(𝜷^l,n)}=\displaystyle{\mathbb{E}}\left\{C_{n}(\hat{\boldsymbol{\beta}}_{l,n})\right\}-{\mathbb{E}}\left\{-\ell_{n}(\hat{\boldsymbol{\beta}}_{l,n})\right\}= 𝔼{en(𝜷l,n)𝐒n(𝜷l,n)1en(𝜷l,n)}\displaystyle{\mathbb{E}}\left\{e_{n}(\boldsymbol{\beta}_{l,n}^{*})^{\top}\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{*})^{-1}e_{n}(\boldsymbol{\beta}_{l,n}^{*})\right\}
+𝔼{en(𝜷l,n)𝐌nen(𝜷l,n)}\displaystyle+{\mathbb{E}}\left\{e_{n}(\boldsymbol{\beta}_{l,n}^{*})^{\top}\mathbf{M}_{n}e_{n}(\boldsymbol{\beta}_{l,n}^{*})\right\}
=\displaystyle= trace{𝐒n(𝜷l,n)1𝚺l,n}\displaystyle\mathrm{trace}\left\{\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{*})^{-1}\mathbf{\Sigma}_{l,n}\right\}
+𝔼{en(𝜷l,n)𝐌nen(𝜷l,n)}\displaystyle+{\mathbb{E}}\left\{e_{n}(\boldsymbol{\beta}_{l,n}^{*})^{\top}\mathbf{M}_{n}e_{n}(\boldsymbol{\beta}_{l,n}^{*})\right\} (26)

where 𝐌n=𝐒n(𝜷~l,n)1𝐒n(𝜷l,n)1\mathbf{M}_{n}=\mathbf{S}_{n}(\tilde{\boldsymbol{\beta}}_{l,n})^{-1}-\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{*})^{-1}. Since en(𝜷l,n)/τne_{n}(\boldsymbol{\beta}_{l,n}^{*})/\sqrt{\tau_{n}} is bounded in probability by C6 and τn𝐌n\tau_{n}\mathbf{M}_{n} converges in probability since 𝚫n\mathbf{\Delta}_{n} converges to zero in probability, we obtain en(𝜷l,n)𝐌nen(𝜷l,n)e_{n}(\boldsymbol{\beta}_{l,n}^{*})^{\top}\mathbf{M}_{n}e_{n}(\boldsymbol{\beta}_{l,n}^{*}) converges in probability. This combined with uniform integrability (13) implies 𝔼{en(𝜷l,n)𝐌nen(𝜷l,n)}=o(1){\mathbb{E}}\left\{e_{n}(\boldsymbol{\beta}_{l,n}^{*})^{\top}\mathbf{M}_{n}e_{n}(\boldsymbol{\beta}_{l,n}^{*})\right\}=o(1). ∎

Appendix D Proof of Proposition 2

Proof.

We remind the notation τn=θn|Wn|\tau_{n}=\theta_{n}|W_{n}|.

pn(𝐗nl)\displaystyle p_{n}(\mathbf{X}_{n}\mid\mathcal{M}_{l}) =plpn(𝐗n,𝜷ll)d𝜷l\displaystyle=\int_{\mathbb{R}^{p_{l}}}p_{n}(\mathbf{X}_{n},\boldsymbol{\beta}_{l}\mid\mathcal{M}_{l})\mathrm{d}\boldsymbol{\beta}_{l}
=plpn(𝐗n;𝜷l)p(𝜷ll)dβl\displaystyle=\int_{\mathbb{R}^{p_{l}}}p_{n}(\mathbf{X}_{n};\boldsymbol{\beta}_{l})p(\boldsymbol{\beta}_{l}\mid\mathcal{M}_{l})\mathrm{d}\beta_{l}
=plp(𝜷ll)exp{τnn(𝜷l)/τn}d𝜷l.\displaystyle=\int_{{\mathbb{R}^{p_{l}}}}p(\boldsymbol{\beta}_{l}\mid\mathcal{M}_{l})\exp\left\{\tau_{n}\,\ell_{n}(\boldsymbol{\beta}_{l})/\tau_{n}\right\}\mathrm{d}\boldsymbol{\beta}_{l}. (27)

Now, we are in the situation where we can apply Łapiński (2019, Theorem 2), which gives rigorous conditions under which a multivariate Laplace approximation holds. First, condition C8 ensures that p(𝜷ll)p(\boldsymbol{\beta}_{l}\mid\mathcal{M}_{l}) is regular enough. Second, condition C5 ensures that τnn(𝜷l)\tau_{n}\ell_{n}(\boldsymbol{\beta}_{l}) has a nonsingular Hessian matrix for any 𝜷l\boldsymbol{\beta}_{l}. Third, for αpl\alpha\in\mathbb{N}^{p_{l}} with |α|3|\alpha|\leq 3, we have for any 𝜷l\boldsymbol{\beta}_{l} (since z0lα0=1α0=1z_{0l}^{\alpha_{0}}={1}^{\alpha_{0}}=1)

α{τn1logpn(𝐗n;𝜷l)}=\displaystyle\partial^{\alpha}\{\tau_{n}^{-1}\log p_{n}(\mathbf{X}_{n};\boldsymbol{\beta}_{l})\}= 1|Wn|WnjIlzjαj(u)ρ(u;𝜷l)du.\displaystyle-\frac{1}{|W_{n}|}\int_{W_{n}}\prod_{j\in I_{l}}z_{j}^{\alpha_{j}}(u)\rho(u;\boldsymbol{\beta}_{l})\mathrm{d}u.

Therefore, under condition C2, α{τn1logpn(𝐗n;𝜷l)}\partial^{\alpha}\{\tau_{n}^{-1}\log p_{n}(\mathbf{X}_{n};\boldsymbol{\beta}_{l})\} is uniformly bounded for 𝜷l\boldsymbol{\beta}_{l} in any compact subset of d\mathbb{R}^{d}.

The fact that p(𝜷ll)d𝜷l=1<\int p(\boldsymbol{\beta}_{l}\mid\mathcal{M}_{l})\mathrm{d}\boldsymbol{\beta}_{l}=1<\infty is sufficient to ensure Łapiński (2019, condition (5)). Finally, by the strong consistency of 𝜷^l,n𝜷l,n\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l,n}^{*} to 0 and from condition C1, the last condition to check is formulated as follows: we need to verify that there exists n0n_{0}\in\mathbb{N} such that for nn0n\geq n_{0}, τn1n(𝜷l)\tau_{n}^{-1}\ell_{n}(\boldsymbol{\beta}_{l}) has a unique maximum in a closed ball B(𝜷l,ε)B(\boldsymbol{\beta}_{l}^{*},\varepsilon) for some ε>0\varepsilon>0 where 𝜷l\boldsymbol{\beta}_{l}^{*} is given by condition C1) and such that

Δ=infnn0,𝜷lplB(𝜷l,ε)τn1{n(𝜷^l,n)n(𝜷l)}>0.\Delta=\inf_{n\geq n_{0},\boldsymbol{\beta}_{l}\in\mathbb{R}^{p_{l}}\setminus B(\boldsymbol{\beta}_{l}^{*},\varepsilon)}\;\tau_{n}^{-1}\left\{\ell_{n}(\hat{\boldsymbol{\beta}}_{l,n})-\ell_{n}(\boldsymbol{\beta}_{l})\right\}>0. (28)

By condition C5, for any n1n\geq 1, n\ell_{n} (and thus τn1n\tau_{n}^{-1}\ell_{n}) has indeed a unique maximum. Letting ε>0\varepsilon>0 there exists n0n_{0}\in\mathbb{N} such that almost surely 𝜷^l,nB(𝜷l,ε/2)\hat{\boldsymbol{\beta}}_{l,n}\in B(\boldsymbol{\beta}_{l}^{*},\varepsilon/2). Consider Δ\Delta with such n0n_{0} and ε\varepsilon. Note that for large nn and any 𝜷lplB(𝜷l,ε)\boldsymbol{\beta}_{l}\in\mathbb{R}^{p_{l}}\setminus B(\boldsymbol{\beta}_{l}^{*},\varepsilon), 𝜷^l,n𝜷lε/2\|\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l}\|\geq\varepsilon/2 almost surely.

Using a first order Taylor expansion around 𝜷^l,n\hat{\boldsymbol{\beta}}_{l,n} and using the fact that en(𝜷^l,n)=0e_{n}(\hat{\boldsymbol{\beta}}_{l,n})=0, we have

τn1{n(𝜷l)n(𝜷^l,n)}=Rn(𝜷^l,n,𝜷l,n)\tau_{n}^{-1}\left\{\ell_{n}(\boldsymbol{\beta}_{l})-\ell_{n}(\hat{\boldsymbol{\beta}}_{l,n})\right\}=R_{n}(\hat{\boldsymbol{\beta}}_{l,n},\boldsymbol{\beta}_{l,n})

where the remainder term Rn(𝜷^l,n,𝜷l,n)R_{n}(\hat{\boldsymbol{\beta}}_{l,n},\boldsymbol{\beta}_{l,n}) is

2τnαpl,|α|=2[(𝜷l,n𝜷^l,n)αα!\displaystyle\frac{2}{\tau_{n}}\sum_{\alpha\in\mathbb{N}^{p_{l}},|\alpha|=2}\bigg{[}\;\frac{(\boldsymbol{\beta}_{l,n}-\hat{\boldsymbol{\beta}}_{l,n})^{\alpha}}{\alpha!}
×01(1t)αn{𝜷^l,n+t(𝜷l,n𝜷^l,n);𝐗n)}dt]\displaystyle\qquad\times\int_{0}^{1}(1-t)\partial^{\alpha}\ell_{n}\left\{\hat{\boldsymbol{\beta}}_{l,n}+t(\boldsymbol{\beta}_{l,n}-\hat{\boldsymbol{\beta}}_{l,n});\mathbf{X}_{n})\right\}\mathrm{d}t\bigg{]}
=\displaystyle= 2|Wn|αpl,|α|=2[(𝜷l,n𝜷^l,n)αα!\displaystyle\frac{-2}{|W_{n}|}\sum_{\alpha\in\mathbb{N}^{p_{l}},|\alpha|=2}\bigg{[}\;\frac{(\boldsymbol{\beta}_{l,n}-\hat{\boldsymbol{\beta}}_{l,n})^{\alpha}}{\alpha!}
×01(1t)Wn𝐳α(u)ρ(u;𝜷^l,n+t(𝜷l,n𝜷^l,n))dudt]\displaystyle\qquad\times\int_{0}^{1}(1-t)\int_{W_{n}}\mathbf{z}^{\alpha}(u)\rho(u;\hat{\boldsymbol{\beta}}_{l,n}+t(\boldsymbol{\beta}_{l,n}-\hat{\boldsymbol{\beta}}_{l,n}))\mathrm{d}u\mathrm{d}t\bigg{]}
=\displaystyle= 2|Wn|Wnαpl,|α|=2(𝜷l,n𝜷^l,n)αα!𝐳α(u)\displaystyle\frac{-2}{|W_{n}|}\int_{W_{n}}\sum_{\alpha\in\mathbb{N}^{p_{l}},|\alpha|=2}\;\frac{(\boldsymbol{\beta}_{l,n}-\hat{\boldsymbol{\beta}}_{l,n})^{\alpha}}{\alpha!}\mathbf{z}^{\alpha}(u)
×exp{𝜷^l,n𝐳l(u)}exp{(𝜷l,n𝜷^l,n)𝐳l(u)}(𝜷l,n𝜷^l,n)𝐳l(u)1{(𝜷l,n𝜷^l,n)𝐳l(u)}2du\displaystyle\times\exp\{\hat{\boldsymbol{\beta}}_{l,n}\mathbf{z}_{l}(u)\}\frac{\exp\{(\boldsymbol{\beta}_{l,n}-\hat{\boldsymbol{\beta}}_{l,n})^{\top}\mathbf{z}_{l}(u)\}-(\boldsymbol{\beta}_{l,n}-\hat{\boldsymbol{\beta}}_{l,n})^{\top}\mathbf{z}_{l}(u)-1}{\{(\boldsymbol{\beta}_{l,n}-\hat{\boldsymbol{\beta}}_{l,n})^{\top}\mathbf{z}_{l}(u)\}^{2}}\mathrm{d}u
=\displaystyle= 1|Wn|Wnexp{𝜷^l,n𝐳l(u)}f{(𝜷l,n𝜷^l,n)𝐳l(u)}du\displaystyle\frac{-1}{|W_{n}|}\int_{W_{n}}\exp\{\hat{\boldsymbol{\beta}}_{l,n}^{\top}\mathbf{z}_{l}(u)\}\,f\left\{(\boldsymbol{\beta}_{l,n}-\hat{\boldsymbol{\beta}}_{l,n})^{\top}\mathbf{z}_{l}(u)\right\}\mathrm{d}u

where f(t)=exp(t)1tf(t)=\exp(t)-1-t. Consider the set BnB_{n} given by C5. Since for large nn, 𝜷^l,n𝜷lε/2\|\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l}^{*}\|\geq\varepsilon/2 and since ff is non negative and has a unique minimum at zero, we have by condition C5 that

infuBnexp{𝜷^l,n𝐳l(u)}>exp(𝜷^l,nc) and infuBnf{(𝜷l,n𝜷^l,n)𝐳l(u)}δ>0\inf_{u\in B_{n}}\exp\{\hat{\boldsymbol{\beta}}_{l,n}^{\top}\mathbf{z}_{l}(u)\}>\exp(-\|\hat{\boldsymbol{\beta}}_{l,n}\|c)\text{ and }\inf_{u\in B_{n}}f\left\{(\boldsymbol{\beta}_{l,n}-\hat{\boldsymbol{\beta}}_{l,n})^{\top}\mathbf{z}_{l}(u)\right\}\geq\delta>0

for some δ>0\delta>0. Therefore,

Rn(𝜷^l,n,𝜷l,n)\displaystyle R_{n}(\hat{\boldsymbol{\beta}}_{l,n},\boldsymbol{\beta}_{l,n})\leq 1|Wn|Bnexp{𝜷^l,n𝐳l(u)}f{(𝜷l,n𝜷^l,n)𝐳l(u)}du\displaystyle\frac{-1}{|W_{n}|}\int_{B_{n}}\exp\{\hat{\boldsymbol{\beta}}_{l,n}^{\top}\mathbf{z}_{l}(u)\}\,f\left\{(\boldsymbol{\beta}_{l,n}-\hat{\boldsymbol{\beta}}_{l,n})^{\top}\mathbf{z}_{l}(u)\right\}\mathrm{d}u
δ|Bn||Wn|exp(𝜷^l,nc).\displaystyle\leq-\delta\,\frac{|B_{n}|}{|W_{n}|}\,\exp(-\|\hat{\boldsymbol{\beta}}_{l,n}\|c).

Again, by definition of BnB_{n}, the strong consistency of 𝜷^l,n\hat{\boldsymbol{\beta}}_{l,n} and condition C2, we conclude that for large nn there exists δ>0\delta^{\prime}>0 such that Rn(𝜷^l,n,𝜷l,n)δ<0R_{n}(\hat{\boldsymbol{\beta}}_{l,n},\boldsymbol{\beta}_{l,n})\leq-\delta^{\prime}<0 whereby we deduce that Δ>0\Delta>0.

The previous statements allow us to conclude by Łapiński (2019, Theorem 2) that

pn(𝐗nl)\displaystyle p_{n}(\mathbf{X}_{n}\mid\mathcal{M}_{l})
=\displaystyle= p(𝜷^l,nl)det{τn1𝐒n(𝜷^l,n)}1/2exp{n(𝜷^l,n)}(2πτn)pl/2+exp{n(𝜷^l,n)}(2πτn)pl/2𝒪(τn1/2).\displaystyle\frac{p(\hat{\boldsymbol{\beta}}_{l,n}\mid\mathcal{M}_{l})}{\mathrm{det}\{\tau_{n}^{-1}\mathbf{S}_{n}(\hat{\boldsymbol{\beta}}_{l,n})\}^{1/2}}\exp\{\ell_{n}(\hat{\boldsymbol{\beta}}_{l,n})\}\left(\frac{2\pi}{\tau_{n}}\right)^{p_{l}/2}\!\!\!\!+\exp\{\ell_{n}(\hat{\boldsymbol{\beta}}_{l,n})\}\left(\frac{2\pi}{\tau_{n}}\right)^{p_{l}/2}\!\!\!\!{\mathcal{O}}(\tau_{n}^{-1/2}).

This gives

logpn(𝐗nl)logp(𝜷^l,nl)det{τn1𝐒n(𝜷^l,n)}1/2n(𝜷^l,n)pl2log2πτn=log[1+det{τn1𝐒n(𝜷^l,n)}1/2p(𝜷^l,nl)𝒪(τn1/2)].\log p_{n}(\mathbf{X}_{n}\mid\mathcal{M}_{l})-\log\frac{p(\hat{\boldsymbol{\beta}}_{l,n}\mid\mathcal{M}_{l})}{\mathrm{det}\{\tau_{n}^{-1}\mathbf{S}_{n}(\hat{\boldsymbol{\beta}}_{l,n})\}^{1/2}}-\ell_{n}(\hat{\boldsymbol{\beta}}_{l,n})-\frac{p_{l}}{2}\log\frac{2\pi}{\tau_{n}}=\\ \log\left[1+\frac{\mathrm{det}\{\tau_{n}^{-1}\mathbf{S}_{n}(\hat{\boldsymbol{\beta}}_{l,n})\}^{1/2}}{p(\hat{\boldsymbol{\beta}}_{l,n}\mid\mathcal{M}_{l})}{\mathcal{O}}(\tau_{n}^{-1/2})\right].

Proposition 2 now follows since

log[1+det{τn1𝐒n(𝜷^l,n)}1/2p(𝜷^l,nl)𝒪(τn1/2)]=log{1+𝒪(τn1/2)}=𝒪(τn1/2)\log\left[1+\frac{\mathrm{det}\{\tau_{n}^{-1}\mathbf{S}_{n}(\hat{\boldsymbol{\beta}}_{l,n})\}^{1/2}}{p(\hat{\boldsymbol{\beta}}_{l,n}\mid\mathcal{M}_{l})}{\mathcal{O}}(\tau_{n}^{-1/2})\right]=\log\{1+{\mathcal{O}}(\tau_{n}^{-1/2})\}={\mathcal{O}}(\tau_{n}^{-1/2})

and using condition C9. ∎

Appendix E Modified version of Theorem 2 in Waagepetersen and Guan (2009)

Consider a sequence of estimating functions un:ppu_{n}:\mathbb{R}^{p}\rightarrow\mathbb{R}^{p}, n1n\geq 1 whose distribution is determined by some underlying probability measure generating the data at hand. For a matrix 𝐀=[aij]\mathbf{A}=[a_{ij}], 𝐀M=maxij|aij|\|\mathbf{A}\|_{M}=\max_{ij}|a_{ij}|, and we let 𝐉n(𝜽)=ddθun(𝜽)\mathbf{J}_{n}(\boldsymbol{\theta})=-\frac{\mathrm{d}}{{\mathrm{d}\theta}}u_{n}(\boldsymbol{\theta}), assuming that unu_{n} is differentiable.

Theorem 2.

Assume that there exists a sequence of invertible symmetric matrices VnV_{n} and a sequence of parameter values 𝛉np\boldsymbol{\theta}_{n}^{*}\in\mathbb{R}^{p} such that

  1. G1

    𝐕n10\|{\mathbf{V}}_{n}^{-1}\|\rightarrow 0.

  2. G2

    There exists an l>0l>0 so that P(ln<l){\mathrm{P}}(l_{n}<l) tends to zero where

    ln=infϕ=1ϕ𝐕n1𝐉n(𝜽n)𝐕n1ϕ.l_{n}=\inf_{\|{\boldsymbol{\phi}}\|=1}{\boldsymbol{\phi}^{\top}}{\mathbf{V}}_{n}^{-1}{\mathbf{J}}_{n}({\boldsymbol{\theta}}_{n}^{*}){\mathbf{V}}_{n}^{-1}{\boldsymbol{\phi}}.
  3. G3

    For any d>0d>0,

    sup𝐕n(𝜽𝜽n)d𝐕n1{𝐉n(𝜽)𝐉n(𝜽n)}𝐕n1M=γnd0\sup_{\|{\mathbf{V}}_{n}({\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{*}_{n})\|\leq d}\|{\mathbf{V}}_{n}^{-1}\{{\mathbf{J}}_{n}({\boldsymbol{\theta}})-{\mathbf{J}}_{n}({\boldsymbol{\theta}}^{*}_{n})\}{\mathbf{V}}_{n}^{-1}\|_{M}=\gamma_{nd}\rightarrow 0

    in probability under PP.

  4. G4

    The sequence un(𝜽n)𝐕n1u_{n}({\boldsymbol{\theta}}^{*}_{n}){\mathbf{V}}_{n}^{-1} is bounded in probability (i.e. for each ϵ>0\epsilon>0 there exists a dd so that P(𝐕n1un(𝜽n)>d)ϵ\mathrm{P}(\|{{\mathbf{V}}_{n}^{-1}u_{n}({\boldsymbol{\theta}}^{*}_{n})}\|>d)\leq\epsilon for nn sufficiently large).

Then for each ϵ>0\epsilon>0, there exists a d>0d>0 such that

P{𝜽~n:un(𝜽~n)=0 and 𝐕n(𝜽~n𝜽n)<d}>1ϵ{\mathrm{P}}\left\{\exists\tilde{{\boldsymbol{\theta}}}_{n}:u_{n}(\tilde{{\boldsymbol{\theta}}}_{n})=0\text{ and }\|{{\mathbf{V}}_{n}(\tilde{{\boldsymbol{\theta}}}_{n}-{\boldsymbol{\theta}}^{*}_{n})}\|<d\right\}>1-\epsilon (29)

whenever nn is sufficiently large.

Proof.

The event

{𝜽~n:un(𝜽~n)=0 and 𝐕n(𝜽~n𝜽n)<d}\{\exists\tilde{{\boldsymbol{\theta}}}_{n}:u_{n}(\tilde{{\boldsymbol{\theta}}}_{n})=0\text{ and }\|{{\mathbf{V}}_{n}(\tilde{{\boldsymbol{\theta}}}_{n}-{\boldsymbol{\theta}}^{*}_{n})}\|<d\}

occurs if ϕ𝐕n1un(𝜽n+𝐕n1ϕ)<0{{\boldsymbol{\phi}}^{\top}{\mathbf{V}}_{n}^{-1}u_{n}({\boldsymbol{\theta}}^{*}_{n}+{\mathbf{V}}_{n}^{-1}{\boldsymbol{\phi}})}<0 for all ϕ{\boldsymbol{\phi}} with ϕ=d\|{\boldsymbol{\phi}}\|=d since this implies un(𝜽n+𝐕n1ϕ){u_{n}({\boldsymbol{\theta}}^{*}_{n}+{\mathbf{V}}_{n}^{-1}{\boldsymbol{\phi}})} for some ϕ<d\|{\boldsymbol{\phi}}\|<d (Lemma 2 in Aitchison and Silvey, 1958). Hence we need to show that there is a dd such that

P{supϕ=dϕ𝐕n1un(𝜽n+𝐕n1ϕ)0}ϵ\mathrm{P}\left\{\sup_{\|{\boldsymbol{\phi}}\|=d}{{\boldsymbol{\phi}}^{\top}{\mathbf{V}}_{n}^{-1}u_{n}({\boldsymbol{\theta}}^{*}_{n}+{\mathbf{V}}_{n}^{-1}{\boldsymbol{\phi}})}\geq 0\right\}\leq\epsilon

for sufficiently large nn. To this end we write

ϕ𝐕n1un(𝜽n+𝐕n1ϕ)=ϕ𝐕n1un(𝜽n)ϕ01𝐕n1𝐉n(𝜽n(t))𝐕n1dtϕ{{\boldsymbol{\phi}}^{\top}{\mathbf{V}}_{n}^{-1}u_{n}({\boldsymbol{\theta}}^{*}_{n}+{\mathbf{V}}_{n}^{-1}{\boldsymbol{\phi}})}={{\boldsymbol{\phi}}^{\top}{\mathbf{V}}_{n}^{-1}u_{n}({\boldsymbol{\theta}}^{*}_{n})}-{{\boldsymbol{\phi}}^{\top}}\int_{0}^{1}{\mathbf{V}}_{n}^{-1}{\mathbf{J}}_{n}({\boldsymbol{\theta}}_{n}(t)){\mathbf{V}}_{n}^{-1}\mathrm{d}t{{\boldsymbol{\phi}}}

where 𝜽(t)=𝜽n+t𝐕n1ϕ{\boldsymbol{\theta}}(t)={\boldsymbol{\theta}}^{*}_{n}+t{{\mathbf{V}}_{n}^{-1}{\boldsymbol{\phi}}}. Then

P{supϕ=dϕ𝐕n1un(𝜽n+𝐕n1ϕ)0}\displaystyle{\mathrm{P}}\left\{\sup_{\|{\boldsymbol{\phi}}\|=d}{{\boldsymbol{\phi}}^{\top}{\mathbf{V}}_{n}^{-1}u_{n}({\boldsymbol{\theta}}^{*}_{n}+{\mathbf{V}}_{n}^{-1}{\boldsymbol{\phi}})}\geq 0\right\}\leq
P{supϕ=dϕ𝐕n1un(𝜽n)infϕ=dϕ01𝐕n1𝐉n(𝜽n(t))𝐕n1dtϕ}\displaystyle{\mathrm{P}}\left\{\sup_{\|{\boldsymbol{\phi}}\|=d}{{\boldsymbol{\phi}}^{\top}{\mathbf{V}}_{n}^{-1}u_{n}({\boldsymbol{\theta}}^{*}_{n})}\geq\inf_{\|{\boldsymbol{\phi}}\|=d}{{\boldsymbol{\phi}}^{\top}}\int_{0}^{1}{\mathbf{V}}_{n}^{-1}{\mathbf{J}}_{n}({\boldsymbol{\theta}}_{n}(t)){\mathbf{V}}_{n}^{-1}\mathrm{d}t\,{{\boldsymbol{\phi}}}\right\}\leq
P[𝐕n1un(𝜽n)dinfϕ=1{ϕ𝐕n1𝐉n(𝜽n)𝐕n1ϕ}dpγnd]]\displaystyle{\mathrm{P}}\left[\|{{\mathbf{V}}^{-1}_{n}u_{n}({\boldsymbol{\theta}}^{*}_{n}})\|\geq d\inf_{\|{\boldsymbol{\phi}}\|=1}\left\{{{\boldsymbol{\phi}}^{\top}}{\mathbf{V}}_{n}^{-1}{\mathbf{J}}_{n}({\boldsymbol{\theta}}^{*}_{n}){\mathbf{V}}_{n}^{-1}{{\boldsymbol{\phi}}}\right\}-dp\gamma_{nd}]\right]\leq
P(𝐕n1un(𝜽n)dln/2)+P(pγnd>ln/2).\displaystyle{\mathrm{P}}\left(\|{{\mathbf{V}}_{n}^{-1}u_{n}({\boldsymbol{\theta}}^{*}_{n})}\|\geq dl_{n}/2\right)+{\mathrm{P}}\left(p\gamma_{nd}>l_{n}/2\right).

The first term can be made arbitrarily small by picking a sufficiently large dd and letting nn tend to infinity. The second term converges to zero as nn tends to infinity. ∎

References

  • Aitchison and Silvey [1958] J. Aitchison and S. D. Silvey. Maximum-likelihood estimation of parameters subject to restraints. The Annals of Mathematical Statistics, 29:813–825, 1958.
  • Akaike [1973] H. Akaike. Information Theory and an Extension of the Maximum Likelihood Principle, pages 199–213. Springer New York, New York, NY, 1973.
  • Baddeley and Turner [2000] A. Baddeley and R. Turner. Practical maximum pseudolikelihood for spatial point patterns. Australian & New Zealand Journal of Statistics, 42(3):283–322, 2000.
  • Baddeley et al. [2015] A. Baddeley, E. Rubak, and R. Turner. Spatial Point Patterns: Methodology and Applications with R. CRC Press, 2015.
  • Berman and Turner [1992] M. Berman and R. Turner. Approximating point process likelihoods with GLIM. Applied Statistics, 41:31–38, 1992.
  • Choiruddin et al. [2018] A. Choiruddin, J.-F. Coeurjolly, and F. Letué. Convex and non-convex regularization methods for spatial point processes intensity estimation. Electronic Journal of Statistics, 12:1210–1255, 2018.
  • Claeskens and Hjort [2008] G. Claeskens and N.L. Hjort. Model Selection and Model Averaging. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2008.
  • Coeurjolly and Lavancier [2019] J.-F. Coeurjolly and F. Lavancier. Understanding spatial point patterns through intensity and conditional intensities. In Stochastic Geometry, number 2237 in Lecture Notes in Mathematics, chapter 2. Springer Verlag, 2019.
  • Condit [1998] R. Condit. Tropical Forest Census Plots. Springer-Verlag and R. G. Landes Company, Berlin, Germany and Georgetown, Texas, 1998.
  • Condit et al. [1996] R. Condit, S.P. Hubbell, and R.B. Foster. Changes in tree species abundance in a neotropical forest: impact of climate change. Journal of tropical ecology, 12(2):231–256, 1996.
  • Gao and Song [2010] X. Gao and P. X.-K. Song. Composite likelihood Bayesian information criteria for model selection in high-dimensional data. Journal of the American Statistical Association, 1-5:1531–1540, 2010.
  • Guan and Loh [2007] Y. Guan and J. M. Loh. A thinned block bootstrap procedure for modeling inhomogeneous spatial point patterns. Journal of the American Statistical Association, 102:1377–1386, 2007.
  • Hubbell and Foster [1983] S. P. Hubbell and R. B. Foster. Diversity of canopy trees in a neotropical forest and implications for conservation. In S. L. Sutton, T. C. Whitmore, and A. C. Chadwick, editors, Tropical Rain Forest: Ecology and Management, pages 25–41. Blackwell Scientific Publications, Oxford, 1983.
  • Jeffrey et al. [2018] D. Jeffrey, J. Horrocks, and Umphrey G.J. Penalized composite likelihoods for inhomogeneous Gibbs point process models. Computational Statistics & Data Analysis, 124:104–116, 2018.
  • Łapiński [2019] T. M. Łapiński. Multivariate Laplace’s approximation with estimated error and application to limit theorems. Journal of Approximation Theory, 248:105305, 2019.
  • Lavancier et al. [2020] F. Lavancier, A. Poinas, and R. Waagepetersen. Adaptive estimating function inference for non-stationary determinantal point processes. Scandinavian Journal of Statistics, 2020. Appeared online.
  • Lebarbier and Mary-Huard [2006] E. Lebarbier and T. Mary-Huard. Une introduction au critère BIC: fondements théoriques et interprétation. Journal de la Société Française de Statistique 1 (147), 39-58, 2006.
  • Møller and Waagepetersen [2004] J. Møller and R. P. Waagepetersen. Statistical Inference and Simulation for Spatial Point Processes. Chapman and Hall/CRC, Boca Raton, 2004.
  • Møller and Waagepetersen [2017] J. Møller and R. P. Waagepetersen. Some recent developments in statistics for spatial point processes. Annual review of Statistics and its Applications, 4:317–342, 2017.
  • Rathbun and Cressie [1994] S. L. Rathbun and N. Cressie. Asymptotic properties of estimators for the parameters of spatial inhomogeneous Poisson point processes. Advances in Applied Probability, 26:122–154, 1994.
  • Schoenberg [2005] F.P. Schoenberg. Consistent parametric estimation of the intensity of a spatial-temporal point process. Journal of Statistical Planning and Inference, 128:79–93, 2005.
  • Schwarz [1978] G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6(2):461–464, 03 1978.
  • Thurman et al. [2015] A. L. Thurman, R. Fu, Y. Guan, and J. Zhu. Regularized estimating equations for model selection of clustered spatial point processes. Statistica Sinica, pages 173–188, 2015.
  • Tierney and Kadane [1986] L. Tierney and J.B. Kadane. Accurate approximations for posterior moments and marginal densities. Journal of the American Statistical Association, 81:82–86, 1986.
  • Varin and Vidoni [2005] C. Varin and P. Vidoni. A note on composite likelihood inference and model selection. Biometrika, 92(3):519–528, 2005.
  • Waagepetersen [2007] R. Waagepetersen. An estimating function approach to inference for inhomogeneous Neyman-Scott processes. Biometrics, 63:252–258, 2007.
  • Waagepetersen and Guan [2009] R. Waagepetersen and Y. Guan. Two-step estimation for inhomogeneous spatial point processes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71:685–702, 2009.