Information criteria for inhomogeneous spatial point processes

Achmad Choiruddin Department of Statistics, Institut Teknologi Sepuluh Nopember (ITS), Indonesia Jean-François Coeurjolly Department of Mathematics, Université du Québec à Montréal (UQAM), Canada Rasmus Waagepetersen Department of Mathematical Sciences, Aalborg University, Denmark

Abstract

The theoretical foundation for a number of model selection criteria is established in the context of inhomogeneous point processes and under various asymptotic settings: infill, increasing domain, and combinations of these. For inhomogeneous Poisson processes we consider Akaike’s information criterion and the Bayesian information criterion, and in particular we identify the point process analogue of ‘sample size’ needed for the Bayesian information criterion. Considering general inhomogeneous point processes we derive new composite likelihood and composite Bayesian information criteria for selecting a regression model for the intensity function. The proposed model selection criteria are evaluated using simulations of Poisson processes and cluster point processes.

Keywords: Akaike’s information criterion, Bayesian information criterion, composite information criterion, composite likelihood, inhomogeneous point process, intensity function, model selection.

1 Introduction

Fitting a regression model to the intensity function of a point process is one of the most fundamental tasks in statistical analysis of point pattern data , see e.g. Møller and Waagepetersen (2017) or Coeurjolly and Lavancier (2019) for a recent review of this problem. If the data in question can be viewed as a realization of a Poisson process, regression parameters are usually estimated by maximum likelihood. If the point process is not Poisson, the likelihood function is often computationally intractable. In such cases the Poisson likelihood function can still be used as a composite likelihood function for estimating regression parameters. This is e.g. the approach underlying the popular spatstat R package (Baddeley et al., 2015) procedure kppm.

Considering regression models, model selection is often a pertinent task. In case of a Poisson process, the Akaike information criterion (AIC) (Akaike, 1973) seems an obvious approach (and implemented in the function logLik.ppm of spatstat) since in this case the likelihood function is available. If the assumption of a Poisson process is not tenable, generalization of the AIC to a composite likelihood information criterion (CIC) (Varin and Vidoni, 2005) is relevant. Yet another alternative is the Bayesian information criterion (BIC) (Schwarz, 1978). The classical framework for the information criteria mentioned typically involves a sample of independent observations where the sample size plays a crucial role for the BIC. However, in the case of a unique realization of a point process, it is not obvious how to define sample size. In some sense, it is one, but this choice is obviously not useful for asymptotic justifications of information criteria. Instead, sample size must be linked to properties of the observation window or the point process intensity function. Several proposals for defining ‘sample size’ have been considered in the literature. Choiruddin et al. (2018) use the size of the observation window while Thurman et al. (2015) consider the number of observed points. Jeffrey et al. (2018) use the sum of the number of data points and the number of dummy points used in a numerical approximation of the likelihood (Berman and Turner, 1992).

In this paper we first establish the theoretical foundation for AIC and CIC in the context of intensity function model selection for a point process. This includes asymptotic results for estimates of the ‘least false parameter value’, see e.g. Claeskens and Hjort (2008, Section 2.2). Next we derive the BIC in case of a Poisson process and we thus identify what is the meaning of sample size in this context. We also consider the generalization of BIC to composite likelihood BIC (CBIC) (Gao and Song, 2010) using a concept of effective degrees of freedom derived for the CIC. Our asymptotic developments are established under an original setting which embraces both infill asymptotic (the number of points in a fixed domain increases) and increasing domain asymptotics (the volume of the observation window tends to infinity) which are often considered in the literature.

The rest of the article is organized as follows. The problem of selecting a model for the intensity function is specified in Section 2. In Section 3 we discuss asymptotic results for intensity function regression parameter estimators under a ’double’ asymptotic framework. We derive the AIC and CIC for spatial point processes in Section 4 and develop the BIC and CBIC in Section 5. The different model selection criteria are compared in a simulation study in Section 6. Section 7 gives some concluding remarks. Proofs are given in the Appendices A-E.

2 Intensity model selection

A spatial point process $\mathbf{X}$ defined on $\mathbb{R}^{d}$ is a locally finite random subset of $\mathbb{R}^{d}$ . If for bounded $B\subset S$ we denote by $N(B)$ the cardinality of $\mathbf{X}\cap B$ , locally finite means that $N(B)$ is a finite integer almost surely. For a bounded domain $W\subset S$ , $|W|$ denotes the volume of $W$ . The intensity function $\lambda$ and the pair correlation function $g$ of $\mathbf{X}$ are defined (if they exist) by the equations

	$\displaystyle{\mathbb{E}}N(A)$	$\displaystyle=\int_{A}\lambda(u)\mathrm{d}u$
	$\displaystyle{\mathbb{E}}\left\{N(A)N(B)\right\}$	$\displaystyle=\int_{A}\lambda(u)\mathrm{d}u+\int_{A}\int_{B}\lambda(u)\lambda(v)g(u,v)\mathrm{d}u\mathrm{d}v$

for any bounded $A,B\subset\mathbb{R}^{d}$ .

If the counts $N(A)$ are Poisson distributed, $\mathbf{X}$ is said to be a Poisson process. In this case counts $N(B_{1}),\dots,N(B_{m})$ are independent whenever the subsets $B_{1},\ldots,B_{m}$ are disjoint and the pair correlation function is identically equal to one. For our asymptotic considerations, we assume that a sequence of spatial point processes $\mathbf{X}_{n}$ is observed within a sequence of bounded windows $W_{n}\subset\mathbb{R}^{d}$ , $n=1,2,\ldots$ . We denote by $\lambda_{n}$ and $g_{n}$ the intensity and pair correlation function of $\mathbf{X}_{n}$ . With an abuse of notation, we denote for any $n\geq 1$ expectation and variance under the sampling distribution of $\mathbf{X}_{n}$ by ${\mathbb{E}}$ and ${\mathbb{V}\mathrm{ar}}$ .

For modelling the intensity function we assume that $p\geq 1$ covariates $z_{1},\ldots,z_{p}$ are available where for each $i=1,\ldots,p$ , $z_{i}$ is a locally integrable function on $\mathbb{R}^{d}$ . Let $I_{l}$ , $l=1,\ldots,2^{p},$ denote the subsets of $\{1,\dots,p\}$ and let $p_{l}=|I_{l}|+1$ where $|I_{l}|$ is the cardinality of $I_{l}$ . We consider models for $\lambda_{n}$ specified in terms of the $2^{p}$ subsets of the covariates. For each $l=1,\ldots,2^{p}$ and $n\geq 1$ we define the log-linear model

\rho_{l,n}\{u;\mathbf{z}_{l}(u);\boldsymbol{\beta}_{l}\}=\theta_{n}\exp\{\boldsymbol{\beta}_{l}^{\top}\mathbf{z}_{l}(u)\}

(1)

where $\boldsymbol{\beta}_{l}=\left\{\beta_{0},(\beta_{j})_{j\in I_{l}}\right\}\in\mathbb{R}^{p_{l}}$ and $\mathbf{z}_{l}(u)=\big{[}1,\{z_{j}(u)\}_{j\in I_{l}}\big{]}$ . The quantity $\theta_{n}$ should not be regarded as a parameter to be estimated. For $n\geq 1$ , $\theta_{n}$ could e.g. represent a timespan over which $\mathbf{X}_{n}$ is observed. In the following, with an abuse of notation, we just write $\rho_{n}(u;\boldsymbol{\beta}_{l})$ for $\rho_{l,n}\{u;\mathbf{z}_{l}(u);\boldsymbol{\beta}_{l}\}$ and similarly for related quantities.

The problem we consider is to select among the $2^{p}$ intensity models $\mathcal{M}_{l}$ , $l=1,\dots,2^{p}$ , given by

\mathcal{M}_{l}=\{\rho_{n}(\cdot;\boldsymbol{\beta}_{l})\mid\boldsymbol{\beta}_{l}=\left\{\beta_{0},(\beta_{j})_{j\in I_{l}}\right\}\in\mathbb{R}^{p_{l}}\},\quad l=1,\dots,2^{p}.

The distinction between $\beta_{0}$ and the other parameters is necessary because our objective is to select among $2^{p}$ different models which all contain an intercept term. Note that the true intensity function $\lambda_{n}$ does not necessarily correspond to any of the suggested models ${\cal M}_{l}$ , $l=1,\ldots,2^{p}$ .

3 Estimation of the intensity function

In this section we discuss estimation of the intensity function using a Poisson likelihood function and associated asymptotic results.

3.1 The Poisson likelihood function

The density of a Poisson point process with intensity ${\rho_{n}}(\cdot;\boldsymbol{\beta}_{l})$ and observed in $W_{n}$ is given by (see e.g. Møller and Waagepetersen (2004))

p_{n}(\mathbf{x};\boldsymbol{\beta}_{l})=\left\{\prod_{u\in{\mathbf{x}}}{\rho_{n}}(u;\boldsymbol{\beta}_{l})\right\}\exp\left\{|W_{n}|-\int_{W_{n}}{\rho_{n}}(u;\boldsymbol{\beta}_{l})\mathrm{d}u\right\}

(2)

for locally finite point configurations $\mathbf{x}\subset W_{n}$ . We emphasize that in the following we assume neither that $\mathbf{X}_{n}$ introduced in the previous section is a Poisson process nor that ${\rho_{n}}(\cdot;\boldsymbol{\beta}_{l})$ coincides with the intensity function $\lambda_{n}$ of $\mathbf{X}_{n}$ .

Combining (1) and (2), up to a constant, the $\log$ of (2) evaluated at $\mathbf{X}_{n}$ becomes

\ell_{n}(\boldsymbol{\beta}_{l})=\sum_{u\in{\mathbf{X}_{n}}}\boldsymbol{\beta}_{l}^{\top}\mathbf{z}_{l}(u)-\theta_{n}\int_{W_{n}}\exp\{\boldsymbol{\beta}_{l}^{\top}\mathbf{z}_{l}(u)\}\mathrm{d}u.

(3)

Let $e_{n}(\boldsymbol{\beta}_{l})$ be the corresponding estimating function given by

e_{n}(\boldsymbol{\beta}_{l})=\frac{\mathrm{d}}{\mathrm{d}\boldsymbol{\beta}_{l}}\ell_{n}(\boldsymbol{\beta}_{l})=\sum_{u\in{\mathbf{X}_{n}}}\mathbf{z}_{l}(u)-\theta_{n}\int_{W_{n}}\mathbf{z}_{l}(u){\rho_{n}}(u;\boldsymbol{\beta}_{l})\mathrm{d}u.

(4)

For any $\boldsymbol{\beta}_{l}\in{\mathbb{R}^{p_{l}}}$ , the sensitivity (or Fisher information) matrix is

\mathbf{S}_{n}(\boldsymbol{\beta}_{l})=-{\mathbb{E}}\left\{\frac{\mathrm{d}}{\mathrm{d}\boldsymbol{\beta}_{l}^{\top}}e_{n}(\boldsymbol{\beta}_{l})\right\}=\theta_{n}\int_{W_{n}}\mathbf{z}_{l}(u)\mathbf{z}_{l}(u)^{\top}{\rho_{n}}(u;\boldsymbol{\beta}_{l})\mathrm{d}u.

(5)

We assume $\mathbf{S}_{n}(\boldsymbol{\beta}_{l})$ is positive definite for all $\boldsymbol{\beta}_{l}$ (see also condition C5 in the next section). We can then define the estimator of $\boldsymbol{\beta}_{l}$ as

\hat{\boldsymbol{\beta}}_{l,n}=\operatorname*{arg\,max}_{\boldsymbol{\beta}_{l}\in{\mathbb{R}^{p_{l}}}}\,{p_{n}}(\mathbf{X}_{n};\boldsymbol{\beta}_{l})=\operatorname*{arg\,max}_{\boldsymbol{\beta}_{l}\in{\mathbb{R}^{p_{l}}}}\,\ell_{n}(\boldsymbol{\beta}_{l}).

(6)

If $\mathbf{X}_{n}$ is indeed a Poisson process with intensity function ${\rho_{n}}(\cdot;\boldsymbol{\beta}_{l})$ , then $\hat{\boldsymbol{\beta}}_{l}$ is the maximum likelihood estimator and the sensitivity (5) equals the observed information matrix $-\mathrm{d}e_{n}(\boldsymbol{\beta}_{l})/\mathrm{d}\boldsymbol{\beta}_{l}^{\top}$ .

If $\mathbf{X}_{n}$ is not Poisson, $\hat{\boldsymbol{\beta}}_{l,n}$ may be viewed as a composite likelihood estimator (Schoenberg, 2005; Waagepetersen, 2007). In the situation where the intensity function ${\rho_{n}}(\cdot;\boldsymbol{\beta}_{l})$ coincides with the true intensity function $\lambda_{n}$ , asymptotic properties of maximum likelihood or composite likelihood estimators obtained as maximizers of Poisson likelihood functions have been established in various settings by Rathbun and Cressie (1994), Waagepetersen (2007), Guan and Loh (2007) and Waagepetersen and Guan (2009). In the next section we investigate the more intriguing situation where the intensity model is misspecified.

3.2 Framework and asymptotic results for misspecified intensity functions

To handle the situation where ${\rho_{n}}(\cdot;\boldsymbol{\beta}_{l})$ does not coincide with the intensity function of $\mathbf{X}_{n}$ , we follow Varin and Vidoni (2005) and define a (composite) Kullback-Leibler divergence between the model ${\cal M}_{l}$ with parameter $\boldsymbol{\beta}_{l}$ and the true sampling distribution. That is,

\mathrm{KL}_{n}(\boldsymbol{\beta}_{l})={\mathbb{E}}\left\{\ell_{n}-\ell_{n}(\boldsymbol{\beta}_{l})\right\}

(7)

where $\ell_{n}$ is the Poisson log-likelihood obtained with the true intensity $\lambda_{n}$ . For a window $W_{n}$ and model ${\cal M}_{l}$ we let

\boldsymbol{\beta}_{l,n}^{*}=\operatorname*{arg\,max}_{{\boldsymbol{\beta}_{l}\in\mathbb{R}^{p_{l}}}}{\mathbb{E}}\ell_{n}(\boldsymbol{\beta}_{l})

denote the ‘least wrong parameter value’ under model ${\cal M}_{l}$ , provided the maximum exists. It is easy to see by explicit evaluation of the right hand side that

-\frac{\mathrm{d}}{\mathrm{d}{\boldsymbol{\beta}_{l}}^{\top}}\left\{-\frac{\mathrm{d}}{\mathrm{d}\boldsymbol{\beta}_{l}}{\mathbb{E}}\ell_{n}(\boldsymbol{\beta}_{l})\right\}=\mathbf{S}_{n}(\boldsymbol{\beta}_{l})

and so condition C5 stated below implies that $\boldsymbol{\beta}_{l,n}^{*}$ is well-defined as a unique maximum when $n$ is large enough. Also it is easy to see that

{\mathbb{E}}e_{n}(\boldsymbol{\beta}_{l,n}^{*})=0,

(8)

which means that $\hat{\boldsymbol{\beta}}_{l,n}$ given by (6) is a candidate to estimate $\boldsymbol{\beta}_{l,n}^{*}$ .

The remainder of this section is devoted to asymptotic results for $\hat{\boldsymbol{\beta}}_{l,n}$ within the above framework of a misspecified intensity function. We thereby extend the results in the references mentioned in Section 3.1. In contrast to these references which used either increasing domain or infill asymptotics, we moreover consider a ‘double asymptotic’ framework as formalized by condition C3 presented below.

Two matrices are crucial for the asymptotic results. The sensitivity matrix $\mathbf{S}_{n}(\boldsymbol{\beta}_{l})$ is given by (5) regardless of whether the model is misspecified or not. The variance-covariance matrix $\boldsymbol{\Sigma}_{l,n}$ of (4) is, using the Campbell theorem, given by

	$\displaystyle\boldsymbol{\Sigma}_{l,n}$	$\displaystyle=\int_{W_{n}}\mathbf{z}_{l}(u)\mathbf{z}_{l}(u)^{\top}\lambda_{n}(u)\mathrm{d}u$
		$\displaystyle\qquad+\int_{W_{n}}\int_{W_{n}}\mathbf{z}_{l}(u)\mathbf{z}_{l}(u)^{\top}\lambda_{n}(u)\lambda_{n}(v)\left\{g_{n}(u,v)-1\right\}\mathrm{d}u\mathrm{d}v.$		(9)

Observe that $\boldsymbol{\Sigma}_{l,n}$ does not depend on $\boldsymbol{\beta}_{l}$ (whence its notation). Our results will be based on the following assumptions where for a square matrix $\mathbf{M}$ , $\nu_{\min}(\mathbf{M})$ (resp. $\nu_{\max}(\mathbf{M})$ ) stands for the smallest (resp. largest) eigenvalue. We use $a_{n}\asymp b_{n}$ to denote that $a_{n}={\mathcal{O}}(b_{n})$ and $b_{n}={\mathcal{O}}(a_{n})$ .

[C1]

As $n\to\infty$ , $\sup_{n\geq 1}\|\boldsymbol{\beta}_{l,n}^{*}\|=\mathcal{O}(1)$ .
[C2]

$\mathbf{z}_{l}:\mathbb{R}^{d}\to\mathbb{R}$ is continuous, $0<\inf_{u\in\mathbb{R}^{d}}\|\mathbf{z}_{l}(u)\|<\sup_{u\in\mathbb{R}^{d}}\|\mathbf{z}_{l}(u)\|<\infty$ .
[C3]

The sequence $\left(\tau_{n}:=\theta_{n}|W_{n}|\right)_{n\geq 1}$ is an increasing sequence, such that $\liminf_{n\to\infty}\theta_{n}>0$ and $\lim_{n\to\infty}\tau_{n}=\infty$ . The sets $W_{n}$ are convex and compact.
[C4]

As $n\to\infty$ , $\tau_{n}^{-1}e_{n}(\boldsymbol{\beta}_{l,n}^{*})\to 0$ almost surely.
[C5]

For any $\boldsymbol{\beta}_{l}\in\mathbb{R}^{p_{l}}$ and $n\geq 1$ , $\mathbf{S}_{n}(\boldsymbol{\beta}_{l})$ is positive definite. In addition, for any $n\geq 1$ , there exists a set $B_{n}\subseteq W_{n}$ such that $|B_{n}|\asymp|W_{n}|$ and a $c>0$ such that $\inf_{n}\inf_{\boldsymbol{\phi}\in\mathbb{R}^{p_{l}},\|\boldsymbol{\phi}\|=1}\inf_{u\in B_{n}}|\boldsymbol{\phi}^{\top}\mathbf{z}_{l}(u)|\geq c$ . Finally, we assume that $\liminf_{n\to\infty}\nu_{\min}\left(\tau_{n}^{-1}\boldsymbol{\Sigma}_{l,n}\right)>0$ .
[C6]

As $n\to\infty$ , $\|\boldsymbol{\Sigma}_{l,n}\|=\mathcal{O}(\tau_{n})$ .
[C7]

As $n\to\infty$ , $\boldsymbol{\Sigma}_{l,n}^{-1/2}e_{n}(\boldsymbol{\beta}_{l,n}^{*})\to N(0,\mathbf{I}_{p_{l}})$ in distribution.

We can then state the following asymptotic result which is verified in Appendix A.

Theorem 1.

(i) Assume conditions C1-C4 hold, then there exists $v=v_{n}(\hat{\boldsymbol{\beta}}_{l,n},\boldsymbol{\beta}_{l,n}^{*})$ , such that almost surely

\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l,n}^{*}\;=\;\frac{\mathbf{z}_{l}(v)}{\|\mathbf{z}_{l}(v)\|^{2}}\,\log\left[1+\tau_{n}^{-1}{e_{n}(\boldsymbol{\beta}_{l,n}^{*})}^{\top}\frac{\mathbf{z}_{l}(v)}{\|\mathbf{z}_{l}(v)\|^{2}}\exp\{-{\boldsymbol{\beta}_{l,n}^{*}}^{\top}\mathbf{z}_{l}(v)\}\right].

(10)

and $\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l,n}^{*}\to 0$ almost surely as $n\to\infty$ .
(ii) Assume conditions C1-C3 and C5-C6 hold. Then, $\hat{\boldsymbol{\beta}}_{l,n}$ is a root- $\tau_{n}$ consistent estimator of $\boldsymbol{\beta}_{l,n}^{*}$ , i.e.

\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l,n}^{*}=\mathcal{O}_{P}(\tau_{n}^{-1/2}).

(11)

(iii) If in addition, C7 holds, then as $n\to\infty$ ,

\boldsymbol{\Sigma}_{l,n}^{-1/2}\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{*})\left(\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l,n}^{*}\right)\to N(0,\mathbf{I}_{p_{l}})

(12)

in distribution.

We stress that Theorem 1 (ii)-(iii) do not require the strong consistency of $\tau_{n}^{-1}e_{n}(\boldsymbol{\beta}_{l,n}^{*})$ to 0, i.e. condition C4. We conclude by some remarks regarding the assumptions C1-C7. Condition C1 ensures that the sequence of ‘least wrong parameter value‘ does not diverge with $n$ .

Condition C3 is different from existing conditions as it embraces both of the standard asymptotic frameworks considered in the literature:

•

infill asymptotics: $W_{n}=W$ with $W$ a bounded set of $\mathbb{R}^{d}$ and $\theta_{n}\to\infty$ as $n\to\infty$ .
•

increasing domain asymptotics: $\theta_{n}=\theta>0$ and $(W_{n})_{n\geq 1}$ is a sequence of bounded domains of $\mathbb{R}^{d}$ such that $|W_{n}|\to\infty$ .

It is also valid if both asymptotics are considered at the same time. The assumed convexity in C3 enables the use of the mean value theorem in the proofs of our theoretical results.

In Condition C2, assuming an upper bound for $\|\mathbf{z}_{l}(u)\|$ is quite standard. The upper bound further implies that assuming a lower bound is not really restrictive since for each covariate $z_{j}$ we can always find some $k$ so that $|z_{j}(u)+k|\neq 0$ for any $u\in\mathbb{R}^{d}$ . Replacing $z_{j}$ by $z_{j}+k$ while changing the intercept from $\beta_{0}$ to $\beta_{0}-\beta_{j}k$ leaves the model unchanged. The continuity assumption is used to prove the strong consistency of $\hat{\boldsymbol{\beta}}_{l,n}$ .

Condition C4 is also used to ensure the strong consistency of $\hat{\boldsymbol{\beta}}_{l,n}$ . This condition can be seen as a law of large numbers. In Example 1, we present a class of models where such an assumption is valid under the generalized asymptotic condition C3. We can observe that if there exists $p>0$ such that ${\mathbb{E}}\{\|\tau_{n}^{-1/2}e_{n}(\boldsymbol{\beta}_{l,n}^{*}\|^{p}\}=\mathcal{O}(1)$ and $\sum_{n}\tau_{n}^{-p/2}<\infty$ , then C4 ensues from an application of Borel-Cantelli’s lemma. At least in the increasing domain framework, assuming such a moment assumption is quite standard to derive a central limit theorem.

Condition C5 is very similar to the assumption required by Rathbun and Cressie (1994) under the Poisson case or by Waagepetersen and Guan (2009) for more general point processes, within the increasing domain asymptotic framework. Note in particular that C5 combined with C1-C2 ensures that $\liminf_{n\to\infty}\nu_{\min}\{\tau_{n}^{-1}\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{*})\}>0$ .

Under the Poisson case, $g_{n}=1$ and so C6 is obviously satisfied if $\sup_{u\in W_{n}}\lambda_{n}(u)=\mathcal{O}(\theta_{n})$ . For more general point processes, assume further that $g_{n}=g$ does not depend on $n$ and is invariant under translations. Then, thanks to C2, the assumption

\int_{\mathbb{R}^{d}}\{g(o,w)-1\}\mathrm{d}w<\infty.

will imply C6. This assumption is quite standard and satisfied by a large class of models, see e.g. Waagepetersen and Guan (2009).

Continuing within the increasing domain framework, C7 was established by Rathbun and Cressie (1994) in the Poisson case, by Guan and Loh (2007) and Waagepetersen and Guan (2009) for $\alpha$ -mixing point processes, and by Lavancier et al. (2020) for determinantal point processes. The infill asymptotic framework was used to establish C7 in case of Poisson cluster processes in Waagepetersen (2007). However, the ‘double’ asymptotic framework has never been considered. Below, we provide an example of a model which satisfies C4, C6 and C7 under the new general asymptotic setting C3.

Example 1.

Let $\mathbf{C}_{n}$ be a homogeneous Poisson point process on $\mathbb{R}^{d}$ with intensity $\theta_{n}$ . Given $\mathbf{C}_{n}$ , let $\mathbf{X}_{n,c}$ , $c\in\mathbf{C}_{n}$ , be independent inhomogeneous Poisson point processses on $W_{n}$ with intensity $\alpha k(u-c)\rho(u)$ where $\alpha>0$ , $k$ is a symmetric density on $\mathbb{R}^{d}$ , and $\rho$ is a non-negative bounded function. Then, $\mathbf{X}_{n}=\cup_{c\in\mathbf{C}_{n}}\mathbf{X}_{n,c}$ is an inhomogeneous Poisson cluster point process (with inhomogeneous offspring). It can be shown that $\lambda_{n}(u)=\theta_{n}\alpha\rho(u)$ and $g_{n}(u,v)=1+{(k\!*\!k)(v-u)}/{\theta_{n}}$ where the notation $*$ denotes convolution. Assuming that $\sup_{u}\rho(u)=\mathcal{O}(1)$ , there exists $K\geq 0$ such that

	$\displaystyle\\|\boldsymbol{\Sigma}_{l,n}\\|$	$\displaystyle\leq K\left\{\tau_{n}+\theta_{n}^{2}\int_{W_{n}}\int_{W_{n}}\|g_{n}(u,v)-1\|\mathrm{d}u\mathrm{d}v\right\}$
		$\displaystyle\leq K\left\{\tau_{n}+\theta_{n}\int_{W_{n}}\int_{W_{n}}\|(k*k)(v-u)\|\mathrm{d}u\mathrm{d}v\right\}$
		$\displaystyle\leq K\left\{\tau_{n}+\tau_{n}\int_{\mathbb{R}^{d}}\|(k*k)(w)\|\mathrm{d}w\right\}=\mathcal{O}(\tau_{n}).$

Thus C6 is satisfied. In Appendix B we show that this model also satisfies C4 and C7 within the ‘double’ asymptotic framework. Along the same lines one can show that C7 holds for the inhomogeneous Poisson point process with intensity function $\lambda_{n}(u)=\theta_{n}\rho(u)$ .

4 Akaike and composite information criteria

One criterion for model selection would be to choose the model that minimizes the Kullback-Leibler divergence, i.e. the model for which (7) evaluated at $\boldsymbol{\beta}_{l,n}^{*}$ is smallest. Of course this criterion is not useful in practice since the true model is unknown. Following Varin and Vidoni (2005) we instead choose the model that minimizes an estimate of the expected value of (7) evaluated at $\hat{\boldsymbol{\beta}}_{l,n}$ , or equivalently, that minimizes an estimate of $2{\mathbb{E}}C_{n}(\hat{\boldsymbol{\beta}}_{l,n})$ with $C_{n}(\boldsymbol{\beta}_{l})=-{\mathbb{E}}\ell_{n}(\boldsymbol{\beta}_{l})$ . We follow Varin and Vidoni (2005, Lemmas 1-2) to derive in our context the following result.

Proposition 1.

Assume conditions C1-C3 and C5-C6 hold. Also assume that there exists $\varepsilon>0$ such that

\sup_{n}{\mathbb{E}}\left\{\;\left\|e_{n}(\boldsymbol{\beta}_{l,n}^{*})^{\top}\mathbf{M}_{n}e_{n}(\boldsymbol{\beta}_{l,n}^{*})\right\|^{1+\varepsilon}\;\right\}<\infty

(13)

where $\mathbf{M}_{n}=\mathbf{S}_{n}(\tilde{\boldsymbol{\beta}}_{l,n})^{-1}-\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{*})^{-1}$ and $\tilde{\boldsymbol{\beta}}_{l,n}$ is on the line segment between $\hat{\boldsymbol{\beta}}_{l,n}$ and $\boldsymbol{\beta}_{l,n}^{*}$ . Then,

{\mathbb{E}}\left\{C_{n}(\hat{\boldsymbol{\beta}}_{l,n})\right\}={\mathbb{E}}\left\{-\ell_{n}(\hat{\boldsymbol{\beta}}_{l,n})\right\}+\mathrm{trace}\left\{\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{*})^{-1}\mathbf{\Sigma}_{l,n}\right\}+o(1).

In other words, $-2\ell_{n}(\hat{\boldsymbol{\beta}}_{l})+2\mathrm{trace}\{\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{*})^{-1}\boldsymbol{\Sigma}_{l,n}\}$ is an asymptotically unbiased estimator of $2{\mathbb{E}}\{C_{n}(\hat{\boldsymbol{\beta}}_{l,n})\}$ .

Remark 1.

The technical condition (13) implies that the sequence $\{\tilde{e}_{n}:=e_{n}(\boldsymbol{\beta}_{l,n}^{*})^{\top}\mathbf{M}_{n}e_{n}(\boldsymbol{\beta}_{l,n}^{*})\}_{n\geq 1}$ is a uniformly integrable sequence of random variables. This ensures that convergence in probability of $\tilde{e}_{n}$ to 0 implies that ${\mathbb{E}}\tilde{e}_{n}=o(1)$ .

To estimate the effective degrees of freedom $p^{*}_{l}=\text{tr}\left\{\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{*})^{-1}\boldsymbol{\Sigma}_{l,n}\right\}$ we first simply estimate $\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{*})$ by $\mathbf{S}_{n}(\hat{\boldsymbol{\beta}}_{l,n})$ . The estimation of $\boldsymbol{\Sigma}_{l,n}$ is more difficult. Following Claeskens and Hjort (2008, p. 31), note that if ${\rho_{n}}(\cdot;\boldsymbol{\beta}_{l,n}^{*})$ coincides with the true intensity function of $\mathbf{X}_{n}$ , then $\boldsymbol{\Sigma}_{l,n}$ coincides with $\boldsymbol{\Sigma}_{n}(\boldsymbol{\beta}^{*}_{l,n})$ given by $\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{*})+T_{2}$ where

T_{2}:=\int_{W_{n}^{2}}\mathbf{z}_{l}(u)\mathbf{z}_{l}^{\top}(v){\rho_{n}}(u;\boldsymbol{\beta}_{l,n}^{*}){\rho_{n}}(v;\boldsymbol{\beta}_{l,n}^{*})\left\{g_{n}(u,v)-1\right\}\mathrm{d}u\mathrm{d}v.

In this case we get

	$\displaystyle\text{trace}\{\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{})^{-1}\boldsymbol{\Sigma}_{n}(\boldsymbol{\beta}_{l,n}^{})\}$	$\displaystyle=\text{trace}\{\mathbf{I}_{p_{l}}+\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{*})^{-1}T_{2}\}$
		$\displaystyle=p_{l}+\text{trace}\{\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{*})^{-1}T_{2}\}.$

We propose to use $p_{l,\text{approx}}^{*}=p_{l}+\text{trace}\{\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{*})^{-1}T_{2}\}$ as an approximation of $p_{l}^{*}$ . In practice we replace $\boldsymbol{\beta}_{l,n}^{*}$ by $\hat{\boldsymbol{\beta}}_{l,n}$ and $g_{n}$ by an estimate obtained by fitting a valid parametric model for $g_{n}$ and thus obtain $\hat{p}_{l,\text{approx}}^{*}$ . Our composite model selection criterion then becomes

\mathrm{CIC}_{l}=-2l_{n}(\hat{\boldsymbol{\beta}}_{l,n})+2\hat{p}_{l,\text{approx}}^{*}.

(14)

For a Poisson process, $g_{n}=1$ in which case $\hat{p}_{l,\text{approx}}^{*}=p_{l}$ and (14) reduces to the popular Akaike’s information criterion. For a clustered point process, $g_{n}>1$ meaning that $\hat{p}_{l,\text{approx}}^{*}>p_{l}$ . Thus we penalize more the complexity of the model in the case of a clustered point process. This seems to make sense since random clustering of points (i.e. not due to covariates) may erroneously be picked up by covariates that actually had no effect in the data generating mechanism. Hence there is a greater risk of picking a too complex model for the intensity function in case of a clustered point process than for a Poisson process.

5 Bayesian information criterion

The motivation of the Bayesian information criterion (BIC) is quite different from the derivation of the AIC and CIC criteria considered in the previous section. A main difference is that there is initially no reference to a Kullback-Leibler distance or asymptotics related to a ‘least false parameter value’. Instead, the true model is considered to be one of the models ${\mathcal{M}}_{l}$ and the idea is to choose the model that has maximum posterior probability within the specified Bayesian framework. However, the asymptotic concepts again play a role in order to derive asymptotic expansions of the posterior probabilities. Section 5.1 covers the Poisson process case. Section 5.2 proposes a composite likelihood BIC in the case where data is not generated from a Poisson process. Note that in this case, it is still assumed that the true intensity function corresponds to one of intensity functions for the models ${\mathcal{M}}_{l}$ .

5.1 BIC in the Poisson process case

The BIC criterion (see e.g. Schwarz (1978); Lebarbier and Mary-Huard (2006)) defines the best model $\mathcal{M}_{\mathrm{BIC}}$ as

\mathcal{M}_{\mathrm{BIC}}=\operatorname*{arg\,max}_{\mathcal{M}_{l},l=1,\ldots,2^{p}}\;\mathrm{P}_{n}\left(\mathcal{M}_{l}\mid\mathbf{X}_{n}\right).

(15)

From Bayes formula,

\mathrm{P}_{n}(\mathcal{M}_{l}\mid\mathbf{X}_{n})=\frac{p_{n}(\mathbf{X}_{n}|{\mathcal{M}}_{l})\mathrm{P}(\mathcal{M}_{l})}{p_{n}(\mathbf{X}_{n})}.

Letting $p(\boldsymbol{\beta}_{l}|{\mathcal{M}}_{l})$ denote the prior density of $\boldsymbol{\beta}_{l}$ given ${\mathcal{M}}_{l}$ ,

p_{n}(\mathbf{X}_{n}|{\mathcal{M}}_{l})=\int_{\mathbb{R}^{p_{l}}}p_{n}(\mathbf{X}_{n};\beta_{l})p(\boldsymbol{\beta}_{l}|{\mathcal{M}}_{l})\mathrm{d}\boldsymbol{\beta}_{l}

since $p_{n}(\mathbf{X}_{n};\beta_{l})$ is the conditional density of $\mathbf{X}_{n}$ given $\mathcal{M}_{l}$ and $\boldsymbol{\beta}_{l}$ . We assume that the prior distribution over models is non-informative, so that the BIC criterion defines the best model as

\mathcal{M}_{\mathrm{BIC}}=\operatorname*{arg\,max}_{{\mathcal{M}_{l}},l=1,\dots,2^{p}}\;p_{n}\left(\mathbf{X}_{n}\mid\mathcal{M}_{l}\right).

(16)

In principle, one could evaluate the $p_{n}\left(\mathbf{X}_{n}\mid\mathcal{M}_{l}\right)$ using numerical quadrature and then determine $\mathcal{M}_{\mathrm{BIC}}$ . However, this is computationally costly and also the need to elicit a specific prior $p(\boldsymbol{\beta}_{l}|{\mathcal{M}}_{l})$ for each model may be a nuisance. Our next result therefore proposes an asymptotic expansion of $\log p_{n}\left(\mathbf{X}_{n}\mid\mathcal{M}_{l}\right)$ . The methodology is standard (basically a Laplace approximation) and well-known in the literature, see Tierney and Kadane (1986) or e.g. Lebarbier and Mary-Huard (2006) and the references therein. However, due to our spatial framework and the double asymptotic point of view considered in this paper, the standard results do not apply straightforwardly. Due to the use of asymptotic results we again need to rely on the notions of a true model and the ‘least false parameter value’ $\boldsymbol{\beta}_{l,n}^{*}$ .

We impose the following conditions on the prior for $\boldsymbol{\beta}_{l}$ and the mean $\mu_{n}={\mathbb{E}}\mathbf{X}_{n}$ .

[C8]

The prior density $p(\boldsymbol{\beta}_{l}\mid\mathcal{M}_{l})$ of $\boldsymbol{\beta}_{l}$ given ${\mathcal{M}}_{l}$ is continuously differentiable on $\mathbb{R}^{p_{l}}$ .
[C9]

$\mu_{n}\asymp\tau_{n}$ .

Note that $\mu_{n}$ is the marginal mean of $\mathbf{X}_{n}$ under the true intensity model $\lambda_{n}$ as discussed in Section 2. Condition C9 seems reasonable since it would hold if $\lambda_{n}$ coincided with any of the specified parametric intensity models $\rho_{n}(\cdot;\boldsymbol{\beta}_{l})$ . We also need to slightly strengthen assumption C1 and replace it by

[C1^′]

As $n\to\infty$ , there exists $\boldsymbol{\beta}_{l}^{*}\in\mathbb{R}^{p_{l}}$ such that $\lim_{n\to\infty}\boldsymbol{\beta}_{l,n}^{*}=\boldsymbol{\beta}_{l}^{*}$ .

The following result is verified in Appendix D using Łapiński (2019, Theorem 2), which is a rigorous statement of a multivariate Laplace approximation.

Proposition 2.

Under the conditions C1^′, C2-C5 and C8-C9, we have, almost surely with respect to the distribution of $\{\mathbf{X}_{n}\}_{n\geq 1}$ , as $n\to\infty$ ,

$\displaystyle\log p_{n}(\mathbf{X}_{n}\mid\mathcal{M}_{l})=$	$\displaystyle\ell_{n}(\hat{\boldsymbol{\beta}}_{l,n})-\frac{p_{l}}{2}\log\left(\mu_{n}\right)$
	$\displaystyle+\frac{p_{l}}{2}\log(2\pi)-\frac{1}{2}\log\mathrm{det}\{\mu_{n}^{-1}\mathbf{S}_{n}(\hat{\boldsymbol{\beta}}_{l,n})\}$
	$\displaystyle+\log p(\hat{\boldsymbol{\beta}}_{l,n}\mid\mathcal{M}_{l})+{\mathcal{O}}(\mu_{n}^{-1/2}).$	(17)

The criterion (16) is defined entirely within the specified Bayesian framework. Hence no reference to a true model and no need for asymptotic results. However, this is changed when we derive the expansion (17) for (16). The expansion is around $\hat{\boldsymbol{\beta}}_{l,n}$ and for technical reasons, when applying the Laplace approximation, convergence of $\hat{\boldsymbol{\beta}}_{l,n}$ is needed. Therefore we need the assumptions that ensure strong consistency of $\hat{\boldsymbol{\beta}}_{l,n}$ .

Since $\mu_{n}\asymp\tau_{n}$ and from the strong consistency of $\hat{\boldsymbol{\beta}}_{l,n}$ , we have that $\log\mathrm{det}\{\mu_{n}^{-1}\mathbf{S}_{n}(\hat{\boldsymbol{\beta}}_{l,n})\}=\mathcal{O}_{\mathrm{P}}(1)$ while C8 ensures that $\log p(\hat{\boldsymbol{\beta}}_{l,n}\mid\mathcal{M}_{l})$ is $\mathcal{O}_{\mathrm{P}}(1)$ . So, if we neglect terms which are $\mathcal{O}_{\mathrm{P}}(1)$ in (17), we follow the standard heuristic and suggest to define a first version of the BIC criterion as

-2\ell_{n}(\hat{\boldsymbol{\beta}}_{l,n})+p_{l}\log(\mu_{n})

(18)

where we remind that $p_{l}$ is the length of $\boldsymbol{\beta}_{l}$ .

In practice $\mu_{n}$ is not known. However, since ${\mathbb{V}\mathrm{ar}}N(W_{n})=\mu_{n}$ it follows that $N(W_{n})/\mu_{n}-1=\mathcal{O}_{\mathrm{P}}(\mu_{n}^{-1/2})=o_{\mathrm{P}}(1)$ . This justifies to define the $\mathrm{BIC}$ criterion in the following natural way

\mathrm{BIC}_{l}=-2\ell_{n}(\hat{\boldsymbol{\beta}}_{l,n})+p_{l}\,\log\left\{N(W_{n})\right\}.

(19)

5.2 Composite likelihood BIC

Suppose $\mathbf{X}_{n}$ has an intensity function of the form (1) but is not a Poisson process. Then as mentioned in Section 3.1, (3) may be viewed as a composite likelihood score for estimating $\boldsymbol{\beta}_{l}$ . In this case, following Gao and Song (2010), (16) may be viewed as a composite likelihood BIC. Again we obtain (19) from (16) by Laplace approximation. However, Gao and Song (2010) suggest to replace $p_{l}$ by the ‘effective degrees of freedom’ $p_{l}^{*}$ considered in Section 4. Thus our proposed composite likelihood BIC is

\mathrm{CBIC}_{l}=-2\ell_{n}(\hat{\boldsymbol{\beta}}_{l,n})+p_{l,\text{approx}}^{*}\,\log\big{\{}N(W_{n})\big{\}}

which becomes equal to the ordinary BIC in (19) for a Poisson process. From a practical point of view, we simply estimate $p_{l,\text{approx}}^{*}$ as in (14).

6 Simulation study

To evaluate the proposed model selection criteria we conduct two simulation studies with $p=6$ spatial covariates, of which four have zero effect. The first study in Section 6.1 considers AIC and BIC in the Poisson point process case. The second study considers the clustered Thomas point process where we employ the CIC and CBIC criteria to select the best model and compare with results obtained using AIC and BIC (assuming wrongly that the simulated data are from a Poisson point process).

The covariates are obtained from the BCI dataset (Hubbell and Foster, 1983; Condit et al., 1996; Condit, 1998) which in addition to locations of around 300 species of trees observed in $W=[0,1000]\times[0,500]$ ( $m^{2}$ ) contains a number of spatial covariates. In particular, we center and scale the two topological covariates (elevation and slope of elevation) and four soil nutrients (aluminium, boron, calcium, and copper). The six covariates are depicted in Figure 1.

Refer to caption — Figure 1: Maps of covariates used in the simulation study. From left to right: First row: elevation ( $z_{1}$ ), slope ( $z_{2}$ ) and aluminium ( $z_{3}$ ); Second row: boron ( $z_{4}$ ), calcium ( $z_{5}$ ) and copper ( $z_{6}$ ).

Skipping the dependence on $n$ in the notation, we model the intensity function as

\displaystyle\rho(u,\boldsymbol{\beta})=\omega\exp\{\beta_{1}z_{1}(u)+\cdots+\beta_{6}z_{6}(u)\},

(20)

where $z_{1},\cdots,z_{6}$ are the centered and scaled covariates as in Figure 1 and $\beta_{1},\cdots,\beta_{6}$ are the regression coefficients. We consider different settings for $\omega$ and $W$ . Note that ${\omega}$ plays the role of $\theta\exp(\beta_{0})$ and will be adjusted to obtain desired expected numbers of points $\mu$ in $W$ . When the simulation involves an observation window $W$ different from $[0,1000]\times[0,500]$ , the covariates are simply rescaled to fit $W$ .

The model selection criteria are compared in terms of the true positive rate (TPR), false positive rate (FPR), expected Kullback-Leibler divergence (MKL), and mean integrated squared error of the intensity function (MISE). The TPR (resp. FPR) are the expected fractions of informative (resp. non-informative) covariates included in the selected model. For a point process observed on $W$ , with intensity $\rho(;\boldsymbol{\beta}^{*})$ where $\boldsymbol{\beta}^{*}$ stands for the true parameter estimated by $\hat{\boldsymbol{\beta}}$ , MKL and MISE are estimated by averaging the following KL and ISE across simulations:

	$\displaystyle\mathrm{KL}=$	$\displaystyle\int_{W}\left[\rho(u;\boldsymbol{\beta}^{})\left\{\log\rho(u;\boldsymbol{\beta}^{})-1\right\}-\rho(u;\hat{\boldsymbol{\beta}})\left\{\log\rho(u;\hat{\boldsymbol{\beta}})-1\right\}\right]\mathrm{d}u$
	$\displaystyle\mathrm{ISE}=$	$\displaystyle\int_{W}\left\{\rho(u;\boldsymbol{\beta}^{*})-\rho(u;\hat{\boldsymbol{\beta}})\right\}^{2}\mathrm{d}u.$

6.1 Poisson point process model

We consider two different scenarios to illustrate both types of asymptotics.

•

Scenario 1 (infill asymptotics). $W=[0,1]\times[0,0.5]$ . We adjust $\omega=\theta\exp(\beta_{0})$ such that $\mu$ equals either $50$ or $200$ .
•

Scenario 2 (increasing domain asymptotics). $W=[0,500]\times[0,250]$ or $[0,1000]\times[0,500]$ and $\omega$ is chosen so that $\mu=200$ or $\mu=800$ .

For each scenario and the choices of $W$ and $\omega$ , 500 simulations are generated from inhomogeneous Poisson point processes with intensity function given by (20) using the function $\mathtt{rpoispp}$ from the $\mathtt{spatstat}$ $\mathtt{R}$ package. We set $\beta_{1}=0.5$ and $\beta_{2}=-0.25$ to represent moderate effects of elevation and slope, and set $\beta_{3}=\dots=\beta_{6}=0$ . For each simulation, parameters are estimated by maximizing the Berman-Turner approximation (see e.g. Baddeley and Turner, 2000) of the Poisson log-likelihood (3) using a number $m$ of quadrature dummy points to approximate the integral in (3). The estimation is done using the $\mathtt{ppm}$ function. Then, the model is selected according to the AIC and BIC-type criteria. We consider different variants of the BIC criterion, namely

\mathrm{BIC}_{l}(\pi)=-2\ell(\boldsymbol{\hat{\beta}}_{l})+p_{l,\text{approx}}^{*}\;\log(\pi)\quad l=1,\cdots,64,

where $\pi$ represents a penalty. Note that (omitting dependence on $l$ ) $\mathrm{AIC}=\mathrm{BIC}\{\exp(2)\}$ and that $\mathrm{BIC}(N)$ ( $N=N(W)$ ) corresponds to the criterion used by Thurman et al. (2015) and is also the criterion suggested by the present paper. We also consider $\mathrm{BIC}(|W|)$ used by Choiruddin et al. (2018) and $\mathrm{BIC}(N+m)$ considered by Jeffrey et al. (2018).

		AIC	BIC $(\pi),\;\;\pi=$
			$N$	$N+400\mu$	$\|W\|$
$W=[0,1]\!\!\times\!\![0,0.5]$	TPR	70	59	49	100
$\mu=50$	FPR	16	5	1	100
	MISE	6.5	6.0	6.0	7.9
	MKL	3.0	2.9	2.9	3.6
$W=[0,1]\!\!\times\!\![0,0.5]$	TPR	94	83	63	100
$\mu=200$	FPR	17	2	0	100
	MISE	2.6	2.3	3.3	3.2
	MKL	2.9	2.8	4.2	3.5
$W=[0,500]\!\!\times\!\![0,250]$	TPR	95	83	63	62
$\mu=200$	FPR	16	2	0	0
	MISE	1.0	0.9	1.3	1.3
	MKL	2.8	2.7	4.1	4.2
$W=[0,1000]\!\!\times\!\![0,500]$	TPR	100	99	99	98
$\mu=800$	FPR	18	1	0	0
	MISE	10.4	6.5	6.7	7.0
	MKL	2.9	1.8	1.9	2.0

Table 1: True positive (TPR) and false positive (FPR) rates in percent, MISE, and MKL, estimated from 500 simulations of inhomogeneous Poisson point processes on different observation domains. Model selections are based on AIC or BIC criteria of the form

\mathrm{BIC}_{l}=-2\ell(\hat{\boldsymbol{\beta}}_{l})+p_{l}\log\pi

(

l=1,\dots,64

) for different penalty terms

\pi=N,N+400\mu,|W|

. For convenience, the four rows with MISE are multiplied by respectively .001, .01, 100, and 1000.

For both scenarios, we perform estimation with $m=4\mu$ (the rule of thumb suggested by spatstat) and model selection with the criteria $\mathrm{AIC}$ , $\mathrm{BIC}(N)$ , $\mathrm{BIC}(N+4\mu)$ and $\mathrm{BIC}(|W|)$ . Similar results are obtained with $\mathrm{BIC}(N)$ and $\mathrm{BIC}(N+4\mu)$ and we omit the results for the latter. We also perform estimation and selection with $m=400\mu$ and only report results for $\mathrm{BIC}(N+400\mu)$ since the results with the criteria $\mathrm{AIC}$ , $\mathrm{BIC}(N)$ and $\mathrm{BIC}(|W|)$ are similar to those obtained when the estimation is performed with $m=4\mu$ .

When $|W|$ is small, especially when $\log|W|<0$ , the criterion $\mathrm{BIC}(|W|)$ obviously fails as it selects the most complex model regardless of the value of $\mu$ . Thereby the FP rate becomes 100%. In addition, as indicated in the second and third rows of Table 1 where the point patterns have the same average number of points, it is worth noticing that $\mathrm{BIC}(|W|)$ selects pretty different models. In particular this criterion has an undesirable strong dependence on the choice of length unit.

The criterion $\mathrm{BIC}(N+m)$ with a large $m$ also fails since the TPR with this criterion and $m=400\mu$ is very small compared to the other criteria, especially when the expected number of points is small or moderate. The AIC criterion achieves a high TPR in all situations but fails since it suffers from a high FPR, even in the scenario 2 where $\mu=800$ .

In all cases, $\mathrm{BIC}(N)$ provides the best trade-off between TPR and FPR and the results improve when $\omega$ or $|W|$ is increased. The minimal values of MKL and MISE are further always obtained with $\mathrm{BIC}(N)$ . In case of a Poisson process, we therefore recommend $\mathrm{BIC}(N)$ (simply denoted BIC in the following).

6.2 Thomas point process model

To generate a simulation from a Thomas point process with intensity (20), we first generate a parent point pattern from a stationary Poisson point process $\mathbf{C}$ with intensity $\kappa>0$ . Given $\mathbf{C}$ , clusters $\mathbf{X}_{c}$ , $c\in\mathbf{C}$ , are generated from inhomogeneous Poisson point processes with intensity functions

\displaystyle\rho_{c}(u;\boldsymbol{\beta})=\omega\exp\{\beta_{1}z_{1}(u)+\cdots+\beta_{6}z_{6}(u)\}k(u-c;\gamma)/\kappa,

where $k(u-c;\gamma)=(2\pi\gamma^{2})^{-1}\exp(-\|u-c\|^{2}/(2\gamma^{2}))$ is the density for $\mathcal{N}(0,\gamma^{2}\mathbf{I}_{2})$ . Finally, $\mathbf{X}=\cup_{c\in\mathbf{C}}\mathbf{X}_{c}$ is an inhomogeneous Thomas point process with intensity (20). The regression parameters are set as follows: $\beta_{1}=2$ , $\beta_{2}=-1$ , $\beta_{3}=\dots=\beta_{6}=0$ . We consider $\kappa=4\times 10^{-4}$ and two scale parameters $\gamma=5$ and $\gamma=15$ . A lower value for $\gamma$ tends to produce more clustered patterns. We consider the observation domains $W=[0,500]\times[0,250]$ and $W=[0,1000]\times[0,500]$ with $\omega$ adjusted to give expected numbers of points $\mu=400$ and $\mu=1600$ for the two windows. The chosen value of $\kappa$ implies on average 50 parent points on $W=[0,500]\times[0,250]$ and 200 parent points on $W=[0,1000]\times[0,500]$ .

		AIC	BIC	CIC	CBIC
$W=[0,500]\!\!\times\!\![0,250]$	TPR	98	96	88	81
$\mu=400$	FPR	81	65	24	17
$\gamma=5$	MISE	4.5	4.4	2.7	2.5
	MKL	9.7	9.6	7.5	9.1
	Mean $(\hat{p}_{l}^{*})$	-	-	103.6	76.6
	SD $(\hat{p}_{l}^{*})$	-	-	122.7	70.1
$W=[0,1000]\!\!\times\!\![0,500]$	TPR	100	100	97	94
$\mu=1600$	FPR	81	65	20	11
$\gamma=5$	MISE	3.9	3.8	2.6	2.4
	MKL	10.3	10.2	7.9	8.2
	Mean $(\hat{p}_{l}^{*})$	-	-	101.1	81.3
	SD $(\hat{p}_{l}^{*})$	-	-	104.6	70.2
$W=[0,500]\!\!\times\!\![0,250]$	TPR	100	99	95	93
$\mu=400$	FPR	70	49	43	32
$\gamma=15$	MISE	2.1	2.0	1.8	1.8
	MKL	5.2	5.0	4.9	5.4
	Mean $(\hat{p}_{l}^{*})$	-	-	48.9	40.9
	SD $(\hat{p}_{l}^{*})$	-	-	77.3	61.3
$W=[0,1000]\!\!\times\!\![0,500]$	TPR	100	100	96	94
$\mu=1600$	FPR	78	57	36	24
$\gamma=15$	MISE	2.8	2.8	2.3	2.5
	MKL	7.7	7.5	7.3	9.3
	Mean $(\hat{p}_{l}^{*})$	-	-	96.1	78.1
	SD $(\hat{p}_{l}^{*})$	-	-	172.1	124.6

Table 2: True positive (TPR) and false positive (FPR) rates in percent, MISE, MKL (divided by 10), and mean and standard deviations of estimates of

p_{l,\text{approx}}^{*}

, based on 500 simulations from inhomogeneous Thomas point processes with

\kappa=4\times 10^{-4}

and scale parameter

\gamma=5

or 15, observed on different observation domains. Parameters are adjusted to have on average

\mu=400

points in the small window and 1600 points in the larger one. Model selection is based on AIC, BIC, CIC, and CBIC.

The R function rThomas is used to simulate the point patterns and the function $\mathtt{kppm}$ to estimate the regression parameters. We use $m=16\mu$ dummy points for the different integral approximations. Models are then selected using four criteria: AIC, BIC, CIC and CBIC. When using AIC or BIC we implicitly assume wrongly that the simulated patterns come from a Poisson point process and thus we set $p_{l,\text{approx}}^{*}=p_{l}$ . For the composite likelihood type criteria CIC and CBIC, we compute $\hat{p}_{l,\text{approx}}^{*}$ using the R function vcov.kppm. The parameters $\kappa$ and $\gamma$ are estimated using minimum contrast estimation with the tuning parameter $r_{\max}$ (Waagepetersen and Guan, 2009) set to 20 when $\gamma=5$ and to 50 when $\gamma=15$ .

Results are reported in Table 2. The AIC and BIC criteria ignore the second-order structure of the simulated point patterns and we observe that overall AIC and BIC produce high TPR but also very high FPR. The CIC and CBIC give much more reasonable trade-offs between TPR and FPR and also (with one exception) give smaller MKL and MISE than AIC and BIC.

Focusing on CIC and BIC, we first notice that the means of the estimates of $p_{l,\text{approx}}^{*}$ are high. This is because the considered clustered point processes are far from Poisson models. We also notice that even when the number of points is quite large, the estimation of $p_{l,\text{approx}}^{*}$ is inaccurate with standard deviations of the estimates of the same order as the means. Nevertheless, the resulting CIC and CBIC give reasonable results. Comparing CIC and CBIC, CIC in general has a higher FPR than CBIC but on the other hand always gives the smallest MKL. The TPR and MISE are quite similar for CIC and BIC. Hence in the case of the clustered point process considered here, CIC and CBIC clearly outperform AIC and BIC but there is not a clear winner between CIC and CBIC.

7 Discussion

In this paper we establish a theoretical foundation for various model selection criteria for inhomogeneous point processes under various asymptotic settings. In case of a Poisson process a main contribution is to identify in relation to BIC, the correct interpretation of ‘sample size’ which based on our theoretical derivation is the expected number of points which in practice is estimated by the observed number of points. This interpretation is supported by our simulation study which also supports the common understanding that BIC may be preferable to AIC which tends to pick too complex models.

More generally for selecting a regression model for the intensity function of a general point process we develop composite model selection criteria, CIC and CBIC that clearly outperform AIC and BIC in the simulation study for a clustered point process. One issue regarding CIC and BIC is to estimate the bias correction for the estimate of the composite Kullback-Leibler divergence which depends on the unknown true intensity function and pair correlation function. Here, inspired by the approach underlying AIC, for a given model we simply plug in the fitted intensity function and pair correlation function for the model in question. This is computationally convenient and we leave it as an open problem to develop more precise estimates.

Further interesting topics for future research would be to study the theoretical foundation of criteria for selecting models for the conditional intensity of a Gibbs process fitted by pseudo-likelihood. Another interesting problem is selection of the penalization parameter when the intensity function is estimated using regularization methods like the lasso (see e.g Thurman et al., 2015; Choiruddin et al., 2018). This is not covered by our theoretical results which rely on asymptotic results for unbiased estimating functions or Bayesian considerations.
Acknowledgments
The research of J.-F. Coeurjolly is supported by the Natural Sciences and Engineering Research Council.Rasmus Waagepetersen is supported by The Danish Council for Independent Research | Natural Sciences, grant DFF - 7014-00074 "Statistics for point processes in space and beyond", and by the Centre for Stochastic Geometry and Advanced Bioimaging, funded by grant 8721 from the Villum Foundation.

Appendix A Proof of Theorem 1

Proof.

(i) Using a zero order Taylor expansion of $e_{n}(\hat{\boldsymbol{\beta}}_{l,n})$ (which equals zero by definition) around $\boldsymbol{\beta}_{l,n}^{*}$ expressed with integral remainder term, we have

	$\displaystyle e_{n}(\boldsymbol{\beta}_{l,n}^{*})=$	$\displaystyle\left(\int_{0}^{1}\!\!\int_{W_{n}}\!\!\!\theta_{n}\mathbf{z}_{l}(u)\mathbf{z}_{l}(u)^{\top}\exp\left[\left\{\boldsymbol{\beta}_{l,n}^{}+t(\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l,n}^{})\right\}^{\top}\mathbf{z}_{l}(u)\right]\mathrm{d}u\mathrm{d}t\right)$
		$\displaystyle\times\left(\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l,n}^{*}\right).$

By rearranging the right-hand side of the latter equation and using Fubini’s theorem, we have

	$\displaystyle e_{n}(\boldsymbol{\beta}_{l,n}^{*})=$	$\displaystyle\theta_{n}\bigg{[}\int_{W_{n}}\mathbf{z}_{l}(u)\exp\left\{{\boldsymbol{\beta}_{l,n}^{*}}^{\top}\mathbf{z}_{l}(u)\right\}\times$
		$\displaystyle\int_{0}^{1}\left\{(\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l,n}^{})^{\top}\mathbf{z}_{l}(u)\right\}\exp\left\{t(\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l,n}^{})^{\top}\mathbf{z}_{l}(u)\right\}\mathrm{d}t\bigg{]}\mathrm{d}u$
	$\displaystyle=$	$\displaystyle\theta_{n}\int_{W_{n}}\mathbf{z}_{l}(u)\exp\left\{{\boldsymbol{\beta}_{l,n}^{}}^{\top}\mathbf{z}_{l}(u)\right\}\left[\exp\left\{(\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l,n}^{})^{\top}\mathbf{z}_{l}(u)\right\}-1\right]\mathrm{d}u.$

Now, using C2-C3, we can apply a mean value theorem for multiple integrals: there exists $v=v_{n}(\hat{\boldsymbol{\beta}}_{l,n},\boldsymbol{\beta}_{l,n}^{*})\in W_{n}$ such that

e_{n}(\boldsymbol{\beta}_{l,n}^{*})=\tau_{n}\mathbf{z}_{l}(v)\exp\left\{{\boldsymbol{\beta}_{l,n}^{*}}^{\top}\mathbf{z}_{l}(v)\right\}\left[\exp\left\{(\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l,n}^{*})^{\top}\mathbf{z}_{l}(v)\right\}-1.\right]

(21)

Since $\|\mathbf{z}_{l}(v)\|>0$ almost surely by condition C2, we deduce (10) from (21) using a little algebra. Let

A_{n}=\tau_{n}^{-1}{e_{n}(\boldsymbol{\beta}_{l,n}^{*})}^{\top}\frac{\mathbf{z}_{l}(v)}{\|\mathbf{z}_{l}(v)\|^{2}}\exp\{-{\boldsymbol{\beta}_{l,n}^{*}}^{\top}\mathbf{z}_{l}(v)\}.

By conditions C1-C4, it is clear that $A_{n}\to 0$ almost surely as $n\to\infty$ . Using the continuity of $t\mapsto\log(1+t)$ and again condition C2, we deduce that $\log(1+A_{n})\mathbf{z}_{l}(v)/\|\mathbf{z}_{l}(v)\|^{2}$ tends to 0 almost surely.

(ii) The proof consists in applying a modified version of Waagepetersen and Guan (2009, Theorem 2) to the sequence of estimating functions $e_{n}(\boldsymbol{\beta}_{l})$ . The modification is given in Appendix E and is needed to handle a sequence of ‘least false’ parameter values $\boldsymbol{\theta}_{n}^{*}$ instead of a unique ‘true’ value $\boldsymbol{\theta}^{*}$ . Thus, we need to prove the assumptions G1-G4 in Appendix E. We leave the reader to check that conditions C3 and C5 imply conditions G1 and G2 with $\mathbf{V}_{n}=\sqrt{\tau_{n}}\mathbf{I}_{p}$ and $\mathbf{J}_{n}(\boldsymbol{\theta}^{*}_{n})=\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{*})$ .

To prove assumption G3, we have to prove that as $n\to\infty$ ,

\sup_{\sqrt{\tau_{n}}\|\boldsymbol{\beta}_{l}-\boldsymbol{\beta}_{l,n}^{*}\|\leq d}\frac{\|\mathbf{S}_{n}(\boldsymbol{\beta}_{l})-\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{*})\|_{M}}{\tau_{n}}\to 0,

where $\|\mathbf{M}\|_{M}=\max_{ij}|M_{ij}|$ . Consider a $\boldsymbol{\beta}_{l}$ such that $\sqrt{\tau_{n}}\|\boldsymbol{\beta}_{l}-\boldsymbol{\beta}_{l,n}^{*}\|\leq d$ . Under conditions C1-C3, there exist $K_{1},K_{2},K<\infty$ such that

$\displaystyle\\|\mathbf{S}_{n}(\boldsymbol{\beta}_{l})-\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{*})\\|_{M}$	$\displaystyle=\\|\theta_{n}\int_{W_{n}}\mathbf{z}_{l}(u)\mathbf{z}_{l}(u)^{\top}\left\{\rho(u;\boldsymbol{\beta}_{l})-\rho(u;\boldsymbol{\beta}_{l,n}^{*})\right\}\mathrm{d}u\\|_{M}$
	$\displaystyle\leq K_{1}\,\tau_{n}\\|\boldsymbol{\beta}_{l}-\boldsymbol{\beta}_{l,n}^{}\\|\exp\left\{K_{2}\left(\\|\boldsymbol{\beta}_{l,n}^{}\\|+d/\sqrt{\tau_{n}}\right)\right\}$
	$\displaystyle\leq K\tau_{n}\\|\boldsymbol{\beta}_{l}-\boldsymbol{\beta}_{l,n}^{*}\\|=\mathcal{O}(\sqrt{\tau_{n}})=\mathcal{O}_{\mathrm{P}}(\sqrt{\tau_{n}}).$	(22)

To verify condition G4, it is sufficient to note that condition C6 implies that the variance of $\tau_{n}^{-1/2}e_{n}(\boldsymbol{\beta}_{l,n}^{*})$ is bounded. Letting $\hat{\boldsymbol{\beta}}_{l,n}$ denote the (unique for $n$ large enough) solution of $e_{n}(\boldsymbol{\beta}_{l})=0$ or, equivalently,

\hat{\boldsymbol{\beta}}_{l,n}=\operatorname*{arg\,max}_{\boldsymbol{\beta}_{l}\in\mathbb{R}^{p_{l}}}\ell_{n}(\boldsymbol{\beta}_{l})

we can then conclude that $\hat{\boldsymbol{\beta}}_{l,n}$ is a root- ${\tau_{n}}$ consistent estimator of $\boldsymbol{\beta}_{l,n}^{*}$ , that is, $\sqrt{\tau_{n}}(\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l,n}^{*})$ is bounded in probability, which proves (11).

(iii) We use a Taylor expansion around $\boldsymbol{\beta}_{l,n}^{*}$ : there exists $t\in(0,1)$ and $\tilde{\boldsymbol{\beta}}_{l,n}=\hat{\boldsymbol{\beta}}_{l,n}+t(\boldsymbol{\beta}_{l,n}^{*}-\hat{\boldsymbol{\beta}}_{l,n})$ such that

	$\displaystyle e_{n}(\boldsymbol{\beta}_{l,n}^{*})$	$\displaystyle=e_{n}(\boldsymbol{\beta}_{l,n}^{})-e_{n}(\hat{\boldsymbol{\beta}}_{l,n})=\mathbf{S}_{n}(\tilde{\boldsymbol{\beta}}_{l,n})(\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l,n}^{})$
		$\displaystyle=\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{})(\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l,n}^{})+\mathbf{B}_{n}\mathbf{S}_{n}^{\prime}$

where $\mathbf{B}_{n}=\sqrt{\tau_{n}}(\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l,n}^{*})$ and where $\mathbf{S}_{n}^{\prime}=\tau_{n}^{-1/2}\left\{\mathbf{S}_{n}(\tilde{\boldsymbol{\beta}}_{l,n})-\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{*})\right\}$ . We now show that $\mathbf{B}_{n}\mathbf{S}_{n}^{\prime}=\mathcal{O}_{\mathrm{P}}(1)$ . Let $d$ and $K$ be given as above and let $K_{3}\geq d^{2}K$ . Using (22) we obtain

		$\displaystyle\mathrm{P}\left(\\|\mathbf{B}_{n}\mathbf{S}_{n}^{\prime}\\|_{M}\geq K_{3}\right)\leq\mathrm{P}\left(\\|\mathbf{B}_{n}\mathbf{S}_{n}^{\prime}\\|_{M}\geq K_{3},\\|\mathbf{B}_{n}\\|<d\right)+\mathrm{P}\left(\\|\mathbf{B}_{n}\\|\geq d\right)$
	$\displaystyle\leq$	$\displaystyle\mathrm{P}\left(dK\\|\mathbf{B}_{n}\\|\geq K_{3}\right)+\mathrm{P}\left(\\|\mathbf{B}_{n}\\|\geq d\right)\leq 2P\left(\\|\mathbf{B}_{n}\\|\geq d\right)$

Now from (ii), for any $\varepsilon>0$ , by choosing $d$ large enough, $P(\|\mathbf{B}_{n}\|\geq d)\leq\varepsilon/2$ for $n$ sufficiently large whereby $\mathrm{P}\left(\|\mathbf{B}_{n}\mathbf{S}_{n}^{\prime}\|_{M}\geq K_{3}\right)\leq\varepsilon$ for $n$ sufficiently large.

Finally, since by condition C5, $\|\mathbf{\Sigma}_{l,n}^{-1/2}\|=\mathcal{O}(\tau_{n}^{-1/2})$ , we conclude that

	$\displaystyle\boldsymbol{\Sigma}_{l,n}^{-1/2}\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{})\left(\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l,n}^{}\right)$	$\displaystyle=\boldsymbol{\Sigma}_{l,n}^{-1/2}e_{n}(\boldsymbol{\beta}_{l,n}^{*})+\boldsymbol{\Sigma}_{l,n}^{-1/2}\mathcal{O}_{P}(1)$
		$\displaystyle=\boldsymbol{\Sigma}_{l,n}^{-1/2}e_{n}(\boldsymbol{\beta}_{l,n}^{*})+o_{\mathrm{P}}(1)$

which yields the result using condition C7 and Slutsky’s lemma. ∎

Appendix B Conditions C4 and C7 for the inhomogeneous Poisson cluster point process

In this section, we show that the inhomogeneous Poisson cluster point process presented in the end of Section 3.2 satisfies C4 and C7. Recall $\lambda_{n}(u)=\theta_{n}\alpha\rho(u)$ where $\sup_{u}\rho(u)={\mathcal{O}}(1)$ .

Proof.

For $k=1,\dots,\lfloor\tau_{n}\rfloor$ , let $\mathbf{C}_{n,k}$ be independent inhomogeneous Poisson point processes with intensity $\theta_{n}/\lfloor\tau_{n}\rfloor$ . By the property of any Poisson point process, $\mathbf{C}_{n}$ has the same distribution as $\cup_{k}\mathbf{C}_{n,k}$ . Define $\mathbf{A}_{n}=\sum_{u\in\mathbf{X}_{n}}\mathbf{z}_{l}(u)$ . Then,

\mathbf{A}_{n}=\sum_{k=1}^{\lfloor\tau_{n}\rfloor}\mathbf{Z}_{n,k}\qquad\text{ where }\qquad\mathbf{Z}_{n,k}=\sum_{c\in\mathbf{C}_{n,k}}\sum_{u\in\mathbf{X}_{c}}\mathbf{z}_{l}(u).

Using twice the Slivnyak-Mecke Theorem (see e.g. Theorem 3.1 in Møller and Waagepetersen, 2004),

	$\displaystyle{\mathbb{E}}\mathbf{Z}_{n,k}$	$\displaystyle=\lfloor\tau_{n}\rfloor^{-1}\int_{W_{n}}\mathbf{z}_{l}(u)\lambda_{n}(u)\mathrm{d}u$
	$\displaystyle{\mathbb{V}\mathrm{ar}}\mathbf{Z}_{n,k}$	$\displaystyle=\lfloor\tau_{n}\rfloor^{-1}\bigg{\{}\int_{W_{n}}\mathbf{z}_{l}(u)\mathbf{z}_{l}(u)^{\top}\lambda_{n}(u)\mathrm{d}u$
		$\displaystyle\qquad+\int_{W_{n}}\int_{W_{n}}\mathbf{z}_{l}(u)\mathbf{z}_{l}(v)^{\top}\lambda_{n}(u)\lambda_{n}(v)\left(g_{n}(u,v)-1\right)\mathrm{d}u\mathrm{d}v\bigg{\}}$

where $g_{n}(u,v)=1+\theta_{n}^{-1}(k*k)(v-u)$ . Hence,

\displaystyle{\mathbb{E}}\mathbf{A}_{n}=\int_{W_{n}}\mathbf{z}_{l}(u)\lambda_{n}(u)\mathrm{d}u\qquad\text{and}\qquad{\mathbb{V}\mathrm{ar}}\mathbf{A}_{n}=\boldsymbol{\Sigma}_{l,n}.

Moreover, by (8),

\int_{W_{n}}\mathbf{z}_{l}(u)\lambda_{n}(u)\mathrm{d}u=\theta_{n}\int_{W_{n}}\mathbf{z}_{l}(u)\rho_{n}(u;\boldsymbol{\beta}_{l,n}^{*})\mathrm{d}u

so that $e_{n}(\boldsymbol{\beta}_{l,n}^{*})=\mathbf{A}_{n}-{\mathbb{E}}\mathbf{A}_{n}=\sum_{k=1}^{\lfloor\tau_{n}\rfloor}\mathring{\mathbf{Z}}_{n,k}$ with $\mathring{\mathbf{Z}}_{n,k}=\mathbf{Z}_{n,k}-{\mathbb{E}}\mathbf{Z}_{n,k}$ . Since C6 is satisfied as shown in Example 1 it follows that the elements of ${\mathbb{V}\mathrm{ar}}(\mathring{\mathbf{Z}}_{n,k})$ are ${\mathcal{O}}(1)$ whereby $\sum_{k\geq 1}{\mathbb{V}\mathrm{ar}}(\mathring{\mathbf{Z}}_{n,k})/k^{2}<\infty$ (elementwise). Therefore, we can apply Kolmogorov’s strong law of large numbers to establish that ${\lfloor\tau_{n}\rfloor}^{-1}e_{n}(\boldsymbol{\beta}_{l,n}^{*})$ tends almost surely to 0. Using the same conditions and C5, we can also apply the Lindeberg-Feller theorem and obtain that as $n\to\infty$ , $\boldsymbol{\Sigma}_{l,n}^{-1/2}e_{n}(\boldsymbol{\beta}_{l,n}^{*})\to N(0,\mathbf{I}_{p_{l}})$ in distribution. ∎

Appendix C Proof of Proposition 1

For a multi-index $\alpha\in\mathbb{N}^{p}$ with cardinality $|\alpha|=\alpha_{1}+\dots+\alpha_{p}$ and a $|\alpha|$ times differentiable function $f:\mathbb{R}^{p}\to\mathbb{R}$ we define for $u\in\mathbb{R}^{p}$ ,

\partial^{\alpha}f(u)=\frac{\partial^{|\alpha|}f(u)}{\partial u_{1}^{\alpha_{1}}\dots\partial u_{p}^{\alpha_{p}}}.

and we also use the notation $u^{\alpha}=u_{1}^{\alpha_{1}}\dots u_{p}^{\alpha_{p}}$ .

Proof.

We follow the sketch of the proof of Varin and Vidoni (2005, Lemmas 1-2). Let $\mathbf{Z}_{n}$ be an independent copy of $\mathbf{X}_{n}$ . Let $\ell_{n}(\boldsymbol{\beta}_{l},\mathbf{Z}_{n})$ and $e_{n}(\boldsymbol{\beta}_{l};\mathbf{Z}_{n})$ be the composite likelihood and its corresponding estimating equation evaluated at $\mathbf{Z}_{n}$ . And we remind the notation $\ell_{n}(\boldsymbol{\beta}_{l})=\ell_{n}(\boldsymbol{\beta}_{l};\mathbf{X}_{n})$ , $C_{n}(\boldsymbol{\beta}_{l})=-{\mathbb{E}}\ell_{n}(\boldsymbol{\beta}_{l})$ , $e_{n}(\boldsymbol{\beta}_{l})=e_{n}(\boldsymbol{\beta}_{l};\mathbf{X}_{n})$ and $\hat{\boldsymbol{\beta}}_{l,n}=\mathrm{argmax}_{\boldsymbol{\beta}_{l}}\ell_{n}(\boldsymbol{\beta}_{l})$ . We have

{\mathbb{E}}C_{n}(\hat{\boldsymbol{\beta}}_{l,n})=-{\mathbb{E}}\left[{\mathbb{E}}\left\{\ell_{n}(\hat{\boldsymbol{\beta}}_{l,n};\mathbf{Z}_{n})\right\}\big{|}\mathbf{X}_{n}\right].

Using a first order Taylor expansion of $\ell_{n}(\cdot;\mathbf{Z}_{n})$ around $\boldsymbol{\beta}_{l,n}^{*}$ with integral remainder term, we have

\ell_{n}(\hat{\boldsymbol{\beta}}_{l,n};\mathbf{Z}_{n})=\ell_{n}(\boldsymbol{\beta}_{l,n}^{*};\mathbf{Z}_{n})+e_{n}(\boldsymbol{\beta}_{l,n}^{*};\mathbf{Z}_{n})^{\top}\mathbf{\Delta}_{n}+\mathbf{R}_{n}(\boldsymbol{\beta}_{l,n}^{*},\hat{\boldsymbol{\beta}}_{l,n};\mathbf{Z}_{n})

where $\mathbf{\Delta}_{n}=\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l,n}^{*}$ and where the integral remainder term can be expressed as

	$\displaystyle\mathbf{R}_{n}(\boldsymbol{\beta}_{l,n}^{*},\hat{\boldsymbol{\beta}}_{l,n};\mathbf{Z}_{n})=$	$\displaystyle 2\sum_{\alpha\in\mathbb{N}^{p_{l}},\|\alpha\|=2}\bigg{[}\;\frac{(\boldsymbol{\beta}_{l,n}^{*}-\hat{\boldsymbol{\beta}}_{l,n})^{\alpha}}{\alpha!}$
		$\displaystyle\qquad\times\int_{0}^{1}(1-t)\partial^{\alpha}\ell_{n}\left\{\hat{\boldsymbol{\beta}}_{l,n}+t(\boldsymbol{\beta}_{l,n}^{*}-\hat{\boldsymbol{\beta}}_{l,n});\mathbf{Z}_{n}\right\}\mathrm{d}t\bigg{]}.$

It is worth noticing that for any $\boldsymbol{\beta}_{l}\in\mathbb{R}^{p_{l}}$ and any $\alpha\in\mathbb{N}^{p_{l}}$ such that $|\alpha|=2$ , $\partial^{\alpha}\ell_{n}(\boldsymbol{\beta}_{l};\mathbf{Z}_{n})$ is a deterministic function of $\boldsymbol{\beta}_{l}$ . Therefore $\mathbf{R}_{n}(\boldsymbol{\beta}_{l,n}^{*},\hat{\boldsymbol{\beta}}_{l,n}):=\mathbf{R}_{n}(\boldsymbol{\beta}_{l,n}^{*},\hat{\boldsymbol{\beta}}_{l,n};\mathbf{Z}_{n})$ does not depend on $\mathbf{Z}_{n}$ . Using this and the unbiasedness of $e_{n}(\boldsymbol{\beta}_{l,n}^{*};\cdot)$ we have

{\mathbb{E}}\left\{\ell_{n}(\hat{\boldsymbol{\beta}}_{l,n};\mathbf{Z}_{n})\big{|}\mathbf{X}_{n}\right\}={\mathbb{E}}\left\{\ell_{n}(\boldsymbol{\beta}_{l,n}^{*};\mathbf{Z}_{n})\right\}+\mathbf{R}_{n}(\boldsymbol{\beta}_{l,n}^{*},\hat{\boldsymbol{\beta}}_{l,n})

Hence,

{\mathbb{E}}\left\{C_{n}(\hat{\boldsymbol{\beta}}_{l,n})\right\}=-{\mathbb{E}}\left\{\ell_{n}(\boldsymbol{\beta}_{l,n}^{*})\right\}-{\mathbb{E}}\left\{\mathbf{R}_{n}(\boldsymbol{\beta}_{l,n}^{*},\hat{\boldsymbol{\beta}}_{l,n})\right\}.

(23)

Now, using a first order Taylor expansion of $-\ell_{n}(\cdot)$ around $\boldsymbol{\beta}_{l,n}^{*}$ , we have

-\ell_{n}(\hat{\boldsymbol{\beta}}_{l,n})=-\ell_{n}(\boldsymbol{\beta}_{l,n}^{*})-e_{n}(\boldsymbol{\beta}_{l,n}^{*})^{\top}\mathbf{\Delta}_{n}-\mathbf{R}_{n}(\boldsymbol{\beta}_{l,n}^{*},\hat{\boldsymbol{\beta}}_{l,n}).

(24)

Combining (23) and the expectation of (24) we obtain

\displaystyle{\mathbb{E}}\left\{C_{n}(\hat{\boldsymbol{\beta}}_{l,n})\right\}-{\mathbb{E}}\left\{-\ell_{n}(\hat{\boldsymbol{\beta}}_{l,n})\right\}=

\displaystyle{\mathbb{E}}\left\{e_{n}(\boldsymbol{\beta}_{l,n}^{*})^{\top}\mathbf{\Delta}_{n}\right\}.

(25)

Since by definition of $\hat{\boldsymbol{\beta}}_{l,n}$ , $\mathbf{\Delta}_{n}=\mathbf{S}_{n}^{-1}(\tilde{\boldsymbol{\beta}}_{l,n})e_{n}(\boldsymbol{\beta}_{l,n}^{*})$ where $\tilde{\boldsymbol{\beta}}_{l,n}=\hat{\boldsymbol{\beta}}_{l,n}+t(\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l,n}^{*})$ for some $t\in(0,1)$ we continue with

$\displaystyle{\mathbb{E}}\left\{C_{n}(\hat{\boldsymbol{\beta}}_{l,n})\right\}-{\mathbb{E}}\left\{-\ell_{n}(\hat{\boldsymbol{\beta}}_{l,n})\right\}=$	$\displaystyle{\mathbb{E}}\left\{e_{n}(\boldsymbol{\beta}_{l,n}^{})^{\top}\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{})^{-1}e_{n}(\boldsymbol{\beta}_{l,n}^{*})\right\}$
	$\displaystyle+{\mathbb{E}}\left\{e_{n}(\boldsymbol{\beta}_{l,n}^{})^{\top}\mathbf{M}_{n}e_{n}(\boldsymbol{\beta}_{l,n}^{})\right\}$
$\displaystyle=$	$\displaystyle\mathrm{trace}\left\{\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{*})^{-1}\mathbf{\Sigma}_{l,n}\right\}$
	$\displaystyle+{\mathbb{E}}\left\{e_{n}(\boldsymbol{\beta}_{l,n}^{})^{\top}\mathbf{M}_{n}e_{n}(\boldsymbol{\beta}_{l,n}^{})\right\}$	(26)

where $\mathbf{M}_{n}=\mathbf{S}_{n}(\tilde{\boldsymbol{\beta}}_{l,n})^{-1}-\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{*})^{-1}$ . Since $e_{n}(\boldsymbol{\beta}_{l,n}^{*})/\sqrt{\tau_{n}}$ is bounded in probability by C6 and $\tau_{n}\mathbf{M}_{n}$ converges in probability since $\mathbf{\Delta}_{n}$ converges to zero in probability, we obtain $e_{n}(\boldsymbol{\beta}_{l,n}^{*})^{\top}\mathbf{M}_{n}e_{n}(\boldsymbol{\beta}_{l,n}^{*})$ converges in probability. This combined with uniform integrability (13) implies ${\mathbb{E}}\left\{e_{n}(\boldsymbol{\beta}_{l,n}^{*})^{\top}\mathbf{M}_{n}e_{n}(\boldsymbol{\beta}_{l,n}^{*})\right\}=o(1)$ . ∎

Appendix D Proof of Proposition 2

Proof.

We remind the notation $\tau_{n}=\theta_{n}|W_{n}|$ .

$\displaystyle p_{n}(\mathbf{X}_{n}\mid\mathcal{M}_{l})$	$\displaystyle=\int_{\mathbb{R}^{p_{l}}}p_{n}(\mathbf{X}_{n},\boldsymbol{\beta}_{l}\mid\mathcal{M}_{l})\mathrm{d}\boldsymbol{\beta}_{l}$
	$\displaystyle=\int_{\mathbb{R}^{p_{l}}}p_{n}(\mathbf{X}_{n};\boldsymbol{\beta}_{l})p(\boldsymbol{\beta}_{l}\mid\mathcal{M}_{l})\mathrm{d}\beta_{l}$
	$\displaystyle=\int_{{\mathbb{R}^{p_{l}}}}p(\boldsymbol{\beta}_{l}\mid\mathcal{M}_{l})\exp\left\{\tau_{n}\,\ell_{n}(\boldsymbol{\beta}_{l})/\tau_{n}\right\}\mathrm{d}\boldsymbol{\beta}_{l}.$	(27)

Now, we are in the situation where we can apply Łapiński (2019, Theorem 2), which gives rigorous conditions under which a multivariate Laplace approximation holds. First, condition C8 ensures that $p(\boldsymbol{\beta}_{l}\mid\mathcal{M}_{l})$ is regular enough. Second, condition C5 ensures that $\tau_{n}\ell_{n}(\boldsymbol{\beta}_{l})$ has a nonsingular Hessian matrix for any $\boldsymbol{\beta}_{l}$ . Third, for $\alpha\in\mathbb{N}^{p_{l}}$ with $|\alpha|\leq 3$ , we have for any $\boldsymbol{\beta}_{l}$ (since $z_{0l}^{\alpha_{0}}={1}^{\alpha_{0}}=1$ )

\displaystyle\partial^{\alpha}\{\tau_{n}^{-1}\log p_{n}(\mathbf{X}_{n};\boldsymbol{\beta}_{l})\}=

\displaystyle-\frac{1}{|W_{n}|}\int_{W_{n}}\prod_{j\in I_{l}}z_{j}^{\alpha_{j}}(u)\rho(u;\boldsymbol{\beta}_{l})\mathrm{d}u.

Therefore, under condition C2, $\partial^{\alpha}\{\tau_{n}^{-1}\log p_{n}(\mathbf{X}_{n};\boldsymbol{\beta}_{l})\}$ is uniformly bounded for $\boldsymbol{\beta}_{l}$ in any compact subset of $\mathbb{R}^{d}$ .

The fact that $\int p(\boldsymbol{\beta}_{l}\mid\mathcal{M}_{l})\mathrm{d}\boldsymbol{\beta}_{l}=1<\infty$ is sufficient to ensure Łapiński (2019, condition (5)). Finally, by the strong consistency of $\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l,n}^{*}$ to 0 and from condition C1^′, the last condition to check is formulated as follows: we need to verify that there exists $n_{0}\in\mathbb{N}$ such that for $n\geq n_{0}$ , $\tau_{n}^{-1}\ell_{n}(\boldsymbol{\beta}_{l})$ has a unique maximum in a closed ball $B(\boldsymbol{\beta}_{l}^{*},\varepsilon)$ for some $\varepsilon>0$ where $\boldsymbol{\beta}_{l}^{*}$ is given by condition C1^′) and such that

\Delta=\inf_{n\geq n_{0},\boldsymbol{\beta}_{l}\in\mathbb{R}^{p_{l}}\setminus B(\boldsymbol{\beta}_{l}^{*},\varepsilon)}\;\tau_{n}^{-1}\left\{\ell_{n}(\hat{\boldsymbol{\beta}}_{l,n})-\ell_{n}(\boldsymbol{\beta}_{l})\right\}>0.

(28)

By condition C5, for any $n\geq 1$ , $\ell_{n}$ (and thus $\tau_{n}^{-1}\ell_{n}$ ) has indeed a unique maximum. Letting $\varepsilon>0$ there exists $n_{0}\in\mathbb{N}$ such that almost surely $\hat{\boldsymbol{\beta}}_{l,n}\in B(\boldsymbol{\beta}_{l}^{*},\varepsilon/2)$ . Consider $\Delta$ with such $n_{0}$ and $\varepsilon$ . Note that for large $n$ and any $\boldsymbol{\beta}_{l}\in\mathbb{R}^{p_{l}}\setminus B(\boldsymbol{\beta}_{l}^{*},\varepsilon)$ , $\|\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l}\|\geq\varepsilon/2$ almost surely.

Using a first order Taylor expansion around $\hat{\boldsymbol{\beta}}_{l,n}$ and using the fact that $e_{n}(\hat{\boldsymbol{\beta}}_{l,n})=0$ , we have

\tau_{n}^{-1}\left\{\ell_{n}(\boldsymbol{\beta}_{l})-\ell_{n}(\hat{\boldsymbol{\beta}}_{l,n})\right\}=R_{n}(\hat{\boldsymbol{\beta}}_{l,n},\boldsymbol{\beta}_{l,n})

where the remainder term $R_{n}(\hat{\boldsymbol{\beta}}_{l,n},\boldsymbol{\beta}_{l,n})$ is

		$\displaystyle\frac{2}{\tau_{n}}\sum_{\alpha\in\mathbb{N}^{p_{l}},\|\alpha\|=2}\bigg{[}\;\frac{(\boldsymbol{\beta}_{l,n}-\hat{\boldsymbol{\beta}}_{l,n})^{\alpha}}{\alpha!}$
		$\displaystyle\qquad\times\int_{0}^{1}(1-t)\partial^{\alpha}\ell_{n}\left\{\hat{\boldsymbol{\beta}}_{l,n}+t(\boldsymbol{\beta}_{l,n}-\hat{\boldsymbol{\beta}}_{l,n});\mathbf{X}_{n})\right\}\mathrm{d}t\bigg{]}$
	$\displaystyle=$	$\displaystyle\frac{-2}{\|W_{n}\|}\sum_{\alpha\in\mathbb{N}^{p_{l}},\|\alpha\|=2}\bigg{[}\;\frac{(\boldsymbol{\beta}_{l,n}-\hat{\boldsymbol{\beta}}_{l,n})^{\alpha}}{\alpha!}$
		$\displaystyle\qquad\times\int_{0}^{1}(1-t)\int_{W_{n}}\mathbf{z}^{\alpha}(u)\rho(u;\hat{\boldsymbol{\beta}}_{l,n}+t(\boldsymbol{\beta}_{l,n}-\hat{\boldsymbol{\beta}}_{l,n}))\mathrm{d}u\mathrm{d}t\bigg{]}$
	$\displaystyle=$	$\displaystyle\frac{-2}{\|W_{n}\|}\int_{W_{n}}\sum_{\alpha\in\mathbb{N}^{p_{l}},\|\alpha\|=2}\;\frac{(\boldsymbol{\beta}_{l,n}-\hat{\boldsymbol{\beta}}_{l,n})^{\alpha}}{\alpha!}\mathbf{z}^{\alpha}(u)$
		$\displaystyle\times\exp\{\hat{\boldsymbol{\beta}}_{l,n}\mathbf{z}_{l}(u)\}\frac{\exp\{(\boldsymbol{\beta}_{l,n}-\hat{\boldsymbol{\beta}}_{l,n})^{\top}\mathbf{z}_{l}(u)\}-(\boldsymbol{\beta}_{l,n}-\hat{\boldsymbol{\beta}}_{l,n})^{\top}\mathbf{z}_{l}(u)-1}{\{(\boldsymbol{\beta}_{l,n}-\hat{\boldsymbol{\beta}}_{l,n})^{\top}\mathbf{z}_{l}(u)\}^{2}}\mathrm{d}u$
	$\displaystyle=$	$\displaystyle\frac{-1}{\|W_{n}\|}\int_{W_{n}}\exp\{\hat{\boldsymbol{\beta}}_{l,n}^{\top}\mathbf{z}_{l}(u)\}\,f\left\{(\boldsymbol{\beta}_{l,n}-\hat{\boldsymbol{\beta}}_{l,n})^{\top}\mathbf{z}_{l}(u)\right\}\mathrm{d}u$

where $f(t)=\exp(t)-1-t$ . Consider the set $B_{n}$ given by C5. Since for large $n$ , $\|\hat{\boldsymbol{\beta}}_{l,n}-\boldsymbol{\beta}_{l}^{*}\|\geq\varepsilon/2$ and since $f$ is non negative and has a unique minimum at zero, we have by condition C5 that

\inf_{u\in B_{n}}\exp\{\hat{\boldsymbol{\beta}}_{l,n}^{\top}\mathbf{z}_{l}(u)\}>\exp(-\|\hat{\boldsymbol{\beta}}_{l,n}\|c)\text{ and }\inf_{u\in B_{n}}f\left\{(\boldsymbol{\beta}_{l,n}-\hat{\boldsymbol{\beta}}_{l,n})^{\top}\mathbf{z}_{l}(u)\right\}\geq\delta>0

for some $\delta>0$ . Therefore,

	$\displaystyle R_{n}(\hat{\boldsymbol{\beta}}_{l,n},\boldsymbol{\beta}_{l,n})\leq$	$\displaystyle\frac{-1}{\|W_{n}\|}\int_{B_{n}}\exp\{\hat{\boldsymbol{\beta}}_{l,n}^{\top}\mathbf{z}_{l}(u)\}\,f\left\{(\boldsymbol{\beta}_{l,n}-\hat{\boldsymbol{\beta}}_{l,n})^{\top}\mathbf{z}_{l}(u)\right\}\mathrm{d}u$
		$\displaystyle\leq-\delta\,\frac{\|B_{n}\|}{\|W_{n}\|}\,\exp(-\\|\hat{\boldsymbol{\beta}}_{l,n}\\|c).$

Again, by definition of $B_{n}$ , the strong consistency of $\hat{\boldsymbol{\beta}}_{l,n}$ and condition C2, we conclude that for large $n$ there exists $\delta^{\prime}>0$ such that $R_{n}(\hat{\boldsymbol{\beta}}_{l,n},\boldsymbol{\beta}_{l,n})\leq-\delta^{\prime}<0$ whereby we deduce that $\Delta>0$ .

The previous statements allow us to conclude by Łapiński (2019, Theorem 2) that

		$\displaystyle p_{n}(\mathbf{X}_{n}\mid\mathcal{M}_{l})$
	$\displaystyle=$	$\displaystyle\frac{p(\hat{\boldsymbol{\beta}}_{l,n}\mid\mathcal{M}_{l})}{\mathrm{det}\{\tau_{n}^{-1}\mathbf{S}_{n}(\hat{\boldsymbol{\beta}}_{l,n})\}^{1/2}}\exp\{\ell_{n}(\hat{\boldsymbol{\beta}}_{l,n})\}\left(\frac{2\pi}{\tau_{n}}\right)^{p_{l}/2}\!\!\!\!+\exp\{\ell_{n}(\hat{\boldsymbol{\beta}}_{l,n})\}\left(\frac{2\pi}{\tau_{n}}\right)^{p_{l}/2}\!\!\!\!{\mathcal{O}}(\tau_{n}^{-1/2}).$

This gives

\log p_{n}(\mathbf{X}_{n}\mid\mathcal{M}_{l})-\log\frac{p(\hat{\boldsymbol{\beta}}_{l,n}\mid\mathcal{M}_{l})}{\mathrm{det}\{\tau_{n}^{-1}\mathbf{S}_{n}(\hat{\boldsymbol{\beta}}_{l,n})\}^{1/2}}-\ell_{n}(\hat{\boldsymbol{\beta}}_{l,n})-\frac{p_{l}}{2}\log\frac{2\pi}{\tau_{n}}=\\ \log\left[1+\frac{\mathrm{det}\{\tau_{n}^{-1}\mathbf{S}_{n}(\hat{\boldsymbol{\beta}}_{l,n})\}^{1/2}}{p(\hat{\boldsymbol{\beta}}_{l,n}\mid\mathcal{M}_{l})}{\mathcal{O}}(\tau_{n}^{-1/2})\right].

Proposition 2 now follows since

\log\left[1+\frac{\mathrm{det}\{\tau_{n}^{-1}\mathbf{S}_{n}(\hat{\boldsymbol{\beta}}_{l,n})\}^{1/2}}{p(\hat{\boldsymbol{\beta}}_{l,n}\mid\mathcal{M}_{l})}{\mathcal{O}}(\tau_{n}^{-1/2})\right]=\log\{1+{\mathcal{O}}(\tau_{n}^{-1/2})\}={\mathcal{O}}(\tau_{n}^{-1/2})

and using condition C9. ∎

Appendix E Modified version of Theorem 2 in Waagepetersen and Guan (2009)

Consider a sequence of estimating functions $u_{n}:\mathbb{R}^{p}\rightarrow\mathbb{R}^{p}$ , $n\geq 1$ whose distribution is determined by some underlying probability measure generating the data at hand. For a matrix $\mathbf{A}=[a_{ij}]$ , $\|\mathbf{A}\|_{M}=\max_{ij}|a_{ij}|$ , and we let $\mathbf{J}_{n}(\boldsymbol{\theta})=-\frac{\mathrm{d}}{{\mathrm{d}\theta}}u_{n}(\boldsymbol{\theta})$ , assuming that $u_{n}$ is differentiable.

Theorem 2.

Assume that there exists a sequence of invertible symmetric matrices $V_{n}$ and a sequence of parameter values $\boldsymbol{\theta}_{n}^{*}\in\mathbb{R}^{p}$ such that

G1

$\|{\mathbf{V}}_{n}^{-1}\|\rightarrow 0$ .

There exists an $l>0$ so that ${\mathrm{P}}(l_{n}<l)$ tends to zero where

l_{n}=\inf_{\|{\boldsymbol{\phi}}\|=1}{\boldsymbol{\phi}^{\top}}{\mathbf{V}}_{n}^{-1}{\mathbf{J}}_{n}({\boldsymbol{\theta}}_{n}^{*}){\mathbf{V}}_{n}^{-1}{\boldsymbol{\phi}}.

For any $d>0$ ,

\sup_{\|{\mathbf{V}}_{n}({\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{*}_{n})\|\leq d}\|{\mathbf{V}}_{n}^{-1}\{{\mathbf{J}}_{n}({\boldsymbol{\theta}})-{\mathbf{J}}_{n}({\boldsymbol{\theta}}^{*}_{n})\}{\mathbf{V}}_{n}^{-1}\|_{M}=\gamma_{nd}\rightarrow 0

in probability under $P$ .

G4

The sequence $u_{n}({\boldsymbol{\theta}}^{*}_{n}){\mathbf{V}}_{n}^{-1}$ is bounded in probability (i.e. for each $\epsilon>0$ there exists a $d$ so that $\mathrm{P}(\|{{\mathbf{V}}_{n}^{-1}u_{n}({\boldsymbol{\theta}}^{*}_{n})}\|>d)\leq\epsilon$ for $n$ sufficiently large).

Then for each $\epsilon>0$ , there exists a $d>0$ such that

{\mathrm{P}}\left\{\exists\tilde{{\boldsymbol{\theta}}}_{n}:u_{n}(\tilde{{\boldsymbol{\theta}}}_{n})=0\text{ and }\|{{\mathbf{V}}_{n}(\tilde{{\boldsymbol{\theta}}}_{n}-{\boldsymbol{\theta}}^{*}_{n})}\|<d\right\}>1-\epsilon

(29)

whenever $n$ is sufficiently large.

Proof.

The event

\{\exists\tilde{{\boldsymbol{\theta}}}_{n}:u_{n}(\tilde{{\boldsymbol{\theta}}}_{n})=0\text{ and }\|{{\mathbf{V}}_{n}(\tilde{{\boldsymbol{\theta}}}_{n}-{\boldsymbol{\theta}}^{*}_{n})}\|<d\}

occurs if ${{\boldsymbol{\phi}}^{\top}{\mathbf{V}}_{n}^{-1}u_{n}({\boldsymbol{\theta}}^{*}_{n}+{\mathbf{V}}_{n}^{-1}{\boldsymbol{\phi}})}<0$ for all ${\boldsymbol{\phi}}$ with $\|{\boldsymbol{\phi}}\|=d$ since this implies ${u_{n}({\boldsymbol{\theta}}^{*}_{n}+{\mathbf{V}}_{n}^{-1}{\boldsymbol{\phi}})}$ for some $\|{\boldsymbol{\phi}}\|<d$ (Lemma 2 in Aitchison and Silvey, 1958). Hence we need to show that there is a $d$ such that

\mathrm{P}\left\{\sup_{\|{\boldsymbol{\phi}}\|=d}{{\boldsymbol{\phi}}^{\top}{\mathbf{V}}_{n}^{-1}u_{n}({\boldsymbol{\theta}}^{*}_{n}+{\mathbf{V}}_{n}^{-1}{\boldsymbol{\phi}})}\geq 0\right\}\leq\epsilon

for sufficiently large $n$ . To this end we write

{{\boldsymbol{\phi}}^{\top}{\mathbf{V}}_{n}^{-1}u_{n}({\boldsymbol{\theta}}^{*}_{n}+{\mathbf{V}}_{n}^{-1}{\boldsymbol{\phi}})}={{\boldsymbol{\phi}}^{\top}{\mathbf{V}}_{n}^{-1}u_{n}({\boldsymbol{\theta}}^{*}_{n})}-{{\boldsymbol{\phi}}^{\top}}\int_{0}^{1}{\mathbf{V}}_{n}^{-1}{\mathbf{J}}_{n}({\boldsymbol{\theta}}_{n}(t)){\mathbf{V}}_{n}^{-1}\mathrm{d}t{{\boldsymbol{\phi}}}

where ${\boldsymbol{\theta}}(t)={\boldsymbol{\theta}}^{*}_{n}+t{{\mathbf{V}}_{n}^{-1}{\boldsymbol{\phi}}}$ . Then

	$\displaystyle{\mathrm{P}}\left\{\sup_{\\|{\boldsymbol{\phi}}\\|=d}{{\boldsymbol{\phi}}^{\top}{\mathbf{V}}_{n}^{-1}u_{n}({\boldsymbol{\theta}}^{*}_{n}+{\mathbf{V}}_{n}^{-1}{\boldsymbol{\phi}})}\geq 0\right\}\leq$
	$\displaystyle{\mathrm{P}}\left\{\sup_{\\|{\boldsymbol{\phi}}\\|=d}{{\boldsymbol{\phi}}^{\top}{\mathbf{V}}_{n}^{-1}u_{n}({\boldsymbol{\theta}}^{*}_{n})}\geq\inf_{\\|{\boldsymbol{\phi}}\\|=d}{{\boldsymbol{\phi}}^{\top}}\int_{0}^{1}{\mathbf{V}}_{n}^{-1}{\mathbf{J}}_{n}({\boldsymbol{\theta}}_{n}(t)){\mathbf{V}}_{n}^{-1}\mathrm{d}t\,{{\boldsymbol{\phi}}}\right\}\leq$
	$\displaystyle{\mathrm{P}}\left[\\|{{\mathbf{V}}^{-1}_{n}u_{n}({\boldsymbol{\theta}}^{}_{n}})\\|\geq d\inf_{\\|{\boldsymbol{\phi}}\\|=1}\left\{{{\boldsymbol{\phi}}^{\top}}{\mathbf{V}}_{n}^{-1}{\mathbf{J}}_{n}({\boldsymbol{\theta}}^{}_{n}){\mathbf{V}}_{n}^{-1}{{\boldsymbol{\phi}}}\right\}-dp\gamma_{nd}]\right]\leq$
	$\displaystyle{\mathrm{P}}\left(\\|{{\mathbf{V}}_{n}^{-1}u_{n}({\boldsymbol{\theta}}^{*}_{n})}\\|\geq dl_{n}/2\right)+{\mathrm{P}}\left(p\gamma_{nd}>l_{n}/2\right).$

The first term can be made arbitrarily small by picking a sufficiently large $d$ and letting $n$ tend to infinity. The second term converges to zero as $n$ tends to infinity. ∎

References

Aitchison and Silvey [1958] J. Aitchison and S. D. Silvey. Maximum-likelihood estimation of parameters subject to restraints. The Annals of Mathematical Statistics, 29:813–825, 1958.
Akaike [1973] H. Akaike. Information Theory and an Extension of the Maximum Likelihood Principle, pages 199–213. Springer New York, New York, NY, 1973.
Baddeley and Turner [2000] A. Baddeley and R. Turner. Practical maximum pseudolikelihood for spatial point patterns. Australian & New Zealand Journal of Statistics, 42(3):283–322, 2000.
Baddeley et al. [2015] A. Baddeley, E. Rubak, and R. Turner. Spatial Point Patterns: Methodology and Applications with R. CRC Press, 2015.
Berman and Turner [1992] M. Berman and R. Turner. Approximating point process likelihoods with GLIM. Applied Statistics, 41:31–38, 1992.
Choiruddin et al. [2018] A. Choiruddin, J.-F. Coeurjolly, and F. Letué. Convex and non-convex regularization methods for spatial point processes intensity estimation. Electronic Journal of Statistics, 12:1210–1255, 2018.
Claeskens and Hjort [2008] G. Claeskens and N.L. Hjort. Model Selection and Model Averaging. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2008.
Coeurjolly and Lavancier [2019] J.-F. Coeurjolly and F. Lavancier. Understanding spatial point patterns through intensity and conditional intensities. In Stochastic Geometry, number 2237 in Lecture Notes in Mathematics, chapter 2. Springer Verlag, 2019.
Condit [1998] R. Condit. Tropical Forest Census Plots. Springer-Verlag and R. G. Landes Company, Berlin, Germany and Georgetown, Texas, 1998.
Condit et al. [1996] R. Condit, S.P. Hubbell, and R.B. Foster. Changes in tree species abundance in a neotropical forest: impact of climate change. Journal of tropical ecology, 12(2):231–256, 1996.
Gao and Song [2010] X. Gao and P. X.-K. Song. Composite likelihood Bayesian information criteria for model selection in high-dimensional data. Journal of the American Statistical Association, 1-5:1531–1540, 2010.
Guan and Loh [2007] Y. Guan and J. M. Loh. A thinned block bootstrap procedure for modeling inhomogeneous spatial point patterns. Journal of the American Statistical Association, 102:1377–1386, 2007.
Hubbell and Foster [1983] S. P. Hubbell and R. B. Foster. Diversity of canopy trees in a neotropical forest and implications for conservation. In S. L. Sutton, T. C. Whitmore, and A. C. Chadwick, editors, Tropical Rain Forest: Ecology and Management, pages 25–41. Blackwell Scientific Publications, Oxford, 1983.
Jeffrey et al. [2018] D. Jeffrey, J. Horrocks, and Umphrey G.J. Penalized composite likelihoods for inhomogeneous Gibbs point process models. Computational Statistics & Data Analysis, 124:104–116, 2018.
Łapiński [2019] T. M. Łapiński. Multivariate Laplace’s approximation with estimated error and application to limit theorems. Journal of Approximation Theory, 248:105305, 2019.
Lavancier et al. [2020] F. Lavancier, A. Poinas, and R. Waagepetersen. Adaptive estimating function inference for non-stationary determinantal point processes. Scandinavian Journal of Statistics, 2020. Appeared online.
Lebarbier and Mary-Huard [2006] E. Lebarbier and T. Mary-Huard. Une introduction au critère BIC: fondements théoriques et interprétation. Journal de la Société Française de Statistique 1 (147), 39-58, 2006.
Møller and Waagepetersen [2004] J. Møller and R. P. Waagepetersen. Statistical Inference and Simulation for Spatial Point Processes. Chapman and Hall/CRC, Boca Raton, 2004.
Møller and Waagepetersen [2017] J. Møller and R. P. Waagepetersen. Some recent developments in statistics for spatial point processes. Annual review of Statistics and its Applications, 4:317–342, 2017.
Rathbun and Cressie [1994] S. L. Rathbun and N. Cressie. Asymptotic properties of estimators for the parameters of spatial inhomogeneous Poisson point processes. Advances in Applied Probability, 26:122–154, 1994.
Schoenberg [2005] F.P. Schoenberg. Consistent parametric estimation of the intensity of a spatial-temporal point process. Journal of Statistical Planning and Inference, 128:79–93, 2005.
Schwarz [1978] G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6(2):461–464, 03 1978.
Thurman et al. [2015] A. L. Thurman, R. Fu, Y. Guan, and J. Zhu. Regularized estimating equations for model selection of clustered spatial point processes. Statistica Sinica, pages 173–188, 2015.
Tierney and Kadane [1986] L. Tierney and J.B. Kadane. Accurate approximations for posterior moments and marginal densities. Journal of the American Statistical Association, 81:82–86, 1986.
Varin and Vidoni [2005] C. Varin and P. Vidoni. A note on composite likelihood inference and model selection. Biometrika, 92(3):519–528, 2005.
Waagepetersen [2007] R. Waagepetersen. An estimating function approach to inference for inhomogeneous Neyman-Scott processes. Biometrics, 63:252–258, 2007.
Waagepetersen and Guan [2009] R. Waagepetersen and Y. Guan. Two-step estimation for inhomogeneous spatial point processes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71:685–702, 2009.

	$\displaystyle\\|\boldsymbol{\Sigma}_{l,n}\\|$	$\displaystyle\leq K\left\{\tau_{n}+\theta_{n}^{2}\int_{W_{n}}\int_{W_{n}}\|g_{n}(u,v)-1\|\mathrm{d}u\mathrm{d}v\right\}$
		$\displaystyle\leq K\left\{\tau_{n}+\theta_{n}\int_{W_{n}}\int_{W_{n}}\|(k*k)(v-u)\|\mathrm{d}u\mathrm{d}v\right\}$
		$\displaystyle\leq K\left\{\tau_{n}+\tau_{n}\int_{\mathbb{R}^{d}}\|(k*k)(w)\|\mathrm{d}w\right\}=\mathcal{O}(\tau_{n}).$

$\displaystyle\\|\mathbf{S}_{n}(\boldsymbol{\beta}_{l})-\mathbf{S}_{n}(\boldsymbol{\beta}_{l,n}^{*})\\|_{M}$	$\displaystyle=\\|\theta_{n}\int_{W_{n}}\mathbf{z}_{l}(u)\mathbf{z}_{l}(u)^{\top}\left\{\rho(u;\boldsymbol{\beta}_{l})-\rho(u;\boldsymbol{\beta}_{l,n}^{*})\right\}\mathrm{d}u\\|_{M}$
	$\displaystyle\leq K_{1}\,\tau_{n}\\|\boldsymbol{\beta}_{l}-\boldsymbol{\beta}_{l,n}^{}\\|\exp\left\{K_{2}\left(\\|\boldsymbol{\beta}_{l,n}^{}\\|+d/\sqrt{\tau_{n}}\right)\right\}$
	$\displaystyle\leq K\tau_{n}\\|\boldsymbol{\beta}_{l}-\boldsymbol{\beta}_{l,n}^{*}\\|=\mathcal{O}(\sqrt{\tau_{n}})=\mathcal{O}_{\mathrm{P}}(\sqrt{\tau_{n}}).$	(22)

	$\displaystyle{\mathrm{P}}\left\{\sup_{\\|{\boldsymbol{\phi}}\\|=d}{{\boldsymbol{\phi}}^{\top}{\mathbf{V}}_{n}^{-1}u_{n}({\boldsymbol{\theta}}^{*}_{n}+{\mathbf{V}}_{n}^{-1}{\boldsymbol{\phi}})}\geq 0\right\}\leq$
	$\displaystyle{\mathrm{P}}\left\{\sup_{\\|{\boldsymbol{\phi}}\\|=d}{{\boldsymbol{\phi}}^{\top}{\mathbf{V}}_{n}^{-1}u_{n}({\boldsymbol{\theta}}^{*}_{n})}\geq\inf_{\\|{\boldsymbol{\phi}}\\|=d}{{\boldsymbol{\phi}}^{\top}}\int_{0}^{1}{\mathbf{V}}_{n}^{-1}{\mathbf{J}}_{n}({\boldsymbol{\theta}}_{n}(t)){\mathbf{V}}_{n}^{-1}\mathrm{d}t\,{{\boldsymbol{\phi}}}\right\}\leq$
	$\displaystyle{\mathrm{P}}\left[\\|{{\mathbf{V}}^{-1}_{n}u_{n}({\boldsymbol{\theta}}^{}_{n}})\\|\geq d\inf_{\\|{\boldsymbol{\phi}}\\|=1}\left\{{{\boldsymbol{\phi}}^{\top}}{\mathbf{V}}_{n}^{-1}{\mathbf{J}}_{n}({\boldsymbol{\theta}}^{}_{n}){\mathbf{V}}_{n}^{-1}{{\boldsymbol{\phi}}}\right\}-dp\gamma_{nd}]\right]\leq$
	$\displaystyle{\mathrm{P}}\left(\\|{{\mathbf{V}}_{n}^{-1}u_{n}({\boldsymbol{\theta}}^{*}_{n})}\\|\geq dl_{n}/2\right)+{\mathrm{P}}\left(p\gamma_{nd}>l_{n}/2\right).$