This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Robust minimum divergence estimation
in a spatial Poisson point process

Yusuke Saigusa1, Shinto Eguchi2 and Osamu Komori3
1Department of Biostatistics, School of medicine, Yokohama City University, 3-9 Fukuura, Kanazawa, Yokohama, Kanagawa 236-0004, Japan
2The Institute of Statistical Mathematics, 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan
3Department of Computer and Information Science, Seikei University, 3-3-1 Kichijojikitamachi, Musashino, Tokyo 180-8633, Japan

Abstract

Species distribution modeling (SDM) plays a crucial role in investigating habitat suitability and addressing various ecological issues. While likelihood analysis is commonly used to draw ecological conclusions, it has been observed that its statistical performance is not robust when faced with slight deviations due to misspecification in SDM. We propose a new robust estimation method based on a novel divergence for the Poisson point process model. The proposed method is characterized by weighting the log-likelihood equation to mitigate the impact of heterogeneous observations in the presence-only data, which can result from model misspecification. We demonstrate that the proposed method improves the predictive performance of the maximum likelihood estimation in our simulation studies and in the analysis of vascular plant data in Japan.

Keywords:

model misspecification, Poisson point process model, presence-only, robustness, sampling bias, species distribution modeling.

1 Introduction

Species distribution modeling (SDM) is critical to evaluate biodiversity, conservation and management planning, impacts on human activities, and so on (Fithian and Hastie, 2013; Schmeller et al., 2018). The presence-absence (PA) data for a species of interest require dedicated surveys per spatial unit; thus, such data are difficult and expensive to obtain. In some cases, the only available data are from museum or herbarium records of locations where the species were observed. In recent years, the development of geographic information systems has enabled ecologists to obtain knowledge of the environment of the study area without conducting field surveys. As a result, presence-only (PO) data have become readily available and research based on the PO data has become more active. However, the PO data often consist of observational surveys rather than the designed surveys, and thus require special attention in the data analysis.

Several statistical and machine learning approaches have been developed to estimate the relative probability or intensity that reflects habitat suitability or abundance of the species of interest. Some methods are based on the joint likelihood of the presence and environmental variables, or on the partial likelihood (Lancaster and Imbens, 1996; Lele and Keim, 2006; Lele, 2009). The application of the spatial Poisson point process (PPP) model to the PO data has been proposed (Warton and Shepherd, 2010; Chakraborty et al., 2011); this model is closely related to the maximum entropy (Maxent) model familiar to ecologists (Fithian and Hastie, 2013; Renner and Warton, 2013). These models can be considered essentially the same method in the sense that they are based on equivalent likelihoods and estimate the equivalent relative probability (or intensity) of presence (Wang and Stone, 2019).

The present paper focuses on the spatial PPP model. In the data analysis of the PO data, it is necessary to be aware of bias due to some reasons. Various methods have been discussed to deal with the sampling bias because the PO data usually suffer from the sampling bias due to heterogeneity in sampling efforts. These methods involve identifying and estimating the effect of bias on presence and then thinning or eliminating it (Dudík et al., 2005; Fithian et al., 2015; Komori et al., 2020). On the other hand, heterogeneous observations of presence occur due to several reasons, such as incorrect or missing geo-coordinates, taxonomic misidentification, and taxonomic shifts in the PO data (Thessen and Patterson, 2011; Wiser, 2016; Serra-Diaz et al., 2017). In fact, Serra-Diaz et al. (2017) noted that a median of 10%\% (up to 30%\%) of presence locations per species across 60,065 tree species were either in highly urbanized areas or outside typical habitat areas, which can have a significant effect in assessing habitat suitability. The heterogeneous observation can have a negative impact on the predictive performance of SDM, e.g., area under the operating characteristic curve (AUC), sensitivity, specificity, and true skill statistic (Liu et al., 2018). Filtering the dataset improves the performance of SDM; however, the contaminated data cannot be automatically corrected because the specialized knowledge and the time-consuming manual checking are required (Belbin et al., 2013; Mesibov, 2013; Velásquez-Tibatá et al., 2019). For large datasets, manually checking the species information introduces exorbitant costs.

Maximum likelihood estimation is usually employed for the PPP model, although it is known that the maximum likelihood estimation is not robust against heterogeneous observations. Despite the PO data are frequently contaminated by heterogeneous observations, to the best of our knowledge, statistical methods for dealing with this issue are rarely discussed, e.g., an M-estimation (Assunção and Guttorp, 1999), and a residual analysis for spatial point process (Baddeley et al., 2005). Robust methods for parameter estimation have been developed on the basis of information divergences to which the Kullback-Leibler (KL) divergence extends (see, e.g., Basu et al., 1998; Fujisawa and Eguchi, 2008; Eguchi and Komori, 2022). The present paper proposes a robust parameter estimation based on the Bregman divergence of intensity functions for the thinned PPP models. The proposed method is characterized by weighting the log-likelihood equation by an influence of heterogeneous observation.

The rest of this paper is organized as follows. Section 2 introduces the spatial PPP model. Section 3 provides a robust parameter estimation method for the thinned PPP model and discusses the robustness to the heterogeneous observation. Section 4 conducts some simulation studies to evaluate the performance of the proposed method. Section 5 presents the analysis of the vascular plant data in Japan. Finally, Section 6 discusses the findings.

2 Spatial Poisson point process model

Consider a PPP that occurs in the two-dimensional Euclidean space 2\mathbb{R}^{2}. For a study area 𝒜2\mathscr{A}\subset\mathbb{R}^{2}, let λ(s)\lambda(s) denote an intensity function for any site s𝒜s\in\mathscr{A} and let {s1,,sm}\{s_{1},\dots,s_{m}\} denote mm presence locations in 𝒜\mathscr{A}. Assume that (i) the total number mm is a sample from a Poisson distribution with intensity 𝒜λ(s)𝑑s\int_{\mathscr{A}}\lambda(s)ds and that (ii) the presence locations (si)(s_{i}) are independent and identically distributed samples of a random variable with the probability function λ(s)/𝒜λ(s)𝑑s\lambda(s)/\int_{\mathscr{A}}\lambda(s)ds for s𝒜s\in\mathscr{A}. The intensity function can be modeled in a complicated form. For simplicity, this paper assumes a log-linear model for the intensity function; that is, logλ𝜷(s)=𝜷𝒙(s)\log\lambda_{\mbox{\tiny\boldmath$\beta$}}(s)=\mbox{\boldmath$\beta$}^{\top}\mbox{\boldmath$x$}(s), where 𝜷=(β0,β1,,βp)\mbox{\boldmath$\beta$}=(\beta_{0},\beta_{1},\dots,\beta_{p})^{\top} is a coefficient parameter vector, and 𝒙(s)=(1,x1(s),,xp(s))\mbox{\boldmath$x$}(s)=(1,x_{1}(s),\dots,x_{p}(s))^{\top} is an environmental variable vector at a location ss. To deal with the bias due to the heterogeneity of sampling effort, the thinned PPP was developed (Dudík et al., 2005; Fithian et al., 2015). Consider the modeling of a PPP by the thinning PPP with the intensity λ𝜽(s)=λ𝜷(s)b𝜶(s)\lambda_{\mbox{\tiny\boldmath$\theta$}}(s)=\lambda_{\mbox{\tiny\boldmath$\beta$}}(s)b_{\mbox{\tiny\boldmath$\alpha$}}(s) using a detection probability b𝜶(s)b_{\mbox{\tiny\boldmath$\alpha$}}(s) for a site s𝒜s\in\mathscr{A}, where 𝜶=(α1,,αq)\mbox{\boldmath$\alpha$}=(\alpha_{1},\dots,\alpha_{q})^{\top} is a parameter vector and 𝜽=(𝜷,𝜶)\mbox{\boldmath$\theta$}=(\mbox{\boldmath$\beta$}^{\top},\mbox{\boldmath$\alpha$}^{\top})^{\top}. This paper assumes a logistic-linear model for the detection probability; that is, logitb𝜶(s)=𝜶𝒛(s)\mathop{\rm logit}b_{\mbox{\tiny\boldmath$\alpha$}}(s)=\mbox{\boldmath$\alpha$}^{\top}\mbox{\boldmath$z$}(s), where 𝒛(s)=(z1(s),,zq(s))\mbox{\boldmath$z$}(s)=(z_{1}(s),\dots,z_{q}(s))^{\top} is a vector of sampling-bias variable, e.g., distance from a road or an urbanized area.

The log-likelihood function of the thinned PPP model is defined by

l(𝜽)=i=1mlog{λ𝜽(si)}𝒜λ𝜽(s)𝑑s.\displaystyle l(\mbox{\boldmath$\theta$})=\sum_{i=1}^{m}\log\left\{\lambda_{\mbox{\tiny\boldmath$\theta$}}(s_{i})\right\}-\int_{\mathscr{A}}\lambda_{\mbox{\tiny\boldmath$\theta$}}(s)ds. (1)

Noting that the integral over a study area in equation (1) cannot be exactly calculated, a numerical approximation method has been proposed to estimate the integral using quadrature weights (Berman and Turner, 1992). Without loss of generality, we set up a location vector {s1,,sr}\{s_{1},\dots,s_{r}\} when the study area 𝒜\mathscr{A} is split into nn grid cells, where r=n+mm(n)r=n+m-m^{(n)}, m(n)m^{(n)} is the number of grid cells that contain at least one presence location, and {sm+1,,sr}\{s_{m+1},\dots,s_{r}\} are the centers of the grid cells that contain no presence location. The approximated log-likelihood function is then given by

l(𝜽)=i=1r[dilog{λ𝜽(si)}wiλ𝜽(si)],\displaystyle l(\mbox{\boldmath$\theta$})=\sum_{i=1}^{r}\left[d_{i}\log\left\{\lambda_{\mbox{\tiny\boldmath$\theta$}}(s_{i})\right\}-w_{i}\lambda_{\mbox{\tiny\boldmath$\theta$}}(s_{i})\right], (2)

where di=I(i{1,,m})d_{i}=I(i\in\{1,\dots,m\}), I()I(\cdot) is the indicator function, and wiw_{i} is a quadrature weight for a location sis_{i}. We regard the area of a grid cell divided by the number of locations {s1,,sr}\{s_{1},\dots,s_{r}\} contained in the cell as the quadrature weight wiw_{i} in the same manner as Renner and Warton (2013). That is, wi=|𝒜|/(nmi+)w_{i}=|\mathscr{A}|/(nm_{i}^{+}), where |𝒜||\mathscr{A}| is the area of 𝒜\mathscr{A}, mi+=max{1,mi}m_{i}^{+}=\max\{1,m_{i}\}, and mim_{i} is the number of presence locations at a grid cell sis_{i}. The likelihood equations can be obtained by

𝜷l(𝜽)\displaystyle\frac{\partial}{\partial\mbox{\boldmath$\beta$}}l(\mbox{\boldmath$\theta$}) =i=1rε𝜷(𝜽,si)=i=1r{diwiλ𝜽(si)}𝒙(si)=𝟎1+p,\displaystyle=\sum_{i=1}^{r}\varepsilon_{\mbox{\tiny\boldmath$\beta$}}(\mbox{\boldmath$\theta$},s_{i})=\sum_{i=1}^{r}\{d_{i}-w_{i}\lambda_{\mbox{\tiny\boldmath$\theta$}}(s_{i})\}\mbox{\boldmath$x$}(s_{i})=\mbox{\boldmath$0$}_{1+p}, (3)
𝜶l(𝜽)\displaystyle\frac{\partial}{\partial\mbox{\boldmath$\alpha$}}l(\mbox{\boldmath$\theta$}) =i=1rε𝜶(𝜽,si)=i=1r11+exp(𝜶𝒛(si)){diwiλ𝜽(si)}𝒛(si)=𝟎q,\displaystyle=\sum_{i=1}^{r}\varepsilon_{\mbox{\tiny\boldmath$\alpha$}}(\mbox{\boldmath$\theta$},s_{i})=\sum_{i=1}^{r}\frac{1}{1+\exp\left(\mbox{\boldmath$\alpha$}^{\top}\mbox{\boldmath$z$}(s_{i})\right)}\{d_{i}-w_{i}\lambda_{\mbox{\tiny\boldmath$\theta$}}(s_{i})\}\mbox{\boldmath$z$}(s_{i})=\mbox{\boldmath$0$}_{q}, (4)

where 𝟎k\mbox{\boldmath$0$}_{k} indicates a kk-dimensional zero vector. Solving the likelihood equations, the maximum likelihood estimator (MLE) of parameter 𝜽\theta is obtained. The intensity function is estimated by plugging the MLEs into the intensity model, λ𝜷(s)=exp{𝜷𝒙(s)}\lambda_{\mbox{\tiny\boldmath$\beta$}}(s)=\exp\{\mbox{\boldmath$\beta$}^{\top}\mbox{\boldmath$x$}(s)\}, and setting 𝜶=𝟎\mbox{\boldmath$\alpha$}=\mbox{\boldmath$0$} that means the sampling-bias effect is removed.

If the species distribution model is correctly specified, then the maximum likelihood method would yield accurate conclusions, even in the presence of sampling bias. However, we must be cautious in situations where model misspecification occurs in practical ecological studies. A non-negligible degree of misspecification can render the MLE unreliable and lead to incorrect inference. For instance, unobservable feature variables might cause the underlying intensity function to deviate slightly from parametric intensity functions, such as interactions with other species. The following section will address this issue and propose a new estimation method.

3 Robust parameter estimation

We are concerned with various possibilities for the misspecification of the thinned PPP model with the parametric intensity function λ𝜽(s)\lambda_{\mbox{\tiny\boldmath$\theta$}}(s) as introduced in section 2. Our main objective is to propose a robust estimation method for the parameter 𝜽\theta. To achieve this, we employ a weighted likelihood equation approach. We focus on the values of parametric intensities {λ𝜽(si):i=1,..,r}\{\lambda_{\mbox{\tiny\boldmath$\theta$}}(s_{i}):i=1,..,r\}, which should represent the species abundance. Suppose that occurrences are partially generated by a heterogeneous point process different from the assumed model with the parameter 𝜽\theta. In this case, the values of parametric intensities corresponding to heterogeneous processes have a small magnitude. In light of this, we propose a weighted likelihood equation for 𝜽=(𝜷,𝜶)\mbox{\boldmath$\theta$}=(\mbox{\boldmath$\beta$}^{\top},\mbox{\boldmath$\alpha$}^{\top})^{\top} as follows:

i=1rF(τλ𝜽(si))ε𝜷(𝜽,si)\displaystyle\sum_{i=1}^{r}F(\tau\lambda_{\mbox{\tiny\boldmath$\theta$}}(s_{i}))\varepsilon_{\mbox{\tiny\boldmath$\beta$}}(\mbox{\boldmath$\theta$},s_{i}) =𝟎1+p,\displaystyle=\mbox{\boldmath$0$}_{1+p}, (5)
i=1rF(τλ𝜽(si))ε𝜶(𝜽,si)\displaystyle\sum_{i=1}^{r}F(\tau\lambda_{\mbox{\tiny\boldmath$\theta$}}(s_{i}))\varepsilon_{\mbox{\tiny\boldmath$\alpha$}}(\mbox{\boldmath$\theta$},s_{i}) =𝟎q,\displaystyle=\mbox{\boldmath$0$}_{q}, (6)

where τ>0\tau>0 is a constant tuning parameter and F()F(\cdot) is a cumulative distribution function for the value of the intensity function. We refer to the estimator of 𝜷\beta of interest, say 𝜷^τ\hat{\mbox{\boldmath$\beta$}}_{\tau}, based on equations (5) and (6) as the minimum intensity divergence estimator (MIDE). Here we adopt the Pareto type II distribution

F(x)=1(1+νx)1ν\displaystyle F(x)=1-\big{(}1+{\nu}x\big{)}^{-\frac{1}{\nu}} (7)

for x>0x>0, where ν>0\nu>0 is a shape parameter. See Figure 1. This distribution has a long tail, which achieves stable estimation by preventing extremely large weights. For practical purposes, we will fix ν=1\nu=1. The weight function F(τλ(si,θ))F(\tau\lambda(s_{i},\theta)) expresses the magnitude of the intensity function at point sis_{i}, and calibrates the model correctness by suppressing the influence of heterogeneous observations in the weighted likelihood function. We derive the estimating equations (5) and (6) in terms of minimization of a specific example of Bregman divergence from the true intensity function to the parametric intensity function λ𝜽(s)\lambda_{\mbox{\tiny\boldmath$\theta$}}(s). A detailed derivation of equations (5) and (6) are provided in Appendix A. The property of the divergence automatically leads to the consistency of the proposed estimator for 𝜽\theta. The consistency and asymptotic normality of the MIDE can be derived (see Appendix B for the details). When the tuning parameter τ\tau goes to \infty, then the MIDE reduces to the MLE from equations (3) and (4) because all the weights of the estimating equations are equal to 1. Therefore, the MLE has no chance of suppressing the influence of heterogeneous observations.

Refer to caption
Figure 1: Plots of the Parato type II distribution function with ν=0.5,1.0,2.0,5.0\nu=0.5,1.0,2.0,5.0.

To simultaneously carry out the shrinkage estimation and the variable selection, we consider the penalized loss function with an L1L_{1} penalty (Tibshirani, 1996) as follows.

lΞϕ(𝜷,𝜶)=lΞ(𝜷,𝜶)+k=0pϕk|βk|,\displaystyle l^{\phi}_{\Xi}(\mbox{\boldmath$\beta$},\mbox{\boldmath$\alpha$})=l_{\Xi}(\mbox{\boldmath$\beta$},\mbox{\boldmath$\alpha$})+\sum_{k=0}^{p}\phi_{k}|\beta_{k}|, (8)

where lΞl_{\Xi} is the loss function minimized by solving the estimating equations (5) and (6) and is given by equation (13) in Appendix A, ϕ0=0\phi_{0}=0, and ϕk=ϕ\phi_{k}=\phi (k0k\neq 0, ϕ0\phi\geq 0) is a constant tuning parameter. The loss function has no penalties for the intercept parameter β0\beta_{0} of the intensity function and the coefficient paramter 𝜶\alpha for the detection probability model. The gradient ascent method (Goeman, 2010) can be used to compute the L1L_{1} penalized estimates. The detailed computation algorithm is provided in Appendix C.

The root trimmed mean squared prediction error (RTMSPE) is employed to select the appropriate values of the tuning parameters τ\tau and ϕ\phi by removing heterogeneous observations. The RTMSPE is given by

RTMSPEδ=1hδi=1hδe[i]2(0<δ<1),\displaystyle{\rm RTMSPE}_{\delta}=\sqrt{\frac{1}{h_{\delta}}\sum_{i=1}^{h_{\delta}}e^{2}_{[i]}}\qquad(0<\delta<1), (9)

where hδ=(n+1)δh_{\delta}=\lfloor(n+1)\delta\rfloor and e[1]2e[n]2e^{2}_{[1]}\leq\cdots\leq e^{2}_{[n]} are the order statistics of [m[1]λ𝜽^(s[1])]2,,[m[n]λ𝜽^(s[n])]2[m_{[1]}-\lambda_{\hat{\mbox{\tiny\boldmath$\theta$}}}(s_{[1]})]^{2},\dots,[m_{[n]}-\lambda_{\hat{\mbox{\tiny\boldmath$\theta$}}}(s_{[n]})]^{2} for grid cells (s[1],,s[n]s_{[1]},\dots,s_{[n]}) that do not overlap each other and the number of observations (m[1],,m[n]m_{[1]},\dots,m_{[n]}) in the cells with a estimate 𝜽^\hat{\mbox{\boldmath$\theta$}}.

4 Simulation

We evaluated the robustness of the MIDE compared with that of the MLE. Two types of situations were considered: one where the data were generated from the target PPP distribution with no contamination, and another where the data were contaminated by heterogeneous distribution.

Consider target and contamination PPPs on a study area 𝒮\mathscr{S} divided into 2,000 grid cells. The intensity function of the target PPP is λ𝜷(s)=exp(𝜷𝒙(s))\lambda_{\mbox{\tiny\boldmath$\beta$}}(s)=\exp(\mbox{\boldmath$\beta$}^{\top}\mbox{\boldmath$x$}(s)), 𝜷=(β0,β1,β2,β3,β4)\mbox{\boldmath$\beta$}=(\beta_{0},\beta_{1},\beta_{2},\beta_{3},\beta_{4})^{\top}, and that of the contamination PPP is λ𝜸(s)=exp(𝜸𝒙(s))\lambda_{\mbox{\tiny\boldmath$\gamma$}}(s)=\exp(\mbox{\boldmath$\gamma$}^{\top}\mbox{\boldmath$x$}(s)), 𝜸=(γ0,γ1,γ2,γ3,γ4)\mbox{\boldmath$\gamma$}=(\gamma_{0},\gamma_{1},\gamma_{2},\gamma_{3},\gamma_{4})^{\top}. We note that λ𝜸(s)\lambda_{\mbox{\tiny\boldmath$\gamma$}}(s) is in the log-linear model, but the parameter 𝜸\gamma is specified a totally different value of 𝜷\beta. The detection probability is b(s)=expit(𝜶𝒛(s))b(s)=\mathop{\rm expit}(\mbox{\boldmath$\alpha$}^{\top}\mbox{\boldmath$z$}(s)), 𝜶=(1,1)\mbox{\boldmath$\alpha$}=(1,-1)^{\top}. In the no contamination case, the simulated data were sampled from the thinned target distribution with the intensity λ𝜷(s)b(s)\lambda_{\mbox{\tiny\boldmath$\beta$}}(s)b(s). In the contamination case, the data were sampled from a thinned superposed PPP for the target and contamination distributions with the intensity (λ𝜷(s)+λ𝜸(s))b(s)(\lambda_{\mbox{\tiny\boldmath$\beta$}}(s)+\lambda_{\mbox{\tiny\boldmath$\gamma$}}(s))b(s). The expected contamination rate is Σsλ𝜸(s)/(Σsλ𝜷(s)+Σsλ𝜸(s))\Sigma_{s}\lambda_{\mbox{\tiny\boldmath$\gamma$}}(s)/(\Sigma_{s}\lambda_{\mbox{\tiny\boldmath$\beta$}}(s)+\Sigma_{s}\lambda_{\mbox{\tiny\boldmath$\gamma$}}(s)). The true values of parameters 𝜷\beta and 𝜸\gamma were set such that the expected contamination rate was sufficiently small. Moreover, the intensity of the contamination distribution was relatively large in some areas where the intensity of the target distribution was relatively small. Therefore, we set γi=βi\gamma_{i}=-\beta_{i}, i0i\neq 0. The environmental variable and the bias variable were generated from the standard normal distribution. The number of presence locations mm was generated from a Poisson distribution with the mean which is the total intensity summed over the study area. The presence locations were then generated from the multinomial distribution with the probability which is the intensity divided by the total intensity.

We investigated the performances of MLE and MIDE via three simulation scenarios: a no-contamination case, and light- and heavy-contamination cases with expected contamination rates of about 10%\% and 20%\%, respectively. The 200 datasets were simulated in each case. The tuning parameter τ\tau was selected by a grid search among the candidate values {0.1,1,5,10,20,}\{0.1,1,5,10,20,\infty\} to minimize the RTMSPE0.9 in simulations. The details of the simulation settings and results for the slope parameters of interest are provided in Figures 2a-c. In addition, the selected percentages of values of tuning parameter τ\tau are given in Table 1. In the no-contamination case, both the MLE and MIDE correctly estimated the true values of the coefficient parameters. For the MIDE, 21.5%\% of simulations selected τ\tau as \infty (the MLE). In both the light- and heavy-contamination cases, the MIDEs of the coefficient parameters were close to the true values of the target distribution, whereas the MLEs were not. For the results of MIDE, the variations of estimates were slightly larger in the heavy-contamination case than in the light-contamination case; however, the estimates were given around the true values in both cases. The value of τ\tau was selected as not \infty in 96.5-100%\% of the simulations, and the MLE was selected as 0-3.5%\% in the light- and heavy-contamination cases.

Refer to caption

(a) No-contamination case

Figure 2: Boxplots of the parameter estimates of the MLE and MIDE for (a) 𝜷=(2,1,1,1,1)\mbox{\boldmath$\beta$}=(-2,1,1,-1,-1)^{\top} in the no-contamination case, (b) 𝜷=(2,1,1,1,1)\mbox{\boldmath$\beta$}=(-2,1,1,-1,-1)^{\top} and 𝜸=(4.2,1,1,1,1)\mbox{\boldmath$\gamma$}=(-4.2,-1,-1,1,1)^{\top} in the light-contamination case, (c) 𝜷=(2,1,1,1,1)\mbox{\boldmath$\beta$}=(-2,1,1,-1,-1)^{\top} and 𝜸=(3.4,1,1,1,1)\mbox{\boldmath$\gamma$}=(-3.4,-1,-1,1,1)^{\top} in the heavy-contamination case. Red points indicate the true values. The tuning parameters τ\tau and ϕ\phi were selected on the basis of the RTMSPE0.9.
Refer to caption

(b) Light-contamination case

Refer to caption

(c) Heavy-contamination case

Figure 2: (continued)
Table 1: Percentages of simulations when the tuning parameter τ\tau of the MIDE was selected to be each value in the simulations with no-, light- and heavy-contamination cases for qq of the RTMSPE.
Value of τ\tau of the MIDE
0.1 1 5 10 20 \infty (MLE)
(a) No-contamination case
0.063 0.038 0.025 0.253 0.405 0.215
(b) Light-contamination case
0.050 0.110 0.100 0.245 0.460 0.035
(c) Heavy-contamination case
0.030 0.134 0.164 0.373 0.299 0.000

5 Data analysis

The PO data and the independent PA data of vascular plants in Japan have been compiled in Kubota et al. (2015). These two datasets for 40 species are available in the R package ‘qPPP’ (Komori et al., 2020). The study area was split into 10km ×\times 10km grid cells (n=4,684n=4,684) on the basis of a regular mesh in the PO data. Duplicate presence locations in each cell were removed to reduce spatial clumping and to avoid inflating of the model accuracy (Veloz, 2009). The study area was divided into seven subregions: Hokkaido, Tohoku, Kanto, Chubu, Kinki, Chugoku-Shikoku, and Kyushu regions. Subregions with a small number of observations or large discrepancy between the observations of PO and PA data were excluded from the analysis. Therefore, the thinned PPP model was applied to the 27 combinations of species and subregions where m50m\geq 50 and the proportion of presence locations of intersection of the PO and PA data to those of the union was greater than 0.5. We then calculated the MLE and MIDE with the tuning parameter τ\tau selected among {0.1,1,5,10,20,}\{0.1,1,5,10,20,\infty\} to minimize the RTMSPE0.9. The 37 environmental variables included in the qPPP package were used as shown later. The sampling-bias variable was the number of presence in the target group (Dudík et al., 2005; Phillips et al., 2009). All the environmental and bias variables were standardized to have mean 0 and variance 1. The predictive performances of the MLE and MIDE were evaluated on the basis of the AUC calculated from the independent PA data.

Fig 3 displays the AUCs calculated from the PA data based on the MLE and MIDE. This figure shows that the AUCs for the MIDE were similar to or improved over those for the MLE in most combinations of species and subregions.

Refer to caption
Figure 3: AUCs calculated from the PA data on the basis of the MLE and MIDE for combinations of species and subregions.
Refer to caption

(a) PO data

Refer to caption

(b) PA data

Refer to caption

(c) MLE

Refer to caption

(d) MIDE

Figure 4: The presence locations of CarpinusCarpinus LaxifloraLaxiflora in the Chugoku-Shikoku region (a) in the PO data and (b) in the PA data. The red tiles indicate the presence locations. The light blue tiles in (a) indicate the pseudo-absence locations, and the white tiles in (b) indicate the absence locations. The standardized predicted intensity functions of the (c) MLE and (d) MIDE, where higher values indicate better habitat suitability.
Refer to caption
Figure 5: The MLE and MIDE of the regression parameter 𝜷\beta for the environmental and bias variables for CarpinusCarpinus LaxifloraLaxiflora in the Chugoku-Shikoku region.
Refer to caption
Figure 6: Boxplots of the weight of the estimating function of the MIDE for two groups. Group A is the presence locations in the PO data that have the presence locations in the PA data within the area of a disc with a radius of 5km centered at the location, whereas Group B is the other presence locations in the PO data.

The PO and PA data of CarpinusCarpinus LaxifloraLaxiflora in the Chugoku-Shikoku region are illustrated in Figures 4a-b. The number of presence locations is 121, and the number of pseudo-absence locations is 254 in the PO data. Most of the presence locations in the PA data are included among the presence locations in the PO data. Conversely, if the PA data are considered as the true distribution, some presence locations included only in the PO data may be suspicious. The predictive performance was improved; the AUC based on the MLE was 0.6260.626, and that based on the MIDE with the selected value τ=1\tau=1 was 0.7070.707. The standardized predicted intensity functions for the PA data of the MLE and MIDE are displayed in Figures 4c-d. Some characteristic differences between the predicted intensity functions based on the MLE and MIDE were observed. In particular, results based on MIDE showed that the northern region has a relatively high habitat suitability, whereas results based on MLE showed no such trend. The estimated regression coefficients for the environmental variables were consistent in sign between the MLE and MIDE, although differences were observed in the magnitude of the values (Figure 5). The estimated coefficient on the bias variable was positive and may correctly reflect the amount of sampling effort.

To explain the improvement in the AUC, we compared the weights of the estimating function of the MIDE between two groups of the presence locations in the PO data. One group had presence locations in the PA data within the area of a disc of radius of 5km centered at the location and the other group had no presence location in the vicinity. That is, the former locations were relatively reliable, whereas the latter locations might be associated with suspicious data. The weights for the latter locations were relatively smaller than those for the former locations (Figure 6). Therefore, the MIDE might reduce the effect of suspect observational information by adding weights to the estimation function, resulting in better prediction results.

6 Discussion

We proposed a new robust estimation method for the thinned PPP model against heterogeneous observations in which a species is observed even though the probability (or intensity) of its presence is low. The weight of the estimating function of the proposed estimator plays an important role in the robust and stable estimation.

We attempted to consider estimators based on information divergences other than the proposed divergence, but they were unstable. The loss functions can be considered based on the β\beta- and γ\gamma-divergences (Basu et al., 1998; Fujisawa and Eguchi, 2008). Then the estimating functions have weights that increases exponentially with the magnitude of the model intensity. In our simulation studies, the results based on the estimating functions differed dramatically and the computation algorithm did not converge when the value of the tuning parameter which is an exponent was slightly changed. At that time, the weights of the power of the intensity function were extremely large for a small subset of the data, which might have led to the unstable estimation. On the other hand, the proposed estimator uses a weight that is rescaled by a distribution function FF and drops the weight into the range from 0 to 1, which might have achieved the stable estimation.

The proposed method can be used in combination with some other useful statistical methods. The methods to reduce sampling bias by filtering the dataset (e.g., the target-group background method (Phillips et al., 2009), and the spatial thinning (Veloz, 2009) employed in Section 5) can be used simultaneously. If there is spatial dependence (or autocorrelation) in species distribution (Dormann, 2007), the proposed estimator can be used to train a spatial dependence model such as the area-interaction process (Baddeley and van Lieshout, 1995). Further study is needed for robust estimation by the proposed method of mixed-effects models such as the Cox process (Møller et al., 1998).

Author Contributions

SE conceived the ideas and designed methodology; YS and OK conducted the simulation and analysed the data; YS wrote the original draft; All authors contributed critically to the drafts and gave final approval for publication.

Supplemental Material

An R code implementing the proposed method is available at GitHub
(https://github.com/saigusay/MIDE).

References

11999Assunção and GuttorpAssunção and Guttorp (1999)a99 Assunção R, Guttorp P. (1999). Robustness for inhomogeneous Poisson point processes. Annals of the Institute of Statistical Mathematics 51(4), 657-678. 21995Baddeley and van LieshoutBaddeley and van Lieshout (1995)b95 Baddeley AJ, van Lieshout MNM. (1995). Area-interaction point processes. Annals of the Institute of Statistical Mathematics 47(4), 601–619. 32005Baddeley et al.Baddeley et al. (2005)b05 Baddeley AJ, Turner R, Møller J, Hazelton M (2005). Residual analysis for spatial point processes. Journal of the Royal Statistics Society, Series B 67, 617-666. 41998Basu et al.Basu et al. (1998)b98 Basu A, Harris IR, Hjort NL, Jones MC. (1998). Robust and efficient estimation by minimising a density power divergence. Biometrika 85(3), 549-559. 52013Belbin et al.Belbin et al. (2013)b13 Belbin L, Daly J, Hirsch T, Hobern D, La Salle J. (2013). A specialist’s audit of aggregated occurrence records: An ‘aggregator’s’ perspective. ZooKeys 305, 67-76. 61992Berman and TurnerBerman and Turner (1992)b92 Berman M, Turner T. (1992). Approximating point process likelihoods with GLIM. Journal of the Royal Statistics Society, Series C 41(1), 31-38. 72011Chakraborty et al.Chakraborty et al. (2011)c11 Chakraborty A, Gelfand AE, Wilson AM, Latimer AM, Silander JA. (2011). Point pattern modelling for degraded presence-only data over large regions. Journal of the Royal Statistical Society, Series C 60(5), 757-776. 82007DormannDormann (2007)d07 Dormann CF. (2007). Effects of incorporating spatial autocorrelation into the analysis of species distribution data. Global Ecology and Biogeography 16(2), 129–138. 92005Dudík et al.Dudík et al. (2005)d05 Dudík M, Schapire RE, Phillips SJ. (2005). Correcting sample selection bias in maximum entropy density estimation. Advances in Neural Information Processing Systems 17, 323–330. 102022Eguchi and KomoriEguchi and Komori (2022)e22 Eguchi S, Komori O. (2022). Minimum Divergence Methods in Statistical Machine Learning: From an Information Geometric Viewpoint. Springer Japan KK, part of Springer Nature, Tokyo. 112015Fithian et al.Fithian et al. (2015)f15 Fithian W, Elith J, Hastie T, Keith DA. (2015). Bias correction in species distribution models: pooling survey and collection data for multiple species. Methods in Ecology Evolution 6, 424–438. 122013Fithian and HastieFithian and Hastie (2013)f13 Fithian W, Hastie T. (2013) Finite-sample equivalence in statistical models for presence-only data. The Annals of Applied Statistics 7(4), 1917-1939. 132008Fujisawa and EguchiFujisawa and Eguchi (2008)f08 Fujisawa H, Eguchi S. (2008). Robust parameter estimation with a small bias against heavy contamination. Journal of Multivariate Analysis 99, 2053-2081. 142010GoemanGoeman (2010)g10 Goeman JJ. (2010). L1 penalized estimation in the Cox proportional hazards model. Biometrical Journal 52(1), 70-84. 152015Komori et al.Komori et al. (2015)k15 Komori O, Eguchi S, Copas JB. (2015). Generalized t-statistic for two-group classification. Biometrics 71(2), 404-416. 162020Komori et al.Komori et al. (2020)k20 Komori O, Eguchi S, Saigusa Y, Kusumoto B, Kubota Y. (2020). Sampling bias correction in species distribution models by quasi-linear Poisson point process. Ecological Informatics 55, 101015. 172015Kubota et al.Kubota et al. (2015)k15b Kubota Y, Shiono T, Kusumoto B. (2015). Role of climate and geohistorical factors in driving plant richness patterns and endemicity on the east Asian continental islands. Ecography 38(6), 639-648. 181996Lancaster and ImbensLancaster and Imbens (1996)l96 Lancaster T, Imbens GW. (1996). Case-control studies with contaminated controls. Journal of Econometrics 70, 145-160. 192009LeleLele (2009)l09 Lele SR. (2009). A new method for estimation of resource selection probability function. The Journal of Wildlife Management 73(1), 122-127. 202006Lele and KeimLele and Keim (2006)l06b Lele SR, Keim JT. (2006). Weighted distributions and estimation of resource selection probability functions. Ecology 87(12), 3021-3028. 212018Liu et al.Liu et al. (2018)l18 Liu C, White M, Newell G. (2018). Detecting outliers in species distribution data. Journal of Biogeography 45, 164-176. 222008Meier et al.Meier et al. (2008)m08 Meier L, van de Geer S, Buhlmann P. (2008). The group lasso for logistic regression. Journal of the Royal Statistical Society, Series B 70(1), 53-71. 232013MesibovMesibov (2013)m13 Mesibov R. (2013). A specialist’s audit of aggregated occurrence records. ZooKeys 293, 1-18. 241998Møller et al.Møller et al. (1998)m98 Møller J, Syversveen AR, Waagepetersen RP. (1998). Log Gaussian Cox processes. Scandinavian Journal of Statistics 25(3), 451–482. 251978OgataOgata (1978)o78 Ogata, Y. (1978). The asymptotic behaviour of maximum likelihood estimators for stationary point processes. Annals of the Institute of Statistical Mathematics, 30(1), 243-261. 262009Phillips et al.Phillips et al. (2009)p09 Phillips SJ, Dudík M, Elith J, Graham CH, Lehmann A, Leathwick J, Ferrier S. (2009). Sample selection bias and presence-only distribution models: implications for background and pseudo-absence data. Ecological Applications 19(1), 181–197. 271994Rathbun and CressieRathbun and Cressie (1994)r94 Rathbun SL, Cressie N. (1994). Asymptotic properties of estimators for the parameters of spatial inhomogeneous Poisson point processes. Advances in Applied Probability, 26(1), 122-154. 282013Renner and WartonRenner and Warton (2013)r13 Renner IW, Warton DI. (2013). Equivalence of MAXENT and Poisson point process models for species distribution modeling in ecology. Biometrics 69(1), 274–281. 292018Schmeller et al.Schmeller et al. (2018)s18 Schmeller DS, Weatherdon LV, Loyau A, Bondeau A, Brotons L, Brummitt N, Geijzendorffer IR, Haase P, Kuemmerlen M, Martin CS, Mihoub J-B, Rocchini D, Saarenmaa H, Stoll S and Regan EC. (2018). A suite of essential biodiversity variables for detecting critical biodiversity change. Biological Reviews 93, 55–71. 302017Serra-Diaz et al.Serra-Diaz et al. (2017)s17 Serra-Diaz JM, Enquist BJ, Maitner B, Merow C, Svenning J-C. (2007). Big data of tree species distributions: how big and how good? Forest Ecosystems 4, 30. 312010StreitStreit (2010)s10 Streit RL. (2010). Poisson Point Processes: Imaging, Tracking, and Sensing. Springer, New York. 322011Thessen and PattersonThessen and Patterson (2011)t11 Thessen A, Patterson D. (2011). Data issues in the life sciences. ZooKeys 150, 15–51. 331996TibshiraniTibshirani (1996)t96 Tibshirani R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B 58(1), 267-288. 342019Velásquez-Tibatá et al.Velásquez-Tibatá et al. (2019)v19 Velásquez-Tibatá J, Olaya-Rodríguez MH, López-Lozano D, Gutiérrez C, González I, Londoño-Murcia MC. (2019). BioModelos: A collaborative online system to map species distributions. PLoS one 14, e0214522. 352009VelozVeloz (2009)v09 Veloz SD. (2009). Spatially autocorrelated sampling falsely inflates measures of accuracy for presence-only niche models. Journal of Biogeography 36, 2290–2299. 362019Wang and StoneWang and Stone (2019)w19 Wang Y, Stone L. (2019). Understanding the connections between species distribution models for presence-background data. Theoretical Ecology 12, 73-88. 372010Warton and ShepherdWarton and Shepherd (2010)w10 Warton DI, Shepherd LC. (2010). Poisson point process models solve the pseudo-absence problem for presence-only data in ecology. The Annals of Applied Statistics 4, 1383–1402. 382016WiserWiser (2016)w16 Wiser SK. (2016). Achievements and challenges in the integration, reuse and synthesis of vegetation plot data. Journal of Vegetation Science 27, 868–879.

Appendix Appendix A Derivation of estimating equation

Let Ξ(t)\Xi(t) be a strictly convex function of tt defined on (0,0,\infty). Then, the Bregman divergence between intensity functions λ1(s)\lambda_{1}(s) and λ2(s)\lambda_{2}(s) is induced by the generator function Ξ\Xi as

DΞ(λ1,λ2)=𝒜[Ξ(λ1(s))Ξ(λ2(s))ξ(λ2(s)){λ1(s)λ2(s)}]𝑑s,\displaystyle D_{\Xi}(\lambda_{1},\lambda_{2})=\int_{\mathscr{A}}\left[\Xi(\lambda_{1}(s))-\Xi(\lambda_{2}(s))-\xi(\lambda_{2}(s))\{\lambda_{1}(s)-\lambda_{2}(s)\}\right]ds, (10)

where ξ\xi is the derivative of Ξ\Xi. Note that DΞ(λ1,λ2)0D_{\Xi}(\lambda_{1},\lambda_{2})\geq 0 because the integrand of (10) is non-negative due to the convexity of Ξ\Xi. The equality holds if and only if λ1=λ2\lambda_{1}=\lambda_{2}. If Ξ(t)=tlogtt\Xi(t)=t\log t-t, then

DΞ(λ1,λ2)=[λ1(s){log(λ1(s))log(λ2(s))}λ1(s)+λ2(s)]𝑑s,D_{\Xi}(\lambda_{1},\lambda_{2})=\int[\lambda_{1}(s)\{\log(\lambda_{1}(s))-\log(\lambda_{2}(s))\}-\lambda_{1}(s)+\lambda_{2}(s)]ds,

which is the extended KL divergence defined on the space of intensity functions.

Let λ0(s)\lambda_{0}(s) be the true intensity function. We consider the minimum divergence estimation with the divergence DΞ(λ0,λ𝜽)D_{\Xi}(\lambda_{0},\lambda_{\mbox{\tiny\boldmath$\theta$}}) with respect to 𝜽\theta. We note that

DΞ(λ0,λ𝜽)=𝒜[Ξ(λ𝜽(s))+ξ(λ𝜽(s)){λ0(s)λ𝜽(s)}]𝑑s+C1,\displaystyle D_{\Xi}(\lambda_{0},\lambda_{\mbox{\tiny\boldmath$\theta$}})=-\int_{\mathcal{A}}\left[\Xi(\lambda_{\mbox{\tiny\boldmath$\theta$}}(s))+\xi(\lambda_{\mbox{\tiny\boldmath$\theta$}}(s))\{\lambda_{0}(s)-\lambda_{\mbox{\tiny\boldmath$\theta$}}(s)\}\right]ds+C_{1}, (11)

where C1C_{1} is a constant that depends only on λ0\lambda_{0}. Hence we define the loss function as

lΞ(𝜽)=i=1mξ(λ𝜽(si))+𝒜{λ𝜽(s)ξ(λ𝜽(s))Ξ(λ𝜽(s))}𝑑s,\displaystyle l_{\Xi}(\mbox{\boldmath$\theta$})=-\sum_{i=1}^{m}\xi(\lambda_{\mbox{\tiny\boldmath$\theta$}}(s_{i}))+\int_{\mathscr{A}}\left\{\lambda_{\mbox{\tiny\boldmath$\theta$}}(s)\xi(\lambda_{\mbox{\tiny\boldmath$\theta$}}(s))-\Xi(\lambda_{\mbox{\tiny\boldmath$\theta$}}(s))\right\}ds, (12)

which has a numerical approximation as

lΞ(𝜽)=i=1r[diξ(λ𝜽(si))wiλ𝜽(si)ξ(λ𝜽(si))+wiΞ(λ𝜽(si))]\displaystyle l_{\Xi}(\mbox{\boldmath$\theta$})=-\sum_{i=1}^{r}\left[d_{i}\xi(\lambda_{\mbox{\tiny\boldmath$\theta$}}(s_{i}))-w_{i}\lambda_{\mbox{\tiny\boldmath$\theta$}}(s_{i})\xi(\lambda_{\mbox{\tiny\boldmath$\theta$}}(s_{i}))+w_{i}\Xi(\lambda_{\mbox{\tiny\boldmath$\theta$}}(s_{i}))\right] (13)

using the presence indicators did_{i}’s and the quadrature weights wiw_{i}’s as in Section 2. We note that the expectation under the true intensity function λ0(s)\lambda_{0}(s) is equal to DΞ(λ0,λ𝜽)D_{\Xi}(\lambda_{0},\lambda_{\mbox{\tiny\boldmath$\theta$}}) except for the constant term C1C_{1}.

Let FF be a cumulative distribution function defined on (0,)(0,\infty). Then, apply the general discussion for the divergence DΞD_{\Xi} to a specific generator function Ξ\Xi that is defied by

Ξ(t)=0t0sF(τu)u𝑑u𝑑s\displaystyle\Xi(t)=\int_{0}^{t}\int_{0}^{s}\frac{F(\tau u)}{u}duds (14)

Accordingly, Ξ\Xi is a convex function because (d2/dt2)Ξ(t)=F(τt)/t(d^{2}/dt^{2})\Xi(t)=F(\tau t)/t is nonnegative. For the model intensity λ𝜽=λ𝜷b𝜶\lambda_{\mbox{\tiny\boldmath$\theta$}}=\lambda_{\mbox{\tiny\boldmath$\beta$}}b_{\mbox{\tiny\boldmath$\alpha$}}, the estimating functions in (5) and (6) are given as the derivatives of the loss function lΞ(𝜽)l_{\Xi}(\mbox{\boldmath$\theta$}) with respect to 𝜷\beta and 𝜶\alpha, that is

𝜷lΞ(𝜽)\displaystyle\frac{\partial}{\partial\mbox{\boldmath$\beta$}}l_{\Xi}(\mbox{\boldmath$\theta$}) =i=1rF(τλ𝜽(si)){diwiλ𝜽(si)}𝒙(si),\displaystyle=\sum_{i=1}^{r}F(\tau\lambda_{\mbox{\tiny\boldmath$\theta$}}(s_{i}))\{d_{i}-w_{i}\lambda_{\mbox{\tiny\boldmath$\theta$}}(s_{i})\}\mbox{\boldmath$x$}(s_{i}), (15)
𝜶lΞ(𝜽)\displaystyle\frac{\partial}{\partial\mbox{\boldmath$\alpha$}}l_{\Xi}(\mbox{\boldmath$\theta$}) =i=1rF(τλ𝜽(si))1+exp(𝜶𝒛(si)){diwiλ𝜽(si)}𝒛(si).\displaystyle=\sum_{i=1}^{r}\frac{F(\tau\lambda_{\mbox{\tiny\boldmath$\theta$}}(s_{i}))}{1+\exp\left(\mbox{\boldmath$\alpha$}^{\top}\mbox{\boldmath$z$}(s_{i})\right)}\{d_{i}-w_{i}\lambda_{\mbox{\tiny\boldmath$\theta$}}(s_{i})\}\mbox{\boldmath$z$}(s_{i}). (16)

Appendix Appendix B Consistency and asymptotic normality of the MIDE

We follow the asymptotic results described in Ogata (1978); Rathbun and Cressie (1994); Assunção and Guttorp (1999). The estimating functions based on loss function (13) is unbiased because

Eλ𝜽[𝜷lΞ(𝜽)]=\displaystyle E_{\lambda_{\mbox{\tiny\boldmath$\theta$}}}\left[\frac{\partial}{\partial\mbox{\boldmath$\beta$}}l_{\Xi}(\mbox{\boldmath$\theta$})\right]= 𝒜{ξ(λ𝜽(s))b𝜶(s)𝜷λ𝜷(s)}λ𝜽(s)𝑑s\displaystyle-\int_{\mathscr{A}}\left\{\xi^{\prime}(\lambda_{\mbox{\tiny\boldmath$\theta$}}(s))b_{\mbox{\tiny\boldmath$\alpha$}}(s)\frac{\partial}{\partial\mbox{\boldmath$\beta$}}\lambda_{\mbox{\tiny\boldmath$\beta$}}(s)\right\}\lambda_{\mbox{\tiny\boldmath$\theta$}}(s)ds (17)
+𝒜λ𝜽(s)ξ(λ𝜽(s))b𝜶(s)𝜷λ𝜷(s)𝑑s=𝟎1+p,\displaystyle+\int_{\mathscr{A}}\lambda_{\mbox{\tiny\boldmath$\theta$}}(s)\xi^{\prime}(\lambda_{\mbox{\tiny\boldmath$\theta$}}(s))b_{\mbox{\tiny\boldmath$\alpha$}}(s)\frac{\partial}{\partial\mbox{\boldmath$\beta$}}\lambda_{\mbox{\tiny\boldmath$\beta$}}(s)ds=\mbox{\boldmath$0$}_{1+p},

where ξ\xi^{\prime} is the derivative of ξ\xi, and the expectation is taken with respect to PPP with the model intensity function. Define 𝑨t={t𝒂,𝒂𝑨}\mbox{\boldmath$A$}_{t}=\{t\mbox{\boldmath$a$},\mbox{\boldmath$a$}\in\mbox{\boldmath$A$}\} where 𝑨2\mbox{\boldmath$A$}\subset\mathbb{R}^{2} be a compact set with positive Lebesgue measure and assume that 𝑨t2\mbox{\boldmath$A$}_{t}\uparrow\mathbb{R}^{2} as tt\to\infty. Consider the MIDE based on a PPP on 𝑨t\mbox{\boldmath$A$}_{t}. Let 𝜷0\mbox{\boldmath$\beta$}_{0} denote the solution of Eλ0[(/𝜷)lΞ(𝜽)]=𝟎1+pE_{\lambda_{0}}\left[(\partial/\partial\mbox{\boldmath$\beta$})l_{\Xi}(\mbox{\boldmath$\theta$})\right]=\mbox{\boldmath$0$}_{1+p}, where the expectation is taken with respect to PPP with the true intensity function. Under some regularity conditions for the proof of Theorems 1 and 2 in Assunção and Guttorp (1999), β^τ𝑝β0\hat{\beta}_{\tau}\xrightarrow{p}\beta_{0} and

Λθ(𝜷^τ𝜷0)𝑑N(𝟎,𝚺τ(𝜷0))\displaystyle\sqrt{\Lambda_{\theta}}\left(\hat{\mbox{\boldmath$\beta$}}_{\tau}-\mbox{\boldmath$\beta$}_{0}\right)\xrightarrow{d}N\left(\mbox{\boldmath$0$},\mbox{\boldmath$\Sigma$}_{\tau}(\mbox{\boldmath$\beta$}_{0})\right) (18)

as tt\to\infty, where Λθ=𝑨tλθ(s)𝑑s\Lambda_{\theta}=\int_{\mbox{\tiny\boldmath$A$}_{t}}\lambda_{\theta}(s)ds and 𝚺τ(𝜷0)=𝑱τ(𝜷0)1𝑰τ(𝜷0)𝑱τ(𝜷0)1\mbox{\boldmath$\Sigma$}_{\tau}(\mbox{\boldmath$\beta$}_{0})=\mbox{\boldmath$J$}_{\tau}(\mbox{\boldmath$\beta$}_{0})^{-1}\mbox{\boldmath$I$}_{\tau}(\mbox{\boldmath$\beta$}_{0})\mbox{\boldmath$J$}_{\tau}(\mbox{\boldmath$\beta$}_{0})^{-1},

𝑱τ(𝜷)\displaystyle\mbox{\boldmath$J$}_{\tau}(\mbox{\boldmath$\beta$}) =1ΛθEλ0[F(τλθ(s))𝒙(s)𝒙(s)],\displaystyle=\frac{1}{\Lambda_{\theta}}E_{\lambda_{0}}\left[F\left(\tau\lambda_{\theta}(s)\right)\mbox{\boldmath$x$}(s)\mbox{\boldmath$x$}^{\top}(s)\right], (19)
𝑰τ(𝜷)\displaystyle\mbox{\boldmath$I$}_{\tau}(\mbox{\boldmath$\beta$}) =1ΛθEλ0[F(τλθ(s))2𝒙(s)𝒙(s)].\displaystyle=\frac{1}{\Lambda_{\theta}}E_{\lambda_{0}}\left[F\left(\tau\lambda_{\theta}(s)\right)^{2}\mbox{\boldmath$x$}(s)\mbox{\boldmath$x$}^{\top}(s)\right]. (20)

Appendix Appendix C Gradient ascent algorithm

The gradient ascent method is employed to minimize the penalized loss. We follow the approach described in Meier et al. (2008); Komori et al. (2015). Consider the gradient 𝒈(𝜷|𝜶)=(g0(𝜷|𝜶),g1(𝜷|𝜶),,gp(𝜷|𝜶))\mbox{\boldmath$g$}(\mbox{\boldmath$\beta$}|\mbox{\boldmath$\alpha$})=(g_{0}(\mbox{\boldmath$\beta$}|\mbox{\boldmath$\alpha$}),g_{1}(\mbox{\boldmath$\beta$}|\mbox{\boldmath$\alpha$}),\dots,g_{p}(\mbox{\boldmath$\beta$}|\mbox{\boldmath$\alpha$}))^{\top} defined by

gk(𝜷|𝜶)={βklΞ(𝜽)ϕksign(βk)ifβk0βklΞ(𝜽)ϕksign(βklΞ(𝜽))ifβk=0and|βklΞ(𝜽)|>ϕk0otherwiseg_{k}(\mbox{\boldmath$\beta$}|\mbox{\boldmath$\alpha$})=\begin{cases}\displaystyle\frac{\partial}{\partial\beta_{k}}l_{\Xi}(\mbox{\boldmath$\theta$})-\phi_{k}\mathop{\rm sign}(\beta_{k})&{\rm if}\ \beta_{k}\neq 0\\ \displaystyle\frac{\partial}{\partial\beta_{k}}l_{\Xi}(\mbox{\boldmath$\theta$})-\phi_{k}\mathop{\rm sign}\left(\frac{\partial}{\partial\beta_{k}}l_{\Xi}(\mbox{\boldmath$\theta$})\right)&\displaystyle{\rm if}\ \beta_{k}=0\ {\rm and}\ \left|\frac{\partial}{\partial\beta_{k}}l_{\Xi}(\mbox{\boldmath$\theta$})\right|>\phi_{k}\\ 0&{\rm otherwise}\end{cases} (21)

for k=0,1,,pk=0,1,\dots,p, where ϕ0=0\phi_{0}=0 and ϕk=ϕ\phi_{k}=\phi for k0k\neq 0 and ‘sign’ indicates the sign function. To find an optimal gradient within a subdomain on which the gradient is continuous, consider the range defined by

ρedge(𝜷)=mink=1,,p[βkgk(𝜷|𝜶)|sign(βk)=sign(gk(βk))0].\displaystyle\rho_{\rm edge}(\mbox{\boldmath$\beta$})=\min_{k=1,\dots,p}\left[-\frac{\beta_{k}}{g_{k}(\mbox{\boldmath$\beta$}|\mbox{\boldmath$\alpha$})}\middle|\mathop{\rm sign}(\beta_{k})=-\mathop{\rm sign}(g_{k}(\beta_{k}))\neq 0\right]. (22)

The unpenalized parameter 𝜶\alpha is estimated by combining the Newton-Raphson method. Given values of tuning parameters τ\tau and ϕ\phi, an iterative estimation procedure is described as follows:

  1. 1.

    Initialize 𝜷(1)\mbox{\boldmath$\beta$}^{(1)} and 𝜶(1)\mbox{\boldmath$\alpha$}^{(1)}.

  2. 2.

    For steps t=2,3,t=2,3,\dots, iteratively calculate

    𝜷(t)\displaystyle\mbox{\boldmath$\beta$}^{(t)} =𝜷(t1)+ρopt𝒈(𝜷(t1)|𝜶(t1)),\displaystyle=\mbox{\boldmath$\beta$}^{(t-1)}+\rho_{\rm opt}\mbox{\boldmath$g$}\left(\mbox{\boldmath$\beta$}^{(t-1)}\middle|\mbox{\boldmath$\alpha$}^{(t-1)}\right), (23)
    𝜶(t)\displaystyle\mbox{\boldmath$\alpha$}^{(t)} =𝜶(t1)𝑯𝜶1(𝜶(t1)|𝜷(t))𝒈𝜶(𝜶(t1)|𝜷(t)),\displaystyle=\mbox{\boldmath$\alpha$}^{(t-1)}-{\mbox{\boldmath$H$}}_{\mbox{\tiny\boldmath$\alpha$}}^{-1}\left(\mbox{\boldmath$\alpha$}^{(t-1)}\middle|\mbox{\boldmath$\beta$}^{(t)}\right){\mbox{\boldmath$g$}}_{\mbox{\tiny\boldmath$\alpha$}}\left(\mbox{\boldmath$\alpha$}^{(t-1)}\middle|\mbox{\boldmath$\beta$}^{(t)}\right), (24)

    where 𝒈𝜶\mbox{\boldmath$g$}_{\mbox{\tiny\boldmath$\alpha$}} and 𝑯𝜶\mbox{\boldmath$H$}_{\mbox{\tiny\boldmath$\alpha$}} are the gradient and Hessian of lΞϕl_{\Xi}^{\phi} with respect to 𝜶\alpha, respectively, and

    ρopt=argmin0ρρedge(𝜷(t1))lΞϕ{𝜷(t1)+ρ𝒈(𝜷(t1)|𝜶(t1))}\displaystyle\rho_{\rm opt}=\mathop{\rm argmin}_{0\leq\rho\leq\rho_{\rm edge}(\mbox{\tiny\boldmath$\beta$}^{(t-1)})}l_{\Xi}^{\phi}\left\{\mbox{\boldmath$\beta$}^{(t-1)}+\rho\mbox{\boldmath$g$}\left(\mbox{\boldmath$\beta$}^{(t-1)}\middle|\mbox{\boldmath$\alpha$}^{(t-1)}\right)\right\} (25)

    until convergence with respect to the penalized loss.

This procedure is repeated when using different values of ϕ\phi, ϕ(nϕ)>>ϕ(1)=0\phi_{(n_{\phi})}>\cdots>\phi_{(1)}=0, where ϕ(nϕ)=max1kp|lΞ(𝜷(1))/βk|\phi_{(n_{\phi})}=\max_{1\leq k\leq p}|\partial l_{\Xi}(\mbox{\boldmath$\beta$}^{(1)})/\partial\beta_{k}| and nϕn_{\phi} is appropriately set according to pp. The starting value 𝜷(1)\mbox{\boldmath$\beta$}^{(1)} when using ϕ(u)\phi_{(u)} is set at the resultant estimate when using ϕ(u1)\phi_{(u-1)} for u=2,,nϕu=2,\dots,n_{\phi}.