Species Abundance Distribution and Species Accumulation Curve: A General Framework and Results
Abstract
We build a general framework which establishes a one-to-one correspondence between species abundance distribution (SAD) and species accumulation curve (SAC). The appearance rates of the species and the appearance times of individuals of each species are modeled as Poisson processes. The number of species can be finite or infinite. Hill numbers are extended to the framework. We introduce a linear derivative ratio family of models, , of which the ratio of the first and the second derivatives of the expected SAC is a linear function. A D1/D2 plot is proposed to detect this linear pattern in the data. By extrapolation of the curve in the D1/D2 plot, a species richness estimator that extends Chao1 estimator is introduced. The SAD of is the Engen’s extended negative binomial distribution, and the SAC encompasses several popular parametric forms including the power law. Family is extended in two ways: which allows species with zero detection probability, and where the derivative ratio is a rational function. Real data are analyzed to demonstrate the proposed methods. We also consider the scenario where we record only a few leading appearance times of each species. We show how maximum likelihood inference can be performed when only the empirical SAC is observed, and elucidate its advantages over the traditional curve-fitting method.
keywords:
[class=MSC]keywords:
and
1 Introduction
Estimating the diversity of classes in a population is a problem encountered in many fields. We may be interested in the diversity of words a person know from his/her writings [18], the illegal immigrants from the apprehension records [3], the distinct attributes in a database [24, 16], or the distinct responses to a crowdsourcing query [48]. Among different applications, species abundance is the one that receives most attention. For this reason, it is chosen as the theme of this paper with the understanding that the proposed framework and methods are applicable in other applications as well.
Understanding the species abundance in an ecological community has long been an important task for ecologists. Such knowledge is paramount in conservation planning and biodiversity management [35]. An exhaustive species inventory is too labor and resource intensive to be practical. Information about species abundance can thus be acquired mainly through a survey.
Let where is the number of species in a community that are represented exactly times in a survey. We do not observe the whole , but which is the zero-truncated . In other words, we do not know how many species are not seen in the survey. We call the vector , the frequency of frequencies (FoF) [21].
A plethora of species abundance models have been proposed for . Comprehensive review of the field can be found in Bunge and Fitzpatrick [6] and Matthews and Whittaker [34]. A typical assumption in purely statistical models is
(1.1) |
where is the total number of recorded species, and is a probability vector, such that is the probability for a randomly selected recorded species to be observed times in the survey for . We call this vector , the species abundance distribution (SAD). The great significance of (1.1) relies on three assumptions: (A1) the species names are noninformative; (A2) the observed data for different recorded species are independent and identically distributed (iid), and (A3) for each recorded species, its observed frequency contains all useful information.
Species accumulation curve (SAC) is another popular tool in the analysis of species abundance data. The survey is viewed as a data-collection process in which more and more sampling effort is devoted. The individual-based SAC is the number of recorded species expressed as a function of the amount of sampling effort.
Despite the different emphases of SAD and SAC, the two approaches have an overlapped target: Estimating , the total number of species in the community. For SAD, it means estimating , the number of unseen species as . For SAC, is the total number of seen species when the sampling effort is unlimited.
As SAD depends largely on the sampling effort, it is necessary to include sampling effort explicitly in the model in order to make comparison of different SADs possible. This addition establishes a link between SAD and SAC. Sampling effort can be of continuous type, such as the area of land or the volume of water sampled, or the duration of the survey. Discrete type sampling effort can be the sample size. To emphasize the sampling effort considered, the SAC is called species-time curve, species-area curve, or species-sample-size curve when the sampling effort is time, area, or sample size respectively. Species-area curve has been studied extensively in the literature. Review of it can be found in Tjørve [46, 47], Dengler [15] and Williams et al. [50]. In this paper, we use time as the measure of sampling effort. We consider the vector to be a function of the time , denoted as . Notations , , , and are likewise denoted as , , , and respectively. The (empirical) SAC is . Throughout the paper, we assume without loss of generality that the survey starts at time and ends at time . Because of the great similarity of the species-time relationship and species-area relationship [39, 32], time and area are treated as if they are interchangeable in this paper. When SAC is a species-area curve, we refer to the Type I species-area curve [43], where the areas sampled are nested (smaller areas are included in larger areas), resulting in a curve that is always nondecreasing.
The study of the relation between the (empirical) SAD and the (empirical) SAC started early since the introduction of rarefaction curve ([1, 42]) which describes how we interpolate SAC using the observed FoF. Let be the observed for , , and . The rarefaction curve for species-time relationship is
(1.2) |
The rationale of (1.2) is that is the probability for a species with frequency in time interval to be observed before time if the appearance times of it in are iid distributed. Equivalent formula for species-area curve appears earlier in Arrhenius [1]. Good and Toulmin [22] proposed an extrapolation formula for species-sample-size curve which when expressed as species-time curve is
(1.3) |
Equation (1.3) is (1.2) with a change of domain. Define the empirical SAD, . Equation (1.3) becomes
(1.4) |
where is an estimator of , the probability generating function of the SAD at time . Equation (1.4) delineates a simple and yet convincing formula to find SAC from SAD and vice versa. The aim of this paper is to propose and study a statistical framework under which model in (1.1) and the relation in (1.4) hold when , and are replaced by , and respectively. Our framework assumes (A1), (A2) and that the appearance times of each species follow a Poisson process which is sufficient for (A3) and the validity of (1.2).
Finite is commonplace in SAD approach although no consensus has been reached. Empirically, log-series distribution, an SAD that assumes infinite , is one of the most successful models. Many SACs do not have asymptote. It is common that the observed number of rare species is large, and shows no sign to decrease. In Bayesian nonparametric approach in genomic diversity study, Poisson-Dirichlet process which assumes infinite is used (see for example Lijoi et al. [31]). In reality, the existence of transient species (species which are observed erratically and infrequently [38]), and the error in the species identification process (a well-known example is the missequencing in pyrosequencing of DNA (see for example Dickie [17])) are continual sources of rare species making the species number larger than expected. In our framework, is random and can be finite or infinite with probability one. We use species richness to refer to when is random, and when is deterministic.
A special feature of our framework is that species can have zero detection probability. We call such species, zero-rate species, and all other species, positive-rate species. Zero-rate species can either be seen only once or unseen in a survey. If there are finite number of zero-rate species, the probability of observing any of them is zero. Therefore, if zero-rate species is observed in a survey, almost surely there are infinite number of them. We interpret observed individuals of positive-rate species as outcomes of a discrete distribution, where individuals of the same species can appear any nonnegative number of times in a survey. On the other hand, individuals of zero-rate species are outcomes of a continuous distribution, and no two such individuals belong to the same species. In our framework, the distribution is allowed to be a mixture of the two. Suppose we want to estimate the population of a town through recording each person we meet on street. Then tourists from distant countries can be viewed as “zero-rate species”. In the first example in Section 9, it is suspected that the sequencing error in pyrosequencing of DNA may be a source for zero-rate species.
Poisson distribution is a main component in our framework. Though Poisson distribution is common in existing SAD models, time scarcely plays a role. In mixed Poisson model (see for example Fisher et al. [20], Bulmer [5]), the observed frequencies of the species are independent and each follows a Poisson distribution with its own rate, say for species . The value is a fixed unknown finite value and are iid sample from a mixing distribution. Another related model was proposed in Zhou et al. [51]. The paper focuses mainly on finite . Neither time nor zero-rate species are included in the model.
The outline of this paper is as follows. We propose in Section 2 a new model for the sampling process, called the mixed Poisson partition process (MPPP). We emphasize on the parametric approach where additional assumptions are made on top of the framework so that the process depends only on a few parameters. Once the parameters are estimated, we can make inference on different characteristics of the population, say the SAD at any fixed time, the Hill numbers, or the expected future data. We study in Section 3. In Section 4, we consider the expected species accumulation curve (ESAC). We prove the one-to-one correspondence among (i) an MPPP, (ii) an ESAC which is a Bernstein function that passes through the origin, and (iii) the expected number of recorded species and the SAD at time such that the first derivative of its probability generating function is absolutely monotone in . In Section 5, we introduce , a parametric family of ESAC. The SAD of is the Engen’s extended negative binomial distribution. This family has an attractive property that the ratio of the first and the second derivatives of the ESAC is a linear function of . We extend to which allows zero-rate species. In Section 6, a D1/D2 plot for and some diagnostic plots for specified distributions are proposed. Extrapolation of the curve in D1/D2 plot is considered in Section 7. Estimator of species richness is suggested basing on a modified first-order extrapolation. In Section 8, is generalized so that the derivative ratio is a rational function instead of a linear function of . Four real data are analyzed in Section 9 to demonstrate the applications of the proposed models and the suggested plots. In Section 10, we propose and study a design where only a few leading appearance times of each species are recorded. In Section 11, we consider the maximum likelihood approach on the empirical SAC. We give a discussion in Section 12.
2 Mixed Poisson Partition Process
Poisson process is the backbone of the framework. A Poisson process is a point process characterized by an intensity measure over the -dimensional space (we usually have in this paper). The intensity measure which we denote as delineates how many points are present on average in different parts of . More precisely, the number of points in a set follows the distribution. Furthermore, for any finite collection of disjoint subsets , are mutually independent, where is the number of points in . If , we call the intensity function. A simple example is the homogeneous Poisson process where for a constant . The parameter is called the rate of the process.
If is finite (i.e., ), simulation of a Poisson process can be performed in two steps: (i) simulate the total number of points , and (ii) simulate iid from the probability measure . Nevertheless, if is infinite, then the number of points is infinite, and it is impossible to simulate all the points in the process. In this case, we can only simulate a selected finite subset of points of the process using the thinning property of Poisson processes [29]. For each point in the Poisson process, let be the probability for the point to be selected, and be the probability for it to be discarded. Then the selected points form a Poisson process with intensity measure , where . We can simulate the selected points when the measure is finite using the aforementioned method. For example, if we are interested only in the points that lie in a bounded region of a homogeneous Poisson process with rate , then where is the indicator function. The number of selected points follows distribution, and the selected points are iid uniformly distributed in .
We model the observations in a species abundance survey by a stochastic process where rates of the species follows a Poisson process.
Definition: (Mixed Poisson partition process) A mixed Poisson partition process (MPPP), , is characterized by a nonzero species intensity measure , which is a measure over , the set of all nonnegative real numbers, satisfying
(2.1) |
The definition of an MPPP consists of three steps:
-
1.
(Generation of positive rates of species) Given , define to be a measure over ( is the set of positive real numbers) by , (i.e., for ). Let (a finite or countably infinite sequence) be a realization of a Poisson process with intensity measure .
-
2.
(Generation of individuals of positive-rate species) For each in Step 1, we generate a realization (independently across ) of a Poisson process with rate . The realization represents the arrival times of a species with rate .
-
3.
(Generation of individuals of zero-rate species) We generate a realization of a Poisson process with rate , independent of . This represents the times of appearance for all the zero-rate species.
Finally, we take . For any , all points in are arrival times of the same species, whereas each point in is from a different zero-rate species (we use a slight abuse of notation to treat each point in as a point process with only one point).
Measures and may be finite or infinite. Let . As the expected number of individuals seen in time interval is equal to (see (3.4) in Section 3), we can interpret as the expected total rate. When is finite (i.e., ), conditional on the event that there is an individual observed exactly at time , the distribution of the rate of the species of that individual is (see Proposition A in Appendix A for a proof). Therefore, we can regard as the intensity measure of of the observed individuals. If is finite, the expected number of positive-rate species in the community is finite. From Step 3 in the definition, if , there are infinite number of zero-rate species and they arrive at a constant rate. With probability one, is finite if and only if and . In such case, follows a distribution. Measure specifies the distribution of the rates of positive-rate species. More precisely, with is the average number of species with rate in . Measure specifies the distribution of the rates of the species of the individuals including those of the zero-rate species. More precisely, is the average number of individuals (per unit time) belonging to species whose rates lie in . Condition (2.1) is essential because it is equivalent to the finiteness of the ESAC (i.e., for any finite nonnegative ). A proof of it is given in Appendix B. Figure 1 illustrates the generation of the MPPP.
When is infinite, we cannot simulate all ’s in Step 1 in practice. We can use the following method to simulate a realization of the process in interval . The probability for a species with rate to be recorded in is . Applying the thinning property, the intensity of the recorded positive-rate species is which is always finite. To include also the recorded zero-rate species, the intensity is (we take when ). After generating according to a Poisson process with this intensity measure, we simulate for each , a realization (independently across ) of a Poisson process with rate , conditional on the event that has at least one point in (if , then contains one uniformly distributed point in ). An alternative equivalent definition that unifies the generation of individuals of zero-rate species and positive-rate species is given in Appendix C.

A special feature of the framework is that is random and can be infinite with probability one. This change necessitates modification of biodiversity measures, among which Hill numbers are popular. When is deterministic, Hill number of order [25] for and is where is the relative abundance of species in an assemblage (the number of individuals of species divided by the total population). When , is defined as . Hill numbers are interpreted as the effective number of species. The order determines the sensitivity of the measure to species relative abundances. When , all species regardless its abundance are treated equally. Larger imposes more weight to abundant species. Hill numbers encompass three important diversity measures as its special cases. They are the species richness (), the exponential of Shannon entropy (), and the inverse of Simpson index (). A notable property of Hill numbers is that they obey the replication principle: The diversity of an assemblage formed by pooling equally abundant and equally large assemblages with no species in common is times the diversity of a single assemblage.
Under our framework, each species corresponds to a Poisson process, and the relative abundance of a species is its rate divided by the expected total rate, . Reasonable modifications to the definition of Hill numbers are to replace with , where is the rate of a species, and to replace summation over species with integration with respect to the measure so that the integral of is one. It works well when is finite, but fails when is infinite. To fix the problem, we need a meaningful surrogate rate which fulfills two requirements: (i) the expected total rate is finite, and (ii) it approaches the true rate as a limit. The first appearance time of a species with rate follows the exponential distribution with density function , which can be regarded as the instantaneous rate of its first appearance at time . This rate approaches when decreases to zero. The expected total instantaneous rate of first appearances over all species at time is ( from (3.1)) which is always finite for positive . Replacing with and taking , we define, the Hill numbers, for our framework as
When and , . When is finite,
Diversity is non-increasing with respect to . Unlike the classical Hill numbers, can be less than one. For instance, from (3.3), which can be any positive value. It can be proved that for any positive constant , , an analogue of the replication principle for Hill numbers.
3 Frequency of Frequencies
A sufficient statistic for a realization of an MPPP in time interval is . Consider a time interval . For each in Step 2 of the definition, it contributes one to one of the counts , where follows a Poisson distribution, . By the splitting property of Poisson processes [29], their total contribution to follows distribution independent across . The zero-rate species only increase by a random variate. Therefore, for ,
(we use the convention that ). Equations (3.2) and (3.3) follows from for . The correctness of (3.1) when follows from (3.2), (3.3) and the fact .
(3.1) | |||||
(3.2) | |||||
(3.3) |
Again for (3.2), we set when . When , is infinite. Since all elements in are independent and follow Poisson distribution, variable is Poisson distributed, and so do and when their expected values are finite.
Write which is the number of individuals observed before time .
(3.4) |
From (3.1), for ,
By the thinning property of Poisson processes,
is the intensity measure of the rate for the species represented times in . We can interpret as the expected total rate for all species represented times in . As pointed out in Section 2 and from (3.4), is the expected total rate for all species. Conditional on the event that we will observe an individual at a given future time , the probability that this individual belongs to a species represented times in is equal to
(formal proof is given in Appendix D) which corresponds to the renowned Good-Turing frequency estimator in Good [21]. It is worth pointing out that the above equation holds for individual observed at a given future time rather than the next observed individual.
From the well-known relation between independent Poisson random variables and multinomial distribution, model (1.1) is valid under the framework with
(3.5) |
Formal proof is given in Appendix E. If is finite, for any fixed . The joint probability mass function of is
In terms of the expected FoF, the log-likelihood function is
(3.6) |
In terms of and , it is
If the unknown vector and the quantity are unrelated, the above log-likelihood function implies that the maximum likelihood estimator (MLE) of is the conditional maximum likelihood estimator (conditional on the observed ) for the multinomial distribution in (1.1). The MLE of is .
4 Expected Species Accumulation Curve
Denote the expected (empirical) SAC (ESAC) as . Condition (2.1) guarantees that is finite for any finite . For a real-valued function , let stand for the -order derivative of function . Clearly . From (3.2),
(4.1) |
Note that . From (3.1) and (4.1),
(4.2) |
Analogous expression for (4.2) appears in Béguinot [2] as an approximate formula under the multinomial model for fixed total number of observed individuals for the species-sample-size curve with the derivative operator replaced by the difference operator.
Before studying the link between ESAC and SAD, two mathematical terms are needed. A function is a Bernstein function if it is a nonnegative real-valued function on such that for all positive integer [44]. An infinitely differentiable function on an interval is called absolutely monotone in if for and . The relation between absolutely monotone function and probability generating function is well known (see for example Strook [45]).
Proposition 1:
- a.
-
A function is the ESAC of an MPPP if and only if is a Bernstein function that passes through the origin.
- b.
-
A function is the probability generating function of for a fixed positive of an MPPP if and only if , and is absolutely monotone in .
- c.
-
An MPPP is uniquely determined by its , or for the probability generating function of for a fixed positive .
- d.
-
For any MPPP, and , we have
(4.3)
It worths pointing out that Boneh et al. [4] proved that for a different setting ( independent Poisson processes with different rates) and called it “infinite order alternating copositivity”.
From (3.6), the log-likelihood function can be re-expressed as a function of .
(4.4) |
It can be shown that the Taylor expansion of at converges to when :
(4.5) |
It signifies that Good-Toulmin estimator in (1.3) performs satisfactorily in short-term extrapolation when . We deduce from (4.5) the following unbiased estimator for different order of derivative of
(4.6) |
We use (4.6) only for 111We include here for completeness. Remember that can be infinite. because outside this interval, may not have the correct sign .
Estimator in (4.6) is useful. For example, a concave downward curve when we plot for 222It corresponds to the diagnostic check for the log-series distribution in Table 1 in Section 5. is an indication that because if we believe that there is a linear function with positive and such that for all , then
5 Linear First Derivative Ratio Family
MPPP is nonparametric in nature. It is defined by an intensity measure, an ESAC or an SAD. When parametric approach is preferred, we restrict our interest to a family of distributions in MPPP, say by putting constraints on the SAD. A way to portray an SAD is to delineate its probability ratio, for . It is equivalent to examine , which we call the th derivative ratio. From (4.1) and the Cauchy-Schwarz inequality, for and , . It deduces that th derivative ratio is always a nonnegative nondecreasing function of . Among all derivative ratios, the first derivative ratio is most important because which relates to the logarithmic derivative of , the expected total rate for unseen species at time . The following proposition gives a sufficient condition for it.
Proposition 2: A sufficient condition for a function on to be the first derivative ratio for an MPPP is that (i) is a Bernstein function, and (ii) or .
We prove Proposition 2 in Appendix G. Hereafter we use to denote the first derivative ratio.
The simplest nontrivial Bernstein function is the positive linear function on . It suggests the following fundamental family of distributions in MPPP.
Definition: (Linear First Derivative Ratio Family) An MPPP is said to belong to the linear first derivative ratio family (denoted as ) if its first derivative ratio takes a linear form for and such that if .
Family encompasses some common SADs and ESACs. We list the characteristics of all models in in Table 1. When , must be larger than 1 (see condition (ii) in Proposition 2), otherwise from the second row of Table 1, for finite . In Table 1, equality of transformed for some models is presented. Such equalities can be used to produce diagnostic check for specified SAD in when is replaced by its estimator in (4.6).
[b]
SAD (Zero-truncated Poisson distribution)1 : | |
() | |
ESAC (Negative exponential law): | |
Diagnostic check: | |
SAD2 : | |
ESAC: | |
SAD (Geometric distribution)3 : | |
() | ESAC (Hyperbola law)4 : |
Diagnostic check: | |
SAD (Log-series distribution): | |
() | ESAC (Kobayashi’s logarithm law)5 : |
Diagnostic check: | |
SAD 6 : When , | |
( | When , , for . |
or ) | ESAC (Power law): When , |
When , | |
Diagnostic check: |
-
1
It is the simplest MPPP with all species having the same rate .
-
2
When (), it is the zero-truncated negative binomial distribution.
-
3
It is a special case of the zero-truncated negative binomial distribution.
-
4
Also known as Michaelis-Menten equation and Monod model.
- 5
-
6
This distribution appears in Zhou et al. [51].
has three parameters , and . Parameters and determine the SAD, and is a scale parameter. As pointed out at the end of Section 3, the MLE of and is equivalent to the conditional MLE (conditional on the observed ) of the multinomial model in (1.1). The MLE of is chosen to make equal to .
From Table 1, we can find and using the relations , and
It can be shown that for , , and
(5.1) |
where is the Dirac measure. The intensity measure takes the form as a gamma distribution with extended shape parameter for nonnegative . Therefore, is the Engen’s extended negative binomial distribution [19] with support . The parameter determines the shape of the gamma distribution.
The Hill number of order for is
where is the digamma function. can be found either as or . Species richness, if and only if . In this case, is increasing in for any fixed . When , is unimodal with respect to .
An ESAC is called following a power law, if for . From Table 1, the SAD at time for a power law has the form
where ( is in Table 1), and for and is the rising factorial (this distribution appears in Zhou et al. [51]). This SAD distribution does not depend on . We call this discrete distribution, the power species abundance distribution (PSAD). If follows a PSAD with parameter , then follows the generalized hypergeometric distribution, distribution [28, 27].
Proposition 3: Under the MPPP, power law is the only ESAC which has SAD independent on . Furthermore, if an SAD under MPPP has a proper limiting distribution when approaches infinity, then the limiting distribution must be a PSAD.
The proof of Proposition 3 is given in Appendix H.
Three distributions in are exceptional. They stand for three boundary situations. The first one is the zero-truncated Poisson distribution (). It models the extreme case when all species have equal abundance. In the approach that assumes finite , it is a common reference distribution, say in interpreting and deriving nonparametric estimator of .
The second one is the log-series distribution (). It is where jumps from finite value to infinite value in . Therefore, if we are interested only in finite models, it is a boundary case. Because of this property, it is used in this paper as a reference distribution in graphical check for the finiteness of (refer to the end of Section 4, and the checking of slope 1 in the D1/D2 plot introduced later).
The last one is the power law (). It is the only possible limiting distribution of SAD in MPPP as approaches . All SADs in with converge to it when increases without bound. It is also the only distribution in that has for any .
6 Diagnostic Plots
An advantage of is that it has a simple diagnostic plot: Draw as a function of for and defined in (4.6). If the curve in the plot is almost linear, is an appropriate model. Approximate intercept and slope of the curve can be used as initial estimate of and in finding the MLE. We call the plot, D1/D2 plot, and the curve for in the plot, the D1/D2 curve.
Similarly, to investigate how well fits a data, we can plot the function for where and are defined in (4.6). We call the plot, D2/D3 plot, and the curve in it, the D2/D3 curve.
As are independent and Poisson distributed, by the delta method, we can approximate by
where
(6.1) |
and
We use as an approximate 95% pointwise confidence band for in the D1/D2 plot. Similar confidence band can be constructed in D2/D3 plot (see Appendix J).
The diagnostic checks in Table 1 suggest graphs to examine the fitness of zero-truncated Poisson distribution, geometric distribution, log-series distribution, and power law to the FoF. Graphical check for the leading three distributions exist in the statistical literature. See for example Hoaglin and Tukey [26]. Our plots are new additions from a totally new point of view. The plot focuses on , the estimated expected total rate of unseen species at time . The graphical check detects discrepancy between the estimated function and its expected pattern for when a specified distribution is assumed. Since the plotted -value in the diagnostic plot has the form , an approximate pointwise confidence band for the curve can be obtained using the approximation for defined in (6.1).
Currently a standard diagnostic plot for power law is the log-log plot which plots against for . As , log-log plot detects whether is a constant function. As suggested by the diagnostic check of power law in Table 1, we can plot against for . We call it log(D)-log plot. Log(D)-log plot checks whether is a constant function because . Log(D)-log plot is more sensitive to discrepancies with the power law because changes very slowly with respect to for many SADs. It is well-known in species-area relationship studies that the curve in log-log plot is approximately linear for various dissimilar SADs [39, 40, 36, 33].
In Figure 2, we consider four distributions for which the diagnostic graphs are designed. The parameter are (1.5, 0) (zero-truncated Poisson distribution), (1, 0.5) (geometric distribution), (0.5, 1) (log-series distribution) and (0, 1.5) (power law). Parameter is chosen so that . We draw the ESAC in panel (a) and the SAD at in panel (b). We can discriminate the power law (the black curve in the panel (a)) from other distributions as its ( is the expected total rate). Other than this, we learn little about the parameters and from visual inspection of the plots in panels (a) and (b). We need special plots to discriminate different parameter values. In D1/D2 plot in panel (c), the four distributions correspond to four straight lines with different slope . The graphical checks in panels (d)-(g) are designed for each special distribution so that when FoF follows that distribution, the curve in the plot is a straight line, which is drawn as a heavy line in Figure 2. Therefore, how straight the curve is can be used to assess how well the distribution fits the data.

7 Species Richness via Extrapolation
Assuming the whole curve to be linear may be too ambitious. If species richness is our main concern, our aim is to estimate . It is convenient to concentrate only on the rare species. We classify species into two groups: rare species and abundant species. Assume that (B1) all species represented 0, 1, 2, and 3 times in time interval are rare, and (B2) rare species follows distribution. 333The categorization of species into rare species and abundant species in (B1) and (B2) is vague and data-dependent. A clear-cut threshold at 3 for the abundant species is artificial. It leads to an intuitive but debatable conclusion: All abundant species are represented at least four times in . Results deduced from (B1) and (B2) can only be approximations. Observations and are data from this model truncated at both ends. We do not know how many rare species are represented four times or more in because the observed for may include abundant species.
It can be shown that for ,
(7.1) |
when and is independent on . When , we have
The information about the rare species from and is barely enough to make estimable. When , a plug-in estimator of is
Estimator needs modification to take into account two inequality constraints: and . Under , the constraint is equivalent to , the slope of satisfying . Therefore, our final estimator of when is
If , () is finite, and (7.1) remains true when . We have
(7.2) |
Using estimator and (7.2), we have the following estimator of .
(7.3) |
From (B1), all unseen species are rare. Estimator depends only on rare species data , and . When , reduces to Chao1 estimator [7] of , which is . If and , a reasonable estimate of is infinity. When , our estimate of is when , and is 0 when . From (B1), all abundant species are observed before time . Their contribution to is included in . Our estimator of is .
Chao1 estimator [7] is a lower bound estimator of , and works well when the rare species have equal abundance, which corresponds to the model with . Our estimator is an extension of Chao1 estimator in the sense that we assume the rare species follow model, and estimate the parameter using the FoF of rare species. It reduces to Chao1 estimator when .
A better way to understand the estimator is to relate it to an extrapolation of because we can assess its suitability in the D1/D2 plot. Abundant species usually appear early. Species that show up late are likely rare. The rear part of depends mainly on rare species. Extrapolation is a technique to extend our knowledge about rare species to the unseen species. (B1) and (B2) imply that is approximately linear when .444We do not require to be exactly linear when because given the rear part of , we can find for and and then perform interpolation using Equation (4.7) to find for all . If is linear after , it can be shown that is linear in . Consider two simple linear extrapolation methods in D1/D2 plot. The zeroth-order extrapolation uses for . Another extrapolation uses
(7.4) |
As , (7.4) is the first-order extrapolation of when . We call (7.4) a modified first-order extrapolation. It is always true that
(7.5) |
If we estimate by its MLE and by its plug-in estimator , then we can estimate for all using (7.5) once an extrapolation rule for is chosen. Species richness is just the limiting value of this . If zeroth-order extrapolation is used, the estimator of is Chao1 estimator. If modified first-order extrapolation rule is used, the estimator of is . As is nondecreasing, zeroth-order extrapolation is a lower bound of all reasonable extrapolations. It explains why Chao1 estimator estimates a lower bound of .
All ’s that fulfill the sufficient condition in Proposition 2 are asymptotically linear because its derivative is nonnegative and non-increasing and thus must have a finite limit. Furthermore, for concave , the modified first-order extrapolation is probably the first-order extrapolation. Therefore, under the sufficient condition in Proposition 2, modified first-order extrapolation is an adequate extrapolation curve when is large, and should give a plausible estimate.
Equation (7.5) is useful. If for a when , then
Similarly, if for a and when , then
where the last expression has finite limit when increases to infinity. These two facts can be used to detect the finiteness of in D1/D2 plot. If we can judge from the in the D1/D2 plot that there are positive values and such that whenever , then is likely infinite. On the other hand, if for a , it is reasonable to believe that is finite. To assist judging whether the rear part of has slope larger than or smaller than 1, it is helpful to add grid lines with slope 1 in the D1/D2 plot as demonstrated in Figure 3.
We can assess the adequacy of Chao1 estimator through investigating how well , and fit a truncated Poisson distribution by testing
against
If the null hypothesis is not rejected, Chao1 estimator gives reasonable estimate of , otherwise it only estimates a lower bound of . An unbiased estimator of is which is our test statistic. An unbiased estimator of is
The approximate p-value of the test is , where stands for standard normal distribution. We reject at significance level when the p-value is less than .
8 Rational First Derivative Ratio Family
When the curve in the D1/D2 plot is not linear, but concave, it is natural to model as a rational function. Quotient of two linear functions is too restrictive because it is in general asymptotically flat. It not only implies that is usually finite, but also that the rare species are homogeneously abundant. Therefore, we consider a ratio of a quadratic polynomial and a linear function. This form of is asymptotically linear with flexible slope. After simple manipulation, we obtain the following expression for .
Definition: (Rational First Derivative Ratio Family) A first derivative ratio, belongs to the rational first derivative ratio family (denoted as ) if
(8.1) |
where and are positive parameters, and and are nonnegative parameters. For uniqueness we assume . If , we require . This additional restriction is to ensure that is finite for all finite .
Function in (8.1) fulfills the sufficient condition in Proposition 2. When , model becomes model. has five parameters, , , , , and .
To calculate the log-likelihood function using (4.4), we need to compute and for all such that . The following equality which can be proved by mathematical induction for is helpful,
To use it, choose a value for . Given which is a scale parameter, and the values of the parameters , , and , we can find . Then apply the above recurrence relation for to compute all necessary in (4.4). The ESAC is
Thus if and only if . It can be proved that . We give the expression for in Appendix L.
9 Examples
Four real FoF data are presented in Table 2. Nonparametric analysis of the data can be found in Böhning and Schön [3], Lijoi et al. [31], Wang [49], Chee and Wang [12], Norris and Pollock [37], and Chiu and Chao [13]. In this section, we fit the data using the parametric models in Sections 5 and 8, demonstrate the use of various diagnostic plots, and compare different diversity estimates. Without loss of generality, we set . The significance level of the tests is fixed to 5%.
(i) Swine feces data | |||||||||||||
( under ) | |||||||||||||
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | |||
8025 | 605 | 129 | 41 | 16 | 8 | 4 | 2 | 1 | 1 | 1 | |||
(ii) Accident data | |||||||||||||
( under ) | |||||||||||||
1 | 2 | 3 | 4 | 5 | 6 | 7 | |||||||
1317 | 239 | 42 | 14 | 4 | 4 | 1 | |||||||
(iii) Tomato flowers data | |||||||||||||
( under ) | |||||||||||||
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | |
1434 | 253 | 71 | 33 | 11 | 6 | 2 | 3 | 1 | 2 | 2 | 1 | 1 | |
14 | 16 | 23 | 27 | ||||||||||
1 | 2 | 1 | 1 | ||||||||||
(iv) Bird abundance data ( under ) | |||||||||||||
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 12 | 13 | 14 | |
11 | 12 | 10 | 6 | 2 | 5 | 1 | 3 | 2 | 4 | 1 | 1 | 1 | |
15 | 16 | 18 | 25 | 29 | 30 | 32 | 39 | 44 | 53 | 54 | |||
2 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
The first data is the swine feces data which appeared and was analyzed in Chiu and Chao [13]. It is for the pooled contig spectra from seven non-medicated swine feces. The large relative to other frequencies is viewed as a signal for sequencing errors. Chiu and Chao [13] proposed a nonparametric estimate of the singleton count basing on the other counts, and the difference of this estimate and the observed singleton count is interpreted as outcome of missequencing. An implicit assumption of this approach is that sequencing errors inflate only the singleton count, and all other frequency counts are unaffected. It is equivalent to claim that there are sequencing errors which create solely zero-rate species.
To investigate whether zero-rate species really exist, we draw the D1/D2 and D2/D3 plots with 95% pointwise confidence bands in panels (a) and (b) respectively in Figure 3. The approximate linear curve in both plots indicates that both and are reasonable models. The heavy dashed lines in panels (a) and (b) are the lines fitted by MLE under . The fitted line is close to the curve in D1/D2 plot, but is not so in the D2/D3 plot. As the dashed line lies inside the confidence bands in panel (b), the disagreement between the singleton count and other counts is not strong enough to reject that they come from the fitted . The Pearson’s chi-square test statistic after grouping all cells with expected frequency less than 5 is 4.325 with 4 degrees of freedom. The estimated under this is 8027.6 which is marginally larger than . As , the MLE of under should be zero. There is no significant evidence for the existence of zero-rate species. The heavy green curve in panel (a) is the fitted curve under . The heavy green curve is very close to the D1/D2 curve. performs better than in terms of Akaike information criterion (AIC) (the AIC for is -136060.6, and that for is -136061). 555The standard likelihood ratio test is not valid as the tested parameter value is on the boundary of the parameter space.
In panel (a) blue grid lines with slope 1 are drawn. From Section 7, the slope of the rear part of larger (or smaller) than 1 is an indication that is infinite (or finite). Apparently the slope of has slope always larger than 1. We are confident that is infinite. The modified first-order extrapolation line for (the purple line) and the zeroth-order extrapolation line for Chao1 estimator (the orange line) are drawn. As , . The orange extrapolation line does not look good. Chao1 estimator does not perform satisfactorily.

The second data come from 9461 accident insurance policies issued by an insurance company. It was used in Böhning and Schön [3], Wang [49], and Chee and Wang [12]. The species corresponds to the policies, and the frequency count to the number of claims during a particular year. The in the D1/D2 plot for the accident data in panel (c) is concave but not far from linear. The heavy dashed line is the fitted line under . The p-value for the Pearson chi-square test for is 0.0979. The MLE of is 1.1146 which is larger than 1. Therefore, the MLE of is infinite. Compared with the blue grid lines with slope 1, the average slope of the D1/D2 curve is close to one showing uncertainty about the finiteness of . The heavy green curve in panel (c) is the fitted curve under . It is preferred to because it has smaller AIC (the AIC for and are -18691.95 and -18692.15 respectively). The MLE of under is 6354. It is less than the true value , and is comparable to the nonparametric estimates presented in Chee and Wang [12] which ranges from 4016 to 7374. The purple line is the modified first-order extrapolation used by . As the slope of the purple line is less than one, is finite. For this data, and which is closer to the true value. The orange extrapolation line for Chao1 estimator is also drawn. The difference of the two extrapolation lines is not small.
The third data come from a CDNA library of the expressed sequence tags of tomato flowers. It was studied in Böhning and Schön [3] and Lijoi et al. [31]. The D1/D2 plot in panel (d) shows that model does not fit the data well (the p-value of the Pearson chi-square test is 0.0059). The curve looks like a Bernstein function. An model is fitted and the fitted curve is shown in the plot by a heavy green curve. The model fits the data well (Pearson chi-square statistic after grouping all cells with expected frequency less than 5 is 1.470 with 2 degrees of freedom). As , the estimated is infinite. This estimate of agrees with the finding when we compare the slope of the rear part of the curve with one under the assistance of the blue grid lines. The modified first-order and zeroth-order extrapolation lines are drawn in purple and orange respectively. The difference of the slopes of the lines is not small.
The fourth data is the bird abundance data for the Wisconsin route of the North American Breeding Bird Survey for 1995. Totally 645 birds from 72 species are recorded. The data was studied in Norris and Pollock [37] where a mixture of five Poisson models was fitted, and the estimated is 76. Comparing with the blue grid lines, the slope of the D1/D2 curve is less than one. Species richness should be finite. The in the D1/D2 plot does not look like a straight line. However, the model is not rejected as the p-value of the Pearson’s chi-square test is 0.116. The estimated is 124.50. We also fit a model to the data. The fitted line for this model is shown in the figure as a heavy green curve. The MLE of under is 80.7. is preferred to as its AIC is -25.63 which is less than that for which is -24.53. does not fit the rear part of well because of an abrupt change of slope around . After , looks quite linear. It suggests that may be a good estimate. For this data, , which is close to the Chao1 estimate 77.04.

We investigate the performance of various special graphical checks in Figure 4. If the curve in the plot is close to a straight line, we assume that the corresponding distribution fits the data. The red dashed curves are the 95% approximate pointwise confidence band. The five columns correspond to five graphical checks: namely the plot for zero-truncated Poisson distribution (check ), for geometric distribution (check ), for log-series distribution (check ), for power law (check ; it is the log(D)-log plot), and the log-log plot for power law. The first four rows of plots are for the four real FoF data. The last row is for data simulated from the specified distribution which the diagnostic plot is designed to check. The curves in the last row are close to a straight line as expected. The log-log plot in the last column does not perform satisfactorily. The first three graphs shows almost perfect line. Comparatively, the log(D)-log plot correctly declare discrepancy of power law to the four real data. The graphical check for log-series distribution correctly detects that accident data follow closely to a log-series distribution (the MLE of under is 1.1146). The curves for swine feces data and the tomato flowers data are concave. It implies that is infinite as pointed out at the end of Section 4. The plot for geometric distribution and zero-truncated Poisson plot look reasonable except for the accident data.
Consider estimation of Hill numbers. In Table 3, we compare the model-based estimator developed in the paper, and Chao-Jost estimator proposed in Chao and Jost [10]. For our model-based estimator, confidence intervals are constructed using parametric bootstrap method. Details are given in Appendix K. The difference between our estimators and Chao-Jost estimator is mild when . In other situations, the difference can be large. Sometimes the two confidence intervals do not overlap. The width of the 95% confidence interval for our parametric estimator is larger than that of the corresponding interval for Chao-Jost estimator. The interval estimate for Chao-Jost estimator is obtained basing on the assumption that is finite and all unseen species have equal abundance. For the accident data, the true is contained in the model-based 95% confidence interval, but not in that for the Chao-Jost estimator.
[b]
Data | Model/Method | |||
---|---|---|---|---|
Swine | 194649 | 27698 | ||
feces | ||||
data | Chao-Jost 1 | 62051 | 53835 | 27801 |
Accident | 6354 | 5203 | 3557 | |
data | ||||
(True | Chao-Jost | 5248 | 4788 | 3606 |
) | ||||
Tomato | 5941 | 1311 | ||
flowers | ||||
data | Chao-Jost | 5887 | 4079 | 1450 |
Bird | 124.51 | 43.84 | 28.43 | |
abundance | ||||
data | Chao-Jost | 77.03 | 41.94 | 27.57 |
Consider estimation of species richness. In Table 4, we compare , Chao1 estimator, iChao1 estimator in Chiu et al. [14], and the abundance-based coverage estimator (ACE) in Chao and Lee [9] and Chao et al. [11]. We construct the confidence intervals for basing on using parametric bootstrap method. Details are given in Appendix K. Estimator usually gives larger estimate. For the bird abundance data, all estimators give similar point estimate. For the accident data, gives the best estimate of the true . For the remaining two data, the differences are huge. Our estimate is infinite, while other estimates are finite as assumed.
The difference between and the other three methods can well be explained by the test on whether the observed , and fit a truncated Poisson distribution. The p-value of the test at the end of Section 7 for the bird abundance data is 0.374. For this data, all estimates are close to each others. The p-value for accident data is 0.040. The difference is significant, but not huge. The p-value of the other two data are smaller than implying that the deviation is huge. The 95% confidence interval for swine feces data is , a much stronger signal when compared to the confidence interval in Table 3. For , only the 95% confidence interval for bird abundance data has finite width. It has the lower endpoint close to the lower endpoint of other confidence intervals, but it has much larger upper endpoint.
Estimator gives wider confidence interval than that for Chao1 estimator because it uses a modified first-order extrapolation that requires the estimation of slope, whereas zeroth-order extrapolation does not have this requirement. Nevertheless, first-order extrapolation looks more rational when compared to zeroth-order extrapolation which can only estimate a lower bound.
[b]
Data | Estimator | Estimate | 95%Lower | 95%Upper |
---|---|---|---|---|
Swine | ||||
feces | Chao1 1 | 62051.3 | 57409.9 | 67136.2 |
data | iChao1 2 | 67615.0 | 62889.6 | 72753.5 |
ACE 3 | 68827.7 | 62928.4 | 75370.3 | |
Accident | 8249.2 | 5087.4 | ||
data | Chao1 | 5247.8 | 4682.7 | 5917.3 |
(True D | iChao1 | 5966.7 | 5396.9 | 6622.6 |
= 9461) | ACE | 5683.8 | 5031.1 | 6461.5 |
Tomato | 18108.3 | |||
flowers | Chao1 | 5887.4 | 5274.4 | 6609.2 |
data | iChao1 | 6512.3 | 5951.4 | 7149.5 |
ACE | 6645.5 | 7779.2 | 11859.5 | |
Bird | 77.9 | 72.5 | 297.0 | |
abundance | Chao1 | 77.0 | 73.3 | 92.1 |
data | iChao1 | 77.5 | 74.6 | 83.3 |
ACE | 78.7 | 74.3 | 91.7 |
-
1
Chao1 estimate and the confidence interval are computed using R package SpadeR [8].
- 2
- 3
10 -appearance Design
Usually we collect all available information within the survey period. Under our framework, all useful information is in FoF. If labor saving is our concern, we may neglect some minor information, say halt recording a species as soon as its observed frequency reaches a fixed positive integer in the study period . In this case, the observation period for each species varies. The period is short for abundant species, and long for rare species. We call this design, the -appearance design. This design places more emphasis on rare species than abundant species which is in line with the common understanding that information about the rare species is critical when our interest is in . In bird survey, species can be identified by distant sightings or short bursts of song. Stop recording abundant species early helps the researcher concentrating more on the rare species. When , we obtain . When , we record only the first appearance-time of each seen species. It is exactly the information available in the empirical SAC.
We call the appearance time of the th individual of a species in a survey, the -appearance time of that species. Our observations are where are the observed -appearance times. The log-likelihood function given a realization is proved in Appendix M to be
(10.1) |
A numerical advantage of (10.1) is that it does not require a general expression of for all . A small simulation experiment in Appendix N shows that the loss in information of -appearance design when compared to the standard design is marked when , and minor when .
By the displacement theorem of Poisson process [29], the -appearance times form a Poisson process with intensity function
(10.2) |
From (4.2), another expression for is . Equation (10.2) gives another interpretation of . For example, is the intensity function of the 1-appearance times (thus is the density function of the 1-appearance times of all seen species in ), and is the intensity function of the 2-appearance times.
11 Inference on Empirical Species Accumulation Curve
Suppose we only observe the empirical SAC, for , or equivalently, the 1-appearance times of all seen species in , say (i.e., the case in Section 10). From (10.1), the log-likelihood function is . If has a free scale parameter, MLE of is . The MLE of for power law has a simple closed form , where .
Given a parametric form of , a traditional approach is to fit it to the empirical SAC by linear or non-linear least-squares method. Two differences between the MLE approach and the curve-fitting method are noteworthy. First, the MLE of is equal to whenever has a free scale parameter, and it is not the case in the curve-fitting approach. Second, the MLE method fits the density function to the 1-appearance times, while the curve-fitting method fits a function to the empirical SAC. The curve-fitting methods do not take the interdependence among the points in the empirical SAC into consideration. Since such interdependence is present in species-time curves and Type I species-area curves [43], the curve-fitting approach is theoretically flawed. Although improvements can be made in curve-fitting approach through transformation and/or adding weights, maximum likelihood approach is preferred for its proven effectiveness under the assumption that the model is correct.
In certain situations, only the values of the empirical SAC at a finite set of times are available. For example, only the cumulative number of species observed after day 1, day 2, and so on are recorded. Suppose the observed is for with . The log-likelihood function is
In the simulation study in Appendix O, MLE has smaller root mean squared relative error in extrapolation when compared to the curve-fitting method.
The distribution function of the 1-appearance times in time interval is . This fact holds in general for any nondecreasing with , including sigmoid function such as the cumulative Weibull function. If we assume that the 1-appearance times are independent, we can perform maximum likelihood inference conditional on when a parametric form of is given. Full maximum likelihood calculation is possible when further assumption on the distribution of is made. As the empirical SAC is proportional to the empirical distribution function of the 1-appearance times, statistical tools for empirical distribution function can be used. For example, we can apply the Dvoretzky-Kiefer-Wolfowitz inequality to construct confidence bands for the distribution function .
12 Discussion
A general framework, MPPP, is proposed for species abundance data. This framework has the following novel features: (1) it delineates the relation between two analytic tools – SAD and SAC, and an MPPP can be characterized by either one of them; (2) it characterizes the class of possible ESACs to be the class of Bernstein functions, giving solid theoretical support to the empirical observation that SACs are nondecreasing concave functions; (3) it allows the existence of zero-rate species; (4) it contains the model, a parametric model where the first derivative ratio of the ESAC is linear, which admits several graphical checks, and two generalizations, namely which includes zero-rate species, and model where the first derivative ratio is a rational function; (5) it admits a new species richness estimator ; (6) it admits a natural generalization of Hill numbers as measures of species diversity, which are well-defined even when the number of species is random and infinite; and (7) it allows inference to be performed when only the first appearance times of a species is recorded, which we refer as -appearance design.
Compared to the conventional curve-fitting approach to SAC, our framework has a more solid theoretical underpinning, as an MPPP is uniquely characterized by an ESAC that is a Bernstein function. In the study of species-area curve, power law is popular. Nevertheless, under our model, power law is an extreme case. Its and for any positive are infinite, and its PSAD has a heavy right tail. Even though curve fitting can give reasonable estimates, the MPPP framework provides a parametric generative model in which maximum likelihood estimates can be obtained. Matthews and Whittaker [34] suggests using maximum likelihood methods instead of least-squares approaches in SAD fitting. We would like to extend their advice to ESAC fitting.
One possible future direction is to apply MPPP in a nonparametric setting, which lessens the need for model selection. Another future direction is to understand the tradeoff that arises in the -appearance design. A small can reduce the sampling effort, at the expense of less accurate estimates of parameters. Investigating this tradeoff in a theoretic and empirical setting can provide guidance to the design.
Acknowledgments
The authors are grateful to the associate editor and two anonymous referees for helpful suggestions that improved the presentation of this article.
Appendix A Conditional Distribution of the Rate of an Individual at a Fixed Time
Informally, if we condition on the event that there is one individual at exact time , then the rate of its species follows the distribution if . The formal statement below involves a limit argument.
Proposition A: Fix , and assume . Conditional on the event that there is at least one individual in the time interval for , then the probability that there is exactly one individual in tends to , and the rate of its species converges in distribution to the distribution as .
Proof. Let be the event that there is at least one individual in the time interval , and be the event that there is exactly one individual in the time interval for . The probability that a species with rate is observed during is . By the thinning property, the rates of the positive-rate species observed during form a Poisson process with intensity measure . Considering also the zero-rate species, the rates of the species observed during form a Poisson process with intensity measure (where when ). Event is that this Poisson process has at least one point. As when ,
Similarly, the rates of the species observed exactly one time during form a Poisson process with intensity measure
Event is that this Poisson process has exactly one point. Thus
which is equal to when because is finite. Therefore,
As probability cannot be larger than 1, it implies that as .
Given that exactly one individual appears in , let be the rate of the species that the individual belongs and be its distribution function. For any positive , let be the event that is less than or equal to . Then
It follows that
Therefore, converges in distribution to the distribution as .
For the case , as , the rate of the species observed in diverges to infinity.
Proposition B: Fix , and assume . Conditional on the event that there is at least one individual observed in the time interval for , then the minimum rate among observed species converges in probability to as .
Proof. Let be the event that there is at least one individual in the time interval . As in the case in Proposition A, for any fixed , we have
as since due to by the definition of MPPP. On the other hand, as when ,
Since , for any ,
Appendix B Proof of the equivalence between Condition (2.1) and the finiteness of ESAC
Appendix C Equivalent definition of the MPPP
In the definition of the MPPP in the paper, the individuals of the zero-rate species and the individuals of the positive-rate species are generated separately. Here we present an alternative definition where all individuals are generated in a unified manner.
Definition: (Equivalent definition of MPPP) An MPPP is characterized by a nonzero species intensity measure , which is a measure over satisfying . Define to be a measure over by
for any measurable set . Generate (a finite or countably infinite sequence) according to a Poisson process with intensity measure . For each simulated , we generate a realization (independently across ) of a Poisson process with rate , conditioned on the event that the first point is at time . (That is, it contains the point together with a Poisson process starting at time . If , then contains only one point .) Each stores the appearance times of one species with rate . Finally, we take .
This definition models the species that will eventually be observed in a study. The first appearance time for each of such species (i.e., the first point of each ) is explicitly included in the definition of .
Appendix D Proof of the Probability that an Individual Observed in a Given Future Time Belongs to a Species Represented times in
Let be the rate of the species observed in a given future time. Suppose . From Proposition A in Appendix A, the probability measure of is . Thus
The last expression holds also when (i.e., the above probability is zero) because from Proposition B in Appendix A, the rate of the future individual approaches infinity and thus that species should have been seen infinite number of times in .
Appendix E Proof of Equation (3.5)
From the thinning property of Poisson processes, the probability measure for given that the species is observed in time interval is
Therefore,
Appendix F Proof of Proposition 1
For any MPPP, from (4.1), is a Bernstein function. Clearly . It proves the necessity part of (a). The sufficiency part of (a) follows from the fact that every Bernstein function with has a unique Lévy-Khintchine representation
where , and is a measure over such that . Define an MPPP with and . The condition becomes Condition (2.1). From (3.2), the ESAC of this MPPP is equal to . It proves the sufficiency part of (a). It also proves that uniquely determines an MPPP in (c) because Lévy-Khintchine representation is unique.
To prove (d), we only need to show that in (4.3) is the probability generating function of . Clearly . From (4.2), for , . It completes the proof of (d).
Given and for a fixed , from (4.3), . The ESAC is uniquely determined and so is the MPPP. It completes the proof for the remaining part of (c).
To prove the necessity of (b), let be the probability generating function of . From (4.3),
Thus and . Furthermore,
It follows that when , . Because has sign , for . It completes the proof of the necessity part. For the sufficiency part, suppose , is absolutely monotone in . Let and be two given positive values. Define . It can be shown that is a Bernstein function such that . From (a), there is an MPPP with ESAC equal to . We have . From (4.3), is the probability generating function of of this MPPP. It completes the proof of (b).
Appendix G Proof of Proposition 2
Let satisfy conditions (i) and (ii) in Proposition 2. Therefore, there exist and such that for all , we have . Let . We show that is a in the Proposition.
When ,
When ,
Therefore, is finite for any . As , and , we have . Since , from Proposition 1(a), it is sufficient if we can prove that
(G.1) |
Clearly (G.1) holds when . Suppose (G.1) holds when for a . As , after taking derivative times on both sides, we have
For , . Furthermore, . Thus . It implies that completing the proof of (G.1) by induction.
Appendix H Proof of Proposition 3
If an SAD is time invariant, then its probability generating function, does not depend on . From (4.3), there is a positive function such that for . Take logarithm on both sides, take derivative with respect to , and then set . We have
The solution of the above differential equation is . As is a Bernstein function, . Hence power law is the only law with independent on .
Consider the second part of Proposition 3. Suppose an SAD in MPPP converges to a proper distribution with probability generating function . Then the of this SAD converges to for any . Let and be any two values in . From (4.3), . Write . Then . Similar to the proof above, we have when . It implies that . As , . Since , we have . It completes the proof as the probability generating function of power law is .
Appendix I Proof of
For any positive integer , condition implies
Therefore, for . Let where and . Then which is a linear function of because . Clearly but not in . Therefore, . It can be shown that every element in has the form for a . It means that is a mixture of zero-rate species and .
Consider for . Let (i.e., ). As th derivative ratio is always a nonnegative nondecreasing function of , both and are nonnegative. For simplicity, we only consider the case when and (cases when and can be studied through letting and respectively). Then
(I.1) |
(I.2) |
and
(I.3) | |||||
If and , from (I.2), when is large. It is impossible because they should have different sign. If and , from (I.1), is a zero function. It is impossible as is undefined. If , from (I.2) and (I.3), when is large. It is possible only when . It implies that . From (I.1) and (I.2), . Thus . It follows that for all .
Appendix J Pointwise Confidence Band for D2/D3 plot
Similar to the confidence band for D1/D2 plot, an approximate 95% pointwise confidence band for D2/D3 plot is , where
with
and
Appendix K Parametric Bootstrap Confidence Interval for Hill Numbers
We have , an estimator of which is or a Hill number of order . We want to construct a confidence interval for . Let be the total number of bootstrap samples such that is an integer. In this paper, we use and . As can be infinite, we avoid using methods that require arithmetic on , such as the basic method. The procedure to construct a parametric bootstrap confidence interval is as follows:
For to do {
Generate a bootstrap sample from a parametric model.
Compute , which we denote as basing on the bootstrap sample. }
Sort in ascending order. Denote the sorted as . Set and . A confidence interval for is
where is an integer such that and the above confidence interval is smallest.
Infinity is a special value of which can have positive probability mass. We use the smallest confidence interval, and attempt to construct confidence interval with finite upper endpoint if possible.
When is an estimator of a Hill number under parametric model or , the bootstrap sample is a simulated FoF under the selected model with the parameter equal to the MLE. This simulation consists of two steps: (i) simulate from distribution, and (ii) generate a random sample of size from the fitted SAD and they form the FoF.
When in (7.3), the bootstrap sample is . We simulate from distribution for independently across . After a confidence interval for is constructed, add to it to find a confidence interval for . If at least one of or is zero, modification is recommended to avoid having any () to be fixed to 0. When there is such that , we move the ending time a little bit backward to , say where is the total number of individuals observed in time interval , and find using (4.8) as shown below:
Clearly for . We use in place of in bootstrapping. We estimate () using the relation in (4.5). Add this estimate to the confidence interval for to construct a confidence interval for .
Appendix L Expressions of for
If or is zero, reduces to . Assume that and are positive.
Assume further that , which implies that is finite. When and ,
where . When ,
where is the Beta function and is the digamma function.
Appendix M Proof of the log-likelihood function for -appearance design
Let be the length of the observation period for species , and be the observed frequency of species in its observation period. For species , we observe . Either or . For a species with rate , the probability function of is
(M.1) |
Since the time of the th individual follows distribution,
(M.2) |
By (M.1) and (M.2), the joint probability density function of the observations , say given is proportional to
Therefore, the log-likelihood function is
Appendix N Simulation experiment on the loss of information of the -appearance design
Let us reconsider the bird abundance data for the Wisconsin route of the North American Breeding Bird Survey for 1995 in Section 9. We choose where is the density function of distribution. The parameter is the expected total number of species. From Section 3, the MLE of is identical to the conditional maximum likelihood estimate of the corresponding Poisson-lognormal model [5]. The fitted lognormal mixing distribution is distribution with and . Let where is a Poisson-lognormal random variable with parameters and . We use the function “dpoilog” in R-package “poilog” [23] to compute this probability. The estimated is .
Without loss of generality, set . We use this data to investigate the information loss of the -appearance design. The -appearance data are simulated from the data using the following procedure:
Simulation procedure: Suppose species has observed frequency in time If , our data for this species is , the frequency of it in time If , we simulate the -appearance time of the species, from distribution, which is the distribution of the order statistic of samples from the distribution.
It can be shown that . The log-likelihood function is
As the MLE of is , MLE of and , say and respectively can be found through maximizing the following function.
The MLE of is . We consider . For each -value, we simulate 100 independent sets of -appearance data. For each simulated data, , and are estimated. The sample mean and sample standard deviation of the estimates are presented in Table 5. Graphical display is given in Figure 5.
The mean of the estimate is close to that basing on . The standard deviation of the estimator decreases as increases. From the simulation results, the standard deviation of the estimators when is considerably worse than those when . Value performs well for this data. The total number of species with frequency less than 4 is 33, around 46% of the seen species.
1 | 2 | 3 | 4 | 5 | 6 | ||
---|---|---|---|---|---|---|---|
mean of | 1.10 | 1.28 | 1.23 | 1.21 | 1.21 | 1.22 | 1.23 |
sd of | 0.56 | 0.09 | 0.07 | 0.04 | 0.04 | 0.03 | 0* |
mean of | 1.39 | 1.24 | 1.29 | 1.29 | 1.31 | 1.30 | 1.30 |
sd of | 0.48 | 0.15 | 0.13 | 0.08 | 0.09 | 0.07 | 0* |
mean of | 90.4 | 83.8 | 85.0 | 85.5 | 85.7 | 85.3 | 85.2 |
sd of | 18.7 | 2.52 | 2.32 | 1.38 | 1.56 | 1.29 | 0* |
* When , we always observe the full data, and the sample
standard deviation (sd) of the estimator across simulation is zero.

Appendix O Simulation comparison of MLE method and curve-fitting method in extrapolation when finite number of points in the empirical SAC are available
Without loss of generality, we set . Our data is . We consider three distributions: power law, log-series distribution and geometric distribution. The curve-fitting methods for the distributions are described below:
-
(a) Power law: . For curve fitting, we regress on .
-
(b) Log-series distribution: . For curve fitting, we regress on . It is the standard approximation method which assumes that .
-
(c) Geometric distribution (hyperbola law): . There are various curve-fitting methods for hyperbola law (see for example Raaijmakers [41]). In this simulation, we regresses on .
The parameter is the expected number of recorded species at time . In the experiment, can take value 200 and 1000. The distribution parameter can take 6 values. For power law, the value of can be 1.25, 1.5, 2, 3, 4 and 5. For log-series distribution, the value of can be 0.01, 0.02, 0.03, 0.05, 0.1, and 0.2. For geometric distribution, the value of can be 0.05, 0.1, 0.2, 0.4, 0.6 and 0.8. For each combination of parameters, we simulate 5000 data. The MLE and the curve-fitting estimate of and are found for each simulated data. We evaluate the performance of an estimator by the root mean squared relative error (RMSRE) (for estimator for , ). The results of the simulation are presented in Tables 6, 7 and 8. The RMSRE for MLE is smaller than that for the curve-fitting method in this simulation study.
() | |||||||
---|---|---|---|---|---|---|---|
(2,200) | RMSRE(Curve-fitting) | 0.075 | 0.078 | 0.083 | 0.091 | 0.096 | 0.100 |
RMSRE(MLE) | 0.073 | 0.074 | 0.077 | 0.080 | 0.082 | 0.083 | |
(2,1000) | RMSRE(Curve-fitting) | 0.034 | 0.035 | 0.037 | 0.040 | 0.042 | 0.044 |
RMSRE(MLE) | 0.033 | 0.034 | 0.035 | 0.036 | 0.037 | 0.037 | |
(4,200) | RMSRE(Curve-fitting) | 0.081 | 0.089 | 0.103 | 0.120 | 0.132 | 0.140 |
RMSRE(MLE) | 0.078 | 0.084 | 0.093 | 0.103 | 0.109 | 0.112 | |
(4,1000) | RMSRE(Curve-fitting) | 0.036 | 0.040 | 0.045 | 0.052 | 0.057 | 0.060 |
RMSRE(MLE) | 0.035 | 0.038 | 0.042 | 0.046 | 0.049 | 0.050 |
() | |||||||
---|---|---|---|---|---|---|---|
(2,200) | RMSRE(Curve-fitting) | 0.072 | 0.073 | 0.073 | 0.077 | 0.090 | 0.122 |
RMSRE(MLE) | 0.072 | 0.072 | 0.072 | 0.072 | 0.073 | 0.076 | |
(2,1000) | RMSRE(Curve-fitting) | 0.033 | 0.034 | 0.036 | 0.043 | 0.065 | 0.106 |
RMSRE(MLE) | 0.032 | 0.032 | 0.032 | 0.033 | 0.033 | 0.034 | |
(4,200) | RMSRE(Curve-fitting) | 0.074 | 0.075 | 0.077 | 0.084 | 0.110 | 0.166 |
RMSRE(MLE) | 0.073 | 0.074 | 0.074 | 0.076 | 0.079 | 0.086 | |
(4,1000) | RMSRE(Curve-fitting) | 0.034 | 0.037 | 0.042 | 0.055 | 0.091 | 0.155 |
RMSRE(MLE) | 0.033 | 0.033 | 0.034 | 0.034 | 0.035 | 0.038 |
() | |||||||
---|---|---|---|---|---|---|---|
(2,200) | RMSRE(Curve-fitting) | 0.322 | 0.458 | 0.586 | 0.687 | 0.731 | 0.756 |
RMSRE(MLE) | 0.085 | 0.106 | 0.136 | 0.165 | 0.174 | 0.174 | |
(2,1000) | RMSRE(Curve-fitting) | 0.316 | 0.454 | 0.583 | 0.684 | 0.728 | 0.753 |
RMSRE(MLE) | 0.054 | 0.079 | 0.111 | 0.134 | 0.136 | 0.132 | |
(4,200) | RMSRE(Curve-fitting) | 0.320 | 0.462 | 0.601 | 0.716 | 0.768 | 0.798 |
RMSRE(MLE) | 0.102 | 0.145 | 0.218 | 0.311 | 0.366 | 0.397 | |
(4,1000) | RMSRE(Curve-fitting) | 0.315 | 0.458 | 0.598 | 0.713 | 0.766 | 0.796 |
RMSRE(MLE) | 0.075 | 0.122 | 0.188 | 0.254 | 0.279 | 0.287 |
References
- Arrhenius [1921] Arrhenius, O. (1921). Species and area. Journal of Ecology 9 95–99.
- Béguinot [2016] Béguinot, J. (2016). Extrapolation of the species accumulation curve associated to “chao” estimator of the number of unrecorded species: a mathematically consistent derivation. Annual Research & Review in Biology 11 1–19.
- Böhning and Schön [2005] Böhning, D. and Schön, D. (2005). Nonparametric maximum likelihood estimation of population size based on the counting distribution. Journal of the Royal Statistical Society: Series C (Applied Statistics) 54 721–737.
- Boneh et al. [1998] Boneh, S., Boneh, A., and Caron, R. J. (1998). Estimating the prediction function and the number of unseen species in sampling with replacement. Journal of the American Statistical Association 93 372–379.
- Bulmer [1974] Bulmer, M. G. (1974). On fitting the Poisson lognormal distribution to species-abundance data. Biometrics 30 101–110.
- Bunge and Fitzpatrick [1993] Bunge, J. and Fitzpatrick, M. (1993). Estimating the number of species: a review. Journal of American Statistical Association 88 364–373.
- Chao [1984] Chao, A. (1984). Nonparametric estimation of the number of classes in a population. Scandinavian Journal of Statistics 11 265–270.
- Chao, Ma, Hsieh and Chiu [2016] Chao, A. and Hsieh, T. C. (2016). User’s guide for online program SpadeR (species-richness prediction and diversity estimation in R). DOI:10.13140/RG.2.2.20744.62722
- Chao and Lee [1992] Chao, A. and Lee, S. M. (1992). Estimating the number of classes via sample coverage. Journal of the American Statistical Association, 87 210–217.
- Chao and Jost [2015] Chao, A. and Jost, L. (2015). Estimating diversity and entropy profiles via discovery rates of new species. Methods in Ecology and Evolution 6 873–882.
- Chao et al. [1993] Chao, A., Ma, M. C., and Yang, M. C. K. (1993). Stopping rules and estimation for recapture debugging with unequal failure rates. Biometrika 43 783–791.
- Chee and Wang [2016] Chee, C.-S. and Wang, Y. (2016). Nonparametric estimation of species richness using discrete k-monotone distributions. Computational Statistics and Data Analysis 93 107–118.
- Chiu and Chao [2016] Chiu, C.-H. and Chao, A. (2016). Estimating and comparing microbial diversity in the presence of sequencing errors. PeerJ 4 1634.
- Chiu et al. [2014] Chiu, C. H., Wang, Y. T., Walther, B. A. and Chao, A. (2014). An improved nonparametric lower bound of species richness via a modified Good-Turing frequency formula. Biometrics 70 671–682.
- Dengler [2009] Dengler, J. (2009). Which function describes the species–area relationship best? A review and empirical evaluation. Journal of Biogeography 36 728–744.
- Deolalikar and Laffitte [2016] Deolalikar, V. and Laffitte, H. (2016). Extensive large-scale study of error surfaces in sampling-based distinct value estimators for databases. IEEE International Conference on Big Data pp. 1579–1586.
- Dickie [2010] Dickie, I. A. (2010). Insidious effects of sequencing errors on perceived diversity in molecular surveys. The New Phytologist 188 916–918.
- Efron and Thisted [1976] Efron, B. and Thisted, R. (1976). Estimating the number of unseen species: How many words did Shakespeare know? Biometrika 63 435–447.
- Engen [1974] Engen, S. (1974). On species frequency models. Biometrika 61 263–270.
- Fisher et al. [1943] Fisher, R. A., Corbet, A. S., and Williams, C. B. (1943). The relation between the number of species and the number of individuals in a random sample of an animal population. Journal of Animal Ecology 12 42–58.
- Good [1953] Good, I. J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika 40 237–264.
- Good and Toulmin [1956] Good, I. J. and Toulmin, G. (1956). The number of new species, and the increase in population coverage, when a sample is increased. Biometrika 43 45–63.
- Grøtan and Engen [2015] Grøtan, V. and Engen, S. (2015). Package ‘poilog’. Biometrics 30 651–660.
- Haas et al. [1995] Haas, P. J., Naughton, J. F., Seshadri, S., and Stokes, L. (1995). Sampling-based estimation of the number of distinct values of a attribute. VLDB 95 311–322.
- Hill [1973] Hill, M. O. (1973). Diversity and evenness: a unifying notation and its consequences. Ecology 54 427–432.
- Hoaglin and Tukey [1985] Hoaglin, D. C. and Tukey, J. W. (1985). Checking the shape of discrete distributions. In Exploring Data Tables, Trends, and Shapes, Eds. D. C.Hoaglin, F. Mosteller & J. W. Tukey, pp.345–416, New York: John Wiley and Sons.
- Kemp [1968] Kemp, A. W. (1968). A wide class of discrete distributions and the associated differential equations. Sankhya, Series A 30 401–410.
- Kemp and Kemp [1956] Kemp, C. D. and Kemp, A. W. (1956). Generalized hypergeometric distributions. Journal of the Royal Statistical Society, Series B 18 202–211.
- Kingman [1993] Kingman, J. F. C. (1993). Poisson Processes. Oxford: Oxford University Press.
- Kobayashi [1975] Kobayashi, S. (1975). The species-area relation ii. a second model for continuous sampling. Researches on Population Ecology 16 265–280.
- Lijoi et al. [2007] Lijoi, A., Mena, R. H., and Prünster, I. (2007). Bayesian nonparametric estimation of the probability of discovering new species. Biometrika 94 769–786.
- Magurran [2007] Magurran, A. E. (2007). Species abundance distributions over time. Ecology Letters 10 347–354.
- Martin and Goldenfield [2006] Martin, H. G. and Goldenfield, N. (2006). On the origin and robustness of power-law species-area relationships in ecology. PNAS 103 10310–10315.
- Matthews and Whittaker [2014] Matthews, T. J. and Whittaker, R. J. (2014). Fitting and comparing competing models of the species abundance distribution: assessment and prospect. Frontiers of Biogeography 6 67–82.
- Matthews and Whittaker [2015] Matthews, T. J. and Whittaker, R. J. (2015). On the species abundance distribution in applied ecology and biodiversity management. Journal of Applied Ecology 52 443–454.
- May [1975] May, R. M. (1975). Patterns of species abundance and diversity. In Ecology and Evolution of Communities, Eds. M. L. Cody & J. M. Diamond, pp.81–120. Cambridge: Belknap Press.
- Norris and Pollock [1998] Norris, J. L. and Pollock, K. H. (1998). Non-parametric MLE for Poisson species abundance models allowing for heterogeneity between species. Environmental and Ecological Statistics 5 391–402.
- Novotny and Basset [2000] Novotny, V. and Basset, Y. (2000). Rare species in communities of tropical insect herbivores: pondering the mystery of singletons. Oikos 89 564–572.
- Preston [1960] Preston, F. W. (1960). Time and space and the variation of species. Ecology 41 612–627.
- Preston [1962] Preston, F. W. (1962). The canonical distribution of commonness and rarity. Ecology 43 185–215.
- Raaijmakers [1987] Raaijmakers, J. G. W. (1987). Statistical analysis of the Michaelis-Menten equation. Biometrics 43 793–803.
- Sanders [1968] Sanders, H. L. (1968). Marine benthic diversity: a comparative study. American Naturalist 102 243–282.
- Scheiner [2003] Scheiner, S. M. (2003). Six types of species-area curves. Global Ecology & Biogeography 12 441–447.
- Schilling et al. [2012] Schilling, R. L., Song, R. and Vondracek, Z. (2012). Bernstein Functions: Theory and Applications. de Gruyter.
- Strook [2010] Stroock, D. W. (2012). Probability Theory: an Analytic View. Cambridge university press.
- Tjørve [2003] Tjørve, E. (2003). Shapes and functions of species-area curves: a review of possible models. Journal of Biogeography 30 827–835.
- Tjørve [2009] Tjørve, E. (2009). Shapes and functions of species-area curves (ii): a review of new models and parameterizations. Journal of Biogeography 36 1435–1445.
- Trushkowsky et al. [2012] Trushkowsky, B., Kraska, T., Franklin, M. J., and Sarkar, P. (2012). Getting it all from the crowd. arXiv preprint arXiv:1202.2335.
- Wang [2010] Wang, J.-P. (2010). Estimating species richness by a Poisson-compound gamma model. Biometrika 97 727–740.
- Williams et al. [2009] Williams, M. R., Lamont, B. B., and Henstridge, J. D. (2009). Species-area functions revisited. Journal of Biogeography 36 1994–2004.
- Zhou et al. [2017] Zhou, M., Favaro, S., and Walker, S. G. (2017). Frequency of frequencies distributions and size-dependent exchangeable random partitions. Journal of the American Statistical Association 112 1623–1635.