This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Transition in the ancestral reproduction rate and its implications for the site frequency spectrum

Yubo Shuai
University of California San Diego
Abstract

Consider a supercritical birth and death process where the children acquire mutations. We study the mutation rates along the ancestral lineages in a sample of size nn from the population at time TT. The mutation rate is time-inhomogenous and has a natural probabilistic interpretation. We use these results to obtain asymptotic results for the site frequency spectrum associated with the sample.

AMS 2020 subject classifications. Primary 60J80; Secondary 60J90, 92D15, 92D25Key words and phrases. Birth and death process, Ancestral lineage, Site frequency spectrum

1 Introduction

Consider a growing population with mutations introduced at the times of birth events. While it is a common assumption in population genetics that mutations are introduced in a time-homogeneous manner [9], the mutations observed in the sample have a time-inhomogenous behavior. This comes from the fact that birth events occurring at different times have different chances of being observed in the sample. In this paper, we analyze the mutations observed in the sample and apply this result to the site frequency spectrum.

1.1 Ancestral reproduction rate

Cheek and Johnston [5] considered a continuous-time Galton-Watson process where each individual reproduces at rate rr. In a reproduction event, an individual is replaced by k0k\geq 0 individuals with probability pkp_{k}. Denote by NtN_{t} the population size at time tt and Ft(s)=𝔼[sNt]F_{t}(s)=\mathbb{E}[s^{N_{t}}] for s[0,1]s\in[0,1] its generating function. Conditional on survival up to time TT, i.e., NT>0N_{T}>0, an individual is sampled uniformly at random from the population at time TT. They studied the rate of reproduction events along the ancestral lineage of the sampled individual. The rate is biased towards large reproduction events and is inhomogeneous along the ancestral lineage. Specifically, they show the following theorem.

Theorem 1.1 (Theorem 2.2 of [5]).

There exists a random variable SS with density FT(s)/(1FT(0))F^{\prime}_{T}(s)/(1-F_{T}(0)) on [0,1] such that conditional on S=sS=s, independently for each ll, size ll reproduction events occur along the uniform ancestral lineage according to a time inhomogeneous Poisson point process with intensity function

rl(s,t)=rlplFTt(s)l1.r_{l}(s,t)=rlp_{l}F_{T-t}(s)^{l-1}.

Igelbrink and Ischebeck [13] generalized this result to the case when the reproduction rates are age-dependent.

In general, one can also take a sample of size nn from the population. When we trace back the ancestral lineages in the sample, they will coalesce when individuals find their common ancestors, giving us a coalescent tree. A central question in population genetics is to understand this coalescent tree and the distribution of the mutations along the tree.

In this paper, we consider sampling nn individuals uniformly at random from a supercritical birth and death process with birth rate λ\lambda and death rate μ\mu, where λ>μ\lambda>\mu. We write r=λμ>0r=\lambda-\mu>0 for the net growth rate. When there is a birth event, we assume the number of mutations that the child acquires has a Poisson distribution with mean ν\nu. The process starts with one individual and runs for TT units of time. As usual, we write NtN_{t} for the population size at time tt and Ft(s)=𝔼[sNt]F_{t}(s)=\mathbb{E}[s^{N_{t}}] for s[0,1]s\in[0,1] for its generating function. Conditional on NTnN_{T}\geq n, we take a uniform sample of size nn.

We give the construction of the coalescent tree and the mutations in the following proposition, with the explanation given in Section 3. It is easier to construct the tree backward in time. That is, we trace back the lineages of the sampled individuals and track the time for individuals to find their common ancestor. If there are nn individuals in the sample, then there are n1n-1 coalescent times Hi,n,TH_{i,n,T} for 1in11\leq i\leq n-1. We should emphasize that the construction of the coalescent tree without mutations is obtained by Harris, Johnston, and Roberts [12] and Lambert [17] using two different approaches. The approach described in steps 1 and 2 of construction below comes from Lambert [17].

Proposition 1.1.

One can generate the coalescent tree with the mutations as follows.

  1. 1.

    Take the sampling probability Yn,TY_{n,T} from the density

    fYn,T(y)=nδTyn1(y+δTyδT)n+1,y(0,1),f_{Y_{n,T}}(y)=\frac{n\delta_{T}y^{n-1}}{(y+\delta_{T}-y\delta_{T})^{n+1}},\qquad y\in(0,1), (1)

    where δt=r/(λertμ)\delta_{t}=r/(\lambda e^{rt}-\mu).

  2. 2.

    Conditional on Yn,T=yY_{n,T}=y, the 0th branch length H0,n,TH_{0,n,T} is TT, and independently for i=1,2,,n1i=1,2,\dots,n-1, the iith branch length Hi,n,TH_{i,n,T} has density

    fHi,n,T|Yn,T=y=yλ+(ryλ)erTyλ(1erT)yλr2ert(yλ+(ryλ)ert)2𝟏{0<t<T}.f_{H_{i,n,T}|Y_{n,T}=y}=\frac{y\lambda+(r-y\lambda)e^{-rT}}{y\lambda(1-e^{-rT})}\cdot\frac{y\lambda r^{2}e^{-rt}}{(y\lambda+(r-y\lambda)e^{-rt})^{2}}\mathbf{1}_{\{0<t<T\}}. (2)

    We draw vertical lines of lengths Hi,n,TH_{i,n,T} for i=0,1,2,,n1i=0,1,2,\dots,n-1. Then, we draw horizontal lines to the left, stopping when we hit a vertical line. See Figure 1 for an illustration.

    TTH0,n,TH_{0,n,T}H1,n,TH_{1,n,T}H2,n,TH_{2,n,T}H3,n,TH_{3,n,T}H4,n,TH_{4,n,T}H5,n,TH_{5,n,T}H6,n,TH_{6,n,T}H7,n,TH_{7,n,T}011223344556677
    Figure 1: A coalescent tree with sample size n=8n=8 and 7 coalescent times.
  3. 3.

    Mutational events occur at the coalescent times H1,n,T,,Hn1,n,TH_{1,n,T},\dots,H_{n-1,n,T} and along the branches. Conditional on Yn,T=yY_{n,T}=y, mutational events along the branches occur independently along the iith branch at times of an inhomogeneous Poisson process for i=0,1,,n1i=0,1,\dots,n-1. At time tt time units before sampling, the rate is λFt(1y)\lambda F_{t}(1-y), where

    Ft(1y)=1δtert+δt2ert(1y)δt+yyδt,δt=rλertμ.F_{t}(1-y)=1-\delta_{t}e^{rt}+\frac{\delta_{t}^{2}e^{rt}(1-y)}{\delta_{t}+y-y\delta_{t}},\qquad\delta_{t}=\frac{r}{\lambda e^{rt}-\mu}.

    At each mutational event, the number of mutations has a Poisson distribution with mean ν\nu.

We are interested in the case when the sample size is much smaller than the population size, so the sampling probability Yn,TY_{n,T} is close to 0. To interpret the time-inhomogeneity, Ft(1y)F_{t}(1-y) is the probability conditional on Yn,T=yY_{n,T}=y, that an individual born tt time units before the sampling will have no descendants alive in the sample. As t0t\rightarrow 0, we have Ft(1y)1y1F_{t}(1-y)\rightarrow 1-y\approx 1, which is the probability that the newborn is not sampled. As tt\rightarrow\infty, we have Ft(1y)μ/λ<1F_{t}(1-y)\rightarrow\mu/\lambda<1, which is the probability that the subtree generated by the newborn goes extinct. Therefore, we see more mutations along the branches near the bottom part of the tree.

To compare our result with that of Cheek and Johnston [5], observe that the density function of Yn,TY_{n,T} with n=1n=1 is

fY1,T(y)=δT(y+δTyδT)2=r(λerTμ)(yλerTyλ+r)2.f_{Y_{1,T}}(y)=\frac{\delta_{T}}{(y+\delta_{T}-y\delta_{T})^{2}}=\frac{r(\lambda e^{rT}-\mu)}{(y\lambda e^{rT}-y\lambda+r)^{2}}.

Using the formula for Ft()F_{t}(\cdot), one can compute the density function of the random variable SS in [5] to be

FT(s)1FT(0)=r(λerTμ)((1s)λerT(1s)λ+r)2.\frac{F_{T}^{\prime}(s)}{1-F_{T}(0)}=\frac{r(\lambda e^{rT}-\mu)}{\left((1-s)\lambda e^{rT}-(1-s)\lambda+r\right)^{2}}.

Therefore, the random variable SS agrees with 1Y1,T1-Y_{1,T}. To interpret rl(s,t)r_{l}(s,t) in [5] correctly, note that the total rate of birth and death events is λ+μ\lambda+\mu and the probability of a size two reproduction event, i.e., a birth event, is p2=λ/(λ+μ)p_{2}=\lambda/(\lambda+\mu). Therefore, we have

r2(s,t)=(λ+μ)2λλ+μFTt(s)=2λFTt(s).r_{2}(s,t)=(\lambda+\mu)\cdot 2\cdot\frac{\lambda}{\lambda+\mu}F_{T-t}(s)=2\lambda F_{T-t}(s).

The extra constant 2 comes from the fact that only the children get mutations in our model, and the discrepancy between TtT-t and tt is because time goes forward in [5] and backward in Proposition 1.1.

1.2 The site frequency spectrum

A commonly used summary statistic for mutational data is the site frequency spectrum. The site frequency spectrum for nn individuals at time TT consists of (Mn,T1,Mn,T2,,Mn,Tn1)(M^{1}_{n,T},M^{2}_{n,T},\dots,M^{n-1}_{n,T}), where Mn,TiM^{i}_{n,T} is the number of mutations inherited by ii individuals. Durrett [9] considered a supercritical birth and death process where mutations occur at a constant rate along the branches with rate ν\nu. He considered sampling the entire population and obtained that as the sampling time TT\rightarrow\infty, the number of mutations affecting at least a fraction xx of the population is approximately ν/λx\nu/\lambda x. Similar results also appear in [2, 21]. Durrett [9] also considered taking a sample of size nn and showed that as the sampling time TT\rightarrow\infty, for 2kn12\leq k\leq n-1

limT𝔼[Mn,Tk]=nνk(k1).\lim_{T\rightarrow\infty}\mathbb{E}[M^{k}_{n,T}]=\frac{n\nu}{k(k-1)}. (3)

Gunnarsson, Leder, and Foo [10] considered models where the mutations occur at the times of reproduction events or at a constant rate along the branches. In both models, they computed the exact site frequency spectrum when the sample consists of the entire population or the individuals with an infinite line of descendants, at both a fixed time and a random time when the population size reaches a fixed level. Lambert [16] and Lambert and Stadler [18] developed a framework to study the site frequency spectrum when each individual is sampled independently with probability pp, conditioned to have nn individuals. Following this approach, Delaporte, Achaz, and Lambert [6] computed 𝔼[Mn,Tk]\mathbb{E}[M^{k}_{n,T}] for critical birth and death process, for both fixed TT and for an unknown TT with some improper prior. Dinh et al. [8] computed 𝔼[Mn,Tk]\mathbb{E}[M^{k}_{n,T}] for supercritical birth and death process. As noted by Ignatieva, Hein, and Jenkins [14], if we take the sampling probability p0p\rightarrow 0 in [8], then one can recover the 1/k(k1)1/k(k-1) shape in the site frequency spectrum as in formula (3).

Beyond the expected value of Mn,TkM^{k}_{n,T}, there are other works on its asymptotic behavior. Gunnarsson, Leder, and Zhang [11] considered a supercritical Galton-Watson process and sampled the entire population. They proved a strong law of large numbers for the site frequency spectrum at deterministic times and a weak law of large numbers at random times when the population size reaches a deterministic level. Schweinsberg and Shuai [20] considered a sample of size nn from supercritical or critical birth and death processes. As the sample size and the population size go to infinity appropriately, they established the asymptotic normality for the length of the branches that support kk leaves. These results can then be translated into the asymptotic normality of the site frequency spectrum, assuming that mutations occur at a constant rate along the branches.

In this paper, since we assume mutations occur at the times of reproduction events, the site frequency spectrum is characterized by the number of reproduction events. Indeed, let Rn,TkR^{k}_{n,T} (resp. Rn,T2R^{\geq 2}_{n,T}) be the number of reproduction events in which the child has kk (resp. at least 2) descendants in the sample. Then conditional on (Rn,T1,Rn,T2,,Rn,Tn1)(R^{1}_{n,T},R^{2}_{n,T},\dots,R^{n-1}_{n,T}), Mn,TiM^{i}_{n,T} has a Poisson distribution with mean νRn,Ti\nu R^{i}_{n,T}, and the Mn,TiM^{i}_{n,T} are independent of one another. Therefore, we will focus on Rn,TkR^{k}_{n,T} and Rn,T2R^{\geq 2}_{n,T}. We show the following theorem.

Theorem 1.2.

Let {Tn}n=1\{T_{n}\}_{n=1}^{\infty} be a sequence such that

limnnerTn=0.\lim_{n\rightarrow\infty}ne^{-rT_{n}}=0. (4)

Then for any fixed k2k\geq 2

Rn,TnknPλrk(k1),\frac{R^{k}_{n,T_{n}}}{n}\stackrel{{\scriptstyle P}}{{\longrightarrow}}\frac{\lambda}{rk(k-1)},

where “P\stackrel{{\scriptstyle P}}{{\longrightarrow}}” denotes convergence in probability as nn\rightarrow\infty.

Corollary 1.1.

Under the same condition as Theorem 1.2, we have

Mn,TnknPλνrk(k1).\frac{M^{k}_{n,T_{n}}}{n}\stackrel{{\scriptstyle P}}{{\longrightarrow}}\frac{\lambda\nu}{rk(k-1)}.

Under a stronger assumption on the sample size and the population size, we can establish a central limit theorem for Rn,Tn2R^{\geq 2}_{n,T_{n}}. The corresponding Mn,Tn2=i=1n1Mn,TniM^{\geq 2}_{n,T_{n}}=\sum_{i=1}^{n-1}M^{i}_{n,T_{n}} represents the number of mutations shared by at least two individuals in the sample. The quantity Mn,Tn2M^{\geq 2}_{n,T_{n}} has been applied to estimate the growth rate of a tumor in [15]. We believe an analogous result should hold for Rn,TnkR^{k}_{n,T_{n}}, but the covariance computation is more involved.

Theorem 1.3.

Let {Tn}n=1\{T_{n}\}_{n=1}^{\infty} be a sequence such that

limnn3/2(logn)erTn=0.\lim_{n\rightarrow\infty}n^{3/2}(\log n)e^{-rT_{n}}=0. (5)

Then

1n(Rn,Tn2nλr)N(0,λ2r2),\frac{1}{\sqrt{n}}\left(R^{\geq 2}_{n,T_{n}}-\frac{n\lambda}{r}\right)\Rightarrow N\left(0,\frac{\lambda^{2}}{r^{2}}\right),

where ``"``\Rightarrow" denotes convergence in distribution as nn\rightarrow\infty.

Corollary 1.2.

Under the same condition as Theorem 1.3, we have

1n(Mn,Tn2nλνr)N(0,λ2ν2r2+λνr).\frac{1}{\sqrt{n}}\left(M^{\geq 2}_{n,T_{n}}-\frac{n\lambda\nu}{r}\right)\Rightarrow N\left(0,\frac{\lambda^{2}\nu^{2}}{r^{2}}+\frac{\lambda\nu}{r}\right).
Remark 1.1.

Results similar to Theorem 1.2 and Corollary 1.2 have been established when the mutations occur at a constant rate ν~\widetilde{\nu} along the branches. We take ν~=λν\widetilde{\nu}=\lambda\nu so that the expected number of mutations per unit of time is the same under both models. Schweinsberg and Shuai [20] showed the following result, which is similar to Theorem 1.2.

Theorem 1.4 (Theorem 1.3 of [20]).

Let k2k\geq 2 be a fixed integer. Assume that

limnnernTn=0.\lim_{n\rightarrow\infty}ne^{-r_{n}T_{n}}=0.

Write Ln,TnkL^{k}_{n,T_{n}} for the total length of the branches that support kk leaves in the coalescent tree of the sample. Then as nn\rightarrow\infty,

rnnLn,TnkP1k(k1).\frac{r_{n}}{n}L_{n,T_{n}}^{k}\stackrel{{\scriptstyle P}}{{\longrightarrow}}\frac{1}{k(k-1)}.

The hypothesis is similar to (4), and the 1/k(k1)1/k(k-1) site frequency spectrum is consistent with (3). Conditional on Ln,TnkL^{k}_{n,T_{n}}, the number of mutations inherited by kk individuals has a Poisson distribution with mean ν~Ln,Tnknν~/(rnk(k1))=nλν/(rnk(k1))\widetilde{\nu}L^{k}_{n,T_{n}}\approx n\widetilde{\nu}/(r_{n}k(k-1))=n\lambda\nu/(r_{n}k(k-1)), consistent with Theorem 1.2.

Johnson et al. showed a result similar to Corollary 1.2. The quantity MninM^{in}_{n} therein corresponds to Mn,Tn2M^{\geq 2}_{n,T_{n}}. In the special case where the rates λ,μ\lambda,\mu and ν~\widetilde{\nu} are constants, Corollary 2 of [15] reduces to the following theorem.

Theorem 1.5 (Corollary 2 of [15]).

Assume that (5) holds. Then

1n(Mn,Tn2nν~r)N(0,ν~r2+ν~r).\frac{1}{\sqrt{n}}\bigg{(}M_{n,T_{n}}^{\geq 2}-\frac{n\widetilde{\nu}}{r}\bigg{)}\Rightarrow N\left(0,\frac{\widetilde{\nu}}{r^{2}}+\frac{\widetilde{\nu}}{r}\right).

The hypothesis is the same as (5), and the result is the same when we plug in ν~=λν\widetilde{\nu}=\lambda\nu. Therefore, our results establish that the assumption that mutations occur at a constant rate along the branches is not essential to the results in [15, 20], and similar results hold when mutations occur only at the birth times, which may be a more natural assumption for some applications.

Another closely related summary statistic for the mutational data is the allele frequency spectrum. For the allele frequency spectrum, the individuals are partitioned into families where each family consists of individuals carrying the same mutations. Then the allele frequency spectrum consists of (An,T1,An,T2,,An,Tn)(A^{1}_{n,T},A^{2}_{n,T},\dots,A^{n}_{n,T}) where An,TiA^{i}_{n,T} is the number of families of size ii. Richard [19] considered the allele frequency spectrum of the population in a birth and death process with age-dependent death rates and mutations at birth times. He computed the expected allele frequency spectrum and proved a law of large numbers for the allele frequency spectrum. Champagnat and Lambert considered the case where the death rate is age-dependent and the mutations occur at a constant rate along the branches. They computed the expected allele frequency spectrum when the sample consists of the entire population in [3] and studied the size of the largest family and the age of the oldest family in [4].

2 The contour process of the genealogical tree

Figure 2 is the planar representation of a genealogical tree. The ends of the vertical lines are the birth and death times for individuals, and the horizontal lines represent reproduction events. The individuals are ordered from left to right in the following way. An individual xx is on the left of yy if either xx is ancestral to yy, or when we trace back the lineages of xx and yy to their common ancestor, the ancestral lineage of yy belongs to the child. For example, in Figure 2, individual 66 is on the left of 77 because 66 is an ancestor of 77, and 77 is on the left of 88 because the as we trace back the ancestral lineage of 77 (the red line) and 88 (the blue line), the ancestral lineage of 88 belongs to the child.

time01122334455667788
Figure 2: The planar representation of a genealogical tree

The contour process of a genealogical tree is a càdlàg process which encodes the information of the tree. This process starts at the depth of the leftmost individual, keeps track of the distance from the root in the depth-first search of the genealogical tree, and is indexed by the total length of the searched branches. See Figure 3 for an illustration. This process has slope 1-1 as we trace back the lineage in the genealogical tree unless we encounter a new birth event, in which case the contour process makes a jump in the size of the child’s lifespan.

Distance travelledDistance from the roott0t_{0}t1t_{1}t2t_{2}t3t_{3}t4t_{4}t5t_{5}t6t_{6}t7t_{7}t8t_{8}
Figure 3: The contour process of the genealogical tree in Figure 2. For i=1,2,,8i=1,2,...,8, tit_{i} is the total length of the branches searched before we reach the iith individual, and the size of the jump is the lifespan of the iith individual.

One can also study the tree truncated at time TT and its contour process. We denote the contour processes for the original and truncated trees by XX and X(T)X^{(T)}, respectively. Lambert [16] showed that XX (resp. X(T)X^{(T)}) has the same distribution as YY (resp. Y(T)Y^{(T)}) stopped at 0 defined as follows. Let {ξi}i=0\{\xi_{i}\}_{i=0}^{\infty} be independent exponentially distributed random variables with mean 1/μ1/\mu representing the lifespan of individuals, and let NN be a Poisson process with rate λ\lambda counting the number of reproduction events found in the depth-first search. We define Lévy processes YY and Y(T)Y^{(T)} by

Yt=ξ0t+i=0N(t)ξi,Yt(T)=min{ξ0,T}t+i=1N(t)min{TYt(T),ξi}.Y_{t}=\xi_{0}-t+\sum_{i=0}^{N(t)}\xi_{i},\qquad Y^{(T)}_{t}=\min\{\xi_{0},T\}-t+\sum_{i=1}^{N(t)}\min\{T-Y^{(T)}_{t-},\xi_{i}\}.

That is, the process YY starts from the lifespan of the 0th individual, has drift 1-1 as we trace back the lineage, and makes jumps when we encounter a new birth event. The process Y(T)Y^{(T)} has the same dynamics except that the jumps are truncated so that Y(T)Y^{(T)} does not go above TT.

3 Mutations in a sample from a population

In most applications, we only have access to the genetic information in a sample of the population at time TT. Therefore, we are interested in the coalescent tree that describes the genealogy of the sampled individuals. See Figure 4 for an illustration. The mutations in the sample can be classified into two groups depending on whether they are associated with branching events in the coalescent tree. Those associated with such branching points are marked red in Figure 4. The others are marked blue.

TT01122334455667788
Figure 4: The coalescent tree (in red) where individuals 2, 5, and 7 are sampled. This tree is obtained by tracing back the lineages of the sampled individuals from time TT.

Two sampling schemes have been studied extensively in the literature: sampling each individual independently with probability yy and taking a uniform sample of size nn. We refer to them as Bernoulli(y)(y) sampling and uniform(n)(n) sampling, respectively. These two sampling schemes were connected by Lambert [17]. We will discuss the mutations in the Bernoulli(y)(y) sampling in Section 3.1 and the uniform(n)(n) sampling in Section 3.2.

In the rest of the paper, we will consider a variant of the model where the parent acquires the blue mutations. By the Markov property, whenever there is a birth event, the subtree generated by the child and its descendants has the same distribution as the subtree generated by the parent and the descendants that it has after the original birth event. Therefore, the distribution of the coalescent tree and the mutations are the same whether we assign the mutations to the parent or the child. For an illustration, one can compare Figures 4 and 5.

TT01122334455667788
Figure 5: The same coalescent tree as in Figure 4, except that the parent mutates when there is a birth event that is not at one of the branchpoints in the coalescent tree.

3.1 Bernoulli(y)(y) sampling

In this section, we describe the coalescent tree, together with the red and blue mutations in the Bernoulli sampling scheme. Lambert and Stadler [18] obtained the construction of the subtree. We outline the construction here for readers’ convenience. Recall that the contour process X(T)X^{(T)} makes jumps at rate λ\lambda. That is, birth events occur at rate λ\lambda as we trace back the lineage. Consider a birth event occurring tt time units before the sampling time. Recalling the definition of NtN_{t} from the introduction, the probability that this birth event does not give rise to a new leaf in the coalescent tree is

q(y,t)=(Nt=0)+k=1(Nt=k)(1y)k=𝔼[(1y)Nt]=Ft(1y).q(y,t)=\mathbb{P}\left(N_{t}=0\right)+\sum_{k=1}^{\infty}\mathbb{P}\left(N_{t}=k\right)(1-y)^{k}=\mathbb{E}\left[(1-y)^{N_{t}}\right]=F_{t}(1-y). (6)

Therefore, red birth events occur with rate λ(1q(y,t))\lambda(1-q(y,t)), and blue birth events occur with rate λq(y,t)\lambda q(y,t), independently of each other. For a coupling argument in Section 4.2, red and blue birth events are represented using a Poisson point process. To be more precise, for any positive continuous functions LUL\leq U on [a,b][a,b], we let 𝒜(L,U,[a,b])={(t,x)2:atb,L(t)xU(t)}\mathcal{A}(L,U,[a,b])=\{(t,x)\in\mathbb{R}^{2}:a\leq t\leq b,L(t)\leq x\leq U(t)\} be the area between LL and UU on [a,b][a,b]. In the special case where L0L\equiv 0 on [a,b][a,b], we abbreviate the notation as 𝒜(U,[a,b])\mathcal{A}(U,[a,b]). We also let {Πi}i0\{\Pi_{i}\}_{i\geq 0} be independent Poisson point process on 2\mathbb{R}^{2} with unit intensity. Then, for the Bernoulli(y)(y) sampling, one can obtain the coalescent tree and the mutations in the following way:

  1. 1.

    We get an empty sample with probability 𝔼[(1y)NT]=FT(1y)\mathbb{E}[(1-y)^{N_{T}}]=F_{T}(1-y). If the sample is non-empty, then we have an initial branch H0,y,TH_{0,y,T} of length TT.

  2. 2.

    Let {πi,y}i0\{\pi_{i,y}\}_{i\geq 0} be independent Poisson processes with rate λ(1q(y,t))\lambda(1-q(y,t)) on [0,T][0,T] defined by πi,y(t)=Πi(𝒜(λq(y,),λ,[0,t]))\pi_{i,y}(t)=\Pi_{i}(\mathcal{A}(\lambda q(y,\cdot),\lambda,[0,t])) for 0tT0\leq t\leq T.

  3. 3.

    Having constructed H0,y,T,,Hi,y,TH_{0,y,T},\dots,H_{i,y,T}, if πi,y(T)>0\pi_{i,y}(T)>0, then the iith branch length is the time of the first jump in πi,y\pi_{i,y}, i.e. Hi,y,T=inf{t0:πi,y(t)>0}H_{i,y,T}=\inf\{t\geq 0:\pi_{i,y}(t)>0\}. Otherwise, we have found all the individuals in the sample. Note that Hi,y,TH_{i,y,T} is also the time for the red reproduction event on the iith individual.

  4. 4.

    Blue birth events occur along the branches at rate λq(y,t)\lambda q(y,t). That is, the number of mutations along the iith branch up to time tt is Πi(𝒜(λq(y,),[0,t]))\Pi_{i}(\mathcal{A}(\lambda q(y,\cdot),[0,t])).

Remark 3.1.

This construction implies that conditional on a non-empty sample, the sample size has a geometric distribution. In particular, by taking y=1y=1, the population size NTN_{T} satisfies (NTk+1|NTk)=(πi,1(T)>0)\mathbb{P}(N_{T}\geq k+1|N_{T}\geq k)=\mathbb{P}(\pi_{i,1}(T)>0) for k1k\geq 1.

We compute q(y,t)=Ft(1y)q(y,t)=F_{t}(1-y) here and refer the reader to equation (7) of [17] for the formula for the density of Hi,y,TH_{i,y,T}. We define

δt=(πi,1(t)=0)=exp(0tλ(Ns>0)𝑑s).\delta_{t}=\mathbb{P}(\pi_{i,1}(t)=0)=\exp\left(-\int_{0}^{t}\lambda\mathbb{P}(N_{s}>0)ds\right). (7)

By Remark 3.1, conditional on Nt>0N_{t}>0, NtN_{t} has a geometric distribution with

(Ntk+1|Ntk)=(πi,1(t)>0)=1δt, for k1.\mathbb{P}(N_{t}\geq k+1|N_{t}\geq k)=\mathbb{P}(\pi_{i,1}(t)>0)=1-\delta_{t},\qquad\text{ for }k\geq 1.

Therefore, we have

ert=𝔼[Nt]=(Nt>0)/δt.e^{rt}=\mathbb{E}[N_{t}]=\mathbb{P}(N_{t}>0)/\delta_{t}. (8)

Combining (7) and (8), we get

δt=exp(0tλersδs𝑑s).\delta_{t}=\exp\left(-\int_{0}^{t}\lambda e^{rs}\delta_{s}ds\right).

Solving this equation for δt\delta_{t} and using the initial condition δ0=1\delta_{0}=1, we get the following formula for δt\delta_{t}.

δt=rλertμ,\delta_{t}=\frac{r}{\lambda e^{rt}-\mu},

which agrees with the formula of δt\delta_{t} in (1). This formula is also found in equation (4) of [17]. To compute q(y,t)q(y,t), by (8) and Remark 3.1 again, we have

q(y,t)\displaystyle q(y,t) =Ft(1y)\displaystyle=F_{t}(1-y)
=(Nt=0)+(Nt1)𝔼[(1y)Nt|Nt1]\displaystyle=\mathbb{P}(N_{t}=0)+\mathbb{P}(N_{t}\geq 1)\mathbb{E}\left[(1-y)^{N_{t}}|N_{t}\geq 1\right]
=1δtert+δtertδt(1y)δt+yyδt.\displaystyle=1-\delta_{t}e^{rt}+\delta_{t}e^{rt}\cdot\frac{\delta_{t}(1-y)}{\delta_{t}+y-y\delta_{t}}. (9)

3.2 Uniform(n)(n) sampling

Lambert [17] showed that the uniform(n)(n) sampling can be realized by taking yy from the density (1) and then performing the Bernoulli(yy) sampling conditioned to have size nn. Therefore, for the uniform(n)(n) sampling, one can obtain the coalescent tree and the mutations in the following way:

  1. 1.

    Take the sampling probability Yn,TY_{n,T} from the density (1).

  2. 2.

    Conditional on Yn,T=yY_{n,T}=y, for i=1,2,,n1i=1,2,\dots,n-1, let πi,y\pi_{i,y} be the Poisson process be defined in Section 3.1, conditional on πi,y(T)>0\pi_{i,y}(T)>0.

  3. 3.

    For i=1,2,,n1i=1,2,\dots,n-1, the iith branch length Hi,n,TH_{i,n,T} is the time of the first jump in πi,y\pi_{i,y}, conditional on πi,y(T)>0\pi_{i,y}(T)>0. That is,

    (Hi,n,T>t|Yn,T=y)=(πi,y(t)=0|πi,y(T)>0).\mathbb{P}(H_{i,n,T}>t|Y_{n,T}=y)=\mathbb{P}(\pi_{i,y}(t)=0|\pi_{i,y}(T)>0).

    The density for Hi,n,TH_{i,n,T} is given by (2).

  4. 4.

    Blue birth events occur along the branches at rate λq(y,t)\lambda q(y,t).

H0,n,TH_{0,n,T}H1,n,TH_{1,n,T}H2,n,TH_{2,n,T}H3,n,TH_{3,n,T}H4,n,TH_{4,n,T}H5,n,TH_{5,n,T}H6,n,TH_{6,n,T}H7,n,TH_{7,n,T}H8,n,TH_{8,n,T}AABB
Figure 6: A coalescent tree with sample size n=9n=9. The blue mutation on the segment ABAB supports three individuals because max{H6,n,T,H7,n,T}min{H5,n,T,H8,n,T}\max\{H_{6,n,T},H_{7,n,T}\}\leq\min\{H_{5,n,T},H_{8,n,T}\}, the red mutation associated with H3,n,TH_{3,n,T} supports two individuals because H4,n,T<H3,n,T<H5,n,TH_{4,n,T}<H_{3,n,T}<H_{5,n,T}.
Remark 3.2.

For i=0,1,,n1i=0,1,\dots,n-1, we define Ri,n,TkR^{k}_{i,n,T} (resp. Ri,n,T2R^{\geq 2}_{i,n,T}) to be the number of reproduction events along the iith branch that support kk (resp. at least two) leaves in the sample. Then for k2k\geq 2, we have

R0,n,Tk=Π0(𝒜(λq(Yn,T,),[max1jk1Hj,n,T,Hk,n,T])),R^{k}_{0,n,T}=\Pi_{0}\left(\mathcal{A}\left(\lambda q(Y_{n,T},\cdot),\left[\max_{1\leq j\leq k-1}H_{j,n,T},H_{k,n,T}\right]\right)\right),

and for 1ink11\leq i\leq n-k-1,

Ri,n,Tk\displaystyle R^{k}_{i,n,T} =Πi(𝒜(λq(Yn,T,),[maxi+1ji+k1Hj,n,T,Hi,n,THi+k,n,T]))\displaystyle=\Pi_{i}\left(\mathcal{A}\left(\lambda q(Y_{n,T},\cdot),\left[\max_{i+1\leq j\leq i+k-1}H_{j,n,T},H_{i,n,T}\wedge H_{i+k,n,T}\right]\right)\right)
+𝟙{maxi+1ji+k1Hj,n,THi,n,THi+k,n,T},\displaystyle\qquad+\mathbbm{1}_{\{\max_{i+1\leq j\leq i+k-1}H_{j,n,T}\leq H_{i,n,T}\leq H_{i+k,n,T}\}},

and

Rnk,n,Tk\displaystyle R^{k}_{n-k,n,T} =Πnk(𝒜(λq(Yn,T,),[maxnk+1jn1Hj,n,T,Hnk,n,T]))\displaystyle=\Pi_{n-k}\left(\mathcal{A}\left(\lambda q(Y_{n,T},\cdot),\left[\max_{n-k+1\leq j\leq n-1}H_{j,n,T},H_{n-k,n,T}\right]\right)\right)
+𝟙{maxnk+1jn1Hj,n,THnk,n,T}.\displaystyle\qquad+\mathbbm{1}_{\{\max_{n-k+1\leq j\leq n-1}H_{j,n,T}\leq H_{n-k,n,T}\}}.

The first part in the definition of Ri,n,TkR^{k}_{i,n,T} or Rnk,n,TkR^{k}_{n-k,n,T} counts the number of blue reproduction events with kk descendants in the sample on the iith branch. The second part counts the number of red reproduction events that have kk descendants in the sample. See Figure 6 for an illustration. Note that there is no red mutation on the 0th branch. The total number of reproduction events with kk descendants in the sample is

Rn,Tk=i=0nkRi,n,Tk.R^{k}_{n,T}=\sum_{i=0}^{n-k}R^{k}_{i,n,T}.

Likewise, the number of reproduction events with at least two descendants in the sample on the 0th branch is

R0,n,T2=Π0(𝒜(λq(Yn,T,),[H1,n,T,T])),R^{\geq 2}_{0,n,T}=\Pi_{0}(\mathcal{A}(\lambda q(Y_{n,T},\cdot),[H_{1,n,T},T])),

and for 1in21\leq i\leq n-2,

Ri,n,T2=Πi(𝒜(λq(Yn,T,),[Hi+1,n,T,Hi,n,T]))+𝟙{Hi+1,n,THi,n,T}.R^{\geq 2}_{i,n,T}=\Pi_{i}(\mathcal{A}(\lambda q(Y_{n,T},\cdot),[H_{i+1,n,T},H_{i,n,T}]))+\mathbbm{1}_{\{H_{i+1,n,T}\leq H_{i,n,T}\}}.

The total number of reproduction events with at least two descendants in the sample is

Rn,T2=i=0n2Ri,n,T2.R^{\geq 2}_{n,T}=\sum_{i=0}^{n-2}R^{\geq 2}_{i,n,T}.

4 Asymptotics for large nn and TT

4.1 Approximation for the sampling probability and branch length

Approximations for the sampling probability Yn,TY_{n,T} and the branch lengths {Hi,n,T}i=1n1\{H_{i,n,T}\}_{i=1}^{n-1} when nn and TT are large are obtained in [15].

  1. 1.

    Choose WW from the exponential density

    fW(w)=ew,w(0,).f_{W}(w)=e^{-w},\qquad w\in(0,\infty).
  2. 2.

    Choose UiU_{i} for i=1,,n1i=1,\dots,n-1 independently from the logistic distribution, with density

    fUi(u)=eu(1+eu)2,u(,).f_{U_{i}}(u)=\frac{e^{u}}{\left(1+e^{u}\right)^{2}},\quad u\in(-\infty,\infty).
  3. 3.

    Let Y=nδT/WY=n\delta_{T}/W.

  4. 4.

    Let Hi=T(logn+log(1/W)+Ui)/r{H}_{i}=T-(\log n+\log(1/W)+U_{i})/r for i=1,,n1i=1,\dots,n-1.

Note that the HiH_{i}’s (hence the quantities defined below) depend on nn, although this dependence is not explicit in the notation. Using this approximation, we can, therefore, define an approximation for the number of reproduction events with at least two descendants in the sample on the iith branch as

R02=Π0(𝒜(λq(nδT/W,),[H1,T])),R^{\geq 2}_{0}=\Pi_{0}(\mathcal{A}(\lambda q(n\delta_{T}/W,\cdot),[H_{1},T])),

and

Ri2=Πi(𝒜(λq(nδT/W,),[Hi+1,Hi]))+𝟙{Hi+1Hi}, for 1in2.R^{\geq 2}_{i}=\Pi_{i}(\mathcal{A}(\lambda q(n\delta_{T}/W,\cdot),[H_{i+1},H_{i}]))+\mathbbm{1}_{\{H_{i+1}\leq H_{i}\}},\qquad\text{ for }1\leq i\leq n-2.

An approximation for the total number of reproduction events with at least two descendants in the sample, excluding the 0th branch, is

R2=i=1n2Ri2.R^{\geq 2}=\sum_{i=1}^{n-2}R^{\geq 2}_{i}.

Similarly, for k2k\geq 2, we define

R0k=Π0(𝒜(λq(nδT/W,),[max1jk1Hj,Hk])),R^{k}_{0}=\Pi_{0}\left(\mathcal{A}\left(\lambda q(n\delta_{T}/W,\cdot),\left[\max_{1\leq j\leq k-1}H_{j},H_{k}\right]\right)\right),

and for 1ink11\leq i\leq n-k-1

Rik=Πi(𝒜(λq(nδT/W,),[maxi+1ji+k1Hj,HiHi+k]))+𝟙{maxi+1ji+k1HjHiHi+k},\displaystyle R^{k}_{i}=\Pi_{i}\left(\mathcal{A}\left(\lambda q(n\delta_{T}/W,\cdot),\left[\max_{i+1\leq j\leq i+k-1}H_{j},H_{i}\wedge H_{i+k}\right]\right)\right)+\mathbbm{1}_{\{\max_{i+1\leq j\leq i+k-1}H_{j}\leq H_{i}\leq H_{i+k}\}},

and

Rnkk=Πi(𝒜(λq(nδT/W,),[maxnk1jn1Hj,Hnk]))+𝟙{maxnk1jn1HjHnk},\displaystyle R^{k}_{n-k}=\Pi_{i}\left(\mathcal{A}\left(\lambda q(n\delta_{T}/W,\cdot),\left[\max_{n-k-1\leq j\leq n-1}H_{j},H_{n-k}\right]\right)\right)+\mathbbm{1}_{\{\max_{n-k-1\leq j\leq n-1}H_{j}\leq H_{n-k}\}},

An approximation for the total number of reproduction events inherited by kk individuals, excluding the 0th and the (nk)th(n-k)th branches, is

Rk=i=1nk1Rik.R^{k}=\sum_{i=1}^{n-k-1}R^{k}_{i}.

4.2 Error bounds

In the rest part of the paper, the sampling time T=TnT=T_{n} depends on nn. The errors in the approximation resulting from taking the limit as nn\rightarrow\infty are bounded by Lemmas 4 and 5 in the supplementary material of [15]. We combine these lemmas into Lemma 4.1. The event AnA_{n} in Lemma 4.1 is the intersection of the events Qn,Tn=Qn,h(n):=n1/2erTn/2Q_{n,T_{n}}=Q_{n,\infty}\leq h(n):=n^{1/2}e^{rT_{n}/2} and Q~n,=1/W\widetilde{Q}_{n,\infty}=1/W in [15]. The event BnB_{n} is the intersection of Qn,Tn=Qn,h(n):=n1/2(logn)1/3ernTn/3Q_{n,T_{n}}=Q_{n,\infty}\leq h(n):=n^{1/2}(\log n)^{-1/3}e^{r_{n}T_{n}/3} and Q~n,=1/W\widetilde{Q}_{n,\infty}=1/W in [15]. The choice of h(n)h(n) comes from the proof of Lemma 6 of [15].

Lemma 4.1.

Assume condition (4) holds. The random variables Yn,TnY_{n,T_{n}}, WW, Hi,n,TnH_{i,n,T_{n}} and HiH_{i} can be coupled so that the following hold.

  1. 1.

    There is a sequence of events AnA_{n} on which Yn,Tn=nδTn/WY_{n,T_{n}}=n\delta_{T_{n}}/W and (An)1\mathbb{P}(A_{n})\rightarrow 1 as nn\rightarrow\infty.

  2. 2.

    As nn\rightarrow\infty,

    rni=1n1𝔼[|Hi,n,TnHi|𝟙An]0.\frac{r}{n}\sum_{i=1}^{n-1}\mathbb{E}[|H_{i,n,T_{n}}-H_{i}|\mathbbm{1}_{A_{n}}]\rightarrow 0.

Assume condition (5) holds. The random variables Hi,n,TnH_{i,n,T_{n}} and HiH_{i} can be coupled so that the following hold.

  1. 1.

    There is a sequence of events BnB_{n} on which Yn,Tn=nδTn/WY_{n,T_{n}}=n\delta_{T_{n}}/W and (Bn)1\mathbb{P}(B_{n})\rightarrow 1 as nn\rightarrow\infty.

  2. 2.

    As nn\rightarrow\infty,

    rni=1n1𝔼[|Hi,n,TnHi|𝟙Bn]0.\frac{r}{\sqrt{n}}\sum_{i=1}^{n-1}\mathbb{E}[|H_{i,n,T_{n}}-H_{i}|\mathbbm{1}_{B_{n}}]\rightarrow 0.

These error bounds allow us to bound the error in the approximations for Rn,Tn2R^{\geq 2}_{n,T_{n}} and Rn,TnkR^{k}_{n,T_{n}}, with the aid of another technical lemma.

Lemma 4.2.

Let X1,X2,,XnX_{1},X_{2},\dots,X_{n} and Y1,Y2,,YnY_{1},Y_{2},\dots,Y_{n} be two independent sequences of i.i.d random variables such that P(Xi=Xj)=0P(X_{i}=X_{j})=0 and P(Yi=Yj)=0P(Y_{i}=Y_{j})=0 for all iji\neq j. Then there exist random variables Y~1,Y~2,,Y~n\widetilde{Y}_{1},\widetilde{Y}_{2},\dots,\widetilde{Y}_{n} such that the following hold.

  1. 1.

    The random vectors (Y~1,Y~2,,Y~n)(\widetilde{Y}_{1},\widetilde{Y}_{2},\dots,\widetilde{Y}_{n}) and (Y1,Y2,,Yn)(Y_{1},Y_{2},\dots,Y_{n}) are equal in distribution.

  2. 2.

    For 1i,jn1\leq i,j\leq n, XiXj if and only if Y~iY~jX_{i}\leq X_{j}\text{ if and only if }\widetilde{Y}_{i}\leq\widetilde{Y}_{j}.

  3. 3.

    i=1n|XiY~i|i=1n|XiYi|\sum_{i=1}^{n}|X_{i}-\widetilde{Y}_{i}|\leq\sum_{i=1}^{n}|X_{i}-Y_{i}|, and so i=1n𝔼[|XiY~i|]i=1n𝔼[|XiYi|]\sum_{i=1}^{n}\mathbb{E}[|X_{i}-\widetilde{Y}_{i}|]\leq\sum_{i=1}^{n}\mathbb{E}[|X_{i}-Y_{i}|].

Proof.

Let Y(1)Y(2)Y(n)Y_{(1)}\leq Y_{(2)}\leq\dots\leq Y_{(n)} be the ordering of the random variables Y1,Y2,,YnY_{1},Y_{2},\dots,Y_{n}. For 1in1\leq i\leq n, we define Y~i\widetilde{Y}_{i} to be Y(j)Y_{(j)} if XiX_{i} is the jjth smallest among (X1,X2,,Xn)(X_{1},X_{2},\dots,X_{n}). Then, the second claim is straightforward from the definition. Since the random variables X1,X2,,XnX_{1},X_{2},\dots,X_{n} are i.i.d, Y~1,Y~2,,Y~n\widetilde{Y}_{1},\widetilde{Y}_{2},\dots,\widetilde{Y}_{n} is therefore a random permutation of the i.i.d random variables Y1,Y2,,YnY_{1},Y_{2},\dots,Y_{n}, which implies the first claim. The last claim follows from the second claim. ∎

Corollary 4.1.

Recall the definition of Ri,n,TnkR^{k}_{i,n,T_{n}} and Ri,n,Tn2R^{\geq 2}_{i,n,T_{n}} from Remark 3.2 and the definition of RikR^{k}_{i} and Ri2R^{\geq 2}_{i} from Section 4.1. Assume condition (4) holds. Under a coupling satisfying the conclusions of Lemma 4.1, we have,

1n(i=0nkRi,n,Tnki=1nk1Rik)P0.\frac{1}{n}\left(\sum_{i=0}^{n-k}R^{k}_{i,n,T_{n}}-\sum_{i=1}^{n-k-1}R^{k}_{i}\right)\stackrel{{\scriptstyle P}}{{\rightarrow}}0.

Assume condition (5) holds. Under a coupling satisfying the conclusions of Lemma 4.1, we have,

1n(i=0n2Ri,n,Tn2i=1n2Ri2)P0.\frac{1}{\sqrt{n}}\left(\sum_{i=0}^{n-2}R^{\geq 2}_{i,n,T_{n}}-\sum_{i=1}^{n-2}R^{\geq 2}_{i}\right)\stackrel{{\scriptstyle P}}{{\rightarrow}}0.
Proof.

In view of Lemma 4.2, we may assume that conditional on Yn,Tn=y=nδTn/WY_{n,T_{n}}=y=n\delta_{T_{n}}/W, the random variables HiH_{i} and Hi,n,TnH_{i,n,T_{n}} are coupled so that HiHjH_{i}\leq H_{j} if and only if Hi,n,TnHj,n,TnH_{i,n,T_{n}}\leq H_{j,n,T_{n}}. Therefore, the only error in the approximation comes from the blue mutations. We write AB=(AB)(BA)A\triangle B=(A\setminus B)\cup(B\setminus A) and 𝐋𝐞𝐛\mathbf{Leb} for the Lebesgue measure on 2\mathbb{R}^{2}. Conditional on the event Ay={Yn,Tn=y=nδTn/W}A_{y}=\{Y_{n,T_{n}}=y=n\delta_{T_{n}}/W\}, we have

𝔼[|R0,n,Tn2||Ay]𝔼[|R0,n,Tn2R02|+R02|Ay]𝔼[|R0,n,Tn2R02||Ay]+λ𝔼[TH1|Ay]=𝔼[|R0,n,Tn2R02||Ay]+λr𝔼[logn+log(1/W)+Ui|Ay],\begin{split}\mathbb{E}\left[\left|R^{\geq 2}_{0,n,T_{n}}\right|\Big{|}A_{y}\right]&\leq\mathbb{E}\left[\left|R^{\geq 2}_{0,n,T_{n}}-R^{\geq 2}_{0}\right|+R^{\geq 2}_{0}\Big{|}A_{y}\right]\\ &\leq\mathbb{E}\left[\left|R^{\geq 2}_{0,n,T_{n}}-R^{\geq 2}_{0}\right|\Big{|}A_{y}\right]+\lambda\mathbb{E}[T-H_{1}|A_{y}]\\ &=\mathbb{E}\left[\left|R^{\geq 2}_{0,n,T_{n}}-R^{\geq 2}_{0}\right|\Big{|}A_{y}\right]+\frac{\lambda}{r}\mathbb{E}[\log n+\log(1/W)+U_{i}|A_{y}],\\ \end{split}

and

𝔼[|R0,n,Tn2R02||Ay]\displaystyle\mathbb{E}\left[\left|R^{\geq 2}_{0,n,T_{n}}-R^{\geq 2}_{0}\right|\Big{|}A_{y}\right] =𝔼[|Π0(𝒜(λq(y,),[H1,n,Tn,T]))Π0(𝒜(λq(y,),[H1,T]))||Ay]\displaystyle=\mathbb{E}[|\Pi_{0}(\mathcal{A}(\lambda q(y,\cdot),[H_{1,n,T_{n}},T]))-\Pi_{0}(\mathcal{A}(\lambda q(y,\cdot),[H_{1},T]))||A_{y}]
𝔼[𝐋𝐞𝐛(𝒜(λq(y,),[H1,n,Tn,T])(𝒜(λq(y,),[H1,T])))|Ay]\displaystyle\leq\mathbb{E}[\mathbf{Leb}(\mathcal{A}(\lambda q(y,\cdot),[H_{1,n,T_{n}},T])\triangle(\mathcal{A}(\lambda q(y,\cdot),[H_{1},T])))|A_{y}]
λ𝔼[|H1,n,TnH1||Ay].\displaystyle\leq\lambda\mathbb{E}[|H_{1,n,T_{n}}-H_{1}||A_{y}].

For 1in21\leq i\leq n-2, we have

𝔼[|Ri,n,Tn2Ri2||Ay]\displaystyle\mathbb{E}\left[\left|R^{\geq 2}_{i,n,T_{n}}-R^{\geq 2}_{i}\right|\Big{|}A_{y}\right] =𝔼[|Πi(𝒜(λq(y,),[Hi+1,n,Tn,Hi,n,Tn]))Πi(𝒜(λq(y,),[Hi+1,Hi]))||Ay]\displaystyle=\mathbb{E}[|\Pi_{i}(\mathcal{A}(\lambda q(y,\cdot),[H_{i+1,n,T_{n}},H_{i,n,T_{n}}]))-\Pi_{i}(\mathcal{A}(\lambda q(y,\cdot),[H_{i+1},H_{i}]))||A_{y}]
𝔼[𝐋𝐞𝐛(𝒜(λq(y,),[Hi+1,n,Tn,Hi,n,Tn])(𝒜(λq(y,),[Hi+1,Hi])))|Ay]\displaystyle\leq\mathbb{E}[\mathbf{Leb}(\mathcal{A}(\lambda q(y,\cdot),[H_{i+1,n,T_{n}},H_{i,n,T_{n}}])\triangle(\mathcal{A}(\lambda q(y,\cdot),[H_{i+1},H_{i}])))|A_{y}]
λ𝔼[|Hi+1,n,TnHi+1|+|Hi,n,TnHi||Ay].\displaystyle\leq\lambda\mathbb{E}[|H_{i+1,n,T_{n}}-H_{i+1}|+|H_{i,n,T_{n}}-H_{i}||A_{y}].

The second claim then follows from integrating over the event BnB_{n} in Lemma 4.1. The first claim can be proved in an analogous way using the inequality

𝔼[|Ri,n,TnkRik||Ay]λj=0k𝔼[|Hi+j,n,TnHi+j||Ay],\displaystyle\mathbb{E}\left[\left|R^{k}_{i,n,T_{n}}-R^{k}_{i}\right|\Big{|}A_{y}\right]\leq\lambda\sum_{j=0}^{k}\mathbb{E}[|H_{i+j,n,T_{n}}-H_{i+j}||A_{y}],

and the observation that Rnk,n,TnkR^{k}_{n-k,n,T_{n}} has the same distribution as R0,n,TnkR^{k}_{0,n,T_{n}}, and that R0,n,TnkR0,n,Tn2R^{k}_{0,n,T_{n}}\leq R^{\geq 2}_{0,n,T_{n}}. ∎

5 Proof of Theorems 1.2 and 1.3

In this section, we start with Propositions 5.1 and 5.2, which are the counterparts of Theorems 1.2 and 1.3 using the approximations R2R^{\geq 2} and RkR^{k}. We prove Propositions 5.1 and 5.2 in Sections 5.2 and 5.3. Theorems 1.2 and 1.3 then follow from Corollary 4.1. We should emphasize that R2R^{\geq 2} and RkR^{k} depend on nn and TnT_{n}, although not made explicit in the notation.

Proposition 5.1.

Under condition (4), we have the following convergence in probability as nn\rightarrow\infty

RknPλrk(k1).\frac{R^{k}}{n}\stackrel{{\scriptstyle P}}{{\longrightarrow}}\frac{\lambda}{rk(k-1)}.
Proposition 5.2.

Under condition (5), we have the following convergence in distribution as nn\rightarrow\infty,

1n(R2nλr)N(0,λ2r2).\frac{1}{\sqrt{n}}\left(R^{\geq 2}-\frac{n\lambda}{r}\right)\Rightarrow N\left(0,\frac{\lambda^{2}}{r^{2}}\right).

To prove Proposition 5.1 or 5.2, it suffices to show that the convergence in probability or distribution holds conditional on W=wW=w, regardless of the value of ww. Consider Proposition 5.2 for example; we will show that the conditional distribution of R2R^{\geq 2} given W=wW=w is asymptotically normal with the same mean and variance given by Proposition 5.2, regardless of the value of ww. Proposition 5.2 then follows from applying the dominated convergence theorem to the characteristic function:

limn𝔼[exp(itn(R2nλr))]=limn𝔼[𝔼[exp(itn(R2nλr))|W]]\displaystyle\lim_{n\rightarrow\infty}\mathbb{E}\left[\exp\left(\frac{it}{\sqrt{n}}\left(R^{\geq 2}-\frac{n\lambda}{r}\right)\right)\right]=\lim_{n\rightarrow\infty}\mathbb{E}\left[\mathbb{E}\left[\exp\left(\frac{it}{\sqrt{n}}\left(R^{\geq 2}-\frac{n\lambda}{r}\right)\right)\Big{|}W\right]\right]
=𝔼[limn𝔼[exp(itn(R2nλr))|W]]=exp(t2λ22r2).\displaystyle\qquad\qquad=\mathbb{E}\left[\lim_{n\rightarrow\infty}\mathbb{E}\left[\exp\left(\frac{it}{\sqrt{n}}\left(R^{\geq 2}-\frac{n\lambda}{r}\right)\right)\Big{|}W\right]\right]=\exp\left(-\frac{t^{2}\lambda^{2}}{2r^{2}}\right).

5.1 Moment computations

In this section, for positive functions f,g,hf,g,h, we write

f=g(1+O(h)) if |fg|Cgh for some universal constant C.f=g(1+O(h))\qquad\text{ if }|f-g|\leq Cgh\text{ for some universal constant }C.

We will also use the following fact. For positive integers 0mn20\leq m\leq n-2,

e(m+1)s(1+es)n𝑑s=0xm(1+x)n𝑑x=k=0m(1)mk1nk1(mk).\int_{-\infty}^{\infty}\frac{e^{(m+1)s}}{(1+e^{s})^{n}}\ ds=\int_{0}^{\infty}\frac{x^{m}}{(1+x)^{n}}\ dx=\sum_{k=0}^{m}(-1)^{m-k}\frac{1}{n-k-1}{m\choose k}. (10)

Another observation is that condition (5) implies that

limnn3/2TnerTn=0.\lim_{n\rightarrow\infty}n^{3/2}T_{n}e^{-rT_{n}}=0. (11)

Indeed, if equation (11) fails, then we may choose a sequence {nk}\{n_{k}\} such that

limnnk3/2TnkerTnk=c>0.\lim_{n\rightarrow\infty}n_{k}^{3/2}T_{n_{k}}e^{-rT_{n_{k}}}=c>0.

Combined with (5), we can deduce that limn(lognk/Tnk)=0\lim_{n\rightarrow\infty}(\log n_{k}/T_{n_{k}})=0. However, this implies

limnnk3/2TnkerTnk=0,\lim_{n\rightarrow\infty}n_{k}^{3/2}T_{n_{k}}e^{-rT_{n_{k}}}=0,

which is a contradiction.

Lemma 5.1.

We have the following moment estimates for the number of reproduction events with at least two descendants in the sample, excluding those on the 0th branch. Assume condition (4) holds. Then for each fixed w(0,)w\in(0,\infty), we have

limnn1Var(i=1n2Ri2|W=w)=λ2r2,\lim_{n\rightarrow\infty}n^{-1}\textup{Var}\left(\sum_{i=1}^{n-2}R^{\geq 2}_{i}\Big{|}W=w\right)=\frac{\lambda^{2}}{r^{2}}, (12)
limnn1𝔼[i=1nk1Rik|W=w]=λrk(k1).\lim_{n\rightarrow\infty}n^{-1}\mathbb{E}\left[\sum_{i=1}^{n-k-1}R^{k}_{i}\Big{|}W=w\right]=\frac{\lambda}{rk(k-1)}. (13)

Assume condition (5) holds. Then for each fixed w(0,)w\in(0,\infty), we have

limn1n|𝔼[i=1n2Ri2|W=w]nλr|=0.\lim_{n\rightarrow\infty}\frac{1}{\sqrt{n}}\left|\mathbb{E}\left[\sum_{i=1}^{n-2}R^{\geq 2}_{i}\Big{|}W=w\right]-\frac{n\lambda}{r}\right|=0. (14)
Proof.

To prove (14), for 1in21\leq i\leq n-2, we define

Ri2,red=𝟙{Hi+1Hi},Ri2,blue=Πi(𝒜(λq(nδTn/W,),[Hi+1,Hi])),R^{\geq 2,red}_{i}=\mathbbm{1}_{\{H_{i+1}\leq H_{i}\}},\qquad R^{\geq 2,blue}_{i}=\Pi_{i}(\mathcal{A}(\lambda q(n\delta_{T_{n}}/W,\cdot),[H_{i+1},H_{i}])),

to be the approximations for the numbers of red and blue reproduction events on the iith branch. Then we have

𝔼[Ri2,red|W=w]=12.\mathbb{E}\left[R^{\geq 2,red}_{i}\Big{|}W=w\right]=\frac{1}{2}. (15)

We now compute 𝔼[Ri2,blue|W=w]\mathbb{E}[R^{\geq 2,blue}_{i}|W=w]. Using the change of variable t=Tn(s+logn+log(1/w))/rt=T_{n}-(s+\log n+\log(1/w))/r, we have

𝔼[Ri2,blue|W=w]\displaystyle\mathbb{E}\left[R^{\geq 2,blue}_{i}\Big{|}W=w\right]
=𝔼[Π(𝒜(λq(nδTn/w,)),[Hi+1,Hi])|W=w]\displaystyle\qquad=\mathbb{E}[\Pi(\mathcal{A}(\lambda q(n\delta_{T_{n}}/w,\cdot)),[H_{i+1},H_{i}])|W=w]
=0Tn(Hi+1t<Hi|W=w)λq(nδTn/w,t)𝑑t\displaystyle\qquad=\int_{0}^{T_{n}}\mathbb{P}(H_{i+1}\leq t<H_{i}|W=w)\lambda q(n\delta_{T_{n}}/w,t)\ dt
=1rlognlog(1/w)rTnlognlog(1/w)(Ui+1sUi)λq(nδTn/w,Tn(s+logn+log(1/w))/r)𝑑s.\displaystyle\qquad=\frac{1}{r}\int_{-\log n-\log(1/w)}^{rT_{n}-\log n-\log(1/w)}\mathbb{P}(U_{i+1}\geq s\geq U_{i})\lambda q(n\delta_{T_{n}}/w,T_{n}-(s+\log n+\log(1/w))/r)\ ds.

We consider the limit of λq(nδTn/w,Tn(s+logn+log(1/w))/r)\lambda q(n\delta_{T_{n}}/w,T_{n}-(s+\log n+\log(1/w))/r) as nn\rightarrow\infty. Recall the formula for q(y,t)q(y,t) from (3.1). Writing t(s)=Tn(s+logn+log(1/w))/rt(s)=T_{n}-(s+\log n+\log(1/w))/r and noticing that

δt=rλert(1+O(ert)),\delta_{t}=\frac{r}{\lambda}e^{-rt}\left(1+O\left(e^{-rt}\right)\right),

we have,

λq(nδTn/w,t(s))\displaystyle\lambda q(n\delta_{T_{n}}/w,t(s))
=λ(1δt(s)ert(s)+δt(s)(1nδTnW)δt(s)+nδTn/wδt(s)nδTn/wδt(s)ert(s))\displaystyle\qquad=\lambda\left(1-\delta_{t(s)}e^{rt(s)}+\frac{\delta_{t(s)}(1-n\delta_{T_{n}}W)}{\delta_{t(s)}+n\delta_{T_{n}}/w-\delta_{t(s)}n\delta_{T_{n}}/w}\cdot\delta_{t(s)}e^{rt(s)}\right)
=λ(1rλ(1+O(ert(s)))+δt(s)δt(s)+nδTn/w(1+O(nδTn/w))rλ)\displaystyle\qquad=\lambda\left(1-\frac{r}{\lambda}\left(1+O\left(e^{-rt(s)}\right)\right)+\frac{\delta_{t(s)}}{\delta_{t(s)}+n\delta_{T_{n}}/w}\cdot(1+O(n\delta_{T_{n}}/w))\cdot\frac{r}{\lambda}\right)
=λ(1rλ(1+O(ert(s)))\displaystyle\qquad=\lambda\Big{(}1-\frac{r}{\lambda}\left(1+O\left(e^{-rt(s)}\right)\right)
+(rn/λw)esrTn(rn/λw)esrTn+(rn/λw)erTn(1+O(ert(s)+nδTn/w))rλ)\displaystyle\qquad\qquad+\frac{(rn/\lambda w)e^{s-rT_{n}}}{(rn/\lambda w)e^{s-rT_{n}}+(rn/\lambda w)e^{-rT_{n}}}\cdot(1+O(e^{-rt(s)}+n\delta_{T_{n}}/w))\cdot\frac{r}{\lambda}\Big{)}
=μ+r1+es+O(rert(s)+rnδTn/w).\displaystyle\qquad=\mu+\frac{r}{1+e^{-s}}+O(re^{-rt(s)}+rn\delta_{T_{n}}/w).

Therefore, using (Ui+1sUi)min{es,es}\mathbb{P}(U_{i+1}\geq s\geq U_{i})\leq\min\{e^{s},e^{-s}\}, there is a positive constant CC such that

|𝔼[Ri2,blue|W=w]1r(Ui+1sUi)(μ+r1+es)ds|\displaystyle\left|\mathbb{E}\left[R^{\geq 2,blue}_{i}\Big{|}W=w\right]-\frac{1}{r}\int_{-\infty}^{\infty}\mathbb{P}(U_{i+1}\geq s\geq U_{i})\left(\mu+\frac{r}{1+e^{-s}}\right)\ ds\right|
λrlognlog(1/w)(Ui+1sUi)𝑑s+λrrTnlognlog(1/w)(Ui+1sUi)𝑑s\displaystyle\qquad\leq\frac{\lambda}{r}\int_{-\infty}^{-\log n-\log(1/w)}\mathbb{P}(U_{i+1}\geq s\geq U_{i})\ ds+\frac{\lambda}{r}\int_{rT_{n}-\log n-\log(1/w)}^{\infty}\mathbb{P}(U_{i+1}\geq s\geq U_{i})\ ds
+1rlognlog(1/w)rTnlognlog(1/w)(Ui+1sUi)|λq(nδTn/w,t(s))μr1+es|𝑑s\displaystyle\qquad\qquad+\frac{1}{r}\int_{-\log n-\log(1/w)}^{rT_{n}-\log n-\log(1/w)}\mathbb{P}(U_{i+1}\geq s\geq U_{i})\left|\lambda q(n\delta_{T_{n}}/w,t(s))-\mu-\frac{r}{1+e^{-s}}\right|\ ds
λrlognlog(1/w)es𝑑s+λrrTnlognlog(1/w)es𝑑s\displaystyle\qquad\leq\frac{\lambda}{r}\int_{-\infty}^{-\log n-\log(1/w)}e^{s}\ ds+\frac{\lambda}{r}\int_{rT_{n}-\log n-\log(1/w)}^{\infty}e^{-s}\ ds
+1rlognlog(1/w)0esCr(ert(s)+nδTn/w)𝑑s\displaystyle\qquad\qquad+\frac{1}{r}\int_{-\log n-\log(1/w)}^{0}e^{s}\cdot Cr(e^{-rt(s)}+n\delta_{T_{n}}/w)\ ds
+1r0rTnlognlog(1/w)esCr(ert(s)+nδTn/w)𝑑s\displaystyle\qquad\qquad+\frac{1}{r}\int_{0}^{rT_{n}-\log n-\log(1/w)}e^{-s}\cdot Cr(e^{-rt(s)}+n\delta_{T_{n}}/w)\ ds
λrwn+λrnwerTn+CrTn(nerTnw+nδTn/w).\displaystyle\qquad\leq\frac{\lambda}{r}\cdot\frac{w}{n}+\frac{\lambda}{r}\cdot\frac{n}{we^{rT_{n}}}+CrT_{n}\left(\frac{ne^{-rT_{n}}}{w}+n\delta_{T_{n}}/w\right).

By (10), we have

1r(UisUi+1)(μ+r1+es)𝑑s=1res(1+es)2(μ+r1+es)𝑑s=μr+12.\displaystyle\frac{1}{r}\int_{-\infty}^{\infty}\mathbb{P}(U_{i}\geq s\geq U_{i+1})\left(\mu+\frac{r}{1+e^{-s}}\right)\ ds=\frac{1}{r}\int_{-\infty}^{\infty}\frac{e^{s}}{(1+e^{s})^{2}}\left(\mu+\frac{r}{1+e^{-s}}\right)\ ds=\frac{\mu}{r}+\frac{1}{2}.

Therefore, under condition (11), we have

limnn|𝔼[Ri2,blue|W=w]μr12|=0.\lim_{n\rightarrow\infty}\sqrt{n}\left|\mathbb{E}\left[R^{\geq 2,blue}_{i}\Big{|}W=w\right]-\frac{\mu}{r}-\frac{1}{2}\right|=0. (16)

Combining (15) and (16), we have

limnn|𝔼[Ri2|W=w]λr|=0.\lim_{n\rightarrow\infty}\sqrt{n}\left|\mathbb{E}\left[R^{\geq 2}_{i}\Big{|}W=w\right]-\frac{\lambda}{r}\right|=0.

Then (14) follows from summing over ii.

Equation (13) can be proved analogously. However, since we are only concerned about the leading term, applying the dominated convergence theorem makes the argument easier. We use the change of variable t=Tn(s+logn+log(1/w))/rt=T_{n}-(s+\log n+\log(1/w))/r, and then

𝔼[Rik|W=w]\displaystyle\mathbb{E}\left[R^{k}_{i}\Big{|}W=w\right]
=𝔼[Πi(𝒜(λq(nδTn/w,),[maxi+1ji+k1Hj,HiHi+k])|W=w]\displaystyle\qquad=\mathbb{E}\left[\Pi_{i}\left(\mathcal{A}(\lambda q(n\delta_{T_{n}}/w,\cdot),\left[\max_{i+1\leq j\leq i+k-1}H_{j},H_{i}\wedge H_{i+k}\right]\right)\Big{|}W=w\right]
+(maxi+1ji+k1HjHi+kHi|W=w)\displaystyle\qquad\qquad+\mathbb{P}\left(\max_{i+1\leq j\leq i+k-1}H_{j}\leq H_{i+k}\leq H_{i}\Big{|}W=w\right)
=0Tn(maxi+1ji+k1HjtHiHi+k|W=w)λq(nδTn/w,t)𝑑t+(k1)!(k+1)!\displaystyle\qquad=\int_{0}^{T_{n}}\mathbb{P}\left(\max_{i+1\leq j\leq i+k-1}H_{j}\leq t\leq H_{i}\wedge H_{i+k}\Big{|}W=w\right)\lambda q(n\delta_{T_{n}}/w,t)\ dt+\frac{(k-1)!}{(k+1)!}
=1rlognlog(1/w)rTnlognlog(1/w)(mini+1ji+k1UjsUiUi+k)\displaystyle\qquad=\frac{1}{r}\int_{-\log n-\log(1/w)}^{rT_{n}-\log n-\log(1/w)}\mathbb{P}\left(\min_{i+1\leq j\leq i+k-1}U_{j}\geq s\geq U_{i}\vee U_{i+k}\right)
λq(nδTn/w,Tn(s+logn+log(1/w))/r)ds+1k(k+1).\displaystyle\qquad\qquad\lambda q(n\delta_{T_{n}}/w,T_{n}-(s+\log n+\log(1/w))/r)\ ds+\frac{1}{k(k+1)}.

As nn\rightarrow\infty, we have lognlog(1/w){-\log n-\log(1/w)}\rightarrow-\infty, rTnlognlog(1/w)rT_{n}-\log n-\log(1/w)\rightarrow\infty, and λq(nδTn/w,t(s))μ+r/(1+es)\lambda q(n\delta_{T_{n}}/w,t(s))\newline \rightarrow\mu+r/(1+e^{-s}). Note also that

(mini+1ji+k1UjsUiUi+k)λq(nδTn/w,t(s))λ(mini+1ji+k1UjsUiUi+k),\displaystyle\mathbb{P}\left(\min_{i+1\leq j\leq i+k-1}U_{j}\geq s\geq U_{i}\vee U_{i+k}\right)\lambda q(n\delta_{T_{n}}/w,t(s))\leq\lambda\mathbb{P}\left(\min_{i+1\leq j\leq i+k-1}U_{j}\geq s\geq U_{i}\vee U_{i+k}\right),

which is integrable over \mathbb{R}. Therefore, by the dominated convergence theorem and equation (10), we have

limn𝔼[Rik|W=w]\displaystyle\lim_{n\rightarrow\infty}\mathbb{E}\left[R^{k}_{i}\Big{|}W=w\right]
=1r(mini+1ji+k1UjsUiUi+k)(μ+r1+es)𝑑s+1k(k+1)\displaystyle\qquad=\frac{1}{r}\int_{-\infty}^{\infty}\mathbb{P}\left(\min_{i+1\leq j\leq i+k-1}U_{j}\geq s\geq U_{i}\vee U_{i+k}\right)\cdot\left(\mu+\frac{r}{1+e^{-s}}\right)\ ds+\frac{1}{k(k+1)}
=1re2s(1+es)k+1(μ+r1+es)𝑑s+1k(k+1)\displaystyle\qquad=\frac{1}{r}\int_{-\infty}^{\infty}\frac{e^{2s}}{(1+e^{s})^{k+1}}\cdot\left(\mu+\frac{r}{1+e^{-s}}\right)\ ds+\frac{1}{k(k+1)}
=μre2s(1+es)k+1𝑑s+e3s(1+es)k+2𝑑s+1k(k+1)\displaystyle\qquad=\frac{\mu}{r}\int_{-\infty}^{\infty}\frac{e^{2s}}{(1+e^{s})^{k+1}}\ ds+\int_{-\infty}^{\infty}\frac{e^{3s}}{(1+e^{s})^{k+2}}\ ds+\frac{1}{k(k+1)}
=λrr(1k+1k1)+(1k+12k+1k1)+1k(k+1)\displaystyle\qquad=\frac{\lambda-r}{r}\left(-\frac{1}{k}+\frac{1}{k-1}\right)+\left(\frac{1}{k+1}-\frac{2}{k}+\frac{1}{k-1}\right)+\frac{1}{k(k+1)}
=λrk(k1).\displaystyle\qquad=\frac{\lambda}{rk(k-1)}. (17)

To prove equation (12), note that Ri2R^{\geq 2}_{i} and Rj2R^{\geq 2}_{j} are conditionally independent given W=wW=w if |ij|2|i-j|\geq 2. It follows that

limnn1Var(i=1n2Ri2|W=w)=limn(Var(Ri2|W=w)+2Cov(Ri2,Ri+12|W=w)).\lim_{n\rightarrow\infty}n^{-1}\textup{Var}\left(\sum_{i=1}^{n-2}R^{\geq 2}_{i}\Big{|}W=w\right)=\lim_{n\rightarrow\infty}\left(\textup{Var}\left(R^{\geq 2}_{i}\Big{|}W=w\right)+2\textup{Cov}\left(R^{\geq 2}_{i},R^{\geq 2}_{i+1}\Big{|}W=w\right)\right). (18)

To compute Var(Ri2)\textup{Var}(R^{\geq 2}_{i}), since Ri2,redR^{\geq 2,red}_{i} is {0,1}\{0,1\}-valued, we have

𝔼[(Ri2,red)2|W=w]=𝔼[Ri2,red|W=w]=12.\mathbb{E}\left[\left(R^{\geq 2,red}_{i}\right)^{2}\Big{|}W=w\right]=\mathbb{E}\left[R^{\geq 2,red}_{i}\Big{|}W=w\right]=\frac{1}{2}. (19)

By the definition, Ri2,blueR^{\geq 2,blue}_{i} is nonzero only if the {0,1}\{0,1\}-valued random variable Ri2,redR^{\geq 2,red}_{i} is nonzero. Then, it follows from equation (16) that

limn𝔼[Ri2,redRi2,blue|W=w]=limn𝔼[Ri2,blue|W=w]=μr+12.\lim_{n\rightarrow\infty}\mathbb{E}\left[R^{\geq 2,red}_{i}R^{\geq 2,blue}_{i}\Big{|}W=w\right]=\lim_{n\rightarrow\infty}\mathbb{E}\left[R^{\geq 2,blue}_{i}\Big{|}W=w\right]=\frac{\mu}{r}+\frac{1}{2}. (20)

Recall that Leb denotes the Lebesgue measure on 2\mathbb{R}^{2} and t(s)=Tn(s+logn+log(1/w))/rt(s)=T_{n}-(s+\log n+\log(1/w))/r. A computation similar to the computation for 𝔼[Rik|W=w]\mathbb{E}[R^{k}_{i}|W=w] gives the following

𝔼[(Ri2,blue)2|W=w]\displaystyle\mathbb{E}\left[\left(R^{\geq 2,blue}_{i}\right)^{2}\Big{|}W=w\right]
=𝔼[Πi(𝒜(λq(nδTn/w,),[Hi+1,Hi])2|W=w]\displaystyle\qquad=\mathbb{E}\left[\Pi_{i}\left(\mathcal{A}(\lambda q(n\delta_{T_{n}}/w,\cdot),\left[H_{i+1},H_{i}\right]\right)^{2}\Big{|}W=w\right]
=𝔼[Leb(𝒜(λq(nδTn/w,),[Hi+1,Hi]))2+Leb(𝒜(λq(nδTn/w,),[Hi+1,Hi]))|W=w]\displaystyle\qquad=\mathbb{E}\left[\textbf{Leb}\left(\mathcal{A}(\lambda q(n\delta_{T_{n}}/w,\cdot),\left[H_{i+1},H_{i}\right])\right)^{2}+\textbf{Leb}\left(\mathcal{A}(\lambda q(n\delta_{T_{n}}/w,\cdot),\left[H_{i+1},H_{i}\right])\right)\Big{|}W=w\right]
=0Tn(Hi+1t1Hi,Hi+1t2Hi|W=w)λq(nδTn/w,t1)λq(nδTn/w,t2)dt1dt2\displaystyle\qquad=\int_{0}^{T_{n}}\mathbb{P}\left(H_{i+1}\leq t_{1}\leq H_{i},H_{i+1}\leq t_{2}\leq H_{i}\Big{|}W=w\right)\lambda q(n\delta_{T_{n}}/w,t_{1})\lambda q(n\delta_{T_{n}}/w,t_{2})\ dt_{1}\ dt_{2}
+𝔼[Ri2,blue|W=w]\displaystyle\qquad\qquad+\mathbb{E}\left[R_{i}^{\geq 2,blue}|W=w\right]
=1r2lognlog(1/w)rTnlognlog(1/w)(Ui+1s1Ui,Ui+1s2Ui)\displaystyle\qquad=\frac{1}{r^{2}}\int_{-\log n-\log(1/w)}^{rT_{n}-\log n-\log(1/w)}\mathbb{P}\left(U_{i+1}\geq s_{1}\geq U_{i},U_{i+1}\geq s_{2}\geq U_{i}\right)
λq(nδTn/w,t(s1))λq(nδTn/w,t(s2))ds1ds2+𝔼[Ri2,blue|W=w].\displaystyle\qquad\qquad\lambda q(n\delta_{T_{n}}/w,t(s_{1}))\lambda q(n\delta_{T_{n}}/w,t(s_{2}))\ ds_{1}\ ds_{2}+\mathbb{E}\left[R_{i}^{\geq 2,blue}|W=w\right].

We apply the dominated convergence theorem as nn\rightarrow\infty, use equation (16), and evaluate the last integral using a computer to get

limn𝔼[(Ri2,blue)2|W=w]\displaystyle\lim_{n\rightarrow\infty}\mathbb{E}\left[\left(R^{\geq 2,blue}_{i}\right)^{2}\Big{|}W=w\right]
=1r2(Uis1Ui+1,Uis2Ui+1)\displaystyle\qquad=\frac{1}{r^{2}}\int_{-\infty}^{\infty}\int_{-\infty}^{\infty}\mathbb{P}(U_{i}\geq s_{1}\geq U_{i+1},U_{i}\geq s_{2}\geq U_{i+1})
(μ+r1+es1)(μ+r1+es2)ds1ds2+μr+12\displaystyle\qquad\qquad\qquad\qquad\left(\mu+\frac{r}{1+e^{-s_{1}}}\right)\left(\mu+\frac{r}{1+e^{-s_{2}}}\right)\ ds_{1}\ ds_{2}+\frac{\mu}{r}+\frac{1}{2}
=2r2s111+es1es21+es2(μ+r1+es1)(μ+r1+es2)𝑑s2𝑑s1+μr+12\displaystyle\qquad=\frac{2}{r^{2}}\int_{-\infty}^{\infty}\int_{-\infty}^{s_{1}}\frac{1}{1+e^{s_{1}}}\frac{e^{s_{2}}}{1+e^{s_{2}}}\left(\mu+\frac{r}{1+e^{-s_{1}}}\right)\left(\mu+\frac{r}{1+e^{-s_{2}}}\right)\ ds_{2}\ ds_{1}+\frac{\mu}{r}+\frac{1}{2}
=2r2(π26μ2+π26μr+12r2)+μr+12.\displaystyle\qquad=\frac{2}{r^{2}}\left(\frac{\pi^{2}}{6}\mu^{2}+\frac{\pi^{2}}{6}\mu r+\frac{1}{2}r^{2}\right)+\frac{\mu}{r}+\frac{1}{2}. (21)

Combining (14), (19), (20) and (5.1), we have

limnVar(Ri2|W=w)\displaystyle\lim_{n\rightarrow\infty}\textup{Var}\left(R^{\geq 2}_{i}|W=w\right) =12+2(μr+12)+2r2(π26μ2+π26μr+12r2)+μr+12(λr)2\displaystyle=\frac{1}{2}+2\left(\frac{\mu}{r}+\frac{1}{2}\right)+\frac{2}{r^{2}}\left(\frac{\pi^{2}}{6}\mu^{2}+\frac{\pi^{2}}{6}\mu r+\frac{1}{2}r^{2}\right)+\frac{\mu}{r}+\frac{1}{2}-\left(\frac{\lambda}{r}\right)^{2}
=(π231)μ2r2+(π23+1)μr+2.\displaystyle=\left(\frac{\pi^{2}}{3}-1\right)\frac{\mu^{2}}{r^{2}}+\left(\frac{\pi^{2}}{3}+1\right)\frac{\mu}{r}+2. (22)

The computation for Cov(Ri2,Ri+12)\textup{Cov}(R^{\geq 2}_{i},R^{\geq 2}_{i+1}) is analogous. We have

𝔼[Ri2,redRi+12,red|W=w]=(Hi+2Hi+1Hi|W=w)=16,\mathbb{E}\left[R^{\geq 2,red}_{i}R^{\geq 2,red}_{i+1}\Big{|}W=w\right]=\mathbb{P}(H_{i+2}\leq H_{i+1}\leq H_{i}|W=w)=\frac{1}{6}, (23)
limn𝔼[Ri2,redRi+12,blue|W=w]\displaystyle\lim_{n\rightarrow\infty}\mathbb{E}\left[R^{\geq 2,red}_{i}R^{\geq 2,blue}_{i+1}\Big{|}W=w\right] =1r(UiUi+1sUi+2)(μ+r1+es)𝑑s\displaystyle=\frac{1}{r}\int_{-\infty}^{\infty}\mathbb{P}(U_{i}\leq U_{i+1}\leq s\leq U_{i+2})\cdot\left(\mu+\frac{r}{1+e^{-s}}\right)\ ds
=1re2s2(1+es)3(μ+r1+es)𝑑s\displaystyle=\frac{1}{r}\int_{-\infty}^{\infty}\frac{e^{2s}}{2(1+e^{s})^{3}}\cdot\left(\mu+\frac{r}{1+e^{-s}}\right)\ ds
=μ4r+16,\displaystyle=\frac{\mu}{4r}+\frac{1}{6}, (24)
limn𝔼[Ri2,blueRi+12,red|W=w]\displaystyle\lim_{n\rightarrow\infty}\mathbb{E}\left[R^{\geq 2,blue}_{i}R^{\geq 2,red}_{i+1}\Big{|}W=w\right] =1r(UisUi+1Ui+2)(μ+r1+es)𝑑s\displaystyle=\frac{1}{r}\int_{-\infty}^{\infty}\mathbb{P}(U_{i}\leq s\leq U_{i+1}\leq U_{i+2})\cdot\left(\mu+\frac{r}{1+e^{-s}}\right)\ ds
=1res2(1+es)3(μ+r1+es)𝑑s\displaystyle=\frac{1}{r}\int_{-\infty}^{\infty}\frac{e^{s}}{2(1+e^{s})^{3}}\cdot\left(\mu+\frac{r}{1+e^{-s}}\right)\ ds
=μ4r+112,\displaystyle=\frac{\mu}{4r}+\frac{1}{12}, (25)

and evaluating the last integral using a computer,

limn𝔼[Ri2,blueRi+12,blue|W=w]\displaystyle\lim_{n\rightarrow\infty}\mathbb{E}\left[R^{\geq 2,blue}_{i}R^{\geq 2,blue}_{i+1}\Big{|}W=w\right]
=1r2(Uis1Ui+1s2Ui+2)(μ+r1+es1)(μ+r1+es2)𝑑s1𝑑s2\displaystyle=\frac{1}{r^{2}}\int_{-\infty}^{\infty}\int_{-\infty}^{\infty}\mathbb{P}(U_{i}\leq s_{1}\leq U_{i+1}\leq s_{2}\leq U_{i+2})\cdot\left(\mu+\frac{r}{1+e^{-s_{1}}}\right)\cdot\left(\mu+\frac{r}{1+e^{-s_{2}}}\right)\ ds_{1}\ ds_{2}
=1r2s111+es1(es11+es1es21+es2)es21+es2(μ+r1+es1)(μ+r1+es2)𝑑s2𝑑s1\displaystyle=\frac{1}{r^{2}}\int_{-\infty}^{\infty}\int_{-\infty}^{s_{1}}\frac{1}{1+e^{s_{1}}}\left(\frac{e^{s_{1}}}{1+e^{s_{1}}}-\frac{e^{s_{2}}}{1+e^{s_{2}}}\right)\frac{e^{s_{2}}}{1+e^{s_{2}}}\cdot\left(\mu+\frac{r}{1+e^{-s_{1}}}\right)\cdot\left(\mu+\frac{r}{1+e^{-s_{2}}}\right)\ ds_{2}\ ds_{1}
=1r2((2π26)μ2+(2π26)μr+112r2).\displaystyle=\frac{1}{r^{2}}\left(\left(2-\frac{\pi^{2}}{6}\right)\mu^{2}+\left(2-\frac{\pi^{2}}{6}\right)\mu r+\frac{1}{12}r^{2}\right). (26)

Combining (15), (16), (23), (5.1), (5.1) and (5.1), we get

limnCov(Ri2,Ri+12|W=w)=(1π26)μ2r2+(12π26)μr12.\lim_{n\rightarrow\infty}\textup{Cov}(R^{\geq 2}_{i},R^{\geq 2}_{i+1}|W=w)=\left(1-\frac{\pi^{2}}{6}\right)\frac{\mu^{2}}{r^{2}}+\left(\frac{1}{2}-\frac{\pi^{2}}{6}\right)\frac{\mu}{r}-\frac{1}{2}. (27)

Combining (18), (5.1) and (27), we have

limnn1Var(i=1n2Ri2|W=w)=μ2r2+2μr+1=λ2r2.\lim_{n\rightarrow\infty}n^{-1}\textup{Var}\left(\sum_{i=1}^{n-2}R^{\geq 2}_{i}\Big{|}W=w\right)=\frac{\mu^{2}}{r^{2}}+\frac{2\mu}{r}+1=\frac{\lambda^{2}}{r^{2}}.

5.2 Proof of Theorem 1.2

By Corollary 4.1, it suffices to prove Proposition 5.1. Conditional on W=wW=w, the random sequences (Rik)ij(R^{k}_{i})_{i\leq j} and (Rik)ij+k+1(R^{k}_{i})_{i\geq j+k+1} are independent for all jj. We can, therefore, apply the law of large numbers to (Ri+j(k+1)k)j1(R^{k}_{i+j(k+1)})_{j\geq 1} for i=1,2,,k,k+1i=1,2,\dots,k,k+1. We have

j=1(ni)/(k+1)Ri+j(k+1)nPλrk(k1)(k+1).\frac{\sum_{j=1}^{\lfloor(n-i)/(k+1)\rfloor}R_{i+j(k+1)}}{n}\stackrel{{\scriptstyle P}}{{\rightarrow}}\frac{\lambda}{rk(k-1)(k+1)}.

Then Proposition 5.1 follows from (13) and recombining the results for i=1,2,,k,k+1i=1,2,\dots,k,k+1.

5.3 Proof of Theorem 1.3 and Corollary 1.2

Proof of Theorem 1.3.

By Corollary 4.1, it suffices to prove Proposition 5.2. Recall from the discussion after Proposition 5.2 that it suffices to prove the central limit theorem conditional on WW. Given W=wW=w, the random sequences (Ri2)ik(R^{\geq 2}_{i})_{i\leq k} and (Ri2)ik+2(R^{\geq 2}_{i})_{i\geq k+2} are independent for all kk. We apply the central limit theorem for mm-dependent sequences to this sequence. By Theorem 3 of [7], it suffices to check the following conditions:

lim infn1nVar(i=1n2Ri2|W=w)>0,\liminf_{n\rightarrow\infty}\frac{1}{n}\textup{Var}\left(\sum_{i=1}^{n-2}R^{\geq 2}_{i}\Big{|}W=w\right)>0, (28)
limn1ni=1n2𝔼[|Ri2𝔼[Ri2]|2𝟙{|Ri2𝔼[Ri2]|>ϵn}|W=w]=0,ϵ>0.\lim_{n\rightarrow\infty}\frac{1}{n}\sum_{i=1}^{n-2}\mathbb{E}\left[\left|R^{\geq 2}_{i}-\mathbb{E}\left[R^{\geq 2}_{i}\right]\right|^{2}\mathbbm{1}_{\left\{\left|R^{\geq 2}_{i}-\mathbb{E}\left[R^{\geq 2}_{i}\right]\right|>\epsilon\sqrt{n}\right\}}\Big{|}W=w\right]=0,\quad\forall\epsilon>0. (29)

If (28) and (29) hold, then Proposition 5.2 follows from (12) and (14). Equation (28) follows from Lemma 5.1. For equation (29), it is sufficient to prove the following bound; see p.362 of [1];

supn𝔼[(Ri2)p|W=w]<, for all p(2,).\sup_{n}\mathbb{E}\left[\left(R_{i}^{\geq 2}\right)^{p}\Big{|}W=w\right]<\infty,\qquad\text{ for all }p\in(2,\infty).

Recall that

Ri2\displaystyle R^{\geq 2}_{i} =Ri2,blue+Ri2,red\displaystyle=R^{\geq 2,blue}_{i}+R^{\geq 2,red}_{i}
=Πi(𝒜(λq(nδT/W,),[Hi+1,Hi]))+𝟙{Hi+1Hi}\displaystyle=\Pi_{i}(\mathcal{A}(\lambda q(n\delta_{T}/W,\cdot),[H_{i+1},H_{i}]))+\mathbbm{1}_{\{H_{i+1}\leq H_{i}\}}
Πi(𝒜(λ,[Hi+1,Hi]))+1\displaystyle\leq\Pi_{i}(\mathcal{A}(\lambda,[H_{i+1},H_{i}]))+1
=Πi(𝒜(λ,[T(logn+log(1/W)+Ui)/r,T(logn+log(1/W)+Ui+1)/r]))+1.\displaystyle=\Pi_{i}(\mathcal{A}(\lambda,[T-(\log n+\log(1/W)+U_{i})/r,T-(\log n+\log(1/W)+U_{i+1})/r]))+1.

Conditional on W=wW=w, the right hand side is distributed as Πi(𝒜(λ,[Ui/r,Ui+1/r]))+1\Pi_{i}(\mathcal{A}(\lambda,[-U_{i}/r,-U_{i+1}/r]))+1, a random variable that does not depend on nn and has finite ppth momemnt for all p(2,+)p\in(2,+\infty). ∎

Sketch of proof of Corollary 1.2.

The proof of this corollary is similar to the proof of Proposition 11 in the supplementary information in [15]. Following the notation therein, we decompose the fluctuation in Mn,Tn2M^{\geq 2}_{n,T_{n}} into the fluctuation in the number of reproduction events Rn,Tn2R^{\geq 2}_{n,T_{n}} and the fluctuation of Mn,Tn2M^{\geq 2}_{n,T_{n}} given Rn,Tn2R^{\geq 2}_{n,T_{n}}. Let

An=Mn,Tn2νRn,Tn2nλν/r,Cn=1n(Rn,Tn2nλr),A_{n}=\frac{M^{\geq 2}_{n,T_{n}}-\nu R^{\geq 2}_{n,T_{n}}}{\sqrt{n\lambda\nu/r}},\qquad C_{n}=\frac{1}{\sqrt{n}}\left(R_{n,T_{n}}^{\geq 2}-\frac{n\lambda}{r}\right),

so that

Mn,Tn2=nλνr+nλνrAn+nνCn.M^{\geq 2}_{n,T_{n}}=\frac{n\lambda\nu}{r}+\sqrt{\frac{n\lambda\nu}{r}}A_{n}+\sqrt{n}\nu C_{n}.

Then (An,Cn)(Z1,Z2)(A_{n},C_{n})\Rightarrow(Z_{1},Z_{2}) where Z1,Z2Z_{1},Z_{2} are independent normal random variables with mean 0 and variances 11 and λ2/r2\lambda^{2}/r^{2}. The nλν/rAn\sqrt{n\lambda\nu/r}A_{n} term contributes the λν/r\lambda\nu/r term of the variance in Corollary 1.2 and the nνCn\sqrt{n}\nu C_{n} term contributes the other part of the variance. ∎

Acknowledgments


The author thanks Professor Jason Schweinsberg for bringing up this research topic, carefully reading the draft, and providing helpful advice in the development of this paper.

References

  • [1] P. Billingsley. Probability and Measure. Wiley Series in Probability and Statistics. Wiley, 1995.
  • [2] Ivana Bozic, Jeffrey M Gerold, and Martin A Nowak. Quantifying clonal and subclonal passenger mutations in cancer evolution. PLoS computational biology, 12(2):e1004731, 2016.
  • [3] Nicolas Champagnat and Amaury Lambert. Splitting trees with neutral Poissonian mutations I: Small families. Stochastic Processes and their Applications, 122:1003–1033, 2012.
  • [4] Nicolas Champagnat and Amaury Lambert. Splitting trees with neutral Poissonian mutations II: Largest and oldest families. Stochastic Processes and their Applications, 123:1368–1414, 2013.
  • [5] David Cheek and Samuel GG Johnston. Ancestral reproductive bias in branching processes. Journal of Mathematical Biology, 86(5):70, 2023.
  • [6] Cécile Delaporte, Guillaume Achaz, and Amaury Lambert. Mutational pattern of a sample from a critical branching population. Journal of Mathematical Biology, 73:627–664, 2016.
  • [7] PH Diananda. The central limit theorem for mm-dependent variables. In Mathematical Proceedings of the Cambridge Philosophical Society, volume 51, pages 92–95. Cambridge University Press, 1955.
  • [8] Khanh N. Dinh, Roman Jaksik, Marek Kimmel, Amaury Lambert, and Simon Tavaré. Statistical Inference for the Evolutionary History of Cancer Genomes. Statistical Science, 35(1):129 – 144, 2020.
  • [9] Rick Durrett. Population genetics of neutral mutations in exponentially growing cancer cell populations. The Annals of Applied Probability, 23:230, 2013.
  • [10] Einar Bjarki Gunnarsson, Kevin Leder, and Jasmine Foo. Exact site frequency spectra of neutrally evolving tumors: A transition between power laws reveals a signature of cell viability. Theoretical Population Biology, 142:67–90, 2021.
  • [11] Einar Bjarki Gunnarsson, Kevin Leder, and Xuanming Zhang. Limit theorems for the site frequency spectrum of neutral mutations in an exponentially growing population. arXiv preprint arXiv:2307.03346, 2023.
  • [12] Simon C. Harris, Samuel G.G. Johnston, and Matthew I. Roberts. The coalescent structure of continuous-time Galton-Watson trees. The Annals of Applied Probability, 30:1368–1414, 2020.
  • [13] Jan Lukas Igelbrink and Jasper Ischebeck. Ancestral reproductive bias in continuous time branching trees under various sampling schemes. arXiv preprint arXiv:2309.05998, 2023.
  • [14] Anastasia Ignatieva, Jotun Hein, and Paul A. Jenkins. A characterisation of the reconstructed birth–death process through time rescaling. Theoretical Population Biology, 134:61–76, 2020.
  • [15] Brian Johnson, Yubo Shuai, Jason Schweinsberg, and Kit Curtius. cloneRate: fast estimation of single-cell clonal dynamics using coalescent theory. Bioinformatics, 39(9):btad561, 2023.
  • [16] Amaury Lambert. The contour of splitting trees is a Lévy process. The Annals of Probability, 38(1):348–395, 2010.
  • [17] Amaury Lambert. The coalescent of a sample from a binary branching process. Theoretical Population Biology, 122:30–35, 2018.
  • [18] Amaury Lambert and Tanja Stadler. Birth–death models and coalescent point processes: The shape and probability of reconstructed phylogenies. Theoretical Population Biology, 90:113–128, 2013.
  • [19] Mathieu Richard. Splitting trees with neutral mutations at birth. Stochastic Processes and their Applications, 124(10):3206–3230, 2014.
  • [20] Jason Schweinsberg and Yubo Shuai. Asymptotics for the site frequency spectrum associated with the genealogy of a birth and death process. arXiv preprint arXiv:2304.13851, 2023.
  • [21] Marc J Williams, Benjamin Werner, Chris P Barnes, Trevor A Graham, and Andrea Sottoriva. Identification of neutral tumor evolution across cancer types. Nature genetics, 48(3):238–244, 2016.