Transition in the ancestral reproduction rate and its implications for the site frequency spectrum
Abstract
Consider a supercritical birth and death process where the children acquire mutations. We study the mutation rates along the ancestral lineages in a sample of size from the population at time . The mutation rate is time-inhomogenous and has a natural probabilistic interpretation. We use these results to obtain asymptotic results for the site frequency spectrum associated with the sample.
1 Introduction
Consider a growing population with mutations introduced at the times of birth events. While it is a common assumption in population genetics that mutations are introduced in a time-homogeneous manner [9], the mutations observed in the sample have a time-inhomogenous behavior. This comes from the fact that birth events occurring at different times have different chances of being observed in the sample. In this paper, we analyze the mutations observed in the sample and apply this result to the site frequency spectrum.
1.1 Ancestral reproduction rate
Cheek and Johnston [5] considered a continuous-time Galton-Watson process where each individual reproduces at rate . In a reproduction event, an individual is replaced by individuals with probability . Denote by the population size at time and for its generating function. Conditional on survival up to time , i.e., , an individual is sampled uniformly at random from the population at time . They studied the rate of reproduction events along the ancestral lineage of the sampled individual. The rate is biased towards large reproduction events and is inhomogeneous along the ancestral lineage. Specifically, they show the following theorem.
Theorem 1.1 (Theorem 2.2 of [5]).
There exists a random variable with density on [0,1] such that conditional on , independently for each , size reproduction events occur along the uniform ancestral lineage according to a time inhomogeneous Poisson point process with intensity function
Igelbrink and Ischebeck [13] generalized this result to the case when the reproduction rates are age-dependent.
In general, one can also take a sample of size from the population. When we trace back the ancestral lineages in the sample, they will coalesce when individuals find their common ancestors, giving us a coalescent tree. A central question in population genetics is to understand this coalescent tree and the distribution of the mutations along the tree.
In this paper, we consider sampling individuals uniformly at random from a supercritical birth and death process with birth rate and death rate , where . We write for the net growth rate. When there is a birth event, we assume the number of mutations that the child acquires has a Poisson distribution with mean . The process starts with one individual and runs for units of time. As usual, we write for the population size at time and for for its generating function. Conditional on , we take a uniform sample of size .
We give the construction of the coalescent tree and the mutations in the following proposition, with the explanation given in Section 3. It is easier to construct the tree backward in time. That is, we trace back the lineages of the sampled individuals and track the time for individuals to find their common ancestor. If there are individuals in the sample, then there are coalescent times for . We should emphasize that the construction of the coalescent tree without mutations is obtained by Harris, Johnston, and Roberts [12] and Lambert [17] using two different approaches. The approach described in steps 1 and 2 of construction below comes from Lambert [17].
Proposition 1.1.
One can generate the coalescent tree with the mutations as follows.
-
1.
Take the sampling probability from the density
(1) where .
-
2.
Conditional on , the th branch length is , and independently for , the th branch length has density
(2) We draw vertical lines of lengths for . Then, we draw horizontal lines to the left, stopping when we hit a vertical line. See Figure 1 for an illustration.
Figure 1: A coalescent tree with sample size and 7 coalescent times. -
3.
Mutational events occur at the coalescent times and along the branches. Conditional on , mutational events along the branches occur independently along the th branch at times of an inhomogeneous Poisson process for . At time time units before sampling, the rate is , where
At each mutational event, the number of mutations has a Poisson distribution with mean .
We are interested in the case when the sample size is much smaller than the population size, so the sampling probability is close to 0. To interpret the time-inhomogeneity, is the probability conditional on , that an individual born time units before the sampling will have no descendants alive in the sample. As , we have , which is the probability that the newborn is not sampled. As , we have , which is the probability that the subtree generated by the newborn goes extinct. Therefore, we see more mutations along the branches near the bottom part of the tree.
To compare our result with that of Cheek and Johnston [5], observe that the density function of with is
Using the formula for , one can compute the density function of the random variable in [5] to be
Therefore, the random variable agrees with . To interpret in [5] correctly, note that the total rate of birth and death events is and the probability of a size two reproduction event, i.e., a birth event, is . Therefore, we have
The extra constant 2 comes from the fact that only the children get mutations in our model, and the discrepancy between and is because time goes forward in [5] and backward in Proposition 1.1.
1.2 The site frequency spectrum
A commonly used summary statistic for mutational data is the site frequency spectrum. The site frequency spectrum for individuals at time consists of , where is the number of mutations inherited by individuals. Durrett [9] considered a supercritical birth and death process where mutations occur at a constant rate along the branches with rate . He considered sampling the entire population and obtained that as the sampling time , the number of mutations affecting at least a fraction of the population is approximately . Similar results also appear in [2, 21]. Durrett [9] also considered taking a sample of size and showed that as the sampling time , for
(3) |
Gunnarsson, Leder, and Foo [10] considered models where the mutations occur at the times of reproduction events or at a constant rate along the branches. In both models, they computed the exact site frequency spectrum when the sample consists of the entire population or the individuals with an infinite line of descendants, at both a fixed time and a random time when the population size reaches a fixed level. Lambert [16] and Lambert and Stadler [18] developed a framework to study the site frequency spectrum when each individual is sampled independently with probability , conditioned to have individuals. Following this approach, Delaporte, Achaz, and Lambert [6] computed for critical birth and death process, for both fixed and for an unknown with some improper prior. Dinh et al. [8] computed for supercritical birth and death process. As noted by Ignatieva, Hein, and Jenkins [14], if we take the sampling probability in [8], then one can recover the shape in the site frequency spectrum as in formula (3).
Beyond the expected value of , there are other works on its asymptotic behavior. Gunnarsson, Leder, and Zhang [11] considered a supercritical Galton-Watson process and sampled the entire population. They proved a strong law of large numbers for the site frequency spectrum at deterministic times and a weak law of large numbers at random times when the population size reaches a deterministic level. Schweinsberg and Shuai [20] considered a sample of size from supercritical or critical birth and death processes. As the sample size and the population size go to infinity appropriately, they established the asymptotic normality for the length of the branches that support leaves. These results can then be translated into the asymptotic normality of the site frequency spectrum, assuming that mutations occur at a constant rate along the branches.
In this paper, since we assume mutations occur at the times of reproduction events, the site frequency spectrum is characterized by the number of reproduction events. Indeed, let (resp. ) be the number of reproduction events in which the child has (resp. at least 2) descendants in the sample. Then conditional on , has a Poisson distribution with mean , and the are independent of one another. Therefore, we will focus on and . We show the following theorem.
Theorem 1.2.
Let be a sequence such that
(4) |
Then for any fixed
where “” denotes convergence in probability as .
Corollary 1.1.
Under the same condition as Theorem 1.2, we have
Under a stronger assumption on the sample size and the population size, we can establish a central limit theorem for . The corresponding represents the number of mutations shared by at least two individuals in the sample. The quantity has been applied to estimate the growth rate of a tumor in [15]. We believe an analogous result should hold for , but the covariance computation is more involved.
Theorem 1.3.
Let be a sequence such that
(5) |
Then
where denotes convergence in distribution as .
Corollary 1.2.
Under the same condition as Theorem 1.3, we have
Remark 1.1.
Results similar to Theorem 1.2 and Corollary 1.2 have been established when the mutations occur at a constant rate along the branches. We take so that the expected number of mutations per unit of time is the same under both models. Schweinsberg and Shuai [20] showed the following result, which is similar to Theorem 1.2.
Theorem 1.4 (Theorem 1.3 of [20]).
Let be a fixed integer. Assume that
Write for the total length of the branches that support leaves in the coalescent tree of the sample. Then as ,
The hypothesis is similar to (4), and the site frequency spectrum is consistent with (3). Conditional on , the number of mutations inherited by individuals has a Poisson distribution with mean , consistent with Theorem 1.2.
Johnson et al. showed a result similar to Corollary 1.2. The quantity therein corresponds to . In the special case where the rates and are constants, Corollary 2 of [15] reduces to the following theorem.
The hypothesis is the same as (5), and the result is the same when we plug in . Therefore, our results establish that the assumption that mutations occur at a constant rate along the branches is not essential to the results in [15, 20], and similar results hold when mutations occur only at the birth times, which may be a more natural assumption for some applications.
Another closely related summary statistic for the mutational data is the allele frequency spectrum. For the allele frequency spectrum, the individuals are partitioned into families where each family consists of individuals carrying the same mutations. Then the allele frequency spectrum consists of where is the number of families of size . Richard [19] considered the allele frequency spectrum of the population in a birth and death process with age-dependent death rates and mutations at birth times. He computed the expected allele frequency spectrum and proved a law of large numbers for the allele frequency spectrum. Champagnat and Lambert considered the case where the death rate is age-dependent and the mutations occur at a constant rate along the branches. They computed the expected allele frequency spectrum when the sample consists of the entire population in [3] and studied the size of the largest family and the age of the oldest family in [4].
2 The contour process of the genealogical tree
Figure 2 is the planar representation of a genealogical tree. The ends of the vertical lines are the birth and death times for individuals, and the horizontal lines represent reproduction events. The individuals are ordered from left to right in the following way. An individual is on the left of if either is ancestral to , or when we trace back the lineages of and to their common ancestor, the ancestral lineage of belongs to the child. For example, in Figure 2, individual is on the left of because is an ancestor of , and is on the left of because the as we trace back the ancestral lineage of (the red line) and (the blue line), the ancestral lineage of belongs to the child.
The contour process of a genealogical tree is a càdlàg process which encodes the information of the tree. This process starts at the depth of the leftmost individual, keeps track of the distance from the root in the depth-first search of the genealogical tree, and is indexed by the total length of the searched branches. See Figure 3 for an illustration. This process has slope as we trace back the lineage in the genealogical tree unless we encounter a new birth event, in which case the contour process makes a jump in the size of the child’s lifespan.
One can also study the tree truncated at time and its contour process. We denote the contour processes for the original and truncated trees by and , respectively. Lambert [16] showed that (resp. ) has the same distribution as (resp. ) stopped at 0 defined as follows. Let be independent exponentially distributed random variables with mean representing the lifespan of individuals, and let be a Poisson process with rate counting the number of reproduction events found in the depth-first search. We define Lévy processes and by
That is, the process starts from the lifespan of the th individual, has drift as we trace back the lineage, and makes jumps when we encounter a new birth event. The process has the same dynamics except that the jumps are truncated so that does not go above .
3 Mutations in a sample from a population
In most applications, we only have access to the genetic information in a sample of the population at time . Therefore, we are interested in the coalescent tree that describes the genealogy of the sampled individuals. See Figure 4 for an illustration. The mutations in the sample can be classified into two groups depending on whether they are associated with branching events in the coalescent tree. Those associated with such branching points are marked red in Figure 4. The others are marked blue.
Two sampling schemes have been studied extensively in the literature: sampling each individual independently with probability and taking a uniform sample of size . We refer to them as Bernoulli sampling and uniform sampling, respectively. These two sampling schemes were connected by Lambert [17]. We will discuss the mutations in the Bernoulli sampling in Section 3.1 and the uniform sampling in Section 3.2.
In the rest of the paper, we will consider a variant of the model where the parent acquires the blue mutations. By the Markov property, whenever there is a birth event, the subtree generated by the child and its descendants has the same distribution as the subtree generated by the parent and the descendants that it has after the original birth event. Therefore, the distribution of the coalescent tree and the mutations are the same whether we assign the mutations to the parent or the child. For an illustration, one can compare Figures 4 and 5.
3.1 Bernoulli sampling
In this section, we describe the coalescent tree, together with the red and blue mutations in the Bernoulli sampling scheme. Lambert and Stadler [18] obtained the construction of the subtree. We outline the construction here for readers’ convenience. Recall that the contour process makes jumps at rate . That is, birth events occur at rate as we trace back the lineage. Consider a birth event occurring time units before the sampling time. Recalling the definition of from the introduction, the probability that this birth event does not give rise to a new leaf in the coalescent tree is
(6) |
Therefore, red birth events occur with rate , and blue birth events occur with rate , independently of each other. For a coupling argument in Section 4.2, red and blue birth events are represented using a Poisson point process. To be more precise, for any positive continuous functions on , we let be the area between and on . In the special case where on , we abbreviate the notation as . We also let be independent Poisson point process on with unit intensity. Then, for the Bernoulli sampling, one can obtain the coalescent tree and the mutations in the following way:
-
1.
We get an empty sample with probability . If the sample is non-empty, then we have an initial branch of length .
-
2.
Let be independent Poisson processes with rate on defined by for .
-
3.
Having constructed , if , then the th branch length is the time of the first jump in , i.e. . Otherwise, we have found all the individuals in the sample. Note that is also the time for the red reproduction event on the th individual.
-
4.
Blue birth events occur along the branches at rate . That is, the number of mutations along the th branch up to time is .
Remark 3.1.
This construction implies that conditional on a non-empty sample, the sample size has a geometric distribution. In particular, by taking , the population size satisfies for .
We compute here and refer the reader to equation (7) of [17] for the formula for the density of . We define
(7) |
By Remark 3.1, conditional on , has a geometric distribution with
Therefore, we have
(8) |
Solving this equation for and using the initial condition , we get the following formula for .
which agrees with the formula of in (1). This formula is also found in equation (4) of [17]. To compute , by (8) and Remark 3.1 again, we have
(9) |
3.2 Uniform sampling
Lambert [17] showed that the uniform sampling can be realized by taking from the density (1) and then performing the Bernoulli() sampling conditioned to have size . Therefore, for the uniform sampling, one can obtain the coalescent tree and the mutations in the following way:
-
1.
Take the sampling probability from the density (1).
-
2.
Conditional on , for , let be the Poisson process be defined in Section 3.1, conditional on .
-
3.
For , the th branch length is the time of the first jump in , conditional on . That is,
The density for is given by (2).
-
4.
Blue birth events occur along the branches at rate .
Remark 3.2.
For , we define (resp. ) to be the number of reproduction events along the th branch that support (resp. at least two) leaves in the sample. Then for , we have
and for ,
and
The first part in the definition of or counts the number of blue reproduction events with descendants in the sample on the th branch. The second part counts the number of red reproduction events that have descendants in the sample. See Figure 6 for an illustration. Note that there is no red mutation on the th branch. The total number of reproduction events with descendants in the sample is
Likewise, the number of reproduction events with at least two descendants in the sample on the th branch is
and for ,
The total number of reproduction events with at least two descendants in the sample is
4 Asymptotics for large and
4.1 Approximation for the sampling probability and branch length
Approximations for the sampling probability and the branch lengths when and are large are obtained in [15].
-
1.
Choose from the exponential density
-
2.
Choose for independently from the logistic distribution, with density
-
3.
Let .
-
4.
Let for .
Note that the ’s (hence the quantities defined below) depend on , although this dependence is not explicit in the notation. Using this approximation, we can, therefore, define an approximation for the number of reproduction events with at least two descendants in the sample on the th branch as
and
An approximation for the total number of reproduction events with at least two descendants in the sample, excluding the th branch, is
Similarly, for , we define
and for
and
An approximation for the total number of reproduction events inherited by individuals, excluding the th and the branches, is
4.2 Error bounds
In the rest part of the paper, the sampling time depends on . The errors in the approximation resulting from taking the limit as are bounded by Lemmas 4 and 5 in the supplementary material of [15]. We combine these lemmas into Lemma 4.1. The event in Lemma 4.1 is the intersection of the events and in [15]. The event is the intersection of and in [15]. The choice of comes from the proof of Lemma 6 of [15].
Lemma 4.1.
Assume condition (4) holds. The random variables , , and can be coupled so that the following hold.
-
1.
There is a sequence of events on which and as .
-
2.
As ,
Assume condition (5) holds. The random variables and can be coupled so that the following hold.
-
1.
There is a sequence of events on which and as .
-
2.
As ,
These error bounds allow us to bound the error in the approximations for and , with the aid of another technical lemma.
Lemma 4.2.
Let and be two independent sequences of i.i.d random variables such that and for all . Then there exist random variables such that the following hold.
-
1.
The random vectors and are equal in distribution.
-
2.
For , .
-
3.
, and so .
Proof.
Let be the ordering of the random variables . For , we define to be if is the th smallest among . Then, the second claim is straightforward from the definition. Since the random variables are i.i.d, is therefore a random permutation of the i.i.d random variables , which implies the first claim. The last claim follows from the second claim. ∎
Corollary 4.1.
Proof.
In view of Lemma 4.2, we may assume that conditional on , the random variables and are coupled so that if and only if . Therefore, the only error in the approximation comes from the blue mutations. We write and for the Lebesgue measure on . Conditional on the event , we have
and
For , we have
The second claim then follows from integrating over the event in Lemma 4.1. The first claim can be proved in an analogous way using the inequality
and the observation that has the same distribution as , and that . ∎
5 Proof of Theorems 1.2 and 1.3
In this section, we start with Propositions 5.1 and 5.2, which are the counterparts of Theorems 1.2 and 1.3 using the approximations and . We prove Propositions 5.1 and 5.2 in Sections 5.2 and 5.3. Theorems 1.2 and 1.3 then follow from Corollary 4.1. We should emphasize that and depend on and , although not made explicit in the notation.
Proposition 5.1.
Under condition (4), we have the following convergence in probability as
Proposition 5.2.
Under condition (5), we have the following convergence in distribution as ,
To prove Proposition 5.1 or 5.2, it suffices to show that the convergence in probability or distribution holds conditional on , regardless of the value of . Consider Proposition 5.2 for example; we will show that the conditional distribution of given is asymptotically normal with the same mean and variance given by Proposition 5.2, regardless of the value of . Proposition 5.2 then follows from applying the dominated convergence theorem to the characteristic function:
5.1 Moment computations
In this section, for positive functions , we write
We will also use the following fact. For positive integers ,
(10) |
Another observation is that condition (5) implies that
(11) |
Indeed, if equation (11) fails, then we may choose a sequence such that
Combined with (5), we can deduce that . However, this implies
which is a contradiction.
Lemma 5.1.
Proof.
To prove (14), for , we define
to be the approximations for the numbers of red and blue reproduction events on the th branch. Then we have
(15) |
We now compute . Using the change of variable , we have
We consider the limit of as . Recall the formula for from (3.1). Writing and noticing that
we have,
Therefore, using , there is a positive constant such that
By (10), we have
Therefore, under condition (11), we have
(16) |
Combining (15) and (16), we have
Then (14) follows from summing over .
Equation (13) can be proved analogously. However, since we are only concerned about the leading term, applying the dominated convergence theorem makes the argument easier. We use the change of variable , and then
As , we have , , and . Note also that
which is integrable over . Therefore, by the dominated convergence theorem and equation (10), we have
(17) |
To prove equation (12), note that and are conditionally independent given if . It follows that
(18) |
To compute , since is -valued, we have
(19) |
By the definition, is nonzero only if the -valued random variable is nonzero. Then, it follows from equation (16) that
(20) |
Recall that Leb denotes the Lebesgue measure on and . A computation similar to the computation for gives the following
We apply the dominated convergence theorem as , use equation (16), and evaluate the last integral using a computer to get
(21) |
Combining (14), (19), (20) and (5.1), we have
(22) |
The computation for is analogous. We have
(23) |
(24) |
(25) |
and evaluating the last integral using a computer,
(26) |
Combining (15), (16), (23), (5.1), (5.1) and (5.1), we get
(27) |
Combining (18), (5.1) and (27), we have
∎
5.2 Proof of Theorem 1.2
5.3 Proof of Theorem 1.3 and Corollary 1.2
Proof of Theorem 1.3.
By Corollary 4.1, it suffices to prove Proposition 5.2. Recall from the discussion after Proposition 5.2 that it suffices to prove the central limit theorem conditional on . Given , the random sequences and are independent for all . We apply the central limit theorem for -dependent sequences to this sequence. By Theorem 3 of [7], it suffices to check the following conditions:
(28) |
(29) |
If (28) and (29) hold, then Proposition 5.2 follows from (12) and (14). Equation (28) follows from Lemma 5.1. For equation (29), it is sufficient to prove the following bound; see p.362 of [1];
Recall that
Conditional on , the right hand side is distributed as , a random variable that does not depend on and has finite th momemnt for all . ∎
Sketch of proof of Corollary 1.2.
The proof of this corollary is similar to the proof of Proposition 11 in the supplementary information in [15]. Following the notation therein, we decompose the fluctuation in into the fluctuation in the number of reproduction events and the fluctuation of given . Let
so that
Then where are independent normal random variables with mean and variances and . The term contributes the term of the variance in Corollary 1.2 and the term contributes the other part of the variance. ∎
Acknowledgments
The author thanks Professor Jason Schweinsberg for bringing up this research topic, carefully reading the draft, and providing helpful advice in the development of this paper.
References
- [1] P. Billingsley. Probability and Measure. Wiley Series in Probability and Statistics. Wiley, 1995.
- [2] Ivana Bozic, Jeffrey M Gerold, and Martin A Nowak. Quantifying clonal and subclonal passenger mutations in cancer evolution. PLoS computational biology, 12(2):e1004731, 2016.
- [3] Nicolas Champagnat and Amaury Lambert. Splitting trees with neutral Poissonian mutations I: Small families. Stochastic Processes and their Applications, 122:1003–1033, 2012.
- [4] Nicolas Champagnat and Amaury Lambert. Splitting trees with neutral Poissonian mutations II: Largest and oldest families. Stochastic Processes and their Applications, 123:1368–1414, 2013.
- [5] David Cheek and Samuel GG Johnston. Ancestral reproductive bias in branching processes. Journal of Mathematical Biology, 86(5):70, 2023.
- [6] Cécile Delaporte, Guillaume Achaz, and Amaury Lambert. Mutational pattern of a sample from a critical branching population. Journal of Mathematical Biology, 73:627–664, 2016.
- [7] PH Diananda. The central limit theorem for -dependent variables. In Mathematical Proceedings of the Cambridge Philosophical Society, volume 51, pages 92–95. Cambridge University Press, 1955.
- [8] Khanh N. Dinh, Roman Jaksik, Marek Kimmel, Amaury Lambert, and Simon Tavaré. Statistical Inference for the Evolutionary History of Cancer Genomes. Statistical Science, 35(1):129 – 144, 2020.
- [9] Rick Durrett. Population genetics of neutral mutations in exponentially growing cancer cell populations. The Annals of Applied Probability, 23:230, 2013.
- [10] Einar Bjarki Gunnarsson, Kevin Leder, and Jasmine Foo. Exact site frequency spectra of neutrally evolving tumors: A transition between power laws reveals a signature of cell viability. Theoretical Population Biology, 142:67–90, 2021.
- [11] Einar Bjarki Gunnarsson, Kevin Leder, and Xuanming Zhang. Limit theorems for the site frequency spectrum of neutral mutations in an exponentially growing population. arXiv preprint arXiv:2307.03346, 2023.
- [12] Simon C. Harris, Samuel G.G. Johnston, and Matthew I. Roberts. The coalescent structure of continuous-time Galton-Watson trees. The Annals of Applied Probability, 30:1368–1414, 2020.
- [13] Jan Lukas Igelbrink and Jasper Ischebeck. Ancestral reproductive bias in continuous time branching trees under various sampling schemes. arXiv preprint arXiv:2309.05998, 2023.
- [14] Anastasia Ignatieva, Jotun Hein, and Paul A. Jenkins. A characterisation of the reconstructed birth–death process through time rescaling. Theoretical Population Biology, 134:61–76, 2020.
- [15] Brian Johnson, Yubo Shuai, Jason Schweinsberg, and Kit Curtius. cloneRate: fast estimation of single-cell clonal dynamics using coalescent theory. Bioinformatics, 39(9):btad561, 2023.
- [16] Amaury Lambert. The contour of splitting trees is a Lévy process. The Annals of Probability, 38(1):348–395, 2010.
- [17] Amaury Lambert. The coalescent of a sample from a binary branching process. Theoretical Population Biology, 122:30–35, 2018.
- [18] Amaury Lambert and Tanja Stadler. Birth–death models and coalescent point processes: The shape and probability of reconstructed phylogenies. Theoretical Population Biology, 90:113–128, 2013.
- [19] Mathieu Richard. Splitting trees with neutral mutations at birth. Stochastic Processes and their Applications, 124(10):3206–3230, 2014.
- [20] Jason Schweinsberg and Yubo Shuai. Asymptotics for the site frequency spectrum associated with the genealogy of a birth and death process. arXiv preprint arXiv:2304.13851, 2023.
- [21] Marc J Williams, Benjamin Werner, Chris P Barnes, Trevor A Graham, and Andrea Sottoriva. Identification of neutral tumor evolution across cancer types. Nature genetics, 48(3):238–244, 2016.