Impossibility of consistent distance estimation from sequence lengths under the TKF91 model
Abstract
We consider the problem of distance estimation under the TKF91 model of sequence evolution by insertions, deletions and substitutions on a phylogeny. In an asymptotic regime where the expected sequence lengths tend to infinity, we show that no consistent distance estimation is possible from sequence lengths alone. More formally, we establish that the distributions of pairs of sequence lengths at different distances cannot be distinguished with probability going to one.
1 Introduction
Phylogeny estimation consists in the inference of an evolutionary tree from extant species data, commonly molecular sequences (e.g. DNA, amino acid). A large body of theoretical work exists on the statistical properties of standard reconstruction methods [Ste16, War17]. Typically in such analyses, one assumes that sequences have evolved on a fixed rooted tree, from a common ancestor sequence to the leaf sequences, according to some Markovian stochastic process. Often these processes model site substitutions exclusively, with the underlying assumption being that the data has been properly aligned in a pre-processing step. In contrast, relatively little theoretical work has focused on models of insertions and deletions (indels) together with substitutions, in spite of the fact that such models have been around for some time [TKF91, TKF92]. See e.g. [Tha06, DR13, ARS15, FR20].
One extra piece of information available under indel models is the length of the sequence, which itself evolves according to a Markov process on the tree. The notable work of Thatte [Tha06] shows that leaf sequence lengths alone are in fact enough to reconstruct phylogenies, through a distance-based approach. More specifically, it is shown in [Tha06, (27)] that under the TKF91 model [TKF91] the expectation of the sequence length at a leaf conditioned on the sequence length at another leaf separated from by an amount of time is
(1) |
where is the expected length at stationarity, where are the rates of insertion and deletion respectively. (Full details on the TKF91 model are provided in Section 2.) Hence we see from (1) that the full distribution of sequence lengths suffices to recover and all ’s.
The tree topology can then be recovered using standard results about the metric properties of phylogenies [Ste16]. That is, the tree is identifiable from the sequence lengths under the TKF91 model in the sense that two distinct tree topologies necessarily produce distinct joint distributions of sequence lengths at the leaves.
It is also suggested in [Tha06]—without a full rigorous proof—that the scheme above could be used to reconstruct phylogenies from a single sample of sequence lengths at the leaves in the limit where . The latter asymptotics ensure that the expected sequence length at stationarity diverges, and serves as a proxy for the amount of data growing to infinity. However, in this short note, we show that no consistent distance estimator exists in this limit. Formally we establish that the distributions of pairs of sequence lengths at different distances cannot be distinguished with probability going to as . Hence, while the tree is identifiable from the distribution of the sequence lengths at the leaves, one sample of the sequence lengths alone cannot be used in a distance-based approach of the type described above to reconstruct the tree consistently as . On the technical side our proof follows by noting that, under the TKF91 model, the sequence length is (morally) a sum of independent random variables with finite variances, to which we apply a central limit theorem. One complication is to obtain a limit theorem that is uniform in the parameter . We expect that our techniques will be useful to analyze other bioinformatics methods under indel processes, for instance methods based on -mer statistics (see e.g. [YZ08, Hau13]). Further intuition on our results is provided in Section 3.
2 Basic definitions
In this section, we recall the TKF91 sequence evolution model [TKF91]. To simplify the presentation, we restrict ourselves to a two-state version of the model, as we will only require the underlying sequence-length process.
Definition 1 (TKF91 model: two-state version).
Consider the following Markov process on the space of binary digit sequences together with an immortal link “”, that is,
where the notation above indicates that all sequences begin with the immortal link. Positions of a sequence are called sites. Let and with be given parameters. The continuous-time dynamics are as follows: If the current state is the sequence , then the following events occur independently:
-
•
Substitution: Each site except for the immortal link is substituted independently at rate . When a substitution occurs, the corresponding digit is replaced by and with probabilities and , respectively.
-
•
Deletion: Each site except for the immortal link is removed independently at rate .
-
•
Insertion: Each site gives birth to a new digit independently at rate . When a birth occurs, the new site is added immediately to the right of its parent site. The newborn site has digit and with probabilities and , respectively.
This indel process is time-reversible with respect to the measure given by
for each where , and . We assume further that . In that case, is the stationary distribution of .
We will be concerned with the underlying sequence-length process.
Definition 2 (Sequence length).
The length of a sequence is defined as the number of sites except for the immortal link and is denoted by .
Under , the sequence-length process is stationary and is geometrically distributed. Specifically the stationary distribution of the length process is
(2) |
We are interested in this process on a rooted tree . Denote the index set by . The root vertex is assigned a state , drawn from stationary distribution on . This state is then evolved down the tree according to the following recursive process. Moving away from the root, along each edge , conditionally on the state , we run the indel process for a time . Denote by the resulting state at . Then the full process is denoted by . In particular, the set of leaf states is .
Setting. Throughout this paper, we let be the probability measure when the root state is . If the root state is chosen according to a distribution , then we denote the probability measure by . We also denote by the conditional probability measure for the event that the root state has length .
For our purposes, it will suffice to focus on the space of star trees with two leaves that have the same finite distance from the root and are labeled as . This distance is the height of the tree. The indel process on a tree reduces to a pair of indel processes that are independent upon conditioning on the root state . We always assume the root state is chosen according to the equilibrium distribution . So the distribution of is
3 Main result
Our main theorem is an impossibility result: the distributions of pairs of sequence lengths at different distances cannot be distinguished with probability going to as . Following [Tha06], we consider the asymptotic regime where , which implies that the expected sequence length at stationarity tends to . Recall that the total variation distance between two probability measures and on a countable measure space is
Theorem 1 (Impossibility of distance estimation from sequence lengths).
Let and be two trees in with heights respectively. For , we consider a TKF91 process on tree and let be the pair of sequence lengths at the leaves . Let
be the distribution of under . Then for any fixed deletion rate ,
(3) |
Recall that the total variation distance can be written as
So (3) implies that there is no test that can distinguish between and with probability going to as .
Proof idea. We first give a heuristic argument that underlies our formal proof. Without loss of generality, assume that the deletion rate is . The stationary length at the root is geometric with mean and standard deviation both of order . So we can think of the root length as roughly with significant probability. Ignoring the small effect of the immortal link and conditioning on , the lengths at the leaves are sums of independent random variables, specifically the progenies of the mortal links of the root. The mean and variance of these variables can be computed explicitly from continuous-time Markov chain theory (see (10) below; see also [Tha06, (27), (31)]). As , the difference in expectation between heights and turns out to be
(4) |
while the variance is of order
(5) |
The key observation is that the variance (5) is than the square of the expectation difference (4). Hence, by the central limit theorem, one can expect significant overlap between the length distributions under and , making them hard to distinguish even as . We formalize this argument next.
We observe that (3) is equivalent to
(6) |
Indeed the total variation distance between two probability measures and on a countable space can also be written as
The rest of the proof is to establish (6). It involves a series of steps:
-
1.
We first reduce the problem to a sum of independent random variables by conditioning on the root sequence length and ignoring the immortal link. In particular, we use the fact that there is a fairly uniform probability that is in an interval of size around . And we remove the effect of the immortal link by conditioning on its having no descendant, an event of positive probability.
-
2.
The central limit theorem (CLT) implies that there is a significant overlap between the two sums. More precisely, we need a local CLT (see e.g. [Dur10]) to derive the sort of pointwise lower bound needed in (6). However the bound we require must be uniform in and we did not find in the literature a result of quite this form. Instead, we use an argument based on the Berry-Esséen theorem (again see e.g. [Dur10]). We first establish overlap over constant size intervals for the sum of the first mortal links, and then we use the final mortal link to match the probabilities on common point values under heights and .
-
3.
Finally we bound the sum in (6).
4 Proof
In this section we give the details of the proof of Theorem 1. We follow the steps described in the previous section.
Step 1. Reducing the problem to a sum of independent random variables. We first show that in (6) can be replaced by where is of the order of the expected sequence length under . That is, we condition on the length of the ancestral sequence. After that we further ignore the progenies of the immortal link so that each leave sequence consists of i.i.d. progenies of the sites in the ancestral sequence. These two simplifications are achieved in (7) and (8) below respectively.
Let be the event that the immortal link of the root sequence produces no mortal link in either leaf sequences. Let be the probability conditioned on that event, and be a lower bound on the probability of uniform in . Under , the two components of are conditionally independent and each is a sum of i.i.d. random variables corresponding to the progenies of mortal links. Hence (7) is at least
(8) |
where we let be the transition probability of the length process without the immortal link.
The sum in (8) leads us to study the overlap between the probability distributions for and . The central limit theorem is what we need. However, because of our need for a bound that is uniform in , we shall apply the Berry-Esséen theorem. Specifically, we apply the latter bound to the progenies of the first mortal links of the root sequence. The idea is to show that summands in (8) have value , for each of and separately, and then use the last mortal link to “match” all these values between and .
Step 2a. Establishing a uniform bound for . Note that is the distribution of , where are i.i.d. random variables having the distribution of the progeny length of a single mortal link at time .
Let the mean and the variance of be
(9) |
As is expected, the distribution is approximately Gaussian with mean and variance . We quantify this statement in the bound (11) below, after proving some moment bounds.
Lemma 1.
Let and be the mean and the standard deviation of defined in (9) and consider the absolute third moment . For any ,
(10) |
Furthermore,
Proof.
Moreover, from [TKF91, (8)–(10)], the probability that a normal link as descendants including itself is
where . It can be seen from L’Hospital’s rule that is continuous as a function of around and that as . From this explicit formula for the probability mass function of , which we note is a geometric sequence, it follows that all moments of are bounded from above uniformly in .
To show that the variance is bounded from below uniformly in , we note (again using L’Hospital’s rule) that is continuous in around , strictly positive and tends to as . Hence the variance is bounded from below, uniformly in ∎
Let be the cumulative distribution function (CDF) of the probability distribution . That is,
Lemma 2 (Uniform bound for ).
For each , there exists a constant such that
(11) |
for all , where is the CDF of the standard normal distribution.
Proof.
Step 2b. Controlling the overlap of and in (8). To quantify the overlap between and , we first compare their expectations. From the formula of in (10), we have
and so, for , the means of for and are close in the sense that
(12) |
for some not depending on .
Now consider the interval with length roughly the standard deviation and centered at around one of the means, . Then consider an equi-partition of this interval into roughly many pieces of constant length. Precisely, we write and for simplicity. Then for an arbitrary constant ,
where the sub-intervals
have constant width for , and
Lemma 3 below says that there exists a constants and large enough (depending on and but not on ) such that each of these intervals contains mass at least under both probability distributions and . See Figure 1. Write for simplicity.
Lemma 3.
There exist positive constants such that, with and ,
for all , and .
Proof.
The Berry-Esséen theorem (11) implies that
for all and , where
Then is roughly an equi-partition of the interval into sub-intervals of length . Furthermore,
Pick large enough (call it ), we obtain from the first display in this proof that
for some constant that does not depend on . By the same argument and using (12), we have
for some constant that does not depend on , even though is constructed using . The proof is complete by taking . ∎
Step 2c. Matching and by the last mortal link. Lemma 3 establishes overlap of and over constant size intervals. The next lemma uses the final mortal link to establish overlap of and over specific values.
Lemma 4.
There exists a positive constant such that
for all , and .
Proof.
By Lemma 3, contains at least one integer, say , with mass at least under the probability measure . This is because there are integers in . Similarly, there exists with mass at least under . Hence
Let be an arbitrary integer in . The progeny of the -th mortal link has positive probability, say , over integers in , uniformly over . It follows that
and similar for . The proof is complete. ∎
Acknowledgments
SR was supported by NSF grants DMS-1614242, CCF-1740707 (TRIPODS), DMS-1902892, and DMS-1916378, as well as a Simons Fellowship and a Vilas Associates Award. BL was supported by DMS-1614242, CCF-1740707 (TRIPODS), DMS-1902892 (to SR). WTF was supported by NSF grant DMS-1614242 (to SR) and DMS-1855417.
References
- [ARS15] Elizabeth S. Allman, John A. Rhodes, and Seth Sullivant. Statistically consistent k-mer methods for phylogenetic tree reconstruction. Journal of computational biology : a journal of computational molecular cell biology, 24 2:153–171, 2015.
- [DR13] Constantinos Daskalakis and Sebastien Roch. Alignment-free phylogenetic reconstruction: sample complexity via a branching process analysis. Ann. Appl. Probab., 23(2):693–721, 2013.
- [Dur10] Rick Durrett. Probability: theory and examples. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, fourth edition, 2010.
- [FR20] Wai-Tong Louis Fan and Sebastien Roch. Statistically consistent and computationally efficient inference of ancestral dna sequences in the tkf91 model under dense taxon sampling. Bulletin of Mathematical Biology, 82(2):21, 2020.
- [Hau13] Bernhard Haubold. Alignment-free phylogenetics and population genetics. Briefings in Bioinformatics, 15(3):407–418, 11 2013.
- [Ste16] Mike Steel. Phylogeny—discrete and random processes in evolution, volume 89 of CBMS-NSF Regional Conference Series in Applied Mathematics. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 2016.
- [Tha06] Bhalchandra D. Thatte. Invertibility of the tkf model of sequence evolution. Mathematical Biosciences, 200(1):58 – 75, 2006.
- [TKF91] Jeffrey L. Thorne, Hirohisa Kishino, and Joseph Felsenstein. An evolutionary model for maximum likelihood alignment of dna sequences. Journal of Molecular Evolution, 33(2):114–124, Aug 1991.
- [TKF92] Jeffrey L Thorne, Hirohisa Kishino, and Joseph Felsenstein. Inching toward reality: an improved likelihood model of sequence evolution. Journal of molecular evolution, 34(1):3–16, 1992.
- [War17] Tandy Warnow. Computational Phylogenetics: An Introduction to Designing Methods for Phylogeny Estimation. Cambridge University Press, USA, 1st edition, 2017.
- [YZ08] Kuan Yang and Liqing Zhang. Performance comparison between k -tuple distance and four model-based distances in phylogenetic tree reconstruction . Nucleic Acids Research, 36(5):e33–e33, 02 2008.