This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Impossibility of consistent distance estimation from sequence lengths under the TKF91 model

Wai-Tong Louis Fan111 Department of Mathematics, Indiana University, Bloomington.    Brandon Legried222 Department of Mathematics, University of Wisconsin–Madison.    Sebastien Roch333 Department of Mathematics, University of Wisconsin–Madison.
Abstract

We consider the problem of distance estimation under the TKF91 model of sequence evolution by insertions, deletions and substitutions on a phylogeny. In an asymptotic regime where the expected sequence lengths tend to infinity, we show that no consistent distance estimation is possible from sequence lengths alone. More formally, we establish that the distributions of pairs of sequence lengths at different distances cannot be distinguished with probability going to one.

1 Introduction

Phylogeny estimation consists in the inference of an evolutionary tree from extant species data, commonly molecular sequences (e.g. DNA, amino acid). A large body of theoretical work exists on the statistical properties of standard reconstruction methods [Ste16, War17]. Typically in such analyses, one assumes that sequences have evolved on a fixed rooted tree, from a common ancestor sequence to the leaf sequences, according to some Markovian stochastic process. Often these processes model site substitutions exclusively, with the underlying assumption being that the data has been properly aligned in a pre-processing step. In contrast, relatively little theoretical work has focused on models of insertions and deletions (indels) together with substitutions, in spite of the fact that such models have been around for some time [TKF91, TKF92]. See e.g. [Tha06, DR13, ARS15, FR20].

One extra piece of information available under indel models is the length of the sequence, which itself evolves according to a Markov process on the tree. The notable work of Thatte [Tha06] shows that leaf sequence lengths alone are in fact enough to reconstruct phylogenies, through a distance-based approach. More specifically, it is shown in [Tha06, (27)] that under the TKF91 model [TKF91] the expectation of the sequence length NvN_{v} at a leaf vv conditioned on the sequence length NuN_{u} at another leaf uu separated from vv by an amount of time tuvt_{uv} is

𝒩v(t)=L¯+(NuL¯)eμtuv(1λ/μ)\displaystyle\mathcal{N}_{v}(t)=\bar{L}+\left(N_{u}-\bar{L}\right)e^{-\mu t_{uv}(1-\lambda/\mu)} (1)

where L¯=λ/μ1λ/μ\bar{L}=\frac{\lambda/\mu}{1-\lambda/\mu} is the expected length at stationarity, where λ<μ\lambda<\mu are the rates of insertion and deletion respectively. (Full details on the TKF91 model are provided in Section 2.) Hence we see from (1) that the full distribution of sequence lengths suffices to recover λ/μ\lambda/\mu and all μtuv\mu t_{uv}’s.

The tree topology can then be recovered using standard results about the metric properties of phylogenies [Ste16]. That is, the tree is identifiable from the sequence lengths under the TKF91 model in the sense that two distinct tree topologies T1T2T_{1}\neq T_{2} necessarily produce distinct joint distributions of sequence lengths at the leaves.

It is also suggested in [Tha06]—without a full rigorous proof—that the scheme above could be used to reconstruct phylogenies from a single sample of sequence lengths at the leaves in the limit where λμ\lambda\nearrow\mu. The latter asymptotics ensure that the expected sequence length at stationarity L¯\bar{L} diverges, and serves as a proxy for the amount of data growing to infinity. However, in this short note, we show that no consistent distance estimator exists in this limit. Formally we establish that the distributions of pairs of sequence lengths at different distances cannot be distinguished with probability going to 11 as λμ\lambda\nearrow\mu. Hence, while the tree is identifiable from the distribution of the sequence lengths at the leaves, one sample of the sequence lengths alone cannot be used in a distance-based approach of the type described above to reconstruct the tree consistently as λμ\lambda\nearrow\mu. On the technical side our proof follows by noting that, under the TKF91 model, the sequence length is (morally) a sum of independent random variables with finite variances, to which we apply a central limit theorem. One complication is to obtain a limit theorem that is uniform in the parameter λ/μ\lambda/\mu. We expect that our techniques will be useful to analyze other bioinformatics methods under indel processes, for instance methods based on kk-mer statistics (see e.g. [YZ08, Hau13]). Further intuition on our results is provided in Section 3.


Organization. The rest of the paper is organized as follows. The TKF91 model is reviewed in Section 2. Our main result, together with a proof sketch, is stated in Section 3. Details of the proof are provided in Section 4.

2 Basic definitions

In this section, we recall the TKF91 sequence evolution model [TKF91]. To simplify the presentation, we restrict ourselves to a two-state version of the model, as we will only require the underlying sequence-length process.

Definition 1 (TKF91 model: two-state version).

Consider the following Markov process ={t}t0\mathcal{I}=\{\mathcal{I}_{t}\}_{t\geq 0} on the space 𝒮\mathcal{S} of binary digit sequences together with an immortal link “\bullet, that is,

𝒮:=``"M1{0,1}M,\mathcal{S}:=``\bullet"\otimes\bigcup_{M\geq 1}\{0,1\}^{M},

where the notation above indicates that all sequences begin with the immortal link. Positions of a sequence are called sites. Let (ν,λ,μ)(0,)3(\nu,\lambda,\mu)\in(0,\infty)^{3} and (π0,π1)[0,1]2(\pi_{0},\pi_{1})\in[0,1]^{2} with π0+π1=1\pi_{0}+\pi_{1}=1 be given parameters. The continuous-time dynamics are as follows: If the current state is the sequence x𝒮\vec{x}\in\mathcal{S}, then the following events occur independently:

  • Substitution: Each site except for the immortal link is substituted independently at rate ν>0\nu>0. When a substitution occurs, the corresponding digit is replaced by 0 and 11 with probabilities π0\pi_{0} and π1\pi_{1}, respectively.

  • Deletion: Each site except for the immortal link is removed independently at rate μ\mu.

  • Insertion: Each site gives birth to a new digit independently at rate λ\lambda. When a birth occurs, the new site is added immediately to the right of its parent site. The newborn site has digit 0 and 11 with probabilities π0\pi_{0} and π1\pi_{1}, respectively.

This indel process is time-reversible with respect to the measure Π\Pi given by

Π(x)=(1λμ)(λμ)Mi=1Mπxi\Pi(\vec{x})=\left(1-\frac{\lambda}{\mu}\right)\left(\frac{\lambda}{\mu}\right)^{M}\prod_{i=1}^{M}\pi_{x_{i}}

for each x=(x1,x2,,xM){0,1}M\vec{x}=(x_{1},x_{2},\cdots,x_{M})\in\{0,1\}^{M} where M1M\geq 1, and Π(``")=(1λμ)\Pi(``\bullet")=\left(1-\frac{\lambda}{\mu}\right). We assume further that λ<μ\lambda<\mu. In that case, Π\Pi is the stationary distribution of \mathcal{I}.

We will be concerned with the underlying sequence-length process.

Definition 2 (Sequence length).

The length of a sequence x=(,x1,,xM)\vec{x}=(\bullet,x_{1},...,x_{M}) is defined as the number of sites except for the immortal link and is denoted by |x|=M|\vec{x}|=M.

Under Π\Pi, the sequence-length process |||\mathcal{I}| is stationary and is geometrically distributed. Specifically the stationary distribution of the length process |||\mathcal{I}| is

γM(λ):=(1λμ)(λμ)M,M+.\gamma^{(\lambda)}_{M}:=\left(1-\frac{\lambda}{\mu}\right)\left(\frac{\lambda}{\mu}\right)^{M},\quad M\in\mathbb{Z}_{+}. (2)

We are interested in this process on a rooted tree TT. Denote the index set by ΓT\Gamma_{T}. The root vertex ρ\rho is assigned a state ρ𝒮\mathcal{I}_{\rho}\in\mathcal{S}, drawn from stationary distribution on 𝒮\mathcal{S}. This state is then evolved down the tree according to the following recursive process. Moving away from the root, along each edge e=(u,v)Ee=(u,v)\in E, conditionally on the state u\mathcal{I}_{u}, we run the indel process for a time (u,v)\ell_{(u,v)}. Denote by t\mathcal{I}_{t} the resulting state at tet\in e. Then the full process is denoted by {t}tΓT\{\mathcal{I}_{t}\}_{t\in\Gamma_{T}}. In particular, the set of leaf states is T={v:vT}\mathcal{I}_{\partial T}=\{\mathcal{I}_{v}:v\in\partial T\}.


Setting. Throughout this paper, we let x\mathbb{P}_{\vec{x}} be the probability measure when the root state is x\vec{x}. If the root state is chosen according to a distribution ν\nu, then we denote the probability measure by ν\mathbb{P}_{\nu}. We also denote by M\mathbb{P}_{M} the conditional probability measure for the event that the root state has length MM.

For our purposes, it will suffice to focus on the space 𝒯2\mathcal{T}_{2} of star trees with two leaves that have the same finite distance h(0,)h\in(0,\infty) from the root and are labeled as {1,2}\{1,2\}. This distance hh is the height of the tree. The indel process on a tree T𝒯2T\in\mathcal{T}_{2} reduces to a pair of indel processes (t1,t2)t0(\mathcal{I}_{t}^{1},\mathcal{I}_{t}^{2})_{t\geq 0} that are independent upon conditioning on the root state ρ=01=02\mathcal{I}_{\rho}=\mathcal{I}_{0}^{1}=\mathcal{I}_{0}^{2}. We always assume the root state is chosen according to the equilibrium distribution Π\Pi. So the distribution of (01,02)𝒮×𝒮(\mathcal{I}_{0}^{1},\mathcal{I}_{0}^{2})\in\mathcal{S}\times\mathcal{S} is

ν^0(x,y)={Π(x) if x=y,0otherwise.\widehat{\nu}_{0}(\vec{x},\vec{y})=\begin{cases}\Pi(\vec{x})&\text{ if }\vec{x}=\vec{y},\\ 0&\textnormal{otherwise.}\end{cases}

3 Main result

Our main theorem is an impossibility result: the distributions of pairs of sequence lengths at different distances cannot be distinguished with probability going to 11 as λμ\lambda\nearrow\mu. Following [Tha06], we consider the asymptotic regime where λμ\lambda\nearrow\mu, which implies that the expected sequence length at stationarity tends to ++\infty. Recall that the total variation distance between two probability measures τ1\tau_{1} and τ2\tau_{2} on a countable measure space EE is

τ1τ2TV=12σE|τ1(σ)τ2(σ)|.\displaystyle\|\tau_{1}-\tau_{2}\|_{TV}=\frac{1}{2}\sum_{\sigma\in E}\left|\tau_{1}(\sigma)-\tau_{2}(\sigma)\right|.
Theorem 1 (Impossibility of distance estimation from sequence lengths).

Let T1T^{1} and T2T^{2} be two trees in 𝒯2\mathcal{T}_{2} with heights h1>h2>0h_{1}>h_{2}>0 respectively. For i{1,2}i\in\{1,2\}, we consider a TKF91 process on tree TiT^{i} and let N(i)=(N1(i),N2(i))+2\vec{N}^{(i)}=(N_{1}^{(i)},N_{2}^{(i)})\in\mathbb{Z}_{+}^{2} be the pair of sequence lengths at the leaves Ti\partial T^{i}. Let

i=Π(N(i))\mathcal{L}_{i}=\mathbb{P}_{\Pi}(\vec{N}^{(i)}\in\cdot)

be the distribution of N(i)\vec{N}^{(i)} under Π\mathbb{P}_{\Pi}. Then for any fixed deletion rate μ(0,)\mu\in(0,\infty),

lim supλμ12TV<1.\limsup_{\lambda\nearrow\mu}\|\mathcal{L}_{1}-\mathcal{L}_{2}\|_{TV}<1. (3)

Recall that the total variation distance can be written as

τ1τ2TV=supAE|τ1(A)τ2(A)|.\displaystyle\|\tau_{1}-\tau_{2}\|_{TV}=\sup_{A\subseteq E}\left|\tau_{1}(A)-\tau_{2}(A)\right|.

So (3) implies that there is no test that can distinguish between 1\mathcal{L}_{1} and 2\mathcal{L}_{2} with probability going to 11 as λμ\lambda\nearrow\mu.


Proof idea. We first give a heuristic argument that underlies our formal proof. Without loss of generality, assume that the deletion rate is μ=1\mu=1. The stationary length MM at the root is geometric with mean and standard deviation both of order 1/(1λ)1/(1-\lambda). So we can think of the root length as roughly MC/(1λ)M\approx C/(1-\lambda) with significant probability. Ignoring the small effect of the immortal link and conditioning on MM, the lengths at the leaves are sums of independent random variables, specifically the progenies of the MM mortal links of the root. The mean and variance of these variables can be computed explicitly from continuous-time Markov chain theory (see (10) below; see also [Tha06, (27), (31)]). As λ1\lambda\nearrow 1, the difference in expectation between heights h1h_{1} and h2h_{2} turns out to be

Me(1λ)h1Me(1λ)h2C1λ[(1λ)h1+(1λ)h2]C(h2h1),\displaystyle Me^{-(1-\lambda)h_{1}}-Me^{-(1-\lambda)h_{2}}\approx\frac{C}{1-\lambda}[-(1-\lambda)h_{1}+(1-\lambda)h_{2}]\approx C(h_{2}-h_{1}), (4)

while the variance is of order

Me(1λ)hi(1e(1λ)hi)1λC1λ(1λ)hi1λChi1λ.\displaystyle M\frac{e^{-(1-\lambda)h_{i}}(1-e^{-(1-\lambda)h_{i}})}{1-\lambda}\approx\frac{C}{1-\lambda}\frac{(1-\lambda)h_{i}}{1-\lambda}\approx\frac{Ch_{i}}{1-\lambda}. (5)

The key observation is that the variance (5) is \gg than the square of the expectation difference (4). Hence, by the central limit theorem, one can expect significant overlap between the length distributions under h1h_{1} and h2h_{2}, making them hard to distinguish even as λ1\lambda\nearrow 1. We formalize this argument next.

We observe that (3) is equivalent to

lim infλ1y+2Π(N(1)=y)Π(N(2)=y)>0.\liminf_{\lambda\nearrow 1}\sum_{\vec{y}\in\mathbb{Z}_{+}^{2}}\mathbb{P}_{\Pi}(\vec{N}^{(1)}=\vec{y})\wedge\mathbb{P}_{\Pi}(\vec{N}^{(2)}=\vec{y})>0. (6)

Indeed the total variation distance between two probability measures τ1\tau_{1} and τ2\tau_{2} on a countable space EE can also be written as

τ1τ2TV=1σEτ1(σ)τ2(σ).\displaystyle\|\tau_{1}-\tau_{2}\|_{TV}=1-\sum_{\sigma\in E}\tau_{1}(\sigma)\land\tau_{2}(\sigma).

The rest of the proof is to establish (6). It involves a series of steps:

  1. 1.

    We first reduce the problem to a sum of independent random variables by conditioning on the root sequence length and ignoring the immortal link. In particular, we use the fact that there is a fairly uniform probability that MM is in an interval of size 1/(1λ)1/(1-\lambda) around 11. And we remove the effect of the immortal link by conditioning on its having no descendant, an event of positive probability.

  2. 2.

    The central limit theorem (CLT) implies that there is a significant overlap between the two sums. More precisely, we need a local CLT (see e.g. [Dur10]) to derive the sort of pointwise lower bound needed in (6). However the bound we require must be uniform in λ\lambda and we did not find in the literature a result of quite this form. Instead, we use an argument based on the Berry-Esséen theorem (again see e.g. [Dur10]). We first establish overlap over Ω(M)\Omega(\sqrt{M}) constant size intervals for the sum of the first M1M-1 mortal links, and then we use the final mortal link to match the probabilities on common point values under heights h1h_{1} and h2h_{2}.

  3. 3.

    Finally we bound the sum in (6).

4 Proof

In this section we give the details of the proof of Theorem 1. We follow the steps described in the previous section.


Step 1. Reducing the problem to a sum of independent random variables. We first show that Π\mathbb{P}_{\Pi} in (6) can be replaced by M{\mathbb{P}}_{M} where MM is of the order of the expected sequence length 1/(1λ)1/(1-\lambda) under Π\Pi. That is, we condition on the length of the ancestral sequence. After that we further ignore the progenies of the immortal link so that each leave sequence consists of i.i.d. progenies of the MM sites in the ancestral sequence. These two simplifications are achieved in (7) and (8) below respectively.

Precisely, for any λ(0,1)\lambda_{*}\in(0,1) and 0<c1<1<c2<+0<c_{1}<1<c_{2}<+\infty, using (2)

lim infλ1y+2Π(N(1)=y)Π(N(2)=y)\displaystyle\liminf_{\lambda\nearrow 1}\sum_{\vec{y}\in\mathbb{Z}_{+}^{2}}\mathbb{P}_{\Pi}(\vec{N}^{(1)}=\vec{y})\wedge\mathbb{P}_{\Pi}(\vec{N}^{(2)}=\vec{y})
infλ(λ,1)y+2Π(N(1)=y)Π(N(2)=y)\displaystyle\geq\inf_{\lambda\in(\lambda_{*},1)}\sum_{\vec{y}\in\mathbb{Z}_{+}^{2}}\mathbb{P}_{\Pi}(\vec{N}^{(1)}=\vec{y})\wedge\mathbb{P}_{\Pi}(\vec{N}^{(2)}=\vec{y})
=infλ(λ,1)y+2[M+γM(λ)M(N(1)=y)][M+γM(λ)M(N(2)=y)]\displaystyle=\inf_{\lambda\in(\lambda_{*},1)}\sum_{\vec{y}\in\mathbb{Z}_{+}^{2}}\left[\sum_{M\in\mathbb{Z}_{+}}\gamma^{(\lambda)}_{M}\,\mathbb{P}_{M}(\vec{N}^{(1)}=\vec{y})\right]\wedge\left[\sum_{M\in\mathbb{Z}_{+}}\gamma^{(\lambda)}_{M}\,\mathbb{P}_{M}(\vec{N}^{(2)}=\vec{y})\right]
=infλ(λ,1)y+2M+(1λ)λM[M(N(1)=y)M(N(2)=y)]\displaystyle=\inf_{\lambda\in(\lambda_{*},1)}\sum_{\vec{y}\in\mathbb{Z}_{+}^{2}}\sum_{M\in\mathbb{Z}_{+}}(1-\lambda)\lambda^{M}\left[\mathbb{P}_{M}(\vec{N}^{(1)}=\vec{y})\wedge\mathbb{P}_{M}(\vec{N}^{(2)}=\vec{y})\right]
infλ(λ,1)M[c11λ,c21λ](1λ)λMy+2M(N(1)=y)M(N(2)=y)\displaystyle\geq\inf_{\lambda\in(\lambda_{*},1)}\sum_{M\in\left[\frac{c_{1}}{1-\lambda},\frac{c_{2}}{1-\lambda}\right]}(1-\lambda)\lambda^{M}\sum_{\vec{y}\in\mathbb{Z}_{+}^{2}}\mathbb{P}_{M}(\vec{N}^{(1)}=\vec{y})\wedge\mathbb{P}_{M}(\vec{N}^{(2)}=\vec{y})
c3(c2c1)infλ(λ,1)infM[c11λ,c21λ]y+2M(N(1)=y)M(N(2)=y),\displaystyle\geq c_{3}(c_{2}-c_{1})\inf_{\lambda\in(\lambda_{*},1)}\inf_{M\in\left[\frac{c_{1}}{1-\lambda},\frac{c_{2}}{1-\lambda}\right]}\sum_{\vec{y}\in\mathbb{Z}_{+}^{2}}\mathbb{P}_{M}(\vec{N}^{(1)}=\vec{y})\wedge\mathbb{P}_{M}(\vec{N}^{(2)}=\vec{y}), (7)

where c3c_{3} is a lower bound on λM\lambda^{M} for M[c11λ,c21λ]M\in\left[\frac{c_{1}}{1-\lambda},\frac{c_{2}}{1-\lambda}\right] and λ(λ,1)\lambda\in(\lambda_{*},1).

Let 𝒵0\mathcal{Z}_{0} be the event that the immortal link of the root sequence produces no mortal link in either leaf sequences. Let M,\mathbb{P}_{M,\bullet} be the probability conditioned on that event, and c4c_{4} be a lower bound on the probability of 𝒵0\mathcal{Z}_{0} uniform in λ(λ,1)\lambda\in(\lambda_{*},1). Under M,\mathbb{P}_{M,\bullet}, the two components of N(1)\vec{N}^{(1)} are conditionally independent and each is a sum of MM i.i.d. random variables corresponding to the progenies of mortal links. Hence (7) is at least

c4c3(c2c1)infλ(λ,1)infM[c11λ,c21λ]y+2M,(N(1)=y)M,(N(2)=y)\displaystyle c_{4}c_{3}(c_{2}-c_{1})\inf_{\lambda\in(\lambda_{*},1)}\inf_{M\in\left[\frac{c_{1}}{1-\lambda},\frac{c_{2}}{1-\lambda}\right]}\sum_{\vec{y}\in\mathbb{Z}_{+}^{2}}\mathbb{P}_{M,\bullet}(\vec{N}^{(1)}=\vec{y})\wedge\mathbb{P}_{M,\bullet}(\vec{N}^{(2)}=\vec{y})
c4c3(c2c1)infλ(λ,1)infM[c11λ,c21λ]y+2[pM,y1(λ)(h1)pM,y2(λ)(h1)][pM,y1(λ)(h2)pM,y2(λ)(h2)].\displaystyle\geq c_{4}c_{3}(c_{2}-c_{1})\inf_{\lambda\in(\lambda_{*},1)}\inf_{M\in\left[\frac{c_{1}}{1-\lambda},\frac{c_{2}}{1-\lambda}\right]}\sum_{\vec{y}\in\mathbb{Z}_{+}^{2}}\left[p_{M,y_{1}}^{(\lambda)}(h_{1})\,p_{M,y_{2}}^{(\lambda)}(h_{1})\right]\wedge\left[p_{M,y_{1}}^{(\lambda)}(h_{2})\,p_{M,y_{2}}^{(\lambda)}(h_{2})\right]. (8)

where we let py1,y2(λ)(t)=y1,(|t|=y2)p_{y_{1},y_{2}}^{(\lambda)}(t)=\mathbb{P}_{y_{1},\bullet}(|\mathcal{I}_{t}|=y_{2}) be the transition probability of the length process without the immortal link.

The sum in (8) leads us to study the overlap between the probability distributions pM,(λ)(t):={pM,j(λ)(t)}j+p_{M,\,\cdot}^{(\lambda)}(t):=\{p_{M,j}^{(\lambda)}(t)\}_{j\in\mathbb{Z}_{+}} for t=h1,h2t=h_{1},h_{2} and M[c11λ,c21λ]M\in\left[\frac{c_{1}}{1-\lambda},\frac{c_{2}}{1-\lambda}\right]. The central limit theorem is what we need. However, because of our need for a bound that is uniform in λ\lambda, we shall apply the Berry-Esséen theorem. Specifically, we apply the latter bound to the progenies of the first M1M-1 mortal links of the root sequence. The idea is to show that Ω(M)\Omega(\sqrt{M}) summands in (8) have value Ω(1/M)\Omega(1/\sqrt{M}), for each of h1h_{1} and h2h_{2} separately, and then use the last mortal link to “match” all these values between h1h_{1} and h2h_{2}.


Step 2a. Establishing a uniform bound for pM1,(λ)(t)p_{M-1,\cdot}^{(\lambda)}(t). Note that pM,(λ)(t)p_{M,\cdot}^{(\lambda)}(t) is the distribution of SM(t):=i=1MLtiS_{M}(t):=\sum_{i=1}^{M}L_{t}^{i}, where {Lti}i1\{L^{i}_{t}\}_{i\geq 1} are i.i.d. random variables having the distribution of the progeny length of a single mortal link at time t>0t>0.

Let the mean and the variance of LtiL^{i}_{t} be

β:=β(λ,t):=𝔼[Lti]andσ2:=σ2(λ,t):=𝔼|Ltiβ|2.\beta:=\beta(\lambda,t):={\mathbb{E}}[L^{i}_{t}]\quad\text{and}\quad\sigma^{2}:=\sigma^{2}(\lambda,t):={\mathbb{E}}|L^{i}_{t}-\beta|^{2}. (9)

As is expected, the distribution pM,(λ)(t)p_{M,\,\cdot}^{(\lambda)}(t) is approximately Gaussian with mean βM\beta M and variance σ2M\sigma^{2}M. We quantify this statement in the bound (11) below, after proving some moment bounds.

Lemma 1.

Let β(λ,t)\beta(\lambda,t) and σ(λ,t)\sigma(\lambda,t) be the mean and the standard deviation of LtiL^{i}_{t} defined in (9) and consider the absolute third moment ρ(λ,t):=E|Ltiβ|3\rho(\lambda,t):=E|L^{i}_{t}-\beta|^{3}. For any t(0,)t\in(0,\infty),

β(λ,t)=e(1λ)tandσ2(λ,t)=1+λ1λe(1λ)t(1e(1λ)t).\beta(\lambda,t)=e^{-(1-\lambda)t}\quad\text{and}\quad\sigma^{2}(\lambda,t)=\frac{1+\lambda}{1-\lambda}e^{-(1-\lambda)t}(1-e^{-(1-\lambda)t}). (10)

Furthermore,

0<infλ[λ,1]σ(λ,t)<supλ[λ,1]σ(λ,t)<andsupλ[λ,1]ρ(λ,t)<.0<\inf_{\lambda\in[\lambda_{*},1]}\sigma(\lambda,t)<\sup_{\lambda\in[\lambda_{*},1]}\sigma(\lambda,t)<\infty\quad\text{and}\quad\sup_{\lambda\in[\lambda_{*},1]}\rho(\lambda,t)<\infty.
Proof.

For (10), see e.g. [DR13, (3), (4)].

Moreover, from [TKF91, (8)–(10)], the probability that a normal link as nn descendants including itself is

(Lti=n)={(1η(λ,t))(1λη(t))[λη(λ,t)]n1for n1η(λ,t)for n=0,{\mathbb{P}}(L^{i}_{t}=n)=\begin{cases}(1-\eta(\lambda,t))(1-\lambda\eta(t))[\lambda\eta(\lambda,t)]^{n-1}\qquad&\text{for }n\geq 1\\ \eta(\lambda,t)\qquad&\text{for }n=0\end{cases},

where η(λ,t)=1e(1λ)t1λe(1λ)t\eta(\lambda,t)=\frac{1-e^{-(1-\lambda)t}}{1-\lambda e^{-(1-\lambda)t}}. It can be seen from L’Hospital’s rule that η(λ,t)\eta(\lambda,t) is continuous as a function of λ\lambda around 11 and that η(λ,t)=t1+t+O(|1λ|)\eta(\lambda,t)=\frac{t}{1+t}+O(|1-\lambda|) as λ1\lambda\to 1. From this explicit formula for the probability mass function of LtiL^{i}_{t}, which we note is a geometric sequence, it follows that all moments of LtiL^{i}_{t} are bounded from above uniformly in λ[λ,1]\lambda\in[\lambda_{*},1].

To show that the variance is bounded from below uniformly in λ[λ,1]\lambda\in[\lambda_{*},1], we note (again using L’Hospital’s rule) that σ2(λ,t)\sigma^{2}(\lambda,t) is continuous in λ\lambda around 11, strictly positive and tends to 2t2t as λ1\lambda\to 1. Hence the variance is bounded from below, uniformly in λ[λ,1]\lambda\in[\lambda_{*},1]

Let FM(λ)(t)F_{M}^{(\lambda)}(t) be the cumulative distribution function (CDF) of the probability distribution pM,(λ)(t)p_{M,\cdot}^{(\lambda)}(t). That is,

FM(λ)(t)(x)=j:jxpM,j(λ)(t)=(SM(t)x).F_{M}^{(\lambda)}(t)(x)=\sum_{j:j\leq x}p_{M,j}^{(\lambda)}(t)={\mathbb{P}}(S_{M}(t)\leq x).
Lemma 2 (Uniform bound for pM1,(λ)(t)p_{M-1,\cdot}^{(\lambda)}(t)).

For each t>0t>0, there exists a constant C>0C>0 such that

supλ[λ,1]supx|FM(λ)(t)(Mβ(λ,t)+xσ(λ,t)M)𝒩(x)|CM,\sup_{\lambda\in[\lambda_{*},1]}\sup_{x\in\mathbb{R}}\Big{|}F_{M}^{(\lambda)}(t)\Big{(}M\beta(\lambda,t)\,+\,x\,\sigma(\lambda,t)\,\sqrt{M}\Big{)}-\mathcal{N}(x)\Big{|}\leq\frac{C}{\sqrt{M}}, (11)

for all MZ+M\in Z_{+}, where 𝒩\mathcal{N} is the CDF of the standard normal distribution.

Proof.

Since β(λ,t),σ2(λ,t),ρ(λ,t)(0,)\beta(\lambda,t),\sigma^{2}(\lambda,t),\rho(\lambda,t)\in(0,\infty), the Berry-Esséen theorem (as stated e.g. in [Dur10]) applies and asserts that

supx|(SM1(M1)β(λ,t)σ(λ,t)M1x)𝒩(x)|3ρ(λ,t)σ3(λ,t)M1\sup_{x\in\mathbb{R}}\left|{\mathbb{P}}\left(\frac{S_{M-1}-(M-1)\beta(\lambda,t)}{\sigma(\lambda,t)\sqrt{M-1}}\leq x\right)\,-\,\mathcal{N}(x)\right|\leq\frac{3\rho(\lambda,t)}{\sigma^{3}(\lambda,t)\sqrt{M-1}}

for all λ[0,1]\lambda\in[0,1]. By Lemma 1, for each t>0t>0, the right hand side is bounded from above uniformly for λ[λ,1)\lambda\in[\lambda_{*},1). ∎


Step 2b. Controlling the overlap of pM1,(λ)(h1)p_{M-1,\cdot}^{(\lambda)}(h_{1}) and pM1,(λ)(h2)p_{M-1,\cdot}^{(\lambda)}(h_{2}) in (8). To quantify the overlap between pM1,(λ)(h1)p_{M-1,\cdot}^{(\lambda)}(h_{1}) and pM1,(λ)(h2)p_{M-1,\cdot}^{(\lambda)}(h_{2}), we first compare their expectations. From the formula of β\beta in (10), we have

β(λ,h1)β(λ,h2)(1λ)(h1h2)\beta(\lambda,h_{1})-\beta(\lambda,h_{2})\leq(1-\lambda)(h_{1}-h_{2})

and so, for M[c11λ,c21λ]M\in\left[\frac{c_{1}}{1-\lambda},\frac{c_{2}}{1-\lambda}\right], the means of SM1S_{M-1} for h1h_{1} and h2h_{2} are close in the sense that

β(λ,h1)(M1)β(λ,h2)(M1)c6\beta(\lambda,h_{1})(M-1)-\beta(\lambda,h_{2})(M-1)\leq c_{6} (12)

for some c6>0c_{6}>0 not depending on λ\lambda.

Now consider the interval with length roughly the standard deviation and centered at around one of the means, β(λ,h1)(M1)\beta(\lambda,h_{1})(M-1). Then consider an equi-partition of this interval into roughly M1\sqrt{M-1} many pieces of constant length. Precisely, we write β1:=β(λ,h1)\beta_{1}:=\beta(\lambda,h_{1}) and σ1:=σ(λ,h1)\sigma_{1}:=\sigma(\lambda,h_{1}) for simplicity. Then for an arbitrary constant K>0K>0,

(β1(M1)σ1M1,β1(M1)+σ1M1)=rΛKM𝒥rM(c)\Big{(}\beta_{1}(M-1)-\sigma_{1}\sqrt{M-1},\;\beta_{1}(M-1)+\sigma_{1}\sqrt{M-1}\Big{)}=\bigcup_{r\in\Lambda^{M}_{K}}\mathcal{J}^{M}_{r}(c)

where the sub-intervals

𝒥rM(K):=(β1(M1)+rσ1K,β1(M1)+(r+1)σ1K)\mathcal{J}^{M}_{r}(K):=\big{(}\beta_{1}(M-1)+r\sigma_{1}K,\;\beta_{1}(M-1)+(r+1)\sigma_{1}K\big{)}

have constant width σ1K\sigma_{1}K for rΛKMr\in\Lambda^{M}_{K}, and

ΛKM:={σ1M1,σ1(M1K),0,σ1K,,σ1(M1K)}.\Lambda^{M}_{K}:=\left\{-\sigma_{1}\sqrt{M-1},\,-\sigma_{1}(\sqrt{M-1}-K)\,\ldots,0,\,\sigma_{1}K,\,\ldots,\sigma_{1}(\sqrt{M-1}-K)\right\}.
+\mathbb{Z}_{+}+\mathbb{Z}_{+}M1M-1
Figure 1: The solid straight line j=β1Mj=\beta_{1}M has slope β1\beta_{1} and the dotted line j=β2Mj=\beta_{2}M has slope β2\beta_{2} where βi=β(λ,hi)\beta_{i}=\beta(\lambda,h_{i}) for i=1,2i=1,2. The vertical line has length 2σ1M12\sigma_{1}\sqrt{M-1} where σ1=σ(λ,h1)\sigma_{1}=\sigma(\lambda,h_{1}) and represents the union of sub-intervals rΛKM𝒥rM(c)\bigcup_{r\in\Lambda^{M}_{K}}\mathcal{J}^{M}_{r}(c). Lemma 3 says that for each M[c11λ,c21λ]M\in\left[\frac{c_{1}}{1-\lambda},\frac{c_{2}}{1-\lambda}\right], both probability measures pM1,(λ)(h1)p_{M-1,\,\cdot}^{(\lambda)}(h_{1}) and pM1,(λ)(h2)p_{M-1,\,\cdot}^{(\lambda)}(h_{2}) have mass at least c8/M1c_{8}/\sqrt{M-1} on 𝒥rM(c)\mathcal{J}^{M}_{r}(c), uniformly for all rΛKMr\in\Lambda^{M}_{K} and λ[λ,1)\lambda\in[\lambda_{*},1).

Lemma 3 below says that there exists a constants c=c7c=c_{7} and KK large enough (depending on c5c_{5} and c6c_{6} but not on λ\lambda) such that each of these intervals contains mass at least c8M1\frac{c_{8}}{\sqrt{M-1}} under both probability distributions pM1,(λ)(h1)p_{M-1,\,\cdot}^{(\lambda)}(h_{1}) and pM1,(λ)(h2)p_{M-1,\,\cdot}^{(\lambda)}(h_{2}). See Figure 1. Write pM1,A(λ)(t)=jApM1,j(λ)(t)p_{M-1,\,A}^{(\lambda)}(t)=\sum_{j\in A}p_{M-1,\,j}^{(\lambda)}(t) for simplicity.

Lemma 3.

There exist positive constants c7,c8c_{7},c_{8} such that, with 𝒥r=𝒥rM(c7)\mathcal{J}_{r}=\mathcal{J}^{M}_{r}(c_{7}) and ΛM=Λc7M\Lambda^{M}=\Lambda^{M}_{c_{7}},

pM1,𝒥r(λ)(h1)pM1,𝒥r(λ)(h2)c8M1p_{M-1,\,\mathcal{J}_{r}}^{(\lambda)}(h_{1})\wedge p_{M-1,\,\mathcal{J}_{r}}^{(\lambda)}(h_{2})\geq\frac{c_{8}}{\sqrt{M-1}}

for all rΛMr\in\Lambda^{M}, M[c11λ,c21λ]M\in\left[\frac{c_{1}}{1-\lambda},\frac{c_{2}}{1-\lambda}\right] and λ[λ,1)\lambda\in[\lambda_{*},1).

Proof.

The Berry-Esséen theorem (11) implies that

suprΛM|pM1,𝒥r(λ)(h1)𝒥r~12πex22𝑑x|6ρσ3M1\sup_{r\in\Lambda^{M}}\left|p_{M-1,\,\mathcal{J}_{r}}^{(\lambda)}(h_{1})-\int_{\widetilde{\mathcal{J}_{r}}}\frac{1}{\sqrt{2\pi}}e^{-\frac{x^{2}}{2}}\,dx\right|\leq\frac{6\rho}{\sigma^{3}\sqrt{M-1}}

for all λ[λ,1]\lambda\in[\lambda_{*},1] and M2M\geq 2, where

𝒥r~:=(rKM1,(r+1)KM1).\widetilde{\mathcal{J}_{r}}:=\left(\frac{rK}{\sqrt{M-1}},\,\frac{(r+1)K}{\sqrt{M-1}}\right).

Then {𝒥r~}rΛM\{\widetilde{\mathcal{J}_{r}}\}_{r\in\Lambda^{M}} is roughly an equi-partition of the interval (1,1)(-1,1) into 2M1K\frac{2\sqrt{M-1}}{K} sub-intervals of length KM1\frac{K}{\sqrt{M-1}}. Furthermore,

𝒥r~12πex22𝑑xKM1e1/22π.\int_{\widetilde{\mathcal{J}_{r}}}\frac{1}{\sqrt{2\pi}}e^{-\frac{x^{2}}{2}}\,dx\geq\frac{K}{\sqrt{M-1}}\,\frac{e^{-1/2}}{\sqrt{2\pi}}.

Pick KK large enough (call it c7c_{7}), we obtain from the first display in this proof that

infrΛM,λ[λ,1)pM1,𝒥r(λ)(h1)cM1\inf_{r\in\Lambda^{M},\,\lambda\in[\lambda_{*},1)}p_{M-1,\,\mathcal{J}_{r}}^{(\lambda)}(h_{1})\geq\frac{c}{\sqrt{M-1}}

for some constant c>0c>0 that does not depend on MM. By the same argument and using (12), we have

infrΛM,λ[λ,1)pM1,𝒥r(λ)(h2)cM1\inf_{r\in\Lambda^{M},\,\lambda\in[\lambda_{*},1)}p_{M-1,\,\mathcal{J}_{r}}^{(\lambda)}(h_{2})\geq\frac{c^{\prime}}{\sqrt{M-1}}

for some constant c>0c^{\prime}>0 that does not depend on MM, even though 𝒥r\mathcal{J}_{r} is constructed using h1h_{1}. The proof is complete by taking c8=min{c,c}c_{8}=\min\{c,c^{\prime}\}. ∎


Step 2c. Matching pM,(λ)(h1)p_{M,\cdot}^{(\lambda)}(h_{1}) and pM,(λ)(h2)p_{M,\cdot}^{(\lambda)}(h_{2}) by the last mortal link. Lemma 3 establishes overlap of pM1,(λ)(h1)p_{M-1,\cdot}^{(\lambda)}(h_{1}) and pM1,(λ)(h2)p_{M-1,\cdot}^{(\lambda)}(h_{2}) over constant size intervals. The next lemma uses the final mortal link to establish overlap of pM,(λ)(h1)p_{M,\cdot}^{(\lambda)}(h_{1}) and pM,(λ)(h2)p_{M,\cdot}^{(\lambda)}(h_{2}) over specific values.

Lemma 4.

There exists a positive constant c9c_{9} such that

infjr+1𝒥r+1+pM,jr+1(λ)(h1)pM,jr+1(λ)(h2)>c8c9c7M1.\inf_{j_{r+1}^{\ast}\in\mathcal{J}_{r+1}\cap\mathbb{Z}_{+}}\,p_{M,j_{r+1}^{\ast}}^{(\lambda)}(h_{1})\wedge p_{M,j_{r+1}^{\ast}}^{(\lambda)}(h_{2})>\frac{c_{8}c_{9}}{c_{7}\sqrt{M-1}}.

for all rΛMr\in\Lambda^{M}, M[c11λ,c21λ]M\in\left[\frac{c_{1}}{1-\lambda},\frac{c_{2}}{1-\lambda}\right] and λ[λ,1)\lambda\in[\lambda_{*},1).

Proof.

By Lemma 3, 𝒥r\mathcal{J}_{r} contains at least one integer, say jr(1)j_{r}^{(1)}, with mass at least c8c7M1\frac{c_{8}}{c_{7}\sqrt{M-1}} under the probability measure pM1,(λ)(h1)p_{M-1,\,\cdot}^{(\lambda)}(h_{1}). This is because there are c7c_{7} integers in 𝒥r\mathcal{J}_{r}. Similarly, there exists jr(2)j_{r}^{(2)} with mass at least c8c7M1\frac{c_{8}}{c_{7}\sqrt{M-1}} under mFM1(λ)(h2)m_{F_{M-1}^{(\lambda)}(h_{2})}. Hence

pM1,jr(1)(λ)(h1)pM1,jr(2)(λ)(h2)c8c7M1.p_{M-1,\,j_{r}^{(1)}}^{(\lambda)}(h_{1})\wedge p_{M-1,\,j_{r}^{(2)}}^{(\lambda)}(h_{2})\geq\frac{c_{8}}{c_{7}\sqrt{M-1}}.

Let jr+1j_{r+1}^{*} be an arbitrary integer in 𝒥r+1\mathcal{J}_{r+1}. The progeny of the MM-th mortal link has positive probability, say c9c_{9}, over integers in [0,2c7][0,2c_{7}], uniformly over λ[0,1]\lambda\in[0,1]. It follows that

pM,jr+1(λ)(h1)=k=0jr+1pM1,k(λ)(h1)p1,jr+1k(λ)(h1)>pM1,jr(1)(λ)(h1)p1,jr+1jr(1)(λ)(h1)c8c9c7M1p_{M,j_{r+1}^{*}}^{(\lambda)}(h_{1})=\sum_{k=0}^{j_{r+1}^{*}}p_{M-1,k}^{(\lambda)}(h_{1})\,p_{1,j_{r+1}^{*}-k}^{(\lambda)}(h_{1})>p_{M-1,j_{r}^{(1)}}^{(\lambda)}(h_{1})\,p_{1,j_{r+1}^{*}-j_{r}^{(1)}}^{(\lambda)}(h_{1})\geq\frac{c_{8}c_{9}}{c_{7}\sqrt{M-1}}

and similar for h2h_{2}. The proof is complete. ∎

Step 3. Putting everything together. Lemma 4 implies the sum in (8) is at least a positive constant, uniformly in M[c11λ,c21λ]M\in\left[\frac{c_{1}}{1-\lambda},\frac{c_{2}}{1-\lambda}\right] and λ(λ,1)\lambda\in(\lambda_{*},1), because that sum is

y+2[pM,y1(λ)(h1)pM,y2(λ)(h1)][pM,y1(λ)(h2)pM,y2(λ)(h2)]\displaystyle\,\sum_{\vec{y}\in\mathbb{Z}_{+}^{2}}\left[p_{M,y_{1}}^{(\lambda)}(h_{1})\,p_{M,y_{2}}^{(\lambda)}(h_{1})\right]\wedge\left[p_{M,y_{1}}^{(\lambda)}(h_{2})\,p_{M,y_{2}}^{(\lambda)}(h_{2})\right]
\displaystyle\geq\, y1rΛM𝒥r+1+,y2rΛM𝒥r+1+[pM,y1(λ)(h1)pM,y1(λ)(h2)][pM,y2(λ)(h1)pM,y2(λ)(h2)]\displaystyle\sum_{y_{1}\in\cup_{r\in\Lambda^{M}}\mathcal{J}_{r+1}\cap\mathbb{Z}_{+},\;y_{2}\in\cup_{r\in\Lambda^{M}}\mathcal{J}_{r+1}\cap\mathbb{Z}_{+}}\left[p_{M,y_{1}}^{(\lambda)}(h_{1})\wedge p_{M,y_{1}}^{(\lambda)}(h_{2})\right]\,\cdot\,\left[p_{M,y_{2}}^{(\lambda)}(h_{1})\wedge p_{M,y_{2}}^{(\lambda)}(h_{2})\right]
\displaystyle\geq\, (c8c9c7M1)2|{y1rΛM𝒥r+1+,y2rΛM𝒥r+1+}|\displaystyle\left(\frac{c_{8}c_{9}}{c_{7}\sqrt{M-1}}\right)^{2}\,\Big{|}\{y_{1}\in\cup_{r\in\Lambda^{M}}\mathcal{J}_{r+1}\cap\mathbb{Z}_{+},\;y_{2}\in\cup_{r\in\Lambda^{M}}\mathcal{J}_{r+1}\cap\mathbb{Z}_{+}\}\Big{|}
\displaystyle\geq\, (c8c9c7)2.\displaystyle\left(\frac{c_{8}c_{9}}{c_{7}}\right)^{2}.

The proof of (6) and hence that of Theorem 1 are complete.

Acknowledgments

SR was supported by NSF grants DMS-1614242, CCF-1740707 (TRIPODS), DMS-1902892, and DMS-1916378, as well as a Simons Fellowship and a Vilas Associates Award. BL was supported by DMS-1614242, CCF-1740707 (TRIPODS), DMS-1902892 (to SR). WTF was supported by NSF grant DMS-1614242 (to SR) and DMS-1855417.

References

  • [ARS15] Elizabeth S. Allman, John A. Rhodes, and Seth Sullivant. Statistically consistent k-mer methods for phylogenetic tree reconstruction. Journal of computational biology : a journal of computational molecular cell biology, 24 2:153–171, 2015.
  • [DR13] Constantinos Daskalakis and Sebastien Roch. Alignment-free phylogenetic reconstruction: sample complexity via a branching process analysis. Ann. Appl. Probab., 23(2):693–721, 2013.
  • [Dur10] Rick Durrett. Probability: theory and examples. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, fourth edition, 2010.
  • [FR20] Wai-Tong Louis Fan and Sebastien Roch. Statistically consistent and computationally efficient inference of ancestral dna sequences in the tkf91 model under dense taxon sampling. Bulletin of Mathematical Biology, 82(2):21, 2020.
  • [Hau13] Bernhard Haubold. Alignment-free phylogenetics and population genetics. Briefings in Bioinformatics, 15(3):407–418, 11 2013.
  • [Ste16] Mike Steel. Phylogeny—discrete and random processes in evolution, volume 89 of CBMS-NSF Regional Conference Series in Applied Mathematics. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 2016.
  • [Tha06] Bhalchandra D. Thatte. Invertibility of the tkf model of sequence evolution. Mathematical Biosciences, 200(1):58 – 75, 2006.
  • [TKF91] Jeffrey L. Thorne, Hirohisa Kishino, and Joseph Felsenstein. An evolutionary model for maximum likelihood alignment of dna sequences. Journal of Molecular Evolution, 33(2):114–124, Aug 1991.
  • [TKF92] Jeffrey L Thorne, Hirohisa Kishino, and Joseph Felsenstein. Inching toward reality: an improved likelihood model of sequence evolution. Journal of molecular evolution, 34(1):3–16, 1992.
  • [War17] Tandy Warnow. Computational Phylogenetics: An Introduction to Designing Methods for Phylogeny Estimation. Cambridge University Press, USA, 1st edition, 2017.
  • [YZ08] Kuan Yang and Liqing Zhang. Performance comparison between k -tuple distance and four model-based distances in phylogenetic tree reconstruction . Nucleic Acids Research, 36(5):e33–e33, 02 2008.