Expected number of induced subtrees shared by two independent copies of the terminal tree in a critical branching process
Abstract.
Consider a rooted tree of minimum degree at least, with leaf-set . A rooted tree with leaf-set is induced by in if is the lowest common ancestor subtree for , with all its degree-2 vertices suppressed. A “maximum agreement subtree” (MAST) for a pair of two trees and is a tree with a largest leaf-set such that is induced by both in and . Bryant et al. [7] and Bernstein et al. [6] proved, among other results, that for and being two independent copies of a random binary (uniform or Yule-Harding distributed) tree , the likely magnitude order of is . In this paper we prove this bound for a wide class of random rooted trees: is a terminal tree of a branching process with an offspring distribution of mean , conditioned on “total number of leaves is ”.
Key words and phrases:
Random, binary tree, asymptotics2010 Mathematics Subject Classification:
05C30, 05C80, 05C05, 34E05, 60C051. Introduction, results
Consider a rooted binary tree , with leaves (pendant vertices) labelled by elements from . We visualize this tree with the root on top and the leaves at bottom. Given , let denote the lowest common ancestor of leaves in , (.) Introduce the subtree of formed by the paths from to leaves in . Ignoring (suppressing) degree-2 vertices of this subtree (except the root itself), we obtain a rooted binary tree with leaf-set . This binary tree is called “a tree induced by in ”.
Finden and Gordon [11] and Gordon [13] introduced a notion of a “maximum agreement subtree” (MAST) for a pair of such trees and : it is a tree with a largest leaf-set such that is induced by both in and . In a pioneering paper [7], Bryant, McKenzie and Steel addressed the problem of a likely order of when and are two independent copies of a random binary tree . To quote from [7], such a problem is “relevant when comparing evolutionary trees for the same set of species that have been constructed from two quite different types of data”.
It was proved in [7] that with probability as . The proof was based on a remarkable property of the uniformly random rooted binary tree, and of few other tree models, known as “sampling consistency”, see Aldous [1]. As observed by Aldous [3], [4], sampling consistency makes this model conceptually close to a uniformly random permutation of . Combinatorially, it means that a rooted binary tree with with a leaf-set , , is induced by in exactly rooted binary trees with leaf-set , regardless of choice of . Probabilistically, the rooted binary tree induced by in is distributed uniformly on the set of all such trees. Mike Steel [23] pointed out that the sampling consistency of the rooted binary tree follows directly from a recursive process for generating all the rooted trees in which induces .
Bernstein, Ho, Long, Steel, St. John, and Sullivant [6] established a qualitatively similar upper bound for the likely size of a common induced subtree in a harder case of Yule-Harding tree, again relying on sampling consistency of this tree model. Recently Misra and Sullivant [19] were able to prove the estimate for the case when two independent binary trees with labelled leaves are obtained by selecting independently, and uniformly at random, two leaf-labelings of the same unlabelled tree. Using the classic results on the length of the longest increasing subsequence in the uniformly random permutation, the authors of [6] established a first power-law lower bound for the likely size of the common induced subtree in the case of the uniform rooted binary tree, and a lower bound , , for the Yule-Harding model. Very recently, Aldous [3] proved that a maximum agreement rooted subtree for two independent, uniform, unrooted trees is likely to have leaves, at least. It was mentioned in [3] that an upper bound could be obtained by “the first moment method (calculating the expected number of large common subtrees)”.
In this paper we show that the total number of unrooted trees with leaf-set , which contains a rooted subtree induced by , , is . It follows that a rooted binary tree induced by in the uniformly random unrooted tree on is again distributed uniformly on the set of all rooted trees. Using the asymptotic estimate from [7], we have: a maximum agreement rooted subtree for two independent copies of the uniformly random unrooted tree is likely to have at most leaves.
The proof of this result suggested that a bound might, just might, be obtained for a broad class of random rooted trees that includes the rooted binary tree, by using a probabilistic counterpart of the two-phase counting procedure. Consider a Markov branching process with a given offspring distribution . If and (critical case), then the process is almost surely extinct.
Let be the random terminal tree, and let be conditioned on the event “ has leaves”, which we label, uniformly at random, by elements of . The uniform binary rooted tree is a special case corresponding to . In general, we assume that , , and that has convergence radius , all the conditions being met by the binary case. We will show that for all , meaning that is well-defined for all .
Finally, an out-degree of a vertex in may now exceed . So we add to the definition of a tree , induced by in , a condition: the out-degree of every vertex from in is the same as its out-degree in .
Under the conditions above, we prove that a rooted binary tree with leaf-set , and edge-set , is induced by in with probability
Here (, respectively) is the probability generating function of the total number of leaves in the terminal tree (conditioned on the event “the progenitor has at least two children”, respectively.). We will check that for this formula reduces to . Note that in general, because of the factor , and , the probability of being induced by in depends not only on , but also on the whole out-degree sequence of .
We use the above identity to prove the following claim. Let
Then, for , with probability , the largest number of leaves in an induced subtree shared by two independent copies of the conditioned terminal tree is at most .
For a wide ranging exposition of combinatorial/probabilistic problems and methods in theory of phylogeny, we refer the reader to a book [22] by Steel.
2. Uniform binary trees
Consider a rooted binary tree with leaf-set . For a given , there exists a subtree with leaf-set , which is rooted at the lowest vertex common to all paths leading away from toward the root of . The vertex set of this lowest common ancestor (LCA) tree is the set of all vertices from the paths in question. Ignoring degree-2 vertices of this subtree (except the root itself), we obtain a rooted binary tree . This LCA subtree has a name “a tree induced by in ”, see [3].
Let be two independent copies of the uniformly random rooted binary tree with leaf-set . Let denote the random total number of leaf-sets of cardinality that induce the same rooted subtree in and . Bernstein et al. [6] proved that
(2.1) |
The proof was based on sampling consistency of the random tree , so that , the number of rooted trees on in which induces a given rooted tree on , is , thus dependent only on the leaf-set size.
Following [3] (see Introduction), we consider the case when a binary tree with leaf-set is unrooted. Let now be two independent copies of the uniformly random (unrooted) binary tree with leaf-set . Let denote the random total number of leaf-sets of cardinality that induce the same rooted subtree in and .
Lemma 2.1.
Let . Then
(2.2) |
Equivalently , the number of unrooted trees on , in which induces a given rooted tree with leaf set , is .
Note. So the expectation is the same as for the rooted trees on .
Proof.
For an unrooted tree with leaves, the notion of a rooted subtree induced by a leaf-set with makes sense only for , and this subtree uniquely exists for any such . Indeed a vertex adjacent to any fixed leaf is joined by a unique path to each leaf in . Any other vertex common to some paths emanating from leaves must be common to all the paths from to . It follows that there exists a unique vertex which is the LCA of the leaves. The paths from to form the subtree induced by , and is connected by an external path to , if .
Let us evaluate . Consider a generic rooted tree with leaves. For to be induced by its leaves in with leaves, it has to be obtained by ignoring degree- (non-root) vertices in the LCA subtree for leaf-set .
The outside (third) neighbors of the ignored vertices are the roots of subtrees with some leaves from the remaining leaves, selected in ways. The roots of possible trees, attached to internal points chosen from some of edges of , can be easily ordered. Introduce , the total number of ordered forests of rooted trees with leaves altogether. By Lemma 4 of Carter et al. [9] (for the count of unordered trees), we have
(2.3) |
It was indicated in [6] that (2.3) follows from
(2.4) |
(Semple and Steel [21]). For the reader’s convenience here is a sketch proof of (2.4) and (2.3). We have
for the last two steps we used Equations (2.5.10), (2.5.16) in Wilf [24].
Introduce , the total number of the ordered forests of binary trees with roots attached to internal points of ’s edges, with leaves altogether. (These leaves have to be chosen from , so . ) Since the total number of integer compositions (ordered partitions) of with positive parts is
(2.3) implies
(2.5) |
Now, is the total number of ways to expand the host subtree into a full binary subtree rooted at the lowest common ancestor of the leaves. To evaluate this sum, first denote , and write
Therefore
We conclude that
(2.6) |
Recall that leaves were chosen from . If , then attaching the single remaining leaf to the root we get a binary tree with leaf-set . If , we view the expanded subtree as a single leaf, and form an unrooted binary tree with leaves, in ways. Therefore depends on only, and with , it is given by
in the first line , and for the second line we used
Consequently
(2.7) |
∎
3. Branching Process Framework
Consider a branching process initiated by a single progenitor. This process is visualized as a growing rooted tree. The root is the progenitor, connected by edges to each of the vertices, that represent the root’s “children”, i.e. the root’s immediate offspring. Each of the children becomes the root of the corresponding (sub)tree, so that the ordered children of all these roots are the grandchildren of the progenitor. We obviously get a recursively defined process; it delivers a nested sequence of trees, which is either infinite, or terminates at a moment when none of the members of the last generation have children.
The classic Galton-Watson branching process is the case when the number of each member’s children (a) is independent of those numbers for all members from the preceding and current generations and (b) has the same distribution , . It is well-known that if and , then the process terminates with probability , Harris [15]. Let denote the terminal tree. Given a finite rooted tree, , we have
where is the out-degree of vertex . , the total population size by the extinction time, has the probability generating function (p.g.f) , , that satisfies
(3.1) |
Indeed, introducing the p.g.f. of the total cardinality of the first generations, we have
So letting , we obtain (3.1). In the same vein, consider the pair , where is the total number of leaves (zero out-degree vertices) of the terminal tree. Then denoting , , we have
(3.2) |
So, with , we get
(3.3) |
Importantly, this identity allows us to deal directly with the leaf set at the extinction moment: is the probability that has leaves. In particular, . More generally, for all . meaning that for all . Indeed, for , we have
so the claim follows by easy induction on .
If , then the branching process is a nested sequence of binary trees. The equation (3.3) yields
So the terminal tree has leaves with probability . On this event, call it , the total number of vertices is , and each of rooted binary trees with vertices is a value of the terminal tree of the same probability . Conditionally on the event , we label, uniformly at random, the leaves of by elements of and use notation for the resulting uniformly random, rooted binary tree.
This is a promising sign that we can extend what we did for the uniformly random binary trees, i.e. for , for a more general offspring distribution .
We continue to assume that . The notion of an induced subtree needs to be expanded, since an out-degree of a vertex now may exceed . Let be a tree with a leaf-set , such that every non-leaf vertex of has out-degree , at least. We say that induces in a tree provided that: (a) the LCA subtree for in is if we ignore vertices of total degree in this LCA subtree; (b) the out-degree of every other vertex in the LCA of in is the same as its out-degree in .
Let be a tree with the leaf-set , , . Let be the event: (i) some elements from are chosen as the leaves of all the complementary subtrees rooted at degree -2 vertices sprinkled on the edges of , forming a composed tree with leaves; (ii) the tree with leaves is obtained by using the remaining leaves and an extra leaf which is the root of the tree composed in step (i). The event is partitioned into disjoint events corresponding to all choices to select elements of in question.
Let be the total number of edges in . For each of these choices, on the event we must have some trees with ordered roots on some of edges, and with their respective, nonempty, leaf-set labels forming an ordered set partition of the set of leaves. The root of each of these trees has one child down the host “edge” of , and all the remaining children outside edges of . Since , the number of other children of a root is with conditional probability . So the number of leaves of the subtrees rooted at the children is with probability , where
(3.4) |
Therefore, conditionally on “ induces ”, a given set of elements of is the leaf-set of these complementary trees with probability
Explanation: is a generic total number of trees rooted at ordered internal points of some edges of ; is a generic number of leaves of a -th tree; the product of two binomial coefficients in the top line is the number of ways to pick edges of and to select an ordered, -long, composition of ; the sum is the probability that the trees have leaves in total.
With these complementary trees attached, we obtain a tree with leaves. So for to hold, we view the root of this tree (i.e. the root of ) as a leaf and complete determination of a tree with leaves by constructing an auxiliary tree with plus remaining leaves. Therefore using the definition of , we have
Here , where is the out-degree sequence of non-leaf vertices of .
Using , , and summing the last equation for , we obtain
(3.5) |
For a partial check of (3.5), let us return to . Here
Therefore
so, by (3.5), we have
(3.6) |
Since , we conclude that
for every binary tree with leaf-set , . The LHS is the probability that induces in the uniformly random binary tree .
To summarize, we proved
Lemma 3.1.
Consider the branching process with the immediate offspring distribution , such that , , and . With probability , the process eventually stops, so that a finite terminal tree is a.s. well-defined. (1) Setting , we have
where . (2) On the event , we define as the tree with leaves labelled uniformly at random by elements from . Given a rooted tree with leaf-set , set := “ holds and induces in ”. Then , where
(3.7) | ||||
and are the set of non-leaves and the number of edges of .
Note. For , turned out to be dependent only on the number of leaves of . The formula (3.7) clearly shows that, in general, this probability depends on shape of . Fortunately, this dependence is confined to a single factor , since the rest depends on two scalars, and . Importantly, these quantities are of the same order of magnitude. Indeed, if then , i.e.
so that
(3.8) |
3.1. Asymptotics
From now on we assume that the series has convergence radius .
Lemma 3.2.
Suppose that . Let , i.e. is the variance of the immediate offspring, since . Then
Proof.
According to Lemma 3.1, we need to determine an asymptotic behavior of the coefficient in the power series , where is given implicitly by the functional equation .
In 1974 Bender [5] sketched a proof of the following general claim.
Theorem 3.3.
Assume that the power series with nonnegative coefficients satisfies . Suppose that there exist and such that: (i) for some and , is analytic for and ; (ii) ; (iii) and ; (iv) if , , and , then and . Then .
The remainder term aside, that’s exactly what we claim in Lemma 3.2 for our . The proof in [5] relied on an appealing conjecture that, under the conditions (i)-(iv), is the radius of convergence for the power series for , and is the only singularity for on the circle . However, ten years later Canfield [8] found an example of meeting the four conditions in which and the radius of convergence for are not the same. Later Meir and Moon found some conditions sufficient for validity of the conjecture. Our equation is a special case of considered in [17]. For the conditions from [17] to work in our case, it would have been necessary to have for complex with , a strong condition difficult to check. (An interesting discussion of these issues can be found in an encyclopedic book by Flajolet and Sedgewick [12] and an authoritative survey by Odlyzko [20].)
Let be the convergence radius for the powers series representing ; so that , since . By implicit differentiation, we have
since . Therefore . Turn to complex . For , we have , where is analytic as a function of and subject to . Observe that is possible only if , since for we have . If then if and only if , and . Notice also that
(3.9) |
Now, is a singular point of if and only if and , i.e. if and only if for some , which is equivalent to
Combination of this condition with (3.9) yields . Therefore the set of all singular points of on the circle is the set of all such that . Notice that
So none of is an accumulation point of roots of outside the circle , i.e. is the full root set of inside the circle , for some small .
Consequently, if , then is the only singular point of on the circle . Define the argument by the condition . By the analytic implicit function theorem applied to , for every , , a small being fixed, there exists an analytic function defined on –an open disc centered at , of a radius small enough to make for all – such that for and for with . Together, these local analytic continuations determine an analytic continuation of to a function determined, and bounded, for with , .
Since is the singular point of , there is no analytic continuation of for and small. So instead we delete an interval and continue analytically into the remaining part of a disc centered at . Here is how. We have . By a “preparation” theorem due to Weierstrass, (Ebeling [10], Krantz [16]), already used by Bender [5] for the same purpose in the general setting, there exist two open discs and such that for and we have
where are analytic on , , and is analytic, non-vanishing, on . (The degree of the polynomial is exactly the order of the first non-vanishing derivative of with respect to at .) So for and , is equivalent to
where is analytic at . Plugging the power series into equation , and expanding in powers of , we can compute the coefficients . In particular,
For real and , we have . So to use as an extension we need to choose , where .
The continuations and together determine an analytic continuation of into a function which is analytic and bounded on a disc minus a cut , being chosen sufficiently small, such that
It follows that
For the second line we integrated along the limit contour : it consists of the directed circular arc , and a detour part formed by two opposite-directed line segments, one from to and another from to . By the formula 3.191(2) from Gradshteyn and Ryzik [14], we have
Therefore
Recalling that , we complete the proof of the Lemma. ∎
Theorem 3.4.
Let , where and . Then, for and , we have . Consequently, with probability , the largest number of leaves in a common induced subtree is at most .
Proof.
By Lemma 3.1,
(3.10) | ||||
Start with the factor. Observe that
The power series for both and around , which start with , have non-negative coefficients only, and so does the power series for at . Therefore the power series for around has only non-negative coefficients. So we obtain a Chernoff-type bound: for all ,
(3.11) |
As , we have
and . So the RHS is essentially of order , and attains its maximum
at , which is , since and , see (3.8). In addition, the binomial factor is at most . So, denoting ,
( for the benchmark case . ) Hence, using the top line equation in (3.10), we obtain
Since , we conclude that
(3.12) |
Recall that , where is the out-degree sequence of non-leaf vertices of a generic with leaves labelled by the elements of . The RHS in (3.12) does not depend on how the labels are assigned to the leaves. Therefore we have the following upper bound for :
the last sum is over all (finite) rooted trees with unlabelled leaves. Define the probability distribution : , , for . Then we have
Observe that . Therefore the process with the offspring distribution is almost surely extinct, implying that
A close look shows that, in fact,
So using , , we obtain then
if , . ∎
References
- [1] D. J. Aldous, Probability distributions on cladograms, Random Discrete Structures, The IMA Volumes in Mathematics and its Applications, 76, D. Aldous and R. Pemantle, Eds., Springer (1996) 1–18.
- [2] D. J. Aldous and L. Popovic, A critical branching process model for biodiversity, Adv. Appl. Prob., 37 (2005) 1094–1115.
- [3] D. J. Aldous, On the largest common subtree of random leaf-labelled binary trees, https://arxiv.org/pdf/2006.10545.pdf
- [4] D. J. Aldous, Open problems: Largest common substructures in probabilistic combinatorics, (2003) https://www.stat.berkeley.edu/-aldous/Research/OP.
- [5] E. A. Bender, Asymptotic methods in enumeration, SIAM Review, 16 (1974) 485–515.
- [6] D. I. Bernstein, L. S. T. Ho, C. Long, M. Steel, K. St. John, A. S. Sullivant, Bounds on the expected size of the maximum agreement subtree, SIAM J. Discrete Math., 29 (2015) 2065–2074.
- [7] D. Bryant, A. McKenzie, and M. Steel, The size of a maximum agreement subtree for random binary trees, in BioConsensus, DIMACS Ser. Discrete Math. Theoret. Comput. Sci.61, AMS (2003) 55–65.
- [8] E. R. Canfield, Remarks on an asymptotic method in combinatorics, J. Comb. Theory, A, 37 (1984) 348–352.
- [9] M. Carter, M. Hendy, D. Penny, L. A. Székely, and N. C. Wormald, On the distribution of lengths of evolutionary trees, SIAM J. Discrete Math., 3 (1990) 38–47.
- [10] W. Ebeling, Functions of Several Complex Variables and Their Singularities, American Mathematical Society (2007).
- [11] C. R. Finden and A. D. Gordon, Obtaining common pruned trees, J. Classification, 2 (1985) 255–276.
- [12] P. Flajolet and R. Sedgewick, Analytic Combinatorics, Cambridge University Press (2009).
- [13] A. Gordon, On the assessment and comparison of classifications, in (R. Tomassine, Ed.), Analyse de Données et Informatique, Le Chesnay, INRIA, France (1980) 149–160.
- [14] L. S. Gradshteyn and I. M. Ryzhik, Table of Integrals, Series, and Products, 6th Edition, Academic Press (2000).
- [15] T. E. Harris, The Theory of Branching Processes, Springer–Verlag (1963).
- [16] S. G. Krantz and H. R. Parks, The Implicit Function Theorem: History, Theory, and Applications, Birkhäuser (2013).
- [17] A. Meir and J. W. Moon, On an asymptotic method in enumeration, J. Comb. Theory, A, 51 (1989) 77–89.
- [18] A. Meir and J. W. Moon, Erratum: “On an asymptotic method in enumeration”, J. Comb. Theory, A, 52 (1989) 163.
- [19] P. Misra and S. Sullivant, Bounds on the expected size of the maximum agreement subtree for a given tree shape, SIAM J. Discrete Math., 33 (2019) 2316–2325.
- [20] A. M. Odlyzko, Asymptotic Enumeration Methods, in Handbook of Combinatorics, II, The MIT Press (1995) 1063–1229.
- [21] C. Semple and M. Steel, Phylogenetics, Oxford University Press (2003).
- [22] M. Steel, Phylogeny: Discrete and Random Processes in Evolution, Society for Industrial and Applied Mathematics (2016).
- [23] M. Steel, Private communication.
- [24] H. S. Wilf, Generatingfunctionology, 2nd Edition, Academic Press (1994).