colorlinks = true, allcolors = blue
Efficient Reconstruction of Stochastic Pedigrees
Abstract
We introduce a new algorithm called Rec-Gen for reconstructing the genealogy or pedigree of an extant population purely from its genetic data. We justify our approach by giving a mathematical proof of the effectiveness of Rec-Gen when applied to pedigrees from an idealized generative model that replicates some of the features of real-world pedigrees. Our algorithm is iterative and provides an accurate reconstruction of a large fraction of the pedigree while having relatively low sample complexity, measured in terms of the length of the genetic sequences of the population. We propose our approach as a prototype for further investigation of the pedigree reconstruction problem toward the goal of applications to real-world examples. As such, our results have some conceptual bearing on the increasingly important issue of genomic privacy.
1 Introduction
1.1 Motivation
The decreased costs of sequencing technologies have enabled large-scale, data-driven analyses of genomes [Ins19]. Recent science and news articles feature stories only possible due to this plethora of data, such as the recent identification and capture of a high-profile criminal [KM18] predicated on DNA evidence. In this effort, an individual’s genetic information was compared to a large, curated database called GEDMatch consisting of over one million individual genomes. In comparison, there exist databases which are of several orders of magnitude larger in size such as MyHeritage (3.7 million [MyH]), 23andMe (10 million [23a]), and Ancestry (15 million [Anc]).
This raises the question: how much kinship information can be learned from DNA? Current databases already contain a considerable amount of this information. Indeed, it is estimated that a given US individual of European ancestry, on average, has a third cousin or closer who is already in the MyHeritage database [ESPC18]. However, such databases are still far from complete. This calls into question the ability to detect missing kinships based on individuals already present in the database.
This discussion also highlights the issue of genomic privacy. Indeed, it becomes much easier to identify and locate individuals by combining the genetic and genealogical information with outside information (addresses, e-mails, family photos, etc.). This potential, having already been demonstrated by the resolution of the aforementioned criminal case, was brought to attention by [ESPC18]. From this point of view, the ability to reconstruct genealogies from collected genetic data is of concern for individuals whose information is revealed, even if one has never been sequenced. Since our work establishes a positive result in a pessimistic scenario where we start with no ground truth information, we believe that our work brings to attention this critical issue via a theoretical framework.
1.2 Our contributions
Without any prior knowledge about the ground truth, can we learn everyone’s genealogy using their genetic information? In this paper, we study the inference problem of recovering ancestral kinship relationships of a population of extant (present-day) individuals, using only their genetic data. Our goal is to use this extant genetic data to recover the pedigree of the extant population, under an idealized model. A pedigree is a graph whose nodes (individuals) have edges that encode parent-sibling relationships. The topology and reconstruction of pedigrees are well-studied in bioinformatics from both a theoretical and empirical perspective, and in general the study of pedigrees poses formidable computational and statistical challenges.
In this paper, we introduce a novel recursive algorithm Rec-Gen for pedigree reconstruction. To demonstrate the effectiveness of our approach, we give a mathematical proof that for an idealized generative model on pedigrees, our algorithm is able to approximately recover the true, unknown pedigree only using the genetic data of the extant population. In terms of sample complexity, which for our purposes refers to the common gene sequence length of an extant individual, our algorithm greatly outperforms the naive reconstruction method (estimate pairwise distances between the extant individuals, then construct the pedigree that produces these distances). We propose our approach in this work as a prototype for the future study of more general pedigrees, including those involving real-life genetic data, from both a theoretical and empirical perspective. For further discussion on our model of pedigree generation, as well as its features and limitations, see Section 1.4 and Section 1.6.
1.3 Related works
A common method in theoretical evolutionary biology is to model lineages and inheritance via a family of directed acyclic graphs. One line of work is that of phylogenetics (refer to [SS03] for an overview) which uses trees to model the occurrence of large-scale speciation events in evolutionary biology. Another line of work is coalescent theory, which focuses on variable-height inheritance trees between genes as its main statistic to infer large-scale population sizes, as in e.g. [KKM+19]. In contrast, pedigrees capture small-scale individual genealogies that encode familial relationships. Specifically, most pedigree models are for human genealogies, where we designate exactly two parents to each individual. By construction, such graphs are no longer trees and warrant different strategies for inference.
[SH06] posed the formal definition of pedigrees using graph-theoretic language. In that work, the authors gave combinatorial arguments proving that one can reconstruct complete pedigrees, assuming the correct ancestral history is provided as an input for each extant individual. Our definition of pedigrees is essentially the same as the one outlined by these authors, though we make the simplification that we do not identify the vertex set bipartition (corresponding to the biological sex of the individuals).
To tie in more closely with real-world applications, one must consider the challenge of estimating these histories from data. Along these lines, [TS08] studied stochastic processes that one can associate with the pedigree, in such a way that one can prove negative results (information-theoretic impossibility) or positive results (an algorithm) for the reconstruction of the pedigree from extant data. The stochastic process used to show their positive result was based on a very specific family of Markov chains which allows for inference but is quite different from our model.
For the problem of performing pedigree reconstruction on real data, there is a wealth of literature [Tho00, KLKH11, HWH+13, Tho13, HWPE14, STH14, Hui17, Wan19]. Such studies apply heuristics that take into account various complications and phenomena observed in human genomes, such as varying levels of correlations between different sites and the presence of mutations that are not inherited from parents.
One line of work particularly relevant to this paper is [HWH+13, HWPE14] in which the authors also tackle the problem of pedigree reconstruction from real extant genetic data. Assuming answers to queries of the form, “how much DNA did and simultaneously inherit from their ancestors?”, they design a statistical test that distinguishes between siblings, half-siblings and cousins. Their method leverages this information with a maximal-clique finding algorithm to iteratively reconstruct the parents, layer-by-layer. There is no proof of correctness provided, but they provide benchmarks on real and simulated data to provide experimental justification. Our contributions have a slightly different flavor: using a similar iterative strategy but with a different statistical test (the novel part of our algorithm) and for a more optimistic set of assumptions, one can actually provably reconstruct the pedigree correctly in a sample-efficient way, in an asymptotic sense.
The authors of [HWPE14] specifically emphasize their method’s ability to reconstruct half-siblings. Technically speaking, this is not allowed in our model and therefore it may appear to the reader that there is something too restrictive or suboptimal about our analysis. One major difference between our model and the aforementioned work is that we model haploid individuals (one copy of DNA), while in reality humans are diploids (two copies of DNA). Furthermore, in our proof, we guarantee reconstruction of monogamous couples of haploid individuals – in other words, up to permutation of the two individuals within each couple. It can be observed that given a monogamous pedigree with a haploid model, one can construct a natural, non-monogamous pedigree with a diploid model such that the total variation of the extant data of the two pedigrees is zero. Therefore, we think that our results should also hold for a diploid model with minor modifications and have correctness guarantees to match the empirical results of the aforementioned work [HWPE14], for example by interpreting Fig. 1(a) as a pair of diploid half-siblings.
Our work is also closely related to the problem of phylogenetic reconstruction [ESSW99, Mos04, MR05, DMR06]. In this setting, symbols are passed from the root of a phylogenetic tree to descendants via a Markov process such as in the Cavender–Farris–Neyman model, a basic model for mutations. Similar to our inference problem in this work, in phylogenetic reconstruction, one is tasked with reconstructing the tree given only the symbols at the leaves. The main result of [ESSW99] characterizes the sample complexity—the minimal string length of the data at the leaves such that reconstruction is possible—as logarithmic in the depth of the tree, a phenomenon that our results suggest also holds for the pedigree reconstruction problem. The work [MR05] provides theoretical guarantees for the problem of learning the phylogenetic generative model (i.e., the topology of the tree as well as the transition matrices), which includes hidden Markov models as a special case, from the extant data under a spectral assumption on the transition matrices (see also later work of [HKZ12]). Most closely related to our approach in this paper is the work [Mos04], which shows how to recursively reconstruct phylogenies using techniques from the theory of broadcast processes on trees (see also [DMR06]). This approach provides inspiration for our main algorithm Rec-Gen, which uses similar techniques to recursively reconstruct pedigrees. We direct the reader to [EKPS00] and [Mos01] for studies of broadcast processes on trees with binary and large alphabet respectively, and [MMP18] for a generalization to directed acyclic graphs.
1.4 Model description and results
We now give an informal, detailed description of our framework for pedigree reconstruction, with a more detailed treatment of the generative model in Section 3. Our generative model on pedigrees consists of two parts: a parametric model for generating the network structure on the set of ancestors and extant individuals, and an inheritance procedure for transmitting genetic data from the founders, the oldest individuals in the pedigree, to the extant population.
To generate the pedigree network structure, we begin with a large founding population of size . The founders randomly mate monogamously, and each couple gives birth to a random number of children, so that the average number of offspring per couple is a constant555More precisely, each couple has a random number of children distributed as a Poisson random variable with expectation . . This procedure of random monogamous mating continues for subsequent generations, eventually yielding the extant nodes and a pedigree formed by the individuals in generations , with nodes at each level .
Next we describe how genetic data transmits from the founding population to the extant. Every individual in the pedigree has a gene sequence consisting of symbols placed in distinct blocks. Each individual in the founding population is initialized with independent uniformly random draws from a very large alphabet . Now we state how parents pass down genes to their children. In a given block, a child inherits, with equal probability, either its mother’s or its father’s symbol in the corresponding block. This procedure repeats for all couples in a given generation and then continues over subsequent generations so that genetic data is iteratively transferred through the pedigree, eventually giving rise to the gene sequences of the extant individuals.
Our main result is summarized in the following theorem. See Theorem 6.1 for a formal statement.
Theorem 1.1 (Main result, informal).
Let and denote sufficiently large absolute constants independent of , the size of the founding population. Let denote a sufficiently small absolute constant independent of . Assume that the alphabet size is very large with respect to .
Then given extant genetic data produced from the generative model with alphabet , growth rate , gene sequence length , and number of generations as described above, the algorithm Rec-Gen recovers of the true pedigree in every generation, with high probability. Moreover, this algorithm runs in polynomial time in the size of the pedigree and the number of blocks per extant individual.
Let denote the true, unknown pedigree. Our formal version of Theorem 1.1 (see Theorem 6.1) implies that with high probability Rec-Gen outputs a reconstructed pedigree whose size is at least in each generation , such that every node can be identified with exactly one node , and this identification preserves relationships in the sense that is a child of in if and only if is a child of in . In graph-theoretic terminology, our reconstruction is a (very large) induced subgraph of the truth .
We note that the stipulation that we recover 90% of the nodes at each level is actually a simplification; in fact, we can make the fraction of reconstructed nodes in each generation arbitrarily large by taking to be large enough. We refer the reader to Theorem 6.1 for details.
1.5 The Rec-Gen algorithm
The algorithm Rec-Gen consists of a recursive procedure that uses only the genetic information from the extant population to construct a good approximation for the true pedigree of depth that generated the observations. In the first phase of recursion, the algorithm reconstructs the parents of the extant nodes, which we label as the generation. In the phase, the algorithm adds a generation to the partially reconstructed version of the true pedigree given by the output of the previous phase. The algorithm terminates after phases of recursion, producing a pedigree with generations that well-approximates the true, unknown pedigree .
We next give a simplified version of our recursive procedure that serves to illustrate the main ideas. See Section 6 for a detailed description of Rec-Gen. Suppose that we have constructed a pedigree of depth , and recall that refers to the length of the gene sequence of an individual. Also recall that a couple refers to a pair of mated individuals.
Note that the first step of our recursive procedure equips each couple with an empirical gene sequence of length B where each block can contain two distinct symbols. This empirical gene sequence is constructed based on extant data and should be thought of as determining which symbols belong to at least one of the individuals from the couple in a given block. Also, we say that three gene sequences overlap in a block if all three sequences have some symbol in common in that block.
Perform the following steps to output a pedigree of depth .
-
(1)
Collect-Symbols For each couple in generation of , use the extant genetic data to recover symbols that belong to as follows.
-
–
Recover a symbol in block of if has three extant descendants descended from distinct children of that all share symbol in block .
-
–
Repeat this procedure to recover at most one other symbol for in block .
-
–
-
(2)
Test-Siblinghood For every triple of couples in generation , determine to be (mutually) ‘siblings’ if and only if at least of their recovered symbols mutually overlap.
-
(3)
Assign-Parents For every maximal collection of couples in generation such that every triple in consists of mutual siblings, construct a pair of parents in generation that have as children precisely one individual from each couple in .666We perform this step in such a way that every child is assigned at most parents.
After iterations of the above recursive procedure, we output a pedigree that gives a good approximation to the underlying pedigree that generated the extant genetic data as described in Theorem 1.1. We remark that working with triples as above greatly simplifies our analysis, as discussed in Section 2.3.
1.6 Model discussion and future directions
Our generative model imposes various constraints on the typical pedigrees that we consider. We discuss these modeling assumptions here and also consider the problem of investigating more general models that could more accurately capture properties of real-world data.
First, we consider the assumption that the size of the alphabet is very large with respect to the size of the founding population. Since a “block” represents the unit of inheritance from a parent777Using biology terminology, each block can be considered as an idealized abstraction of a collection of single-nucleotide polymorphisms (sites of variation) with high linkage disequilibrium (empirical measure of correlation) that are passed from parent to child., this implies that with very high probability all of the founders have distinct symbols in their gene sequences, and no two founders share a common symbol.888Mathematically, this can be thought of as an improper prior on a countably infinite alphabet . Our large alphabet assumption is equivalent to the assertion that the founders are unrelated.
Second, the stochastic process describing inheritance in our model has the following biological interpretation. A standard concept in population genetics refers to long-running sequence matches as being identical by descent (IBD) if they arose due to inheritance from a common ancestor [Tho13]. In contrast, the term identity by state refers to the event that two identical tracts in the genome arose by coincidence – via mutations – in two unrelated individuals. Our inheritance model contains the assertion that each block corresponds to true IBD sequences: if two individuals have the same symbol, we can always identify a common ancestor that gave rise to these symbols.
Third, we recall the hypothesis that every couple has on average children, where is a sufficiently large absolute constant independent of the size of the founding population. This ensures that, roughly speaking, every new generation is a factor larger than the previous one. Assuming roughly uniform growth of generations, it is necessary that — otherwise the population would die out and there would be no extant nodes after generations. More subtly, it is necessary that — otherwise, via standard results from the theory of branching processes (see, e.g. [KA15]) a founding node has a very low probability of passing on its symbols to the extant. In this situation, even detection of such an ancestor from extant genetic data alone is information-theoretically impossible. On the other hand, our assumption that is a large constant essentially amplifies the signal sent from a founder to the extant, and this simplifies our mathematical analysis.
Our first open question considers relaxing the previously discussed assumptions.
Question 1.
What theoretical guarantees can be established for pedigree reconstruction in the context of our generative model when is very close to ? What about when the size of the alphabet is finite? Can we analyze more generic models of inheritance where blocks are not inherited i.i.d. from parents?
A more subtle consequence of our generative model is inbreeding, a term we use to refer to the following phenomena: (1) the presence of multiple lowest common ancestors for a pair of extant nodes, and (2) the presence of mated couples such that the two individuals in the couple have a lowest common ancestor (LCA) (see Definition 3.5 for the formal definition of an LCA). The degree of inbreeding qualitatively refers to the frequency of such structures in the pedigree. Moreover, inbreeding as in (2) is mathematically equivalent to having cycles in the pedigree. In general, a higher degree of inbreeding makes the pedigree reconstruction problem more difficult and in some cases information-theoretically impossible (see Section 2.1 for detailed examples). Our choice of model allows for some degree of inbreeding, and our algorithm and analysis are carefully tailored to circumvent this obstacle.
Other assumptions inherent in our model include that the pedigree is graded, i.e., couples are formed from individuals in the same generation, and monogamous: a given individual only mates with one other individual. Furthermore, mutations — errors in the transmission of genetic data from parents to offspring — are a central component in biological applications that our current model does not incorporate.
Question 2.
What theoretical guarantees can be established for reconstruction of pedigrees in generative models with some combination of (i) a higher degree of inbreeding, (ii) mutations, (iii) non-monogamous mating, and (iv) inter-generational mating?
2 Inference challenges and techniques
In this section, we detail some of the challenges posed by the reconstruction of pedigrees constructed from our generative model as well as our techniques and analysis for handling them. To develop some intuition for our strategy, we first illustrate some of the properties of pedigrees using concrete examples.
2.1 Examples: complications from inbreeding
Recall that two individuals that share the same set of parents are siblings. If two individuals share a common subset of grandparents (but not parents), we refer to them as cousins.
First consider the pedigrees displayed in Fig. 1(a). An important statistic for determining relationships is the correlation between symbols of nodes at the same level. Consider the event that the left extant shares the same symbol as the right extant. Note that these two extant nodes are cousins sharing a single set of grandparents. The grandparents are the founders in this example, so we assign to each of them a unique symbol (). The occurrence of implies that or via the left extant receiving a symbol from its right parent; this occurs with probability . Conditioned on this occurring, the right extant block is the same as with probability , so the overall probability that both receive the same symbol is .
Compare this to the example shown in Fig. 1(b), where the two extant are cousins in two ways (siblings marrying siblings). Note that whichever symbol (out of ) that is, the right grandchild receives the same independently with probability . This is an example of a type of inbreeding where two extant nodes have more than one LCA.
The examples in Fig. 2 demonstrate how the correlation between extant nodes is boosted due to the presence of inbreeding. Note that in the generic case where extant siblings have an ancestral pedigree that is a tree, these individuals have a fraction overlap in their blocks. For comparison, let us compute the probability of coincidence for the two extant nodes in Fig. 2(a). The probability that , for example, is
Since and inherit symbols independently from their grandparents, the overall probability is
which is precisely the probability that two generic siblings inherit the same symbol.
The situation in Fig. 2(b) is even more pronounced. The two parents share the same symbol (either or ) with probability and have different symbols with probability . This means that the coincidence probability is now : their correlation between overlaps is much stronger than that of siblings in the generic case.
From the example in Fig. 2(a), we conclude that the statistical model of extant data parametrized by pedigrees is unidentifiable. Stated another way, it is information-theoretically impossible to distinguish between siblings and inbred cousins using only extant data. Thus, in order for any algorithm to succeed in reconstructing a large fraction of the pedigree using only extant data, it is necessary to bound the amount of inbreeding in the ensemble of pedigrees of interest. We accomplish this using a careful analysis of our generative model.
2.2 Informal analysis of Rec-Gen
In this section, we present a high-level analysis of the Rec-Gen algorithm. Theorem 1.1 states that Rec-Gen yields an accurate reconstruction on of nodes for typical pedigrees from our generative model999We note again that the is for simplicity of exposition, and in reality we can recover an arbitrarily large fraction of nodes. This is made precise in Theorem 6.1.. Note that a formal statement of this theorem, our main result, is given by Theorem 6.1, and a complete proof is contained in the upcoming sections.
Suppose we construct a pedigree on generations that, for simplicity of the discussion, exactly matches the true, unknown pedigree up to generation . We show that Collect-Symbols, Test-Siblings, and Assign-Parents applied to provide an accurate reconstruction of of the nodes at generation . In the remainder of this section we give a high-level argument that the output satisfies the following conditions:
-
(i)
every individual in can be identified with a unique individual in at generation ,
-
(ii)
at most of the nodes in generation of are not identified with an individual in , and
-
(iii)
if is a child of in then is a child of in .
Recall that for the purposes of reconstruction, we only have access to the genetic data of the extant.
In this discussion, we refer to three couples as (mutual) siblings if there exist individuals and such that and are mutually siblings. A clique refers to a collection of couples such that every triple from consists of mutual siblings.
The next two facts are essential to the argument.
Together, (A) and (B) imply that for of the couples in generation , our algorithm gets all of the siblings relationships between these couples correct. To see why, we can use a similar calculation as in the first example of Section 2.1 to conclude that the average overlap between the symbols of three individuals that are mutually siblings is . By concentration of binomial random variables about their means, it follows that with high probability, all triples of individuals that are mutually siblings in have at least mutual overlap between their symbols. A simple union bound combined with (A) and (B) implies that for most triples of individuals in generation that are mutually siblings, the recovered symbols from Collect-Symbols in those individuals’ corresponding couples have overlap at least . Hence, Test-Siblinghood infers correct siblinghood relationships for a majority of triples.
Moreover, our siblings test on the recovered symbols does not have any false-positives:
-
(C)
Test-Siblinghood never misclassifies non-siblings as siblings (Lemma 6.5).
The next and last key fact argues that our naive assignment of parents to individuals in cliques as in Assign Parents is in fact the correct assignment in a typical pedigree. This property holds with very high probability over our generative model.
-
(D)
Let denote a clique at generation in the true pedigree. Then there exists a couple , which we refer to as the parents of , in generation of that has exactly one child in every couple of , and no other couple has more than child in (Lemma 4.8).
Together, (A), (B), (C), and (D) imply that our reconstruction criteria (i), (ii), and (iii) from the beginning of this section hold, as we now justify. Recall that we already showed (A) and (B) imply that we classify a large fraction of the couples at generation correctly as siblings. Moreover, part (C) and the transitivity of siblinghood in imply that cliques in our reconstruction really correspond to cliques in the truth. By part (D) such cliques have unique parents. Thus, for (i), we identify newly constructed couples with the unique parents of the clique formed by the children of , further pairing the two individuals in with those in arbitrarily. With this identification, (iii) follows immediately. To show part (ii), later in the paper we give a sufficient condition for a couple at generation to have of its symbols collected by Collect-Symbols as in (B) (see Lemma 5.9). Then we show that of individuals in generation have children in such couples (see Proposition 5.8), which proves part (ii). Essentially, this sufficient condition amounts to saying that a couple at generation has no inbreeding (cycles) above or below it (i.e. among its ancestors or descendants, respectively) and that the pedigree of descendants of contains a -ary tree (see Definition 5.4).
2.3 Motivation for using triples
It is tempting to employ a seemingly simpler recursive scheme than the one described in Section 1.5 that operates on pairs instead of triples. As an example, consider an alternative recursive procedure such that:
-
1.
Collect-Symbols only uses pairs of extant descendants to recover symbols of a couple
-
2.
Test-Siblinghood considers only pairs of couples at generation and detects them to be siblings if their strings overlap by at least , and
-
3.
Assign-Parents assigns parents to individuals in maximal collections such that every pair of couples is (tested as) siblings.
Unfortunately, this simpler approach encounters two major technical complications.
First, working with a pairwise siblings test introduces a problem for the step of assigning parents. Define a pairwise clique to be a collection of couples so that every pair of couples passes the pairwise siblings test. With high probability, it turns out in every generation there exist a constant number of pairwise cliques that are not explained in the naive way of assigning to this clique parents that have precisely one child per couple. In particular, in the true pedigree it is possible to have three couples that mutually pass the pairwise siblings test, yet there are three distinct parent couples each having precisely two children among these three couples. See Fig. 3 for an illustration. This type of structure, though rare, occurs a constant number of times in each generation, and thus introduces inherent errors in our reconstruction that accumulate at every step of iteration.
A second problem caused by working with pairs arises in the step of collecting symbols. The pairwise version of our algorithm assigns a symbol to a couple if that symbol occurs in two extant descendants that are descended from distinct children of that couple. In our generative model, it turns out that with high probability there are a logarithmic number of pairs of extant nodes that have at least two LCA’s. For such pairs, the pairwise algorithm does not accurately assign symbols to their reconstructed ancestors. Similar to the previous issue, these errors snowball and make the analysis for proving Theorem 1.1 very difficult.
On the other hand, working with an algorithm using triples as described in Section 1.5 makes for a much cleaner analysis and nicer reconstruction guarantee. This innovation circumvents the technical complications of the pairwise version because every clique (recall that this is a collection of couples where every triple consists of mutual siblings) can be explained in a naive way (Lemma 4.8), and in our generative model every triple of extant individuals descended from distinct children of a given ancestor have that ancestor as their unique LCA with very high probability (Lemma 4.10).
2.4 Outline of technical arguments
The remainder of the paper, which provides a formal proof of Theorem 1.1, is divided into four parts.
-
•
Section 3 provides preliminary definitions and a formal definition of our generative model.
-
•
Section 4 proves important properties about the typical network structure of pedigrees from our generative model.
-
•
Definition 5.2 proves important properties about the block statistics of the extant nodes in a typical pedigree from our generative model.
-
•
Section 6 gives a precise description of Rec-Gen and provides a formal statement and proof of Theorem 1.1.
Specifically, in Section 4 we rigorously quantify the degree of inbreeding in typical pedigrees from our model by counting the number of collisions (see Definition 4.1 and Lemma 4.4). This has several useful consequences, including that every clique has a unique parent (fact (D) from Section 2.2, also see Lemma 4.8) and that the extant individuals used in Collect-Symbols have a unique LCA (see Lemma 4.10). In particular, the latter is key to showing fact (A) from Section 2.2.
In Definition 5.2, we provide a definition (see Definition 5.4) that essentially characterizes the individuals in that are reconstructible via Rec-Gen. We show that couples involving such individuals, referred to as awesome couples, transmit many of their symbols to the extant, with high probability (see Lemma 5.9). In particular, awesome couples have at least of their symbols recovered by Collect-Symbols (fact (B) from Section 2.2). We also prove an important result for our siblings test: triples of individuals that are not mutually siblings have mutually overlap at most (see Lemma 5.2). This combined with fact (A) from Section 2.2 essentially shows that Test-Siblinghood never classifies non-siblings as siblings (fact (C) from Section 2.2, see also Lemma 6.5).
Our final section, Section 6 ties everything together, following fairly closely the high-level argument presented in Section 2.2 to prove the formal version of Theorem 1.1.
3 Preliminaries
3.1 Key definitions and terms
Definition 3.1.
A pedigree is a directed acyclic graph (DAG) with vertices and edges where every vertex has indegree at most . The collection of vertices of indegree zero are referred to as the founders, and the collection of vertices of outdegree zero are referred to as the extant.
Definition 3.2.
If the indegree of each vertex in the underlying DAG is either or , then is called a complete pedigree.
In this work, we focus on a special family of complete pedigrees that are both graded and monogamous.
Definition 3.3.
is said to be graded if the vertices can be partitioned into such that are the founders, are the extant, and all directed paths from to can be written as a sequence of edges where and for each . The founders’ index is the depth of the pedigree.
is said to be monogamous if for every vertex of outdegree , there exists a unique vertex such that . The unordered pair is referred to as a couple.
We assume that every non-extant individual in the pedigree is in a couple, and so the number of vertices at each non-extant level is even. This assumption is effectively without loss of generality—if an individual is not in a couple, then it has no descendants, and so we cannot recover information about this individual or even its existence.
An example of a complete, graded, monogamous pedigree is shown in Fig. 1(a). In our model, symbols are passed down from parents to children in a completely symmetric way. Thus, given the data of the children, it is impossible to distinguish the owner of each symbol from amongst the two parents. The goal of this paper is to show how one can provably infer the structure of a complete pedigree from extant genetic data via the reconstruction of the ancestral symbols, modulo block phasing (determining which symbol belongs to which parent for each block). Therefore, we introduce the following version of a pedigree which condenses this information.
Definition 3.4.
A coupled pedigree induced by a complete, monogamous pedigree is defined as follows:
-
•
is obtained by merging couples into a single node (extant individuals remain singletons), introducing edge multiplicity.
-
•
is the result of halving the number of resulting copies of each edge after merging couples.
In particular, a coupled pedigree is also a pedigree. Examples are drawn in Fig. 4 in relation to Fig. 1, where the complete pedigree 1(a) induces a coupled pedigree 4(a) and 1(b) induces 4(b).
The only information that is lost after transforming a complete, monogamous pedigree into a coupled pedigree is the block phasing. Indeed, observe that given the coupled structure , one can easily obtain the individual structure up to block phasing as follows: (1) add the extant individuals in to , (2) for every non-extant node add individuals to , and (3) given parents and of in , add the four edges to . In addition, if is graded, retains a graded structure so that are the extant nodes and are depth-graded couple nodes. In particular, the graph structure of an individuals pedigree uniquely determines the graph structure of its associated coupled pedigree and vice versa.
Given the previous discussion, since our goal is to recover the graph structure of an underlying true pedigree given gene sequences of a large number of extant individuals, it suffices to reconstruct the associated coupled pedigree .
Furthermore, since the graph underlying a pedigree is a DAG, given a subset of the pedigree, it is natural to consider the notion of “ancestors” (nodes from which there is a directed path to ) and “descendants” (nodes to which there is a directed path from ). Also for simplicity, we stipulate that every node is both a descendant and an ancestor of itself, i.e., and . Since the indegree of each node can be more than one, it is possible for two nodes to have more than one “lowest common ancestor”. We define this now.
Definition 3.5 (Lowest Common Ancestors).
Let denote a set of nodes in a pedigree . The set of lowest common ancestors of , denoted , consists of all nodes such that is an ancestor of every node in , and moreover, no descendant of is an ancestor of every node in .
During our analysis, we often restrict our attention to the information that the pedigree contains about the ancestors or descendants of a particular collection of nodes. In particular, we want to exploit (sub)structures that are not too intertwined. The following definitions make these ideas precise:
Definition 3.6 (Subpedigrees).
Let denote a subset of nodes of pedigree . The subgraph of induced by is itself a pedigree, which we call the subpedigree of induced by .
Definition 3.7 (Ancestral pedigrees).
Let denote a subset of vertices at level of a graded pedigree . The subpedigree induced by is the (level ) ancestral subpedigree of induced by .
Definition 3.8 (Descendant pedigrees).
Let denote a subset of vertices at level of a graded pedigree . The subpedigree induced by is the (level ) descendant subpedigree of induced by .
Definition 3.9 (Tree pedigrees).
A pedigree that has no undirected cycles (when the directions of the edges in are ignored) is called a tree pedigree.
Note that coupled pedigrees can have edges of multiplicity two, though only in the case where two siblings form a coupled node, which a rare structure in our generative model. In coupled pedigrees, we consider a double edge to be an undirected cycle of length two. Hence, a tree pedigree consists entirely of simple or multiplicity 1 edges.
As we demonstrate (e.g. Lemma 5.2), coupled tree pedigrees exhibit a type of correlation decay between blocks that enable us to perform inference on the structure. In contrast, non-tree coupled pedigrees correspond to pedigrees with inbreeding, which can arise in nature and appear in our probabilistic model as well. Section 2.1 illustrates examples of such structures. These types of structures introduce challenges for performing inference under our generative model.
3.2 Siblings in a pedigree
Note that siblinghood is a transitive relationship: if are siblings and are siblings, then so are . As alluded to in Section 2.3, it is important to look at these relationships in triplets. We now detail how one can encode this information as a 3-uniform hypergraph.
Definition 3.10.
A 3-uniform hypergraph is a pair of vertices and a multiset of edges, so that each edge is an unordered triple of vertices in .
Definition 3.11.
Let be a coupled pedigree of depth (each non-extant node is a set of a pair of individuals). The siblinghood hypergraph of at level is the 3-uniform hypergraph that describes the three-way sibling relationships of its level- members. For every triple , the edge multiplicity is
The siblinghood hypergraph is defined similarly, by considering each extant individual as a degenerate (cardinality 1) couple and applying the above definition (Each hyperedge appears zero or once, never twice).
Recall that a clique in a 3-uniform hypergraph is a collection of vertices such that all possible triplets form an edge. The next statement is an observation that follows from the definition of and the transitivity of siblinghood.
Proposition 3.1.
If are level- couples that respectively contain individuals which are siblings, then form a clique in .
3.3 Probability Tools
We denote a Poisson distribution with mean as . We use some basic tools from probability theory in our proof. The first is referred to in literature as Poisson thinning, see e.g. [Lal16].
Proposition 3.2 (Poisson Thinning).
Let , and let be iid Ber() random variables that are independent of . Then is -distributed.
Second, we recall that sums of Poisson random variables are themselves Poissons:
Proposition 3.3.
Fix and let be iid random variables. Then is -distributed.
Third, we will invoke the following standard variants of the Chernoff–Hoeffding bounds for sums of Bernoulli random variables:
Theorem 3.4 (Bernoulli tail probability (Chernoff–Hoeffding Bounds)).
Let be a sum of independent Bernoulli random variables. Let . Then
Lastly and in the same spirit as the Chernoff–Hoeffding bound, we will use the fact that Poisson distributions also have sub-exponential tails.
Proposition 3.5 (Poisson tail probability).
Let . Then for any , we have
For a proof, refer to Chapter 2 of [Pol15].
4 Structure of Poisson Pedigrees
4.1 Model Description
We now describe our simple model for generating a population and its genetic data. The model is best viewed in two stages. In the first stage, we generate the population as well as the pedigree topology on these individuals, and in the second stage, we generate the genetic data given this pedigree structure. Note that the random individual pedigree constructed below is graded, monogamous, and complete.
Part I: Pedigree topology
-
1.
To generate , start with founding individuals in and make an arbitrary maximum matching of these individuals to create a set of mated couples. For each couple, generate an independent number of children, where is a fixed parameter throughout the entire pedigree. These newly generated individuals form the nodes in .
-
2.
Repeat the above process to generate the individuals in .
Once we have the population and pedigree structure as above, we generate the genetic data in the following manner.
Part II: Inheritance procedure
-
1.
Each individual in has a length- string (’s gene sequence). The string’s indices are referred to as blocks.
-
2.
For each founding individual in and for each block , each is drawn i.i.d. uniformly from an alphabet . For our model, is an infinite-sized alphabet: we simply require that each block of each founder has a unique symbol.
-
3.
Every other individual in the population has exactly two parents and . Conditioned on and , independently over , the th block of copies with probability and with probability .
Remark 1.
We adopt the following conventions in the remainder of the paper.
-
1.
We let denote the coupled pedigree induced (see Definition 3.4) by the randomly generated individual pedigree constructed in Part I above.
-
2.
We use the term coupled node, or simply node when the context is clear, to refer to a vertex of . We use the term individual to refer to an element of contained in a coupled node of . Unless otherwise explicitly noted, parent-child relationships are taken according to the structure of the coupled pedigree . That is, given we use the phrase, “ is a child of ,” to mean that the couple contains an individual who is an offspring of the mated couple . Finally, we say that coupled nodes are siblings if and contain individuals who are siblings in .
-
3.
denotes the probability measure over the randomly generated pedigree as well as the random inheritance procedure.
To given an example of our terminology, there are two individuals in a non-extant coupled node. Each individual is a vertex of , and together they form a coupled node, which is a vertex of . Note that as an artifact of our definitions, extant individuals are both coupled nodes and individuals in . Moreover extant nodes have exactly one parent in given by the coupled node containing the individuals comprising that extant individuals biological parents, as determined by our generative model.
To further emphasize the previous remark, recall that by the discussion in Section 3.1, there is a unique correspondence between coupled pedigrees and individual pedigrees. Hence, it suffices to give a (partial) reconstruction of to (partially) reconstruct the original individual pedigree . Thus the content of our main result Theorem 6.1 and the remainder of this paper primarily work with the coupled pedigree .
Parameters:
For convenience, we collect the various parameters of interest here.
Parameter | Description | Value |
---|---|---|
Size of founding population | ||
Number of blocks for each individual | ||
Expected # of children per couple | ||
Number of generations in population | , | |
Size of block alphabet |
We set for a sufficiently large constant. The expected number of children per couple, , will be set to a sufficiently large constant that is at least 3. Finally, the number of generations will be set to , where is sufficiently small with respect to .
4.2 Concentration bounds and upper bounds on inbreeding
In this section we quantify the degree of inbreeding in . To do so, we first describe an alternative description of our generative model. An equivalent procedure for constructing the coupled pedigree structure is to (1) sample the generation sizes according to Poisson random variables with appropriate parameters, (2) pair up individuals in each generation at random into coupled nodes, and (3) have coupled nodes choose two parent coupled nodes at random from the previous generation. This is described formally below.
Lemma 4.1.
The (coupled) pedigree described in Section 4.1 can be equivalently viewed as follows:
-
1.
Let be the size of the founding population. For from to : Let be the number of individuals in couples, and sample .
-
2.
For each level , match the individuals at level randomly, leaving out a single individual if was odd.
-
3.
For each level , sample a vector from a Multinomial distribution with parameters
For any , the set of coordinates are interpreted as children of the couple at level (and are therefore siblings at level ).
-
4.
Convert the resulting pedigree on individuals from steps 1–3 to a coupled pedigree .
Proof.
The number of vertices at each level in the statement of Lemma 4.1 is the same as the model in Section 4.1. This follows by induction. The number of founding vertices is the same in both models. In the model in Section 4.1, the number of individuals at level is distributed as , where the are iid and is the number of individuals at level that are matched. The value of this sum is distributed as (due to Proposition 3.3), the same as in the statement Lemma 4.1.
The random matching in Step 2 of Lemma 4.1 is the same as the matching in Section 4.1.
The final step in the process above assigns individuals in to parents in by sampling a vector of length with entries in from a multinomial distribution and assigning individuals to parents based on these labels. Indeed, if we look at the number of children of a fixed couple (say, the couple in ), this is distributed as , where . By Poisson thinning (Proposition 3.2), this distribution is simply , which is exactly the distribution of the number of children of the couple in Section 4.1. ∎
Next we use tail bounds on Poisson random variables to show that the sizes of each level are well-concentrated with high probability, assuming a sufficiently large size of the initial population. Recall that denotes the number of individuals in generation .
Lemma 4.2 (Concentration of generations).
Fix such that , and suppose that the founding population size is at least . Then, for some constant , with probability at least we have that, for all
(1) |
Remark 2.
An immediate corollary of this result is that
(2) |
for each with high probability.
Proof of Lemma 4.2.
Our goal is to upper bound the right-hand-side of
and so it suffices to show
Consider fixing the number of individuals at level to be an arbitrary number satisfying Eq. 2. We know that the number of individuals at level is distributed as . By applying the Poisson tail bound Proposition 3.5, we see that
(3) | |||
(4) |
We now claim that implies that , which follows from the facts that and that (Eq. 2). Namely, assume that . Then , since . Now assume instead that . Then
where in the last line we use the fact that .
Remark 3 (Dependence on ).
The strategy from this point onwards is to condition on the event from Eq. 1. Since this event fails with probability that is exponentially small in , we lose only an additive probability.
As mentioned in Section 2.1, two nodes may have significantly higher amounts of symbol overlap caused by inbreeding in their ancestral pedigree than would be expected given their distance in the pedigree. This can cause us to reconstruct an incorrect pedigree if we attempt to explain the symbol overlap without accounting for inbreeding; for instance, we may see two nodes and think they are siblings, when in reality they are cousins with inbreeding in their family tree (see Section 2.1 for a detailed example). To formally connect different patterns of inbreeding with the amount of spurious symbol overlap they cause, we introduce the notion of collisions in an ancestral pedigree. Roughly speaking, triples of coupled nodes with relatively few collisions in their ancestral pedigree do not have many spurious overlaps, which we prove in Definition 5.2. We first define collisions and then bound the number that occur under our probabilistic assumptions in Lemma 4.4. We also give an alternative characterization of collisions in Lemma 4.3 that is useful later.
Definition 4.1 (Collisions).
Let denote a coupled pedigree. Fix a subset of nodes , where . If , we say that this collection has collisions at level if the set of parents of in has size . If , we say that it has collisions at level if the set of parents in has size . Write
Extend the notion of collisions to ancestral subgraphs as follows. If we have nodes , the number of collisions between the ancestral subpedigrees for is equal to
where denotes the set of ancestors levels above .
Lemma 4.3 (Ancestral collisions, alternate characterization).
Let denote a set of nodes that are all at the same level. Consider the subpedigree . Let denote the number of nodes in that have outdegree in the subpedigree . Then
Proof.
Let denote a set of nodes at level . Let denote the set of parents of that have outdegree in the subpedigree . Let denote the number of collisions that has at level . Then we claim that
(5) |
This is true by induction on the cardinality of , as we now demonstrate. We prove this assuming that is a set of non-extant coupled nodes; the case for extant nodes is extremely similar. The base case follows because the unique node either has two distinct parents, in which case there are no collisions and each has outdegree , or has a single parent, in which case the number of collisions is and the parent has outdegree . In both cases Eq. 5 holds.
For the inductive step, suppose that Eq. 5 is valid for all with . Now consider with . Choose an arbitrary and consider . Observe that by Definition 4.1 and induction:
Therefore, if has parents contained in , then
because each parent of contained in increases the degree of some node in by .
Applying this argument over all levels to the sets , we see by Definition 4.1 and summing over all levels that Lemma 4.3 holds for coupled nodes. ∎
In our model and in light of Lemma 4.1, a collision between sets and intuitively corresponds to a node in “choosing” a parent couple that was already chosen by another node in . This observation lets us bound the number of collisions between the ancestors of 3 nodes with high probability.
Lemma 4.4 (Exponential tail of collisions).
Fix three nodes in the same level , and let be a positive integer. Then
(6) |
Proof.
We show that the probability on the left-hand-side of Eq. 6 can be upper bounded by the probability that a binomial random variable with sufficiently small mean is at least , from which the result follows.
We assume that each level has at least individuals. This is a high probability event by Lemma 4.2 (which actually describes a much stronger situation). Since we just want an upper bound, we condition such an event and this assumption is made without loss of generality.
Let . Note that , regardless of how many collisions have happened underneath it. The distribution of is equal to a sum of at most Bernoulli random variables, two for each node in , which are indicator random variables that a parent coupled node selected by some node in is the same as a parent coupled node previously selected by (Lemma 4.1). Furthermore, each of these indicator random variables is 1 with probability at most , even conditioned on the previously set random variables—indeed, there are only parents selected in total, so there are only this many nodes that can be selected from to cause a collision, and there are at least coupled nodes at level . Therefore, the random variable is stochastically dominated by . Let . Then we get that
(7) |
where . By bounding the binomial tail and noting that we take , (Eq. 7) can be bounded by
∎
In particular, by union bounding over all triples of nodes in the coupled pedigree , we get the following corollary. Note that there are most nodes in the pedigree when we condition on the high-probability event from Lemma 4.2.
Corollary 4.5.
Since we take the ratio to be sufficiently small (Section 4.1), the probability of the above event is negligible. Hence, we can assume without loss of generality for the rest of the document that the number of collisions in the ancestral trees of any three nodes is at most 3.
Additionally, by applying Lemma 4.4 to a single node (repeated three times) and applying linearity of expectation, we can bound the probability that there are many coupled nodes with collisions in their ancestral pedigrees using Markov’s inequality. We state this as a corollary.
Corollary 4.6.
For any ,
as long as is sufficiently large.
Definition 4.2 (-Richness).
Fix a pedigree , and let be an integer. All extant nodes in are -rich. For all , a level -node is -rich if it has at least children that are -rich.
Lemma 4.7 (Most nodes are -rich).
Fix a constant , and let as in Lemma 4.2. As long as and are sufficiently large, there exists a constant such that with probability , at least fraction of level- coupled nodes in are -rich for all .
Proof of Lemma 4.7.
Let the term “-poor node” refer to coupled nodes that are not -rich. Let denote the number of coupled nodes at level in . Our goal is to prove an upper bound on the event that there are at least -poor nodes at level , conditioned on the event that there are at least -rich nodes at level .
Let denote the event that there are at least -rich nodes at level . Let denote the event for all , which occurs with probability by Lemma 4.2. We also condition on the sizes of , abbreviating this conditioning as .
Let be an arbitrary subset of nodes at level of size , and consider the event where only consists of -poor nodes. This implies that the number of -rich children of is at most . Let be iid Bernoulli RVs, which represent indicators for the event where the th -rich child chooses at least one of its parents to be in . Note that .
(Chernoff–Hoeffding Bound) |
Observe that there are many choices for . To apply a union bound, it suffices for to be large enough so that looks linear in . In that case, we obtain a bound of the form
Therefore, we may write
for an appropriate constant depending only on and .
∎
Lemma 4.8 (Cliques have unique parents).
Let denote the siblinghood hypergraph at level . Let be as in Lemma 4.2. For a constant , with probability at least , for all hypercliques with at least one hyperedge, there is a unique node at level that is a parent of every node in . We refer this node as the parent of .
Proof.
By Proposition 3.1, a hyperclique corresponds to a set of coupled nodes that contain a set of mutual siblings, where each couple has at least one of the siblings in it. This establishes that there is a coupled node at level that is at least one parent of every node in . In the case where is a hyperclique of extant nodes, we are done: every node in is an individual and has exactly one parent coupled node.
If is at a higher level, note that there can be at most two parents for , as defined above. The reason is that any individual has exactly one parent couple, and since there are only two individuals in a couple, there cannot be three parent couples each with one child in each couple in .
Next we show that if there are two coupled nodes, both of which are parents of , then there must be many collisions among the ancestors of , and therefore we can rule this out as a low-probability event. Since has at least one hyperedge, we know that . This means that any arbitrary set of three nodes from must have at least collisions by Definition 4.1—but Corollary 4.5 shows that with probability , this does not occur anywhere in the pedigree. ∎
Lemma 4.9 (Disjointness of maximal cliques).
Let denote the siblinghood hypergraph at level . For , each extant node is contained in a unique maximal clique, and moreover, the maximal cliques in are vertex disjoint (and thus, also edge-disjoint). For , each node is contained in at most two maximal cliques. Moreover, with probability , the maximal cliques in are edge-disjoint.
Proof.
Note that maximal cliques in the siblinghood hypergraph correspond to maximal sets of siblings. The claim for extant nodes is relatively trivial - extants are individuals, and so the maximal sets of siblings partition the set of extant nodes.
For , since each individual in a coupled node has one pair of parents, a coupled node can have at most two parents. Thus it can be part of at most two sets of siblings. Hence, it is part of at most two maximal cliques.
Finally, we need to establish that the maximal cliques in are edge-disjoint. To do this, it suffices to show that the intersection between any two maximal cliques is less than 3, so there can be no hyper-edge. Indeed, if three nodes that are simultaneously in two maximal cliques, these three nodes would themselves form a clique with two different parents in level , which occurs with probability at most by Lemma 4.8. ∎
4.3 The joint LCA and its uniqueness
The next two lemmas are crucial in Section 6 to show that we can accurately collect symbols for accurately reconstructed coupled nodes. Here we define the joint lowest common ancestor, which is a special type of LCA for a triple of coupled nodes.
Definition 4.3.
Let denote coupled nodes in . We say that have a joint LCA if it holds that and there exist distinct children of so that for all , is an ancestor of .
Lemma 4.10 (Joint LCA is unique).
Suppose that each triple of coupled nodes in has at most 3 collisions. Further suppose that have a joint LCA . Then is the unique LCA of .
Proof.
For the sake of contradiction, suppose that have another LCA . By the definition of LCA, is neither an ancestor nor a descendant of .
If is a joint LCA of , then both and have outdegree in , which by Lemma 4.3 implies that has at least collisions.
If is not a joint LCA, then has outdegree in . Moreover, there exists a unique lowest node that is an ancestor of precisely two nodes in . In particular, has outdegree at least in . Observe that the nodes are all distinct. Hence by Lemma 4.3, the number of collisions is at least .
In either case, has at least collisions, which is a contradiction. ∎
Lemma 4.11 (Inheritance paths go through LCA).
Suppose that each triple of coupled nodes in has at most 3 collisions. Further suppose that have an LCA . Let denote a strict ancestor of . Then for some , all paths from to in pass through .
Proof.
To draw a contradiction, suppose that for all that has a path to that does not go through . Suppose further, without loss of generality, that is the lowest node in that is an ancestor of and has this property.
Let denote a spanning tree on (red edges in Fig. 5). Also select a spanning tree on the union of all paths from to that do not go through (blue edges in Fig. 5). Observe that has outdegree at least in . Since also has a path to , then has outdegree at least in . Moreover, has collisions. Since is not contained in , we conclude by Lemma 4.3 that has at least collisions. The first terms accounts for the collisions in , and the second applies Lemma 4.3 to . This is a contradiction. ∎
Note that by Corollary 4.5, Lemmas 4.10 and 4.11 hold for all triples with high probability.
5 Lemmas that enable reconstruction
In this section, we prove bounds on “overlap statistics” previously explored in Section 2. Since we now have switched to talking about coupled pedigrees, we re-define its notion now.
Definition 5.1 (Diploid blocks).
Let induce the coupled pedigree . Given (haploid) gene sequences , we associate with each non-extant couple node a diploid sequence defined in terms of each block as a multiset . Each extant node’s block is thought of as a singleton set.
Definition 5.2 (Diploid overlap).
Three diploid sequences overlap in block if
The term fraction of mutual overlaps between coupled nodes in refers to the statistic
5.1 Distinguishing siblings from non-siblings: Coincidence probability bounds
In this section, we establish the following high-probability separation condition for triples of coupled nodes at the same level:
-
•
if are mutually siblings, they overlap in at least fraction of blocks.
-
•
if are not mutually siblings, they overlap in at most fraction of blocks.
In order to reconstruct the pedigree, we perform inference on the underlying pedigree structure from the symbols at the extant level. The key step of our reconstruction algorithm is to infer which triples of nodes are mutually siblings based on the overlap between their reconstructed symbols. The conditions stated above justify using the number of overlapping symbols in triples as a statistic for determining siblinghood. The first fact (Lemma 5.1) is easy to prove. In contrast, the second fact (Lemma 5.2) is rather non-trivial; we prove it using casework.
Lemma 5.1 (Symbol overlap in siblings).
With probability , the fraction of mutual overlap in symbols between any triple of coupled nodes , that are mutually siblings is at least for any arbitrarily small .
Proof.
It suffices to consider the overlap of the individuals in , respectively, that are siblings, i.e., have a common parent in . We claim that the expected fraction of overlap for is at least 1/4. Indeed, any individual symbol at the parent (couple) node survives to all three children with probability , and there are symbols at the parent (one per block per member of the couple). The Chernoff–Hoeffding bound gives that for any fixed triple of siblings, the probability that it has less than mutual overlap is at most . To be explicit, let denote the indicator of an overlap between in block .
In the second line we use that are i.i.d., in the third line we use that the expectation is at least , and to finish we apply Chernoff–Hoeffding. A union bound over all triples of siblings yields the result. ∎
Lemma 5.2 (Symbol overlap in non-siblings).
Fix . With probability , every triple of coupled nodes , , and that are at the same level but are not mutual siblings share overlap in less than fraction of their symbols.
5.1.1 Proof of Lemma 5.2
Remark 4.
In this proof, we condition on the high probability event from Corollary 4.5 that all triples of coupled nodes have at most collisions in their ancestral subpedigree .
It is clear that if are completely unrelated, then their mutual overlap is zero, since we assume an infinite alphabet. If have a common ancestor, then typically their ancestral pedigree has two collisions, and all triples have at most three collisions in their ancestral pedigree by our conditioning in Remark 4. We refer to triples with three collisions as being inbred and think of the extra collision as the site of inbreeding, a notion that we later formalize in this section.
Recall the definition of tree subpedigree (Definition 3.9), which we refer to simply as a tree in what follows. Also recall that an edge of multiplicity 2 in a pedigree is considered to be an undirected cycle of length 2. Thus, a tree subpedigree consists only of simple (multiplicity 1) edges. Our strategy for proving Lemma 5.2 follows the recipe below for casework.
-
1.
have exactly two LCAs, and the ancestral pedigree of is a tree.
-
2.
have exactly one LCA, and the LCA has a cycle above it.
-
3.
have exactly one LCA, and the ancestral pedigree of is a tree.
-
4.
have exactly one LCA, and the ancestral pedigree of contains a cycle that is not completely above the LCA.
We now assert that the above cases cover all possibilities; this is proven in the next two claims.
Claim 1.
For , , and to have a single LCA, their ancestors must have at least 2 collisions.
Proof.
All three nodes need a common ancestor, which means there are at least 2 collisions are present in . ∎
Claim 2.
The nodes , , and have at most two LCAs, with two LCAs only if has three collisions. Furthermore, if there are two LCAs, then is a tree pedigree.
Proof.
By the previous claim, creating a single LCA for three nodes requires 2 collisions in . By definition, one LCA cannot be an ancestor of another LCA. This means there must be at least one more collision in to create the second LCA, bringing the total number of collisions required in to three. This immediately yields the final part of the claim by Remark 4.
To establish that there are at most two LCAs, suppose we add a third LCA. Then by the same argument, this LCA cannot be an ancestor of either of the two other LCAs, and so there must be another collision to explain it. This leads to four collisions among the ancestors, which we have ruled out. ∎
We now upper bound the expected overlap between , and by doing the above casework on the structure of their ancestral pedigrees. We simply upper bound the expected overlap, relying on the independence of inheritance in the different blocks so that we can apply a Chernoff–Hoeffding bound.
Lemma 5.3 (Case 1: exactly two LCAs).
Suppose that , , and have exactly two LCAs. Then the expected fraction of mutual overlap is at most .
Proof.
Fig. 6 illustrates the topology of interest. First we note that neither of the LCAs can have repeated symbols, since their ancestral pedigrees contain no collisions. Consider the ancestral pedigree from , , and up to any one particular LCA, noting that this pedigree is a tree by 2. Any configuration containing , , and their ancestors leading up to that LCA has at least 5 edges, since are not mutual siblings. Therefore, the probability that a single symbol propagates from that LCA to all of , , and is , which yields an expected 1/16 fraction of overlap since there are symbols at the LCA (since it is a coupled node). Since there are two such LCAs, the expectation is at most . ∎
In the remaining cases, we assume there is exactly one LCA. Note that any common symbols across , , and must be present in this LCA—if , , and inherit a symbol that is not present in this LCA, then by tracing their paths of inheritance for the symbol we can find another LCA. However, this does not guarantee that all common symbols in , , and can be traced back to inheritance from the LCA— if there is inbreeding, some nodes in can potentially inherit a symbol via an ancestor of the LCA through a path does not go through the LCA, while the rest inherit it from the LCA.
Lemma 5.4 (Case 2: one LCA with cycle above).
Suppose that , , and have exactly one LCA . Furthermore, this LCA has at least one collision in its ancestral pedigree. Then the fraction of mutual overlap is at most in expectation.
Proof.
We know that , , and , must have at least two distinct parents between them that are connected to (else would be their parent). This means there are at least two edges in the graph between and the parents of , , and , and at least three edges between , , and and their respective parents.
Since we know there are at most three collisions among the ancestors of , , and , there can be only one collision in the ancestral pedigree of , and the presence of this collision means there are no other collisions in . Therefore, each of the parent couples of , , and have an individual that is unrelated to , and so there are no repeated symbols within any of the parent couples. So even if the parents were to get overlap in the blocks due to inheritance from , it holds that , , and inherit at most fraction of these blocks on expectation.
Finally, all common symbols between , , and must have been inherited from — if a common symbol was instead inherited by some from some ancestor of , this would create a fourth collision in . ∎
Lemma 5.5 (Case 3: one LCA and is a tree).
Suppose , , and have exactly one LCA and is a tree. Then the fraction of mutual overlap is at most in expectation.
Proof.
The lack of any cycles in means that all inheritance of common symbols comes from the lone LCA . Any such union of paths from to , and forms a directed tree with at least five edges; see Fig. 7. In addition, has two distinct symbols in every block. Therefore, for any particular symbol the probability that all three of inherit it is , which yields an expected fraction of at most overlapping blocks.
∎
The final case is the most complicated one to analyze.
Lemma 5.6 (Case 4: one LCA with cycle not completely above).
Suppose , , and have exactly one LCA and contains a cycle that does not lie completely above . Then the fraction of mutual overlap is at most in expectation.
As an aid in proving Lemma 5.6, it is helpful to first identify the “most recent” inbred node. We make this notion precise now.
Definition 5.3 (Witness).
We call a node a witness to inbreeding or simply a witness if is the lowest node in that is part of an undirected cycle.
Lemma 5.7 (Unique witness).
Under the conditions of Lemma 5.6, there exists a unique witness in . Moreover, this witness lies strictly below the LCA .
Proof.
We know that is not a tree, so there exists a cycle in . We show that there can only be one cycle. Suppose that there exist two cycles in . Then we claim that .
Consider a spanning tree of . Then has two collisions. Moreover, contains a single cycle, so we conclude that there exists a node in whose outdegree is increased by one upon adding the edges from to (Otherwise, would still be a tree). Therefore, by Lemma 4.3, has three collisions. By similar reasoning and using that , we conclude that has collisions. Since , we conclude that has at least collisions. But under our conditioning, no subpedigree has or more collisions. It follows that in Lemma 5.6 there is exactly one cycle in , and thus, exactly one witness.
To prove the final statement, note that if the witness is located above in , then the cycle lies completely above . ∎
Proof of Lemma 5.6.
Consider and the subpedigree consisting of the ancestors of . Recall that is the unique LCA of . By Lemma 5.7, there is a unique witness , which is the lowest node in the unique cycle occurring in .
Subcase 1: .
Without further loss of generality, suppose that the witness lies along the path from to . Then it follows that there is a unique path from to in . Otherwise, there would exist two cycles in , which is a contradiction as this would lead to collisions in . Similarly, there is a unique path from to in . Moreover, is a tree. It follows that the subpedigree of the ancestors of and is a tree. Observe that is at least two levels above , and by the topology of this subcase, there are at least 4 edges in the tree subpedigree from to and . This implies that the expected overlap between and is at most . Thus the expected overlap between is at most the expected overlap between and , which is bounded by .
Subcase 2: Without loss of generality, .
Let . Either is on the branch that leads to and , or it is on the branch that leads to . First, suppose that is on the branch that leads to and . Then we may further assume is on the path from to . For if, say, is on the path from to , then is a tree, in which case we can argue as in Subcase 1 that the mutual expected overlap between is at most .
Therefore, it suffices to consider the cases is on the path from to or is on the path from to (Fig. 8). In the first case, the descendants of form a tree with at least two edges. Moreover, there is a unique node at the same level as in , and this individual is located on the path from to . Let denote the (distinct) symbols of in a given block. By these facts, symmetry, and conditional independence of inheritance,
The second line includes a factor of to account for either or being passed down to . The terms in the third line are ordered to correspond to the events in the two lines above. In particular, we have by conditional independence of inheritance that
because there are at most paths from to , and each has probability at most of passing down . The bound
holds similarly.
Now suppose that is on the path from to . Then
Above, we used the fact that tree pedigree from to has at least edges. We also used the fact
which holds because there are at most two paths to from , each path has probability at least of not passing down , and so by conditional independence of inheritance, the probability that both paths do not pass down is at least . ∎
Finally, to finish the proof of Lemma 5.2 using Lemmas 5.3, 5.4, 5.5, and 5.6, note that in all four cases the expected overlap between coupled nodes is at most . Thus, the probability that mutually share more than fraction of symbols in all cases is at most by Chernoff–Hoeffding, similar to the analysis of Lemma 5.1. Union bounding over all possible triples gives an upper bound of the chance that there is some triple with at least overlap. By also ruling out the bad event in Corollary 4.5 (which occurs with probability ), we obtain the desired upper bound.
5.2 Which ancestors are reconstructible?
In this section, we characterize nodes that are of importance in our analysis: couples whose history lacks inbreeding (e.g. graph structure is reconstructible using blocks) and have ample extant information (e.g. blocks are recoverable). We present this in two parts respectively in Definition 5.4 and Definition 5.5.
Definition 5.4 (Awesome Node).
Call a node in the pedigree awesome if:
-
1.
It is -rich.
-
2.
It is not an ancestor of any extant node that has a collision within its own ancestral pedigree (including itself).
Definition 5.5 (-goodness).
Let be a specific block. Say that a coupled node in a pedigree is -good if has at least two sets of three extant descendants , , and , , in such that:
-
1.
is a joint LCA of and is a joint LCA of .
-
2.
, , and all have the same symbol in block , and , , and all have the same symbol in block .
-
3.
.
We furthermore define every extant node to be -good, for all .
We now deliver the main message of this section: most nodes have these properties, given the assumptions of our model (Proposition 5.8 and Lemma 5.9). Therefore, this characterization enables a natural reconstruction algorithm (Section 6).
Proposition 5.8 (Many awesome nodes).
Let (as in Definition 5.4) be a constant, let be a sufficiently large constant with respect to , and let be sufficiently large with respect to both and . With probability at least , in every layer of the pedigree at least fraction of the nodes are awesome.
Proof.
Since and are sufficiently large with respect to , we can apply Lemma 4.7 with and . This tells us that at least fraction of nodes in each layer are -rich with probability , where the constant depends only on .
Applying Corollary 4.6 with , there are at most nodes at the extant level with collisions in their ancestral pedigree, with probability . This means there are at most ancestors of these nodes. It follows that the number of nodes that are -rich but not awesome is at most . This is at most , provided is sufficiently large with respect to and and we take to be small with respect to .
The first probability is exponentially small in , while the second probability is exponentially small in . Therefore, the probability of both events occurring simultaneously can be lower bounded by , by taking the constant hidden in the to be slightly smaller than what is found in the previous paragraph. ∎
Lemma 5.9 (Awesome implies -good).
Let (as in Definition 5.4) be a sufficiently large constant. With probability over the symbol inheritance process, every awesome coupled node in is -good for at least fraction of blocks .
The figure “” is an arbitrary choice for simplification. It can be replaced by anything arbitrarily close to , which changes the constant factor of found in the lemma above. To prove Lemma 5.9, first we need a structural claim about awesome nodes:
Claim 3.
For any awesome coupled node, the subpedigree formed by it and its awesome descendants contains an induced -ary tree that goes down to the extant level.
Proof of 3.
First, we show that this subpedigree has no undirected cycles within it, which establishes the tree structure. Then, we argue that each node has children within this subpedigree.
Suppose that there an undirected cycle within this subpedigree. We show that this implies the presence of a collision within the subpedigree, contradicting the awesomeness of all nodes in the subpedigree. Note that there must be a node within this subpedigree with a cycle in its ancestral pedigree - for instance, take the node at the lowest level within the cycle. Applying Lemma 4.3 to this awesome node, we see it has a collision among its ancestors, which contradicts condition 2) of Definition 5.4.
Now we establish that each node has at least children in the subpedigree. An awesome coupled node has at least children that are -rich, since it is -rich itself. Furthermore, none of these children have descendants with collisions in their ancestral pedigree, so they are all awesome, which finishes the proof. ∎
Proof of Lemma 5.9.
Every awesome coupled node in has exactly distinct symbols in each block. Indeed, assume for contradiction that there is an awesome coupled node with a block in which it only has one distinct symbol. Due to the infinite alphabet assumption, we know that we can trace any symbol in a block back to a unique founder. Hence, there must be a collision in the ancestral pedigree of , which is a contradiction with condition 2) of (Definition 5.4).
Now we can proceed with showing that every awesome coupled node is -good for fraction of blocks . Fix an awesome node and a block .
We use condition (1) of awesomeness to show that, with probability tending to 1 as , there exist two sets of three extant nodes that both have as a joint LCA, where the first set has a symbol in block , and the second set has a symbol .
Towards this end, let us follow the inheritance of among an induced -ary tree of awesome descendants, as guaranteed by 3. The inheritance follows a broadcast process with copy probability on this -ary tree. The probability that the symbol makes it to at least three distinct children of , and this symbol in turn survives to the extant nodes can be expressed as
(8) |
where refers to the survival probability of percolation on the -ary tree with copy probability . The first term refers to the probability that the symbol is inherited by at least 3 of the awesome children of . Additionally, these three extant nodes have as an LCA, as they have paths of inheritance from that do not all intersect at any other node.
Naturally, Eq. 8 also gives the probability that is similarly inherited. Furthermore, from standard results about Galton-Watson processes (see e.g. [KA15]), we know that as , . Hence, we conclude that Eq. 8 tends to 1 as . Thus it follows from the union bound the probability that there exist two sets of three extant nodes that both have as a lowest common ancestor, the first set has in block , and the second set has , also tends to 1 as .
Hence, given a specific block , the probability that an awesome coupled node is -good is at least . The high probability of this occurring for all blocks follows from a standard Chernoff–Hoeffding bound. ∎
6 Reconstructing the Pedigree
On the following page, we provide pseudocode for \namerefalgo:reconstruct_all which is the proposed reconstruction procedure, with details of the inner procedures following it (\namerefalgo:collect-symbols, \namerefalgo:test-siblings, and \namerefalgo:construct-parents). Note that for the first iteration of \namerefalgo:reconstruct_all, we do not need to collect symbols as the extant genetic data is given to us. Thus we simply test siblinghood at iteration by using the true gene sequences.
The goal of the rest of this section is to prove the correctness of \namerefalgo:reconstruct_all. We now formally state our guarantee:
Theorem 6.1 (Main theorem, formal).
Let be the depth- coupled pedigree output by the algorithm \namerefalgo:reconstruct_all, applied to the gene sequences in . With probability tending towards as , is an induced subpedigree of such that for all levels , where as . The probability is over the randomness of the coupled pedigree and the inheritance procedure with parameters set as in Section 4.1.
We define where, for a given value of , is defined to be the largest value of such that Proposition 5.8 holds. Observe that as because Proposition 5.8 holds for arbitrarily large values . Therefore, as .
We make use of the following high-probability events, provided is a large enough constant so that satisfies the hypothesis of Lemma 5.9, is sufficiently large with respect to , the total number of generations is , where , and the gene sequence length is .
Proposition 6.2 (Key Reductions).
With probability tending towards as , the pedigree satisfies:
-
1.
For each level , each clique of has a single parent (Lemma 4.8).
-
2.
For each level , the maximal cliques of are edge-disjoint, in such a way that each is contained in at most two maximal cliques (Lemma 4.9).
-
3.
Each triple of nodes, has at most 3 collisions (Corollary 4.5), implying
-
(a)
their joint LCA is unique (Lemma 4.10), and
-
(b)
all inheritance paths for some node go through the unique LCA (Lemma 4.11).
-
(a)
- 4.
-
5.
For each level , at least fraction of nodes in are awesome (Proposition 5.8).
-
6.
If is awesome, then it is -good for of blocks (Lemma 5.9).
The “probability tending towards 1” portion of Theorem 6.1 can be quantified via a union bound on the probability of failure of any of the events in Proposition 6.2, while the “” guarantee comes from the fact that we recover of the awesome nodes in conjunction with Condition 5. With this as a simplification, we proceed with the proof of Theorem 6.1.
The upcoming lemma (Lemma 6.3) proves the correctness of the very first iteration (depth 1 from depth 0), and therefore serves as the base case. The inductive step (Lemma 6.4) is presented immediately afterwards. For the remainder of this section, we write to denote the depth- reconstructed pedigree after the th iteration of \namerefalgo:reconstruct_all, ( is the depth- pedigree of all the extant nodes). In contrast, let denote the subpedigree of (the ground truth) induced by graded levels up to .
Lemma 6.3.
Let denote the estimated -regular siblinghood hypergraph for the extant nodes (line 7 of \namerefalgo:test-siblings). Consider the pedigree created by \namerefalgo:construct-parents applied to . Then there exists an injective homomorphism so that the induced subgraph on is isomorphic to . Moreover, contains , where is the set of awesome nodes at levels in .
Proof.
Let denote the true siblinghood hypergraph on extant nodes with at least two siblings. By Condition 4, we have that . Since both graphs have the same set of vertices, we simply write .
This gives a natural, explicit characterization of . For an extant node , define so that it is the identity map on the extant. Given couple , define to be the parent couple of the children of . The condition implies that at least one such choice for exists, and moreover by Condition 1, is the unique parent.
is injective: Let with . At the extant level, the maximal cliques in are vertex disjoint by Condition 2. Hence, the children of and the children of have empty intersection. Moreover in , vertex-disjoint maximal cliques have distinct parents. Therefore, , as desired.
respects edges: We already know that . Now suppose that is an edge in for and . Since is in the image of , it follows that has at least children that passed the siblings test in our algorithm. If is one of , we’re done, so suppose not. By Condition 5.1, the extant triples , , and all have at least overlap. Therefore, form a clique in , and line 5 of \namerefalgo:construct-parents states that is a parent of all four, so is an edge in .
The image of contains the awesome nodes in : This part is trivially true for the extant nodes, so consider only the awesome nodes . By definition, any awesome node is -rich. Since , the children of form a maximal clique of size at least in . Therefore, \namerefalgo:construct-parents creates a parent for these children in , which gives the pre-image of . ∎
Lemma 6.4.
Let and suppose that we are given . Assume that there exists an injective homomorphism which satisfies
-
1.
,
-
2.
induces a subgraph isomorphic to , and
-
3.
contains the awesome nodes in sets .
Let be the level- extension of , via lines 4 through 7 of \namerefalgo:reconstruct_all. Then there exists a level- extension of the map with the same properties.
We prove this in two stages. The first part (Lemma 6.5) asserts that we reconstruct the sibling relationships correctly, while the latter (Lemma 6.6) assures that the cliques of this estimated siblinghood hypergraph are actually the faithful, “largest possible” groupings of siblings.
Lemma 6.5.
Definition 6.1.
For an awesome node , its awesome subtree is the subgraph of that is the union of all paths from to extant nodes that consist entirely of awesome nodes.
Claim 4.
Suppose that there is a reconstruction map satisfying the hypotheses in Lemma 6.4. Then for any awesome node , its awesome subtree satisfies .
Proof of 4.
Note that Line 5 of \namerefalgo:construct-parents ensures that every node in is -rich. Since is an injective homomorphism, it follows that every node in is also -rich in . Furthermore, being awesome implies that all of its descendants are awesome in , since none of its descendants can have collisions in its ancestral pedigree (Definition 5.4). By the definition of the awesome subtree (Definition 6.1), it holds that .
For the other direction (), let be an extant node so that there is a path from to consisting only of awesome nodes. By condition 3 of Lemma 6.4, all of the nodes along this path are in the image of . ∎
Claim 5.
Let be as in Lemma 6.4, and let for some . Suppose that in block the symbols and are recovered for by applying Algorithm 1 to . Then it holds that also has symbols in block .
Proof of 5.
For , suppose that nodes have the symbol in block and are used by \namerefalgo:collect-symbols to recover in block of . Recall that are all descended from distinct children of . Let induce subpedigree in .
By the hypotheses of Lemma 6.4, and so must be a common ancestor of in . By line 4 of \namerefalgo:collect-symbols and because , – and therefore – is their joint LCA. With respect to , Conditions 3a and 3b tell us the much stronger condition that is their only LCA, and that all paths in from any common ancestor of to (without loss of generality) must pass through . Therefore, if all inherit symbols in block , the symbol must have passed through block of via the infinite symbols assumption. ∎
Claim 6.
Let be as in Lemma 6.4, and let for some . Suppose that is awesome in . If is -good and has symbols in block , then \namerefalgo:collect-symbols recovers the symbols and for in block .
Proof of 6.
By 5, we only need to show that at least two symbols in block are reconstructed by \namerefalgo:collect-symbols applied to . Note that -goodness implies .
By -goodness of , as in the proof of Lemma 5.9, there is a witnessing triple for each of the contained in the extant of the awesome subtree . By 4, also contains these witnesses. Since extant nodes are the exact same in compared to by hypothesis 1 of Lemma 6.4, \namerefalgo:collect-symbols applied to recovers in block . ∎
Proof of Lemma 6.5.
By assumption, is injective. To first see that is a hypergraph homomorphism, let be distinct nodes satisfying line 2 of \namerefalgo:test-siblings, and let , and denote their counterparts in .
Suppose that are not mutually siblings. By Condition 4, have at most mutually overlapping blocks. By 5, for all , the symbols reconstructed for in block using \namerefalgo:collect-symbols are a subset of the symbols in block of . Therefore, have mutually overlapping symbols in at most blocks. Since , \namerefalgo:test-siblings does not place a hyperedge between in .
To conclude that the induced subgraph is isomorphic to , it remains to show that if are mutual siblings in , then is a hyperedge in . Note that of the blocks of were recovered by \namerefalgo:collect-symbols by the definition of , and by 5, the symbols of in block are a subset of the symbols of , respectively, in block . By Condition 4, the mutual overlap between the siblings is at least . Thus, by a union bound on the occurrence of 1%-fraction of unrecovered blocks, the mutual overlap between is at least . Therefore, \namerefalgo:test-siblings constructs a hyperedge on , as desired. It follows that the induced subgraph is isomorphic to .
Lemma 6.6.
Let denote the maximal (hyper)cliques in the subgraph of induced by , and let denote the (hyper)cliques probed by \namerefalgo:construct-parents applied to . Given , define to be the set given by the image of under . Then is a bijection between and .
Proof.
By Lemma 6.5, the subgraph induced by is isomorphic to . Hence, it suffices to show that the cliques probed by \namerefalgo:construct-parents applied to are precisely the maximal cliques of . Recall that by Condition 2, the maximal cliques in are edge-disjoint, and every node of is involved in at most cliques.
It is helpful to imagine the cliques as being listed out in the same order that they are probed by \namerefalgo:construct-parents, indexed by timesteps . Let , and let denote the result of removing the edges of the clique from .
We argue that for all , the graph is a union of edge-disjoint maximal cliques, and any two maximal cliques intersect in at most a single vertex. The base case is true by Condition 2. This holds for because the above property is preserved when all of the edges are removed from a single maximal clique in . Moreover, for all , the maximal cliques in are the same as those of but with a single maximal clique in removed. Hence, it also follows by induction that for all , the maximal clique in is also a maximal clique in .
Since \namerefalgo:construct-parents terminates at the first time when has no hyperedges, we conclude that are all of the maximal cliques in , as desired. ∎
Proof of Lemma 6.4.
We first extend the definition of to level . For , we define as follows. Let denote the children of . By Lemmas 6.5 and 6.6, is a clique in . Define to be the parent of the children of the clique in . The map is well-defined at level because of Condition 1. It remains to show that is an isomorphism onto its image, and moreover that its image contains all of the awesome nodes at level .
The map is injective: We know this is true for , so it suffices to consider injectivity of when restricted to the nodes at level in . Let with . Let (resp., ) denote the maximal clique in that consists of the children of (resp., ). By Lemma 6.6, and are distinct maximal cliques in the induced subgraph , and therefore, are contained in distinct maximal cliques in . Distinct maximal cliques in have distinct parents, so by the definition of , we conclude that , as desired.
The map is edge-preserving: Suppose that is an edge in with and . Consider the maximal clique containing in . By Lemma 6.6, is a maximal clique in the induced subgraph , and by construction of , the parent of is . Therefore, the edge is in the pedigree .
Suppose now that the edge is in the pedigree . Consider the maximal clique containing . By Lemma 6.6, is a maximal clique in . By Lemma 6.6 and the construction in \namerefalgo:construct-parents, we conclude that the parent of in is mapped to under . By injectivity of , this parent is precisely . Therefore, is an edge in .
The image of contains the awesome nodes in : It suffices to prove the statement for the awesome nodes at level , which we denote by . Suppose that is an awesome node at level of . By awesomeness, has at least awesome children. Let denote the clique in given by the awesome children of . By Lemmas 6.5 and 6.6, satisfies because all of the awesome children up to level are in the image of , by the inductive hypotheses. By Lemma 6.6, the maximal clique containing in satisfies that are all children of . By the definition of \namerefalgo:construct-parents and at level , we conclude that a parent is constructed for and , as desired. ∎
Acknowledgments We thank Vishesh Jain for many helpful discussions.
References
- [23a] 23andMe. About us. \urlhttps://mediacenter.23andme.com/company/about-us/.
- [Anc] Ancestry.com. Ancestry continues to lead the industry with world’s largest consumer dna network. \urlhttps://www.ancestry.com/corporate/newsroom/press-releases/ancestry%C2%AE-surpasses-15-million-members-its-dna-network-powering-unparalleled.
- [DMR06] Constantinos Daskalakis, Elchanan Mossel, and Sébastien Roch. Optimal phylogenetic reconstruction. In Proceedings of the thirty-eighth annual ACM symposium on Theory of computing, pages 159–168, 2006.
- [EKPS00] William Evans, Claire Kenyon, Yuval Peres, and Leonard J. Schulman. Broadcasting on trees and the ising model. Ann. Appl. Probab., 10(2):410–433, 05 2000.
- [ESPC18] Yaniv Erlich, Tal Shor, Itsik Pe’er, and Shai Carmi. Identity inference of genomic data using long-range familial searches. Science, 362(6415):690–694, 2018.
- [ESSW99] Péter L Erdős, Michael A Steel, László A Székely, and Tandy J Warnow. A few logs suffice to build (almost) all trees (i). Random Structures & Algorithms, 14(2):153–184, 1999.
- [HKZ12] Daniel Hsu, Sham M Kakade, and Tong Zhang. A spectral algorithm for learning hidden markov models. Journal of Computer and System Sciences, 78(5):1460–1480, 2012.
- [Hui17] Jisca Huisman. Pedigree reconstruction from snp data: parentage assignment, sibship clustering and beyond. Molecular ecology resources, 17(5):1009–1024, 2017.
- [HWH+13] Dan He, Zhanyong Wang, Buhm Han, Laxmi Parida, and Eleazar Eskin. Iped: inheritance path-based pedigree reconstruction algorithm using genotype data. Journal of Computational Biology, 20(10):780–791, 2013.
- [HWPE14] Dan He, Zhanyong Wang, Laxmi Parida, and Eleazar Eskin. Iped2: Inheritance path based pedigree reconstruction algorithm for complicated pedigrees. In Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB ’14, page 202–210, New York, NY, USA, 2014. Association for Computing Machinery.
- [Ins19] National Human Genome Research Institute. The cost of sequencing a human genome. \urlhttps://www.genome.gov/about-genomics/fact-sheets/Sequencing-Human-Genome-cost, 2019.
- [KA15] Marek Kimmel and David Axelrod. Branching Processes in Biology. Springer Publishing Company, Incorporated, 2nd edition, 2015.
- [KKM+19] Younhun Kim, Frederic Koehler, Ankur Moitra, Elchanan Mossel, and Govind Ramnarayan. How many subpopulations is too many? exponential lower bounds for inferring population histories. Journal of Computational Biology, 2019.
- [KLKH11] Bonnie Kirkpatrick, Shuai Cheng Li, Richard M Karp, and Eran Halperin. Pedigree reconstruction using identity by descent. Journal of Computational Biology, 18(11):1481–1493, 2011.
- [KM18] Gina Kolata and Heather Murphy. The golden state killer is tracked through a thicket of dna, and experts shudder. The New York Times, 4 2018.
- [Lal16] Steven Lalley. Poisson processes. Statistics 312: Stochastic Processes, 2016.
- [MMP18] Anuran Makur, Elchanan Mossel, and Yury Polyanskiy. Broadcasting on bounded degree dags. arXiv preprint arXiv:1803.07527, 2018.
- [Mos01] Elchanan Mossel. Reconstruction on trees: beating the second eigenvalue. Annals of Applied Probability, pages 285–300, 2001.
- [Mos04] Elchanan Mossel. Phase transitions in phylogeny. Transactions of the American Mathematical Society, 356(6):2379–2404, 2004.
- [MR05] Elchanan Mossel and Sébastien Roch. Learning nonsingular phylogenies and hidden markov models. In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, pages 366–375, 2005.
- [MyH] MyHeritage. Myheritage end-of-year infographic. \urlhttps://blog.myheritage.com/2019/12/wrapping-up-a-fantastic-2019.
- [Pol15] D Pollard. Mini empirical. Manuscript. \urlhttp://www.stat.yale.edu/ pollard/Books/Mini/, 2015.
- [SH06] Mike Steel and Jotun Hein. Reconstructing pedigrees: a combinatorial perspective. Journal of theoretical biology, 240(3):360–367, 2006.
- [SS03] Charles Semple and Mike Steel. Phylogenetics, volume 24. Oxford University Press on Demand, 2003.
- [STH14] Doron Shem-Tov and Eran Halperin. Historical pedigree reconstruction from extant populations using partitioning of relatives (prepare). PLoS computational biology, 10(6), 2014.
- [Tho00] Elizabeth A. Thompson. Statistical inference from genetic data on pedigrees. NSF-CBMS Regional Conference Series in Probability and Statistics, 6:i–169, 2000.
- [Tho13] Elizabeth A Thompson. Identity by descent: variation in meiosis, across genomes, and in populations. Genetics, 194(2):301–326, 2013.
- [TS08] Bhalchandra D Thatte and Mike Steel. Reconstructing pedigrees: a stochastic perspective. Journal of theoretical biology, 251(3):440–449, 2008.
- [Wan19] Jinliang Wang. Pedigree reconstruction from poor quality genotype data. Heredity, 122(6):719–728, 2019.