This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\hypersetup

colorlinks = true, allcolors = blue

Efficient Reconstruction of Stochastic Pedigrees

Younhun Kim111Massachusetts Institute of Technology. Department of Mathematics. Email: \url[email protected].    Elchanan Mossel222Massachusetts Institute of Technology. Department of Mathematics and IDSS. Email: \url[email protected]. Partially supported by awards ONR N00014-16-1-2227, NSF CCF1665252 and DMS-1737944.    Govind Ramnarayan333Massachusetts Institute of Technology. CSAIL. Email: \url[email protected]. Partially supported by awards NSF CCF1665252 and DMS-1737944.    Paxton Turner444Massachusetts Institute of Technology. Department of Mathematics. Email: \url[email protected].
Abstract

We introduce a new algorithm called Rec-Gen for reconstructing the genealogy or pedigree of an extant population purely from its genetic data. We justify our approach by giving a mathematical proof of the effectiveness of Rec-Gen when applied to pedigrees from an idealized generative model that replicates some of the features of real-world pedigrees. Our algorithm is iterative and provides an accurate reconstruction of a large fraction of the pedigree while having relatively low sample complexity, measured in terms of the length of the genetic sequences of the population. We propose our approach as a prototype for further investigation of the pedigree reconstruction problem toward the goal of applications to real-world examples. As such, our results have some conceptual bearing on the increasingly important issue of genomic privacy.

1 Introduction

1.1 Motivation

The decreased costs of sequencing technologies have enabled large-scale, data-driven analyses of genomes [Ins19]. Recent science and news articles feature stories only possible due to this plethora of data, such as the recent identification and capture of a high-profile criminal [KM18] predicated on DNA evidence. In this effort, an individual’s genetic information was compared to a large, curated database called GEDMatch consisting of over one million individual genomes. In comparison, there exist databases which are of several orders of magnitude larger in size such as MyHeritage (\sim3.7 million [MyH]), 23andMe (\sim10 million [23a]), and Ancestry (\sim15 million [Anc]).

This raises the question: how much kinship information can be learned from DNA? Current databases already contain a considerable amount of this information. Indeed, it is estimated that a given US individual of European ancestry, on average, has a third cousin or closer who is already in the MyHeritage database [ESPC18]. However, such databases are still far from complete. This calls into question the ability to detect missing kinships based on individuals already present in the database.

This discussion also highlights the issue of genomic privacy. Indeed, it becomes much easier to identify and locate individuals by combining the genetic and genealogical information with outside information (addresses, e-mails, family photos, etc.). This potential, having already been demonstrated by the resolution of the aforementioned criminal case, was brought to attention by [ESPC18]. From this point of view, the ability to reconstruct genealogies from collected genetic data is of concern for individuals whose information is revealed, even if one has never been sequenced. Since our work establishes a positive result in a pessimistic scenario where we start with no ground truth information, we believe that our work brings to attention this critical issue via a theoretical framework.

1.2 Our contributions

Without any prior knowledge about the ground truth, can we learn everyone’s genealogy using their genetic information? In this paper, we study the inference problem of recovering ancestral kinship relationships of a population of extant (present-day) individuals, using only their genetic data. Our goal is to use this extant genetic data to recover the pedigree of the extant population, under an idealized model. A pedigree is a graph whose nodes (individuals) have edges that encode parent-sibling relationships. The topology and reconstruction of pedigrees are well-studied in bioinformatics from both a theoretical and empirical perspective, and in general the study of pedigrees poses formidable computational and statistical challenges.

In this paper, we introduce a novel recursive algorithm Rec-Gen for pedigree reconstruction. To demonstrate the effectiveness of our approach, we give a mathematical proof that for an idealized generative model on pedigrees, our algorithm is able to approximately recover the true, unknown pedigree only using the genetic data of the extant population. In terms of sample complexity, which for our purposes refers to the common gene sequence length of an extant individual, our algorithm greatly outperforms the naive reconstruction method (estimate pairwise distances between the extant individuals, then construct the pedigree that produces these distances). We propose our approach in this work as a prototype for the future study of more general pedigrees, including those involving real-life genetic data, from both a theoretical and empirical perspective. For further discussion on our model of pedigree generation, as well as its features and limitations, see Section 1.4 and Section 1.6.

1.3 Related works

A common method in theoretical evolutionary biology is to model lineages and inheritance via a family of directed acyclic graphs. One line of work is that of phylogenetics (refer to [SS03] for an overview) which uses trees to model the occurrence of large-scale speciation events in evolutionary biology. Another line of work is coalescent theory, which focuses on variable-height inheritance trees between genes as its main statistic to infer large-scale population sizes, as in e.g. [KKM+19]. In contrast, pedigrees capture small-scale individual genealogies that encode familial relationships. Specifically, most pedigree models are for human genealogies, where we designate exactly two parents to each individual. By construction, such graphs are no longer trees and warrant different strategies for inference.

[SH06] posed the formal definition of pedigrees using graph-theoretic language. In that work, the authors gave combinatorial arguments proving that one can reconstruct complete pedigrees, assuming the correct ancestral history is provided as an input for each extant individual. Our definition of pedigrees is essentially the same as the one outlined by these authors, though we make the simplification that we do not identify the vertex set bipartition (corresponding to the biological sex of the individuals).

To tie in more closely with real-world applications, one must consider the challenge of estimating these histories from data. Along these lines, [TS08] studied stochastic processes that one can associate with the pedigree, in such a way that one can prove negative results (information-theoretic impossibility) or positive results (an algorithm) for the reconstruction of the pedigree from extant data. The stochastic process used to show their positive result was based on a very specific family of Markov chains which allows for inference but is quite different from our model.

For the problem of performing pedigree reconstruction on real data, there is a wealth of literature [Tho00, KLKH11, HWH+13, Tho13, HWPE14, STH14, Hui17, Wan19]. Such studies apply heuristics that take into account various complications and phenomena observed in human genomes, such as varying levels of correlations between different sites and the presence of mutations that are not inherited from parents.

One line of work particularly relevant to this paper is [HWH+13, HWPE14] in which the authors also tackle the problem of pedigree reconstruction from real extant genetic data. Assuming answers to queries of the form, “how much DNA did ii and jj simultaneously inherit from their ancestors?”, they design a statistical test that distinguishes between siblings, half-siblings and cousins. Their method leverages this information with a maximal-clique finding algorithm to iteratively reconstruct the parents, layer-by-layer. There is no proof of correctness provided, but they provide benchmarks on real and simulated data to provide experimental justification. Our contributions have a slightly different flavor: using a similar iterative strategy but with a different statistical test (the novel part of our algorithm) and for a more optimistic set of assumptions, one can actually provably reconstruct the pedigree correctly in a sample-efficient way, in an asymptotic sense.

The authors of [HWPE14] specifically emphasize their method’s ability to reconstruct half-siblings. Technically speaking, this is not allowed in our model and therefore it may appear to the reader that there is something too restrictive or suboptimal about our analysis. One major difference between our model and the aforementioned work is that we model haploid individuals (one copy of DNA), while in reality humans are diploids (two copies of DNA). Furthermore, in our proof, we guarantee reconstruction of monogamous couples of haploid individuals – in other words, up to permutation of the two individuals within each couple. It can be observed that given a monogamous pedigree with a haploid model, one can construct a natural, non-monogamous pedigree with a diploid model such that the total variation of the extant data of the two pedigrees is zero. Therefore, we think that our results should also hold for a diploid model with minor modifications and have correctness guarantees to match the empirical results of the aforementioned work [HWPE14], for example by interpreting Fig. 1(a) as a pair of diploid half-siblings.

Our work is also closely related to the problem of phylogenetic reconstruction [ESSW99, Mos04, MR05, DMR06]. In this setting, symbols are passed from the root of a phylogenetic tree to descendants via a Markov process such as in the Cavender–Farris–Neyman model, a basic model for mutations. Similar to our inference problem in this work, in phylogenetic reconstruction, one is tasked with reconstructing the tree given only the symbols at the leaves. The main result of [ESSW99] characterizes the sample complexity—the minimal string length of the data at the leaves such that reconstruction is possible—as logarithmic in the depth of the tree, a phenomenon that our results suggest also holds for the pedigree reconstruction problem. The work [MR05] provides theoretical guarantees for the problem of learning the phylogenetic generative model (i.e., the topology of the tree as well as the transition matrices), which includes hidden Markov models as a special case, from the extant data under a spectral assumption on the transition matrices (see also later work of [HKZ12]). Most closely related to our approach in this paper is the work [Mos04], which shows how to recursively reconstruct phylogenies using techniques from the theory of broadcast processes on trees (see also [DMR06]). This approach provides inspiration for our main algorithm Rec-Gen, which uses similar techniques to recursively reconstruct pedigrees. We direct the reader to [EKPS00] and [Mos01] for studies of broadcast processes on trees with binary and large alphabet respectively, and [MMP18] for a generalization to directed acyclic graphs.

1.4 Model description and results

We now give an informal, detailed description of our framework for pedigree reconstruction, with a more detailed treatment of the generative model in Section 3. Our generative model on pedigrees consists of two parts: a parametric model for generating the network structure on the set of ancestors and extant individuals, and an inheritance procedure for transmitting genetic data from the founders, the oldest individuals in the pedigree, to the extant population.

To generate the pedigree network structure, we begin with a large founding population of size NTN_{T}. The founders randomly mate monogamously, and each couple gives birth to a random number of children, so that the average number of offspring per couple is a constant555More precisely, each couple has a random number of children distributed as a Poisson random variable with expectation α\alpha. α\alpha. This procedure of random monogamous mating continues for TT subsequent generations, eventually yielding the extant nodes and a pedigree 𝒫\mathcal{P} formed by the individuals in generations 0,1,,T0,1,\ldots,T, with NiN_{i} nodes at each level ii.

Next we describe how genetic data transmits from the founding population to the extant. Every individual in the pedigree has a gene sequence consisting of BB symbols placed in BB distinct blocks. Each individual in the founding population is initialized with independent uniformly random draws from a very large alphabet Σ\Sigma. Now we state how parents pass down genes to their children. In a given block, a child inherits, with equal probability, either its mother’s or its father’s symbol in the corresponding block. This procedure repeats for all couples in a given generation and then continues over subsequent generations so that genetic data is iteratively transferred through the pedigree, eventually giving rise to the gene sequences of the extant individuals.

Our main result is summarized in the following theorem. See Theorem 6.1 for a formal statement.

Theorem 1.1 (Main result, informal).

Let α\alpha and β\beta denote sufficiently large absolute constants independent of NTN_{T}, the size of the founding population. Let ε\varepsilon denote a sufficiently small absolute constant independent of NTN_{T}. Assume that the alphabet size |Σ||\Sigma| is very large with respect to NTN_{T}.

Then given extant genetic data produced from the generative model with alphabet Σ\Sigma, growth rate α\alpha, gene sequence length B=βlogNTB=\beta\log N_{T}, and number of generations T=εlogNTT=\varepsilon\log N_{T} as described above, the algorithm Rec-Gen recovers 90%90\% of the true pedigree in every generation, with high probability. Moreover, this algorithm runs in polynomial time in the size of the pedigree and the number of blocks per extant individual.

Let 𝒫\mathcal{P} denote the true, unknown pedigree. Our formal version of Theorem 1.1 (see Theorem 6.1) implies that with high probability Rec-Gen outputs a reconstructed pedigree 𝒫^\hat{\mathcal{P}} whose size is at least 0.9Ni0.9N_{i} in each generation i{0,,T}i\in\{0,\ldots,T\}, such that every node u^𝒫^\hat{u}\in\hat{\mathcal{P}} can be identified with exactly one node u𝒫u\in\mathcal{P}, and this identification preserves relationships in the sense that u^\hat{u} is a child of v^\hat{v} in 𝒫^\hat{\mathcal{P}} if and only if uu is a child of vv in 𝒫\mathcal{P}. In graph-theoretic terminology, our reconstruction 𝒫^\hat{\mathcal{P}} is a (very large) induced subgraph of the truth 𝒫\mathcal{P}.

We note that the stipulation that we recover 90% of the nodes at each level is actually a simplification; in fact, we can make the fraction of reconstructed nodes in each generation arbitrarily large by taking α\alpha to be large enough. We refer the reader to Theorem 6.1 for details.

1.5 The Rec-Gen algorithm

The algorithm Rec-Gen consists of a recursive procedure that uses only the genetic information from the extant population to construct a good approximation for the true pedigree 𝒫\mathcal{P} of depth TT that generated the observations. In the first phase of recursion, the algorithm reconstructs the parents of the extant nodes, which we label as the 1st1^{\mathrm{st}} generation. In the ttht^{\mathrm{th}} phase, the algorithm adds a ttht^{\mathrm{th}} generation to the partially reconstructed version of the true pedigree given by the output of the previous phase. The algorithm terminates after TT phases of recursion, producing a pedigree 𝒫^\hat{\mathcal{P}} with TT generations that well-approximates the true, unknown pedigree 𝒫\mathcal{P}.

We next give a simplified version of our recursive procedure that serves to illustrate the main ideas. See Section 6 for a detailed description of Rec-Gen. Suppose that we have constructed a pedigree 𝒫^t\hat{\mathcal{P}}_{t} of depth tt, and recall that BB refers to the length of the gene sequence of an individual. Also recall that a couple refers to a pair of mated individuals.

Note that the first step of our recursive procedure equips each couple with an empirical gene sequence of length B where each block can contain two distinct symbols. This empirical gene sequence is constructed based on extant data and should be thought of as determining which symbols belong to at least one of the individuals from the couple in a given block. Also, we say that three gene sequences σ,σ,σ′′\sigma,\sigma^{\prime},\sigma^{\prime\prime} overlap in a block if all three sequences have some symbol in common in that block.

Perform the following steps to output a pedigree 𝒫^t+1\hat{\mathcal{P}}_{t+1} of depth t+1t+1.

  • (1)

    Collect-Symbols For each couple cc in generation tt of 𝒫^t\hat{\mathcal{P}}_{t}, use the extant genetic data to recover symbols that belong to cc as follows.

    • Recover a symbol σ\sigma in block b[B]b\in[B] of cc if cc has three extant descendants descended from distinct children of cc that all share symbol σ\sigma in block bb.

    • Repeat this procedure to recover at most one other symbol σσ\sigma^{\prime}\neq\sigma for cc in block bb.

  • (2)

    Test-Siblinghood For every triple of couples c,c,c′′𝒫^tc,c^{\prime},c^{\prime\prime}\in\hat{\mathcal{P}}_{t} in generation tt, determine c,c,c′′c,c^{\prime},c^{\prime\prime} to be (mutually) ‘siblings’ if and only if at least 0.21B0.21B of their recovered symbols mutually overlap.

  • (3)

    Assign-Parents For every maximal collection 𝒞={c1,c2,,ck}\mathcal{C}=\{c_{1},c_{2},\ldots,c_{k}\} of couples in generation tt such that every triple in 𝒞\mathcal{C} consists of mutual siblings, construct a pair of parents in generation t+1t+1 that have as children precisely one individual from each couple in 𝒞\mathcal{C}.666We perform this step in such a way that every child is assigned at most 22 parents.

After TT iterations of the above recursive procedure, we output a pedigree 𝒫^T\hat{\mathcal{P}}_{T} that gives a good approximation to the underlying pedigree that generated the extant genetic data as described in Theorem 1.1. We remark that working with triples as above greatly simplifies our analysis, as discussed in Section 2.3.

1.6 Model discussion and future directions

Our generative model imposes various constraints on the typical pedigrees that we consider. We discuss these modeling assumptions here and also consider the problem of investigating more general models that could more accurately capture properties of real-world data.

First, we consider the assumption that the size of the alphabet Σ\Sigma is very large with respect to the size NTN_{T} of the founding population. Since a “block” represents the unit of inheritance from a parent777Using biology terminology, each block can be considered as an idealized abstraction of a collection of single-nucleotide polymorphisms (sites of variation) with high linkage disequilibrium (empirical measure of correlation) that are passed from parent to child., this implies that with very high probability all of the founders have distinct symbols in their gene sequences, and no two founders share a common symbol.888Mathematically, this can be thought of as an improper prior on a countably infinite alphabet Σ\Sigma. Our large alphabet assumption is equivalent to the assertion that the founders are unrelated.

Second, the stochastic process describing inheritance in our model has the following biological interpretation. A standard concept in population genetics refers to long-running sequence matches as being identical by descent (IBD) if they arose due to inheritance from a common ancestor [Tho13]. In contrast, the term identity by state refers to the event that two identical tracts in the genome arose by coincidence – via mutations – in two unrelated individuals. Our inheritance model contains the assertion that each block corresponds to true IBD sequences: if two individuals have the same symbol, we can always identify a common ancestor that gave rise to these symbols.

Third, we recall the hypothesis that every couple has on average α\alpha children, where α\alpha is a sufficiently large absolute constant independent of the size NTN_{T} of the founding population. This ensures that, roughly speaking, every new generation is a factor α/2\alpha/2 larger than the previous one. Assuming roughly uniform growth of generations, it is necessary that α>0\alpha>0\,\,— otherwise the population would die out and there would be no extant nodes after TT generations. More subtly, it is necessary that α2\alpha\geq 2   — otherwise, via standard results from the theory of branching processes (see, e.g. [KA15]) a founding node has a very low probability of passing on its symbols to the extant. In this situation, even detection of such an ancestor from extant genetic data alone is information-theoretically impossible. On the other hand, our assumption that α\alpha is a large constant essentially amplifies the signal sent from a founder to the extant, and this simplifies our mathematical analysis.

Our first open question considers relaxing the previously discussed assumptions.

Question 1.

What theoretical guarantees can be established for pedigree reconstruction in the context of our generative model when α\alpha is very close to 22? What about when the size of the alphabet Σ\Sigma is finite? Can we analyze more generic models of inheritance where blocks are not inherited i.i.d. from parents?

A more subtle consequence of our generative model is inbreeding, a term we use to refer to the following phenomena: (1) the presence of multiple lowest common ancestors for a pair of extant nodes, and (2) the presence of mated couples such that the two individuals in the couple have a lowest common ancestor (LCA) (see Definition 3.5 for the formal definition of an LCA). The degree of inbreeding qualitatively refers to the frequency of such structures in the pedigree. Moreover, inbreeding as in (2) is mathematically equivalent to having cycles in the pedigree. In general, a higher degree of inbreeding makes the pedigree reconstruction problem more difficult and in some cases information-theoretically impossible (see Section 2.1 for detailed examples). Our choice of model allows for some degree of inbreeding, and our algorithm and analysis are carefully tailored to circumvent this obstacle.

Other assumptions inherent in our model include that the pedigree is graded, i.e., couples are formed from individuals in the same generation, and monogamous: a given individual only mates with one other individual. Furthermore, mutations — errors in the transmission of genetic data from parents to offspring — are a central component in biological applications that our current model does not incorporate.

Question 2.

What theoretical guarantees can be established for reconstruction of pedigrees in generative models with some combination of (i) a higher degree of inbreeding, (ii) mutations, (iii) non-monogamous mating, and (iv) inter-generational mating?

2 Inference challenges and techniques

In this section, we detail some of the challenges posed by the reconstruction of pedigrees constructed from our generative model as well as our techniques and analysis for handling them. To develop some intuition for our strategy, we first illustrate some of the properties of pedigrees using concrete examples.

2.1 Examples: complications from inbreeding

Recall that two individuals u,vu,v that share the same set of parents are siblings. If two individuals share a common subset of grandparents (but not parents), we refer to them as cousins.

k\boxed{k}\boxed{\ell}g\boxed{g}h\boxed{h}i\boxed{i}j\boxed{j}a\boxed{a}b\boxed{b}c\boxed{c}d\boxed{d}e\boxed{e}f\boxed{f}
(a) three sets of grandparents (cousins, one way)
k\boxed{k}\boxed{\ell}g\boxed{g}h\boxed{h}i\boxed{i}j\boxed{j}a\boxed{a}b\boxed{b}c\boxed{c}d\boxed{d}
(b) two sets of grandparents (cousins, two ways)
Figure 1: Simple examples of depth-3 complete pedigrees with a single block. The letters inside the boxes represents the block data. 1(a): The overlap probability is (k=)=18\mathbb{P}(k=\ell)=\tfrac{1}{8}. 1(b): An altered version of 1(a) with only two sets of grandparents, which yields (k=)=14\mathbb{P}(k=\ell)=\frac{1}{4}.

First consider the pedigrees displayed in Fig. 1(a). An important statistic for determining relationships is the correlation between symbols of nodes at the same level. Consider the event EE that the left extant shares the same symbol as the right extant. Note that these two extant nodes are cousins sharing a single set of grandparents. The grandparents are the founders in this example, so we assign to each of them a unique symbol (abcdefa\neq b\neq c\neq d\neq e\neq f). The occurrence of EE implies that k=ck=c or k=dk=d via the left extant receiving a symbol from its right parent; this occurs with probability 12\tfrac{1}{2}. Conditioned on this occurring, the right extant block \ell is the same as kk with probability 14\frac{1}{4}, so the overall probability that both receive the same symbol is 18\frac{1}{8}.

Compare this to the example shown in Fig. 1(b), where the two extant are cousins in two ways (siblings marrying siblings). Note that whichever symbol (out of a,b,c,da,b,c,d) that kk is, the right grandchild receives the same independently with probability 14\tfrac{1}{4}. This is an example of a type of inbreeding where two extant nodes have more than one LCA.

k\boxed{k}\boxed{\ell}e\boxed{e}f\boxed{f}g\boxed{g}h\boxed{h}a\boxed{a}b\boxed{b}
(a) four siblings begetting cousins.
k\boxed{k}\boxed{\ell}e\boxed{e}f\boxed{f}a\boxed{a}b\boxed{b}
(b) two siblings begetting siblings.
Figure 2: Two examples of complete pedigrees with inbreeding. The extants in 2(a) are cousins, yet they have a coincidence of 12\tfrac{1}{2} as if they were generic siblings from unrelated parents. In comparison, 2(b) yields 34\tfrac{3}{4} which exceeds the coincidence of siblings.

The examples in Fig. 2 demonstrate how the correlation between extant nodes is boosted due to the presence of inbreeding. Note that in the generic case where extant siblings have an ancestral pedigree that is a tree, these individuals have a 12\tfrac{1}{2} fraction overlap in their blocks. For comparison, let us compute the probability of coincidence for the two extant nodes in Fig. 2(a). The probability that k=ak=a, for example, is

(k=a)=(e=f=a)+12({e,f}={a,b})=14+(12)2=12.\mathbb{P}(k=a)=\mathbb{P}(e=f=a)+\frac{1}{2}\mathbb{P}\left(\{e,f\}=\{a,b\}\right)=\frac{1}{4}+\left(\frac{1}{2}\right)^{2}=\frac{1}{2}.

Since kk and \ell inherit symbols independently from their grandparents, the overall probability is

(k=)=(k==a)+(k==b)=(12)2+(12)2=12,\mathbb{P}(k=\ell)=\mathbb{P}(k=\ell=a)+\mathbb{P}(k=\ell=b)=\left(\frac{1}{2}\right)^{2}+\left(\frac{1}{2}\right)^{2}=\frac{1}{2},

which is precisely the probability that two generic siblings inherit the same symbol.

The situation in Fig. 2(b) is even more pronounced. The two parents share the same symbol (either aa or bb) with probability 12\tfrac{1}{2} and have different symbols with probability 12\tfrac{1}{2}. This means that the coincidence probability is now 12+12×12=34\tfrac{1}{2}+\tfrac{1}{2}\times\tfrac{1}{2}=\tfrac{3}{4}: their correlation between overlaps is much stronger than that of siblings in the generic case.

From the example in Fig. 2(a), we conclude that the statistical model of extant data parametrized by pedigrees is unidentifiable. Stated another way, it is information-theoretically impossible to distinguish between siblings and inbred cousins using only extant data. Thus, in order for any algorithm to succeed in reconstructing a large fraction of the pedigree using only extant data, it is necessary to bound the amount of inbreeding in the ensemble of pedigrees of interest. We accomplish this using a careful analysis of our generative model.

2.2 Informal analysis of Rec-Gen

In this section, we present a high-level analysis of the Rec-Gen algorithm. Theorem 1.1 states that Rec-Gen yields an accurate reconstruction on 90%90\% of nodes for typical pedigrees from our generative model999We note again that the 90%90\% is for simplicity of exposition, and in reality we can recover an arbitrarily large fraction of nodes. This is made precise in Theorem 6.1.. Note that a formal statement of this theorem, our main result, is given by Theorem 6.1, and a complete proof is contained in the upcoming sections.

Suppose we construct a pedigree 𝒫^t\hat{\mathcal{P}}_{t} on tt generations that, for simplicity of the discussion, exactly matches the true, unknown pedigree 𝒫\mathcal{P} up to generation tt. We show that Collect-Symbols, Test-Siblings, and Assign-Parents applied to 𝒫^t\hat{\mathcal{P}}_{t} provide an accurate reconstruction of 90%90\% of the nodes at generation t+1t+1. In the remainder of this section we give a high-level argument that the output 𝒫^t+1\hat{\mathcal{P}}_{t+1} satisfies the following conditions:

  • (i)

    every individual u^\hat{u} in 𝒫^t+1\hat{\mathcal{P}}_{t+1} can be identified with a unique individual uu in 𝒫\mathcal{P} at generation t+1t+1,

  • (ii)

    at most 10%10\% of the nodes in generation t+1t+1 of 𝒫\mathcal{P} are not identified with an individual in 𝒫^t+1\hat{\mathcal{P}}_{t+1}, and

  • (iii)

    if vv is a child of u^\hat{u} in 𝒫^t+1,\hat{\mathcal{P}}_{t+1}, then vv is a child of uu in 𝒫\mathcal{P}.

Recall that for the purposes of reconstruction, we only have access to the genetic data of the extant.

In this discussion, we refer to three couples c,c,c′′𝒫c,c^{\prime},c^{\prime\prime}\in\mathcal{P} as (mutual) siblings if there exist individuals uc,uc,u\in c,u^{\prime}\in c^{\prime}, and u′′c′′u^{\prime\prime}\in c^{\prime\prime} such that u,u,u,u^{\prime}, and u′′u^{\prime\prime} are mutually siblings. A clique refers to a collection of couples 𝒞={c1,,ck}\mathcal{C}=\{c_{1},\ldots,c_{k}\} such that every triple from 𝒞\mathcal{C} consists of mutual siblings.

The next two facts are essential to the argument.

  1. (A)

    If Collect-Symbols recovers symbol σ\sigma in block bb for a couple cc in generation tt, then cc also has the symbol σ\sigma in block bb in 𝒫\mathcal{P} (5).

  2. (B)

    Collect-Symbols recovers at least 99%99\% of the symbols for at least 99%99\% of the couples in generation tt (Lemma 5.9).

Together, (A) and (B) imply that for 99%99\% of the couples in generation tt, our algorithm gets all of the siblings relationships between these couples correct. To see why, we can use a similar calculation as in the first example of Section 2.1 to conclude that the average overlap between the symbols of three individuals that are mutually siblings is 25%25\%. By concentration of binomial random variables about their means, it follows that with high probability, all triples of individuals that are mutually siblings in 𝒫\mathcal{P} have at least 24.9%24.9\% mutual overlap between their symbols. A simple union bound combined with (A) and (B) implies that for most triples of individuals in generation tt that are mutually siblings, the recovered symbols from Collect-Symbols in those individuals’ corresponding couples have overlap at least 21%21\%. Hence, Test-Siblinghood infers correct siblinghood relationships for a majority of triples.

Moreover, our siblings test on the recovered symbols does not have any false-positives:

  • (C)

    Test-Siblinghood never misclassifies non-siblings as siblings (Lemma 6.5).

The next and last key fact argues that our naive assignment of parents to individuals in cliques as in Assign Parents is in fact the correct assignment in a typical pedigree. This property holds with very high probability over our generative model.

  • (D)

    Let 𝒞𝒫\mathcal{C}\subset\mathcal{P} denote a clique at generation tt in the true pedigree. Then there exists a couple c~\tilde{c}, which we refer to as the parents of 𝒞\mathcal{C}, in generation t+1t+1 of 𝒫\mathcal{P} that has exactly one child in every couple of 𝒞\mathcal{C}, and no other couple has more than 11 child in 𝒞\mathcal{C} (Lemma 4.8).

Together, (A), (B), (C), and (D) imply that our reconstruction criteria (i), (ii), and (iii) from the beginning of this section hold, as we now justify. Recall that we already showed (A) and (B) imply that we classify a large fraction of the couples at generation tt correctly as siblings. Moreover, part (C) and the transitivity of siblinghood in 𝒫\mathcal{P} imply that cliques in our reconstruction really correspond to cliques in the truth. By part (D) such cliques have unique parents. Thus, for (i), we identify newly constructed couples u^𝒫t+1\hat{u}\in\mathcal{P}_{t+1} with the unique parents u𝒫u\in\mathcal{P} of the clique formed by the children of u^\hat{u}, further pairing the two individuals in uu with those in u^\hat{u} arbitrarily. With this identification, (iii) follows immediately. To show part (ii), later in the paper we give a sufficient condition for a couple at generation tt to have 99%99\% of its symbols collected by Collect-Symbols as in (B) (see Lemma 5.9). Then we show that 90%90\% of individuals in generation t+1t+1 have children in such couples (see Proposition 5.8), which proves part (ii). Essentially, this sufficient condition amounts to saying that a couple cc at generation tt has no inbreeding (cycles) above or below it (i.e. among its ancestors or descendants, respectively) and that the pedigree of descendants of cc contains a α/4\alpha/4-ary tree (see Definition 5.4).

2.3 Motivation for using triples

It is tempting to employ a seemingly simpler recursive scheme than the one described in Section 1.5 that operates on pairs instead of triples. As an example, consider an alternative recursive procedure such that:

  1. 1.

    Collect-Symbols only uses pairs of extant descendants to recover symbols of a couple c,c,

  2. 2.

    Test-Siblinghood considers only pairs of couples at generation tt and detects them to be siblings if their strings overlap by at least 49%49\%, and

  3. 3.

    Assign-Parents assigns parents to individuals in maximal collections 𝒞\mathcal{C} such that every pair of couples is (tested as) siblings.

Unfortunately, this simpler approach encounters two major technical complications.

First, working with a pairwise siblings test introduces a problem for the step of assigning parents. Define a pairwise clique to be a collection of couples so that every pair of couples passes the pairwise siblings test. With high probability, it turns out in every generation there exist a constant number of pairwise cliques that are not explained in the naive way of assigning to this clique parents that have precisely one child per couple. In particular, in the true pedigree 𝒫\mathcal{P} it is possible to have three couples that mutually pass the pairwise siblings test, yet there are three distinct parent couples each having precisely two children among these three couples. See Fig. 3 for an illustration. This type of structure, though rare, occurs a constant number of times in each generation, and thus introduces inherent errors in our reconstruction that accumulate at every step of iteration.

Figure 3: An undesirable subpedigree, where three child couples have mutual siblingship, but they do not mutually share a parent couple.

A second problem caused by working with pairs arises in the step of collecting symbols. The pairwise version of our algorithm assigns a symbol to a couple if that symbol occurs in two extant descendants that are descended from distinct children of that couple. In our generative model, it turns out that with high probability there are a logarithmic number of pairs of extant nodes that have at least two LCA’s. For such pairs, the pairwise algorithm does not accurately assign symbols to their reconstructed ancestors. Similar to the previous issue, these errors snowball and make the analysis for proving Theorem 1.1 very difficult.

On the other hand, working with an algorithm using triples as described in Section 1.5 makes for a much cleaner analysis and nicer reconstruction guarantee. This innovation circumvents the technical complications of the pairwise version because every clique (recall that this is a collection of couples where every triple consists of mutual siblings) can be explained in a naive way (Lemma 4.8), and in our generative model every triple of extant individuals descended from distinct children of a given ancestor have that ancestor as their unique LCA with very high probability (Lemma 4.10).

2.4 Outline of technical arguments

The remainder of the paper, which provides a formal proof of Theorem 1.1, is divided into four parts.

  • Section 3 provides preliminary definitions and a formal definition of our generative model.

  • Section 4 proves important properties about the typical network structure of pedigrees from our generative model.

  • Definition 5.2 proves important properties about the block statistics of the extant nodes in a typical pedigree from our generative model.

  • Section 6 gives a precise description of Rec-Gen and provides a formal statement and proof of Theorem 1.1.

Specifically, in Section 4 we rigorously quantify the degree of inbreeding in typical pedigrees from our model by counting the number of collisions (see Definition 4.1 and Lemma 4.4). This has several useful consequences, including that every clique has a unique parent (fact (D) from Section 2.2, also see Lemma 4.8) and that the extant individuals used in Collect-Symbols have a unique LCA (see Lemma 4.10). In particular, the latter is key to showing fact (A) from Section 2.2.

In Definition 5.2, we provide a definition (see Definition 5.4) that essentially characterizes the individuals in 𝒫\mathcal{P} that are reconstructible via Rec-Gen. We show that couples involving such individuals, referred to as awesome couples, transmit many of their symbols to the extant, with high probability (see Lemma 5.9). In particular, awesome couples have at least 99%99\% of their symbols recovered by Collect-Symbols (fact (B) from Section 2.2). We also prove an important result for our siblings test: triples of individuals that are not mutually siblings have mutually overlap at most 19%19\% (see Lemma 5.2). This combined with fact (A) from Section 2.2 essentially shows that Test-Siblinghood never classifies non-siblings as siblings (fact (C) from Section 2.2, see also Lemma 6.5).

Our final section, Section 6 ties everything together, following fairly closely the high-level argument presented in Section 2.2 to prove the formal version of Theorem 1.1.

3 Preliminaries

3.1 Key definitions and terms

Definition 3.1.

A pedigree 𝒫=(V,E)\mathcal{P}=(V,E) is a directed acyclic graph (DAG) with vertices VV and edges EE where every vertex has indegree at most 22. The collection of vertices of indegree zero are referred to as the founders, and the collection of vertices of outdegree zero are referred to as the extant.

Definition 3.2.

If the indegree of each vertex in the underlying DAG is either 22 or 0, then 𝒫\mathcal{P} is called a complete pedigree.

In this work, we focus on a special family of complete pedigrees that are both graded and monogamous.

Definition 3.3.

𝒫\mathcal{P} is said to be graded if the vertices V(𝒫)V(\mathcal{P}) can be partitioned into i=0TVi(𝒫)\bigcup_{i=0}^{T}V_{i}(\mathcal{P}) such that VT(𝒫)V_{T}(\mathcal{P}) are the founders, V0(𝒫)V_{0}(\mathcal{P}) are the extant, and all directed paths eT,,e1e_{T},\ldots,e_{1} from VT(𝒫)V_{T}(\mathcal{P}) to V0(𝒫)V_{0}(\mathcal{P}) can be written as a sequence of edges et=(vtvt1)e_{t}=(v_{t}\to v_{t-1}) where vtVt(𝒫)v_{t}\in V_{t}(\mathcal{P}) and vt1Vt1(𝒫)v_{t-1}\in V_{t-1}(\mathcal{P}) for each tt. The founders’ index TT is the depth of the pedigree.

𝒫\mathcal{P} is said to be monogamous if for every vertex uu of outdegree >0>0, there exists a unique vertex uu^{\prime} such that (uv)E(uv)E(u\to v)\in E\iff(u^{\prime}\to v)\in E. The unordered pair {u,u}\{u,u^{\prime}\} is referred to as a couple.

We assume that every non-extant individual in the pedigree is in a couple, and so the number of vertices at each non-extant level is even. This assumption is effectively without loss of generality—if an individual is not in a couple, then it has no descendants, and so we cannot recover information about this individual or even its existence.

An example of a complete, graded, monogamous pedigree is shown in Fig. 1(a). In our model, symbols are passed down from parents to children in a completely symmetric way. Thus, given the data of the children, it is impossible to distinguish the owner of each symbol from amongst the two parents. The goal of this paper is to show how one can provably infer the structure of a complete pedigree from extant genetic data via the reconstruction of the ancestral symbols, modulo block phasing (determining which symbol belongs to which parent for each block). Therefore, we introduce the following version of a pedigree which condenses this information.

Definition 3.4.

A coupled pedigree 𝒬=(V𝒬,E𝒬)\mathcal{Q}=(V_{\mathcal{Q}},E_{\mathcal{Q}}) induced by a complete, monogamous pedigree 𝒫=(V𝒫,E𝒫)\mathcal{P}=(V_{\mathcal{P}},E_{\mathcal{P}}) is defined as follows:

  • V𝒬(V𝒫2)V_{\mathcal{Q}}\subset\binom{V_{\mathcal{P}}}{2} is obtained by merging couples c={u,u}V𝒫c=\{u,u^{\prime}\}\subset V_{\mathcal{P}} into a single node (extant individuals remain singletons), introducing edge multiplicity.

  • E𝒬E_{\mathcal{Q}} is the result of halving the number of resulting copies of each edge after merging couples.

k\boxed{k}\boxed{\ell}{g,h}\boxed{\{g,h\}}{i,j}\boxed{\{i,j\}}{a,b}\boxed{\{a,b\}}{c,d}\boxed{\{c,d\}}{e,f}\boxed{\{e,f\}}
(a) coupled version of 1(a)
k\boxed{k}\boxed{\ell}{g,h}\boxed{\{g,h\}}{i,j}\boxed{\{i,j\}}{a,b}\boxed{\{a,b\}}{c,d}\boxed{\{c,d\}}
(b) coupled version of 1(b)
Figure 4: 1(a) induces coupled pedigree 4(a), while 1(b) induces 4(b).

In particular, a coupled pedigree is also a pedigree. Examples are drawn in Fig. 4 in relation to Fig. 1, where the complete pedigree 1(a) induces a coupled pedigree 4(a) and 1(b) induces 4(b).

The only information that is lost after transforming a complete, monogamous pedigree into a coupled pedigree is the block phasing. Indeed, observe that given the coupled structure 𝒬=(V𝒬,E𝒬)\mathcal{Q}=(V_{\mathcal{Q}},E_{\mathcal{Q}}), one can easily obtain the individual structure 𝒫=(V𝒫,E𝒫)\mathcal{P}=(V_{\mathcal{P}},E_{\mathcal{P}}) up to block phasing as follows: (1) add the extant individuals in V0V𝒬V_{0}\subset V_{\mathcal{Q}} to V𝒫V_{\mathcal{P}}, (2) for every non-extant node cV𝒬c\in V_{\mathcal{Q}} add individuals uc,ucu_{c},u_{c}^{\prime} to V𝒫V_{\mathcal{P}}, and (3) given parents c1c_{1} and c2c_{2} of cc in 𝒬\mathcal{Q}, add the four edges uc1uc,uc1uc,uc2uc,uc2ucu_{c_{1}}\to u_{c},u_{c_{1}}^{\prime}\to u_{c},u_{c_{2}}\to u_{c}^{\prime},u_{c_{2}}^{\prime}\to u_{c}^{\prime} to E𝒫E_{\mathcal{P}}. In addition, if 𝒫\mathcal{P} is graded, 𝒬\mathcal{Q} retains a graded structure V𝒬=V0(𝒬)VT(𝒬)V_{\mathcal{Q}}=V_{0}(\mathcal{Q})\cup\cdots\cup V_{T}(\mathcal{Q}) so that V0(𝒬)V_{0}(\mathcal{Q}) are the extant nodes and V1(𝒬),,VT(𝒬)V_{1}(\mathcal{Q}),\ldots,V_{T}(\mathcal{Q}) are depth-graded couple nodes. In particular, the graph structure of an individuals pedigree 𝒫\mathcal{P} uniquely determines the graph structure of its associated coupled pedigree 𝒬\mathcal{Q} and vice versa.

Given the previous discussion, since our goal is to recover the graph structure of an underlying true pedigree 𝒫\mathcal{P} given gene sequences of a large number of extant individuals, it suffices to reconstruct the associated coupled pedigree 𝒬\mathcal{Q}.

Furthermore, since the graph underlying a pedigree is a DAG, given a subset SS of the pedigree, it is natural to consider the notion of “ancestors” (nodes 𝖺𝗇𝖼(S)\mathsf{anc}(S) from which there is a directed path to SS) and “descendants” (nodes 𝖽𝖾𝗌𝖼(S)\mathsf{desc}(S) to which there is a directed path from SS). Also for simplicity, we stipulate that every node vv is both a descendant and an ancestor of itself, i.e., v𝖺𝗇𝖼(v)v\in\mathsf{anc}(v) and v𝖽𝖾𝗌𝖼(v)v\in\mathsf{desc}(v). Since the indegree of each node can be more than one, it is possible for two nodes to have more than one “lowest common ancestor”. We define this now.

Definition 3.5 (Lowest Common Ancestors).

Let SS denote a set of nodes in a pedigree 𝒫\mathcal{P}. The set of lowest common ancestors of SS, denoted 𝖫𝖢𝖠(S)\mathsf{LCA}(S), consists of all nodes u𝒫u\in\mathcal{P} such that uu is an ancestor of every node in SS, and moreover, no descendant of uu is an ancestor of every node in SS.

During our analysis, we often restrict our attention to the information that the pedigree contains about the ancestors or descendants of a particular collection of nodes. In particular, we want to exploit (sub)structures that are not too intertwined. The following definitions make these ideas precise:

Definition 3.6 (Subpedigrees).

Let WV𝒫W\subset V_{\mathcal{P}} denote a subset of nodes of pedigree 𝒫\mathcal{P}. The subgraph of (V𝒫,E𝒫)(V_{\mathcal{P}},E_{\mathcal{P}}) induced by WW is itself a pedigree, which we call the subpedigree of 𝒫\mathcal{P} induced by WW.

Definition 3.7 (Ancestral pedigrees).

Let WkVk(𝒫)W_{k}\subset V_{k}(\mathcal{P}) denote a subset of vertices at level kk of a graded pedigree 𝒫\mathcal{P}. The subpedigree induced by Wk𝖺𝗇𝖼(Wk)W_{k}\cup\mathsf{anc}(W_{k}) is the (level kk) ancestral subpedigree of 𝒫\mathcal{P} induced by WkW_{k}.

Definition 3.8 (Descendant pedigrees).

Let WkVk(𝒫)W_{k}\subset V_{k}(\mathcal{P}) denote a subset of vertices at level kk of a graded pedigree 𝒫\mathcal{P}. The subpedigree induced by Wk𝖽𝖾𝗌𝖼(Wk)W_{k}\cup\mathsf{desc}(W_{k}) is the (level kk) descendant subpedigree of 𝒫\mathcal{P} induced by WkW_{k}.

Definition 3.9 (Tree pedigrees).

A pedigree 𝒫\mathcal{P} that has no undirected cycles (when the directions of the edges in E𝒫E_{\mathcal{P}} are ignored) is called a tree pedigree.

Note that coupled pedigrees can have edges of multiplicity two, though only in the case where two siblings form a coupled node, which a rare structure in our generative model. In coupled pedigrees, we consider a double edge to be an undirected cycle of length two. Hence, a tree pedigree consists entirely of simple or multiplicity 1 edges.

As we demonstrate (e.g. Lemma 5.2), coupled tree pedigrees exhibit a type of correlation decay between blocks that enable us to perform inference on the structure. In contrast, non-tree coupled pedigrees correspond to pedigrees with inbreeding, which can arise in nature and appear in our probabilistic model as well. Section 2.1 illustrates examples of such structures. These types of structures introduce challenges for performing inference under our generative model.

3.2 Siblings in a pedigree

Note that siblinghood is a transitive relationship: if u,vu,v are siblings and v,wv,w are siblings, then so are u,wu,w. As alluded to in Section 2.3, it is important to look at these relationships in triplets. We now detail how one can encode this information as a 3-uniform hypergraph.

Definition 3.10.

A 3-uniform hypergraph is a pair (V,E)(V,E) of vertices and a multiset of edges, so that each edge is an unordered triple {u,v,w}\{u,v,w\} of vertices in VV.

Definition 3.11.

Let 𝒫\mathcal{P} be a coupled pedigree of depth TT (each non-extant node is a set of a pair of individuals). The siblinghood hypergraph GkG_{k} of 𝒫\mathcal{P} at level k>0k>0 is the 3-uniform hypergraph that describes the three-way sibling relationships of its level-kk members. For every triple e={c1,c2,c3}e=\{c_{1},c_{2},c_{3}\}, the edge multiplicity n(e;Gk)n(e;G_{k}) is

n(e;Gk)={0if (u1,u2,u3)c1×c2×c3 such that u1,u2,u3 are siblings1if  unique (u1,u2,u3)c1×c2×c3 such that u1,u2,u3 are siblings2elsen(e;G_{k})=\begin{cases}0&\text{if }\not\exists\;(u_{1},u_{2},u_{3})\in c_{1}\times c_{2}\times c_{3}\text{ such that $u_{1},u_{2},u_{3}$ are siblings}\\ 1&\textit{if }\exists\text{ unique }(u_{1},u_{2},u_{3})\in c_{1}\times c_{2}\times c_{3}\text{ such that $u_{1},u_{2},u_{3}$ are siblings}\\ 2&\text{else}\end{cases}

The siblinghood hypergraph G0G_{0} is defined similarly, by considering each extant individual uu as a degenerate (cardinality 1) couple cu={u}c_{u}=\{u\} and applying the above definition (Each hyperedge appears zero or once, never twice).

Recall that a clique in a 3-uniform hypergraph is a collection of vertices such that all possible triplets form an edge. The next statement is an observation that follows from the definition of GkG_{k} and the transitivity of siblinghood.

Proposition 3.1.

If c1,,cmc_{1},\ldots,c_{m} are level-kk couples that respectively contain individuals u1,,umu_{1},\ldots,u_{m} which are siblings, then c1,,cmc_{1},\ldots,c_{m} form a clique in GkG_{k}.

3.3 Probability Tools

We denote a Poisson distribution with mean λ\lambda as 𝖯𝗈𝗂𝗌(λ)\mathsf{Pois}(\lambda). We use some basic tools from probability theory in our proof. The first is referred to in literature as Poisson thinning, see e.g. [Lal16].

Proposition 3.2 (Poisson Thinning).

Let N𝖯𝗈𝗂𝗌(λ)N\sim\mathsf{Pois}(\lambda), and let X1,X2,X_{1},X_{2},\ldots be iid Ber(pp) random variables that are independent of NN. Then X=i=1NXiX=\sum_{i=1}^{N}X_{i} is 𝖯𝗈𝗂𝗌(λp)\mathsf{Pois}(\lambda p)-distributed.

Second, we recall that sums of Poisson random variables are themselves Poissons:

Proposition 3.3.

Fix N>0N>0 and let X1,X2,,XNX_{1},X_{2},\ldots,X_{N} be iid 𝖯𝗈𝗂𝗌(λ)\mathsf{Pois}(\lambda) random variables. Then X=i=1NXiX=\sum_{i=1}^{N}X_{i} is 𝖯𝗈𝗂𝗌(λN)\mathsf{Pois}(\lambda N)-distributed.

Third, we will invoke the following standard variants of the Chernoff–Hoeffding bounds for sums of Bernoulli random variables:

Theorem 3.4 (Bernoulli tail probability (Chernoff–Hoeffding Bounds)).

Let X=i=1NXiX=\sum_{i=1}^{N}X_{i} be a sum of independent Bernoulli(pi)(p_{i}) random variables. Let μ=𝔼[X]\mu=\mathbb{E}[X]. Then

(X>(1+δ)μ)\displaystyle\mathbb{P}(X>(1+\delta)\mu) exp(δ22+δμ)\displaystyle\leq\exp\left(-\frac{\delta^{2}}{2+\delta}\mu\right)
(X<(1δ)μ)\displaystyle\mathbb{P}(X<(1-\delta)\mu) exp(δ22μ)\displaystyle\leq\exp\left(-\frac{\delta^{2}}{2}\mu\right)
(|Xμ|>γN)\displaystyle\mathbb{P}(|X-\mu|>\gamma N) 2exp(2Nγ2).\displaystyle\leq 2\exp(-2N\gamma^{2}).

Lastly and in the same spirit as the Chernoff–Hoeffding bound, we will use the fact that Poisson distributions also have sub-exponential tails.

Proposition 3.5 (Poisson tail probability).

Let X𝖯𝗈𝗂𝗌(λ)X\sim\mathsf{Pois}(\lambda). Then for any x>0x>0, we have

(|Xλ|x)2exp(x22(λ+x))\mathbb{P}(|X-\lambda|\geq x)\leq 2\exp\left(-\frac{x^{2}}{2(\lambda+x)}\right)

For a proof, refer to Chapter 2 of [Pol15].

4 Structure of Poisson Pedigrees

4.1 Model Description

We now describe our simple model for generating a population and its genetic data. The model is best viewed in two stages. In the first stage, we generate the population as well as the pedigree topology 𝒫𝗂𝗇𝖽𝗂𝗏\mathcal{P}_{\mathsf{indiv}} on these individuals, and in the second stage, we generate the genetic data given this pedigree structure. Note that the random individual pedigree 𝒫𝗂𝗇𝖽𝗂𝗏\mathcal{P}_{\mathsf{indiv}} constructed below is graded, monogamous, and complete.

Part I: Pedigree topology

  1. 1.

    To generate 𝒫𝗂𝗇𝖽𝗂𝗏\mathcal{P}_{\mathsf{indiv}}, start with NT=NN_{T}=N founding individuals in VTV_{T} and make an arbitrary maximum matching of these individuals to create a set of mated couples. For each couple, generate an independent Pois(α)\text{Pois}(\alpha) number of children, where α>0\alpha>0 is a fixed parameter throughout the entire pedigree. These newly generated individuals form the nodes in VT1V_{T-1}.

  2. 2.

    Repeat the above process to generate the individuals in VT2,,V0V_{T-2},\dots,V_{0}.

Once we have the population and pedigree structure as above, we generate the genetic data in the following manner.

Part II: Inheritance procedure

  1. 1.

    Each individual uu in 𝒫𝗂𝗇𝖽𝗂𝗏\mathcal{P}_{\mathsf{indiv}} has a length-BB string σu\sigma_{u} (uu’s gene sequence). The string’s indices are referred to as blocks.

  2. 2.

    For each founding individual uu in VTV_{T} and for each block b[B]b\in[B], each σu(b)\sigma_{u}(b) is drawn i.i.d. uniformly from an alphabet Σ\Sigma. For our model, Σ\Sigma is an infinite-sized alphabet: we simply require that each block of each founder has a unique symbol.

  3. 3.

    Every other individual vv in the population has exactly two parents ff and mm. Conditioned on σf\sigma_{f} and σm\sigma_{m}, independently over [B][B], the iith block of vv copies σf(i)\sigma_{f}(i) with probability 0.50.5 and σm(i)\sigma_{m}(i) with probability 0.50.5.

Remark 1.

We adopt the following conventions in the remainder of the paper.

  1. 1.

    We let 𝒫\mathcal{P} denote the coupled pedigree induced (see Definition 3.4) by the randomly generated individual pedigree 𝒫𝗂𝗇𝖽𝗂𝗏\mathcal{P}_{\mathsf{indiv}} constructed in Part I above.

  2. 2.

    We use the term coupled node, or simply node when the context is clear, to refer to a vertex of 𝒫\mathcal{P}. We use the term individual to refer to an element of 𝒫𝗂𝗇𝖽𝗂𝗏\mathcal{P}_{\mathsf{indiv}} contained in a coupled node of 𝒫\mathcal{P}. Unless otherwise explicitly noted, parent-child relationships are taken according to the structure of the coupled pedigree 𝒫\mathcal{P}. That is, given u,v𝒫u,v\in\mathcal{P} we use the phrase, “uu is a child of vv,” to mean that the couple uu contains an individual who is an offspring of the mated couple vv. Finally, we say that coupled nodes u,v𝒫u,v\in\mathcal{P} are siblings if uu and vv contain individuals who are siblings in 𝒫𝗂𝗇𝖽𝗂𝗏\mathcal{P}_{\mathsf{indiv}}.

  3. 3.

    \mathbb{P} denotes the probability measure over the randomly generated pedigree 𝒫\mathcal{P} as well as the random inheritance procedure.

To given an example of our terminology, there are two individuals in a non-extant coupled node. Each individual is a vertex of 𝒫𝗂𝗇𝖽𝗂𝗏\mathcal{P}_{\mathsf{indiv}}, and together they form a coupled node, which is a vertex of 𝒫\mathcal{P}. Note that as an artifact of our definitions, extant individuals are both coupled nodes and individuals in 𝒫\mathcal{P}. Moreover extant nodes have exactly one parent in 𝒫\mathcal{P} given by the coupled node containing the individuals comprising that extant individuals biological parents, as determined by our generative model.

To further emphasize the previous remark, recall that by the discussion in Section 3.1, there is a unique correspondence between coupled pedigrees and individual pedigrees. Hence, it suffices to give a (partial) reconstruction 𝒫^\hat{\mathcal{P}} of 𝒫\mathcal{P} to (partially) reconstruct the original individual pedigree 𝒫𝗂𝗇𝖽𝗂𝗏\mathcal{P}_{\mathsf{indiv}}. Thus the content of our main result Theorem 6.1 and the remainder of this paper primarily work with the coupled pedigree 𝒫\mathcal{P}.

Parameters:

For convenience, we collect the various parameters of interest here.

Parameter Description Value
NN Size of founding population
BB Number of blocks for each individual Θ(log(N))\Theta(\log(N))
α\alpha Expected # of children per couple Θ(1)\Theta(1)
TT Number of generations in population εlog(N)\varepsilon\log(N), ε=O(1/log(α))\varepsilon=O(1/\log(\alpha))
|Σ||\Sigma| Size of block alphabet \infty

We set B=O(log(N))B=O(\log(N)) for a sufficiently large constant. The expected number of children per couple, α\alpha, will be set to a sufficiently large constant that is at least 3. Finally, the number of generations TT will be set to εlog(N)\varepsilon\log(N), where ε>0\varepsilon>0 is sufficiently small with respect to 1/log(α)1/\log(\alpha).

4.2 Concentration bounds and upper bounds on inbreeding

In this section we quantify the degree of inbreeding in 𝒫\mathcal{P}. To do so, we first describe an alternative description of our generative model. An equivalent procedure for constructing the coupled pedigree structure 𝒫\mathcal{P} is to (1) sample the generation sizes according to Poisson random variables with appropriate parameters, (2) pair up individuals in each generation at random into coupled nodes, and (3) have coupled nodes choose two parent coupled nodes at random from the previous generation. This is described formally below.

Lemma 4.1.

The (coupled) pedigree 𝒫\mathcal{P} described in Section 4.1 can be equivalently viewed as follows:

  1. 1.

    Let NT:=NN_{T}:=N be the size of the founding population. For ii from TT to 11: Let Ni=defNi/22N_{i}^{\prime}\stackrel{{\scriptstyle def}}{{=}}\lfloor N_{i}/2\rfloor\cdot 2 be the number of individuals in couples, and sample Ni1𝖯𝗈𝗂𝗌(αNi/2)N_{i-1}\sim\mathsf{Pois}(\alpha N_{i}^{\prime}/2).

  2. 2.

    For each level ii, match the individuals at level ii randomly, leaving out a single individual if NiN_{i} was odd.

  3. 3.

    For each level ii, sample a vector v[Ni/2]Ni1\vec{v}\in[N_{i}^{\prime}/2]^{N_{i-1}} from a Multinomial distribution with parameters

    (Ni1,(2/Ni,,2/Ni)).(N_{i-1},(2/N_{i}^{\prime},\ldots,2/N_{i}^{\prime})).

    For any k[Ni/2]k\in[N_{i}^{\prime}/2], the set of coordinates {j:vj=k}\{j:v_{j}=k\} are interpreted as children of the kthk^{th} couple at level ii (and are therefore siblings at level i1i-1).

  4. 4.

    Convert the resulting pedigree on individuals from steps 1–3 to a coupled pedigree 𝒫\mathcal{P}.

Proof.

The number of vertices at each level in the statement of Lemma 4.1 is the same as the model in Section 4.1. This follows by induction. The number of founding vertices NN is the same in both models. In the model in Section 4.1, the number of individuals at level i1i-1 is distributed as j=1Ni/2Xj\sum_{j=1}^{N_{i}^{\prime}/2}X_{j}, where the XjX_{j} are iid 𝖯𝗈𝗂𝗌(α)\mathsf{Pois}(\alpha) and NiN_{i}^{\prime} is the number of individuals at level ii that are matched. The value of this sum is distributed as 𝖯𝗈𝗂𝗌(αNi/2)\mathsf{Pois}(\alpha N_{i}^{\prime}/2) (due to Proposition 3.3), the same as in the statement Lemma 4.1.

The random matching in Step 2 of Lemma 4.1 is the same as the matching in Section 4.1.

The final step in the process above assigns individuals in Vi1V_{i-1} to parents in ViV_{i} by sampling a vector v\vec{v} of length Ni1N_{i-1} with entries in [Ni/2][N_{i}^{\prime}/2] from a multinomial distribution and assigning individuals to parents based on these labels. Indeed, if we look at the number of children of a fixed couple (say, the jthj^{th} couple in ViV_{i}), this is distributed as 𝖡𝗂𝗇(X,2/Ni)\mathsf{Bin}(X,2/N_{i}^{\prime}), where X𝖯𝗈𝗂𝗌(αNi/2)X\sim\mathsf{Pois}(\alpha N_{i}^{\prime}/2). By Poisson thinning (Proposition 3.2), this distribution is simply 𝖯𝗈𝗂𝗌(α)\mathsf{Pois}(\alpha), which is exactly the distribution of the number of children of the jthj^{th} couple in Section 4.1. ∎

Next we use tail bounds on Poisson random variables to show that the sizes of each level are well-concentrated with high probability, assuming a sufficiently large size of the initial population. Recall that NiN_{i} denotes the number of individuals in generation ii.

Lemma 4.2 (Concentration of generations).

Fix δ\delta such that 0<δ<α/210<\delta<\alpha/2-1, and suppose that the founding population size NN is at least α/δ+1\alpha/\delta+1. Then, for some constant C1=C1(δ)C_{1}=C_{1}(\delta), with probability at least 1Texp(C1αN)1-T\exp(-C_{1}\alpha N) we have that, for all i{0,,T1}i\in\{0,\ldots,T-1\}

(α/2δ)Ni+1Ni(α/2+δ)Ni+1.(\alpha/2-\delta)N_{i+1}\leq N_{i}\leq(\alpha/2+\delta)\cdot N_{i+1}. (1)
Remark 2.

An immediate corollary of this result is that

(α/2δ)iNNTi(α/2+δ)iN(\alpha/2-\delta)^{i}\cdot N\leq N_{T-i}\leq(\alpha/2+\delta)^{i}\cdot N (2)

for each iTi\leq T with high probability.

Proof of Lemma 4.2.

Our goal is to upper bound the right-hand-side of

[some Nj fails Eq. 1]i=0T1[Nifails Eq. 1|Ni+1satisfies Eq. 2]\mathbb{P}[\text{some $N_{j}$ fails\leavevmode\nobreak\ \lx@cref{creftype~refnum}{eq:concentration}}]\leq\sum_{i=0}^{T-1}\mathbb{P}[N_{i}\,\,\text{fails\leavevmode\nobreak\ \lx@cref{creftype~refnum}{eq:concentration}}\,|\,N_{i+1}\,\,\text{satisfies\leavevmode\nobreak\ \lx@cref{creftype~refnum}{eq:concentration-1}}]

and so it suffices to show

[Nifails Eq. 1|Ni+1satisfies Eq. 2]2exp(Θ(α2(N1)/(α+δ))).\mathbb{P}[N_{i}\,\,\text{fails\leavevmode\nobreak\ \lx@cref{creftype~refnum}{eq:concentration}}\,|\,N_{i+1}\,\,\text{satisfies\leavevmode\nobreak\ \lx@cref{creftype~refnum}{eq:concentration-1}}]\leq 2\exp(-\Theta(\alpha^{2}(N-1)/(\alpha+\delta))).

Consider fixing the number of individuals at level i+1i+1 to be an arbitrary number Ni+1N_{i+1} satisfying Eq. 2. We know that the number of individuals at level ii is distributed as Ni𝖯𝗈𝗂𝗌(αNi+1/2)N_{i}\sim\mathsf{Pois}(\alpha N_{i+1}^{\prime}/2). By applying the Poisson tail bound Proposition 3.5, we see that

[|NiαNi+1/2|>(δ/2)Ni+1Ni+1satisfies Eq. 2]\displaystyle\mathbb{P}\left[|N_{i}-\alpha N_{i+1}^{\prime}/2|>(\delta/2)N_{i+1}^{\prime}\mid N_{i+1}\,\,\text{satisfies\leavevmode\nobreak\ \lx@cref{creftype~refnum}{eq:concentration-1}}\right] (3)
<2exp((αNi+1/2)22(α/2+δ/2)Ni+1)\displaystyle<2\exp\left(\frac{-(\alpha N_{i+1}^{\prime}/2)^{2}}{2(\alpha/2+\delta/2)N_{i+1}^{\prime}}\right)
<2exp(αNi+14(1+δ))\displaystyle<2\exp\left(-\alpha\frac{-N_{i+1}^{\prime}}{4(1+\delta)}\right) (4)

We now claim that |NiαNi+1/2|>δNi+1|N_{i}-\alpha N_{i+1}/2|>\delta N_{i+1} implies that |NiαNi+1/2|>(δ/2)Ni+1|N_{i}-\alpha N_{i+1}^{\prime}/2|>(\delta/2)N_{i+1}^{\prime}, which follows from the facts that |Ni+1Ni+1|1|N_{i+1}-N_{i+1}^{\prime}|\leq 1 and that Ni+1NN_{i+1}\geq N (Eq. 2). Namely, assume that Ni>(α/2+δ)Ni+1N_{i}>(\alpha/2+\delta)N_{i+1}. Then Ni>(α/2+δ/2)Ni+1N_{i}>(\alpha/2+\delta/2)N_{i+1}^{\prime}, since Ni+1Ni+1N_{i+1}\geq N_{i+1}^{\prime}. Now assume instead that Ni<(α/2δ)Ni+1N_{i}<(\alpha/2-\delta)N_{i+1}. Then

Ni\displaystyle N_{i} <(α/2δ)Ni+1\displaystyle<(\alpha/2-\delta)N_{i+1}
(α/2δ)(Ni+1+1)\displaystyle\leq(\alpha/2-\delta)(N_{i+1}^{\prime}+1)
(α/2δ/2)(Ni+1)\displaystyle\leq(\alpha/2-\delta/2)(N_{i+1}^{\prime})

where in the last line we use the fact that (δ/2)Ni+1(δ/2)(N1)α/2(\delta/2)N_{i+1}^{\prime}\geq(\delta/2)(N-1)\geq\alpha/2.

Hence, we get that

[|NiαNi+1/2|>δNi+1Ni+1satisfies Eq. 2]2exp(α[N14(1+δ)])\mathbb{P}[|N_{i}-\alpha N_{i+1}/2|>\delta N_{i+1}\mid N_{i+1}\,\,\text{satisfies\leavevmode\nobreak\ \lx@cref{creftype~refnum}{eq:concentration-1}}]\leq 2\exp\left(-\alpha\left[\frac{N-1}{4(1+\delta)}\right]\right)

where we use the fact that Ni+1NN_{i+1}\geq N since Ni+1N_{i+1} satisfies Eq. 2. ∎

Remark 3 (Dependence on δ\delta).

The strategy from this point onwards is to condition on the event from Eq. 1. Since this event fails with probability that is exponentially small in NN, we lose only an additive exp(cδαN)\exp(-c_{\delta}\alpha N) probability.

As mentioned in Section 2.1, two nodes may have significantly higher amounts of symbol overlap caused by inbreeding in their ancestral pedigree than would be expected given their distance in the pedigree. This can cause us to reconstruct an incorrect pedigree if we attempt to explain the symbol overlap without accounting for inbreeding; for instance, we may see two nodes and think they are siblings, when in reality they are cousins with inbreeding in their family tree (see Section 2.1 for a detailed example). To formally connect different patterns of inbreeding with the amount of spurious symbol overlap they cause, we introduce the notion of collisions in an ancestral pedigree. Roughly speaking, triples of coupled nodes with relatively few collisions in their ancestral pedigree do not have many spurious overlaps, which we prove in Definition 5.2. We first define collisions and then bound the number that occur under our probabilistic assumptions in Lemma 4.4. We also give an alternative characterization of collisions in Lemma 4.3 that is useful later.

Definition 4.1 (Collisions).

Let 𝒫\mathcal{P} denote a coupled pedigree. Fix a subset of nodes AVk(𝒫)A\subset V_{k}(\mathcal{P}), where kTk\neq T. If k>0k>0, we say that this collection has zz collisions at level k+1k+1 if the set of parents of AA in 𝒫\mathcal{P} has size 2|A|z2|A|-z. If k=0k=0, we say that it has zz collisions at level 11 if the set of parents in 𝒫\mathcal{P} has size |A|z|A|-z. Write

𝖼𝗈𝗅𝗅k+1(A):=(# collisions at level k+1 in A)\mathsf{coll}_{k+1}(A):=(\text{\# collisions at level }k+1\text{ in }A)

Extend the notion of collisions to ancestral subgraphs as follows. If we have nodes u1,,uJVk(𝒫)u_{1},\ldots,u_{J}\in V_{k}(\mathcal{P}), the number of collisions between the ancestral subpedigrees 𝖺𝗇𝖼(uj)\mathsf{anc}(u_{j}) for j=1,,Jj=1,\ldots,J is equal to

𝖼𝗈𝗅𝗅(u1,,uJ):=i=0Tk1𝖼𝗈𝗅𝗅i+1(𝖺𝗇𝖼i(u1)𝖺𝗇𝖼i(uJ))\mathsf{coll}(u_{1},\ldots,u_{J}):=\sum_{i=0}^{T-k-1}\mathsf{coll}_{i+1}(\mathsf{anc}_{i}(u_{1})\cup\cdots\cup\mathsf{anc}_{i}(u_{J}))

where 𝖺𝗇𝖼i(uj)\mathsf{anc}_{i}(u_{j}) denotes the set of ancestors ii levels above uju_{j}.

Lemma 4.3 (Ancestral collisions, alternate characterization).

Let u1,,uJu_{1},\ldots,u_{J} denote a set of nodes that are all at the same level. Consider the subpedigree 𝒯=anc(u1,,uJ)\mathcal{T}=\text{anc}(u_{1},\ldots,u_{J}). Let kjk_{j} denote the number of nodes in 𝒯\mathcal{T} that have outdegree jj in the subpedigree 𝒯\mathcal{T}. Then

𝖼𝗈𝗅𝗅(u1,,uJ)=j2(j1)kj.\mathsf{coll}(u_{1},\ldots,u_{J})=\sum_{j\geq 2}(j-1)k_{j}.
Proof.

Let SS denote a set of nodes at level ii. Let kij(S)k_{ij}(S) denote the set of parents of SS that have outdegree jj in the subpedigree 𝖺𝗇𝖼(S)\mathsf{anc}(S). Let 𝖼𝗈𝗅𝗅i+1(S)\mathsf{coll}_{i+1}(S) denote the number of collisions that SS has at level i+1i+1. Then we claim that

𝖼𝗈𝗅𝗅i+1(S)=j(j1)kij(S).\mathsf{coll}_{i+1}(S)=\sum_{j}(j-1)k_{ij}(S). (5)

This is true by induction on the cardinality of SS, as we now demonstrate. We prove this assuming that SS is a set of non-extant coupled nodes; the case for extant nodes is extremely similar. The base case |S|=1|S|=1 follows because the unique node uSu\in S either has two distinct parents, in which case there are no collisions and each has outdegree 11, or uu has a single parent, in which case the number of collisions is 11 and the parent has outdegree 22. In both cases Eq. 5 holds.

For the inductive step, suppose that Eq. 5 is valid for all SS with |S|s|S|\leq s. Now consider SS with |S|=s+1|S|=s+1. Choose an arbitrary uSu\in S and consider S=S\{u}S^{\prime}=S\backslash\{u\}. Observe that by Definition 4.1 and induction:

𝖼𝗈𝗅𝗅i+1(S)\displaystyle\mathsf{coll}_{i+1}(S) =2|S||par(S)|\displaystyle=2|S|-|par(S)|
=2|S||par(S)|+2|{u}||par(u)\par(S)|\displaystyle=2|S^{\prime}|-|par(S^{\prime})|+2|\{u\}|-|par(u)\backslash par(S^{\prime})|
=𝖼𝗈𝗅𝗅(S)+2|par(u)\par(S)|\displaystyle=\mathsf{coll}(S^{\prime})+2-|par(u)\backslash par(S^{\prime})|
=j(j1)kij(S)+2|par(u)\par(S)|.\displaystyle=\sum_{j}(j-1)k_{ij}(S^{\prime})+2-|par(u)\backslash par(S^{\prime})|.

Therefore, if uu has {0,1,2}\ell\in\{0,1,2\} parents contained in par(S)par(S^{\prime}), then

𝖼𝗈𝗅𝗅i+1(S)=+j(j1)kij(S)=j(j1)kij(S),\mathsf{coll}_{i+1}(S)=\ell+\sum_{j}(j-1)k_{ij}(S^{\prime})=\sum_{j}(j-1)k_{ij}(S),

because each parent of uu contained in par(S)par(S^{\prime}) increases the degree of some node in SS^{\prime} by 11.

Applying this argument over all levels ii to the sets =1J𝖺𝗇𝖼i(u)\cup_{\ell=1}^{J}\mathsf{anc}_{i}(u_{\ell}), we see by Definition 4.1 and summing over all levels ii that Lemma 4.3 holds for coupled nodes. ∎

In our model and in light of Lemma 4.1, a collision between sets AA and BB intuitively corresponds to a node in BB “choosing” a parent couple that was already chosen by another node in ABA\cup B. This observation lets us bound the number of collisions between the ancestors of 3 nodes with high probability.

Lemma 4.4 (Exponential tail of collisions).

Fix three nodes u,v,w𝒫u,v,w\in\mathcal{P} in the same level kk, and let cc be a positive integer. Then

[𝖼𝗈𝗅𝗅(u,v,w)c]=O(72c22cTNc)\mathbb{P}[\mathsf{coll}(u,v,w)\geq c]=O\left(\frac{72^{c}\cdot 2^{2cT}}{N^{c}}\right) (6)
Proof.

We show that the probability on the left-hand-side of Eq. 6 can be upper bounded by the probability that a binomial random variable with sufficiently small mean is at least cc, from which the result follows.

We assume that each level has at least NN individuals. This is a high probability event by Lemma 4.2 (which actually describes a much stronger situation). Since we just want an upper bound, we condition such an event and this assumption is made without loss of generality.

Let Si:=𝖺𝗇𝖼i(u)𝖺𝗇𝖼i(v)𝖺𝗇𝖼i(w)S_{i}:=\mathsf{anc}_{i}(u)\cup\mathsf{anc}_{i}(v)\cup\mathsf{anc}_{i}(w). Note that |Si|32i|S_{i}|\leq 3\cdot 2^{i}, regardless of how many collisions have happened underneath it. The distribution of 𝖼𝗈𝗅𝗅(𝖺𝗇𝖼i(u),𝖺𝗇𝖼i(v),𝖺𝗇𝖼i(w))\mathsf{coll}(\mathsf{anc}_{i}(u),\mathsf{anc}_{i}(v),\mathsf{anc}_{i}(w)) is equal to a sum of at most 32i+13\cdot 2^{i+1} Bernoulli random variables, two for each node in SiS_{i}, which are indicator random variables that a parent coupled node selected by some node in uSiu\in S_{i} is the same as a parent coupled node previously selected by vSiv\in S_{i} (Lemma 4.1). Furthermore, each of these indicator random variables is 1 with probability at most 32T+2/N3\cdot 2^{T+2}/N, even conditioned on the previously set random variables—indeed, there are only 32i+132T3\cdot 2^{i+1}\leq 3\cdot 2^{T} parents selected in total, so there are only this many nodes that can be selected from to cause a collision, and there are at least N/2N/4\lfloor N/2\rfloor\geq N/4 coupled nodes at level i+1i+1. Therefore, the random variable 𝖼𝗈𝗅𝗅(Si)\mathsf{coll}(S_{i}) is stochastically dominated by 𝖡𝗂𝗇(32i+1,32T+2/N)\mathsf{Bin}(3\cdot 2^{i+1},3\cdot 2^{T+2}/N). Let Xi𝖡𝗂𝗇(32i+1,32T+2/N)X_{i}\sim\mathsf{Bin}(3\cdot 2^{i+1},3\cdot 2^{T+2}/N). Then we get that

[𝖼𝗈𝗅𝗅(u,v,w)c]\displaystyle\mathbb{P}[\mathsf{coll}(u,v,w)\geq c] =[i𝖼𝗈𝗅𝗅k+i(Si)c]\displaystyle=\mathbb{P}[\sum_{i}\mathsf{coll}_{k+i}(S_{i})\geq c]
[i=kT1Xic]\displaystyle\leq\mathbb{P}[\sum_{i=k}^{T-1}X_{i}\geq c]
[Xc]\displaystyle\leq\mathbb{P}[X\geq c] (7)

where X𝖡𝗂𝗇(32T+1,32T+2/N)X\sim\mathsf{Bin}(3\cdot 2^{T+1},3\cdot 2^{T+2}/N). By bounding the binomial tail and noting that we take N>14422TN>144\cdot 2^{2T}, (Eq. 7) can be bounded by

[Xc]\displaystyle\mathbb{P}[X\geq c] i=c32T+1(32T+1i)(32T+2N)i\displaystyle\leq\sum_{i=c}^{3\cdot 2^{T+1}}\binom{3\cdot 2^{T+1}}{i}\left(\frac{3\cdot 2^{T+2}}{N}\right)^{i}
i=c32T+1(32T+1)i(32T+2N)i\displaystyle\leq\sum_{i=c}^{3\cdot 2^{T+1}}(3\cdot 2^{T+1})^{i}\left(\frac{3\cdot 2^{T+2}}{N}\right)^{i}
272c22cTNc\displaystyle\leq 2\cdot 72^{c}\cdot\frac{2^{2cT}}{N^{c}}

In particular, by union bounding over all triples of nodes in the coupled pedigree 𝒫\mathcal{P}, we get the following corollary. Note that there are most (α/2+δ)TN(\alpha/2+\delta)^{T}\cdot N nodes in the pedigree when we condition on the high-probability event from Lemma 4.2.

Corollary 4.5.
[u,v,w:𝖼𝗈𝗅𝗅(u,v,w)4]=O((α/2+δ)3T28TN)\mathbb{P}[\exists u,v,w:\mathsf{coll}(u,v,w)\geq 4]=O\left(\frac{(\alpha/2+\delta)^{3T}2^{8T}}{N}\right)

Since we take the ratio T/log(N)T/\log(N) to be sufficiently small (Section 4.1), the probability of the above event is negligible. Hence, we can assume without loss of generality for the rest of the document that the number of collisions in the ancestral trees of any three nodes is at most 3.

Additionally, by applying Lemma 4.4 to a single node (repeated three times) and applying linearity of expectation, we can bound the probability that there are many coupled nodes uu with collisions in their ancestral pedigrees 𝖺𝗇𝖼(u)\mathsf{anc}(u) using Markov’s inequality. We state this as a corollary.

Corollary 4.6.

For any C>0C>0,

[|{u:𝖼𝗈𝗅𝗅(u)1}|C(2α+4δ)T]72/C\mathbb{P}\left[\Big{|}\{u:\mathsf{coll}(u)\geq 1\}\Big{|}\geq C(2\alpha+4\delta)^{T}\right]\leq 72/C

as long as NN is sufficiently large.

Definition 4.2 (dd-Richness).

Fix a pedigree 𝒫\mathcal{P}, and let d3d\geq 3 be an integer. All extant nodes in 𝒫\mathcal{P} are dd-rich. For all k>0k>0, a level kk-node is dd-rich if it has at least dd children that are dd-rich.

Lemma 4.7 (Most nodes are dd-rich).

Fix a constant 0<τ<10<\tau<1, and let δ>0\delta>0 as in Lemma 4.2. As long as NN and α\alpha are sufficiently large, there exists a constant C2=C2(τ,δ)C_{2}=C_{2}(\tau,\delta) such that with probability 1Texp(C2αN)1-T\exp(-C_{2}\alpha N), at least (1τ)(1-\tau) fraction of level-kk coupled nodes in 𝒫\mathcal{P} are dd-rich for all kk.

Proof of Lemma 4.7.

Let the term “dd-poor node” refer to coupled nodes that are not dd-rich. Let MkM_{k} denote the number of coupled nodes at level kk in 𝒫\mathcal{P}. Our goal is to prove an upper bound on the event that there are at least τMk+1\tau M_{k+1} dd-poor nodes at level k+1k+1, conditioned on the event that there are at least (1τ)Mk(1-\tau)M_{k} dd-rich nodes at level kk.

Let RkR_{k} denote the event that there are at least (1τ)Mk(1-\tau)M_{k} dd-rich nodes at level kk. Let EE denote the event (α/2δ)Mk+1Mk(α/2+δ)Mk+1(\alpha/2-\delta)M_{k+1}\leq M_{k}\leq(\alpha/2+\delta)M_{k+1} for all kk, which occurs with probability 1exp(C1αN)1-\exp(-C_{1}\alpha N) by Lemma 4.2. We also condition on the sizes of M0,,MTM_{0},\ldots,M_{T}, abbreviating this conditioning as M0:TM_{0:T}.

Let SS be an arbitrary subset of nodes at level k+1k+1 of size τMk+1+1\tau M_{k+1}+1, and consider the event where SS only consists of dd-poor nodes. This implies that the number of dd-rich children of SS is at most (d1)(τMk+1+1)(d-1)(\tau M_{k+1}+1). Let XiX_{i} be iid Bernoulli RVs, which represent indicators for the event where the iith dd-rich child chooses at least one of its parents to be in SS. Note that (Xi=1)=(1(1|S|Mk+1)2)>|S|Mk+1\mathbb{P}(X_{i}=1)=\left(1-\left(1-\frac{|S|}{M_{k+1}}\right)^{2}\right)>\frac{|S|}{M_{k+1}}.

(S\displaystyle\mathbb{P}(S only has d-poor nodesRk,E,M0:T)\displaystyle\text{ only has $d$-poor nodes}\mid R_{k},E,M_{0:T})
[i=1(1τ)MkXi(d1)|S||M0:T]\displaystyle\leq\mathbb{P}\left[\sum_{i=1}^{(1-\tau)M_{k}}X_{i}\leq(d-1)|S|\;\Bigg{|}\;M_{0:T}\right]
exp[(1τ)Mk|S|2Mk+1(1(d1)Mk+1(1τ)Mk)2]\displaystyle\leq\exp\left[-\frac{(1-\tau)M_{k}|S|}{2M_{k+1}}\left(1-\frac{(d-1)M_{k+1}}{(1-\tau)M_{k}}\right)^{2}\right] (Chernoff–Hoeffding Bound)

Observe that there are (Mk+1|S|)(eτ)τMk+1+1\binom{M_{k+1}}{|S|}\leq\left(\frac{e}{\tau}\right)^{\tau M_{k+1}+1} many choices for SS. To apply a union bound, it suffices for α\alpha to be large enough so that (1τ)MkMk+1(1(d1)Mk+1(1τ)Mk)2(1τ)α(1d1(1τ)α)2\frac{(1-\tau)M_{k}}{M_{k+1}}\left(1-\frac{(d-1)M_{k+1}}{(1-\tau)M_{k}}\right)^{2}\approx(1-\tau)\alpha(1-\frac{d-1}{(1-\tau)\alpha})^{2} looks linear in α\alpha. In that case, we obtain a bound of the form

(at least\displaystyle\mathbb{P}(\text{at least } τMk+1 d-poor nodes at level k+1Rk,E,M0:T)\displaystyle\tau M_{k+1}\text{ $d$-poor nodes at level $k+1$}\mid R_{k},E,M_{0:T})
exp(CMk+1α).\displaystyle\leq\exp\left(-CM_{k+1}\alpha\right).

Therefore, we may write

(at least\displaystyle\mathbb{P}(\text{at least } (1τ) fraction of d-rich at all levels)\displaystyle(1-\tau)\text{ fraction of $d$-rich at all levels})
(1eC1αN)k=1T(1exp(CMk+1α))\displaystyle\geq(1-e^{-C_{1}\alpha N})\prod_{k=1}^{T}\left(1-\exp(-CM_{k+1}\alpha)\right)
1exp(C1αN)k=0T1exp(CN(α/2δ)kα)\displaystyle\geq 1-\exp(-C_{1}\alpha N)-\sum_{k=0}^{T-1}\exp(-CN(\alpha/2-\delta)^{k}\alpha)
1Texp(C2αN)\displaystyle\geq 1-T\exp(-C_{2}\alpha N)

for an appropriate constant C2C_{2} depending only on τ\tau and δ\delta.

Lemma 4.8 (Cliques have unique parents).

Let GkG_{k} denote the siblinghood hypergraph at level kk. Let δ>0\delta>0 be as in Lemma 4.2. For a constant C3=C3(δ)C_{3}=C_{3}(\delta), with probability at least 11NeC3Tlogα1-\frac{1}{N}e^{C_{3}T\log\alpha}, for all hypercliques 𝒞Gk\mathcal{C}\subset G_{k} with at least one hyperedge, there is a unique node at level k+1k+1 that is a parent of every node in 𝒞\mathcal{C}. We refer this node as the parent of 𝒞\mathcal{C}.

Proof.

By Proposition 3.1, a hyperclique corresponds to a set of coupled nodes that contain a set of mutual siblings, where each couple has at least one of the siblings in it. This establishes that there is a coupled node at level k+1k+1 that is at least one parent of every node in 𝒞\mathcal{C}. In the case where 𝒞\mathcal{C} is a hyperclique of extant nodes, we are done: every node in 𝒞\mathcal{C} is an individual and has exactly one parent coupled node.

If 𝒞\mathcal{C} is at a higher level, note that there can be at most two parents for 𝒞\mathcal{C}, as defined above. The reason is that any individual has exactly one parent couple, and since there are only two individuals in a couple, there cannot be three parent couples each with one child in each couple in 𝒞\mathcal{C}.

Next we show that if there are two coupled nodes, both of which are parents of 𝒞\mathcal{C}, then there must be many collisions among the ancestors of 𝒞\mathcal{C}, and therefore we can rule this out as a low-probability event. Since 𝒞\mathcal{C} has at least one hyperedge, we know that |𝒞|3|\mathcal{C}|\geq 3. This means that any arbitrary set of three nodes from 𝒞\mathcal{C} must have at least 62=46-2=4 collisions by Definition 4.1—but Corollary 4.5 shows that with probability O((α/2+δ)3T28TN)O\left(\frac{(\alpha/2+\delta)^{3T}2^{8T}}{N}\right), this does not occur anywhere in the pedigree. ∎

Lemma 4.9 (Disjointness of maximal cliques).

Let GkG_{k} denote the siblinghood hypergraph at level kk. For k=0k=0, each extant node is contained in a unique maximal clique, and moreover, the maximal cliques in G0G_{0} are vertex disjoint (and thus, also edge-disjoint). For k>0k>0, each node is contained in at most two maximal cliques. Moreover, with probability 11NeC3Tlogα1-\frac{1}{N}e^{C_{3}T\log\alpha}, the maximal cliques in GkG_{k} are edge-disjoint.

Proof.

Note that maximal cliques in the siblinghood hypergraph correspond to maximal sets of siblings. The claim for extant nodes is relatively trivial - extants are individuals, and so the maximal sets of siblings partition the set of extant nodes.

For k>0k>0, since each individual in a coupled node has one pair of parents, a coupled node can have at most two parents. Thus it can be part of at most two sets of siblings. Hence, it is part of at most two maximal cliques.

Finally, we need to establish that the maximal cliques in GkG_{k} are edge-disjoint. To do this, it suffices to show that the intersection between any two maximal cliques is less than 3, so there can be no hyper-edge. Indeed, if three nodes that are simultaneously in two maximal cliques, these three nodes would themselves form a clique with two different parents in level k+1k+1, which occurs with probability at most 11NeC3Tlogα1-\frac{1}{N}e^{C_{3}T\log\alpha} by Lemma 4.8. ∎

4.3 The joint LCA and its uniqueness

The next two lemmas are crucial in Section 6 to show that we can accurately collect symbols for accurately reconstructed coupled nodes. Here we define the joint lowest common ancestor, which is a special type of LCA for a triple of coupled nodes.

Definition 4.3.

Let u,v,wu,v,w denote coupled nodes in 𝒫\mathcal{P}. We say that u,v,wu,v,w have a joint LCA zz if it holds that z𝖫𝖢𝖠(u,v,w)z\in\mathsf{LCA}(u,v,w) and there exist distinct children cu,cv,cwc_{u},c_{v},c_{w} of zz so that for all x{u,v,w}x\in\{u,v,w\}, cxc_{x} is an ancestor of xx.

Lemma 4.10 (Joint LCA is unique).

Suppose that each triple of coupled nodes in 𝒫\mathcal{P} has at most 3 collisions. Further suppose that u,v,wu,v,w have a joint LCA z𝖫𝖢𝖠(u,v,w)z\in\mathsf{LCA}(u,v,w). Then zz is the unique LCA of u,v,wu,v,w.

Proof.

For the sake of contradiction, suppose that u,v,wu,v,w have another LCA zzz^{\prime}\neq z. By the definition of LCA, zz^{\prime} is neither an ancestor nor a descendant of zz.

If zz^{\prime} is a joint LCA of u,v,wu,v,w, then both zz and zz^{\prime} have outdegree 33 in 𝖺𝗇𝖼(u,v,w)\mathsf{anc}(u,v,w), which by Lemma 4.3 implies that 𝖺𝗇𝖼(u,v,w)\mathsf{anc}(u,v,w) has at least 2×(31)=42\times(3-1)=4 collisions.

If zz^{\prime} is not a joint LCA, then zz^{\prime} has outdegree 22 in 𝖺𝗇𝖼(u,v,w)\mathsf{anc}(u,v,w). Moreover, there exists a unique lowest node y𝖽𝖾𝗌𝖼(z)𝖺𝗇𝖼(u,v,w)y\in\mathsf{desc}(z^{\prime})\cap\mathsf{anc}(u,v,w) that is an ancestor of precisely two nodes in {u,v,w}\{u,v,w\}. In particular, yy has outdegree at least 22 in 𝖺𝗇𝖼(u,v,w)\mathsf{anc}(u,v,w). Observe that the nodes y,z,zy,z,z^{\prime} are all distinct. Hence by Lemma 4.3, the number of collisions is at least 2×(21)+1×(31)=42\times(2-1)+1\times(3-1)=4.

In either case, 𝖺𝗇𝖼(u,v,w)\mathsf{anc}(u,v,w) has at least 44 collisions, which is a contradiction. ∎

Lemma 4.11 (Inheritance paths go through LCA).

Suppose that each triple of coupled nodes in 𝒫\mathcal{P} has at most 3 collisions. Further suppose that u,v,w𝒫u,v,w\in\mathcal{P} have an LCA zz. Let zz^{\prime} denote a strict ancestor of zz. Then for some x{u,v,w}x\in\{u,v,w\}, all paths from zz^{\prime} to xx in 𝒫\mathcal{P} pass through zz.

Proof.
zzzz^{\prime}uuvvww
Figure 5: “Proof-by-picture” of Lemma 4.11.

To draw a contradiction, suppose that for all x{u,v,w}x\in\{u,v,w\} that zz^{\prime} has a path to xx that does not go through zz. Suppose further, without loss of generality, that zz^{\prime} is the lowest node in 𝒫\mathcal{P} that is an ancestor of zz and has this property.

Let 𝒯\mathcal{T} denote a spanning tree on 𝖽𝖾𝗌𝖼(z)𝖺𝗇𝖼(u,v,w)\mathsf{desc}(z)\cap\mathsf{anc}(u,v,w) (red edges in Fig. 5). Also select a spanning tree 𝒯\mathcal{T}^{\prime} on the union of all paths from zz^{\prime} to u,v,wu,v,w that do not go through zz (blue edges in Fig. 5). Observe that zz^{\prime} has outdegree at least 22 in 𝒯\mathcal{T}^{\prime}. Since zz^{\prime} also has a path to zz, then zz^{\prime} has outdegree at least 33 in 𝖺𝗇𝖼(u,v,w)\mathsf{anc}(u,v,w). Moreover, 𝒯\mathcal{T} has 22 collisions. Since zz^{\prime} is not contained in 𝒯\mathcal{T}, we conclude by Lemma 4.3 that 𝖺𝗇𝖼(u,v,w)\mathsf{anc}(u,v,w) has at least 2+1×(31)=42+1\times(3-1)=4 collisions. The first terms accounts for the collisions in 𝒯\mathcal{T}, and the second applies Lemma 4.3 to zz^{\prime}. This is a contradiction. ∎

Note that by Corollary 4.5, Lemmas 4.10 and 4.11 hold for all triples u,v,w𝒫u,v,w\in\mathcal{P} with high probability.

5 Lemmas that enable reconstruction

In this section, we prove bounds on “overlap statistics” previously explored in Section 2. Since we now have switched to talking about coupled pedigrees, we re-define its notion now.

Definition 5.1 (Diploid blocks).

Let 𝒫𝗂𝗇𝖽𝗂𝗏\mathcal{P}_{\mathsf{indiv}} induce the coupled pedigree 𝒫\mathcal{P}. Given (haploid) gene sequences (σu)uV(𝒫𝗂𝗇𝖽𝗂𝗏)(\sigma_{u})_{u\in V(\mathcal{P}_{\mathsf{indiv}})}, we associate with each non-extant couple v={v1,v2}v=\{v_{1},v_{2}\} node a diploid sequence σv\sigma_{v} defined in terms of each block bb as a multiset σv(b):=σv1(b)σv2(b)\sigma_{v}(b):=\sigma_{v_{1}}(b)\cup\sigma_{v_{2}}(b). Each extant node’s block is thought of as a singleton set.

Definition 5.2 (Diploid overlap).

Three diploid sequences σ,σ,σ′′\sigma,\sigma^{\prime},\sigma^{\prime\prime} overlap in block bb if

σ(b)σ(b)σ′′(b).\sigma(b)\cap\sigma^{\prime}(b)\cap\sigma^{\prime\prime}(b)\neq\varnothing.

The term fraction of mutual overlaps between coupled nodes u,v,wu,v,w in refers to the statistic

# overlapping blocks of σu,σv,σwB=|{b[B]:σu(b)σv(b)σw(b)}|B.\frac{\#\textrm{ overlapping blocks of }\sigma_{u},\sigma_{v},\sigma_{w}}{B}=\frac{\left|\{b\in[B]:\sigma_{u}(b)\cap\sigma_{v}(b)\cap\sigma_{w}(b)\neq\varnothing\}\right|}{B}.

5.1 Distinguishing siblings from non-siblings: Coincidence probability bounds

In this section, we establish the following high-probability separation condition for triples of coupled nodes at the same level:

  • if u,v,wu,v,w are mutually siblings, they overlap in at least 1/41/4 fraction of blocks.

  • if u,v,wu,v,w are not mutually siblings, they overlap in at most 3/163/16 fraction of blocks.

In order to reconstruct the pedigree, we perform inference on the underlying pedigree structure from the symbols at the extant level. The key step of our reconstruction algorithm is to infer which triples of nodes are mutually siblings based on the overlap between their reconstructed symbols. The conditions stated above justify using the number of overlapping symbols in triples as a statistic for determining siblinghood. The first fact (Lemma 5.1) is easy to prove. In contrast, the second fact (Lemma 5.2) is rather non-trivial; we prove it using casework.

Lemma 5.1 (Symbol overlap in siblings).

With probability 1O(α3TN3exp(γ2B))1-O(\alpha^{3T}N^{3}\exp(-\gamma^{2}B)), the fraction of mutual overlap in symbols between any triple of coupled nodes uu, v,w𝒫v,w\in\mathcal{P} that are mutually siblings is at least 14γ\tfrac{1}{4}-\gamma for any arbitrarily small γ>0\gamma>0.

Proof.

It suffices to consider the overlap of the individuals u1,v1,w1u_{1},v_{1},w_{1} in u,v,wu,v,w, respectively, that are siblings, i.e., u1,v1,w1u_{1},v_{1},w_{1} have a common parent in 𝒫𝗂𝗇𝖽𝗂𝗏\mathcal{P}_{\mathsf{indiv}}. We claim that the expected fraction of overlap for u1,v1,w1u_{1},v_{1},w_{1} is at least 1/4. Indeed, any individual symbol at the parent (couple) node survives to all three children with probability 1/81/8, and there are 2B2B symbols at the parent (one per block per member of the couple). The Chernoff–Hoeffding bound gives that for any fixed triple (u,v,w)(u,v,w) of siblings, the probability that it has less than 1/4γ1/4-\gamma mutual overlap is at most exp(γ2B)\exp(-\gamma^{2}B). To be explicit, let XiX_{i} denote the indicator of an overlap between u,v,wu,v,w in block bb.

(average overlap<1/4γ)\displaystyle\mathbb{P}(\text{average overlap}<1/4-\gamma) =(1Bi=1BXi<1/4+γ)\displaystyle=\mathbb{P}\left(\frac{1}{B}\sum_{i=1}^{B}X_{i}<1/4+\gamma\right)
=(1Bi=1B(Xi𝔼[Xi])<1/4𝔼[X1]+γ)\displaystyle=\mathbb{P}\left(\frac{1}{B}\sum_{i=1}^{B}(X_{i}-\mathbb{E}[X_{i}])<1/4-\mathbb{E}[X_{1}]+\gamma\right)
(1Bi=1B(Xi𝔼[Xi])<γ)\displaystyle\leq\mathbb{P}\left(\frac{1}{B}\sum_{i=1}^{B}(X_{i}-\mathbb{E}[X_{i}])<-\gamma\right)
2exp(2Bγ2).\displaystyle\leq 2\exp(-2B\gamma^{2}).

In the second line we use that XiX_{i} are i.i.d., in the third line we use that the expectation is at least 1/41/4, and to finish we apply Chernoff–Hoeffding. A union bound over all O((αTN)3)O((\alpha^{T}N)^{3}) triples of siblings yields the result. ∎

Lemma 5.2 (Symbol overlap in non-siblings).

Fix γ>0\gamma>0. With probability 1O(1/NT)O(α3TN3exp(γ2B))1-O(1/N_{T})-O(\alpha^{3T}N^{3}\exp(-\gamma^{2}B)), every triple of coupled nodes uu, vv, and ww that are at the same level but are not mutual siblings share overlap in less than 316+γ\tfrac{3}{16}+\gamma fraction of their symbols.

5.1.1 Proof of Lemma 5.2

Remark 4.

In this proof, we condition on the high probability event from Corollary 4.5 that all triples u,v,wu,v,w of coupled nodes have at most 33 collisions in their ancestral subpedigree 𝖺𝗇𝖼(u,v,w)\mathsf{anc}(u,v,w).

It is clear that if u,v,wu,v,w are completely unrelated, then their mutual overlap is zero, since we assume an infinite alphabet. If u,v,wu,v,w have a common ancestor, then typically their ancestral pedigree has two collisions, and all triples have at most three collisions in their ancestral pedigree by our conditioning in Remark 4. We refer to triples with three collisions as being inbred and think of the extra collision as the site of inbreeding, a notion that we later formalize in this section.

Recall the definition of tree subpedigree (Definition 3.9), which we refer to simply as a tree in what follows. Also recall that an edge of multiplicity 2 in a pedigree is considered to be an undirected cycle of length 2. Thus, a tree subpedigree consists only of simple (multiplicity 1) edges. Our strategy for proving Lemma 5.2 follows the recipe below for casework.

  1. 1.

    u,v,wu,v,w have exactly two LCAs, and the ancestral pedigree of u,v,wu,v,w is a tree.

  2. 2.

    u,v,wu,v,w have exactly one LCA, and the LCA has a cycle above it.

  3. 3.

    u,v,wu,v,w have exactly one LCA, and the ancestral pedigree of u,v,wu,v,w is a tree.

  4. 4.

    u,v,wu,v,w have exactly one LCA, and the ancestral pedigree of u,v,wu,v,w contains a cycle that is not completely above the LCA.

We now assert that the above cases cover all possibilities; this is proven in the next two claims.

Claim 1.

For uu, vv, and ww to have a single LCA, their ancestors must have at least 2 collisions.

Proof.

All three nodes need a common ancestor, which means there are at least 2 collisions are present in 𝖺𝗇𝖼(u,v,w)\mathsf{anc}(u,v,w). ∎

Claim 2.

The nodes uu, vv, and ww have at most two LCAs, with two LCAs only if 𝖺𝗇𝖼(u,v,w)\mathsf{anc}(u,v,w) has three collisions. Furthermore, if there are two LCAs, then 𝖺𝗇𝖼(u,v,w)\mathsf{anc}(u,v,w) is a tree pedigree.

Proof.

By the previous claim, creating a single LCA for three nodes requires 2 collisions in 𝖺𝗇𝖼(u,v,w)\mathsf{anc}(u,v,w). By definition, one LCA cannot be an ancestor of another LCA. This means there must be at least one more collision in 𝖺𝗇𝖼(u,v,w)\mathsf{anc}(u,v,w) to create the second LCA, bringing the total number of collisions required in 𝖺𝗇𝖼(u,v,w)\mathsf{anc}(u,v,w) to three. This immediately yields the final part of the claim by Remark 4.

To establish that there are at most two LCAs, suppose we add a third LCA. Then by the same argument, this LCA cannot be an ancestor of either of the two other LCAs, and so there must be another collision to explain it. This leads to four collisions among the ancestors, which we have ruled out. ∎

We now upper bound the expected overlap between uu, vv and ww by doing the above casework on the structure of their ancestral pedigrees. We simply upper bound the expected overlap, relying on the independence of inheritance in the different blocks so that we can apply a Chernoff–Hoeffding bound.

Lemma 5.3 (Case 1: exactly two LCAs).

Suppose that uu, vv, and ww have exactly two LCAs. Then the expected fraction of mutual overlap is at most 1/81/8.

Proof.
zz^{\prime}zzppqquuvvww
Figure 6: The topologies of Lemma 5.3 with two LCAs. Others are obtained by swapping the roles of u,v,wu,v,w.

Fig. 6 illustrates the topology of interest. First we note that neither of the LCAs can have repeated symbols, since their ancestral pedigrees contain no collisions. Consider the ancestral pedigree from uu, vv, and ww up to any one particular LCA, noting that this pedigree is a tree by 2. Any configuration containing uu, vv, ww and their ancestors leading up to that LCA has at least 5 edges, since u,v,wu,v,w are not mutual siblings. Therefore, the probability that a single symbol propagates from that LCA to all of uu, vv, and ww is (1/2)5=1/32\leq(1/2)^{5}=1/32, which yields an expected 1/16 fraction of overlap since there are 2|B|2|B| symbols at the LCA (since it is a coupled node). Since there are two such LCAs, the expectation is at most 1/81/8. ∎

In the remaining cases, we assume there is exactly one LCA. Note that any common symbols across uu, vv, and ww must be present in this LCA—if uu, vv, and ww inherit a symbol that is not present in this LCA, then by tracing their paths of inheritance for the symbol we can find another LCA. However, this does not guarantee that all common symbols in uu, vv, and ww can be traced back to inheritance from the LCA— if there is inbreeding, some nodes in {u,v,w}\{u,v,w\} can potentially inherit a symbol via an ancestor of the LCA through a path does not go through the LCA, while the rest inherit it from the LCA.

Lemma 5.4 (Case 2: one LCA with cycle above).

Suppose that uu, vv, and ww have exactly one LCA zz. Furthermore, this LCA has at least one collision in its ancestral pedigree. Then the fraction of mutual overlap is at most 1/81/8 in expectation.

Proof.

We know that uu, vv, and ww, must have at least two distinct parents between them that are connected to zz (else zz would be their parent). This means there are at least two edges in the graph between zz and the parents of uu, vv, and ww, and at least three edges between uu, vv, and ww and their respective parents.

Since we know there are at most three collisions among the ancestors of uu, vv, and ww, there can be only one collision in the ancestral pedigree of zz, and the presence of this collision means there are no other collisions in 𝖺𝗇𝖼(u,v,w)\mathsf{anc}(u,v,w). Therefore, each of the parent couples of uu, vv, and ww have an individual that is unrelated to zz, and so there are no repeated symbols within any of the parent couples. So even if the parents were to get 100%100\% overlap in the blocks due to inheritance from zz, it holds that uu, vv, and ww inherit at most 1/81/8 fraction of these blocks on expectation.

Finally, all common symbols between uu, vv, and ww must have been inherited from zz— if a common symbol was instead inherited by some x{u,v,w}x\in\{u,v,w\} from some ancestor of zz, this would create a fourth collision in 𝖺𝗇𝖼(u,v,w)\mathsf{anc}(u,v,w). ∎

Lemma 5.5 (Case 3: one LCA and 𝖺𝗇𝖼(u,v,w)\mathsf{anc}(u,v,w) is a tree).

Suppose uu, vv, and ww have exactly one LCA and 𝖺𝗇𝖼(u,v,w)\mathsf{anc}(u,v,w) is a tree. Then the fraction of mutual overlap is at most 1/161/16 in expectation.

zzppqquuvvww
zzppqqrruuvvww
Figure 7: Exhaustive list of topologies from Lemma 5.5, up to re-labelling of u,v,wu,v,w. Each edge represents a path of length >1>1.
Proof.

The lack of any cycles in 𝖺𝗇𝖼(u,v,w)\mathsf{anc}(u,v,w) means that all inheritance of common symbols comes from the lone LCA zz. Any such union of paths from zz to uu, vv and ww forms a directed tree with at least five edges; see Fig. 7. In addition, zz has two distinct symbols in every block. Therefore, for any particular symbol the probability that all three of u,v,wu,v,w inherit it is (1/2)5=1/32\leq(1/2)^{5}=1/32, which yields an expected fraction of at most 1/161/16 overlapping blocks.

The final case is the most complicated one to analyze.

Lemma 5.6 (Case 4: one LCA with cycle not completely above).

Suppose uu, vv, and ww have exactly one LCA and 𝖺𝗇𝖼(u,v,w)\mathsf{anc}(u,v,w) contains a cycle that does not lie completely above z=𝖫𝖢𝖠(u,v,w)z=\mathsf{LCA}(u,v,w). Then the fraction of mutual overlap is at most 3/163/16 in expectation.

As an aid in proving Lemma 5.6, it is helpful to first identify the “most recent” inbred node. We make this notion precise now.

Definition 5.3 (Witness).

We call a node g𝖺𝗇𝖼(u,v,w)g\in\mathsf{anc}(u,v,w) a witness to inbreeding or simply a witness if gg is the lowest node in 𝖺𝗇𝖼(u,v,w)\mathsf{anc}(u,v,w) that is part of an undirected cycle.

Lemma 5.7 (Unique witness).

Under the conditions of Lemma 5.6, there exists a unique witness in 𝖺𝗇𝖼(u,v,w)\mathsf{anc}(u,v,w). Moreover, this witness lies strictly below the LCA zz.

Proof.

We know that 𝒯:=𝖺𝗇𝖼(u,v,w)\mathcal{T}:=\mathsf{anc}(u,v,w) is not a tree, so there exists a cycle in 𝒯\mathcal{T}. We show that there can only be one cycle. Suppose that there exist two cycles 𝒞,𝒞\mathcal{C},\mathcal{C}^{\prime} in 𝒯\mathcal{T}. Then we claim that 𝖼𝗈𝗅𝗅(u,v,w)4\mathsf{coll}(u,v,w)\geq 4.

Consider a spanning tree 𝒯\mathcal{T}^{\prime} of 𝒯\mathcal{T}. Then 𝒯\mathcal{T}^{\prime} has two collisions. Moreover, 𝒯𝒞\mathcal{T}^{\prime}\cup\mathcal{C} contains a single cycle, so we conclude that there exists a node in 𝒯\mathcal{T}^{\prime} whose outdegree is increased by one upon adding the edges from 𝒞\mathcal{C} to 𝒯\mathcal{T}^{\prime} (Otherwise, 𝒯𝒞\mathcal{T}^{\prime}\cup\mathcal{C} would still be a tree). Therefore, by Lemma 4.3, 𝒯𝒞\mathcal{T}^{\prime}\cup\mathcal{C} has three collisions. By similar reasoning and using that 𝒞𝒞\mathcal{C}\neq\mathcal{C^{\prime}}, we conclude that 𝒯𝒞𝒞\mathcal{T}^{\prime}\cup\mathcal{C}\cup\mathcal{C^{\prime}} has 44 collisions. Since 𝒯𝒞𝒞𝒯\mathcal{T}^{\prime}\cup\mathcal{C}\cup\mathcal{C^{\prime}}\subset\mathcal{T}, we conclude that 𝒯\mathcal{T} has at least 44 collisions. But under our conditioning, no subpedigree has 44 or more collisions. It follows that in Lemma 5.6 there is exactly one cycle in 𝒯\mathcal{T}, and thus, exactly one witness.

To prove the final statement, note that if the witness is located above zz in 𝖺𝗇𝖼(u,v,w)\mathsf{anc}(u,v,w), then the cycle lies completely above zz. ∎

Proof of Lemma 5.6.

Consider u,v,wu,v,w and the subpedigree 𝒯=𝖺𝗇𝖼(u,v,w)\mathcal{T}=\mathsf{anc}(u,v,w) consisting of the ancestors of u,v,wu,v,w. Recall that zz is the unique LCA of u,v,wu,v,w. By Lemma 5.7, there is a unique witness g𝒯g\in\mathcal{T}, which is the lowest node in the unique cycle occurring in 𝒯\mathcal{T}.

Subcase 1: 𝖫𝖢𝖠(u,v)=𝖫𝖢𝖠(v,w)=𝖫𝖢𝖠(u,w)=𝖫𝖢𝖠(u,v,w)\mathsf{LCA}(u,v)=\mathsf{LCA}(v,w)=\mathsf{LCA}(u,w)=\mathsf{LCA}(u,v,w).

Without further loss of generality, suppose that the witness gg lies along the path from uu to zz. Then it follows that there is a unique path from vv to zz in 𝒯\mathcal{T}. Otherwise, there would exist two cycles in 𝒯\mathcal{T}, which is a contradiction as this would lead to 44 collisions in 𝒯\mathcal{T}. Similarly, there is a unique path from ww to zz in 𝒯\mathcal{T}. Moreover, 𝖺𝗇𝖼(z)\mathsf{anc}(z) is a tree. It follows that the subpedigree 𝖺𝗇𝖼(v,w)\mathsf{anc}(v,w) of the ancestors of vv and ww is a tree. Observe that zz is at least two levels above v,wv,w, and by the topology of this subcase, there are at least 4 edges in the tree subpedigree from zz to vv and ww. This implies that the expected overlap between vv and ww is at most 2(1/2)4=1/82\cdot(1/2)^{4}=1/8. Thus the expected overlap between u,v,wu,v,w is at most the expected overlap between uu and vv, which is bounded by 1/81/8.

Subcase 2: Without loss of generality, 𝖫𝖢𝖠(u,v)𝖫𝖢𝖠(u,v,w)\mathsf{LCA}(u,v)\neq\mathsf{LCA}(u,v,w).

z:{x,y}z:\{x,y\}ggppuuvvqqwwx\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}xx\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}x
z:{x,y}z:\{x,y\}ggppuuvvwwx\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}xx\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}x
Figure 8: Example of structures being analyzed in the proof of Lemma 5.6, Subcase 2. Here {x,y}\{x,y\} depict the symbols of the LCA zz in a specific block. The red edges delineate the inheritance events (possibly occurring simultaneously) of a common symbol xx.

Let p=𝖫𝖢𝖠(u,v)p=\mathsf{LCA}(u,v). Either gg is on the branch that leads to uu and vv, or it is on the branch that leads to ww. First, suppose that gg is on the branch that leads to uu and vv. Then we may further assume gg is on the path from zz to pp. For if, say, gg is on the path from pp to uu, then 𝖺𝗇𝖼(v,w)\mathsf{anc}(v,w) is a tree, in which case we can argue as in Subcase 1 that the mutual expected overlap between u,v,wu,v,w is at most 1/81/8.

Therefore, it suffices to consider the cases gg is on the path from zz to pp or gg is on the path from zz to ww (Fig. 8). In the first case, the descendants of gg form a tree with at least two edges. Moreover, there is a unique node qq at the same level as gg in 𝒯\mathcal{T}, and this individual is located on the path from zz to ww. Let σ(z)={x,y}\sigma(z)=\{x,y\} denote the (distinct) symbols of zz in a given block. By these facts, symmetry, and conditional independence of inheritance,

[σ(u)σ(v)σ(w)]\displaystyle\mathbb{P}[\sigma(u)\cap\sigma(v)\cap\sigma(w)\neq\varnothing]
2[σ(g)={x,x},xσ(u)σ(v)][xσ(q),xσ(w)]\displaystyle\leq 2\mathbb{P}[\sigma(g)=\{x,x\},\,x\in\sigma(u)\cap\sigma(v)]\mathbb{P}[x\in\sigma(q),x\in\sigma(w)]
+2[σ(g)={x,y},xσ(u)σ(v)][xσ(q),xσ(w)]\displaystyle\quad+2\mathbb{P}[\sigma(g)=\{x,y\},x\in\sigma(u)\cap\sigma(v)]\mathbb{P}[x\in\sigma(q),x\in\sigma(w)]
2×(14×1)×(12×12)+2×(12×14)×(12×12)\displaystyle\leq 2\times\left(\frac{1}{4}\times 1\right)\times\left(\frac{1}{2}\times\frac{1}{2}\right)+2\times\left(\frac{1}{2}\times\frac{1}{4}\right)\times\left(\frac{1}{2}\times\frac{1}{2}\right)
=316.\displaystyle=\frac{3}{16}.

The second line includes a factor of 22 to account for either xx or yy being passed down to u,v,wu,v,w. The terms in the third line are ordered to correspond to the events in the two lines above. In particular, we have by conditional independence of inheritance that

[σ(g)={x,x}]1/4\mathbb{P}[\sigma(g)=\{x,x\}]\leq 1/4

because there are at most 22 paths from zz to gg, and each has probability at most 1/21/2 of passing down xx. The bound

[σ(g)={x,y}]1/2\mathbb{P}[\sigma(g)=\{x,y\}]\leq 1/2

holds similarly.

Now suppose that gg is on the path from zz to ww. Then

[σ(u)σ(v)σ(w)]\displaystyle\mathbb{P}[\sigma(u)\cap\sigma(v)\cap\sigma(w)\neq\varnothing] 2[xσ(u)σ(v)][xσ(w)]\displaystyle\leq 2\mathbb{P}[x\in\sigma(u)\cap\sigma(v)]\mathbb{P}[x\in\sigma(w)]
21834=316.\displaystyle\leq 2\cdot\frac{1}{8}\cdot\frac{3}{4}=\frac{3}{16}.

Above, we used the fact that tree pedigree from zz to u,vu,v has at least 33 edges. We also used the fact

[xσ(w)]34,\mathbb{P}[x\in\sigma(w)]\leq\frac{3}{4},

which holds because there are at most two paths to ww from zz, each path has probability at least 1/21/2 of not passing down xx, and so by conditional independence of inheritance, the probability that both paths do not pass down xx is at least 1/41/4. ∎

Finally, to finish the proof of Lemma 5.2 using Lemmas 5.3, 5.4, 5.5, and 5.6, note that in all four cases the expected overlap between coupled nodes u,v,wu,v,w is at most 3/163/16. Thus, the probability that u,v,wu,v,w mutually share more than 3/16+γ3/16+\gamma fraction of symbols in all cases is at most 2exp(2Bγ2)2\exp(-2B\gamma^{2}) by Chernoff–Hoeffding, similar to the analysis of Lemma 5.1. Union bounding over all O((αTN)3)O((\alpha^{T}N)^{3}) possible triples gives an O(α3TN3exp(Bγ2))O(\alpha^{3T}N^{3}\exp(-B\gamma^{2})) upper bound of the chance that there is some triple with at least 3/16+γ3/16+\gamma overlap. By also ruling out the bad event in Corollary 4.5 (which occurs with probability O(1/NT)O(1/N_{T})), we obtain the desired upper bound.

5.2 Which ancestors are reconstructible?

In this section, we characterize nodes that are of importance in our analysis: couples whose history lacks inbreeding (e.g. graph structure is reconstructible using blocks) and have ample extant information (e.g. blocks are recoverable). We present this in two parts respectively in Definition 5.4 and Definition 5.5.

Definition 5.4 (Awesome Node).

Call a node in the pedigree 𝒫\mathcal{P} awesome if:

  1. 1.

    It is dd-rich.

  2. 2.

    It is not an ancestor of any extant node that has a collision within its own ancestral pedigree (including itself).

Definition 5.5 (bb-goodness).

Let b[B]b\in[B] be a specific block. Say that a coupled node vv in a pedigree 𝒫\mathcal{P} is bb-good if vv has at least two sets of three extant descendants x1x_{1}, y1y_{1}, z1z_{1} and x2x_{2}, y2y_{2}, z2z_{2} in 𝒫\mathcal{P} such that:

  1. 1.

    vv is a joint LCA of x1,y1,z1x_{1},y_{1},z_{1} and is a joint LCA of x2,y2,z2x_{2},y_{2},z_{2}.

  2. 2.

    x1x_{1}, y1y_{1}, and z1z_{1} all have the same symbol σ1\sigma_{1} in block bb, and x2x_{2}, y2y_{2}, and z2z_{2} all have the same symbol σ2\sigma_{2} in block bb.

  3. 3.

    σ1σ2\sigma_{1}\neq\sigma_{2}.

We furthermore define every extant node to be bb-good, for all b[B]b\in[B].

We now deliver the main message of this section: most nodes have these properties, given the assumptions of our model (Proposition 5.8 and Lemma 5.9). Therefore, this characterization enables a natural reconstruction algorithm (Section 6).

Proposition 5.8 (Many awesome nodes).

Let d>0d>0 (as in Definition 5.4) be a constant, let α\alpha be a sufficiently large constant with respect to dd, and let NN be sufficiently large with respect to both dd and α\alpha. With probability at least 1αΩ(T)1-\alpha^{-\Omega(T)}, in every layer of the pedigree at least 11/d1-1/d fraction of the nodes are awesome.

Proof.

Since α\alpha and NN are sufficiently large with respect to dd, we can apply Lemma 4.7 with τ=1/(2d)\tau=1/(2d) and δ=d\delta=d. This tells us that at least 11/(2d)1-1/(2d) fraction of nodes in each layer are dd-rich with probability 1Texp(C2αN)1-T\exp(-C_{2}\alpha N), where the constant C2=C2(1/(2d),d)C_{2}=C_{2}(1/(2d),d) depends only on dd.

Applying Corollary 4.6 with C=αTC=\alpha^{T}, there are at most αO(T)\alpha^{O(T)} nodes at the extant level with collisions in their ancestral pedigree, with probability 1αΩ(T)1-\alpha^{-\Omega(T)}. This means there are at most 2TαO(T)2^{T}\cdot\alpha^{O(T)} ancestors of these nodes. It follows that the number of nodes that are dd-rich but not awesome is at most 2TαO(T)2^{T}\cdot\alpha^{O(T)}. This is at most N2d\frac{N}{2d}, provided NN is sufficiently large with respect to dd and α\alpha and we take ε=T/logN\varepsilon=T/\log N to be small with respect to 1/log(α)1/\log(\alpha).

The first probability 1Texp(C2αN)1-T\exp(-C_{2}\alpha N) is exponentially small in NN, while the second probability 1αΩ(T)1-\alpha^{-\Omega(T)} is exponentially small in T=εlogNT=\varepsilon\log N. Therefore, the probability of both events occurring simultaneously can be lower bounded by 1αΩ(T)1-\alpha^{-\Omega(T)}, by taking the constant hidden in the Ω\Omega to be slightly smaller than what is found in the previous paragraph. ∎

Lemma 5.9 (Awesome implies bb-good).

Let d>0d>0 (as in Definition 5.4) be a sufficiently large constant. With probability 1exp(Ω(B))1-\exp(-\Omega(B)) over the symbol inheritance process, every awesome coupled node in 𝒫\mathcal{P} is bb-good for at least 99%99\% fraction of blocks b[B]b\in[B].

The figure “99%99\%” is an arbitrary choice for simplification. It can be replaced by anything arbitrarily close to 11, which changes the constant factor of Ω(B)\Omega(B) found in the lemma above. To prove Lemma 5.9, first we need a structural claim about awesome nodes:

Claim 3.

For any awesome coupled node, the subpedigree formed by it and its awesome descendants contains an induced dd-ary tree that goes down to the extant level.

Proof of 3.

First, we show that this subpedigree has no undirected cycles within it, which establishes the tree structure. Then, we argue that each node has dd children within this subpedigree.

Suppose that there an undirected cycle within this subpedigree. We show that this implies the presence of a collision within the subpedigree, contradicting the awesomeness of all nodes in the subpedigree. Note that there must be a node within this subpedigree with a cycle in its ancestral pedigree - for instance, take the node at the lowest level within the cycle. Applying Lemma 4.3 to this awesome node, we see it has a collision among its ancestors, which contradicts condition 2) of Definition 5.4.

Now we establish that each node has at least dd children in the subpedigree. An awesome coupled node vv has at least dd children that are dd-rich, since it is dd-rich itself. Furthermore, none of these children have descendants with collisions in their ancestral pedigree, so they are all awesome, which finishes the proof. ∎

Proof of Lemma 5.9.

Every awesome coupled node in 𝒫\mathcal{P} has exactly 22 distinct symbols in each block. Indeed, assume for contradiction that there is an awesome coupled node vv with a block in which it only has one distinct symbol. Due to the infinite alphabet assumption, we know that we can trace any symbol in a block back to a unique founder. Hence, there must be a collision in the ancestral pedigree of vv, which is a contradiction with condition 2) of (Definition 5.4).

Now we can proceed with showing that every awesome coupled node is bb-good for 99%99\% fraction of blocks b[B]b\in[B]. Fix an awesome node vv and a block b[B]b\in[B].

We use condition (1) of awesomeness to show that, with probability tending to 1 as dd\to\infty, there exist two sets of three extant nodes that both have vv as a joint LCA, where the first set has a symbol σ1\sigma_{1} in block bb, and the second set has a symbol σ2σ1\sigma_{2}\neq\sigma_{1}.

Towards this end, let us follow the inheritance of σ1\sigma_{1} among an induced dd-ary tree of awesome descendants, as guaranteed by 3. The inheritance follows a broadcast process with copy probability 1/21/2 on this dd-ary tree. The probability that the symbol makes it to at least three distinct children of vv, and this symbol in turn survives to the extant nodes can be expressed as

(1(1/2)d(1+d+(d2)))cd,1/2\left(1-(1/2)^{d}\left(1+d+\binom{d}{2}\right)\right)\cdot c_{d,1/2} (8)

where cd,1/2c_{d,1/2} refers to the survival probability of percolation on the dd-ary tree with copy probability 1/21/2. The first term refers to the probability that the symbol is inherited by at least 3 of the dd awesome children of vv. Additionally, these three extant nodes have vv as an LCA, as they have paths of inheritance from vv that do not all intersect at any other node.

Naturally, Eq. 8 also gives the probability that σ2\sigma_{2} is similarly inherited. Furthermore, from standard results about Galton-Watson processes (see e.g. [KA15]), we know that as dd\to\infty, cd,1/21c_{d,1/2}\to 1. Hence, we conclude that Eq. 8 tends to 1 as dd\to\infty. Thus it follows from the union bound the probability that there exist two sets of three extant nodes that both have vv as a lowest common ancestor, the first set has σ1\sigma_{1} in block bb, and the second set has σ2\sigma_{2}, also tends to 1 as dd\to\infty.

Hence, given a specific block bb, the probability that an awesome coupled node is bb-good is at least 0.9950.995. The high probability of this occurring for all blocks follows from a standard Chernoff–Hoeffding bound. ∎

6 Reconstructing the Pedigree

On the following page, we provide pseudocode for \namerefalgo:reconstruct_all which is the proposed reconstruction procedure, with details of the inner procedures following it (\namerefalgo:collect-symbols, \namerefalgo:test-siblings, and \namerefalgo:construct-parents). Note that for the first iteration of \namerefalgo:reconstruct_all, we do not need to collect symbols as the extant genetic data is given to us. Thus we simply test siblinghood at iteration k=1k=1 by using the true gene sequences.

Algorithm 1 Reconstruct a depth-TT coupled pedigree, given extant individuals V0V_{0}.
Rec-Gen\NR@gettitleRec-Gen
1:procedure Rec-Gen(T,V0T,V_{0})
2:     𝒫^(V=V0,E=)\hat{\mathcal{P}}\leftarrow(V=V_{0},E=\varnothing) \triangleright Extant Pedigree with no edges
3:     for k=1k=1 to TT do
4:         if k>1k>1 then
5:              for all vertices vv in level k1k-1 of 𝒫^\hat{\mathcal{P}} do
6:                  Collect-Symbols(v,𝒫^)\textsc{Collect-Symbols}(v,\hat{\mathcal{P}})                        
7:         G^Test-Siblinghood(𝒫^)\hat{G}\leftarrow\textsc{Test-Siblinghood}(\hat{\mathcal{P}})
8:         Assign-Parents(𝒫^,G^\hat{\mathcal{P}},\hat{G})      
9:     return 𝒫^\hat{\mathcal{P}}
\hypertarget
Algorithm 2 Empirically reconstruct the symbols of top-level node vv in 𝒫\mathcal{P}.
Collect-Symbols\NR@gettitleCollect-Symbols
1:procedure Collect-Symbols(v,𝒫^v,\hat{\mathcal{P}})
2:     for all blocks b[B]b\in[B] do
3:         repeat
4:               Find extant triple (x,y,z)(x,y,z) such that:           1) vv is a joint LCA of x,y,zx,y,z,           2) xx, yy, and zz all have the same symbol σ\sigma in bb, and           3) σ\sigma is not yet recorded for block bb in vv.
5:              Record the symbol σ\sigma for block bb in vv.
6:         until two distinct symbols are recorded for block bb, or no such triple exists.      
\hypertarget
Algorithm 3 Perform statistical tests to detect siblinghood
Test-Siblinghood\NR@gettitleTest-Siblinghood
1:procedure Test-Siblinghood(depth (k1)(k-1) pedigree 𝒫^\hat{\mathcal{P}})
2:      V{vVk1(𝒫^):(# fully recovered blocks of v)0.99|B|}V\leftarrow\{v\in V_{k-1}(\hat{\mathcal{P}}):(\text{\# fully recovered blocks of }v)\geq 0.99|B|\}
3:     EE\leftarrow\varnothing
4:     for all distinct triples {u,v,w}2V\{u,v,w\}\subset 2^{V} at level k1k-1 do
5:         if 0.21|B|\geq 0.21|B| blocks bb such that s^u(b)s^v(b)s^w(b)\hat{s}_{u}(b)\cap\hat{s}_{v}(b)\cap\hat{s}_{w}(b)\neq\varnothing then
6:              EE{u,v,w}E\leftarrow E\cup\{u,v,w\}               
7:     return G^=(V,E)\hat{G}=(V,E) \triangleright 3-wise sibling hypergraph
\hypertarget
Algorithm 4 Construct ancestors, given top-level 3-way sibling relationship.
Assign-Parents\NR@gettitleAssign-Parents
1:procedure Assign-Parents(𝒫,G\mathcal{P},G)
2:     repeat
3:         𝒞Any-Maximal-Clique(G)\mathcal{C}\leftarrow\textsc{Any-Maximal-Clique}(G)
4:         Remove one copy of all hyper-edges in 𝒞\mathcal{C} from GG.
5:         If |𝒞|d|\mathcal{C}|\geq d, attach a level-kk parent in 𝒫\mathcal{P} for all nodes from 𝒞\mathcal{C}.
6:     until no maximal cliques of size d\geq d remain in GG.
\hypertarget

The goal of the rest of this section is to prove the correctness of \namerefalgo:reconstruct_all. We now formally state our guarantee:

Theorem 6.1 (Main theorem, formal).

Let 𝒫^\hat{\mathcal{P}} be the depth-TT coupled pedigree output by the algorithm \namerefalgo:reconstruct_all, applied to the gene sequences in V0(𝒫)V_{0}(\mathcal{P}). With probability tending towards 11 as NN\to\infty, 𝒫^\hat{\mathcal{P}} is an induced subpedigree of 𝒫\mathcal{P} such that |Vi(𝒫^)|η(α)|Vi(𝒫)||V_{i}(\hat{\mathcal{P}})|\geq\eta(\alpha)|V_{i}(\mathcal{P})| for all levels i{0,,T}i\in\{0,\ldots,T\}, where η(α)1\eta(\alpha)\to 1 as α\alpha\to\infty. The probability is over the randomness of the coupled pedigree 𝒫\mathcal{P} and the inheritance procedure with parameters set as in Section 4.1.

We define η(α):=1(1/d(α))\eta(\alpha):=1-(1/d(\alpha)) where, for a given value of α\alpha, d(α)d(\alpha) is defined to be the largest value of dd such that Proposition 5.8 holds. Observe that d(α)d(\alpha)\to\infty as α\alpha\to\infty because Proposition 5.8 holds for arbitrarily large values dd. Therefore, η(α)1\eta(\alpha)\to 1 as α\alpha\to\infty.

We make use of the following high-probability events, provided α\alpha is a large enough constant so that d=d(α)d=d(\alpha) satisfies the hypothesis of Lemma 5.9, NN is sufficiently large with respect to α\alpha, the total number of generations is T=εlogNT=\varepsilon\log N, where ε=O(1/logα)\varepsilon=O(1/\log\alpha), and the gene sequence length is B=Ω(logN)B=\Omega(\log N).

Proposition 6.2 (Key Reductions).

With probability tending towards 11 as NN\to\infty, the pedigree 𝒫\mathcal{P} satisfies:

  1. 1.

    For each level kk, each clique of GkG_{k} has a single parent (Lemma 4.8).

  2. 2.

    For each level kk, the maximal cliques of GkG_{k} are edge-disjoint, in such a way that each vVk(𝒫)v\in V_{k}(\mathcal{P}) is contained in at most two maximal cliques (Lemma 4.9).

  3. 3.

    Each triple u,v,wu,v,w of nodes, has at most 3 collisions (Corollary 4.5), implying

    1. (a)

      their joint LCA is unique (Lemma 4.10), and

    2. (b)

      all inheritance paths for some node x{u,v,w}x\in\{u,v,w\} go through the unique LCA (Lemma 4.11).

  4. 4.

    The fraction of overlap is at least 24.9%24.9\% for siblings in 𝒫\mathcal{P} while for non-mutual siblings it is at most 18.85%18.85\% (Lemmas 5.1 and 5.2).

  5. 5.

    For each level kk, at least η(α)\eta(\alpha) fraction of nodes in Vk(𝒫)V_{k}(\mathcal{P}) are awesome (Proposition 5.8).

  6. 6.

    If uV(𝒫)u\in V(\mathcal{P}) is awesome, then it is bb-good for 99%99\% of blocks b[B]b\in[B] (Lemma 5.9).

The “probability tending towards 1” portion of Theorem 6.1 can be quantified via a union bound on the probability of failure of any of the events in Proposition 6.2, while the “|V(𝒫^)|η(α)|V(𝒫)||V(\hat{\mathcal{P}})|\geq\eta(\alpha)|V(\mathcal{P})|” guarantee comes from the fact that we recover 100%100\% of the awesome nodes in conjunction with Condition 5. With this as a simplification, we proceed with the proof of Theorem 6.1.

The upcoming lemma (Lemma 6.3) proves the correctness of the very first iteration (depth 1 from depth 0), and therefore serves as the base case. The inductive step (Lemma 6.4) is presented immediately afterwards. For the remainder of this section, we write 𝒫^k\hat{\mathcal{P}}_{k} to denote the depth-kk reconstructed pedigree after the kkth iteration of \namerefalgo:reconstruct_all, (𝒫^0\hat{\mathcal{P}}_{0} is the depth-0 pedigree of all the extant nodes). In contrast, let 𝒫k\mathcal{P}_{k} denote the subpedigree of 𝒫\mathcal{P} (the ground truth) induced by graded levels V0V_{0} up to VkV_{k}.

Lemma 6.3.

Let G^0\hat{G}_{0} denote the estimated 33-regular siblinghood hypergraph for the extant nodes (line 7 of \namerefalgo:test-siblings). Consider the pedigree 𝒫^1\hat{\mathcal{P}}_{1} created by \namerefalgo:construct-parents applied to (𝒫^0,G^0)(\hat{\mathcal{P}}_{0},\hat{G}_{0}). Then there exists an injective homomorphism ϕ:𝒫^1𝒫1\phi:\hat{\mathcal{P}}_{1}\to\mathcal{P}_{1} so that the induced subgraph on ϕ(𝒫^1)\phi(\hat{\mathcal{P}}_{1}) is isomorphic to 𝒫^1\hat{\mathcal{P}}_{1}. Moreover, ϕ(𝒫^1)\phi(\hat{\mathcal{P}}_{1}) contains A1A_{\leq 1}, where A1A_{\leq 1} is the set of awesome nodes at levels 1\leq 1 in 𝒫\mathcal{P}.

Proof.

Let G0G_{0} denote the true siblinghood hypergraph on extant nodes with at least two siblings. By Condition 4, we have that G0^G0\hat{G_{0}}\cong G_{0}. Since both graphs have the same set of vertices, we simply write G0^=G0\hat{G_{0}}=G_{0}.

This gives a natural, explicit characterization of ϕ\phi. For an extant node vV0(𝒫1)v\in V_{0}(\mathcal{P}_{1}), define ϕ(v)=v\phi(v)=v so that it is the identity map on the extant. Given couple u^V1(𝒫^1)\hat{u}\in V_{1}(\hat{\mathcal{P}}_{1}), define ϕ(u^)\phi(\hat{u}) to be the parent couple uV1(𝒫1)u\in V_{1}(\mathcal{P}_{1}) of the children of u^\hat{u}. The condition G0^G0\hat{G_{0}}\cong G_{0} implies that at least one such choice for uu exists, and moreover by Condition 1, uu is the unique parent.

ϕ\phi is injective: Let u^,v^V1(𝒫^1)\hat{u},\hat{v}\in V_{1}(\hat{\mathcal{P}}_{1}) with u^v^\hat{u}\neq\hat{v}. At the extant level, the maximal cliques in G0G_{0} are vertex disjoint by Condition 2. Hence, the children of u^\hat{u} and the children of v^\hat{v} have empty intersection. Moreover in 𝒫1\mathcal{P}_{1}, vertex-disjoint maximal cliques have distinct parents. Therefore, ϕ(u^)ϕ(v^)\phi(\hat{u})\neq\phi(\hat{v}), as desired.

ϕ\phi respects edges: We already know that (u^,v)E(𝒫^1)(ϕ(u^),v)E(𝒫1)(\hat{u},v)\in E(\hat{\mathcal{P}}_{1})\implies(\phi(\hat{u}),v)\in E(\mathcal{P}_{1}). Now suppose that (ϕ(u^),v)(\phi(\hat{u}),v) is an edge in 𝒫1\mathcal{P}_{1} for u^V1(𝒫^1)\hat{u}\in V_{1}(\hat{\mathcal{P}}_{1}) and vV0(𝒫^1)v\in V_{0}(\hat{\mathcal{P}}_{1}). Since uu is in the image of ϕ\phi, it follows that uu has at least 33 children w,x,yw,x,y that passed the siblings test in our algorithm. If vv is one of w,x,yw,x,y, we’re done, so suppose not. By Condition 5.1, the extant triples {v,w,x}\{v,w,x\}, {v,x,y}\{v,x,y\}, and {v,w,y}\{v,w,y\} all have at least 24%24\% overlap. Therefore, v,w,x,yv,w,x,y form a clique in G0^\hat{G_{0}}, and line 5 of \namerefalgo:construct-parents states that u^\hat{u} is a parent of all four, so (u^,v)(\hat{u},v) is an edge in 𝒫^1\hat{\mathcal{P}}_{1}.

The image of ϕ\phi contains the awesome nodes in 𝒫1\mathcal{P}_{1}: This part is trivially true for the extant nodes, so consider only the awesome nodes A1V1(𝒫1)A_{1}\subset V_{1}(\mathcal{P}_{1}). By definition, any awesome node uV1(𝒫1)u\in V_{1}(\mathcal{P}_{1}) is dd-rich. Since d3d\geq 3, the children of uu form a maximal clique of size at least 33 in G0G_{0}. Therefore, \namerefalgo:construct-parents creates a parent u^\hat{u} for these children in 𝒫^1\hat{\mathcal{P}}_{1}, which gives the pre-image of uu. ∎

Lemma 6.4.

Let k2k\geq 2 and suppose that we are given 𝒫^k1\hat{\mathcal{P}}_{k-1}. Assume that there exists an injective homomorphism ϕ:𝒫^k1𝒫k1\phi:\hat{\mathcal{P}}_{k-1}\to\mathcal{P}_{k-1} which satisfies

  1. 1.

    ϕ|𝒫^0Id\phi|_{\hat{\mathcal{P}}_{0}}\equiv Id,

  2. 2.

    ϕ(𝒫^k1)𝒫k1\phi(\hat{\mathcal{P}}_{k-1})\subset\mathcal{P}_{k-1} induces a subgraph isomorphic to 𝒫^k1\hat{\mathcal{P}}_{k-1}, and

  3. 3.

    ϕ(𝒫^k1)\phi(\hat{\mathcal{P}}_{k-1}) contains the awesome nodes in sets A0,A1,,Ak1A_{0},A_{1},\ldots,A_{k-1}.

Let 𝒫^k\hat{\mathcal{P}}_{k} be the level-kk extension of 𝒫^k1\hat{\mathcal{P}}_{k-1}, via lines 4 through 7 of \namerefalgo:reconstruct_all. Then there exists a level-kk extension of the map ϕ:𝒫^k𝒫k\phi:\hat{\mathcal{P}}_{k}\to\mathcal{P}_{k} with the same properties.

We prove this in two stages. The first part (Lemma 6.5) asserts that we reconstruct the sibling relationships correctly, while the latter (Lemma 6.6) assures that the cliques of this estimated siblinghood hypergraph are actually the faithful, “largest possible” groupings of siblings.

Lemma 6.5.

Assume the hypotheses of Lemma 6.4, and let G^k1\hat{G}_{k-1} be the estimated siblinghood hypergraph constructed by \namerefalgo:test-siblings, line 7, on input 𝒫^k1\hat{\mathcal{P}}_{k-1}. Then the subgraph of Gk1G_{k-1} induced by ϕ(G^k1)\phi(\hat{G}_{k-1}) is isomorphic to G^k1\hat{G}_{k-1}, and moreover ϕ(G^k1)\phi(\hat{G}_{k-1}) contains all of the awesome nodes Ak1A_{k-1} at level k1k-1.

The upcoming statements (4, 5 and 6) are pivotal for the proof of Lemma 6.5.

Definition 6.1.

For an awesome node u𝒫ku\in\mathcal{P}_{k}, its awesome subtree is the subgraph of 𝒫k\mathcal{P}_{k} that is the union of all paths from uu to extant nodes that consist entirely of awesome nodes.

Claim 4.

Suppose that there is a reconstruction map ϕ:𝒫^k1𝒫k1\phi:\hat{\mathcal{P}}_{k-1}\to\mathcal{P}_{k-1} satisfying the hypotheses in Lemma 6.4. Then for any awesome node u=ϕ(u^)Vk1(𝒫k1)u=\phi(\hat{u})\in V_{k-1}(\mathcal{P}_{k-1}), its awesome subtree SuS_{u} satisfies ϕ1(Su)=𝖽𝖾𝗌𝖼(u^)\phi^{-1}(S_{u})=\mathsf{desc}(\hat{u}).

Proof of 4.

Note that Line 5 of \namerefalgo:construct-parents ensures that every node in 𝒫^k1\hat{\mathcal{P}}_{k-1} is dd-rich. Since ϕ\phi is an injective homomorphism, it follows that every node in ϕ(𝖽𝖾𝗌𝖼(u^))\phi(\mathsf{desc}(\hat{u})) is also dd-rich in 𝒫\mathcal{P}. Furthermore, uu being awesome implies that all of its descendants are awesome in 𝒫\mathcal{P}, since none of its descendants can have collisions in its ancestral pedigree (Definition 5.4). By the definition of the awesome subtree (Definition 6.1), it holds that ϕ(𝖽𝖾𝗌𝖼(u^))Su\phi(\mathsf{desc}(\hat{u}))\subseteq S_{u}.

For the other direction (ϕ(𝖽𝖾𝗌𝖼(u^))Su\phi(\mathsf{desc}(\hat{u}))\supseteq S_{u}), let vV0(𝒫)v\in V_{0}(\mathcal{P}) be an extant node so that there is a path from uu to vv consisting only of awesome nodes. By condition 3 of Lemma 6.4, all of the nodes along this path are in the image of ϕ\phi. ∎

Claim 5.

Let ϕ\phi be as in Lemma 6.4, and let u=ϕ(u^)u=\phi(\hat{u}) for some u^Vk1(𝒫^k1)\hat{u}\in V_{k-1}(\hat{\mathcal{P}}_{k-1}). Suppose that in block bb the symbols σ^1\hat{\sigma}_{1} and σ^2\hat{\sigma}_{2} are recovered for u^\hat{u} by applying Algorithm 1 to u^\hat{u}. Then it holds that uu also has symbols σ^1,σ^2\hat{\sigma}_{1},\hat{\sigma}_{2} in block bb.

Proof of 5.

For i=1,2i=1,2, suppose that nodes xi,yi,ziV0(𝒫^0)=V0(𝒫)x_{i},y_{i},z_{i}\in V_{0}(\hat{\mathcal{P}}_{0})=V_{0}(\mathcal{P}) have the symbol σ^i\hat{\sigma}_{i} in block bb and are used by \namerefalgo:collect-symbols to recover σ^i\hat{\sigma}_{i} in block bb of u^\hat{u}. Recall that xi,yi,zix_{i},y_{i},z_{i} are all descended from distinct children of u^\hat{u}. Let ϕ(𝒫^k1)\phi(\hat{\mathcal{P}}_{k-1}) induce subpedigree 𝒬\mathcal{Q} in 𝒫\mathcal{P}.

By the hypotheses of Lemma 6.4, 𝒬𝒫^k1\mathcal{Q}\cong\hat{\mathcal{P}}_{k-1} and so uu must be a common ancestor of xi,yi,zix_{i},y_{i},z_{i} in 𝒬\mathcal{Q}. By line 4 of \namerefalgo:collect-symbols and because 𝒬𝒫^k1\mathcal{Q}\cong\hat{\mathcal{P}}_{k-1}, u^\hat{u} – and therefore uu – is their joint LCA. With respect to 𝒫\mathcal{P}, Conditions 3a and 3b tell us the much stronger condition that uu is their only LCA, and that all paths in 𝒫\mathcal{P} from any common ancestor of xi,yi,zix_{i},y_{i},z_{i} to xix_{i} (without loss of generality) must pass through uu. Therefore, if xi,yi,zix_{i},y_{i},z_{i} all inherit symbols σ^i\hat{\sigma}_{i} in block bb, the symbol σ^i\hat{\sigma}_{i} must have passed through block bb of uu via the infinite symbols assumption. ∎

Claim 6.

Let ϕ\phi be as in Lemma 6.4, and let u=ϕ(u^)u=\phi(\hat{u}) for some u^Vk1(𝒫^k1)\hat{u}\in V_{k-1}(\hat{\mathcal{P}}_{k-1}). Suppose that uu is awesome in 𝒫\mathcal{P}. If uu is bb-good and has symbols σ1,σ2\sigma_{1},\sigma_{2} in block bb, then \namerefalgo:collect-symbols recovers the symbols σ1\sigma_{1} and σ2\sigma_{2} for u^\hat{u} in block bb.

Proof of 6.

By 5, we only need to show that at least two symbols in block bb are reconstructed by \namerefalgo:collect-symbols applied to u^\hat{u}. Note that bb-goodness implies σ1σ2\sigma_{1}\neq\sigma_{2}.

By bb-goodness of uu, as in the proof of Lemma 5.9, there is a witnessing triple for each of the σi\sigma_{i} contained in the extant of the awesome subtree SuS_{u}. By 4, 𝖽𝖾𝗌𝖼(u^)\mathsf{desc}(\hat{u}) also contains these witnesses. Since extant nodes are the exact same in 𝒫\mathcal{P} compared to 𝒫^k1\hat{\mathcal{P}}_{k-1} by hypothesis 1 of Lemma 6.4, \namerefalgo:collect-symbols applied to u^\hat{u} recovers σ1,σ2\sigma_{1},\sigma_{2} in block bb. ∎

Proof of Lemma 6.5.

By assumption, ϕ:G^k1Gk1\phi:\hat{G}_{k-1}\to G_{k-1} is injective. To first see that ϕ\phi is a hypergraph homomorphism, let u^,v^,w^Vk1(𝒫^k1)\hat{u},\hat{v},\hat{w}\in V_{k-1}(\hat{\mathcal{P}}_{k-1}) be distinct nodes satisfying line 2 of \namerefalgo:test-siblings, and let u=ϕ(u^),v=ϕ(v^)u=\phi(\hat{u}),v=\phi(\hat{v}), and w=ϕ(w^)w=\phi(\hat{w}) denote their counterparts in 𝒫\mathcal{P}.

Suppose that u,v,wu,v,w are not mutually siblings. By Condition 4, u,v,wu,v,w have at most 0.1885|B|0.1885|B| mutually overlapping blocks. By 5, for all x^{u^,v^,w^}\hat{x}\in\{\hat{u},\hat{v},\hat{w}\}, the symbols reconstructed for x^\hat{x} in block bb using \namerefalgo:collect-symbols are a subset of the symbols in block bb of x:=ϕ(x^){u,v,w}x:=\phi(\hat{x})\in\{u,v,w\}. Therefore, u^,v^,w^\hat{u},\hat{v},\hat{w} have mutually overlapping symbols in at most 0.1885|B|0.1885|B| blocks. Since 0.1885<0.210.1885<0.21, \namerefalgo:test-siblings does not place a hyperedge between u^,v^,w^\hat{u},\hat{v},\hat{w} in G^1\hat{G}_{1}.

To conclude that the induced subgraph ϕ(G^k1)\phi(\hat{G}_{k-1}) is isomorphic to G^k1\hat{G}_{k-1}, it remains to show that if u,v,wu,v,w are mutual siblings in 𝒫\mathcal{P}, then {u^,v^,w^}\{\hat{u},\hat{v},\hat{w}\} is a hyperedge in G^k1\hat{G}_{k-1}. Note that 99%99\% of the blocks of u^,v^,w^\hat{u},\hat{v},\hat{w} were recovered by \namerefalgo:collect-symbols by the definition of G^k1\hat{G}_{k-1}, and by 5, the symbols of u^,v^,w^\hat{u},\hat{v},\hat{w} in block bb are a subset of the symbols of u,v,wu,v,w, respectively, in block bb. By Condition 4, the mutual overlap between the siblings u,v,wu,v,w is at least 0.249|B|0.249|B|. Thus, by a union bound on the occurrence of 1%-fraction of unrecovered blocks, the mutual overlap between u^,v^,w^\hat{u},\hat{v},\hat{w} is at least (0.2490.03)|B|0.21|B|(0.249-0.03)|B|\geq 0.21|B|. Therefore, \namerefalgo:test-siblings constructs a hyperedge on u^,v^,w^\hat{u},\hat{v},\hat{w}, as desired. It follows that the induced subgraph ϕ(G^k1)\phi(\hat{G}_{k-1}) is isomorphic to G^k1\hat{G}_{k-1}.

Finally, we show that the awesome nodes Ak1A_{k-1} are fully contained in ϕ(G^k1)\phi(\hat{G}_{k-1}). By Condition 6, awesome nodes are bb-good. Now apply 6, to conclude that \namerefalgo:collect-symbols reconstructs 99% of the blocks in each awesome node uu, so uG^k1u\in\hat{G}_{k-1} according to Line 2 of \namerefalgo:test-siblings. ∎

Lemma 6.6.

Let 𝒞\mathscr{C} denote the maximal (hyper)cliques in the subgraph of Gk1G_{k-1} induced by ϕ(G^k1)\phi(\hat{G}_{k-1}), and let 𝒞algo\mathscr{C}_{\text{algo}} denote the (hyper)cliques probed by \namerefalgo:construct-parents applied to G^k1\hat{G}_{k-1}. Given 𝒞𝒞algo\mathcal{C}\in\mathscr{C}_{\text{algo}}, define ϕ(𝒞)\phi(\mathcal{C}) to be the set given by the image of 𝒞\mathcal{C} under ϕ\phi. Then ϕ\phi is a bijection between 𝒞algo\mathscr{C}_{\text{algo}} and 𝒞\mathscr{C}.

Proof.

By Lemma 6.5, the subgraph HH induced by ϕ(G^k1)\phi(\hat{G}_{k-1}) is isomorphic to G^k1\hat{G}_{k-1}. Hence, it suffices to show that the cliques probed by \namerefalgo:construct-parents applied to HH are precisely the maximal cliques of HH. Recall that by Condition 2, the maximal cliques in HH are edge-disjoint, and every node of HH is involved in at most 22 cliques.

It is helpful to imagine the cliques 𝒞1,𝒞2,,𝒞M𝒞algo\mathcal{C}_{1},\mathcal{C}_{2},\ldots,\mathcal{C}_{M}\in\mathscr{C}_{\text{algo}} as being listed out in the same order that they are probed by \namerefalgo:construct-parents, indexed by timesteps m=1,2,,Mm=1,2,\ldots,M. Let H(0)=HH^{(0)}=H, and let H(m)H^{(m)} denote the result of removing the edges of the clique 𝒞t\mathcal{C}_{t} from H(m1)H^{(m-1)}.

We argue that for all mm, the graph H(m)H^{(m)} is a union of edge-disjoint maximal cliques, and any two maximal cliques intersect in at most a single vertex. The base case m=1m=1 is true by Condition 2. This holds for m>1m>1 because the above property is preserved when all of the edges are removed from a single maximal clique in H(m1)H^{(m-1)}. Moreover, for all mm, the maximal cliques in H(m)H^{(m)} are the same as those of H(m1)H^{(m-1)} but with a single maximal clique 𝒞m\mathcal{C}_{m} in H(m1)H^{(m-1)} removed. Hence, it also follows by induction that for all mm, the maximal clique 𝒞m\mathcal{C}_{m} in H(m1)H^{(m-1)} is also a maximal clique in HH.

Since \namerefalgo:construct-parents terminates at the first time MM when H(M)H^{(M)} has no hyperedges, we conclude that 𝒞1,,𝒞M\mathcal{C}_{1},\ldots,\mathcal{C}_{M} are all of the maximal cliques in HH, as desired. ∎

Proof of Lemma 6.4.

We first extend the definition of ϕ\phi to level kk. For u^Vk(𝒫^k)\hat{u}\in V_{k}(\hat{\mathcal{P}}_{k}), we define ϕ(u^)Vk(𝒫k)\phi(\hat{u})\in V_{k}(\mathcal{P}_{k}) as follows. Let 𝒞^Vk1(𝒫^k)\hat{\mathcal{C}}\subset V_{k-1}(\hat{\mathcal{P}}_{k}) denote the children of u^\hat{u}. By Lemmas 6.5 and 6.6, ϕ(𝒞^)\phi(\hat{\mathcal{C}}) is a clique in Gk1G_{k-1}. Define ϕ(u^)Vk(𝒫k)\phi(\hat{u})\in V_{k}(\mathcal{P}_{k}) to be the parent of the children of the clique ϕ(𝒞^)\phi(\hat{\mathcal{C}}) in 𝒫\mathcal{P}. The map ϕ\phi is well-defined at level kk because of Condition 1. It remains to show that ϕ\phi is an isomorphism onto its image, and moreover that its image contains all of the awesome nodes at level kk.

The map ϕ\phi is injective: We know this is true for ϕ|𝒫^k1\phi\big{|}_{\hat{\mathcal{P}}_{k-1}}, so it suffices to consider injectivity of ϕ\phi when restricted to the nodes at level kk in 𝒫^k\hat{\mathcal{P}}_{k}. Let u^,v^Vk(𝒫^k)\hat{u},\hat{v}\in V_{k}(\hat{\mathcal{P}}_{k}) with u^v^\hat{u}\neq\hat{v}. Let 𝒞\mathcal{C} (resp., 𝒞\mathcal{C}^{\prime}) denote the maximal clique in G^k1\hat{G}_{k-1} that consists of the children of u^\hat{u} (resp., v^\hat{v}). By Lemma 6.6, ϕ(𝒞)\phi(\mathcal{C}) and ϕ(𝒞)\phi(\mathcal{C}^{\prime}) are distinct maximal cliques in the induced subgraph ϕ(G^k1)\phi(\hat{G}_{k-1}), and therefore, are contained in distinct maximal cliques in Gk1G_{k-1}. Distinct maximal cliques in Gk1G_{k-1} have distinct parents, so by the definition of ϕ\phi, we conclude that ϕ(u^)ϕ(v^)\phi(\hat{u})\neq\phi(\hat{v}), as desired.

The map ϕ\phi is edge-preserving: Suppose that (u^,v^)(\hat{u},\hat{v}) is an edge in 𝒫^k\hat{\mathcal{P}}_{k} with u^Vk(𝒫^k)\hat{u}\in V_{k}(\hat{\mathcal{P}}_{k}) and v^Vk1(𝒫^k)\hat{v}\in V_{k-1}(\hat{\mathcal{P}}_{k}). Consider the maximal clique 𝒞^\hat{\mathcal{C}} containing v^\hat{v} in G^k1\hat{G}_{k-1}. By Lemma 6.6, ϕ(𝒞^)\phi(\hat{\mathcal{C}}) is a maximal clique in the induced subgraph ϕ(G^k1)Gk1\phi(\hat{G}_{k-1})\subset G_{k-1}, and by construction of ϕ\phi, the parent of ϕ(𝒞^)\phi(\hat{\mathcal{C}}) is ϕ(u^)\phi(\hat{u}). Therefore, the edge (ϕ(u^),ϕ(v^))(\phi(\hat{u}),\phi(\hat{v})) is in the pedigree 𝒫k\mathcal{P}_{k}.

Suppose now that the edge (u,v)=(ϕ(u^),ϕ(v^))(u,v)=(\phi(\hat{u}),\phi(\hat{v})) is in the pedigree 𝒫k\mathcal{P}_{k}. Consider the maximal clique 𝒞Gk1\mathcal{C}^{\prime}\subset G_{k-1} containing vv. By Lemma 6.6, 𝒞:=ϕ1(𝒞)={x𝒫^k:ϕ(x)𝒞}\mathcal{C}:=\phi^{-1}(\mathcal{C}^{\prime})=\{x\in\hat{\mathcal{P}}_{k}:\phi(x)\in\mathcal{C}^{\prime}\} is a maximal clique in G^k1\hat{G}_{k-1}. By Lemma 6.6 and the construction in \namerefalgo:construct-parents, we conclude that the parent of v^\hat{v} in 𝒫^k\hat{\mathcal{P}}_{k} is mapped to uu under ϕ\phi. By injectivity of ϕ\phi, this parent is precisely ϕ1(u)=u^\phi^{-1}(u)=\hat{u}. Therefore, (u^,v^)(\hat{u},\hat{v}) is an edge in 𝒫^k\hat{\mathcal{P}}_{k}.

The image of ϕ\phi contains the awesome nodes in 𝒫k\mathcal{P}_{k}: It suffices to prove the statement for the awesome nodes at level kk, which we denote by AkA_{k}. Suppose that uu is an awesome node at level kk of 𝒫\mathcal{P}. By awesomeness, uu has at least dd awesome children. Let 𝒞\mathcal{C}^{\prime} denote the clique in Gk1G_{k-1} given by the awesome children of uu. By Lemmas 6.5 and 6.6, 𝒞:=ϕ1(𝒞)\mathcal{C}:=\phi^{-1}(\mathcal{C}^{\prime}) satisfies |𝒞|=|𝒞|d|\mathcal{C}|=|\mathcal{C}^{\prime}|\geq d because all of the awesome children up to level k1k-1 are in the image of ϕ\phi, by the inductive hypotheses. By Lemma 6.6, the maximal clique 𝒞~\tilde{\mathcal{C}} containing 𝒞\mathcal{C} in G^k1\hat{G}_{k-1} satisfies that ϕ(𝒞~)\phi(\tilde{\mathcal{C}}) are all children of uu. By the definition of \namerefalgo:construct-parents and ϕ\phi at level kk, we conclude that a parent u^\hat{u} is constructed for 𝒞~𝒞\tilde{\mathcal{C}}\supset\mathcal{C} and ϕ(u^)=u\phi(\hat{u})=u, as desired. ∎

Acknowledgments We thank Vishesh Jain for many helpful discussions.

References

  • [23a] 23andMe. About us. \urlhttps://mediacenter.23andme.com/company/about-us/.
  • [Anc] Ancestry.com. Ancestry continues to lead the industry with world’s largest consumer dna network. \urlhttps://www.ancestry.com/corporate/newsroom/press-releases/ancestry%C2%AE-surpasses-15-million-members-its-dna-network-powering-unparalleled.
  • [DMR06] Constantinos Daskalakis, Elchanan Mossel, and Sébastien Roch. Optimal phylogenetic reconstruction. In Proceedings of the thirty-eighth annual ACM symposium on Theory of computing, pages 159–168, 2006.
  • [EKPS00] William Evans, Claire Kenyon, Yuval Peres, and Leonard J. Schulman. Broadcasting on trees and the ising model. Ann. Appl. Probab., 10(2):410–433, 05 2000.
  • [ESPC18] Yaniv Erlich, Tal Shor, Itsik Pe’er, and Shai Carmi. Identity inference of genomic data using long-range familial searches. Science, 362(6415):690–694, 2018.
  • [ESSW99] Péter L Erdős, Michael A Steel, László A Székely, and Tandy J Warnow. A few logs suffice to build (almost) all trees (i). Random Structures & Algorithms, 14(2):153–184, 1999.
  • [HKZ12] Daniel Hsu, Sham M Kakade, and Tong Zhang. A spectral algorithm for learning hidden markov models. Journal of Computer and System Sciences, 78(5):1460–1480, 2012.
  • [Hui17] Jisca Huisman. Pedigree reconstruction from snp data: parentage assignment, sibship clustering and beyond. Molecular ecology resources, 17(5):1009–1024, 2017.
  • [HWH+13] Dan He, Zhanyong Wang, Buhm Han, Laxmi Parida, and Eleazar Eskin. Iped: inheritance path-based pedigree reconstruction algorithm using genotype data. Journal of Computational Biology, 20(10):780–791, 2013.
  • [HWPE14] Dan He, Zhanyong Wang, Laxmi Parida, and Eleazar Eskin. Iped2: Inheritance path based pedigree reconstruction algorithm for complicated pedigrees. In Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB ’14, page 202–210, New York, NY, USA, 2014. Association for Computing Machinery.
  • [Ins19] National Human Genome Research Institute. The cost of sequencing a human genome. \urlhttps://www.genome.gov/about-genomics/fact-sheets/Sequencing-Human-Genome-cost, 2019.
  • [KA15] Marek Kimmel and David Axelrod. Branching Processes in Biology. Springer Publishing Company, Incorporated, 2nd edition, 2015.
  • [KKM+19] Younhun Kim, Frederic Koehler, Ankur Moitra, Elchanan Mossel, and Govind Ramnarayan. How many subpopulations is too many? exponential lower bounds for inferring population histories. Journal of Computational Biology, 2019.
  • [KLKH11] Bonnie Kirkpatrick, Shuai Cheng Li, Richard M Karp, and Eran Halperin. Pedigree reconstruction using identity by descent. Journal of Computational Biology, 18(11):1481–1493, 2011.
  • [KM18] Gina Kolata and Heather Murphy. The golden state killer is tracked through a thicket of dna, and experts shudder. The New York Times, 4 2018.
  • [Lal16] Steven Lalley. Poisson processes. Statistics 312: Stochastic Processes, 2016.
  • [MMP18] Anuran Makur, Elchanan Mossel, and Yury Polyanskiy. Broadcasting on bounded degree dags. arXiv preprint arXiv:1803.07527, 2018.
  • [Mos01] Elchanan Mossel. Reconstruction on trees: beating the second eigenvalue. Annals of Applied Probability, pages 285–300, 2001.
  • [Mos04] Elchanan Mossel. Phase transitions in phylogeny. Transactions of the American Mathematical Society, 356(6):2379–2404, 2004.
  • [MR05] Elchanan Mossel and Sébastien Roch. Learning nonsingular phylogenies and hidden markov models. In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, pages 366–375, 2005.
  • [MyH] MyHeritage. Myheritage end-of-year infographic. \urlhttps://blog.myheritage.com/2019/12/wrapping-up-a-fantastic-2019.
  • [Pol15] D Pollard. Mini empirical. Manuscript. \urlhttp://www.stat.yale.edu/ pollard/Books/Mini/, 2015.
  • [SH06] Mike Steel and Jotun Hein. Reconstructing pedigrees: a combinatorial perspective. Journal of theoretical biology, 240(3):360–367, 2006.
  • [SS03] Charles Semple and Mike Steel. Phylogenetics, volume 24. Oxford University Press on Demand, 2003.
  • [STH14] Doron Shem-Tov and Eran Halperin. Historical pedigree reconstruction from extant populations using partitioning of relatives (prepare). PLoS computational biology, 10(6), 2014.
  • [Tho00] Elizabeth A. Thompson. Statistical inference from genetic data on pedigrees. NSF-CBMS Regional Conference Series in Probability and Statistics, 6:i–169, 2000.
  • [Tho13] Elizabeth A Thompson. Identity by descent: variation in meiosis, across genomes, and in populations. Genetics, 194(2):301–326, 2013.
  • [TS08] Bhalchandra D Thatte and Mike Steel. Reconstructing pedigrees: a stochastic perspective. Journal of theoretical biology, 251(3):440–449, 2008.
  • [Wan19] Jinliang Wang. Pedigree reconstruction from poor quality genotype data. Heredity, 122(6):719–728, 2019.