This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A dissimilarity measure for semidirected networks

Michael Maxfield1 0000-0003-3087-0085    Jingcheng Xu1 0000-0003-2387-9188    Cécile Ané1,2 1 Department of Statistics, University of Wisconsin - Madison, USA.2 Department of Botany, University of Wisconsin - Madison, USA. 0000-0002-4702-8217
Abstract

Semidirected networks have received interest in evolutionary biology as the appropriate generalization of unrooted trees to networks, in which some but not all edges are directed. Yet these networks lack proper theoretical study. We define here a general class of semidirected phylogenetic networks, with a stable set of leaves, tree nodes and hybrid nodes. We prove that for these networks, if we locally choose the direction of one edge, then globally the set of directed paths starting by this edge is stable across all choices to root the network. We define an edge-based representation of semidirected phylogenetic networks and use it to define a dissimilarity between networks, which can be efficiently computed in near-quadratic time. Our dissimilarity extends the widely-used Robinson-Foulds distance on both rooted trees and unrooted trees. After generalizing the notion of tree-child networks to semidirected networks, we prove that our edge-based dissimilarity is in fact a distance on the space of tree-child semidirected phylogenetic networks.

Index Terms:
phylogenetic, admixture graph, Robinson-Foulds, tree-child, μ\mu-representation, ancestral profile
publicationid: pubid:
This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.

I Introduction

Historical relationships between species, virus strains or languages are represented by phylogenies, which are rooted graphs, in which the edge direction indicates the flow of time going forward. Semidirected phylogenetic networks are to rooted networks what undirected trees are to rooted trees. They appeared recently, following studies showing that the root and the direction of some edges in the network may not be identifiable, from various data types [28, 3]. Consequently, several methods to infer phylogenies from data aim to estimate semidirected networks, rather than fully directed networks, such as SNaQ [28], NANUQ [1], admixtools2 [23], poolfstat [11], NetRAX [22], and PhyNEST [18].

The theoretical identifiability of semidirected networks is receiving increased attention [12, 24, 33, 2] but graph theory is still at an early stage for this type of network [but see 21]. In particular, adapted distance and dissimilarity measures are lacking, as are tools to test whether two phylogenetic semidirected networks are isomorphic. These tools are urgently needed for applications. For example, when an inference method is evaluated using simulations, its performance is quantified by how often the inferred network matches the true network used to simulate data, or how similar the inferred network is to the true network. Tools for semidirected networks would also help summarize a posterior sample of networks output by Bayesian inference methods. Even for a basic summary such as the posterior probability of a given topology, we need to decide which semidirected networks are isomorphic, in a potentially very large posterior sample of networks.

Unless additional structure is assumed, current methods appeal to a naive strategy that considers all possible ways to root semidirected networks, and then use methods designed for directed networks, e.g. to check if the candidate rooted networks are isomorphic or to minimize a dissimilarity between the two rooted networks across all possible root positions.

In this work, we first generalize the notion of semidirected phylogenetic networks in which edges are either of tree type or hybrid type, such that any edge can be directed or undirected. We relax the constraint of a single root (of unknown position). Multi-rooted phylogenetic networks were recently introduced, although for directed networks, to represent the history of closely related and admixed populations [29] or distant groups of species that exchanged genes nonetheless [14, 15]. Our general definition requires care to define a set of leaves consistent across all compatible directed phylogenetic networks.

For these semidirected networks, we define an edge-based “μ\mu-representation” μE\mu_{E}, extending the node-based μ\mu-representation, denoted here as μV\mu_{V}, by Cardona et al. [7]. The “ancestral profile” of a rooted network contain the same information as the node-based representation μV\mu_{V}, that is: the number of directed paths from each node to each leaf [5]. So our edge-based representation μE\mu_{E} also extends the notion of ancestral profile to semidirected networks, in which ancestral relationships are unknown between some nodes because the root is unknown. Briefly, μE\mu_{E}’s information for an edge depends on whether the edge direction is known or implicitly constrained by the direction of other edges in the network. For example, if NN is a semidirected tree and an edge is explicitly or implicitly directed, then μE\mu_{E} associates the edge to the cluster of taxa below it. If instead the edge direction depends on the unknown placement of the root(s), then μE\mu_{E} associates the edge to the bipartition of the taxa obtained by deleting the edge from the semidirected tree. If NN has reticulations, μV\mu_{V} uses μ\mu-vectors to generalize the notion of clusters, storing the number of directed paths from a given node to each taxon. Our extension μE\mu_{E} associates a directed edge to the μ\mu-vector of its child node, and an undirected edge to two μ\mu-vectors: one for each direction that the edge can take. To handle and distinguish semidirected networks with multiple roots, μE\mu_{E} also associates each root to a μ\mu-vector, well-defined even when the exact root position(s) are unknown. We then define a dissimilarity measure dμEd_{\mu_{E}} between semidirected networks.

Our network representation μE\mu_{E} can be calculated in polynomial time, namely 𝒪(n|E|)\mathcal{O}(n|E|) where nn is the number of leaves and |E||E| is the number of edges. The associated network dissimilarity dμEd_{\mu_{E}} can then be calculated in 𝒪(|E|(n+log|E|))\mathcal{O}\bigl{(}|E|(n+\log|E|)\bigr{)} time. It provides the first dissimilarity measure adapted to semidirected networks (without iteratively network re-rooting) that can be calculated in polynomial time. Linz and Wicke [21] also recently considered semidirected networks (with a single root of unknown position). They showed that “cut edge transfer” rearrangements, which transform one network into another, define a finite distance on the space of level-1 networks with a fixed number of hybrid nodes. This distance is NP-hard to compute, however, because it extends the subtree prune and regraft (SPR) distance on unrooted trees [13].

Finally, we generalize the notion of tree-child networks to semidirected networks, and prove that dμEd_{\mu_{E}} is a true distance on the subspace of tree-child semidirected networks, extending the result of [7] to semidirected networks using an edge-based representation. On trees, this dissimilarity equals the widely used Robinson-Foulds distance between unrooted trees [27]. We provide concrete algorithms to construct μE\mu_{E} from a given semidirected network, and to reconstruct the network from μE\mu_{E}.

As a proper distance, dμEd_{\mu_{E}} can decide in polynomial time if two tree-child semidirected networks are isomorphic. For rooted phylogenetic networks, the isomorphism problem is polynomially equivalent to the general graph isomorphism problem, even if restricted to tree-sibling time-consistent rooted networks [8]. As semidirected networks include single-rooted networks, the graph isomorphism problem for semidirected networks is necessarily more complex. The general graph isomorphism problem was shown to be of subexponential complexity [4] but is not known to be solvable in polynomial time. Therefore, there is little hope of finding a dissimilarity that can be computed in polynomial time, and that is a distance for general semidirected phylogenetic networks. Hence, dissimilarities like dμEd_{\mu_{E}} offer a balance between computation time and the extent of network space in which it can discriminate between distinct networks.

II Basic definitions for semidirected graphs

For a graph GG we denote its vertex set as V(G)V(G) and its edge set as E(G)E(G). The subgraph induced by a subset of vertices VV(G)V^{\prime}\subseteq V(G) is denoted as G[V]G[V^{\prime}], and the edge-induced subgraph is denoted as G[E]G[E^{\prime}] for a subset EE(G)E^{\prime}\subseteq E(G).

We use the following terminology for a directed graph G=(V,E)G=(V,E). For a node vv, its in-degree is denoted as degi(v,G)\mathrm{deg}_{\mathrm{i}}(v,G) and out-degree as dego(v,G)\mathrm{deg}_{\mathrm{o}}(v,G). Its total degree deg(v,G)\mathrm{deg}(v,G) is degi(v,G)+dego(v,G)\mathrm{deg}_{\mathrm{i}}(v,G)+\mathrm{deg}_{\mathrm{o}}(v,G). We may omit GG when no confusion is likely. For u,vVu,v\in V, we write u>vu>v if there exists a directed path uvu\leadsto v. A node vv is a leaf if dego(v)=0\mathrm{deg}_{\mathrm{o}}(v)=0; vv is an internal node otherwise. We denote the set of leaves as VL(G)V_{L}(G). A node vv is a root if degi(v)=0\mathrm{deg}_{\mathrm{i}}(v)=0; a tree node if degi(v)1\mathrm{deg}_{\mathrm{i}}(v)\leq 1; and a hybrid node otherwise. We denote the set of tree nodes as VTV_{T} and the set of hybrid nodes as VHV_{H}. An edge (u,v)E(u,v)\in E is a tree edge if vv is a tree node; a hybrid edge otherwise. We denote the set of tree edges as ETE_{T} and the set of hybrid edges as EHE_{H}. A descendant of vv is any node uu such that v>uv>u. A tree path is a directed path consisting only of tree edges. A tree descendant leaf of vv is any leaf uu such that there exists a tree path vuv\leadsto u. An elementary path in GG is a directed path such that the first node has out-degree 11 in GG and all intermediate nodes have in-degree and out-degree 11 in GG. In the remainder, we use “path” to mean “directed path” for brevity, unless otherwise specified.

We now extend these notions to semidirected graphs.

Definition 1 (semidirected graph).

A semidirected graph NN is a tuple N=(V,E)N=(V,E), where VV is the set of vertices, and E=EUEDE=E_{U}\sqcup E_{D} is the set of edges, EUE_{U} being the set of undirected edges and EDE_{D} the set of directed edges.

Undirected edges in EUE_{U} are denoted as uvuv for some u,vVu,v\in V, instead of the standard notation of {u,v}\{u,v\} for brevity. Directed edges in EDE_{D} are denoted as (u,v)(u,v) for some u,vVu,v\in V, implying the direction from uu to vv, with uu referred to as the parent of the edge, and vv as its child. A directed graph is a semidirected graph with no undirected edges: EU=E_{U}=\emptyset.

For vV(N)v\in V(N), degi(v,N)\mathrm{deg}_{\mathrm{i}}(v,N) denotes the number of directed edges with vv as their child, dego(v,N)\mathrm{deg}_{\mathrm{o}}(v,N) the number of directed edges with vv as their parent, degu(v,N)\mathrm{deg}_{\mathrm{u}}(v,N) the number of undirected edges in NN incident to vv, and deg=degi+dego+degu\mathrm{deg}=\mathrm{deg}_{\mathrm{i}}+\mathrm{deg}_{\mathrm{o}}+\mathrm{deg}_{\mathrm{u}}. We may omit NN when no confusion is likely. Furthermore, child(v,N)={wV:(v,w)ED}\mathrm{child}(v,N)=\{w\in V:(v,w)\in E_{D}\}. NN is binary if deg(v)=1\mathrm{deg}(v)=1 or 33 for all nodes. NN is bicombining if degiv=2\mathrm{deg}_{\mathrm{i}}{v}=2 for all hybrid nodes.

A semidirected graph N=(V,E)N^{\prime}=(V,E^{\prime}) is compatible with another semidirected graph N=(V,E)N=(V,E) if NN^{\prime} can be obtained from NN by directing some undirected edges in NN.

The contraction of NN, denoted as Cont(N)\mathrm{Cont}(N), is the directed graph obtained by contracting every undirected edge in NN. It is well defined, as it can be viewed as the quotient graph of NN under the partition that groups nodes connected by a series of undirected edges. For vV(N)v\in V(N), Cont(v,N)\mathrm{Cont}(v,N) is defined as the node in Cont(N)\mathrm{Cont}(N) which vv gets contracted into.

In a semidirected graph NN, a node vV(N)v\in V(N) is a root if it only has outgoing edges. It is a tree node if degi(v,N)1\mathrm{deg}_{\mathrm{i}}(v,N)\leq 1. Otherwise, vv is called a hybrid node. The set of tree nodes is denoted as VTV_{T} or VT(N)V_{T}(N) and the set of hybrid nodes as VHV_{H} or VH(N)V_{H}(N). A tree edge is an undirected edge, or a directed edge whose child is a tree node. A hybrid edge is a directed edge whose child is a hybrid node. We denote the set of tree edges as ETE_{T} or ET(N)E_{T}(N) and the set of hybrid edges as EHE_{H} or EH(N)E_{H}(N).

A semidirected cycle is a semidirected graph if its undirected edges can be directed so that it becomes a directed cycle. A semidirected graph is acyclic if it does not contain a semidirected cycle. We refer to directed acyclic graphs as DAGs and to semidirected acyclic graphs as SDAGs.

Next, we define a more stringent notion of compatibility to maintain the classification of nodes and edges as being of tree versus hybrid type, illustrated in Figs. 1 and 2.

Definition 2 (phylogenetically compatible, rooted partner, network).

An SDAG NN^{\prime} is phylogenetically compatible with another SDAG NN if NN^{\prime} is compatible with NN and EH(N)=EH(N)E_{H}(N^{\prime})=E_{H}(N). A rooted partner of NN is a DAG that is phylogenetically compatible with NN. A multi-root semidirected network, or network for short, is an SDAG that admits a rooted partner.

Note that if SDAG NN^{\prime} is phylogenetically compatible with SDAG NN, then VT(N)=VT(N)V_{T}(N)=V_{T}(N^{\prime}) and VH(N)=VH(N)V_{H}(N)=V_{H}(N^{\prime}). Note also the rooted partner of a binary network may not be binary by the typical definition for rooted networks, since the root can have degree 33 instead of 22.

Not all SDAGs are networks (see Fig. 1 for an example). We are interested in networks rather than general SDAGs, because a network can represent evolutionary history up to “rerooting”, as captured by its rooted partners. Fig. 1 gives an example network (NN, top) and 2 of its 4 rooted partners (bottom).

Refer to caption

Figure 1: Examples of SDAGs and rooted partners. Top left: SDAG that is not a network, because it has no rooted partner: directing uvuv would cause uu or vv to become a hybrid node, and result in a non-phylogenetically compatible DAG. Top right: NN is a network. Bottom: 2 of NN’s 4 rooted partners. Each rooted partner has 2 roots (brown dots), one from each brown edge in NN. Of the nodes incident to e2e_{2} for example, either uu (bottom left) or vv (bottom right) can serve as root.

Traditionally, phylogenetic trees and networks are connected and with a single root [30]. We consider here a broader class of graphs, allowing for multiple connected components and multiple roots per connected component. We also allow for non-simple graphs, that is, for multiple parallel edges between the same two nodes uu and vv, directed and in the same direction for the graph to be acyclic. With a slight abuse of notation, we keep referring to each parallel edge as (u,v)(u,v). Self-loops are not allowed as they would cause the graph to be cyclic. The term “rooted partner” was introduced by Linz and Wicke [21] in the context of traditional semidirected phylogenetic networks in which the set of directed edges is precisely the set of hybrid edges, and for which a rooted partner has a single root.

We will use the following results frequently. The first one is trivial.

Proposition 1.

Let NN be an SDAG phylogenetically compatible with SDAG NN^{\prime}. Then ED(N)ED(N)ET(N)E_{D}(N)\setminus E_{D}(N^{\prime})\subseteq E_{T}(N).

Proposition 2.

Let NN be a network, and NN^{\prime} the semidirected graph obtained from NN by undirecting some of the tree edges of NN. Then NN^{\prime} is a network, and NN is phylogenetically compatible with NN^{\prime}.

Proof.

Let AA be the set of tree edges in NN that are undirected to obtain NN^{\prime}. We first consider the case when AA consists of a single edge (u,v)(u,v). We shall establish the following claims:

  1. 1.

    NN^{\prime} is a network;

  2. 2.

    the edges of NN^{\prime} in E(N)AE(N)\setminus A retain their type (undirected, tree, or hybrid);

  3. 3.

    NN is phylogenetically compatible with NN^{\prime}.

For claim 1, suppose for contradiction that NN^{\prime} is not an SDAG. Then there exists in NN^{\prime} a semidirected cycle CC which contains uvuv. Let GG be a rooted partner of NN, and C+C^{+} the subgraph made of the corresponding edges in CC. Since C+C^{+} cannot be a directed cycle, there exist two hybrid edges (a,h)(a,h) and (b,h)(b,h) in C+C^{+}. Both are also directed in NN by phylogenetic compatibility. Because they are hybrid edges, they are distinct from (u,v)(u,v), so they are directed in NN^{\prime} and CC. This contradicts CC being a semidirected cycle, showing that NN^{\prime} is an SDAG. Since it also admits GG as a rooted partner, NN^{\prime} is a network.

Claim 2 follows from the observation that the types of nodes (tree vs hybrid) in NN^{\prime} stays the same. Claim 3 follows from claim 2.

If AA contains multiple edges, then we iteratively undirect one edge at a time. By the previous argument, the resulting graph at each step is a network with which NN is phylogenetically compatible, because phylogenetic compatibility is transitive. Hence NN^{\prime} is a network and NN is phylogenetically compatible with it. ∎

The next definition is motivated by the fact that phylogenies are inferred from data collected at leaves, which are known entities with labels (individuals, populations, or species), whereas non-leaf nodes are inferred and unlabeled.

Definition 3 (unrooted, rooted and ambiguous leaves).

A node vv in a network NN is a rooted leaf in NN if vv is a leaf in every rooted partner of NN; and an unrooted leaf if it is a leaf in some rooted partner NN. We denote the set of rooted leaves and unrooted leaves as VRL(N)V_{RL}(N) and VUL(N)V_{UL}(N) respectively. An ambiguous leaf is a node in VUL(N)VRL(N)V_{UL}(N)\setminus V_{RL}(N).

Refer to caption
Figure 2: Examples to illustrate definitions and notations. Directed edges (EDE_{D}) are shown with arrows; hybrid edges (EHE_{H}) in blue. Root components, whose nodes (VRV_{R}) can serve as roots, are shown in brown. N2N_{2} is compatible but not phylogenetically compatible with N1N_{1} (e.g. ww is hybrid in N2N_{2} and not in N1N_{1}). N3N_{3} is phylogenetically compatible with N1N_{1}. Rooted leaves are bb and cc in N1N_{1} and N2N_{2}; aa is an ambiguous leaf in N1N_{1}. N3N_{3} has no ambiguous leaves so it is an \mathcal{L}-network on =VUL(N3)=VRL(N3)={a,b,c}\mathcal{L}=V_{UL}(N_{3})=V_{RL}(N_{3})=\{a,b,c\}. N1N_{1} has 1 root component. Its rooted partner rooted at uu is tree-child, but none of the others are (rooted at aa, ww or vv), so it is weakly tree-child. N2N_{2} has 2 root components R1={a}R_{1}=\{a\} and R2={u,v}R_{2}=\{u,v\}, and 2 rooted partners: one from the root choice ρ(R1)=a\rho(R_{1})=a, ρ(R2)=u\rho(R_{2})=u; and the other from the root choice ρ(R1)=a\rho(R_{1})=a, ρ(R2)=v\rho(R_{2})=v. N2N_{2} is not tree-child in either sense, as none of its rooted partners are tree-child. N2=𝒞(N2)N_{2}=\mathcal{C}(N_{2}) is complete. N3N_{3} is not complete: in 𝒞(N3)\mathcal{C}(N_{3}), edges incident to bb and cc are directed.

Clearly, VRL(N)VUL(N)V_{RL}(N)\subseteq V_{UL}(N). If NN is directed, then VRL(N)=VUL(N)=VL(N)V_{RL}(N)=V_{UL}(N)=V_{L}(N). As we will see later in 12, an ambiguous leaf is of degree 1 in the undirected graph obtained by undirecting all edges in NN, hence a leaf in the traditional sense. In Fig. 2 for example, aa is an ambiguous leaf in N1N_{1} but a rooted leaf in N3N_{3}.

In Section III we make no assumption regarding VUL(N)V_{UL}(N) and VRL(N)V_{RL}(N). But to extend μ\mu-vectors to semidirected graphs in Section IV, we will need a stable set of leaves across rooted partners. For a vector of distinct labels \mathcal{L}, an \mathcal{L}-DAG is a DAG whose leaves are bijectively labeled by \mathcal{L}. To extend this definition we impose VUL(N)=VRL(N)V_{UL}(N)=V_{RL}(N) below, but we will see (after 12) that this requirement is less stringent than it appears.

Definition 4 (leaf-labeled network).

A network NN is labeled in \mathcal{L} and called an \mathcal{L}-network, if VUL(N)=VRL(N)V_{UL}(N)=V_{RL}(N) and VUL(N)V_{UL}(N) is bijectively labeled by elements of \mathcal{L}. The leaf set of an \mathcal{L}-network NN is defined as VL(N)=VUL(N)V_{L}(N)=V_{UL}(N).

For example, in Fig. 2 N3N_{3} is an \mathcal{L}-network with ={a,b,c}\mathcal{L}=\{a,b,c\}, while N1N_{1} is not because VUL(N1)VRL(N1)V_{UL}(N_{1})\neq V_{RL}(N_{1}). N2N_{2} is a {b,c}\{b,c\}-network.

A rooted partner GG of an \mathcal{L}-network NN must be an \mathcal{L}-DAG: since VRL(N)VL(G)VUL(N)V_{RL}(N)\subseteq V_{L}(G)\subseteq V_{UL}(N) generally, we have that VL(G)=VL(N)V_{L}(G)=V_{L}(N) and VL(G)V_{L}(G) can inherit the labels in NN.

Our main result concerns the class of tree-child graphs. If GG is a DAG, it is tree-child if every internal node vv of GG has at least one child that is a tree node [16]. Equivalently, GG is tree-child if every non-leaf node in GG has a tree descendant leaf [7]. We extend this notion to semidirected networks, illustrated in Fig. 2.

Definition 5 (semidirected tree-child).

A network is weakly tree-child if one of its rooted partners is tree-child. It is strongly tree-child, or simply tree-child, if all its rooted partners are tree-child.

Since a DAG GG is a network with a single rooted partner, GG is strongly and weakly tree-child if and only if it is tree-child as a DAG. In Fig. 2, N3N_{3} is weakly but not strongly tree-child: of its 3 rooted partners, only one is tree-child. Later in 11, we provide a characterization easy to check without enumerating rooted partners.

III Properties of semidirected acyclic graphs

Proposition 3.

A semidirected graph N=(V,E)N=(V,E) is acyclic if and only if the undirected graph induced by its undirected edges EUE_{U} consists of trees only (i.e. is a forest), and Cont(N)\mathrm{Cont}(N) is acyclic.

Proof.

Let NN be a semidirected graph. Suppose that N[EU]N[E_{U}] is not a forest. Then there exists a compatible directed graph of N[EU]N[E_{U}] which contains a cycle that exists in a compatible directed graph of NN, therefore NN is not acyclic. Next suppose that Cont(N)\mathrm{Cont}(N) contains a cycle CC^{\prime}. Then there exists a compatible directed graph GG of NN which contains a cycle CC such that CC^{\prime} is obtained from CC after contracting edges in GG that correspond to undirected edges in NN, and so NN is not acyclic.

Now suppose the graph induced by EUE_{U} is a forest and that Cont(N)\mathrm{Cont}(N) is acyclic. If NN is not acyclic, then by definition there exists a compatible directed graph GG that contains a cycle CC. Since Cont(N)\mathrm{Cont}(N) is acyclic, CC must contract into a single node in Cont(N)\mathrm{Cont}(N). This implies that CC contains undirected edges only, hence is contained in N[EU]N[E_{U}], a contradiction. Therefore NN is acyclic. ∎

In a network NN, a semidirected path from u0u_{0} to unu_{n} is a sequence of nodes u0u1unu_{0}u_{1}\ldots u_{n}, such that for i=1,ni=1,\ldots n, either ui1uiu_{i-1}u_{i} or (ui1,ui)(u_{i-1},u_{i}) is an edge in NN. On V(N)V(N) we define vuv\lesssim u if there is a semidirected path from uu to vv, and the associated equivalence relation: uvu\sim v if uvu\lesssim v and vuv\lesssim u. In Fig. 2, ava\lesssim v in N1N_{1} and N3N_{3}. Also vav\lesssim a so ava\sim v in N1N_{1}, but not in N3N_{3}. On equivalence classes, \lesssim becomes a partial order, with which we can define the following.

Definition 6 (undirected components, root components, directed part).

In a network NN, an undirected component is the subgraph induced by an equivalence class under \sim. A root component of NN is an undirected component that is maximal under \lesssim. A root component is trivial if it consists of a single node. The set of edges and nodes that are not in a root component is called the directed part of NN. We denote the set of nodes in the directed part as VDP(N)V_{DP}(N) and the set of edges EDP(N)E_{DP}(N). We also denote the set of nodes in root components VR(N)V_{R}(N) and the set of edges ER(N)E_{R}(N).

The directed part of NN is generally not a subgraph: it may contain an edge but not one of its incident nodes. In Fig. 2, edges in the directed part EDPE_{DP} are those in black (tree edges) and blue (hybrid edges). Edges in root components ERE_{R} are in brown. In N3N_{3}, vv is adjacent to the directed part, but is not in VDP(N3)V_{DP}(N_{3}). N2N_{2} has a trivial root component: {a}\{a\}. The following result justifies the name given to the equivalence classes.

Proposition 4.

For u,vu,v nodes in a network NN, uvu\sim v if and only if there is an undirected path between uu and vv.

Proof.

Consider uu and vv connected by an undirected path. Since this path does not contain any directed edge, clearly, uvu\lesssim v and vuv\lesssim u, hence uvu\sim v.

Now suppose uvu\sim v. By definition, there exists a semidirected path puv=u0u1unp_{uv}=u_{0}u_{1}\ldots u_{n} from u=u0u=u_{0} to v=unv=u_{n} and a semidirected path pvup_{vu} from vv to uu. If pvu=unun1u0p_{vu}=u_{n}u_{n-1}\ldots u_{0} then all edges in these paths must be shared and undirected. This is because NN is acyclic, in case puvp_{uv} and pvup_{vu} have distinct edges incident to uiu_{i} and ui+1u_{i+1}. Then, there is an undirected path between uu and vv as claimed. Otherwise, there exists i10i_{1}\geq 0 and i2>i1+1i_{2}>i_{1}+1 such that pvup_{vu} is the concatenation of semidirected paths unun1ui2u_{n}u_{n-1}\ldots u_{i_{2}}; pui2ui1p_{u_{i_{2}}u_{i_{1}}} from ui2u_{i_{2}} to ui1u_{i_{1}} not containing any uju_{j} for i1<j<i2i_{1}<j<i_{2}; and ui1ui11u0u_{i_{1}}u_{i_{1}-1}\ldots u_{0}. Then, the concatenation of pui2ui1p_{u_{i_{2}}u_{i_{1}}} with ui1ui1+1ui2u_{i_{1}}u_{i_{1}+1}\ldots u_{i_{2}} forms a semidirected cycle. Since NN is acyclic, this case cannot occur. ∎

We now characterize undirected components as the undirected trees in the forest induced by NN’s undirected edges.

Proposition 5.

Let NN be a network and FF the graph induced by the undirected edges of NN. Then FF is a forest where:

  1. 1.

    each tree corresponds to an undirected component of NN, and has at most one node vv with degi(v,N)1\mathrm{deg}_{\mathrm{i}}(v,N)\geq 1;

  2. 2.

    the root components of NN are exactly the trees without such nodes, and they contain tree nodes only.

Proof.

By 3, FF is a forest. By 4, each tree in FF is an undirected component. If a tree TT in FF had more than one node vv with degi(v,N)1\mathrm{deg}_{\mathrm{i}}(v,N)\geq 1, it would be impossible to direct the edges in TT without making one of them hybrid, contradicting the existence of a rooted partner of NN.

For the second claim, let TT be a tree in FF. Note that (u,v)ED(u,v)\in E_{D} implies that u≁vu\not\sim v and vv’s equivalence class is not maximal under \lesssim. So if TT is maximal under \lesssim, then all nodes vv in TT have degi(v,N)=0\mathrm{deg}_{\mathrm{i}}(v,N)=0, which implies that vv is a tree node. Conversely, if TT is not maximal, then there exist nodes vv in TT, uvu\gtrsim v in a different tree of FF, and a semidirected path from uu to vv containing a directed edge. Taking the last directed edge on this path gives a node vv^{\prime} in TT with degi(v,N)1\mathrm{deg}_{\mathrm{i}}(v^{\prime},N)\geq 1. ∎

Lemma 6.

Let NN be a network and GG a rooted partner of NN. Let vVR(N)v\in V_{R}(N). If v<uv<u in GG, then uVR(N)u\in V_{R}(N).

Proof.

Let v<uv<u in GG. Then vuv\lesssim u in NN. By Definition 6, uvu\lesssim v in NN as well. Therefore, uu belongs to the same root component as vv, hence uVR(N)u\in V_{R}(N). ∎

Proposition 7.

Let NN be a network. Let G1G_{1} and G2G_{2} be rooted partners of NN. Then edges in EDP(N)E_{DP}(N) have the same direction in G1G_{1} and G2G_{2}.

Proof.

Let TT be an undirected component of NN that is not a root component. We need to show that TT’s edges have the same direction in G1G_{1} and G2G_{2}. By 5, TT is an undirected tree in NN, and has exactly one node v0v_{0} with an incoming edge in NN. Then v0v_{0} must be the root of TT in both G1G_{1} and G2G_{2}, for G1G_{1} and G2G_{2} to be phylogenetically compatible with NN, which completes the proof. ∎

Definition 7 (root choice function).

Let NN be a network, and \mathcal{R} be the set of root components of NN. A root choice function of NN is a function ρ:V(N)\rho:\mathcal{R}\to V(N) such that for a root component TT\in\mathcal{R}, ρ(T)V(T)\rho(T)\in V(T). In other words, ρ\rho chooses a node for each root component.

Conceptually, a semidirected network represents uncertainty about the root(s) position. Next, we show that to resolve uncertainty, exactly 1 node from each root component must be chosen as root, and this choice can be made independently across root components. In other words, root choice functions are in one-to-one correspondence with rooted partners. As a result, all rooted partners have the same number of roots: the number of root components. In Fig. 1, NN has 2 root components each with 2 nodes, hence 2×2=42\times 2=4 rooted partners.

Proposition 8.

Given a root choice function ρ\rho of a network NN, there exists a unique rooted partner Nρ+N^{+}_{\rho} of NN such that the set of roots in Nρ+N^{+}_{\rho} is the image of ρ\rho. Conversely, given any rooted partner GG of NN, there exists a unique root choice function ρ\rho such that G=Nρ+G=N^{+}_{\rho}.

Proof.

Let G0G_{0} be a rooted partner of NN. For a root choice function ρ\rho, let Nρ+N^{+}_{\rho} be the graph compatible with NN obtained by directing edges in EDP(N)E_{DP}(N) as they are in G0G_{0}, and away from ρ(T)\rho(T) in any root component TT (which is possible by 5). Nρ+N^{+}_{\rho} is a DAG because NN is acyclic. To prove that Nρ+N^{+}_{\rho} is phylogenetically compatible with NN (and hence a rooted partner), we need to show that EH(Nρ+)=EH(G0)E_{H}(N^{+}_{\rho})=E_{H}(G_{0}). Since all edges in NN incident to both VR=VR(N)V_{R}=V_{R}(N) and VDP=VDP(N)V_{DP}=V_{DP}(N) are directed from VRV_{R} to VDPV_{DP}, they have the same direction in Nρ+N^{+}_{\rho} and G0G_{0}. Therefore nodes in VDPV_{DP} and edges in EDP(N)E_{DP}(N) are of the same type in Nρ+N^{+}_{\rho} and G0G_{0} (and NN). Furthermore, by construction and 5, all edges in ER(N)E_{R}(N) remain of tree type in both Nρ+N^{+}_{\rho} and G0G_{0}. Hence Nρ+N^{+}_{\rho} is a rooted partner of NN. Finally, the root set of Nρ+N^{+}_{\rho} is the image of ρ\rho because a root component’s root is a root of the full network (from degi(v,N)=0\mathrm{deg}_{\mathrm{i}}(v,N)=0 for any vVRv\in V_{R}) and because VDPV_{DP} cannot contain any root of Nρ+N^{+}_{\rho} (by 5 again).

To prove that Nρ+N^{+}_{\rho} is unique, let GG be a rooted partner of NN whose set of roots is the image of ρ\rho. By 7, edges in EDP(N)E_{DP}(N) have the same direction in GG and N+N^{+}. For a root component TT, G[V(T)]=Nρ+[V(T)]G[V(T)]=N^{+}_{\rho}[V(T)] because TT is an undirected tree in NN, rooted by the same ρ(T)\rho(T) in both GG and Nρ+N^{+}_{\rho}. Therefore G=Nρ+G=N^{+}_{\rho}.

Let GG be a rooted partner of NN. By 5 and phylogenetic compatibility, GG must have exactly one root in each root component TT. Define ρ\rho such that ρ(T)\rho(T) is this root of GG in TT. By the arguments above, GG cannot have any root in VDPV_{DP}, and then G=Nρ+G=N^{+}_{\rho}. ∎

We can define the following thanks to 7:

Definition 8 (network completion).

The completion 𝒞(N)\mathcal{C}(N) of a network NN is the semidirected graph obtained from NN by directing every undirected edge in its directed part, as it is in any rooted partner of NN. More precisely, let GG be a rooted partner of NN. We direct uvEDP(N)uv\in E_{DP}(N) as (u,v)(u,v) in 𝒞(N)\mathcal{C}(N) if (u,v)E(G)(u,v)\in E(G). A network NN is complete if 𝒞(N)=N\mathcal{C}(N)=N.

Proposition 9.

For a network NN, 𝒞(N)\mathcal{C}(N) is a network and phylogenetically compatible with NN.

Proof.

From 8, any rooted partner of NN is of the form Nρ+N^{+}_{\rho}. As seen in the proof of 8, 𝒞(N)\mathcal{C}(N) and Nρ+N^{+}_{\rho} differ in root components only: if eER(N)e\in E_{R}(N) then ee is undirected in NN and in 𝒞(N)\mathcal{C}(N), directed in Nρ+N^{+}_{\rho}, and is a tree edge in all. Therefore 𝒞(N)\mathcal{C}(N) is phylogenetically compatible with NN and is a network (admitting Nρ+N^{+}_{\rho} as rooted partner). ∎

Remark 1.

Propositions 5 and 8 yield practical algorithms. Finding the root components requires only traversing the network NN, tracking the forest FF of undirected components, and which nodes have nonzero in-degree. Computing 𝒞(N)\mathcal{C}(N) then consists in directing the edges away from such a node in each tree of FF, if one exists. In particular, in a single traversal of NN we can construct a rooted partner GG, record the roots of GG, record the edges of GG that were in root components of NN, and for each such edge record the corresponding root. To do this, for each tree TT in FF that is a root component, we arbitrarily choose and record a node uu as root, direct the edges in TT away from uu, record these edges as belonging to a root component of NN, and for all these edges record uu as the corresponding root. We then direct the rest of the undirected edges the same way as when computing the completion. We shall use this in Algorithm 1 later.

With as many directed edges as can be possibly implied by the directed edges in NN, 𝒞(N)\mathcal{C}(N) is the network that we are generally interested in. We define phylogenetic isomorphism between networks based on their completion.

Definition 9 (network isomorphism).

\mathcal{L}-networks NN and NN^{\prime} are phylogenetically isomorphic, denoted by NNN\cong N^{\prime}, if 𝒞(N)\mathcal{C}(N) and 𝒞(N)\mathcal{C}(N^{\prime}) are isomorphic as semidirected graphs, with an isomorphism that preserves the leaf labels.

In Fig. 2 for example, N2N_{2} is complete, but N3N_{3} is not. N1N_{1} and N3N_{3} are not phylogenetically isomorphic, because the edge incident to aa remains undirected in the completion 𝒞(N1)\mathcal{C}(N_{1}). N2N_{2} is not phylogenetically compatible with N1N_{1} or N3N_{3}, and not phylogenetically isomorphic to either.

Lemma 10.

Let NN be a complete network. There exists a directed edge (u,v)(u,v) in NN if and only if vVDP(N)v\in V_{DP}(N).

Proof.

If vVDP(N)v\in V_{DP}(N), its undirected component UU has no undirected edges by 8. Then by 5, U={v}U=\{v\} and degi(v)1\mathrm{deg}_{\mathrm{i}}(v)\geq 1, so vv is the child of some directed edge. Conversely, if there exists (u,v)E(N)(u,v)\in E(N) then vv’s equivalence class is not maximal and vVDP(N)v\in V_{DP}(N). ∎

Using 𝒞(N)\mathcal{C}(N), we can now decide if a network is weakly or strongly tree-child in a single traversal, thanks to the following.

Proposition 11.

Let NN be a complete network, \mathcal{R} its set of root components, and W0W_{0} the set of nodes that form trivial root components. For TT\in\mathcal{R}, let W1(T)W_{1}(T) be the set of nodes uu in TT adjacent to VDP(N)V_{DP}(N) with degu(u)=1\mathrm{deg}_{\mathrm{u}}(u)=1 and without a tree-child in NN. NN is weakly (resp. strongly) tree-child if and only if every non-leaf node in VDP(N)W0V_{DP}(N)\cup W_{0} has a tree child in NN; and for every TT\in\mathcal{R}, |W1(T)|1|W_{1}(T)|\leq 1 (resp. W1(T)=W_{1}(T)=\emptyset).

If NN is a DAG, then VDP(N)W0V_{DP}(N)\cup W_{0} is the full node set and we simply recover the tree-child definition. Recall that children are defined using directed edges only. For example, take NN in Fig. 1. In 𝒞(N)\mathcal{C}(N) the edges incident to bb and c1c_{1} are directed; eie_{i} remains undirected and forms a root component TiT_{i} (i=1,2i=1,2). W1(T1)=W_{1}(T_{1})=\emptyset but W1(T2)={u}W_{1}(T_{2})=\{u\}, because uu is incident to exactly 1 undirected edge and 2 outgoing hybrid edges. By 11, NN is weakly tree-child. Indeed, its partners rooted at uu are tree-child (e.g. Fig. 1 bottom left) but its partners rooted at vv are not (e.g. bottom right). The proof below shows that, more generally, a weakly tree-child network has at most one ‘problematic’ node in each root component, and this node must serve as root for a rooted partner to be tree-child.

Proof of 11.

We first characterize which rooted partners are tree-child. Suppose that NN is complete, and that every non-leaf node in VDP(N)W0V_{DP}(N)\cup W_{0} has a tree child in NN. For a root choice function ρ\rho, we claim that the rooted partner Nρ+N^{+}_{\rho} is tree-child if and only if W1(T){ρ(T)}W_{1}(T)\subseteq\{\rho(T)\} for every TT\in\mathcal{R}. To prove this claim, consider a node uu in NN. If uu is in VDP(N)W0V_{DP}(N)\cup W_{0} then uu has the same children in NN as in any rooted partner, so its tree child in NN is also its tree child in Nρ+N^{+}_{\rho}. Otherwise, uu is in some TT\in\mathcal{R}, TT is non-trivial and degu(u)1\mathrm{deg}_{\mathrm{u}}(u)\geq 1. If degu(u)2\mathrm{deg}_{\mathrm{u}}(u)\geq 2, then at least one of its neighbor in TT is its tree child in Nρ+N^{+}_{\rho}. If degu(u)=1\mathrm{deg}_{\mathrm{u}}(u)=1 and is not adjacent to VDP(N)V_{DP}(N), then dego(u)=0\mathrm{deg}_{\mathrm{o}}(u)=0 and uu is a leaf in Nρ+N^{+}_{\rho} or its unique neighbor is its tree child in Nρ+N^{+}_{\rho}. If uu has a tree child ww in NN, then ww is also its tree child in Nρ+N^{+}_{\rho}. Otherwise, uW1(T)u\in W_{1}(T). Let vv be its unique neighbor in TT. If ρ(T)u\rho(T)\neq u then vv is the parent of uu in Nρ+N^{+}_{\rho}, uu has the same children in Nρ+N^{+}_{\rho} as in NN, so uu has no tree child in Nρ+N^{+}_{\rho} (by definition of W1(T)W_{1}(T)). If ρ(T)=u\rho(T)=u then vv is a child of uu in Nρ+N^{+}_{\rho}, and since vv is a tree node (because vTv\in T) uu has a tree child in Nρ+N^{+}_{\rho}. Overall, we get that Nρ+N^{+}_{\rho} is tree-child if and only if any node in any W1(T)W_{1}(T) is a root, which proves the claim.

For the first direction of 11, suppose that NN is complete and weakly tree-child. If uu is a non-leaf node in VDP(N)W0V_{DP}(N)\cup W_{0}, then uu has the same children in NN as in any rooted partner, so uu has a tree child in NN. Also, we can apply our claim. Since Nρ+N^{+}_{\rho} is tree-child for some ρ\rho, by our claim we get W1(T){ρ(T)}W_{1}(T)\subseteq\{\rho(T)\} which implies that |W1(T)|1|W_{1}(T)|\leq 1 for each TT\in\mathcal{R}. Suppose further that NN is strongly tree-child. If W1(T)W_{1}(T)\neq\emptyset for some TT\in\mathcal{R} then TT is nontrivial, we can choose ρ(T)=vW1(T)\rho(T)=v\notin W_{1}(T) for which Nρ+N^{+}_{\rho} is not tree-child, a contradiction. This proves the first direction.

For the second direction, assume that every non-leaf node in VDP(N)W0V_{DP}(N)\cup W_{0} has a tree child in NN and |W1(T)|1|W_{1}(T)|\leq 1 (resp. W1(T)=W_{1}(T)=\emptyset) for every TT\in\mathcal{R}. Then NN admits at least one tree-child rooted partner (setting ρ\rho such that {ρ(T)}=W1(T)\{\rho(T)\}=W_{1}(T) for any TT with W1(T)W_{1}(T)\neq\emptyset) and NN is weakly tree-child. Further, if W1(T)=W_{1}(T)=\emptyset for all TT, then Nρ+N^{+}_{\rho} is tree-child for any root choice function ρ\rho, and NN is strongly tree-child. ∎

Finally, we characterize unambiguous leaves quickly.

Lemma 12.

In a network NN, vv is an ambiguous leaf if and only if vVR(N)v\in V_{R}(N), degu(v)=1\mathrm{deg}_{\mathrm{u}}(v)=1, and dego(v)=degi(v)=0\mathrm{deg}_{\mathrm{o}}(v)=\mathrm{deg}_{\mathrm{i}}(v)=0.

Proof.

Let vv be an ambiguous leaf. Then dego(v)=0\mathrm{deg}_{\mathrm{o}}(v)=0. By 7, vv is in VR(N)V_{R}(N) so degi(v)=0\mathrm{deg}_{\mathrm{i}}(v)=0. Finally, if degu(v)=0\mathrm{deg}_{\mathrm{u}}(v)=0, then vv is isolated, and a rooted leaf. If degu(v)2\mathrm{deg}_{\mathrm{u}}(v)\geq 2, then in any rooted partner one of the incident edges is directed away from vv, and vv is never a leaf. Therefore degu(v)=1\mathrm{deg}_{\mathrm{u}}(v)=1.

Conversely, let vVR(N)v\in V_{R}(N) be incident to exactly one edge, uvuv. By 8, we can find a rooted partner where vv is a non-leaf root as well as a rooted partner where uu is a root and vv is a leaf. Hence vv is an ambiguous leaf. ∎

Remark 2.

We argue that requiring VUL(N)=VRL(N)V_{UL}(N)=V_{RL}(N) for NN to be an 𝒩\mathcal{N}-network is reasonable in practice. By 12, an ambiguous leaf xx is in a root component and incident to only one undirected edge. The ambiguity is whether xx becomes a leaf or a root in a rooted partner. In practice, one knows which nodes are supposed to be leaves, with a label and data [10, 25]. Then one can, for each root component, either direct all incident edges to ambiguous leaves towards them, making them rooted leaves (as in Fig. 2, compare N1N_{1} and N3N_{3}), or pick one ambiguous leaf to serve as root for that component (as in N2N_{2}, Fig. 2) and direct edges accordingly, turning the remaining ambiguous leaves into rooted leaves. This yields a network with VUL=VRLV_{UL}=V_{RL}. By 5, this can be done in 𝒪(|E|)\mathcal{O}(|E|) where |E||E| is the number of edges.

IV Vectors and Representations

Formally a multiset is a tuple (A,m)(A,m), where AA is a set and mm: A+A\to\mathbb{Z}^{+} gives the multiplicity. We consider two multisets (A,mA)(A,m_{A}) and (B,mB)(B,m_{B}) to be the same if mA|AB=0m_{A}|_{A\setminus B}=0, mB|BA=0m_{B}|_{B\setminus A}=0, and mA|AB=mB|ABm_{A}|_{A\cap B}=m_{B}|_{A\cap B}. To simplify notations, we use \lbag\rbag to denote a multiset by enumerating each element as many times as its multiplicity. For example, A=a,a,bA=\lbag a,a,b\rbag contains aa with multiplicity 2 and bb with multiplicity 1. For brevity, we identify a multiset (A,m)(A,m) with the set AA if m1m\equiv 1, e.g. a,b,c={a,b,c}\lbag a,b,c\rbag=\{a,b,c\}. We adopt the standard notion of sum and difference for multisets. The symmetric difference between multisets is defined as AB=(AB)+(BA)A\triangle B=(A-B)+(B-A).

In what follows, we consider a vector of distinct labels \mathcal{L}, whose order is arbitrary but fixed, as it will determine the order of coordinates in all μ\mu vectors and representations. For an \mathcal{L}-DAG GG and for node vV(G)v\in V(G), the μ\mu-vector of vv is defined as the tuple μ(v,G)=(μ1(v),,μn(v))\mu(v,G)=(\mu_{1}(v),\ldots,\mu_{n}(v)) where nn is the number of labels in \mathcal{L} and μi(v)\mu_{i}(v) is the number of paths in GG from vv to the leaf with the ithi^{\mathrm{th}} label in \mathcal{L}. As in [7], the partial order \geq between μ\mu-vectors is the coordinatewise order. Namely, for m=(m1,,mn)m=(m_{1},\ldots,m_{n}) and m=(m1,,mn)m^{\prime}=(m_{1}^{\prime},\ldots,m_{n}^{\prime}), mmm\geq m^{\prime} if mimim_{i}\geq m_{i}^{\prime} for all i=1,,ni=1,\ldots,n. If mmm\not\leq m^{\prime} and mmm\not\geq m^{\prime} then mm and mm^{\prime} are incomparable. The node-based μ\mu-representation of GG from [7], denoted as μV(G)\mu_{V}(G), is defined as the multiset μ(v,G):vV(G)\lbag\mu(v,G):v\in V(G)\rbag. Algorithm 1 in [7] computes μV(G)\mu_{V}(G) recursively in post-order thanks to the following property, which is a slight extension of Lemma 4 in [7] allowing for parallel edges by summing over child edges instead of child nodes.

Lemma 13.

Let GG be a DAG and uu a node in GG. Then

μ(u,G)=vchild(u,G)(u,v)E(G)μ(v,G).\mu(u,G)=\sum_{v\in\mathrm{child}(u,G)}\;\;\sum_{(u,v)\in E(G)}\mu(v,G)\;.

We will make frequent use of the following result, whose original proof easily extends to DAGs with parallel edges thanks to 13. It is an extension of Lemma 5 of [7] stating the assumption used in the proof by [7], which is weaker than requiring a tree-child DAG.

Lemma 14.

Let GG be an \mathcal{L}-DAG and u,vu,v two nodes in GG.

  1. 1.

    If there exists a path uvu\leadsto v, then μ(u,G)μ(v,G)\mu(u,G)\geq\mu(v,G).

  2. 2.

    If μ(u,G)>μ(v,G)\mu(u,G)>\mu(v,G) and if vv has a tree descendant leaf, then there exists a path uvu\leadsto v.

  3. 3.

    If μ(u,G)=μ(v,G)\mu(u,G)=\mu(v,G) and if vv has a tree descendant leaf, then u,vu,v are connected by an elementary path.

Other results in [7] similarly hold when parallel edges are allowed, such as their Theorem 1 on tree-child networks (which must be non-binary if they have parallel edges).

The rest of the section is organized as follows. In part IV-A we define the edge-based μ\mu-representation for \mathcal{L}-networks, with Algorithm 1 to compute it. Part IV-B presents properties of this μ\mu-representation for tree-child networks, and part IV-C describes how to reconstruct a complete tree-child \mathcal{L}-network from its edge-based μ\mu-representation. The networks in Fig. 3 are used as examples throughout.

IV-A Edge-based μ\mu-representation

We first extend the notion of μ\mu-vectors to nodes in the directed part of an \mathcal{L}-network.

Proposition 15.

Let NN be an \mathcal{L}-network, and vVDP(N)v\in V_{DP}(N). Then the set of directed paths starting at vv, and consequently μ(v,G)\mu(v,G), are the same for any rooted partner GG of NN.

Using the proposition above, the following is well-defined.

Definition 10 (μ\mu-vector of a node in the directed part).

Let NN be an \mathcal{L}-network and GG any rooted partner of NN. For vVDP(N)v\in V_{DP}(N), we define μ(v,N)=μ(v,G)\mu(v,N)=\mu(v,G).

In Fig. 3 for example, e5e_{5} is in the directed part of NN. Applying 13 recursively on h1h_{1}, h2h_{2} and their parent in 𝒞(N){\mathcal{C}}(N), we get μ(e5,N)=(0,0,0,0,0,1,1,0)\mu(e_{5},N)=(0{,}0{,}0{,}0{,}0,1{,}1{,}0) with leaf order given by =(a1,a2,b,c,d,h1,h2,h3)\mathcal{L}=(a_{1},a_{2},b,c,d,h_{1},h_{2},h_{3}). All 3 hybrid edges have the same μ\mu-vector in NN: (0,0,0,0,0,1,1,1)(0{,}0{,}0{,}0{,}0,1{,}1{,}1).

Proof of 15.

Let GG be a rooted partner of NN. Let u1unu_{1}\ldots u_{n} (n1n\geq 1), where u1=vu_{1}=v, be a directed path starting at vv in GG. We claim ui,i=1,,nu_{i},i=1,\ldots,n are all in VDP(N)V_{DP}(N): Otherwise, we can find ii such that uiVDP(N)u_{i}\in V_{DP}(N) and ui+1VR(N)u_{i+1}\in V_{R}(N). Since (ui,ui+1)E(G)(u_{i},u_{i+1})\in E(G), either uiui+1u_{i}u_{i+1} or (ui,ui+1)(u_{i},u_{i+1}) is in E(N)E(N). By 4, this implies either uiVR(N)u_{i}\in V_{R}(N) or ui+1VDP(N)u_{i+1}\in V_{DP}(N), a contradiction. Therefore any directed paths from vv in GG lies entirely in G[VDP(N)]G[V_{DP}(N)]. The conclusion then follows from 7. ∎

Now we turn to root components. Here the μ\mu-vector for a node is not well-defined as it varies depending on the rooted partner. It turns out that if locally we choose the direction of an edge uvuv, say (u,v)(u,v), then globally the set of directed paths from vv across all rooted partners are the same, and consequently the μ\mu-vector of vv is fixed. We now formalize this.

Proposition 16.

Let NN be an \mathcal{L}-network with uvER(N)uv\in E_{R}(N). Then the set of directed paths starting at vv, and consequently μ(v,G)\mu(v,G), are the same for any rooted partner GG of NN where uvuv is directed as (u,v)(u,v), and there always exists such a rooted partner.

Using this proposition, we can define the following.

Definition 11 (directional μ\mu-vector).

Let NN be an \mathcal{L}-network, uvER(N)uv\in E_{R}(N), and GG any rooted partner of NN where uvuv is directed as (u,v)(u,v). We call μ(v,G)\mu(v,G) the directional μ\mu-vector of (u,v)(u,v), and write it as μd(u,v,N)\mu_{d}(u,v,N), or μd(u,v)\mu_{d}(u,v) if NN is clear in the context.

Refer to caption
Figure 3: Tree-child \mathcal{L}-networks on =(a1,a2,b,c,d,h1,h2,h3)\mathcal{L}=(a_{1},a_{2},b,c,d,h_{1},h_{2},h_{3}) with one root component. Directed edges are shown with arrows. Black: tree edges in the directed part, leading to A1A_{1} in Algorithm 1. Blue: hybrid edges, leading to A2A_{2} in Algorithm 1. Brown: edges in the root component (ERE_{R}), leading to A4A_{4} in Algorithm 1. For i5i\leq 5, edge eie_{i} in NN (left) and eie^{\prime}_{i} in NN^{\prime} (right) share the same μ\mu-vector set. The multisets μE(N)\mu_{E}(N) and μE(N)\mu_{E}(N^{\prime}) have 17 elements in common: 8 corresponding to edges incident to leaves, 5 from edges eie_{i} and eie^{\prime}_{i} for i5i\leq 5, 3 from hybrid edges, and 1 root μ\mu-vector set. μE(N)\mu_{E}(N) has 2 unique elements (from e6e_{6} and e7e_{7}) and μE(N)\mu_{E}(N^{\prime}) has 3, so dμE(N,N)=5d_{\mu_{E}}(N,N^{\prime})=5. See the Appendix for details.

In Fig. 3 for example, e1=uve_{1}=uv is in the root component of NN, with directional μ\mu-vectors: μd(v,u,N)=(1,1,0,0,0,0,0,0)\mu_{d}(v,u,N)=(1{,}1{,}0{,}0{,}0,0{,}0{,}0) and μd(u,v,N)=(0,0,1,1,1,3,3,3)\mu_{d}(u,v,N)=(0{,}0{,}1{,}1{,}1,3{,}3{,}3). In NN^{\prime}, e1e^{\prime}_{1} has these same directional μ\mu-vectors.

Before the proof we establish a lemma.

Lemma 17.

Let NN be a network with uvuv an edge in some root component. Let NN^{\prime} be the semidirected graph obtained from NN by directing uvuv as (u,v)(u,v). Then NN^{\prime} is a network phylogenetically compatible with NN.

Proof.

Let GG be a rooted partner of NN where uu is a root. Let A=E(G)E(N)A=E(G)\setminus E(N) be the set of edges that are directed in GG but not in NN. By phylogenetic compatibility, AA consists of tree edges only. AA also contains (u,v)(u,v).

Since from GG we get back NN^{\prime} if we undirect all the edges in A{(u,v)}A\setminus\{(u,v)\}, by 2, GG is phylogenetically compatible with NN^{\prime}. Then EH(N)=EH(G)=EH(N)E_{H}(N^{\prime})=E_{H}(G)=E_{H}(N) and NN^{\prime} is phylogenetically compatible with NN. ∎

Proof of 16.

The existence of a rooted partner where uvuv is directed as (u,v)(u,v) follows from 8.

Let G1G_{1} and G2G_{2} be rooted partners of NN such that uvuv is directed as (u,v)(u,v) in both. Let NN^{\prime} be the semidirected graph obtained from NN by directing uvuv as (u,v)(u,v). By Lemma 17, NN^{\prime} is a network phylogenetically compatible with NN. Thus, G1G_{1} and G2G_{2} are rooted partners of NN^{\prime}. Since degi(v,N)1\mathrm{deg}_{\mathrm{i}}(v,N^{\prime})\geq 1, vVDP(N)v\in V_{DP}(N^{\prime}) by 5. The conclusion then follows from 15. ∎

In the next lemma and definition, we associate each root component (rather than a node or edge) to a μ\mu-vector.

Lemma 18.

Let NN be an \mathcal{L}-network, and TT a root component of NN. Then the μ\mu-vector μ(ρ(T),Nρ+)\mu(\rho(T),N^{+}_{\rho}) is independent of the root choice function ρ\rho. Furthermore, if uvuv is an edge in TT, this μ\mu-vector is equal to μd(u,v)+μd(v,u)\mu_{d}(u,v)+\mu_{d}(v,u).

With this lemma we define the following:

Definition 12 (root μ\mu-vector).

Let NN be an \mathcal{L}-network and TT a root component of NN. The root μ\mu-vector of TT is defined as μ(ρ(T),Nρ+)\mu(\rho(T),N^{+}_{\rho}) where ρ\rho is any root choice function of NN. We write it as μr(T,N)\mu_{r}(T,N) or μr(T)\mu_{r}(T) if NN is clear from context.

In Fig. 3, NN has one root component TT (in brown), with μr(T,N)=(1,1,1,1,1,3,3,3)\mu_{r}(T,N)=(1{,}1{,}1{,}1{,}1,3{,}3{,}3). The root component of NN^{\prime} has the same root μ\mu-vector.

Proof of 18.

The claims obviously hold when TT is trivial. Now consider TT non-trivial and distinct nodes uvu\neq v in TT. Let GuG_{u} (resp. GvG_{v}) be a rooted partner of NN where uu (resp. vv) is a root. To prove the first claim, we show that μ(u,Gu)=μ(v,Gv)\mu(u,G_{u})=\mu(v,G_{v}) by constructing a bijection fuf_{u} between 𝒫u\mathcal{P}_{u} and 𝒫v\mathcal{P}_{v}, where 𝒫z\mathcal{P}_{z} (z=u,vz=u,v) is the set of directed paths from zz to xx in GzG_{z}, for an arbitrary but fixed leaf xx of NN. Suppose pu=uwx𝒫up_{u}=u\ldots w\ldots x\in\mathcal{P}_{u}, where ww is the last node such that uwu\ldots w lies in TT. We can modify pup_{u} to a new path pvp_{v} by changing the uwu\ldots w subpath to vwv\ldots w, the unique tree path between vv and ww in TT. By 6, the subpath wxw\ldots x only contain edges in EDP(N)E_{DP}(N). Then by 7, pv𝒫vp_{v}\in\mathcal{P}_{v}. Obviously, fuf_{u} is a bijection whose inverse is the map from 𝒫v\mathcal{P}_{v} to 𝒫u\mathcal{P}_{u} constructed by symmetry, proving the first claim.

For the second claim, let uvuv be an edge in TT and xx, GuG_{u}, GvG_{v}, 𝒫u\mathcal{P}_{u} and 𝒫v\mathcal{P}_{v} as before. Let 𝒫uv\mathcal{P}_{u}^{v} be the set of directed paths from vv to xx in GuG_{u}, such that |𝒫uv||\mathcal{P}_{u}^{v}| is the coordinate value for xx in μd(u,v)\mu_{d}(u,v). Define 𝒫vu\mathcal{P}_{v}^{u} similarly. We can partition 𝒫u=AB\mathcal{P}_{u}=A\sqcup B where paths in AA contain vv, and paths in BB do not. It is easily verified that B=𝒫vuB=\mathcal{P}_{v}^{u}, and that prepending uu to a path gives a bijection 𝒫uvA\mathcal{P}_{u}^{v}\to A. Since xx is arbitrary, we get μr(T)=μd(u,v)+μd(v,u)\mu_{r}(T)=\mu_{d}(u,v)+\mu_{d}(v,u). ∎

We are now ready to define the edge-based μ\mu-representation of an \mathcal{L}-network, and an algorithm to compute it.

Definition 13 (edge-based μ\mu-representation).

Let NN be a complete \mathcal{L}-network. To edge ee of NN we associate a set μ(e)\mu(e), called the edge μ\mu-vector set of the edge ee, as follows:

  • For e=(u,v)e=(u,v), by 10 we have vVDPv\in V_{DP}, and we define μ(e)={(μ(v,N),t)}\mu(e)=\{(\mu(v,N),t)\} using 10, where tt is a tag taking value :𝐭\mathbf{:{\mkern-5.0mu}t} if ee is a tree edge, and :𝐡\mathbf{:{\mkern-5.0mu}h} otherwise.

  • For e=uvERe=uv\in E_{R}, using 11 we define μ(e)={(μd(u,v),:𝐭),(μd(v,u),:𝐭)}\mu(e)=\{(\mu_{d}(u,v),\mathbf{:{\mkern-5.0mu}t}),(\mu_{d}(v,u),\mathbf{:{\mkern-5.0mu}t})\}.

Let \mathcal{R} be the set of root components of NN, then the edge-based μ\mu-representation of NN, denoted by μE(N)\mu_{E}(N), is defined as the multiset

μ(e):eE(N)+{(μr(T),:𝐫)}:T\lbag\mu(e):e\in E(N)\rbag+\lbag\{(\mu_{r}(T),\mathbf{:{\mkern-5.0mu}r})\}:T\in\mathcal{R}\rbag

with μr\mu_{r} from 12 and :𝐫\mathbf{:{\mkern-5.0mu}r} a tag value indicating a root μ\mu-vector. For an \mathcal{L}-network NN^{\prime}, μE(N)\mu_{E}(N^{\prime}) is defined as μE(𝒞(N))\mu_{E}(\mathcal{C}(N^{\prime})).

In Fig. 3 for example, μE(N)\mu_{E}(N) contains 19 μ\mu-vector sets: 9 in A1A_{1}, 3 in A2A_{2}, 6 in A3A_{3} and 1 in A4A_{4}, using notations as in Algorithm 1. A1A_{1} contains the unidirectional μ\mu-vector sets from the 8 edges incident to the leaves, such as {((0,0,0,0,0,1,0,0),:𝐭)}\{((0{,}0{,}0{,}0{,}0,1{,}0{,}0),\mathbf{:{\mkern-5.0mu}t})\} for the edge to h1h_{1}, and {((0,0,0,0,0,1,1,0),:𝐭)}\{((0{,}0{,}0{,}0{,}0,1{,}1{,}0),\mathbf{:{\mkern-5.0mu}t})\} from e5e_{5}. A2A_{2} is from the hybrid edges: {((0,0,0,0,0,1,1,1),:𝐡)}\{((0{,}0{,}0{,}0{,}0,1{,}1{,}1),\mathbf{:{\mkern-5.0mu}h})\} with multiplicity 3. A3A_{3} has only 1 element: {((1,1,1,1,1,3,3,3),:𝐫)}\{((1{,}1{,}1{,}1{,}1,3{,}3{,}3),\mathbf{:{\mkern-5.0mu}r})\}. Finally, A4A_{4} contains the bidirectional μ\mu-vector sets from the 6 edges in the root component(s). For example, e1e_{1} contributes {((1,1,0,0,0,0,0,0),:𝐭),((0,0,1,1,1,3,3,3),:𝐭)}\{((1{,}1{,}0{,}0{,}0,0{,}0{,}0),\mathbf{:{\mkern-5.0mu}t}),((0{,}0{,}1{,}1{,}1,3{,}3{,}3),\mathbf{:{\mkern-5.0mu}t})\} See the Appendix for the other 5.

Refer to caption
Figure 4: Networks illustrating the need to include the μ\mu-vector of trivial root components in 13, to distinguish networks with multiple roots. NN and NN^{\prime} are tree-child and non-isomorphic. They have the same multiset of edge μ\mu-vector sets, but different root μ\mu-vectors so μE(N)μE(N)\mu_{E}(N)\neq\mu_{E}(N^{\prime}).

18 together with 5 yields the following Algorithm 1 to compute the edge-based μ\mu-representation of an \mathcal{L}-network N=(V,E)N=(V,E) with nn leaves. As discussed in Remark 1, line 1 in Algorithm 1 takes a single traversal of NN and 𝒪(|E|)\mathcal{O}(|E|) time. Computing the node-based μ\mu-representation by Algorithm 1 in [7] takes 𝒪(n|E|)\mathcal{O}(n|E|) time. The remaining steps iterate over edges and take 𝒪(n|E|)\mathcal{O}(n|E|) time, giving an overall complexity of 𝒪(n|E|)\mathcal{O}(n|E|).

Algorithm 1 Given \mathcal{L}-network NN, compute its edge-based μ\mu-representation A=μE(N)A=\mu_{E}(N)
1:   compute a rooted partner GG of NN, and store:
  • RR the set of roots in GG

  • ρ\rho: VR(N)RV_{R}(N)\to R the function that maps a node in a root component TT of NN to the root of TT in GG

  • ER+E_{R}^{+} the set of edges in GG that corresponds to ER(N)E_{R}(N)

  • EDP+=E(G)ER+E_{DP}^{+}=E(G)\setminus E_{R}^{+}

2:  compute the node-based μ\mu-representation of GG, let μ=μV(,G)\mu=\mu_{V}(\cdot,G)
3:  A1{(μ(v),:𝐭)}:(u,v)EDP+ET(G)A_{1}\leftarrow\lbag\{(\mu(v),\mathbf{:{\mkern-5.0mu}t})\}:(u,v)\in E_{DP}^{+}\cap E_{T}(G)\rbag
4:  A2{(μ(v),:𝐡)}:(u,v)EDP+EH(G)A_{2}\leftarrow\lbag\{(\mu(v),\mathbf{:{\mkern-5.0mu}h})\}:(u,v)\in E_{DP}^{+}\cap E_{H}(G)\rbag
5:  A3{(μ(r),:𝐫)}:rRA_{3}\leftarrow\lbag\{(\mu(r),\mathbf{:{\mkern-5.0mu}r})\}:r\in R\rbag
6:  A4{(μ(v),:𝐭),(μ(ρ(v))μ(v),:𝐭)}:(u,v)ER+A_{4}\leftarrow\lbag\{(\mu(v),\mathbf{:{\mkern-5.0mu}t}),\bigl{(}\mu(\rho(v))-\mu(v),\mathbf{:{\mkern-5.0mu}t}\bigr{)}\}:(u,v)\in E_{R}^{+}\rbag
7:  return  A=A1+A2+A3+A4A=A_{1}+A_{2}+A_{3}+A_{4}

Compared to the node-based representation μV\mu_{V}, μE\mu_{E} has two features. Unsurprisingly, each undirected edge (whose direction is not resolved by completion) is represented as bidirectional using two μ\mu-vectors. This is similar to the representation of edges in unrooted trees, as bipartitions of \mathcal{L}. The second feature is the inclusion of a μ\mu-vector for each root component, which may seem surprising. For a non-trivial root component TT, μr(T)\mu_{r}(T) is redundant with information from μ(e)\mu(e) for any ee in TT, by 18. The purpose of including the root μ\mu-vectors in μE(N)\mu_{E}(N) is to keep information from trivial root components, for networks with multiple roots. Without this information, μE\mu_{E} cannot discriminate simple networks with multiple roots when one or more root component is trivial, as illustrated in Fig. 4.

Networks with a unique and non-trivial root component correspond to standard phylogenetic rooted networks, with uncertainty about the root location. For these networks, we could use edge μ\mu-vectors only: μE(N)=μ(e):eE(N)\mu_{E}(N)=\lbag\mu(e):e\in E(N)\rbag, that is, omit A3A_{3} in Algorithm 1. Indeed, the root μ\mu-vector of the unique root component TT is redundant with μ(e)\mu(e) of any edge ee in TT. For this class of standard networks, then, our results below also hold using the simplified definition of μE\mu_{E}.

IV-B Properties for tree-child networks

We will use the following results to reconstruct a tree-child network from its edge-based μ\mu-representation. First we characterize when and how μ\mu-vectors are comparable.

Proposition 19.

Let T1T_{1} and T2T_{2} be distinct nontrivial root components of a strongly tree-child \mathcal{L}-network NN. Then directional μ\mu-vectors from T1T_{1} and from T2T_{2} are incomparable to one another.

Proof.

Let uuE(T1)uu^{\prime}\in E(T_{1}) and vvE(T2)vv^{\prime}\in E(T_{2}). Suppose for contradiction that μd(u,u)μd(v,v)\mu_{d}(u,u^{\prime})\geq\mu_{d}(v,v^{\prime}). Let GG be a rooted partner of NN in which uu and vv are roots. Then μd(u,u)=μ(u,G)\mu_{d}(u,u^{\prime})=\mu(u^{\prime},G) and μd(v,v)=μ(v,G)\mu_{d}(v,v^{\prime})=\mu(v^{\prime},G). Since GG is tree-child, there exists a path uvu^{\prime}\leadsto v^{\prime} in GG by 14 (possibly up to relabeling if μd(u,u)=μd(v,v)\mu_{d}(u,u^{\prime})=\mu_{d}(v,v^{\prime})) Since (v,v)(v,v^{\prime}) is a tree edge in GG, by Lemma 1 in [7], uvu^{\prime}\leadsto v^{\prime} contains or is contained in (v,v)(v,v^{\prime}). Both cases imply that u{v,v}u^{\prime}\in\{v,v^{\prime}\} (using that vv is a root for the first case), a contradiction. ∎

Proposition 20.

In a weakly tree-child \mathcal{L}-network, different root components have incomparable root μ\mu-vectors.

Proof.

Let T1T2T_{1}\neq T_{2} be root components of a tree-child \mathcal{L}-network NN. In a tree-child rooted partner of NN, there is no directed path between the roots of T1T_{1} and T2T_{2}. Therefore, by Lemmas 14 and 18, μr(T1)\mu_{r}(T_{1}) and μr(T2)\mu_{r}(T_{2}) are incomparable. ∎

Lemma 21.

Let NN be a strongly tree-child \mathcal{L}-network. Suppose uv,stuv,st are two (not necessarily distinct) edges in root component TT of NN such that the undirected tree path from uu to tt in TT contains vv and ss. Then:

  1. 1.

    μd(u,v)μd(s,t)\mu_{d}(u,v)\geq\mu_{d}(s,t),

  2. 2.

    μd(v,u)\mu_{d}(v,u) is incomparable to μd(s,t)\mu_{d}(s,t),

  3. 3.

    μd(u,v)\mu_{d}(u,v) is incomparable to μd(t,s)\mu_{d}(t,s).

Proof.

Let GuG_{u} (resp. GvG_{v}) be a rooted partner of NN with uu (resp. vv) as a root. For part 1, by 14, we have μd(u,v)=μ(v,Gu)μ(t,Gu)=μd(s,t)\mu_{d}(u,v)=\mu(v,G_{u})\geq\mu(t,G_{u})=\mu_{d}(s,t).

For part 2, by symmetry, it suffices to show that μd(v,u)μd(s,t)\mu_{d}(v,u)\not\leq\mu_{d}(s,t). Let w1,,wkw_{1},\ldots,w_{k} be the neighbors of uu besides vv. Then μ(wi,Gu)=μ(wi,Gv)\mu(w_{i},G_{u})=\mu(w_{i},G_{v}) by 15 if wiVDP(N)w_{i}\in V_{DP}(N), or 16 if wiV(T)w_{i}\in V(T). Then by 13 we have μd(v,u)=μ(u,Gv)=i=1k(u,wi)E(Gv)μ(wi,Gv)=i=1k(u,wi)E(Gv)μ(wi,Gu)\mu_{d}(v,u)=\mu(u,G_{v})=\sum_{i=1}^{k}\sum_{(u,w_{i})\in E(G_{v})}\mu(w_{i},G_{v})=\sum_{i=1}^{k}\sum_{(u,w_{i})\in E(G_{v})}\mu(w_{i},G_{u}).

First, suppose for contradiction that μd(v,u)<μd(s,t)=μ(t,Gu)\mu_{d}(v,u)<\mu_{d}(s,t)=\mu(t,G_{u}). Then for each ii, μ(t,Gu)>μ(wi,Gu)\mu(t,G_{u})>\mu(w_{i},G_{u}), and since GuG_{u} is tree-child there exists a path twit\leadsto w_{i} in GuG_{u} by 14. As uu is a root in GuG_{u} and not contained in these paths, w1,,wkw_{1},\ldots,w_{k} are hybrid nodes. Then uu does not have a tree child in GvG_{v}, a contradiction.

Now suppose instead μd(v,u)=μd(s,t)=μ(t,Gu)\mu_{d}(v,u)=\mu_{d}(s,t)=\mu(t,G_{u}), then μ(t,Gu)μ(wi,Gu)\mu(t,G_{u})\geq\mu(w_{i},G_{u}) for all ii. If μ(t,Gu)=μ(wi,Gu)\mu(t,G_{u})=\mu(w_{i},G_{u}) for some ii, then wi=w1w_{i}=w_{1} is the only neighbor of uu other than vv. By 14, tt and w1w_{1} are connected by an elementary path in GuG_{u}, which is impossible as both have uu as a parent. Therefore μ(t,Gu)>μ(wi,Gu)\mu(t,G_{u})>\mu(w_{i},G_{u}) for all ii, which leads to a contradiction by the argument above.

Part 3 follows from part 2 using that μd(a,b)+μd(b,a)=μr(T)\mu_{d}(a,b)+\mu_{d}(b,a)=\mu_{r}(T) for any abE(T)ab\in E(T) by 18. ∎

Next, we relate edges with identical directional μ\mu-vectors.

Lemma 22.

Let NN be a strongly tree-child \mathcal{L}-network, and xx a fixed μ\mu-vector. If we direct all edges uvER(N)uv\in E_{R}(N) with μd(u,v)=x\mu_{d}(u,v)=x as (u,v)(u,v), then these edges form a directed path. If the path is nonempty, we denote the first node as h(x,N)h(x,N).

Proof.

Let Ex={uvER(N):μd(u,v)=x}E_{x}=\{uv\in E_{R}(N):\mu_{d}(u,v)=x\}. Take uvuv and stst in ExE_{x}. By 19, they are in the same root component TT, so there is an undirected path pp in TT connecting uvuv and stst. By applying 21 multiple times and permuting labels if necessary, we may assume pp is of the form uvstuv\ldots st with μd(u,v)=μd(s,t)=x\mu_{d}(u,v)=\mu_{d}(s,t)=x. For a rooted partner GG of NN with uu a root, we have μ(v,G)=μ(t,G)=x\mu(v,G)=\mu(t,G)=x, which implies by 14 there is an elementary path in GG from vv to tt. By 6, this path lies in ER(N)E_{R}(N). But since ER(N)E_{R}(N) induces a forest, this elementary path must be the vstv\ldots st part of the path pp. Therefore all the intermediate nodes ww in pp have degu(w,N)=2\mathrm{deg}_{\mathrm{u}}(w,N)=2 and degi(w,N)=dego(w,N)=0\mathrm{deg}_{\mathrm{i}}(w,N)=\mathrm{deg}_{\mathrm{o}}(w,N)=0. Furthermore, if w1w_{1} and w2w_{2} are consecutive nodes in pp, then w1w2Exw_{1}w_{2}\in E_{x} because x=μ(v,G)μ(w2,G)=μd(w1,w2)μ(t,G)=xx=\mu(v,G)\geq\mu(w_{2},G)=\mu_{d}(w_{1},w_{2})\geq\mu(t,G)=x. Therefore all edges in pp are in ExE_{x}, and form a directed path when directed as in the statement.

Now take an undirected path p0p_{0} of edges in ExE_{x}, of maximum length, and let ee one of its edges. To show that p0p_{0} contains all the edges in ExE_{x}, take eExe^{\prime}\in E_{x}. By the previous argument, ee and ee^{\prime} are connected by a tree path p1p_{1}. Also by the previous argument, all intermediate nodes in p0p_{0} and in p1p_{1} have degu=2\mathrm{deg}_{\mathrm{u}}=2. Since p1p_{1} cannot extend p0p_{0} by definition of p0p_{0}, p1p_{1} must be contained in p0p_{0}, therefore ee^{\prime} is in p0p_{0} as claimed. ∎

Finally, the next result was proved for orchard DAGs, which include tree-child DAGs [9, Proposition 10]. We restrict its statement to hybrid nodes here, because we allow networks to have in and out degree-1 nodes.

Lemma 23.

Let GG be a tree-child \mathcal{L}-DAG. Let u,vu,v be distinct hybrid nodes in GG. Then μ(u,G)μ(v,G)\mu(u,G)\neq\mu(v,G).

IV-C Reconstructing a complete tree-child network

To reconstruct a complete tree-child network NN from its edge-based μ\mu-representation, Algorithm 2 will first construct μV(G)\mu_{V}(G) for a rooted partner GG from μE(N)\mu_{E}(N). Then Algorithm 3 will use μE(N)\mu_{E}(N) to undirect some edges in GG and recover NN.

Algorithm 2 Given A=μE(N)A=\mu_{E}(N) from a tree-child \mathcal{L}-network NN, compute B=μV(G)B=\mu_{V}(G) for some rooted partner GG of NN
0:  multiset AA
0:  multiset BB
1:  B1x:{(x,:𝐭)}AB_{1}\leftarrow\lbag x\colon\{(x,\mathbf{:{\mkern-5.0mu}t})\}\in A\rbag
2:  B2{x:{(x,:𝐡)}A}B_{2}\leftarrow\{x\colon\{(x,\mathbf{:{\mkern-5.0mu}h})\}\in A\}
3:  B3{x:{(x,:𝐫)}A}B_{3}\leftarrow\{x\colon\{(x,\mathbf{:{\mkern-5.0mu}r})\}\in A\}
4:  B4B_{4}\leftarrow\lbag\rbag
5:  for zz in B3B_{3} do
6:        M(z)x:{(x,:𝐭),(zx,:𝐭)}AM(z)\leftarrow\lbag x:\{(x,\mathbf{:{\mkern-5.0mu}t}),(z-x,\mathbf{:{\mkern-5.0mu}t})\}\in A\rbag
7:        if M(z)M(z) is empty then skip to next iteration
8:        r(z)r(z)\leftarrow some arbitrary element of M(z)M(z)
9:        for {(x1,:𝐭),(x2,:𝐭)}A\{(x_{1},\mathbf{:{\mkern-5.0mu}t}),(x_{2},\mathbf{:{\mkern-5.0mu}t})\}\in A with x1+x2=zx_{1}+x_{2}=z do
10:              B4B4+yB_{4}\leftarrow B_{4}+\lbag y\rbag where y=xiy=x_{i} if xir(z)x_{i}\leq r(z) else y=zxiy=z-x_{i} if xi>r(z)x_{i}>r(z) (i=1i=1 or 22)
11:  BB1+B2+B3+B4B\leftarrow B_{1}+B_{2}+B_{3}+B_{4}
12:  return  BB

Continuing with NN in Fig. 3 (left), A=μE(N)A=\mu_{E}(N) is given in the Appendix. Algorithm 2 starts with B1={(1,0,0,0,0,0,0,0),,(0,0,0,0,0,0,0,1),(0,0,0,0,0,1,1,0)}B_{1}=\{(1{,}0{,}0{,}0{,}0,0{,}0{,}0),\ldots,(0{,}0{,}0{,}0{,}0,0{,}0{,}1),(0{,}0{,}0{,}0{,}0,1{,}1{,}0)\} for edges incident to leaves and e5e_{5} (in black in Fig. 3). B2={(0,0,0,0,0,1,1,1)}B_{2}=\{(0{,}0{,}0{,}0{,}0,1{,}1{,}1)\}, for the unique μ\mu-vector shared by all 3 hybrid edges in NN. B3B_{3} has a single element z=(1,1,1,1,1,3,3,3)z=(1{,}1{,}1{,}1{,}1,3{,}3{,}3) because NN has a single root component, so the loop on line 5 has a single iteration and all elements {(x1,:𝐭),(x2,:𝐭)}\{(x_{1},\mathbf{:{\mkern-5.0mu}t}),(x_{2},\mathbf{:{\mkern-5.0mu}t})\} in AA satisfy x1+x2=zx_{1}+x_{2}=z (from edges in brown in Fig. 3). On line 6, M(z)M(z) has 12 elements (see the Appendix). We can arbitrarily pick r(z)=(0,0,0,1,1,2,2,2)M(z)r(z)=(0{,}0{,}0{,}1{,}1,2{,}2{,}2)\in M(z), which corresponds to e7e_{7} directed rightward. Then B=μV(G)B=\mu_{V}(G) for the partner GG of NN rooted at the node incident to e6e_{6} and e7e_{7} (see Fig. A7).

Proof of correctness for Algorithm 2.

As NN and 𝒞(N)\mathcal{C}(N) have the same rooted partners and μE(N)=μE(𝒞(N))\mu_{E}(N)=\mu_{E}(\mathcal{C}(N)), we may assume NN to be complete.

Let ρ\rho be a root choice function such that ρ(T)=h(r(μr(T)),N)\rho(T)=h(r(\mu_{r}(T)),N), where rr is the function on line 8 and hh is defined in 22. By 18 and 20, ρ(T)V(T)\rho(T)\in V(T) is well-defined. Let G=Nρ+G=N_{\rho}^{+}. We shall show that Algorithm 2, with A=μE(N)A=\mu_{E}(N) as input, produces the output B=μV(G)B=\mu_{V}(G).

Consider partitioning V(N)=V(G)V(N)=V(G) into the following sets:

  1. V1={vV_{1}=\{v is a tree node in the directed part of N}N\},

  2. V2={vV_{2}=\{v is a hybrid node}\},

  3. V3={vV_{3}=\{v is a root in G}G\},

  4. V4={vV_{4}=\{v in a root component of NN, but not a root in G}G\}.

We will establish Bi=μ(v,G):vViB_{i}=\lbag\mu(v,G):v\in V_{i}\rbag for the multisets BiB_{i} in the algorithm (i=1,,4i=1,\ldots,4), to conclude the proof.

By 10, (u,v)v(u,v)\mapsto v is a bijection between the directed tree edges and V1V_{1}. Then by 13, B1=μ(v,N):(u,v)ET(N)=μ(v,G):vV1B_{1}=\lbag\mu(v,N):(u,v)\in E_{T}(N)\rbag=\lbag\mu(v,G):v\in V_{1}\rbag, which concludes case i=1i=1.

For i=2i=2, 23 implies that μ(u,G)μ(v,G)\mu(u,G)\neq\mu(v,G) for distinct uvu\neq v in V2V_{2}. Therefore μ(v,G):vV2={μ(v,G):vV2}\lbag\mu(v,G):v\in V_{2}\rbag=\{\mu(v,G):v\in V_{2}\}. By the definition of hybrid edges, B2={x:{(x,:𝐡)}A}={μ(v,G):(u,v)EH(N)}B_{2}=\{x\colon\{(x,\mathbf{:{\mkern-5.0mu}h})\}\in A\}=\{\mu(v,G):(u,v)\in E_{H}(N)\} is equal to {μ(v,G):vV2}\{\mu(v,G):v\in V_{2}\}, which implies B2=μ(v,G):vV2B_{2}=\lbag\mu(v,G):v\in V_{2}\rbag.

For i=3i=3, by 18 and 20, the μ\mu-vectors of the roots in GG are the same as the root μ\mu-vectors, and are all distinct. Hence B3=μ(v,G):vV3B_{3}=\lbag\mu(v,G):v\in V_{3}\rbag.

For i=4i=4, let ER+E_{R}^{+} be the set of edges in GG that corresponds to ER(N)E_{R}(N). Consider the map V4ER+V_{4}\to E_{R}^{+} that associates vv to its parent edge (u,v)(u,v) in GG. It is well-defined because V4V_{4} excludes the roots of GG, root components only contain tree nodes (5) and uvER(N)uv\in E_{R}(N) by 10. Furthermore, the map is a bijection. Therefore we have μ(v,G):vV4=μd(u,v):(u,v)ER+\lbag\mu(v,G):v\in V_{4}\rbag=\lbag\mu_{d}(u,v):(u,v)\in E_{R}^{+}\rbag.

B4B_{4} is constructed in Algorithm 2 by taking a μ\mu-vector from the pair μd(s,t)\mu_{d}(s,t) and μd(t,s)\mu_{d}(t,s), for each undirected edge stst in each root component. Let TT be the root component that contains stst and let z=μr(T)B3z=\mu_{r}(T)\in B_{3}. Then μd(s,t)+μd(t,s)\mu_{d}(s,t)+\mu_{d}(t,s) equals zz but no other root μ\mu-vector by 20, so μ(st)\mu(st) is considered at exactly one iteration of the loop on line 9. Next we need to show that on line 10, exactly one μ\mu-vector gets chosen, and is y=μd(s,t)y=\mu_{d}(s,t) where (s,t)ER+(s,t)\in E_{R}^{+}.

From 22, let u=h(r(z),N)u=h(r(z),N) be the root of TT in GG and vv such that r(z)=μd(u,v)r(z)=\mu_{d}(u,v). Since uu is a root in GG and (s,t)E(G)(s,t)\in E(G), the tree path pp in TT from uu to tt contains ss. If pp also contains vv, then μd(s,t)μd(u,v)\mu_{d}(s,t)\leq\mu_{d}(u,v) and μd(t,s)μd(u,v)\mu_{d}(t,s)\not\leq\mu_{d}(u,v) by 21, so line 10 defines y=μd(s,t)y=\mu_{d}(s,t) as claimed. If pp does not contain vv, then the tree path from vv to tt contains uu and ss, so by 21 μd(s,t)\mu_{d}(s,t) is incomparable to μd(u,v)\mu_{d}(u,v) and μd(t,s)μd(u,v)\mu_{d}(t,s)\geq\mu_{d}(u,v). Further, μd(t,s)>μd(u,v)\mu_{d}(t,s)>\mu_{d}(u,v) by the choice u=h(r(z),N)u=h(r(z),N) and 22. Therefore line 10 defines y=zμd(t,s)=μd(s,t)y=z-\mu_{d}(t,s)=\mu_{d}(s,t), which concludes the proof. ∎

Algorithm 3 Given A=μE(N)A=\mu_{E}(N) from a tree-child \mathcal{L}-network NN, and a rooted partner GG of NN, modify GG to obtain 𝒞(N)\mathcal{C}(N)
1:  BμE(G)B\leftarrow\mu_{E}(G)
2:  Fx:{(x,:𝐭)}BAF\leftarrow\lbag x:\{(x,\mathbf{:{\mkern-5.0mu}t})\}\in B-A\rbag
3:  for xUnique(F)x\in\text{Unique}(F) do
4:        m(x)m(x)\leftarrow multiplicity of xx in FF
5:        p(x)p(x)\leftarrow the directed path in GG formed by {(u,v)E(G):μV(v,G)=x}\{(u,v)\in E(G):\mu_{V}(v,G)=x\}
6:        undirect the first m(x)m(x) edges in p(x)p(x)
7:  return  GG

Given μE(N)\mu_{E}(N) from NN in Fig. 3 and GG from Algorithm 2 (Fig. A7), FF contains 6 μ\mu-vectors (see the Appendix) so 6 edges in GG are undirected to obtain NN. One of them x=(0,0,1,0,0,0,0,0)x=(0{,}0{,}1{,}0{,}0,0{,}0{,}0) has multiplicity 1 in FF but corresponds to an elementary path of 2 edges in GG adjacent to bb. On this path, only e2e_{2} is undirected by Algorithm 3.

Proof of correctness of Algorithm 3.

Note that line 5 uses 14 to claim that p(x)p(x) is a directed path, and so line 6 can be applied.

Let ER+E_{R}^{+} denote the set of edges in GG that corresponds to edges in ER(N)E_{R}(N). It suffices to show that line 6 undirects all edges in ER+E_{R}^{+} and no other. Obviously, line 2 defines F=μ(v,G):(u,v)ER+F=\lbag\mu(v,G):(u,v)\in E_{R}^{+}\rbag. Suppose FF consists of elements x1,,xkx_{1},\ldots,x_{k} with multiplicities m1,,mkm_{1},\ldots,m_{k}. We know that ER+E_{R}^{+} exactly consists of mim_{i} edges whose children have μ\mu-vector xix_{i}, for i=1,,ki=1,\ldots,k. Thus we only need to show that for each xix_{i}, the first mim_{i} edges in p(xi)p(x_{i}) are in ER+E_{R}^{+}. By 6, if an edge ee is not in ER+E_{R}^{+}, then all edges below it are also not in ER+E_{R}^{+}. Therefore along the path p(xi)p(x_{i}), edges in ER+E_{R}^{+} must come first before any edge not in ER+E_{R}^{+}, which finishes the proof. ∎

The following theorem derives directly from Theorem 1 in [7] to reconstruct GG from μV(G)\mu_{V}(G), and the application of Algorithms 2 and 3.

Theorem 1.

Let N1N_{1} and N2N_{2} be strongly tree-child \mathcal{L}-networks. Then μE(N1)=μE(N2)\mu_{E}(N_{1})=\mu_{E}(N_{2}) if and only if N1N_{1} and N2N_{2} are phylogenetically isomorphic.

1 does not generally hold for weakly tree-child networks, as seen in a counter-example in Fig. 5.

Refer to caption

Figure 5: Weakly tree-child (a,b,c)(a,b,c)-networks for which 1 does not hold. For each network, the rooted partner rooted at the degree-2 node is tree-child. The other 2 rooted partners are not. N1N_{1} and N2N_{2} are not phylogenetically isomorphic yet μE(N1)=μE(N2)=A1++A4\mu_{E}(N_{1})=\mu_{E}(N_{2})=A_{1}+\cdots+A_{4} with A1={((1,0,0),:𝐭)},{((0,1,0),:𝐭)},{((0,0,1),:𝐭)}A_{1}=\lbag\{((1,0,0),\mathbf{:{\mkern-5.0mu}t})\},\{((0,1,0),\mathbf{:{\mkern-5.0mu}t})\},\{((0,0,1),\mathbf{:{\mkern-5.0mu}t})\}\rbag,
A2={((0,0,1),:𝐡)},{((0,0,1),:𝐡)}A_{2}=\lbag\{((0,0,1),\mathbf{:{\mkern-5.0mu}h})\},\{((0,0,1),\mathbf{:{\mkern-5.0mu}h})\}\rbag, A3={((1,1,2),:𝐫)}A_{3}=\lbag\{((1,1,2),\mathbf{:{\mkern-5.0mu}r})\}\rbag and A4={((1,0,1),:𝐭),((0,1,1),:𝐭)},{((1,1,1),:𝐭),((0,0,1),:𝐭)}A_{4}=\lbag\{((1,0,1),\mathbf{:{\mkern-5.0mu}t}),((0,1,1),\mathbf{:{\mkern-5.0mu}t})\},\{((1,1,1),\mathbf{:{\mkern-5.0mu}t}),((0,0,1),\mathbf{:{\mkern-5.0mu}t})\}\rbag.

V The edge-based μ\mu-distance

Definition 14 (edge-based μ\mu-distance).

Let N1N_{1} and N2N_{2} be \mathcal{L}-networks. The edge-based μ\mu-dissimilarity between N1N_{1} and N2N_{2} is defined as

dμE(N1,N2)=|μE(N1)μE(N2)|.d_{\mu_{E}}(N_{1},N_{2})=|\mu_{E}(N_{1})\triangle\mu_{E}(N_{2})|\;.

For example, dμE(N,N)=2d_{\mu_{E}}(N,N^{\prime})=2 in Fig. 6. For the networks in Fig. 3, dμE(N,N)=5d_{\mu_{E}}(N,N^{\prime})=5 due to non-matching μ\mu-vector sets for edges e6e_{6} and e7e_{7} in NN and the 3 unlabelled tree edges in NN^{\prime}. (see Fig. 3 and the Appendix for details).

We are now ready to state our main theorem, which justifies why we may refer to dμEd_{\mu_{E}} as a distance.

Refer to caption

Figure 6: Example rooted networks for which the node-based and edge-based μ\mu distances differ, on leaves (a,b,c,d)(a,b,c,d): Left: NN is tree-child, non-bicombining. Right: NN^{\prime} is not tree-child, but bicombining. Both have 3 nodes (including leaf cc) with μ\mu-vector (0,0,1,0)(0{,}0{,}1{,}0), and dμV(N,N)=0d_{\mu_{V}}(N,N^{\prime})=0. NN and NN^{\prime} both have 5 edges with μ\mu-vector (0,0,1,0)(0{,}0{,}1{,}0), of which 3 (resp. 4) are of hybrid type in NN (resp. NN^{\prime}), and dμE(N,N)=2d_{\mu_{E}}(N,N^{\prime})=2.
Theorem 2.

For a vector of leaf labels \mathcal{L}, dμEd_{\mu_{E}} is a distance on the class of (complete) strongly tree-child \mathcal{L}-networks.

Proof.

From the properties of the symmetric difference, dμEd_{\mu_{E}} is a dissimilarity in the sense that is it symmetric, non-negative, and satisfies the triangle inequality. It remains to show that dμEd_{\mu_{E}} satisfies the separation property. Let N1N_{1} and N2N_{2} be tree-child \mathcal{L}-networks. If dμE(N1,N2)=0d_{\mu_{E}}(N_{1},N_{2})=0 then μE(N1)=μE(N2)\mu_{E}(N_{1})=\mu_{E}(N_{2}) and by 1, N1N2N_{1}\cong N_{2}. ∎

For unrooted trees, the μ\mu-vector of each undirected edge encodes the bipartition on \mathcal{L} associated with the edge, hence dμEd_{\mu_{E}} agrees with the Robinson-Foulds distance on unrooted trees.

On rooted trees, dμEd_{\mu_{E}} agrees with dμVd_{\mu_{V}}. Indeed, if TT is a directed tree or forest on \mathcal{L}, then each non-root node vv has a unique parent edge ee with element {(μV(v),:𝐭)}\{(\mu_{V}(v),\mathbf{:{\mkern-5.0mu}t})\} in μE(T)\mu_{E}(T); and each root uu forms to a trivial root component with element μE(u)={(μV(u),:𝐫)}\mu_{E}(u)=\{(\mu_{V}(u),\mathbf{:{\mkern-5.0mu}r})\} in μE(T)\mu_{E}(T).

However, dμEd_{\mu_{E}} does not generally extend dμVd_{\mu_{V}}. For example, consider the rooted networks in Fig. 6. They have the same μV\mu_{V} representation, hence dμV(N,N)=0d_{\mu_{V}}(N,N^{\prime})=0. However, their μE\mu_{E} representations differ, due to edges with the same μ\mu vector (1 path to cc only) but different tags (tree edge in NN versus hybrid edge in NN^{\prime}). Hence dμEd_{\mu_{E}} can distinguish these networks: dμE(N,N)>0d_{\mu_{E}}(N,N^{\prime})>0.


We can compute dμEd_{\mu_{E}} using a variant of Algorithm 3 in [7]. Specifically, we first group the elements of μE(Ni)\mu_{E}(N_{i}) (i=1,2i=1,2) by their type: of the form {(x,:𝐫)}\{(x,\mathbf{:{\mkern-5.0mu}r})\}, {(x,:𝐭)}\{(x,\mathbf{:{\mkern-5.0mu}t})\}, {(x,:𝐡)}\{(x,\mathbf{:{\mkern-5.0mu}h})\}, or {(x,:𝐭),(y,:𝐭)}\{(x,\mathbf{:{\mkern-5.0mu}t}),(y,\mathbf{:{\mkern-5.0mu}t})\}. Then it suffices to equip a total order and apply Algorithm 3 in [7] to each group, then add the distances obtained from the 4 groups. For the first 3 types we can simply use the lexical order on the μ\mu-vector xx. For the last type, we may compare two elements by comparing the lexically smaller μ\mu-vector first and then the larger one, to obtain a total order within the group.

As in [7], with μE(Ni)\mu_{E}(N_{i}) computed and sorted, the above takes 𝒪(n|E|)\mathcal{O}(n|E|) time where |E|=max(E1,E2)|E|=\max(E_{1},E_{2}). Taking into account computing and sorting μE(Ni)\mu_{E}(N_{i}), computing dμEd_{\mu_{E}} takes O(|E|(n+log|E|))O\bigl{(}|E|(n+\log|E|)\bigr{)} time.

This complexity can also be expressed in terms of the number of leaves and root components, thanks to the following straightforward generalization of Proposition 1 in [7], allowing for multiple root components.

Proposition 24.

Let NN be a tree-child \mathcal{L}-network with nn leaves and tt root components. Then |VH|nt|V_{H}|\leq n-t.
A node vv is called elementary if it is a tree node and either dego(v)=deg(v)=1\mathrm{deg}_{\mathrm{o}}(v)=\mathrm{deg}(v)=1, or dego(v)<deg(v)=2\mathrm{deg}_{\mathrm{o}}(v)<\mathrm{deg}(v)=2. If NN has no elementary nodes then

|V|2nt+vVHdegi(v)(m+2)(nt)+t|V|\leq 2n-t+\sum_{v\in V_{H}}\mathrm{deg}_{\mathrm{i}}(v)\leq(m+2)(n-t)+t

where m=maxvVH{degi(v)}\displaystyle m=\max_{v\in V_{H}}\{\mathrm{deg}_{\mathrm{i}}(v)\}, and |E|(2m+1)(nt)|E|\leq(2m+1)(n-t).

Proof.

Since NN is an \mathcal{L}-network, it has no ambiguous leaves, and the elementary nodes in NN are the tree nodes of out-degree 1 in any rooted partner. By considering a rooted partner, we may assume that NN is a DAG with tt roots, and follow the proof of Proposition 1 in [7]. Their arguments remain valid for the bounds on |VH||V_{H}| and |V||V| when NN has t1t\geq 1 roots, and when the removal of all but 1 parent hybrid edges at each hybrid node gives a forest instead of a tree. To bound |E||E| we enumerate the parent edges of each node: |E|(|VT|t)+m|VH|=|V|t+(m1)|VH||E|\leq(|V_{T}|-t)+m|V_{H}|=|V|-t+(m-1)|V_{H}| then use the previous bounds. ∎

Therefore, as long as mm and tt are bounded and there are no elementary nodes, for example in binary tree-child networks with a single root component, then |E|=𝒪(n)|E|=\mathcal{O}(n). Consequently, computing μE\mu_{E} on one such network or computing dμEd_{\mu_{E}} on two such networks takes 𝒪(n2)\mathcal{O}(n^{2}) time.

VI Conclusion and extensions

For rooted networks, the node-based representation μV\mu_{V}, or equivalently the ancestral profile, is known to provide a distance between networks beyond the class of tree-child networks, such as the class of semibinary tree-sibling time-consistent networks [6] and stack-free orchard binary networks [5], a class that includes binary tree-child networks. Orchard networks can be characterized as rooted trees with additional “horizontal arcs” [32]. They were first defined as cherry-picking networks: networks that can be reduced to a single edge by iteratively reducing a cherry or a reticulated cherry [17]. A cherry is a pair of leaves (x,y)(x,y) with a common parent. A reticulated cherry is a pair of leaves (x,y)(x,y) such that the parent uu of yy is a tree node and the parent vv of xx is a hybrid node with e=(u,v)e=(u,v) as a parent hybrid edge. Reducing the pair C=(x,y)C=(x,y) means removing taxon xx if CC is a cherry or removing hybrid edge ee if CC is a reticulated cherry, and subsequently suppressing uu and vv if they are of degree 2. Cherries and reticulated cherries are both well-defined on the class of semidirected networks considered here, because leaves are well-defined (stable across rooted partners), each leaf is incident to a single tree edge, and hybrid nodes / edges are well-defined. A stack is a pair of hybrid nodes connected by a hybrid edge, and a rooted network is stack-free if it has no stack. As hybrid edges are well-defined on our general class of networks, the concepts of stacks and stack-free networks also generalize directly. Therefore, we conjecture that for semidirected networks, our edge-based representation μE\mu_{E} and the associated dissimilarity dμEd_{\mu_{E}} also separate distinct networks well beyond the tree-child class, possibly to stack-free orchard semidirected networks.

To discriminate distinct orchard networks with possible stacks, Cardona et al. [9] introduced an “extended” node-based μ\mu-representation of rooted phylogenetic networks. In this representation, the μ\mu-vector for each node vv is extended by one more coordinate, μ0(v)\mu_{0}(v), counting the number of paths from vv to a hybrid node (any hybrid node). On rooted networks, adding this extension allows μV\mu_{V} to distinguish between any two orchard networks, even if they contain stacks (but assumed binary, without parallel edges and without outdegree-1 tree nodes in [9]). For semidirected networks, we conjecture that the edge-based representation μE\mu_{E} can also be extended in the same way, and that this extension may provide a proper distance on the space of semidirected orchard networks.

Phylogenetic networks are most often used as metric networks with edge lengths and inheritance probabilities. Dissimilarities are needed to compare metric networks using both their topologies and edge parameters. For trees, extensions of the RF distance, which dμEd_{\mu_{E}} extends, are widely used. They can be expressed using edge-based μ\mu-vectors as

d(T1,T2)=mμE(T1)μE(T2)|(m,T1)(m,T2)|pd(T_{1},T_{2})=\sum_{m\in\mu_{E}(T_{1})\cup\mu_{E}(T_{2})}|\ell(m,T_{1})-\ell(m,T_{2})|^{p} (1)

where (m,Ti)\ell(m,T_{i}) is the length in tree TiT_{i} of the edge corresponding to the μ\mu-vector mm, considered to be 0 if mm is absent from μE(Ti)\mu_{E}(T_{i}). The weighted RF distance uses p=1p=1 [26] and the branch score distance uses p=2p=2 [20]. If all weights (m,T)\ell(m,T) are 1 for mμE(T)m\in\mu_{E}(T), then (1) boils down to the RF distance when restricted to trees (either rooted or unrooted), and to our dμEd_{\mu_{E}} dissimilarity on semidirected phylogenetic networks more generally. For networks with edge lengths, (1) could be used to extend dμEd_{\mu_{E}}, where (μE(e),N)\ell(\mu_{E}(e),N) is defined as the length of edge ee in NN as it is for trees. A root μ\mu-vector could be assigned weight (μr(T),N)=0\ell(\mu_{r}(T),N)=0, because in standard cases, such as for networks with a single root component, the root μ\mu-vector(s) carry redundant information.

Alternatively, using inheritance probabilities could be useful to capture the similarity between a network having a hybrid edge with inheritance very close to 0 and a network lacking this edge. To this end, we could modify μ\mu-vectors. Recall that [7] defined μ(v,N)=(μ1,,μn)\mu(v,N)=(\mu_{1},\ldots,\mu_{n}) with μi\mu_{i} equal to the number mim_{i} of paths from vv to taxon ii in a directed network NN. We could generalize μi\mu_{i} to be a function of these mim_{i} paths, possibly reflecting inheritance probabilities. For example, we could use the weight of a path pp, defined as γ(p)=epγ(e)\gamma(p)=\prod_{e\in p}\gamma(e). These weights sum to 1 over up-down paths between vv and ii [33], although not over the mim_{i} directed paths from vv to ii. The weights of the mim_{i} paths could then be normalized before calculating their entropy Hi=p:viγ(p)logγ(p)H_{i}=-\sum_{p:v\leadsto i}\gamma(p)\log\gamma(p) and then define μi=eHi\mu_{i}=e^{H_{i}}. The original definition μi=mi\mu_{i}=m_{i} corresponds to giving all paths viv\leadsto i equal weight 1/mi1/m_{i}. This extension carries over from directed to semidirected networks because we proved here that the set of directed paths from an edge e=(u,v)e=(u,v) to ii is independent of the root choice, given a fixed admissible direction assigned to ee, as shown in Propositions 15 and 16. With this extension, μ\mu-vectors are in the continuous space 0n\mathbb{R}_{\geq 0}^{n} instead of 0n\mathbb{Z}_{\geq 0}^{n}. To use them in a dissimilarity between networks NN and NN^{\prime}, we could use non-trivial distance between μ\mu-vectors (such as the L1L^{1} or L2L^{2} norm) then get the score of an optimal matching between μ\mu-vectors in μE(N)\mu_{E}(N) and μE(N)\mu_{E}(N^{\prime}). Searching for an optimal matching would increase the computational complexity of the dissimilarity, but would remain polynomial using the Hungarian algorithm [19].

To reduce the dependence of dμEd_{\mu_{E}} on the number of taxa nn in the two networks, dμEd_{\mu_{E}} should be normalized by a factor depending on nn only. This is particularly useful to compare networks with different leaf labels, by taking the dissimilarity between the subnetworks on their shared leaves. Ideally, the normalization factor is the diameter of the network space, that is, the maximum distance dμE(N,N)d_{\mu_{E}}(N,N^{\prime}) over all networks NN and NN^{\prime} in a subspace of interest. For the subspace of unrooted trees on nn leaves, this is 2(n3)2(n-3) [30]. Future work could study the diameter of other semidirected network spaces, such as level-1 or tree-child semidirected networks (which have ntn-t or fewer hybrid nodes, 24) or orchard semidirected networks (whose number of hybrids is unbounded).

To compare semidirected networks N1N_{1} on leaf set 1\mathcal{L}_{1} and N2N_{2} on leaf set 2\mathcal{L}_{2} with a non-zero dissimilarity if 12\mathcal{L}_{1}\neq\mathcal{L}_{2}, one idea is to consider the subnetworks N~1\tilde{N}_{1} and N~2\tilde{N}_{2} on their common leaf set =12\mathcal{L}=\mathcal{L}_{1}\cap\mathcal{L}_{2} then use a penalized dissimilarity:

dμE(N~1,N~2)+λdSymm(1,2)d_{\mu_{E}}(\tilde{N}_{1},\tilde{N}_{2})+\lambda d_{\mathrm{Symm}}(\mathcal{L}_{1},\mathcal{L}_{2})

for some constant λ0\lambda\geq 0. This dissimilarity may not satisfy the triangle inequality, which might be acceptable in some contexts. For example, consider as input a set of semidirected networks N1,,NnN_{1},\ldots,N_{n} with NiN_{i} on leaf set i\mathcal{L}_{i}, and consider the full leaf set =ini\mathcal{L}=\cup_{i}^{n}\mathcal{L}_{i}. We may then seek an \mathcal{L}-network NN that minimizes some criterion, such as

i=1nd(N,Ni).\sum_{i=1}^{n}d(N,N_{i})\;. (2)

When NN is constrained to be an unrooted tree, input networks NiN_{i} are unrooted trees and when dd is the RF distance using NN pruned to i\mathcal{L}_{i}, this is the well-studied RF supertree problem [31]. When the input trees NiN_{i} are further restricted to be on 4 taxa, (2) is the criterion used by ASTRAL [34]. The very wide use of ASTRAL and its high accuracy points to the impact of distances that are fast to calculate, such as our proposed dμEd_{\mu_{E}}.

Acknowledgements

This work was supported in part by the National Science Foundation (DMS 2023239) and by a H. I. Romnes faculty fellowship to C.A. provided by the University of Wisconsin-Madison Office of the Vice Chancellor for Research with funding from the Wisconsin Alumni Research Foundation.

References

  • Allman et al. [2019] Elizabeth S. Allman, Hector Baños, and John A. Rhodes. NANUQ: a method for inferring species networks from gene trees under the coalescent model. Algorithms for Molecular Biology, 14(1), 2019. doi: 10.1186/s13015-019-0159-2.
  • Ané et al. [2024] Cécile Ané, John Fogg, Elizabeth S. Allman, Hector Baños, and John A. Rhodes. Anomalous networks under the multispecies coalescent: theory and prevalence. Journal of Mathematical Biology, 88:29, 2024. doi: 10.1007/s00285-024-02050-7.
  • Baños [2019] Hector Baños. Identifying species network features from gene tree quartets under the coalescent model. Bulletin of Mathematical Biology, 81(2):494–534, 2019. doi: 10.1007/s11538-018-0485-4.
  • Babai [2016] László Babai. Graph isomorphism in quasipolynomial time. arXiv, 2016. doi: 10.48550/arXiv.1512.03547.
  • Bai et al. [2021] Allan Bai, Péter L. Erdős, Charles Semple, and Mike Steel. Defining phylogenetic networks using ancestral profiles. Mathematical Biosciences, 332:108537, 2021. doi: 10.1016/j.mbs.2021.108537.
  • Cardona et al. [2008] Gabriel Cardona, Mercè Llabrés, Francesc Rosselló, and Gabriel Valiente. A distance metric for a class of tree-sibling phylogenetic networks. Bioinformatics, 24(13):1481–1488, 2008. doi: 10.1093/bioinformatics/btn231.
  • Cardona et al. [2009] Gabriel Cardona, Francesc Rosselló, and Gabriel Valiente. Comparison of tree-child phylogenetic networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 6(4):552–569, 2009. doi: 10.1109/tcbb.2007.70270.
  • Cardona et al. [2014] Gabriel Cardona, Mercè Llabrés, Francesc Rosselló, and Gabriel Valiente. The comparison of tree-sibling time consistent phylogenetic networks is graph isomorphism-complete. The Scientific World Journal, 2014:254279, 2014. doi: 10.1155/2014/254279.
  • Cardona et al. [2024] Gabriel Cardona, Joan Carles Pons, Gerard Ribas, and Tomás Martínez Coronado. Comparison of orchard networks using their extended μ\mu-representation. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 21(3):501–507, 2024. doi: 10.1109/TCBB.2024.3361390.
  • Felsenstein [2004] Joseph Felsenstein. Inferring Phylogenies. Sinauer Associates, Sunderland, Massachusetts, 2004. ISBN 0878931775. doi: 10.1086/383584.
  • Gautier et al. [2022] Mathieu Gautier, Renaud Vitalis, Laurence Flori, and Arnaud Estoup. f-statistics estimation and admixture graph construction with pool-seq or allele count data using the R package poolfstat. Molecular Ecology Resources, 22(4):1394–1416, 2022. doi: 10.1111/1755-0998.13557.
  • Gross et al. [2021] Elizabeth Gross, Leo van Iersel, Remie Janssen, Mark Jones, Colby Long, and Yukihiro Murakami. Distinguishing level-1 phylogenetic networks on the basis of data generated by Markov processes. Journal of Mathematical Biology, 83(3):32, 2021. doi: 10.1007/s00285-021-01653-8.
  • Hickey et al. [2008] Glenn Hickey, Frank Dehne, Andrew Rau-Chaplin, and Christian Blouin. SPR distance computation for unrooted trees. Evolutionary Bioinformatics, 4:EBO.S419, 2008. doi: 10.4137/EBO.S419.
  • Huber et al. [2022] Katharina T. Huber, Vincent Moulton, and Guillaume E. Scholz. Forest-based networks. Bulletin of Mathematical Biology, 84:119, 2022. doi: 10.1007/s11538-022-01081-9.
  • Huber et al. [2025] Katharina T. Huber, Leo van Iersel, Vincent Moulton, and Guillaume E. Scholz. Is this network proper forest-based? Information Processing Letters, 187:106500, 2025. doi: 10.1016/j.ipl.2024.106500.
  • Huson et al. [2010] Daniel H. Huson, Regula Rupp, and Celine Scornavacca. Phylogenetic Networks: Concepts, Algorithms and Applications. Cambridge University Press, Cambridge, 2010. doi: 10.1017/CBO9780511974076.
  • Janssen and Murakami [2021] Remie Janssen and Yukihiro Murakami. On cherry-picking and network containment. Theoretical Computer Science, 856:121–150, 2021. doi: 10.1016/j.tcs.2020.12.031.
  • Kong et al. [2022] Sungsik Kong, David L. Swofford, and Laura S. Kubatko. Inference of phylogenetic networks from sequence data using composite likelihood. bioRxiv, 2022. doi: 10.1101/2022.11.14.516468.
  • Kuhn [1955] Harold W. Kuhn. The hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1-2):83–97, 1955. doi: 10.1002/nav.3800020109.
  • Kuhner and Felsenstein [1994] M K Kuhner and J Felsenstein. A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Molecular Biology and Evolution, 11(3):459–468, 1994. doi: 10.1093/oxfordjournals.molbev.a040126.
  • Linz and Wicke [2023] Simone Linz and Kristina Wicke. Exploring spaces of semi-directed level-1 networks. Journal of Mathematical Biology, 87(70), 2023. doi: 10.1007/s00285-023-02004-5.
  • Lutteropp et al. [2022] Sarah Lutteropp, Celine Scornavacca, Alexey M Kozlov, Benoit Morel, and Alexandros Stamatakis. NetRAX: accurate and fast maximum likelihood phylogenetic network inference. Bioinformatics, 38(15):3725–3733, 2022. doi: 10.1093/bioinformatics/btac396.
  • Maier et al. [2023] Robert Maier, Pavel Flegontov, Olga Flegontova, Ulaş Işıldak, Piya Changmai, and David Reich. On the limits of fitting complex models of population history to f-statistics. eLife, 12:e85492, 2023. doi: 10.7554/eLife.85492.
  • Martin et al. [2023] Samuel Martin, Vincent Moulton, and Richard M. Leggett. Algebraic invariants for inferring 4-leaf semi-directed phylogenetic networks. bioRxiv, 2023. doi: 10.1101/2023.09.11.557152.
  • Neureiter et al. [2022] Nico Neureiter, Peter Ranacher, Nour Efrat-Kowalsky, Gereon A. Kaiping, Robert Weibel, Paul Widmer, and Remco R. Bouckaert. Detecting contact in language trees: a Bayesian phylogenetic model with horizontal transfer. Humanities and Social Sciences Communications, 9(1):205, 2022. doi: 10.1057/s41599-022-01211-7.
  • Robinson and Foulds [1979] D. F. Robinson and L. R. Foulds. Comparison of weighted labelled trees. In A. F. Horadam and W. D. Wallis, editors, Combinatorial Mathematics VI, pages 119–126, Berlin, Heidelberg, 1979. Springer Berlin Heidelberg. ISBN 978-3-540-34857-3.
  • Robinson and Foulds [1981] D.F. Robinson and L.R. Foulds. Comparison of phylogenetic trees. Mathematical Biosciences, 53(1):131–147, 1981. doi: doi.org/10.1016/0025-5564(81)90043-2.
  • Solís-Lemus and Ané [2016] Claudia Solís-Lemus and Cécile Ané. Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting. PLOS Genetics, 12(3):e1005896, 2016. doi: 10.1371/journal.pgen.1005896.
  • Soraggi and Wiuf [2019] Samuele Soraggi and Carsten Wiuf. General theory for stochastic admixture graphs and f-statistics. Theoretical Population Biology, 125:56–66, 2019. doi: 10.1016/j.tpb.2018.12.002.
  • Steel [2016] Mike Steel. Phylogeny: Discrete and Random Processes in Evolution. Society for Industrial and Applied Mathematics, Philadelphia, PA, 2016. doi: 10.1137/1.9781611974485.
  • Vachaspati and Warnow [2017] Pranjal Vachaspati and Tandy Warnow. Fastrfs: fast and accurate Robinson-Foulds supertrees using constrained exact optimization. Bioinformatics, 33(5):631–639, 2017. doi: 10.1093/bioinformatics/btw600.
  • van Iersel et al. [2022] Leo van Iersel, Remie Janssen, Mark Jones, and Yukihiro Murakami. Orchard networks are trees with additional horizontal arcs. Bulletin of Mathematical Biology, 84:76, 2022. doi: 10.1007/s11538-022-01037-z.
  • Xu and Ané [2023] Jingcheng Xu and Cécile Ané. Identifiability of local and global features of phylogenetic networks from average distances. Journal of Mathematical Biology, 86(1):12, 2023. doi: 10.1007/s00285-022-01847-8.
  • Zhang et al. [2018] Chao Zhang, Maryam Rabiee, Erfan Sayyari, and Siavash Mirarab. ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees. BMC Bioinformatics, 19(Suppl 6):153, 2018. doi: 10.1186/s12859-018-2129-y.

[Example edge-based μ\mu-representation]

Let =(a1,a2,b,c,d,h1,h2,h3)\mathcal{L}=(a_{1},a_{2},b,c,d,h_{1},h_{2},h_{3}). In Fig. 3 left, NN is an \mathcal{L}-network. In μ\mu-vectors, leaves are ordered as in \mathcal{L}. Following notations in Algorithm 1, μE(N)=A1+A2+A3+A4\mu_{E}(N)=A_{1}+A_{2}+A_{3}+A_{4} where the multisets AiA_{i} (i4i\leq 4) partition the μ\mu-vector sets based on their type (uni/bidirectional) and tags (:𝐭\mathbf{:{\mkern-5.0mu}t}, :𝐡\mathbf{:{\mkern-5.0mu}h} or :𝐫\mathbf{:{\mkern-5.0mu}r}).

A1A_{1} contains the μ\mu-vector sets of tree edges in the directed part (in black in Fig. 3), which are here the pendent edges (incident to leaves) and e5e_{5}:

A1\displaystyle A_{1} =\displaystyle= \displaystyle\lbag {((1,0,0,0,0, 0,0,0), :t)}, # pendent to a_1

A2A_{2} contains the μ\mu-vector sets of hybrid edges (in blue in Fig. 3):

A2\displaystyle A_{2} =\displaystyle= \displaystyle\lbag {((0,0,0,0,0, 1,1,1), :h)},

A3A_{3} contains a single μ\mu-vector set, because NN has only 1 root component:

A3\displaystyle A_{3} =\displaystyle= \displaystyle\lbag {((1,1,1,1,1, 3,3,3), :r)} ⟆ .

A4A_{4} contains the bidirectional μ\mu-vector sets of edges in the root component (in brown in Fig. 3):

A4\displaystyle A_{4} =\displaystyle= \displaystyle\lbag {((1,1,0,0,0, 0,0,0), :t), ((0,0,1,1,1, 3,3,3), :t)}, # e_1

Now we reconstruct NN from μE(N)\mu_{E}(N), following notations from Algorithm 2 and Algorithm 3.

B1B_{1} contains the μ\mu-vectors from those in A1A_{1}, corresponding to the tree nodes in the directed part:

B1\displaystyle B_{1} =\displaystyle= \displaystyle\lbag (1,0,0,0,0, 0,0,0), (0,0,0,1,0, 0,0,0), (0,0,0,0,0, 0,1,0),

B2B_{2} contains the hybrid node’s μ\mu-vector (in the directed part):

B2\displaystyle B_{2} =\displaystyle= {\displaystyle\{ (0,0,0,0,0, 1,1,1) } .

B3B_{3} contains a single root μ\mu-vector:

B3={z} with z=(1,1,1,1,1,3,3,3).B_{3}=\{z\}\mbox{ with }z=(1{,}1{,}1{,}1{,}1,3{,}3{,}3)\;.

M(z)M(z) contains all directional μ\mu-vectors of all edges in the root component corresponding to zz:

M(z)\displaystyle M(z) =\displaystyle= \displaystyle\lbag (1,1,0,0,0, 0,0,0), (0,0,1,0,0, 0,0,0), (1,1,1,1,0, 1,1,1),

After arbitrarily choosing r(z)=(0,0,0,1,1,2,2,2)r(z)=(0{,}0{,}0{,}1{,}1,2{,}2{,}2), B4B_{4} contains a subset of directional μ\mu-vectors:

B4\displaystyle B_{4} =\displaystyle= \displaystyle\lbag (1,1,0,0,0, 0,0,0), (0,0,1,0,0, 0,0,0), (0,0,0,0,1, 2,2,2),
Refer to caption
Figure A7: For NN in Fig. 3 (left), rooted partner GG from Algorithm 1 when choosing r(z)=(0,0,0,1,1,2,2,2)r(z)=(0{,}0{,}0{,}1{,}1,2{,}2{,}2), which corresponds to e7e_{7} (red arrow). The undirected edges in NN (brown labels) are identified by Algorithm 3.

Fig. A7 shows GG, the rooted partner such that μV(G)=B1+B2+B3+B4\mu_{V}(G)=B_{1}+B_{2}+B_{3}+B_{4}. Algorithm 3 starts by obtaining B=μE(G)B=\mu_{E}(G):

B\displaystyle B =\displaystyle= \displaystyle\lbag {((1,0,0,0,0, 0,0,0), :t)}, {((0,0,1,0,0, 0,0,0), :t)},

Then FF contains the μ\mu-vectors xx of elements {(x,:𝐭)}\{(x,\mathbf{:{\mkern-5.0mu}t})\} that are in BB but not in μE(N)\mu_{E}(N):

F\displaystyle F =\displaystyle= \displaystyle\lbag (1,1,0,0,0, 0,0,0), (0,0,1,0,0, 0,0,0),

For x=(0,0,1,0,0,0,0,0)x=(0{,}0{,}1{,}0{,}0,0{,}0{,}0), {(x,:𝐭)}\{(x,\mathbf{:{\mkern-5.0mu}t})\} has multiplicity 2 in BB, corresponding to an elementary path of 2 edges: e2e_{2} and the pendent edge incident to bb. As xx has multiplicity 1 in FF, only e2e_{2} (in GG) is undirected to obtain NN. The other elements in FF also have multiplicity 1. Each one corresponds to a single edge in GG (labeled in brown in Fig. A7) and is undirected to obtain NN.