This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Bounding the softwired parsimony score of a phylogenetic network

Janosch Döcker School of Computer Science, University of Auckland, New Zealand Simone Linz School of Computer Science, University of Auckland, New Zealand Kristina Wicke Department of Mathematical Sciences, New Jersey Institute of Technology, USA
Abstract

In comparison to phylogenetic trees, phylogenetic networks are more suitable to represent complex evolutionary histories of species whose past includes reticulation such as hybridisation or lateral gene transfer. However, the reconstruction of phylogenetic networks remains challenging and computationally expensive due to their intricate structural properties. For example, the small parsimony problem that is solvable in polynomial time for phylogenetic trees, becomes NP-hard on phylogenetic networks under softwired and parental parsimony, even for a single binary character and structurally constrained networks. To calculate the parsimony score of a phylogenetic network NN, these two parsimony notions consider different exponential-size sets of phylogenetic trees that can be extracted from NN and infer the minimum parsimony score over all trees in the set. In this paper, we ask: What is the maximum difference between the parsimony score of any phylogenetic tree that is contained in the set of considered trees and a phylogenetic tree whose parsimony score equates to the parsimony score of NN? Given a gap-free sequence alignment of multi-state characters and a rooted binary level-kk phylogenetic network, we use the novel concept of an informative blob to show that this difference is bounded by k+1k+1 times the softwired parsimony score of NN. In particular, the difference is independent of the alignment length and the number of character states. We show that an analogous bound can be obtained for the softwired parsimony score of semi-directed networks, while under parental parsimony on the other hand, such a bound does not hold.

Keywords: Phylogenetic networks; level-kk; softwired parsimony; parental parsimony

1 Introduction

The generalisation of phylogenetic trees to phylogenetic networks goes along with the development of new methods to reconstruct phylogenetic networks from genomic data. Since phylogenetic networks are structurally much more complicated than phylogenetic trees, the algorithms to infer networks are typically computationally expensive. Indeed, several optimisation problems that can be solved efficiently for phylogenetic trees, become computationally difficult for phylogenetic networks. For example, the small parsimony problem, which seeks to find the parsimony score of a given phylogenetic tree with character states assigned to its leaves is solvable in polynomial time for phylogenetic trees, e.g., using the well-known Fitch-Hartigan algorithm [7, 10], but becomes NP-hard under different notions of parsimony for phylogenetic networks, even for a single binary character and structurally constrained networks [6, 24].

Despite the popularity of model-based methods to infer phylogenetic trees, maximum parsimony (see, e.g. [4] and references therein) continues to be widely used in certain areas of evolutionary biology, such as the analysis of morphological data (see, e.g. [17, 18, 20]). Moreover, since calculating the parsimony score of a phylogenetic tree is computationally less expensive than calculating its likelihood, parsimony trees are often used as starting trees from which a search through tree space is started [23] and are also used in Bayesian phylogenetic inference [25].

Recently, different notions for parsimony on rooted phylogenetic networks have been proposed, referred to as hardwired, softwired, and parental parsimony. Softwired [16] and parental [24] parsimony both consider collections of trees that can be extracted from a rooted phylogenetic network (so-called displayed trees, respectively parental trees) and define the parsimony score as the minimum parsimony score of any tree in the collection. Softwired parsimony is implemented in the popular software package PhyloNetworks [22] and is the main focus of this paper. Hardwired parsimony [13], on the other hand, calculates the parsimony score of a phylogenetic network by considering character-state transitions along all edges of the network. As the sets of rooted phylogenetic trees that are evaluated when computing the softwired and parental parsimony scores of a phylogenetic network have exponential size, it is of interest to investigate the differences in parsimony scores of elements in these sets. Given a gap-free alignment of multi-state characters and a rooted binary level-kk network NN (formally defined below), we analyse how different the parsimony score of any phylogenetic tree displayed by NN and the softwired parsimony score of NN can be. We show that independent of the alignment length and number of character states, this difference is bounded by k+1k+1 times the parsimony score of NN. Thus, while computing the softwired parsimony score is in general an NP-hard problem (even for a single binary character and a structurally constrained network [6, Theorem 4.3]), our result implies that an upper bound for the softwired parsimony score of NN can be obtained in polynomial time by simply evaluating the parsimony score of an arbitrary phylogenetic tree that is displayed by NN. In particular if the level of NN is small, this upper bound gives a good indication of the magnitude of the softwired parsimony score of NN.

Related to our result, it was shown by Fischer et al. [6, Theorem 5.7] that the NP-hard optimisation problem of computing the softwired parsimony of a rooted level-kk network for a single multi-state character is, on the positive side, also fixed-parameter tractable, when the parameter is kk. If one considers more than a single binary character, the softwired parsimony problem is NP-hard even for a rooted level-11 network [15, Theorem 1]. As a consequence of the latter negative result, Kelk et al. [15] posed the following question. Are there good (i.e. constant-factor) approximation algorithms for computing the softwired parsimony score of a rooted phylogenetic network NN and a sequence alignment AA without gaps under the following three restrictions: (i) NN is level-11, (ii) each biconnected component of NN has exactly three outgoing edges, and (iii) AA consists of binary characters?

As hinted at above, from an algorithmic perspective, our upper bound result implies a (k+1)(k+1)-approximation algorithm for computing the softwired parsimony score of a rooted binary level-kk network NN. Specifically, take an arbitrary phylogenetic tree that is displayed by NN, compute its parsimony score, and use this to approximate the softwired parsimony score of NN. If the level of NN is fixed, this algorithm provides a polynomial-time constant factor approximation. Hence, we answer the aforementioned question by Kelk et al. affirmatively for a much larger class of rooted phylogenetic networks in the sense that our result holds for level-kk networks (for a fixed non-negative integer kk), it does not require restriction (ii), and it holds for gap-free alignments independent of the number of character states. Our result also complements a recent paper by Frohn and Kelk [8], in which the authors establish a 2-approximation algorithm for the softwired parsimony problem on binary tree-child networks for a single character.

While softwired parsimony for rooted phylogenetic networks is the main focus of our paper, we additionally show that an analogous upper bound for the softwired parsimony score holds for semi-directed networks that are obtained from rooted phylogenetic networks by deleting the root and omitting the direction of all but reticulation edges. Semi-directed networks have recently been central to studying identifiability questions related to phylogenetic networks and to developing phylogenetic network estimation algorithms (e.g. [1, 9, 11, 21]). We also briefly turn to the notion of parental parsimony (on rooted phylogenetic networks) and show by way of counterexample that an analogous bound for the parental parsimony score does not hold.

The remainder of this paper is organised as follows. We define all relevant concepts related to phylogenetic trees and networks, introduce the notion of softwired parsimony, and state the main result in Section 2. In Section 3, we revisit the rSPR distance and establish an upper bound on this distance for two phylogenetic trees that are both displayed by a given phylogenetic network. In Sections 4 and 5, we introduce the notion of an informative blob and a blob reduction, respectively. Informative blobs are a novel concept that is crucial for obtaining our main result, the upper bound on the softwired parsimony score, which we establish in Section 6. Sections 7 and 8 are then devoted to parental parsimony on rooted networks and softwired parsimony on semi-directed networks, respectively. We end the paper with some concluding remarks and directions for future research in Section 9.

2 Preliminaries and statement of main result

This section introduces notation and terminology, and states the main result. Throughout this paper, XX denotes a non-empty finite set. Let GG be a directed graph. We use V(G)V(G) and E(G)E(G) to denote the vertex set and edge set, respectively, of GG. Furthermore, for each edge (u,v)(u,v) of GG, uu is called a parent of vv and vv is called a child of uu. We sometimes also refer to uu and vv as neighbours in GG. In a similar vein, for two (not necessarily distinct) vertices ss and tt of GG, we say that ss (resp. tt) is an ancestor (resp. descendant) of tt (resp. ss) if there is a directed path of length zero or more from ss to tt. Now let GG and GG^{\prime} be two directed graphs, and let e=(u,w)e=(u,w) be an edge of GG. Then subdividing ee is the operation of replacing ee with two new edges (u,v)(u,v) and (v,w)(v,w). Furthermore, we call GG^{\prime} a subdivision of GG if GG^{\prime} can be obtained from GG by repeatedly subdividing an edge. We also consider GG to be a subdivision of itself.

Phylogenetic trees and networks. A rooted binary phylogenetic network NN on XX is a rooted acyclic digraph with no loops and no parallel edges that satisfies the following three properties:

  1. (i)

    the set of leaves is XX,

  2. (ii)

    the out-degree of the (unique) root ρ\rho is exactly one, and

  3. (iii)

    every other vertex has either in-degree one and out-degree two, or in-degree two and out-degree one.

The set XX is also sometimes called the label set of NN. Furthermore, a vertex of NN is referred to as a reticulation if it has in-degree two and as a tree vertex if it has in-degree one and out-degree two. Similarly, an edge of NN that is directed into a reticulation is referred to as a reticulation edge. We denote the number of reticulations in NN by h(N)h(N).

Let NN be a rooted binary phylogenetic network on XX. If NN has no reticulation, then it is called a rooted binary phylogenetic XX-tree. Since all phylogenetic trees and networks are rooted and binary throughout this paper except for Section 8, we refer to a rooted binary phylogenetic network as a phylogenetic network on XX and to a rooted binary phylogenetic tree as a phylogenetic XX-tree.

Let SS be a subdivision of a phylogenetic XX-tree. We call the directed path from the root of SS to its closest degree-three vertex its root path. If SS is a phylogenetic tree, then the root path consists of a single edge, in which case we sometimes refer to the root path as the root edge.

Now let NN be a phylogenetic network. A biconnected component of NN is a maximal subgraph of NN that is connected and cannot be disconnected by deleting exactly one of its vertices. Furthermore, a vertex of a biconnected component of NN is called a reticulation if it is a reticulation in NN. With this definition in hand, we say that NN is level-kk if the maximum number of reticulations of a biconnected component of NN is at most kk. Lastly, we call a biconnected component of NN a blob if it has at least one reticulation. For a blob BB of NN, we refer to the unique vertex with in-degree zero and out-degree two in BB as the source of BB. A phylogenetic network NN on {x1,x2,,x8}\{x_{1},x_{2},\ldots,x_{8}\} with two blobs BB and BB^{\prime} is shown on the left-hand side of Figure 1.

Clusters. Let MM be a subdivision of a phylogenetic network on XX, and let YY be a subset of XX. We call YY a cluster of MM if there exists a vertex vv in MM that has precisely YY as its set of descendant leaves. Note that there may be more than one vertex in MM whose cluster is YY and that this may also be the case if MM is a subdivision of a phylogenetic XX-tree. Furthermore, we use clM(v)\mathrm{cl}_{M}(v) or cl(v)\mathrm{cl}(v) if the subscript is clear from the context to denote the cluster of a given vertex vv of MM.

Displaying. Let NN be a phylogenetic network on XX with root ρ\rho, and let TT be a phylogenetic XX^{\prime}-tree with XXX^{\prime}\subseteq X. We say that TT is displayed by NN if there exists a subgraph of NN that is a subdivision of TT that includes ρ\rho, in which case this subgraph is called an embedding of TT in NN. The set of all phylogenetic XX-trees that are displayed by NN is referred to as the display set of NN and denoted by D(N)D(N). Ignoring the assignment of 0 and 1 to vertices for the moment, Figure 1 shows a phylogenetic network NN, a phylogenetic tree TT that is displayed by NN, and an embedding SS of TT in NN. Now, consider a subset RR of the reticulation edges of NN. We refer to RR as a switching if, for each reticulation vv in NN, it contains exactly one of the two edges that are directed into vv. By deleting each reticulation edge of NN that is not in RR, we obtain a connected subgraph GG of NN with no underlying cycle and, for each leaf X\ell\in X, there is a directed path from the root of GG, which coincides with ρ\rho, to \ell. If we repeatedly suppress each vertex in GG with in-degree one and out-degree one, and delete each vertex in GG with out-degree zero that is not in XX until no such operation is possible, we obtain a phylogenetic XX-tree TRT_{R}. We say that RR yields TRT_{R}. By construction, TRT_{R} is displayed by NN. Conversely, observe that, for each phylogenetic XX-tree TT in D(N)D(N), there exists at least one switching that yields TT. In summary, we have the following observation.

Refer to caption
Figure 1: A phylogenetic network NN (left), an embedding SS of a phylogenetic XX-tree displayed by NN together with a binary character ff and an extension FF (middle; also indicated by the dashed lines in the left panel), and the phylogenetic XX-tree TT obtained from SS by suppressing vertices of in-degree one and out-degree one (right).
Observation 2.1.

Let NN be a phylogenetic network on XX, and let TT be a phylogenetic XX-tree. Then TT is displayed by NN if and only if there exists a switching of NN that yields TT.

rSPR operation. Let TT be a phylogenetic XX-tree, and let e=(u,v)e=(u,v) be an edge of TT that is not incident with the root. Let TT^{\prime} be a phylogenetic XX-tree obtained from TT by deleting ee and reattaching the resulting rooted subtree that contains vv via a new edge ff in the following way: Subdivide an edge of the component that contains the root of TT with a new vertex uu^{\prime}, join uu^{\prime} and vv with ff, and suppress uu. We say that TT^{\prime} has been obtained from TT by a rooted subtree prune and regraft (rSPR) operation. The rSPR distance between any two phylogenetic XX-trees TT and TT^{\prime}, denoted by drSPR(T,T)d_{\mathrm{rSPR}}(T,T^{\prime}), is the minimum number of rSPR operations that transform TT into TT^{\prime}. It is well known that TT^{\prime} can always be obtained from TT by a sequence of single rSPR operations and, so, drSPR(T,T)d_{\mathrm{rSPR}}(T,T^{\prime}) is well defined.

Characters. An rr-state character on XX is a surjective function f:XCf\colon X\rightarrow C from XX into a set CC of character states with r=|C|1r=|C|\geq 1. If r=2r=2, then ff is called a binary character. Throughout this paper, all results are established for rr-state characters with rr being fixed and arbitrarily large. For simplicity, we refer to an rr-state character on XX as a character on XX.

Let GG be an acyclic digraph with leaf set XX, and let f:XCf\colon X\rightarrow C be a character on XX. An extension of ff to V(G)V(G) is a function F:V(G)CF\colon V(G)\rightarrow C such that F()=f()F(\ell)=f(\ell) for each element X\ell\in X. For an extension FF of ff, we set

ch(F,G)=|{(u,v)E(G):F(u)F(v)}|,{\rm ch}(F,G)=|\{(u,v)\in E(G):F(u)\neq F(v)\}|,

and refer to ch(F,G){\rm ch}(F,G) as the changing number of FF. Intuitively, each edge of GG that contributes to the changing number of FF requires a character-state transition to explain ff on GG. Lastly, we say that an extension FF of ff to V(G)V(G) is minimum if there exists no extension of ff to V(G)V(G) whose changing number is strictly smaller than that of FF.

In what follows, we often consider a sequence (f1,f2,,fn)(f_{1},f_{2},\ldots,f_{n}) of characters on XX instead of a single character. We call such a sequence an alignment. Unless stated otherwise, all alignments in this paper are sequences of rr-state characters for r2r\geq 2 that do not contain the gap symbol “–”. Such an alignment is referred to as gap-free. In applied phylogenetics, multiple sequence alignments frequently contain gaps which, intuitively, are placeholders that can take on any of the other rr character states. We will see in the last section why the restriction to gap-free alignments is necessary. Lastly, we denote a sequence (f1)(f_{1}) that consists of a single element by f1f_{1} and omit parentheses for simplicity.

Parsimony on phylogenetic trees and their subdivisions. Given an alignment A=(f1,f2,,fn)A=(f_{1},f_{2},\ldots,f_{n}) of characters on XX and an arbitrary rooted tree TT with leaf set XX, we refer to

PS(A,T)=i=1nminFi(ch(Fi,T))PS(A,T)=\sum_{i=1}^{n}\min_{F_{i}}({\rm ch}(F_{i},T))

as the parsimony score of AA on TT, where the minimum is taken over all extensions of fif_{i} to V(T)V(T).

Instead of calculating the parsimony score of a phylogenetic XX-tree TT, we are often interested in calculating the parsimony score of a subdivision of TT in the upcoming sections. The next lemma states that both scores are equal. Its correctness can be established analogously to the proof of Lemma 4.5 in [6]. In the proof of this lemma, Fischer et al. have shown that the softwired parsimony score of a character ff on a phylogenetic tree TT is equal to the parsimony score of ff on a particular rooted tree that is a generalisation of a subdivision of TT in the sense that it may contain unlabeled leaves in addition to the leaves in XX.

Lemma 2.2.

Let ff be a character on XX, and let SS be a subdivision of a phylogenetic XX-tree TT. Then PS(f,S)=PS(f,T)PS(f,S)=PS(f,T).

We also have the following observation.

Observation 2.3.

Let ff be a character on XX, and let SS be a subdivision of a phylogenetic XX-tree. If FF is an extension of ff to V(S)V(S) such that F(u)F(v)F(u)\neq F(v) for some edge (u,v)(u,v) of the root path of SS, then there exists an extension FF^{\prime} of ff to V(S)V(S) such that F(u)=F(v)F^{\prime}(u)=F^{\prime}(v) and ch(F,T)<ch(F,T){\rm ch}(F^{\prime},T)<{\rm ch}(F,T).

By Observation 2.3, we freely assume throughout the remainder of the paper that every extension FF of a character to the vertices of a subdivision of a phylogenetic tree has the additional property that there is no character state transition on any edge of its root path.

Parsimony on phylogenetic networks. As outlined in the introduction, several notions of parsimony have been introduced that generalise parsimony from phylogenetic trees to phylogenetic networks. In this paper, we are focusing on the notion of softwired parsimony and briefly touch on parental parsimony in Section 7. Roughly speaking, the softwired parsimony score of an alignment AA of characters on a phylogenetic network NN is defined to be the smallest number of character-state transitions that is necessary to explain AA on any phylogenetic tree that is displayed by NN. Following Nakhleh et al. [16], we now make this precise. Let A=(f1,f2,,fn)A=(f_{1},f_{2},\ldots,f_{n}) be an alignment of characters on XX, and let NN be a phylogenetic network on XX. The softwired parsimony score of AA on NN is defined as

PSsw(A,N)=i=1nPSsw(fi,N)\displaystyle PS_{\mathrm{sw}}(A,N)=\sum\limits_{i=1}^{n}PS_{\mathrm{sw}}(f_{i},N) =i=1nminTD(N)minFi(ch(Fi,T))\displaystyle=\sum_{i=1}^{n}\min_{T\in D(N)}\min_{F_{i}}({\rm ch}(F_{i},T))
=i=1nminTD(N)PS(fi,T),\displaystyle=\sum_{i=1}^{n}\min_{T\in D(N)}PS(f_{i},T), (1)

where, for each character fif_{i}, the first minimum is taken over all phylogenetic trees in the display set of NN and the second minimum is taken over all extensions of fif_{i} to V(T)V(T). As per Equation (2), each character in AA can follow a different tree in D(N)D(N). A slightly more restricted definition of softwired parsimony, which has also appeared in the literature (e.g. see [14, 15]), is the following

PSsw(A,N)=minTD(N)i=1nminFi(ch(Fi,T))=minTD(N)i=1nPS(fi,T),\displaystyle PS_{\mathrm{sw}^{\prime}}(A,N)=\min\limits_{T\in D(N)}\sum\limits_{i=1}^{n}\min_{F_{i}}({\rm ch}(F_{i},T))=\min\limits_{T\in D(N)}\sum\limits_{i=1}^{n}PS(f_{i},T), (2)

where all characters in AA follow the same tree in D(N)D(N). Clearly, if n=1n=1, then PSsw(A,N)=PSsw(A,N)PS_{\mathrm{sw}}(A,N)=PS_{\mathrm{sw}^{\prime}}(A,N). On the other hand, for n1n\geq 1, it follows from the definition that PSsw(A,N)PSsw(A,N)PS_{\mathrm{sw}}(A,N)\leq PS_{\mathrm{sw}^{\prime}}(A,N). For the purpose of the upcoming sections, we adopt the softwired parsimony definition as formalised in Equation (2) and will see later, that our main result holds also under the definition given in Equation (2). In this context, it is worth mentioning that the hardness result for computing the softwired parsimony score of a level-11 network for an alignment of at least two binary characters [15, Theorem 1] as mentioned in the introduction has been established for the definition given in Equation (2).

Statement of main result. The main result of this paper is the following theorem which we establish in Section 6.

Theorem 2.4.

Let NN be a phylogenetic network on XX, and let TT be a phylogenetic XX-tree in D(N)D(N). Furthermore, let AA be an alignment of characters on XX. Then

PS(A,T)(k+1)PSsw(A,N) and PS(A,T)(k+1)PSsw(A,N),PS(A,T)\leq(k+1)\cdot PS_{\mathrm{sw}}(A,N)\text{ and }PS(A,T)\leq(k+1)\cdot PS_{\mathrm{sw}^{\prime}}(A,N),

where kk is the level of NN.

For example, if NN is a level-1 network, Theorem 2.4 implies that the parsimony score of an arbitrary tree displayed by NN is at most twice the parsimony score of a tree displayed by NN whose parsimony score is equal to PSsw(A,N)PS_{\mathrm{sw}}(A,N). Moreover, we show in Section 6 that the bound as stated in Theorem 2.4 is sharp.

The next corollary positively answers the open problem that is detailed in the introduction and that was first posed in [15].

Corollary 2.5.

For a fixed non-negative integer kk, let NN be a level-kk network on XX, and let AA be an alignment of characters on XX. There exists a polynomial (k+1)(k+1)-approximation algorithm to calculate PSsw(A,N)PS_{\mathrm{sw}}(A,N) and PSsw(A,N)PS_{\mathrm{sw}^{\prime}}(A,N).

Proof.

Clearly, we can construct a phylogenetic XX-tree TT such that TD(N)T\in D(N) in time that is polynomial in |V(N)||V(N)|. Furthermore, it takes time that is polynomial in |X||X| to calculate PS(A,T)PS(A,T) by applying Fitch’s algorithm [7]. The result now follows immediately from Theorem 2.4. ∎

3 Bounding the rSPR distance

In this section, we establish an upper bound on the rSPR distance between two phylogenetic trees for when both trees are displayed by a given network. Let NN be a phylogenetic network, and let RR and RR^{\prime} be two switchings of NN. We define

dswitch(R,R)=h(N)|RR|d_{\mathrm{switch}}(R,R^{\prime})=h(N)-|R\cap R^{\prime}|

to be the switching distance between RR and RR^{\prime}. Intuitively, dswitch(R,R)d_{\mathrm{switch}}(R,R^{\prime}) is the number of reticulations in NN for which RR and RR^{\prime} contain different reticulation edges.

The following lemma is a generalisation of [3, Lemma 3.1].

Lemma 3.1.

Let NN be a phylogenetic network on XX, and let TRT_{R} and TRT_{R^{\prime}} be two phylogenetic XX-trees that are yielded by two switchings RR and RR^{\prime}, respectively, of NN. Then

drSPR(TR,TR)dswitch(R,R).d_{\mathrm{rSPR}}(T_{R},T_{R^{\prime}})\leq d_{\mathrm{switch}}(R,R^{\prime}).
Proof.

Let SS (resp. SS^{\prime}) be an embedding of TRT_{R} (resp. TRT_{R^{\prime}}) in NN whose edge set contains each edge in RR (resp. RR^{\prime}). Obtain a directed acyclic graph NN^{\prime} from NN by deleting each edge that is not contained in E(S)E(S)E(S)\cup E(S^{\prime}) and, subsequently, applying any of the following two operations until no further operation is possible.

  1. (i)

    Suppress a vertex of in-degree one and out-degree one.

  2. (ii)

    If ee and ee^{\prime} are two edges in parallel, delete ee^{\prime}.

By construction, NN^{\prime} is a phylogenetic network on XX and each reticulation edge that is not contained in RRR\cup R^{\prime} is deleted in obtaining NN^{\prime}. Hence, h(N)dswitch(R,R)h(N^{\prime})\leq d_{\mathrm{switch}}(R,R^{\prime}). Furthermore, as SS and SS^{\prime} are embeddings of TRT_{R} and TRT_{R^{\prime}}, respectively, TRT_{R} and TRT_{R^{\prime}} are displayed by NN^{\prime}. Now, let h(TR,TR)h(T_{R},T_{R^{\prime}}) be the minimum number of reticulations of any phylogenetic network that displays TRT_{R} and TRT_{R^{\prime}}. Clearly, h(TR,TR)h(N)h(T_{R},T_{R^{\prime}})\leq h(N^{\prime}) and, thus

dswitch(R,R)h(N)h(TR,TR)drSPR(TR,TR),d_{\mathrm{switch}}(R,R^{\prime})\geq h(N^{\prime})\geq h(T_{R},T_{R^{\prime}})\geq d_{\mathrm{rSPR}}(T_{R},T_{R^{\prime}}),

where the last inequality follows from [19, Equation 10.1]. ∎

4 Informative blobs

To establish the main result of this paper, we introduce the novel concept of informative and non-informative blobs in this section. After giving a formal definition of these blobs, we establish results related to the changing number of character extensions to embeddings of phylogenetic trees that are displayed by phylogenetic networks that consist of a single informative or non-informative blob. Let NN be a phylogenetic network on XX, and let BB be a blob of NN. Furthermore, let CN(B)C_{N}(B) be the subset of V(N)V(N) that contains precisely each vertex that is not in BB and that is a child of a vertex of BB. We refer to CN(B)C_{N}(B), as the children of BB. As an immediate consequence of the definition of CN(B)C_{N}(B), we have the following lemma that we will freely use throughout the remainder of the paper.

Lemma 4.1.

Let BB be a blob of a phylogenetic network NN on XX, and let vv be a vertex of CN(B)C_{N}(B). Furthermore, let SS be an embedding of a phylogenetic XX-tree that is displayed by NN. Then, vv is a vertex of SS. Moreover, if vv is a vertex of a blob BB^{\prime} in NN, then vv is the source of BB^{\prime}.

Proof.

Suppose that vv is a vertex of a blob BB^{\prime} in NN. Since vCN(B)v\in C_{N}(B), we have BBB^{\prime}\neq B. Towards a contradiction, assume that vv is not the source of BB^{\prime}. It follows that vv is a vertex of in-degree two. Let uu be the parent of vv in BB, and let uu^{\prime} be the other parent of vv. Then there is a directed path from the root of NN to vv that traverses uu and a directed path from the root of NN to vv that traverses uu^{\prime}, thereby contradicting that BB and BB^{\prime} are two distinct blobs in NN. We complete the proof by noting that SS contains each edge of NN whose deletion disconnects NN into more than one connected component and, so, vv is a vertex of SS. ∎

Now, let ss be the source of a blob BB in a phylogenetic network NN on XX. Furthermore, let SS be an embedding of a phylogenetic XX-tree that is displayed by NN. Let ff be a character on XX, and let FF be an extension of ff to V(S)V(S). We set ind(F,B,S)=0{\rm ind}(F,B,S)=0 if each element in CN(B)C_{N}(B) is assigned to the same character state under FF and, otherwise, we set ind(F,B,S)=1{\rm ind}(F,B,S)=1. By Lemma 4.1, recall that each vertex in CN(B)C_{N}(B) is also a vertex of SS and, thus, ind(F,B,S){\rm ind}(F,B,S) is well defined. Moreover, we say that BB is a non-informative blob relative to SS and ff if there exists an extension FF of ff to V(S)V(S) such that PS(f,S)=ch(F,S)PS(f,S)={\rm ch}(F,S) and ind(F,B,S)=0{\rm ind}(F,B,S)=0. Otherwise, we say that BB is an informative blob. We next extend the concept of a single informative blob to all blobs B1,B2,,BmB_{1},B_{2},\ldots,B_{m} of NN and set

b(f,N,S)=minFj such thatPS(f,S)=ch(Fj,S)(i=1mind(Fj,Bi,S)),b(f,N,S)=\min\limits_{\genfrac{}{}{0.0pt}{1}{F_{j}\text{ such that}}{PS(f,S)={\rm ch}(F_{j},S)}}\left(\sum_{i=1}^{m}{\rm ind}(F_{j},B_{i},S)\right),

where the minimum is taken over all extensions FjF_{j} of ff to V(S)V(S) whose changing number is equal to PS(f,S)PS(f,S). Then b(f,N,S)b(f,N,S) denotes the number of informative blobs relative to SS and ff in NN. If FjF_{j} is an extension of ff to V(S)V(S) such that b(f,N,S)=i=1mind(Fj,Bi,S)b(f,N,S)=\sum_{i=1}^{m}{\rm ind}(F_{j},B_{i},S), then we say that FjF_{j} realises b(f,N,S)b(f,N,S). See Figure 1 for an example of a phylogenetic network NN on X={x1,x2,,x8}X=\{x_{1},x_{2},\ldots,x_{8}\} with two blobs BB and BB^{\prime}, an embedding SS of a phylogenetic XX-tree displayed by NN, and a binary character ff on XX such that b(f,N,S)=1b(f,N,S)=1. Here, BB is non-informative because there exists a minimum extension FF of ff that assigns character state 0 to all elements of CN(B)C_{N}(B), where CN(B)C_{N}(B) contains the source of BB^{\prime} and leaves x7x_{7} and x8x_{8}. Blob BB^{\prime}, on the other hand, is informative, as the elements in CN(B)={x1,x2,,x6}C_{N}(B^{\prime})=\{x_{1},x_{2},\ldots,x_{6}\} are assigned two different states by ff and thus by any extension of it. To see that FF is indeed minimum, notice that ch(F,S)=ch(F,T)=1{\rm ch}(F,S)={\rm ch}(F,T)=1. This is minimum since ff employs two states, and it is well-known and easy to see that, in this case, any extension of ff requires at least one change.

Lemma 4.2.

Let ff be a character on XX, and let NN be a phylogenetic network on XX with a single blob BB whose source ss is the child of the root. Let SS and SS^{\prime} be embeddings of two phylogenetic XX-trees that are displayed by NN. Suppose that BB is non-informative relative to SS^{\prime} and ff. Let FF^{\prime} be an extension of ff to V(S)V(S^{\prime}) with ch(F,S)=PS(f,S){\rm ch}(F^{\prime},S^{\prime})=PS(f,S^{\prime}) that assigns the same character state to each vertex in CN(B)C_{N}(B). Then there exists an extension FF of ff to V(S)V(S) such that

  1. (i)

    ch(F,S)=ch(F,S){\rm ch}(F,S)={\rm ch}(F^{\prime},S^{\prime}) and

  2. (ii)

    F(s)=F(s)F(s)=F^{\prime}(s).

Proof.

Since BB is non-informative, recall that FF^{\prime} exists. Furthermore, by the definition of an embedding, ss is the child of the roots of SS and SS^{\prime}. Let VV be the subset of V(N)V(N) that precisely contains each vertex that is not a vertex of BB. Since BB is the only blob of NN, each vertex in VV is also a vertex of SS and SS^{\prime}. Furthermore, each vertex of SS or SS^{\prime} that is not in VV, is a vertex of BB. Since ss is the child of the root of SS^{\prime} and FF^{\prime} assigns the same character state, say α\alpha, to each vertex in CN(B)C_{N}(B), it follows that FF^{\prime} also assigns α\alpha to each vertex in V(S)V(S^{\prime}) that is an ancestor of some vertex in CN(B)C_{N}(B). In particular, F(s)=αF^{\prime}(s)=\alpha.

Now, consider SS. Set F(u)=F(u)F(u)=F^{\prime}(u) for each vertex uVu\in V and set F(u)=αF(u^{\prime})=\alpha for each vertex uV(S)Vu^{\prime}\in V(S)\setminus V. By definition of FF, we again have that each vertex in V(S)V(S) that is an ancestor of some vertex in CN(B)C_{N}(B) is assigned to α\alpha under FF. Hence, as FF^{\prime} is an extension of ff to V(S)V(S^{\prime}), FF is an extension of ff to V(S)V(S) with F(s)=F(s)=αF(s)=F^{\prime}(s)=\alpha; thereby satisfying (ii). Moreover, since SS and SS^{\prime} are embeddings of two phylogenetic XX-trees that are displayed by NN, the edges of NN satisfy the following property: If e=(u,v)e=(u,v) is an edge of SS^{\prime} (resp. SS) but not an edge of SS (resp. SS^{\prime}), then ee is an edge of BB and, consequently, F(u)=F(v)=αF^{\prime}(u)=F^{\prime}(v)=\alpha (resp. F(u)=F(v)=αF(u)=F(v)=\alpha). It follows that ch(F,S)=ch(F,S){\rm ch}(F,S)={\rm ch}(F^{\prime},S^{\prime}) which satisfies (i) and, therefore, completes the proof of the lemma. ∎

Lemma 4.3.

Let ff be a character on XX. Let TT and TT^{\prime} be two phylogenetic XX-trees. Furthermore, let FF^{\prime} be an extension of ff to V(T)V(T^{\prime}). Then there exists an extension FF of ff to V(T)V(T) such that

  1. (i)

    ch(F,T)ch(F,T)+drSPR(T,T){\rm ch}(F,T)\leq{\rm ch}(F^{\prime},T^{\prime})+d_{\mathrm{rSPR}}(T^{\prime},T) and

  2. (ii)

    F(ρ)=F(ρ)F(\rho)=F^{\prime}(\rho^{\prime}), where ρ\rho and ρ\rho^{\prime} is the root of TT and TT^{\prime}, respectively.

Proof.

We show by induction on drSPR(T,T)d_{\mathrm{rSPR}}(T^{\prime},T) that there exists an extension FF of ff to V(T)V(T) that satisfies (i)–(ii). Suppose that drSPR(T,T)=1d_{\mathrm{rSPR}}(T^{\prime},T)=1. Then there exists a single rSPR operation that transforms TT^{\prime} into TT. Given such an rSPR operation, let (u,v)(u^{\prime},v^{\prime}) be the edge of TT^{\prime} that is deleted in the pruning part of the operation. Let upu_{p}^{\prime} and ucvu_{c}^{\prime}\neq v^{\prime} be the parent and other child of uu^{\prime} in TT^{\prime}. Further, let uu be the vertex that subdivides an edge, say (up,uc)(u_{p},u_{c}), when reattaching the resulting subtree with root vv^{\prime} such that (u,v)(u,v^{\prime}) is an edge in TT. Noting that each vertex in TT except for uu is also a vertex of TT^{\prime}, we next obtain an extension FF of ff to V(T)V(T) with no character state transition on the root edge of TT as follows: For each vertex wuw\neq u, we set F(w)=F(w)F(w)=F^{\prime}(w). In particular, we have F(ρ)=F(ρ)F(\rho)=F^{\prime}(\rho^{\prime}) and, so, (ii) follows. Moreover, if up=ρu_{p}=\rho, we set F(u)=F(up)F(u)=F(u_{p}). Otherwise, we set F(u)=αF(u)=\alpha, where α\alpha is a character state that has been assigned to at least one neighbour of uu in TT under FF and there is no other character state that has been assigned to strictly more neighbours of uu in TT under FF. We next show that (i) holds. Consider the edges of TT. Except for the edges (u,v)(u,v^{\prime}) used to reattach the subtree with root vv^{\prime}, (up,uc)(u_{p}^{\prime},u_{c}^{\prime}) obtained from suppressing uu^{\prime}, and (up,u)(u_{p},u) and (u,uc)(u,u_{c}) obtained from subdividing (up,uc)(u_{p},u_{c}) with uu, each edge of TT is also an edge of TT^{\prime}. If F(up)F(uc)F(u_{p}^{\prime})\neq F(u_{c}^{\prime}), then either F(up)F(u)F^{\prime}(u_{p}^{\prime})\neq F^{\prime}(u^{\prime}) or F(u)F(uc)F^{\prime}(u^{\prime})\neq F^{\prime}(u_{c}^{\prime}). Hence, suppressing uu^{\prime} does not increase the changing number. On the other hand, when assigning a character state to uu the changing number may increase. More specifically, we consider three cases. First, if F(up)=F(uc)F(u_{p})=F(u_{c}), then F(u)=F(up)F(u)=F(u_{p}) by definition of FF. Note that upu_{p} may be ρ\rho. Thus, there is no character state transition on the two edges (up,u)(u_{p},u) and (u,uc)(u,u_{c}), and at most one such transition on the edge (u,v)(u,v^{\prime}) under FF in TT. Second, if |{F(up),F(uc),F(v)}|=3|\{F(u_{p}),F(u_{c}),F(v^{\prime})\}|=3, then there is a character state transition on the edge (up,uc)(u_{p},u_{c}) under FF^{\prime} and we have two character state transitions on the three edges (up,u)(u_{p},u), (u,uc)(u,u_{c}), and (u,v)(u,v^{\prime}) under FF. Third, if F(up)F(uc)F(u_{p})\neq F(u_{c}) and |{F(up),F(uc),F(v)}|=2|\{F(u_{p}),F(u_{c}),F(v^{\prime})\}|=2, then there is again a character state transition on the edge (up,uc)(u_{p},u_{c}) under FF^{\prime} and we have one character state transition on the three edges (up,u)(u_{p},u), (u,uc)(u,u_{c}), and (u,v)(u,v^{\prime}) under FF. Hence, regardless of which case applies

ch(F,T)ch(F,T)+1=ch(F,T)+drSPR(T,T);{\rm ch}(F,T)\leq{\rm ch}(F^{\prime},T^{\prime})+1={\rm ch}(F^{\prime},T^{\prime})+d_{\mathrm{rSPR}}(T^{\prime},T);

thereby satisfying (i) for when drSPR(T,T)=1d_{\mathrm{rSPR}}(T^{\prime},T)=1.

Now suppose that drSPR(T,T)2d_{\mathrm{rSPR}}(T^{\prime},T)\geq 2 and that (i)–(ii) are satisfied for all pairs of phylogenetic trees whose rSPR distance is strictly smaller than drSPR(T,T)d_{\mathrm{rSPR}}(T^{\prime},T). Let T′′T^{\prime\prime} be a phylogenetic XX-tree such that drSPR(T,T′′)=1d_{\mathrm{rSPR}}(T^{\prime},T^{\prime\prime})=1 and drSPR(T′′,T)=drSPR(T,T)1d_{\mathrm{rSPR}}(T^{\prime\prime},T)=d_{\mathrm{rSPR}}(T^{\prime},T)-1. Recalling that FF^{\prime} is an extension of ff to V(T)V(T^{\prime}), it follows from the induction hypothesis, that there is an extension F′′F^{\prime\prime} of ff to V(T′′)V(T^{\prime\prime}) that satisfies (ii) and ch(F′′,T′′)ch(F,T)+1{\rm ch}(F^{\prime\prime},T^{\prime\prime})\leq{\rm ch}(F^{\prime},T^{\prime})+1. Again, by the induction hypothesis, there exists an extension FF of ff to V(T)V(T) that satisfies (ii) and ch(F,T)ch(F′′,T′′)+drSPR(T′′,T){\rm ch}(F,T)\leq{\rm ch}(F^{\prime\prime},T^{\prime\prime})+d_{\mathrm{rSPR}}(T^{\prime\prime},T). Hence, by combining the two inequalities we obtain

ch(F,T)\displaystyle{\rm ch}(F,T) ch(F′′,T′′)+drSPR(T′′,T)\displaystyle\leq{\rm ch}(F^{\prime\prime},T^{\prime\prime})+d_{\mathrm{rSPR}}(T^{\prime\prime},T)
ch(F,T)+1+drSPR(T′′,T)=ch(F,T)+drSPR(T,T)\displaystyle\leq{\rm ch}(F^{\prime},T^{\prime})+1+d_{\mathrm{rSPR}}(T^{\prime\prime},T)={\rm ch}(F^{\prime},T^{\prime})+d_{\mathrm{rSPR}}(T^{\prime},T)

and F(ρ)=F(ρ)F(\rho)=F^{\prime}(\rho^{\prime}). Hence, FF satisfies (i)–(ii). This completes the proof of the lemma. ∎

Corollary 4.4.

Let ff be a character on XX. Let NN be a phylogenetic network on XX with a single blob, and let TT and TT^{\prime} be two phylogenetic XX-trees displayed by NN. Furthermore, let FF^{\prime} be an extension of ff to V(T)V(T^{\prime}). Then there exists an extension FF of ff to V(T)V(T) such that

  1. (i)

    ch(F,T)ch(F,T)+k{\rm ch}(F,T)\leq{\rm ch}(F^{\prime},T^{\prime})+k, where kk is the level of NN, and

  2. (ii)

    F(ρ)=F(ρ)F(\rho)=F^{\prime}(\rho^{\prime}), where ρ\rho and ρ\rho^{\prime} is the root of TT and TT^{\prime}, respectively.

Proof.

Let RR and RR^{\prime} be two switchings of NN that yield TT and TT^{\prime}, respectively. By Lemma 3.1, we have drSPR(T,T)dswitch(R,R)d_{\mathrm{rSPR}}(T,T^{\prime})\leq d_{\mathrm{switch}}(R,R^{\prime}). Noting that NN has a single blob, we have dswitch(R,R)kd_{\mathrm{switch}}(R,R^{\prime})\leq k and the corollary now follows from Lemma 4.3. ∎

While Lemma 4.2 is restricted to phylogenetic networks that consist of a single non-informative blob, the next lemma establishes an analogous result for all phylogenetic networks that consist of a single blob.

Lemma 4.5.

Let ff be a character on XX, and let NN be a phylogenetic network on XX with a single blob BB whose source ss is the child of the root. Let SS and SS^{\prime} be embeddings of two phylogenetic trees that are displayed by NN. Furthermore, let FF^{\prime} be an extension of ff to V(S)V(S^{\prime}). Then there exists an extension FF of ff to V(S)V(S) such that

  1. (i)

    ch(F,S)ch(F,S)+k{\rm ch}(F,S)\leq{\rm ch}(F^{\prime},S^{\prime})+k, where kk is the level of NN, and

  2. (ii)

    F(s)=F(s)F(s)=F^{\prime}(s).

Proof.

Let TT (resp. TT^{\prime}) be the two phylogenetic XX-trees such that SS (resp. SS^{\prime}) is an embedding of TT (resp. TT^{\prime}) in NN. Furthermore, let F(s)=αF^{\prime}(s)=\alpha. Observe that each vertex in TT^{\prime} is a unique degree-three vertex in SS^{\prime}. First, let FTF^{\prime}_{T^{\prime}} be the extension of ff to V(T)V(T^{\prime}) obtained from FF^{\prime} by setting FT(w)=F(w)F^{\prime}_{T^{\prime}}(w)=F^{\prime}(w) for each wV(T)w\in V(T^{\prime}). Then

ch(FT,T)ch(F,S),{\rm ch}(F^{\prime}_{T^{\prime}},T^{\prime})\leq{\rm ch}(F^{\prime},S^{\prime}), (3)

and, because there is no character state transition on any edge of the root path of SS^{\prime} that contains ss, the root of TT^{\prime} is assigned to α\alpha. Second, by Corollary 4.4, there exists an extension FTF_{T} of ff to V(T)V(T) such that

ch(FT,T)ch(FT,T)+k{\rm ch}(F_{T},T)\leq{\rm ch}(F^{\prime}_{T^{\prime}},T^{\prime})+k (4)

and the root of TT is also assigned to α\alpha. Third, we obtain an extension FF of ff to V(S)V(S) from FTF_{T} as follows. Noting that each edge (v,v)(v,v^{\prime}) in TT corresponds to a unique directed path v=v1,v2,,vs=vv=v_{1},v_{2},\ldots,v_{s}=v^{\prime} in SS whose non-terminal vertices all have degree two, we set F(v1)=F(v2)==F(vs1)=FT(v)F(v_{1})=F(v_{2})=\cdots=F(v_{s-1})=F_{T}(v) and F(vs)=FT(v)F(v_{s})=F_{T}(v^{\prime}). Then

ch(F,S)=ch(FT,T),{\rm ch}(F,S)={\rm ch}(F_{T},T), (5)

and, since the root of TT is assigned to α\alpha, it follows that each vertex on the root path of SS, in particular ss, is also assigned to α\alpha. Hence (ii) is satisfied. Moreover, by combining Equations (3)–(5), we have

ch(F,S)=ch(FT,T)ch(FT,T)+kch(F,S)+k.{\rm ch}(F,S)={\rm ch}(F_{T},T)\leq{\rm ch}(F^{\prime}_{T^{\prime}},T^{\prime})+k\leq{\rm ch}(F^{\prime},S^{\prime})+k.

This concludes the proof of the lemma. ∎

5 Blob reduction

In this section, we introduce the notion of a blob reduction. Intuitively, this allows us to decompose a phylogenetic network NN into two smaller phylogenetic networks and calculate the parsimony score of an embedding SS of a phylogenetic XX-tree displayed by NN based on these two smaller networks.

Let NN be a phylogenetic network on XX, let BB be a blob of NN whose source, say ss, has maximum distance from the root of NN, and let Y=cl(s)Y=\mathrm{cl}(s). For some yXy\notin X, the blob reduction of BB reduces NN to two smaller phylogenetic networks as follows. Let N(Y¯)N(\bar{Y}) be the phylogenetic network on (XY){y}(X\setminus Y)\cup\{y\} that is obtained from NN by replacing the subnetwork of NN that is rooted at ss with a single new leaf yy. Furthermore, let N(Y)N(Y) be the phylogenetic network on YY that is obtained from the subnetwork of NN that is rooted at ss by adding a new vertex ρY\rho_{Y} and edge (ρY,s)(\rho_{Y},s). By construction, each of N(Y¯)N(\bar{Y}) and N(Y)N(Y) contains at least one leaf.

Now, let SS be an embedding of a phylogenetic XX-tree TT that is displayed by NN. Recall that ss is a vertex of SS and cl(s)=Y\mathrm{cl}(s)=Y. Let ff be a character on XX, and let FF be an extension of ff to V(S)V(S). Using the aforementioned blob reduction of BB as a guide, we next also reduce SS to two smaller trees such that one of the resulting trees is an embedding of a subtree of TT in N(Y¯)N(\bar{Y}) and the other one is an embedding of another subtree of TT in N(Y)N(Y). More specifically, let S(Y¯)S(\bar{Y}) be the tree with leaf set (XY){y}(X\setminus Y)\cup\{y\} that is obtained from SS by replacing the subtree of SS that is rooted at ss with a single new leaf yy. Furthermore, let S(Y)S(Y) be the tree with leaf set YY that is obtained from the subtree of SS that is rooted at ss by adding a new vertex ρY\rho_{Y} and edge (ρY,s)(\rho_{Y},s). We call (S(Y),S(Y¯))(S(Y),S(\bar{Y})) the cluster tree pair of SS relative to BB. Let fYf_{Y} and fY¯f_{\bar{Y}} be a character on YY and on (XY){y}(X\setminus Y)\cup\{y\}, respectively, such that fY()=f()f_{Y}(\ell)=f(\ell) for each Y\ell\in Y and fY¯()=f()f_{\bar{Y}}(\ell^{\prime})=f(\ell^{\prime}) for each XY\ell^{\prime}\in X\setminus Y. We refer to extensions FYF_{Y} of fYf_{Y} to V(S(Y))V(S(Y)) and FY¯F_{\bar{Y}} of fY¯f_{\bar{Y}} to V(S(Y¯))V(S(\bar{Y})) as a pair of cluster extensions with respect to ff if FY¯(y)=FY(ρY)F_{\bar{Y}}(y)=F_{Y}(\rho_{Y}). Except for fY¯(y)f_{\bar{Y}}(y), observe that ff uniquely determines the character state of each leaf in S(Y)S(Y) and S(Y¯)S(\bar{Y}). Moreover, since the root path of S(Y)S(Y) contains at least one edge, the definition of a pair of cluster extensions implies that FY(ρY)=FY(s)F_{Y}(\rho_{Y})=F_{Y}(s) by our assumption following Observation 2.3. The next lemma shows how the changing number of extensions of characters ff, fYf_{Y}, and fY¯f_{\bar{Y}} to V(S)V(S), V(S(Y))V(S(Y)), and V(S(Y¯))V(S(\bar{Y})), respectively, are related to each other.

Lemma 5.1.

Let BB be a blob of a phylogenetic network NN on XX to which the blob reduction can be applied. Let ff be a character on XX, and let SS be an embedding of a phylogenetic XX-tree that is displayed by NN. Furthermore, let (S(Y),S(Y¯))(S(Y),S(\bar{Y})) be the cluster tree pair of SS relative to BB. Then, the following two statements hold.

  1. (i)

    If FF is an extension of ff to V(S)V(S), then there exists a pair of cluster extensions (FY,FY¯)(F_{Y},F_{\bar{Y}}) such that

    ch(F,S)=ch(FY,S(Y))+ch(FY¯,S(Y¯)).{\rm ch}(F,S)={\rm ch}(F_{Y},S(Y))+{\rm ch}(F_{\bar{Y}},S(\bar{Y})).
  2. (ii)

    If (FY,FY¯)(F_{Y},F_{\bar{Y}}) is a pair of cluster extensions with respect to ff, then there exists an extension FF of ff to V(S)V(S) such that

    ch(F,S)=ch(FY,S(Y))+ch(FY¯,S(Y¯)).{\rm ch}(F,S)={\rm ch}(F_{Y},S(Y))+{\rm ch}(F_{\bar{Y}},S(\bar{Y})).
Proof.

Let ss be the source of BB, and let Y=clN(s)Y=\mathrm{cl}_{N}(s). By the definition of a cluster tree pair, S(Y)S(Y) has leaf set YY and root ρY\rho_{Y}, and S(Y¯)S(\bar{Y}) has leaf set (XY){y}(X\setminus Y)\cup\{y\} and root ρ\rho, where ρ\rho is also the root of SS. Lastly, as ss is a vertex of SS, it follows from the construction of S(Y)S(Y) and S(Y¯)S(\bar{Y}) that ss corresponds to the child of ρY\rho_{Y} in S(Y)S(Y) and to yy in S(Y¯)S(\bar{Y}), whereas each other vertex of SS corresponds to a unique vertex in either S(Y)S(Y) or S(Y¯)S(\bar{Y}). To ease reading, we refer to the child of ρY\rho_{Y} in S(Y)S(Y) as sYs_{Y}. Reversely, the only vertex of S(Y)S(Y) and S(Y¯)S(\bar{Y}) that does not correspond to a unique vertex in SS is ρY\rho_{Y}. We next show that (i) and (ii) hold.

First, let FF be an extension of ff to V(S)V(S). Obtain a pair of cluster extensions FYF_{Y} and FY¯F_{\bar{Y}} of characters fYf_{Y} and fY¯f_{\bar{Y}} to V(S(Y))V(S(Y)) and V(S(Y¯))V(S(\bar{Y})), respectively, in the following way. For each vertex ww of V(S(Y)){ρY}V(S(Y))\setminus\{\rho_{Y}\}, set FY(w)=F(w)F_{Y}(w)=F(w^{\prime}), where ww^{\prime} is the vertex of SS that ww corresponds to, and set FY(ρY)=FY(sY)F_{Y}(\rho_{Y})=F_{Y}(s_{Y}). Similarly, for each vertex ww of V(S(Y¯))V(S(\bar{Y})), set FY¯(w)=F(w)F_{\bar{Y}}(w)=F(w^{\prime}), where ww^{\prime} is the vertex of SS that ww corresponds to. Since yy and sYs_{Y} both correspond to ss, it follows that FY¯(y)=FY(ρY)F_{\bar{Y}}(y)=F_{Y}(\rho_{Y}). It is is now easily checked that FYF_{Y} and FY¯F_{\bar{Y}} is a pair of cluster extensions with respect to ff and that

ch(F,S)=ch(FY,S(Y))+ch(FY¯,S(Y¯)).{\rm ch}(F,S)={\rm ch}(F_{Y},S(Y))+{\rm ch}(F_{\bar{Y}},S(\bar{Y})).

Hence, (i) holds.

Now, let FYF_{Y} and FY¯F_{\bar{Y}} be a pair of cluster extensions with respect to ff. In particular, FY()=f()F_{Y}(\ell)=f(\ell) for each Y\ell\in Y, FY¯()=f()F_{\bar{Y}}(\ell)=f(\ell) for each XY\ell\in X\setminus Y, and FY¯(y)=FY(ρY)F_{\bar{Y}}(y)=F_{Y}(\rho_{Y}). Now obtain an extension FF of ff to V(S)V(S) from FYF_{Y} and FY¯F_{\bar{Y}} in the following way. For each vertex ww^{\prime} of V(S)V(S) that corresponds to a vertex ww of S(Y)S(Y), set F(w)=FY(w)F(w^{\prime})=F_{Y}(w) and, for each vertex ww^{\prime}of V(S){s}V(S)\setminus\{s\} that corresponds to a vertex ww of S(Y¯)S(\bar{Y}), set F(w)=FY¯(w)F(w^{\prime})=F_{\bar{Y}}(w). Since FY(sy)=FY(ρY)F_{Y}(s_{y})=F_{Y}(\rho_{Y}), it follows that

ch(F,S)=ch(FY,S(Y))+ch(FY¯,S(Y¯)).{\rm ch}(F,S)={\rm ch}(F_{Y},S(Y))+{\rm ch}(F_{\bar{Y}},S(\bar{Y})).

Thus, (ii) holds as well. ∎

6 Proof of Theorem 2.4

In this section, we establish the proof of Theorem 2.4 and show that the bound that is given in the theorem is sharp. Most work in proving Theorem 2.4 goes into establishing the following lemma.

Lemma 6.1.

Let ff be a character on XX. Let NN be a phylogenetic network on XX, and let SS and SS^{\prime} be embeddings of two phylogenetic XX-trees that are displayed by NN. Furthermore, let FF^{\prime} be an extension of ff to V(S)V(S^{\prime}) that realises b(f,N,S)b(f,N,S^{\prime}). Then there exists an extension FF of ff to V(S)V(S) such that

ch(F,S)ch(F,S)+kb(f,N,S),{\rm ch}(F,S)\leq{\rm ch}(F^{\prime},S^{\prime})+k\cdot b(f,N,S^{\prime}),

where kk is the level of NN.

Proof.

Let B1,B2,,BmB_{1},B_{2},\ldots,B_{m} be the blobs of NN. The proof is by induction on mm. If m=0m=0, then NN is a phylogenetic tree with k=0k=0 and so the result clearly follows since S=SS=S^{\prime} and, therefore,

ch(F,S)ch(F,S)+0{\rm ch}(F,S)\leq{\rm ch}(F^{\prime},S^{\prime})+0

when setting F=FF=F^{\prime}. Now assume that m1m\geq 1 and that the statement is true for all phylogenetic networks with at most m1m-1 blobs. Let BB be a blob of NN whose source ss has maximum distance from the root of NN over all its blobs, and let Y=clN(s)Y=\mathrm{cl}_{N}(s). Without loss of generality, we assume that B=BmB=B_{m}.

For some yXy\notin X, let N(Y¯)N(\bar{Y}) be the phylogenetic network on (XY){y}(X\setminus Y)\cup\{y\}, and let N(Y)N(Y) be the phylogenetic network on YY and with root ρY\rho_{Y} resulting from NN by applying a blob reduction to BmB_{m}. Notice that by construction, N(Y)N(Y) consists of the single blob BmB_{m} with ss being the child of ρY\rho_{Y}, whereas N(Y¯)N(\bar{Y}) contains precisely m1m-1 blobs. Moreover, let (S(Y),S(Y¯))(S(Y),S(\bar{Y})) be the cluster tree pair of SS relative to BmB_{m}, and let (S(Y),S(Y¯))(S^{\prime}(Y),S^{\prime}(\bar{Y})) be the cluster tree pair of SS^{\prime} relative to BmB_{m}. Since ss is a vertex of SS and SS^{\prime}, ss is also a vertex of S(Y)S(Y) and S(Y)S^{\prime}(Y). Now, by Lemma 5.1, Part (i), there exists a pair of cluster extensions (FY,GY¯)(F^{\prime}_{Y},G^{\prime}_{\bar{Y}}) with respect to ff such that

ch(F,S)=ch(FY,S(Y))+ch(GY¯,S(Y¯)){\rm ch}(F^{\prime},S^{\prime})={\rm ch}(F^{\prime}_{Y},S^{\prime}(Y))+{\rm ch}(G^{\prime}_{\bar{Y}},S^{\prime}(\bar{Y})) (6)

and GY¯(y)=FY(ρY)=FY(s)G^{\prime}_{\bar{Y}}(y)=F^{\prime}_{Y}(\rho_{Y})=F^{\prime}_{Y}(s). Let fYf_{Y} and gY¯g_{\bar{Y}} be the characters on YY and (XY){y}(X\setminus Y)\cup\{y\}, respectively, such that FYF^{\prime}_{Y} and GY¯G^{\prime}_{\bar{Y}} are extensions of fYf_{Y} and gY¯g_{\bar{Y}}, respectively.

We next consider N(Y¯)N(\bar{Y}) and start by making two observations. First, since ch(F,S)=PS(f,S){\rm ch}(F^{\prime},S^{\prime})=PS(f,S^{\prime}), it follows from Lemma 5.1, Parts (i) and (ii), that ch(GY¯,S(Y¯))=PS(gY¯,S(Y¯)){\rm ch}(G^{\prime}_{\bar{Y}},S^{\prime}(\bar{Y}))=PS(g_{\bar{Y}},S^{\prime}(\bar{Y})). Second, by the construction of the pair of cluster extensions in Lemma 5.1 Part (i), we may assume that, for each vertex ww of V(S(Y¯))V(S^{\prime}(\bar{Y})), we have GY¯(w)=F(w)G^{\prime}_{\bar{Y}}(w)=F^{\prime}(w^{\prime}), where ww^{\prime} is the vertex of SS^{\prime} that ww corresponds to. Then, as FF^{\prime} realises b(f,N,S)b(f,N,S^{\prime}), GY¯G^{\prime}_{\bar{Y}} realises b(gY¯,N(Y¯),S(Y¯))b(g_{\bar{Y}},N(\bar{Y}),S^{\prime}(\bar{Y})). Noting that N(Y¯)N(\bar{Y}) has m1m-1 blobs and level at most kk, we now apply the induction hypothesis to obtain an extension GY¯G_{\bar{Y}} of gY¯g_{\bar{Y}} to V(S(Y¯))V(S(\bar{Y})) that satisfies

ch(GY¯,S(Y¯))ch(GY¯,S(Y¯))+kb(gY¯,N(Y¯),S(Y¯)){\rm ch}(G_{\bar{Y}},S(\bar{Y}))\leq{\rm ch}(G^{\prime}_{\bar{Y}},S^{\prime}(\bar{Y}))+k\cdot b(g_{\bar{Y}},N(\bar{Y}),S^{\prime}(\bar{Y})) (7)

such that GY¯(y)=gY¯(y)=GY¯(y)=FY(s)G_{\bar{Y}}(y)=g_{\bar{Y}}(y)=G^{\prime}_{\bar{Y}}(y)=F^{\prime}_{Y}(s).

To complete the proof, we consider N(Y)N(Y). Here, we distinguish two cases depending on whether its single blob BmB_{m} whose level is at most kk is informative or non-informative.

First, assume that BmB_{m} is informative. By Lemma 4.5, there exists an extension FYF_{Y} of fYf_{Y} to V(S(Y))V(S(Y)) such that

ch(FY,S(Y))ch(FY,S(Y))+kandFY(s)=FY(s).{\rm ch}(F_{Y},S(Y))\leq{\rm ch}(F^{\prime}_{Y},S^{\prime}(Y))+k\quad\text{and}\quad F_{Y}(s)=F^{\prime}_{Y}(s). (8)

Since FY(s)=FY(s)=GY¯(y)F_{Y}(s)=F^{\prime}_{Y}(s)=G_{\bar{Y}}(y), the pair (FY,GY¯)(F_{Y},G_{\bar{Y}}) is a pair of cluster extensions with respect to ff. Thus, by Lemma 5.1, Part (ii), there exists an extension FF of ff to V(S)V(S) such that

ch(F,S)=ch(FY,S(Y))+ch(GY¯,S(Y¯)).{\rm ch}(F,S)={\rm ch}(F_{Y},S(Y))+{\rm ch}(G_{\bar{Y}},S(\bar{Y})).

Now, using Inequalities (7) and (8), we obtain

ch(F,S)\displaystyle{\rm ch}(F,S) =ch(FY,S(Y))+ch(GY¯,S(Y¯))\displaystyle={\rm ch}(F_{Y},S(Y))+{\rm ch}(G_{\bar{Y}},S(\bar{Y}))
ch(FY,S(Y))+k+ch(GY¯,S(Y¯))+kb(gY¯,N(Y¯),S(Y¯))\displaystyle\leq{\rm ch}(F^{\prime}_{Y},S^{\prime}(Y))+k+{\rm ch}(G^{\prime}_{\bar{Y}},S^{\prime}(\bar{Y}))+k\cdot b(g_{\bar{Y}},N(\bar{Y}),S^{\prime}(\bar{Y}))
=ch(FY,S(Y))+ch(GY¯,S(Y¯))+k(1+b(gY¯,N(Y¯),S(Y¯)))\displaystyle={\rm ch}(F^{\prime}_{Y},S^{\prime}(Y))+{\rm ch}(G^{\prime}_{\bar{Y}},S^{\prime}(\bar{Y}))+k\cdot(1+b(g_{\bar{Y}},N(\bar{Y}),S^{\prime}(\bar{Y})))
=ch(F,S)+kb(f,N,S),\displaystyle={\rm ch}(F^{\prime},S^{\prime})+k\cdot b(f,N,S^{\prime}),

where the last equality follows from Equation (6) and the fact that BmB_{m} is informative.

Second, assume that BmB_{m} is non-informative. Then by Lemma 4.2, there exists an extension FYF_{Y} of fYf_{Y} to V(S(Y))V(S(Y)) such that

ch(FY,S(Y))=ch(FY,S(Y))andFY(s)=FY(s).{\rm ch}(F_{Y},S(Y))={\rm ch}(F^{\prime}_{Y},S^{\prime}(Y))\quad\text{and}\quad F_{Y}(s)=F^{\prime}_{Y}(s). (9)

The remainder of the proof is now similar to the first case. In particular, noting that the pair (FY,GY¯)(F_{Y},G_{\bar{Y}}) is a pair of cluster extensions with respect to ff, by Lemma 5.1, Part(ii), there exists an extension FF of ff to V(S)V(S) such that

ch(F,S)\displaystyle{\rm ch}(F,S) =ch(FY,S(Y))+ch(GY¯,S(Y¯)).\displaystyle={\rm ch}(F_{Y},S(Y))+{\rm ch}(G_{\bar{Y}},S(\bar{Y})).

Using Inequalities in (7) and (9), we obtain

ch(F,S)\displaystyle{\rm ch}(F,S) =ch(FY,S(Y))+ch(GY¯,S(Y¯))\displaystyle={\rm ch}(F_{Y},S(Y))+{\rm ch}(G_{\bar{Y}},S(\bar{Y}))
ch(FY,S(Y))+ch(GY¯,S(Y¯))+kb(gY¯,N(Y¯),S(Y¯))\displaystyle\leq{\rm ch}(F^{\prime}_{Y},S^{\prime}(Y))+{\rm ch}(G^{\prime}_{\bar{Y}},S^{\prime}(\bar{Y}))+k\cdot b(g_{\bar{Y}},N(\bar{Y}),S^{\prime}(\bar{Y}))
=ch(F,S)+kb(f,N,S),\displaystyle={\rm ch}(F^{\prime},S^{\prime})+k\cdot b(f,N,S^{\prime}),

where the last equality follows from Equation (6) and the fact that BmB_{m} is non-informative.

In both cases, we obtain an extension FF of ff to V(S)V(S) such that

ch(F,S)ch(F,S)+kb(f,N,S).{\rm ch}(F,S)\leq{\rm ch}(F^{\prime},S^{\prime})+k\cdot b(f,N,S^{\prime}).

This concludes the proof of the lemma. ∎

Corollary 6.2.

Let ff be a character on XX. Let NN be a phylogenetic network on XX, and let SS and SS^{\prime} be embeddings of two phylogenetic XX-trees displayed by NN such that PSsw(f,N)=PS(f,S)PS_{\mathrm{sw}}(f,N)=PS(f,S^{\prime}). Then

PS(f,S)PS(f,S)+kb(f,N,S),PS(f,S)\leq PS(f,S^{\prime})+k\cdot b(f,N,S^{\prime}),

where kk is the level of NN.

Proof.

Let FF^{\prime} be an extension of ff to V(S)V(S^{\prime}) that realises b(f,N,S)b(f,N,S^{\prime}). By Lemma 6.1, there exists an extension FF of ff to V(S)V(S) such that PS(f,S)ch(F,S)PS(f,S)+kb(f,N,S)PS(f,S)\leq{\rm ch}(F,S)\leq PS(f,S^{\prime})+k\cdot b(f,N,S^{\prime}). ∎

We are finally in a position to establish the main result of this paper, which we restate for convenience.

Theorem 2.4. Let NN be a phylogenetic network on XX, and let TT be a phylogenetic XX-tree in D(N)D(N). Let AA be an alignment of characters on XX. Then

PS(A,T)(k+1)PSsw(A,N) and PS(A,T)(k+1)PSsw(A,N),PS(A,T)\leq(k+1)\cdot PS_{\mathrm{sw}}(A,N)\text{ and }PS(A,T)\leq(k+1)\cdot PS_{\mathrm{sw}^{\prime}}(A,N),

where kk is the level of NN.

Proof.

We establish that

PS(A,T)(k+1)PSsw(A,N).PS(A,T)\leq(k+1)\cdot PS_{\mathrm{sw}}(A,N).

Since PSsw(A,N)PSsw(A,N)PS_{\mathrm{sw}}(A,N)\leq PS_{\mathrm{sw}^{\prime}}(A,N), it immediately follows that PS(A,T)(k+1)PSsw(A,N)PS(A,T)\leq(k+1)\cdot PS_{\mathrm{sw}^{\prime}}(A,N) also holds.

First, assume that AA consists of a single character ff. Let SS be an embedding of TT in NN, and let SS^{\prime} be an embedding of a phylogenetic XX-tree in NN such that PSsw(f,N)=PS(f,S)PS_{\mathrm{sw}}(f,N)=PS(f,S^{\prime}). By Corollary 6.2, we have

PS(f,T)=PS(f,S)PS(f,S)+kb(f,N,S),\displaystyle PS(f,T)=PS(f,S)\leq PS(f,S^{\prime})+k\cdot b(f,N,S^{\prime}),

where the first equality follows from Lemma 2.2. Now, as every blob BB in NN that is informative relative to SS^{\prime} contributes at least one to PS(f,S)PS(f,S^{\prime}), we have

b(f,N,S)PS(f,S).\displaystyle b(f,N,S^{\prime})\leq PS(f,S^{\prime}).

By Lemma 2.2 and combining the last two inequalities, we obtain

PS(f,T)=PS(f,S)\displaystyle PS(f,T)=PS(f,S) PS(f,S)+kPS(f,S)\displaystyle\leq PS(f,S^{\prime})+k\cdot PS(f,S^{\prime})
=(k+1)PS(f,S)\displaystyle=(k+1)\cdot PS(f,S^{\prime})
=(k+1)PSsw(f,N).\displaystyle=(k+1)\cdot PS_{\mathrm{sw}}(f,N). (10)

Now, assume that A=(f1,,fn)A=(f_{1},\ldots,f_{n}) with n1n\geq 1. Then, we apply Inequality (10) to each character, and obtain

PS(A,T)\displaystyle PS(A,T) =i=1nPS(fi,T)\displaystyle=\sum\limits_{i=1}^{n}PS(f_{i},T)
i=1n(k+1)PSsw(fi,N)\displaystyle\leq\sum\limits_{i=1}^{n}(k+1)\cdot PS_{\mathrm{sw}}(f_{i},N)
(k+1)i=1nPSsw(fi,N)\displaystyle\leq(k+1)\cdot\sum\limits_{i=1}^{n}PS_{\mathrm{sw}}(f_{i},N)
=(k+1)PSsw(A,N).\displaystyle=(k+1)\cdot PS_{\mathrm{sw}}(A,N).

Refer to caption
Figure 2: A level-kk network NN on X={x0}{xi,xi,xi′′:1ik}X=\{x_{0}\}\cup\{x_{i},x^{\prime}_{i},x^{\prime\prime}_{i}:1\leq i\leq k\} (left) and two phylogenetic XX-trees TT and TT^{\prime} displayed by NN together with a binary character ff and extensions FF and FF^{\prime} to V(T)V(T) and V(T)V(T^{\prime}), respectively (right).

We close this section by presenting, for each k0k\geq 0, a level-kk network and a binary character such that the upper bound stated in Theorem 2.4 is sharp. As level-0 networks on XX are phylogenetic XX-trees, the bound is sharp for any level-0 network and any binary character on XX. For k1k\geq 1, let NN be the level-kk network on X={x0}{xi,xi,xi′′:1ik}X=\{x_{0}\}\cup\{x_{i},x^{\prime}_{i},x^{\prime\prime}_{i}:1\leq i\leq k\} that is depicted in Figure 2. Further, let f:X{0,1}f\colon X\rightarrow\{0,1\} be the binary character with f(x)=1f(x)=1 if x{x0,x1,,xk}x\in\{x_{0},x_{1},\ldots,x_{k}\} and f(x)=0f(x)=0 otherwise. Let us consider the two phylogenetic XX-trees T,TD(N)T,T^{\prime}\in D(N) that are illustrated on the right-hand side of Figure 2 together with extensions FF and FF^{\prime} of ff to V(T)V(T) and V(T)V(T^{\prime}), respectively. By using Fitch’s algorithm it is easy to verify that FF and FF^{\prime} are minimum. Hence, we have PS(f,T)=1PS(f,T^{\prime})=1 and, thus, PSsw(f,N)=1PS_{\mathrm{sw}}(f,N)=1 (since ff employs two character states, PSsw(f,N)1PS_{\mathrm{sw}}(f,N)\geq 1). Moreover, PS(f,T)=k+1PS(f,T)=k+1. In summary, we have PS(f,T)=(k+1)PSsw(f,N)PS(f,T)=(k+1)\cdot PS_{\mathrm{sw}}(f,N). As the construction shown in Figure 2 involves a single character, the two notions of softwired parsimony on NN coincide, and we have PS(f,T)=(k+1)PSsw(f,N)PS(f,T)=(k+1)\cdot PS_{\mathrm{sw}^{\prime}}(f,N) for the same example.

7 Parental parsimony for phylogenetic networks

We now briefly consider the notion of parental parsimony introduced by van Iersel et al. [24] as an alternative to softwired and hardwired parsimony. Intuitively, instead of defining the parsimony score of a phylogenetic network NN by considering its display set D(N)D(N), parental parsimony considers the set of parental trees (sometimes also called weakly displayed trees [12]), which is a superset of D(N)D(N).

A multilabelled tree on XX is a leaf-labelled rooted tree whose root has out-degree one, all other interior vertices have in-degree one and out-degree one, or in-degree one and out-degree two, and, for each element xx in XX, there exists at least one leaf in TT that is labelled xx. Now using the same notation as [24], let U(N)U^{\ast}(N) be the multilabelled obtained from a phylogenetic network NN on XX as follows: The vertices of U(N)U^{\ast}(N) are the directed paths in NN starting at the root of NN, and for each pair of directed paths p,pp,p^{\prime}, there is an edge (p,p)(p,p^{\prime}) in U(N)U^{\ast}(N) if and only if pp^{\prime} is an extension of pp by one additional edge of NN. Furthermore, each vertex in U(N)U^{\ast}(N) corresponding to a path in NN starting at the root of NN and ending at xXx\in X is labelled by xx. For an example of the multilabelled tree U(N)U^{\ast}(N) obtained from a phylogenetic network NN, see Figure 3. Now, a phylogenetic XX-tree is called a parental tree of NN if it can be obtained from a subgraph of U(N)U^{\ast}(N) by suppressing vertices of in-degree and out-degree one. To denote the set of all parental trees of NN, we use P(N)P(N). Informally speaking, a tree is a parental tree of a phylogenetic network if it can be drawn inside the network in such a way that the tree vertices of the tree correspond to tree vertices of the network. Importantly, though, a parental tree is not necessarily a displayed tree (see Figure 3), whereas every displayed tree is also parental.

Refer to caption
Figure 3: A phylogenetic network NN on X={x1,x2,,x4}X=\{x_{1},x_{2},\ldots,x_{4}\}, a phylogenetic XX-tree TT such that TP(N)T\in P(N) and TD(N)T\notin D(N), and the multilabelled tree U(N)U^{\ast}(N). The dashed lines in NN indicate how TT can be drawn inside NN.

Given a character ff on XX and a phylogenetic network NN on XX, the parental parsimony score of ff on NN is now defined as

PSpa(f,N)\displaystyle PS_{\mathrm{pa}}(f,N) =minTP(N)PS(f,T),\displaystyle=\min\limits_{T\in P(N)}PS(f,T),

where the minimum is taken over all parental trees for NN.

It was shown by van Iersel et al. [24, Theorem 2] that computing the parental parsimony score is NP-hard even if ff is a binary character and NN is a restricted type of a so-called tree-child network [2]. It is thus a natural question if our main result for softwired parsimony (Theorem 2.4) generalises to parental parsimony. Unfortunately, this is not the case. Suppose that nn is an even integer and that NN is the level-11 network on n+1n+1 leaves depicted in Figure 4. Then, both phylogenetic trees TT and TT^{\prime} as depicted in the same figure are parental trees of NN and TD(N)T^{\prime}\notin D(N). Additionally, suppose that ff is the binary character that assigns state 0 to leaves x2,x4,x6,,xnx_{2},x_{4},x_{6},\ldots,x_{n}, and state 11 to leaves x1,x3,,xn+1x_{1},x_{3},\ldots,x_{n+1}. Crucially, we have PS(f,T)=n/2PS(f,T)=n/2 and PS(f,T)=1PS(f,T^{\prime})=1. In particular, PSpa(f,N)=1PS_{\mathrm{pa}}(f,N)=1, and thus,

PS(f,T)=n/22PSpa(f,N)=2PS(f,T)=n/2\not\leq 2\cdot PS_{\mathrm{pa}}(f,N)=2

for each n>4n>4 (a similar argument applies to nn being odd), which shows that Theorem 2.4 does not generalise to parental parsimony even if the phylogenetic network is level-11 with a single blob and the alignment consists of a single binary character.

Refer to caption
Figure 4: A level-11 network NN on X={x1,x2,,xn+1}X=\{x_{1},x_{2},\ldots,x_{n+1}\} with nn being even and two parental trees TT and TT^{\prime} of NN together with a binary character ff on XX. Here, PSpa(N,f)=PS(f,T)=1PS_{\mathrm{pa}}(N,f)=PS(f,T^{\prime})=1, whereas PS(f,T)=n/2PS(f,T)=n/2.

8 Softwired parsimony for semi-directed and unrooted networks

In this section, we show that binary semi-directed networks have an analogous bound on the softwired parsimony score as the one we established for rooted binary phylogenetic networks. We briefly turn our attention to unrooted binary phylogenetic networks at the end of the section and show that the approach we take to establish the bound for semi-directed networks does not work in this setting. However, this does not exclude the possibility that similar bounds can be obtained for unrooted phylogenetic networks by other means.

A binary semi-directed network NsN_{s} on XX is a leaf-labelled mixed multigraph without any loops that can be obtained from a rooted binary phylogenetic network NrN_{r} by deleting its root, suppressing the child of the root, and omitting the direction of each edge that is not a reticulation edge. We call NrN_{r} a rooted partner of NsN_{s}. By construction, NsN_{s} has at most one pair of parallel edges. This is precisely the case when NrN_{r} has an underlying 3-cycle that contains the child of the root. Note that NsN_{s} may have multiple rooted partners. A vertex vv in NsN_{s} is called a reticulation if NsN_{s} contains two edges, referred to as reticulation edges, that are directed into vv. A semi-directed level-kk network is a semi-directed network NsN_{s} such that a rooted partner of NsN_{s} is a rooted binary level-kk network. Lastly, an unrooted binary phylogenetic XX-tree TuT_{u} is a connected undirected acyclic graph whose leaf set is XX and whose inner vertices all have degree three. Note that an unrooted binary phylogenetic tree is a binary semi-directed network without reticulations. In the following, all rooted (resp. semi-directed) phylogenetic networks and rooted (resp. unrooted) phylogenetic trees are assumed to be binary.

Now, let TuT_{u} be an unrooted binary phylogenetic XX-tree, and let NsN_{s} be a semi-directed network on XX. We say that TuT_{u} is displayed by NsN_{s} if there exists a subgraph SS of NsN_{s} such that SS is a subdivision of TuT_{u} (omitting the directions of the reticulation edges) and SS contains, for each reticulation vv in NsN_{s}, at most one reticulation edge incident with vv. Similar to rooted phylogenetic networks, if TuT_{u} is displayed by NsN_{s}, we call SS an embedding of TuT_{u} in NsN_{s}. Furthermore, we refer to the set of all unrooted phylogenetic XX-trees that are displayed by NsN_{s} as the unrooted display set of NsN_{s} and denote it by D(Ns)D(N_{s}).

Let A=(f1,f2,,fn)A=(f_{1},f_{2},\ldots,f_{n}) be an alignment of characters on XX, and let NsN_{s} be a semi-directed network on XX. We define the softwired parsimony score of AA on NsN_{s} as

PSsw(A,Ns)=i=1nminTuD(Ns)PS(fi,Tu),\displaystyle PS_{\mathrm{sw}}(A,N_{s})=\sum_{i=1}^{n}\min_{T_{u}\in D(N_{s})}PS(f_{i},T_{u}), (11)

where the parsimony score of an unrooted phylogenetic XX-tree TT is defined as in the rooted case since the corresponding concepts of the changing number and (minimum) extensions naturally translate to undirected graphs.

The next lemma shows how the set of unrooted phylogenetic trees that are displayed by a semi-directed network NsN_{s} is related to the set of rooted phylogenetic trees that are displayed by a rooted partner of NsN_{s}.

Lemma 8.1.

Let NsN_{s} be a semi-directed network on XX, and let NrN_{r} be a rooted partner of NsN_{s}. Then, the following two statements hold.

  1. (i)

    If TuT_{u} is an unrooted phylogenetic XX-tree that is displayed by NsN_{s}, then there exists a rooted phylogenetic XX-tree TD(Nr)T\in D(N_{r}) such that TuT_{u} can be obtained from TT by deleting the root and suppressing its child.

  2. (ii)

    If TT is a rooted phylogenetic XX-tree that is displayed by NrN_{r}, then there exists an unrooted phylogenetic XX-tree TuD(Ns)T_{u}\in D(N_{s}) such that TuT_{u} can be obtained from TT by deleting the root and suppressing its child.

Proof.

Throughout the proof, we assume that NsN_{s} does not contain a pair of parallel edges. Indeed, if NsN_{s} contains such a pair of edges, e1e_{1} and e2e_{2} say, we can obtain a semi-directed network NsN_{s}^{\prime} on XX from NsN_{s}, by deleting e1e_{1}, suppressing the two resulting degree-two vertices, and omitting the direction of e2e_{2}. Clearly D(Ns)=D(Ns)D(N_{s})=D(N_{s}^{\prime}). Now, let vv be the child of the root ρ\rho in NrN_{r}, and let ww and ww^{\prime} be the children of vv. If either ww or ww^{\prime} is a reticulation, we assume without loss of generality that ww^{\prime} is a reticulation. Observe that at most one of ww and ww^{\prime} is a reticulation. Each edge in NrN_{r} that is not incident with vv corresponds to exactly one edge in NsN_{s}, and each edge in NsN_{s} except eρ={w,w}e_{\rho}=\{w,w^{\prime}\} (resp.  eρ=(w,w)e_{\rho}=(w,w^{\prime}) if ww^{\prime} is a reticulation in NrN_{r}) corresponds to exactly one edge in NrN_{r}. In particular, each reticulation edge in NrN_{r} corresponds to exactly one such edge in NsN_{s} and vice versa. First, let TuT_{u} be an unrooted phylogenetic XX-tree that is displayed by NsN_{s}. By definition, there exists an embedding SuS_{u} of TuT_{u} in NsN_{s}. If eρe_{\rho} is an edge in SuS_{u}, we obtain an embedding SS of a rooted phylogenetic XX-tree TT in NrN_{r} from SuS_{u} as follows. We replace eρe_{\rho} with the three directed edges (ρ,v)(\rho,v), (v,w)(v,w) and (v,w)(v,w^{\prime}) and each edge eeρe\neq e_{\rho} with its directed counterpart in NrN_{r}. By definition, TT is displayed by NrN_{r}, that is TD(Nr)T\in D(N_{r}). Further, by construction, TuT_{u} can be obtained from TT by deleting the root and suppressing its child. Now, let us consider the case that eρe_{\rho} is not an edge in SuS_{u}. Let SS^{\prime} be the connected acyclic subgraph of NrN_{r} that is obtained from SuS_{u} by replacing each edge in SuS_{u} with its directed counterpart in NrN_{r}. We next show that there is a unique vertex vsv_{s} in SS^{\prime} with in-degree zero and out-degree two. Since SS^{\prime} is acyclic, it follows that SS^{\prime} contains a vertex with in-degree zero. Clearly, the out-degree of this vertex cannot be one or three and, thus, vsv_{s} exists. Assume towards a contradiction that vsvsv_{s}^{\prime}\neq v_{s} is another vertex with this property. As SS^{\prime} is connected and does not contain a vertex with in-degree two, one of vsv_{s}^{\prime} and vsv_{s} is a descendant of the other in SS^{\prime}. But then one of these vertices has in-degree one, a contradiction. It now follows that each vertex in SS^{\prime} is a descendant of vsv_{s}. Let π=(ρ=v1,v2,,vt=vs)\pi=(\rho=v_{1},v_{2},\ldots,v_{t}=v_{s}) be a directed path in NrN_{r}. Such a path π\pi exists as there is a directed path from the root to any vertex in NrN_{r}. Since each vertex in SS^{\prime} is a descendant of vsv_{s}, it follows that viv_{i} with i<ti<t is not in SS^{\prime}. Then, we obtain an embedding SS of a rooted phylogenetic XX-tree TT in NrN_{r} from SS^{\prime} by adding the t1t-1 directed edges (vi,vi+1)(v_{i},v_{i+1}), 1i<t1\leq i<t, to SS^{\prime}. Again, we have TD(Nr)T\in D(N_{r}) and, by construction, TuT_{u} can be obtained from TT by deleting the root and suppressing its child. Hence, (i) holds.

Second, let TT be a rooted phylogenetic XX-tree that is displayed by NrN_{r}. Let SS be an embedding of TT in NrN_{r}, and let π=(ρ=v1,v2,,vt)\pi=(\rho=v_{1},v_{2},\ldots,v_{t}) be the root path of SS. Then, we obtain an embedding SuS_{u} of an unrooted phylogenetic XX-tree TuT_{u} in NsN_{s} from SS by deleting each vertex that lies on π\pi except for vtv_{t}, turning directed edges into undirected ones and, if t=2t=2, suppressing vtv_{t}. If t=2t=2, then π\pi consists of the single edge (ρ,v)(\rho,v), and SS contains (v,w)(v,w) and (v,w)(v,w^{\prime}). Hence, SuS_{u} contains the edge {w,w}\{w,w^{\prime}\} as a result of suppressing vv. Otherwise, if t3t\geq 3, then SS contains (ρ,v)(\rho,v) and exactly one of (v,w)(v,w) and (v,w)(v,w^{\prime}), and vtv_{t} is a vertex in NsN_{s}. Furthermore, if t4t\geq 4, each edge (vi,vi+1)(v_{i},v_{i+1}) with 3i<t3\leq i<t that is traversed by π\pi corresponds to a unique edge consisting of the same vertices in NsN_{s}. It now follows that SuS_{u} is indeed an embedding of TuT_{u} in NsN_{s}, that is, TuD(Ns)T_{u}\in D(N_{s}). By construction, TuT_{u} can be obtained from TT by deleting the root and suppressing its child. Hence, (ii) holds as well. ∎

We next obtain the following corollary from Lemma 8.1.

Corollary 8.2.

Let NsN_{s} be a semi-directed network on XX, let NrN_{r} be a rooted partner of NsN_{s}, and let AA be an alignment of characters on XX. Then PSsw(A,Ns)=PSsw(A,Nr)PS_{\mathrm{sw}}(A,N_{s})=PS_{\mathrm{sw}}(A,N_{r}).

Proof.

The corollary follows from Lemma 8.1 and the fact that, if TT is a rooted phylogenetic XX-tree and TuT_{u} is an unrooted phylogenetic XX-tree such that TuT_{u} can be obtained from TT by deleting the root and suppressing its child, then PS(A,T)=PS(A,Tu)PS(A,T)=PS(A,T_{u}). ∎

We are now in a position to state the main result of the section.

Theorem 8.3.

Let NsN_{s} be a semi-directed network on XX, and let TuT_{u} be an unrooted phylogenetic XX-tree that is displayed by NsN_{s}. Furthermore, let AA be an alignment of characters on XX. Then

PS(A,Tu)(k+1)PSsw(A,Ns),PS(A,T_{u})\leq(k+1)\cdot PS_{\mathrm{sw}}(A,N_{s}),

where kk is the level of NsN_{s}.

Proof.

Let NrN_{r} be a rooted partner of NsN_{s}, and let TD(Nr)T\in D(N_{r}) be a rooted phylogenetic XX-tree such that deleting the root in TT and suppressing its child yields TuT_{u}. The rooted phylogenetic XX-tree TT exists by Lemma 8.1 (i). Since an unrooted phylogenetic tree is a semi-directed network without reticulations, we have (i) PS(A,Tu)=PS(A,T)PS(A,T_{u})=PS(A,T) by Corollary 8.2. Furthermore, by Theorem 2.4, we have (ii) PS(A,T)(k+1)PSsw(A,Nr)PS(A,T)\leq(k+1)\cdot PS_{\mathrm{sw}}(A,N_{r}). Finally, by Corollary 8.2, we have (iii) PSsw(A,Nr)=PSsw(A,Ns)PS_{\mathrm{sw}}(A,N_{r})=PS_{\mathrm{sw}}(A,N_{s}). Combining (i)–(iii) yields PS(A,Tu)(k+1)PSsw(A,Ns)PS(A,T_{u})\leq(k+1)\cdot PS_{\mathrm{sw}}(A,N_{s}). ∎

We now shift our focus from semi-directed networks to unrooted phylogenetic networks. An unrooted binary phylogenetic network UU on XX is a connected undirected graph without any loops or edges in parallel, whose leaf set is XX, and whose inner vertices all have degree three. As before, we omit the term binary in the following as all unrooted phylogenetic networks considered in this section are binary.

Let UU be an unrooted phylogenetic network on XX. We say that an unrooted phylogenetic XX-tree TuT_{u} is displayed by UU if there exists a subgraph of UU that is a subdivision of TuT_{u}. Furthermore, we refer to the set of all unrooted phylogenetic XX-trees that are displayed by UU as the display set of UU and denote it by D(U)D(U). We call UU an unrooted level-kk network if at most kk edges have to be deleted in each biconnected component of UU such that the resulting graph is acyclic. Lastly, if UU can be obtained from a rooted phylogenetic network NN by deleting its root, suppressing the child of the root, and omitting all edge directions, we say that NN is an orientation of UU.

Let A=(f1,f2,,fn)A=(f_{1},f_{2},\ldots,f_{n}) be an alignment of characters on XX, and let UU be an unrooted phylogenetic network on XX. We define the softwired parsimony score of AA on UU as

PSsw(A,U)=i=1nminTD(U)PS(fi,T).\displaystyle PS_{\mathrm{sw}}(A,U)=\sum_{i=1}^{n}\min_{T\in D(U)}PS(f_{i},T). (12)

Next, we present an unrooted level-1 network UU and a binary character ff such that PSsw(f,U)PSsw(f,N)PS_{\mathrm{sw}}(f,U)\neq PS_{\mathrm{sw}}(f,N) for an orientation NN of UU, that is, we show that Corollary 8.2 does not translate from semi-directed networks to unrooted phylogenetic networks. Moreover, we give two different orientations NN and NN^{\prime} of UU with PSsw(f,N)PSsw(f,N)PS_{\mathrm{sw}}(f,N)\neq PS_{\mathrm{sw}}(f,N^{\prime}).

Refer to caption
Figure 5: An unrooted level-1 network UU (left) with its display set D(U)D(U) (middle), and two orientations NN and NN^{\prime} of UU (right) with their respective display sets indicated by a dashed rectangle. More precisely, D(N)D(N) and D(N)D(N^{\prime}) can be obtained by orienting the enclosed unrooted phylogenetic XX-trees.

To this end, let UU be the unrooted level-1 network on X={a,b,c,d,e}X=\{a,b,c,d,e\}, and let NN and NN^{\prime} be the two orientations of UU as shown in Figure 5. The display set D(U)D(U) of size five is shown in the middle of the same figure. By deleting the root and suppressing its child in each element of D(N)D(N) (resp. D(N)D(N^{\prime})), we obtain a subset of D(U)D(U) as indicated by the two dashed rectangles that each enclose NN (resp. NN^{\prime}) and two elements of D(U)D(U). Now for the single binary character f:X{0,1}f\colon X\rightarrow\{0,1\} with

f(a)=f(b)=f(c)=0 and f(d)=f(e)=1,f(a)=f(b)=f(c)=0\text{ and }f(d)=f(e)=1,

we have PSsw(f,U)=1PS_{\mathrm{sw}}(f,U)=1, PSsw(f,N)=2PS_{\mathrm{sw}}(f,N)=2 and PSsw(f,N)=1PS_{\mathrm{sw}}(f,N^{\prime})=1. Since the softwired parsimony score of an unrooted phylogenetic network UU is not necessarily the same as the score of an orientation of UU, we cannot represent an unrooted phylogenetic network by an arbitrary orientation in the way we used a rooted partner of a semi-directed network to obtain Theorem 8.3. While our example shows that using the same approach as in the semi-directed setting is not viable, it does not exclude the existence of similar bounds for unrooted phylogenetic networks or classes thereof (such as unrooted level-1 networks, for example).

9 Concluding remarks

In this paper, we have obtained a bound on the softwired parsimony score of a gap-free alignment of multistate characters on rooted as well as semi-directed phylogenetic level-kk networks. To be precise, we have shown that the maximum difference between the softwired parsimony score of a phylogenetic network NN and the parsimony score of any tree displayed by NN is bounded by k+1k+1 times the parsimony score of NN. Unfortunately, our approximation result as stated in Theorem 2.4 cannot be generalised to alignments with gaps since it was already shown in [15, Corollary 2] that computing the softwired parsimony score of a level-11 network for an alignment of binary characters that additionally allows gaps is APX-hard.

Extending the notion of softwired parsimony to semi-directed networks and exploiting a connection between the display sets of semi-directed networks and their rooted partners, we have shown that an analogous bound holds for semi-directed networks. For unrooted networks, on the other hand, the approach via rooted partners (more formally, via orientations) does not seem to be viable. Nevertheless, it would be an interesting question for future research to investigate if an analogous or similar bound for the softwired parsimony score can be obtained in some other way for unrooted phylogenetic networks.

Another interesting direction for future research would be to analyse whether our results also apply in the case of non-binary phylogenetic networks, i.e., phylogenetic networks that may have vertices of degree strictly greater than three. While there exists a polynomial time algorithm to compute the parsimony score of a given non-binary phylogenetic tree with character states assigned to its leaves, namely the Fitch-Hartigan algorithm [7, 10], other concepts that our results rely on, such as the rSPR distance and its relation to the switching distance, have been studied less for non-binary phylogenetic trees (see [5, Section 2] for a related discussion).

Acknowledgements. The first and second authors thank the New Zealand Marsden Fund for financial support. All authors thank Steven Kelk for helpful discussions.

References

  • Allman et al. [2019] E. S. Allman, H. Baños, and J. A. Rhodes. NANUQ: A method for inferring species networks from gene trees under the coalescent model. Algorithms for Molecular Biology, 14:1–25, 2019.
  • Cardona et al. [2009] G. Cardona, F. Rossello, and G. Valiente. Comparison of tree-child phylogenetic networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 6(4):552–569, 2009.
  • Döcker et al. [2024] J. Döcker, S. Linz, and C. Semple. Hypercubes and Hamilton cycles of display sets of rooted phylogenetic networks. Advances in Applied Mathematics, 152:102595, 2024.
  • Felsenstein [2004] J. Felsenstein. Inferring Phylogenies. Sinauer Associates, US, 2004.
  • Fischer and Kelk [2016] M. Fischer and S. Kelk. On the maximum parsimony distance between phylogenetic trees. Annals of Combinatorics, 20:87–113, 2016.
  • Fischer et al. [2015] M. Fischer, L. van Iersel, S. Kelk, and C. Scornavacca. On computing the maximum parsimony score of a phylogenetic network. SIAM Journal on Discrete Mathematics, 29(1):559–585, 2015.
  • Fitch [1971] W. M. Fitch. Toward defining the course of evolution: minimum change for a specific tree topology. Systematic Biology, 20(4):406–416, 1971.
  • Frohn and Kelk [2023] M. Frohn and S. Kelk. A 22-approximation algorithm for the softwired parsimony problem on binary, tree-child phylogenetic networks. submitted, 2023.
  • Gross and Long [2018] E. Gross and C. Long. Distinguishing phylogenetic networks. SIAM Journal on Applied Algebra and Geometry, 2(1):72–93, 2018.
  • Hartigan [1973] J. A. Hartigan. Minimum mutation fits to a given tree. Biometrics, 29(1):53, 1973.
  • Hollering and Sullivant [2021] B. Hollering and S. Sullivant. Identifiability in phylogenetics using algebraic matroids. Journal of Symbolic Computation, 104:142–158, 2021.
  • Huber et al. [2016] K. T. Huber, V. Moulton, M. Steel, and T. Wu. Folding and unfolding phylogenetic trees and networks. Journal of Mathematical Biology, 73(6–7):1761–1780, 2016.
  • Kannan and Wheeler [2012] L. Kannan and W. C. Wheeler. Maximum parsimony on phylogenetic networks. Algorithms for Molecular Biology, 7:1–10, 2012.
  • Kelk and Fischer [2017] S. Kelk and M. Fischer. On the complexity of computing MP distance between binary phylogenetic trees. Annals of Combinatorics, 21:573–604, 2017.
  • Kelk et al. [2019] S. Kelk, F. Pardi, C. Scornavacca, and L. van Iersel. Finding a most parsimonious or likely tree in a network with respect to an alignment. Journal of Mathematical Biology, 78:527–547, 2019.
  • Nakhleh et al. [2005] L. Nakhleh, G. Jin, F. Zhao, and J. Mellor-Crummey. Reconstructing phylogenetic networks using maximum parsimony. In 2005 IEEE Computational Systems Bioinformatics Conference (CSB’05), pages 93–102. IEEE, 2005.
  • Sansom et al. [2018] R. S. Sansom, P. G. Choate, J. N. Keating, and E. Randle. Parsimony, not Bayesian analysis, recovers more stratigraphically congruent phylogenetic trees. Biology Letters, 14(6):20180263, 2018.
  • Schrago et al. [2018] C. G. Schrago, B. O. Aguiar, and B. Mello. Comparative evaluation of maximum parsimony and Bayesian phylogenetic reconstruction using empirical morphological data. Journal of Evolutionary Biology, 31(10):1477–1484, 2018.
  • Semple [2007] C. Semple. Hybridization networks. In O. Gascuel and M. Steel, editors, Reconstructing Evolution: New Mathematical and Computational Advances, pages 277–314. Oxford University Press, UK, 2007.
  • Smith [2019] M. R. Smith. Bayesian and parsimony approaches reconstruct informative trees from simulated morphological datasets. Biology Letters, 15(2):20180632, 2019.
  • Solís-Lemus and Ané [2016] C. Solís-Lemus and C. Ané. Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting. PLoS Genetics, 12(3):e1005896, 2016.
  • Solís-Lemus et al. [2017] C. Solís-Lemus, P. Bastide, and C. Ané. PhyloNetworks: a package for phylogenetic networks. Molecular Biology and Evolution, 34(12):3292–3298, 2017.
  • Stamatakis et al. [2005] A. Stamatakis, T. Ludwig, and H. Meier. RAxML-III: A fast program for maximum likelihood-based inference of large phylogenetic trees. Bioinformatics, 21(4):456–463, 2005.
  • van Iersel et al. [2018] L. van Iersel, M. Jones, and C. Scornavacca. Improved maximum parsimony models for phylogenetic networks. Systematic Biology, 67(3):518–542, 2018.
  • Zhang et al. [2020] C. Zhang, J. P. Huelsenbeck, and F. Ronquist. Using parsimony-guided tree proposals to accelerate convergence in Bayesian phylogenetic inference. Systematic Biology, 69(5):1016–1032, 2020.